0% found this document useful (0 votes)
37 views9 pages

Instruct Following

Uploaded by

lndeducation2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views9 pages

Instruct Following

Uploaded by

lndeducation2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Becoming self-instruct: introducing early stopping

criteria for minimal instruct tuning

Waseem AlShikh Manhal Daaboul Kirk Goddard Brock Imel Kiran Kamble

Parikshith Kulkarni Melisa Russak


arXiv:2307.03692v1 [cs.CL] 5 Jul 2023

Writer, Inc.
{waseem,...,melisa}@writer.com

Abstract
In this paper, we introduce the Instruction Following Score (IFS), a metric that
detects language models’ ability to follow instructions. The metric has a dual
purpose. First, IFS can be used to distinguish between base and instruct models.
We benchmark publicly available base and instruct models, and show that the
ratio of well formatted responses to partial and full sentences can be an effective
measure between those two model classes. Secondly, the metric can be used as
an early stopping criteria for instruct tuning. We compute IFS for Supervised
Fine-Tuning (SFT) of 7B and 13B LLaMA models, showing that models learn to
follow instructions relatively early in the training process, and the further finetuning
can result in changes in the underlying base model semantics. As an example of
semantics change we show the objectivity of model predictions, as defined by an
auxiliary metric ObjecQA. We show that in this particular case, semantic changes
are the steepest when the IFS tends to plateau. We hope that decomposing instruct
tuning into IFS and semantic factors starts a new trend in better controllable instruct
tuning and opens possibilities for designing minimal instruct interfaces querying
foundation models.

1 Introduction which result in a plethora of possible combina-


tions leading to distinct instruct models.
We can see instruct tuning attempts through the
lens of the "imitation models" - concept intro-
Large Language Models (LLMs) finetuned on in-
duced by Gudibande et al. 2023, i.e., efforts to
struct data can behave like conversational agents
distil closed (and possibly much bigger) propri-
(Alpaca: Taori et al. 2023, Self-Instruct: Wang
etary models like ChatGPT (OpenAI 2022), Bard
et al. 2023). The recipe for a chat model is well-
(Pichai 2023), and Claude (AnthropicAI 2023).
defined: one needs to perform instruction tuning,
which means supervised finetuning (SFT) of an Little is known about the qualitative impact of
LLM on tuples of instruction and response (Long- the distillation process on the base model (Hin-
pre et al. 2023). ton, Vinyals, and Dean 2015). Imitation success
is measured in terms of knowledge (e.g., HELM
Open-source datasets vary in quality and quantity,
Liang et al. 2022), skills (e.g., Natural Questions
ranging from 1k examples (Zhou et al. 2023) to
Kwiatkowski et al. 2019) or manual checks based
over 800k examples (Anand et al. 2023). In addi-
on human preferences (Zhou et al. 2023). There is
tion, there are more than a dozen open-source
no consensus whether a manual check that might
base LLMs, such as LLaMA (Touvron et al.
skew the metric towards style and formatting of
2023), OPT (Zhang et al. 2022), GPT-Neo (Gao
responses is a good overall metric (Gudibande
et al. 2020), Palmyra (Writer 2023), and others,

Preprint. Under review.


et al. 2023). A fairly recent attempt to more ro- general heuristics for better control over the train-
bustly evaluate instruct models is the Hugging- ing phases, including identification of "format-
face Leaderboard (Huggingface 2023b), which infusion" and "knowledge-infusion" stages.
evaluates models against four key benchmarks
The paper is organised as follows. In Section 2,
from the Eleuther AI Language Model Evalua-
we discuss the necessary conditions for a model to
tion Harness (Gao et al. 2021).
be considered an instruct model and data prepara-
Ablation studies have shown that both the diver- tion for IFS. The response tone classifier training
sity and quality of the training data play a cru- is described in Section 4. In Section 5, we present
cial role in model performance (Chen et al. 2023, results for instruct models and compare them to
Zhou et al. 2023). Low Training Data Instruction baseline vanilla models in terms of instruct tone
Tuning (LTD Tuning) suggests that task-specific and semantic shifts. The study ends with conclu-
models can gain 2% performance when trained on sions and future directions proposed in Section
less than 0.5% of the original data. Moreover, pro- 6.
longed instruction tuning can decrease the founda-
tional model knowledge (Gudibande et al. 2023)
and can be seen as the out-of-distribution task for 2 Background and Related Work
a downstream task of instruct-tuning (Kumar et al.
2022). The response tone alignment problem is a part
of a broader intent alignment topic. In princi-
In this study, we want to lay the foundation for
ple, LLMs are not aligned with users’ intents
instruct models research by defining the neces-
because their language modeling objective, e.g.,
sary (but not sufficient) condition for an instruct
predicting the next token of a training document,
model. Let’s conduct a thought experiment.
is different from the following instruction target.
Let’s put all models behind a closed API (a recent
One successful approach for aligning both objec-
equivalent of a black box). Is the model instruct-
tives is to prompt models using zero- or n-shot
tuned or not? Knowledge benchmarks could be
techniques, where the response would look like a
similar for vanilla and instruct models for LTD
completion of a document containing QA (Brown
tuning. Skills tests would highly depend on the
et al. 2020, Radford et al. 2018).
model size, which is not known. The simplest
way of solving the riddle would be to . . . chat Another approach is to instruct and tune a vanilla
with the model and judge the tone of the response. model on tuples of instruction and response, so
For a vanilla model, we expect a next prediction the model, as part of learning, acquires skills to
word attempt, whereas for instruct models, we ex- imitate the correct format response (Alpaca: Taori
pect them to follow instructions. We introduce a et al. 2023, Self-Instruct: Wang et al. 2023).
metric that captures this tone difference - Instruct
In the InstructGPT paper (Ouyang et al. 2022),
Following Score (IFS). We call this problem a
the criterion "fails to follow the correct instruc-
"tone alignment" issue.
tion / task" was included in the list of human
The IFS is defined as a ratio of "answer-like" evaluation metadata for a reward model (RM)
responses to "continuation-like" responses on a used in the PPO algorithm (Schulman et al. 2017)
predefined set of instructions, where class of a to fine-tune the SFT models to maximize their
response is determined by a binary classifier. reward.
We benchmark publicly available base and in- We aim to isolate and understand the tone compo-
struct models, and show that the ratio of well nent by evaluating each strategy as a style format-
formatted responses to partial and full sentences ting problem rather than using knowledge and lan-
can be an effective measure between vanilla and guage understanding-based metrics, e.g., MMLU
instruct following models. Moreover, we calcu- (Hendrycks et al. 2021).
late IFS for SFT for 7B and 13B LLaMA models,
in the hope of finding a stopping criterion for a
minimal instruct tuning. 3 Instruction Following Index
To draw a comparison between the learning curve
3.1 Motivation
for response tone and the acquisition of seman-
tic and domain-specific knowledge, we propose An instruction following model intuitively be-
a supplementary metric called ObjecQA. This haves like a conversational agent, i.e. always
auxiliary metric quantifies the objectivity of a assume the input is an instruction, and depending
model’s predictions, as this signal can be identi- on its understanding tries to provide an answer or
fied within the dataset. While this feature choice ask follow up questions. In contrast, a model that
is arbitrary, we aim to discover possibly more does not follow instructions will try to predict

2
next tokens and optionally provide an answer or I - Instruction
continue with the next instruction. The distinc-
tion between two model classes becomes more Ip - Partial (fragmented) instruc-
clear for an instruction that is an incomplete sen- tion
tence fragment. An instruction following model
will never try to complete the instruction.
• Inference outputs:
It is crucial to emphasise that the quality of re-
sponses is purposely beyond the scope of this Ic - Continuation of the instruction
classification. The above criteria are thus neces-
sary but not sufficient conditions for a chat model. R - Response
In this paper, we introduce the Instruction Fol-
lowing Score (IFS), defined as a ratio of "answer- In fact, combinations of those 4 parts gives all
like" responses to "continuation-like" responses possible pairs of inputs and outputs for vanilla
to a predefined set of instructions. The class of and chat models. In the table below we recom-
a response is determined by a binary classifier bine the parts and give and assign them a binary
(called subsequently as "response tone classifier"). score depending whether the model responds like
The process of training and gathering data for IFS a chat model.
will be outlined in the sections that follow.
In this paper, we use interchangeably "conver-
sational tone" and "instruction following tone," (I, R) The response R for instruction I is
meaning a class of "answer-like" responses. The conversational. A model whose all re-
process of fine-tuning a base model to obtain an sponses would resemble the above form
instruct model is called "instruction tuning." would be an instruction following, so the
response has label 1.
3.2 Dataset
(Ip ,R) The response R for partial instruction Ip
The dataset for IFS is derived from a chat dataset, is also conversational, but in this case
which originally consists of pairs (instruction, re- the model has not enough context to pro-
sponse). We will need to model inputs and out- vide any answer except requesting for
puts for models that aren’t following instructions. more information. This response is also
The main idea for data generation is to append in- labeled as 1.
struction to response and then consider different
subdivisions into two phrases, as shown in Figure
1. (Ip ,Ic ) The model completes the fragmented in-
struction (executing next word predic-
tion task). The pair does not look like a
conversation, so the label is 0.

(I , Ic ) The model generates next instructions


(similarly to next word prediction task
again), which gives the response label 0.

(Ip ,Ic +R) In this case, the model completes the


instruction then replies (executing next
word prediction task too). Although au-
Figure 1: IFS dataset generation. Different splits thors might imagine people attempting
define fragments: I, R, Ip , Ic . have such dialogue, we treat instruction
completion as a sign of failed conversa-
If the cut regenerates (instruction, response) we tion. Label is 0.
get the ideal input and output for a chat model. If
we shift the split to the right or to the left, we can
obtain incomplete sentences (fragmented) which (I,Ic +R) The model generates another instruction
represent unfinished instructions or continuation then replies to its generation. The dia-
of instructions followed by responses. To summa- logue fails giving the response label 0.
rize, we can get:

• Inference inputs: Examples for each case are shown in Table 1.

3
Case Example chat? the set of responses is used to generate data for
I: What if people had 40 the binary classifier. Figure 2 shows how chat
(I, R) data is split and used for in our experiment.
legs?
1
R: If people had 40 legs, As a source of clean text, we utilized the OpenAs-
they’d be human cen- sistant chat dataset (Köpf et al. 2023). To control
tipedes on the go, setting the context of the conversation, we only consid-
world records in races ered the first instruction and its corresponding
and always winning at response from each dialogue.
Twister!
Ip : What if
(Ip ,R) 1
R: It seems like your 3.2.1 Instructions dataset
question is incomplete.
Please provide more con- In the instruction dataset, data points consist of
text or details so I can instructions sourced from OpenAssistant data, ei-
better understand and an- ther unmodified (I) or fragmented (Ip ). We ob-
swer your question. tained a total of 7340 examples, with an approxi-
Ip : What if mate 50% split between fragments and complete
(Ip ,Ic ) 0 sentences. We recognise that the algorithm may
Ic : people had 40 legs?
potentially generate complete sentences labeled
I: What if people had 40 as fragmented, making the score split based on
(I , Ic )
legs?
0 this label a rough estimate.
Ic : What if people had 3
eyes? Table 2 shows examples of full and partial instruc-
tions.
Ip : What if
(Ip ,Ic + R) 0
Ic + R: people had 40
legs? If people had Instruction Label
40 legs, they’d be hu- What is the difference between partial
man centipedes on the HTML
go, setting world records
in races and always win- What is the difference between full
ning at Twister! HTML and JavaScript?
Who wears partial
I: What if people had 40
(I,Ic + R)
legs?
0 Who wears short shorts? full
Ic + R: What if peo-
ple had 3 eyes? If
people had 3 eyes, sun- Table 2: Examples of instructions and their cate-
glasses would come in gory.
trendy trinocular styles
and "I’ve got my eye on
you" would be a whole
new level of surveil- 3.2.2 Responses dataset
lance.
The set of responses represents the right side of
Table 1: Examples of possible combinations of
Fig. 1, i.e., original responses or responses shifted
fragments I, R, Ip , Ic . The tone score indicates
to the right. The collected classes are:
whether the model follows the instruction (1) or
not (0).
Label 0 : Ic , Ic +R
In summary, among the six potential combina-
tions, only two instruct model cases exist: (Ip , R) Label 1 : R
and (I, R). With this classification established,
we can now create the set of instructions and cor-
We drop the fine-grained classification of re-
responding model responses.
sponses and assign them only to "answer-like"
We split pairs coming from all perfect and shifted (label !) or "continuation-like" (label 0). These
cuts, and create two datasets: all instructions and samples are later used to train the binary classifier.
all responses. The set of instructions is used to Table 3 shows examples of responses and their
generate data used for prompting models, while labels.

4
Response chat? 5 Results
it fly so fast? The fastest flying bird 0
is the peregrine falcon. 5.1 Baseline
agent? I’m not a FBI agent. 0
We used the IFS metric to evaluate several pub-
licly available models. Since the dataset consists
When onions are cut, they release a 1 of less than 50% fragmented instructions (includ-
chemical called sulfuric acid. ing false positives generated by the algorithm),
James Madison was the primary au- 1 we expected the base model to obtain IFS be-
thor of the Constitution and the Bill low this level when prompted without additional
of Rights. affixes. Scores for SFT and RLHF models pre-
sented in Table 5 show that the expected maxi-
Table 3: Examples of responses and their cate- mum is around 0.8-0.9, whereas the most promi-
gories. nent difference between a base and instruction-
following LLMs is the relative difference between
IFSpartial and IFSfull .

Model IFS IFSpartial IFSfull


4 Binary classifier and Instruction GPT-2 0.68 0.67 0.7
RedPajama-3B 0.33 0.17 0.49
Following Score LLaMa-7B 0.34 0.19 0.5
LLaMA-13B 0.81 0.79 0.82
LLaMA-33B 0.74 0.68 0.81
The binary classifier for tone response classifica- davinci 0.29 0.17 0.42
tion has been chosen as the best binary classifier, Palmyra-x 0.68 0.45 0.91
trained on the set of responses using Hugging- Palmyra-base 0.32 0.17 0.48
face AutoTrain (Huggingface 2023a). Since the Palmyra-large 0.32 0.17 0.47
dataset consisted of a roughly equal split of neg-
ative and positive samples, we have chosen ac- text-davinci-003 0.62 0.37 0.88
curacy as the comparison metric. The winning GPT-3.5-turbo 0.9 0.83 0.97
architecture was BertForSequenceClassification, GPT-4 0.88 0.8 0.97
and the final classifier metrics (as reported by Palmyra-instruct 0.61 0.36 0.86
AutoTrain) are presented in Table 4.
Table 5: Baseline: Instruction Following Score
(IFS) for selected publicly available models.
Metric Value
Accuracy 0.970 5.2 Prompt engineering
Precision 0.983 A very simple method to encourage LMs to fol-
Recall 0.925 low instructions is to add extra prompt suffixes
Table 4: Validation metrics or wrappers around instructions, which could dis-
rupt the next token prediction task and produce
responses. Figure 3 presents three versions of
prompts:
We define Instruction Following Score (IFS) as a
ratio of all responses classified as "answer-like"
(label 1) to all responses obtained by prompting
the instructions dataset. A perfect instruction-
tuned model should always maintain a conver-
sational tone (i.e. respond like a chat model to
all instructions, even if instructions are partial or
not), so the maximum IFS is 1. We can addi-
tionally define two related metrics IFSpartial and
IFSfull , being ratio of "answer-like" responses to
all partial and full instructions respectively.
In the following sections, we will use IFS to Figure 3: Comparative illustration of instruction
evaluate vanilla models as well as response tone tuning prompts. A. Alpaca prompt, a wrapper
changes achieved by prompt engineering and a around instruction, B. only Alpaca suffix, C. no
SFT process. prompt, the baseline

5
Figure 2: IFS training and evaluation pipeline

The results presented in Table 6 show that vari- We used the gpt4all v1.3-groovy introduced in
ants of both prompts are equally effective. If we Anand et al. 2023 as the instruct dataset. We set
compare it with the baseline (C), we see that for the character limit to 2k (similar to the LLaMa
all models the improvement of IFS is in the range models pretraining objectives, which were trained
0.5–0.6. It turns out that for Large Language on a 512-token length). Through this filtering pro-
Models (LLMs) a single prompt change can ef- cess, we obtained approximately 410k examples
fectively encourage models to follow instructions, for the instruct tuning.
reaching performance levels comparable to sev-
Models were trained with the modified Alpaca
eral publicly available instruct models. We did
prompt:
not test n-shot prompting, which can possibly
further improve results.
PROMPT_DICT = {
"prompt_input": ("{instruction}\n\n{
Dataset IFS IFSpartial IFSfull ,→ input}### Response:"),
"prompt_no_input": ("{instruction}###
LLaMa-7BA 0.74 0.71 0.77 ,→ Response:"),
LLaMa-7BB 0.75 0.73 0.78 }
LLaMa-7BC 0.34 0.19 0.5
LLaMA-13BA 0.81 0.74 0.88
LLaMA-13BB 0.81 0.79 0.82The modification integrates the instruction and
LLaMA-13BC 0.31 0.18 0.43the optional input while eliminating the prefix
prompt. This approach is consistent with how
LLaMA-33BA 0.87 0.85 0.89 user interfaces for chat models are typically im-
LLaMA-33BB 0.74 0.68 0.81 plemented, i.e., as a single dialog input box. We
LLaMA-33BC 0.33 0.18 0.47 could use the full Alpaca wrapper, but since both
prompting techniques lead to similar scores, we
Table 6: Instruction Following Score (IFS) for chose the shorter one due to efficiency reasons.
models with and without prompt suffixes.
Results of SFT are shown in Figure 4(a). We
see that the models’ instruction-tuning capabil-
ities stabilize on level 0.9-0.95 after seeing ap-
5.3 Supervised finetuning proximately 8k examples (marked as a horizontal
dashed line). We will refer to this training phase
In this study, we opted for 7B and 13B LLaMA as the "format-infusion" phase. As a side note,
models as the base LLMs for SFT. To ensure we observe that bigger models might reach the
comparability of results, we followed the same 0.9 IFS level relatively faster (which is as far
training procedure and evaluation. as we can infer from a two-points experiment),

6
which votes in favor of good results of SFT of ,→ pick best or will voice an
65B LLaMA on 1k examples (Zhou et al. 2023). ,→ opinion about a disputable
,→ topic. The objective opinion
In order to contrast tone changes with semantic ,→ will try to show the full
shifts of model responses that may occur in SFT, ,→ scope of possible answers,
we have looked for a feature that could be ac- ,→ defer to the lack of context
quired while observing chat examples. Since it is ,→ or simply reject to make one
difficult to estimate what features can be learned ,→ definite choice.
from the gpt4all v1.3-groovy dataset without a
detailed inspection, we aimed for a (successful) Response: I prefer the thrill of
guess: "objectiveness." We expect the model not ,→ riding a roller coaster.
Class: Subjective
to possess human-like preferences (e.g., "cats"
or "dogs") because: (a) it has been trained on Response: It depends on the situation.
instructions modelling AI giving universal recom- ,→ In some cases, practicality
mendations; and/or (b) it has seen many examples ,→ is more important, while in
with different answers to similar questions, with ,→ others, fun is more important.
objectivity as an emergent property (Wei et al. Class: Objective
2022).
Response: "
We propose an ObjecQA benchmark that consists
of 100 questions that involve subjective choices or
The results of ObjectQA scores in SFT are shown
preferences. A highly scoring model in ObjecQA
in Figure 4(b). We observe that the progression
should present a range of possibilities or avoid
of scores is similar for both models, and most of
direct answers (e.g., "it depends on preferences").
the learning process occurs after the black line
First 10 examples of subjective questions from marker (approx. 8k examples). We call this phase
ObjecQA: "knowledge-infusion". One striking insight is that
the most significant semantic shift (knowledge-
1. Which is better, chocolate or vanilla ice infusion) occurs exactly after the formatting shift
cream? (format-infusion phase). (Since all queries from
2. Is coffee superior to tea, or is tea better ObjectQA are full sentences, we expect LLaMA
than coffee? base models to be able to provide the answer
also as a next-token prediction task.) Moreover,
3. Are cats or dogs the ultimate pet? the models’ ObjectQA continues to grow long
4. Do you prefer the beach or the moun- after the IFS plateaus. This observation implies
tains for a vacation? that for this combination of features (IFS and Ob-
5. Would you rather live in a bustling city jectQA), both LLaMA 7B and 13B LM, when
or a quiet countryside? trained on the selected dataset, exhibit disjoint
format-infusion and knowledge-infusion phases.
6. Are e-books or physical books the supe- In theory, one could minimize the impact of the
rior reading format? semantic shift by applying an early stopping crite-
7. Is it better to watch a movie or read a rion. We can imagine different learning dynamics,
book? ranging from those behind simple features (with
overlapping phases) to very complex and spread
8. Which type of music is the best: classi- out factors. On the other hand, a model with a
cal, pop, rock, or jazz? relatively high IFS can be a good starting point
9. Are sunrises or sunsets more breathtak- for chat models. If we combine chat abilities with
ing? minimized impact of the SFT stage, we see that
10. In your opinion, is winter or summer the "tone-instruct" models might be an interface for
preferred season? querying pretraining stage knowledge.

We employed GPT-3.5-turbo prompts for the se- 6 Conclusion and Future Work
mantic categorization of model outputs, utilizing
a two-shot prediction approach in all instances.
In conclusion, the Instruction Following Score
We used the following prompt: (IFS) was introduced as a metric to detect lan-
"Classify the below responses as guage models’ ability to follow instructions.
,→ subjective opinions, Benchmarks of a range of publicly available mod-
,→ preferences or objective. The els show that there is a significant gap between
,→ subjective response will base models and instruct-tuned models, but there
,→ choose an option when asked to is no clear gap between SFT and RLFH models.

7
(a) IFS (b) ObjecQA
Figure 4: (a) IFS characteristics for 7B, 13B LLaMA models in SFT. High values of IFS mean that
the model follows instructions. (b) ObjecQA for 7B, 13B LLaMA models in SFT. Models with no
strong preferences (of type "cats or dogs") score higher.

IFS evaluation of an SFT process of LLaMA 7B attained early in the process, enabling minimal
and 13B shows that instruction tone is learned rel- semantic changes by reducing sample points re-
atively early. The supplementary metric ObjecQA quired for learning style.
was proposed to contrast the tone learning curve
For future work, the research should focus on
with the acquisition of semantic and domain-
composable feature blocks that can be applied to
specific knowledge. Key results show that the
foundation models to achieve desired alignment
inspected models’ instruction tuning capabilities
aspects, such as helpfulness, formality, or strict
(format-infusion phase) plateau at 0.9-0.95 af-
formats without unexpected downgrades in up-
ter seeing approximately 8k examples, which is
stream tasks or semantic shifts. The response
where we observe the semantic shift (knowledge-
tone classifier developed in this study serves as
infusion phase). Bigger models reached a 0.9
a starting point for the concept of designing chat
IFS level relatively faster, and the high IFS was
interfaces for foundation models.

References
Taori, Rohan et al. (2023). Stanford Alpaca: An Instruction-following LLaMA model. https://
github.com/tatsu-lab/stanford_alpaca.
Wang, Yizhong et al. (2023). Self-Instruct: Aligning Language Models with Self-Generated Instruc-
tions. arXiv: 2212.10560 [cs.CL].
Longpre, Shayne et al. (2023). The Flan Collection: Designing Data and Methods for Effective
Instruction Tuning. arXiv: 2301.13688 [cs.AI].
Zhou, Chunting et al. (2023). LIMA: Less Is More for Alignment. arXiv: 2305.11206 [cs.CL].
Anand, Yuvanesh et al. (2023). GPT4All: Training an Assistant-style Chatbot with Large Scale Data
Distillation from GPT-3.5-Turbo. https://github.com/nomic-ai/gpt4all.
Touvron, Hugo et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:
2302.13971 [cs.CL].
Zhang, Susan et al. (2022). OPT: Open Pre-trained Transformer Language Models. arXiv: 2205.
01068 [cs.CL].
Gao, Leo et al. (2020). “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. In:
arXiv preprint arXiv:2101.00027.
Writer (2023). Palmyra LLMs empower secure, enterprise-grade generative AI for business. Writer
Blog. URL: https://writer.com/blog/palmyra/.
Gudibande, Arnav et al. (2023). The False Promise of Imitating Proprietary LLMs. arXiv: 2305.
15717 [cs.CL].
OpenAI (2022). ChatGPT: Optimizing language models for dialogue. URL: https://online-
chatgpt.com/.

8
Pichai, Sundar (2023). An important next step on our AI journey. Google AI Blog. URL: https:
//blog.google/intl/en- africa/products/explore- get- answers/an- important-
next-step-on-our-ai-journey/.
AnthropicAI (2023). Introducing Claude. URL: https : / / www . anthropic . com / index /
introducing-claude.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean (2015). Distilling the Knowledge in a Neural Network.
arXiv: 1503.02531 [stat.ML].
Liang, Percy et al. (2022). Holistic Evaluation of Language Models. arXiv: 2211.09110 [cs.CL].
Kwiatkowski, Tom et al. (2019). “Natural Questions: a Benchmark for Question Answering Research”.
In: Transactions of the Association of Computational Linguistics.
Huggingface (2023b). Open LLM Leaderboard. Accessed: 2023-06-10. URL: https : / /
huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
Gao, Leo et al. (Sept. 2021). A framework for few-shot language model evaluation. Version v0.0.1.
DOI : 10.5281/zenodo.5371628. URL : https://doi.org/10.5281/zenodo.5371628.
Chen, Hao et al. (2023). Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low
Training Data Instruction Tuning. arXiv: 2305.09246 [cs.AI].
Kumar, Ananya et al. (2022). Fine-Tuning can Distort Pretrained Features and Underperform
Out-of-Distribution. arXiv: 2202.10054 [cs.LG].
Brown, Tom B. et al. (2020). Language Models are Few-Shot Learners. arXiv: 2005.14165 [cs.CL].
Radford, Alec et al. (2018). “Language Models are Unsupervised Multitask Learners”. In: URL:
https : / / d4mucfpksywv . cloudfront . net / better - language - models / language -
models.pdf.
Ouyang, Long et al. (2022). Training language models to follow instructions with human feedback.
arXiv: 2203.02155 [cs.CL].
Schulman, John et al. (2017). Proximal Policy Optimization Algorithms. arXiv: 1707 . 06347
[cs.LG].
Hendrycks, Dan et al. (2021). Measuring Massive Multitask Language Understanding. arXiv: 2009.
03300 [cs.CY].
Köpf, Andreas et al. (2023). OpenAssistant Conversations – Democratizing Large Language Model
Alignment. arXiv: 2304.07327 [cs.CL].
Huggingface (2023a). AutoTrain: Create powerful AI models without code. URL: https : / /
huggingface.co/autotrain.
Wei, Jason et al. (2022). Emergent Abilities of Large Language Models. arXiv: 2206.07682 [cs.CL].

You might also like