Parameter Efficient Instruction Tuning:
An Empirical Study

Pengfei He
AdaBit AI

Abstract

Instruction tuning [27] [25] [16] [24] has become an important step for finetuning pretrained language models to better follow human instructions and generalize on various tasks. Nowadays, pretrained language models[2] become increasingly larger, and full parameter finetuning is overwhelmingly costly. Therefore, Parameter Efficient Finetuning (PEFT) [3] [14] [6] has arisen as a cost-effective practice for instruction tuning because of significantly smaller computational, memory, and storage cost compared to full finetuning. Despite their widespread adaptations, the vast hyperparameter spaces, the number of PEFT methods, the different focus of instruction tuning capabilities make disentangling the impact of each aspect difficult. This study systematically investigates several representative PEFT methods, surveying the effect of hyperparameter choices including training hyperparameters and PEFT-specific hyperparameters, how different models sizes and the number of instruction tasks affect the performance, in-task-distribution memorization and open instruction following capability[23]. Our empirical study shows that only LoRA [8] and adapter [7] can get close to full finetuning with ideal training settings. The ideal training setting includes an appropriate learning rate, largest LoRA rank or adapter size allowed and a diverse training tasks. On the other hand, LoRA and adapter suffer from training instability if such an ideal training condition is not met. Additionally, LoRA requires a greater number of tasks for effective unseen task generalization, exhibit slower learning speed. Moreover, LoRA has weaker task-level memorization. Lastly, LoRA and adapter falls short in complex reasoning, coding and long-form generation compared to finetuning in open instruction tuning settings but it shows stronger capabilities compared to adapter. We hope our work could guide practitioners through the PEFT optimization and provides the insight of further research on what aspects to improve these methods in instruction tuning scenarios.

1 Introduction

Recent study demonstrates instruction tuning which trains language models on instruction-output pairs data can enhance model’s capability of comprehending human instructions and following these instructions. Efforts are also underway to compile a diverse mixture of high-quality instruction tuning data[22] [23][9] [28] to advance language models in their better generalization on downstream tasks and better alignment with user intents. Consequently, instruction tuning has become a standard method of aligning LLMs closely with human instructions. However, LLM’s vast amounts of parameters often constrains their accessibility through traditional finetuning methods due to the high training, storage, memory cost it incurs. In light of this, PEFT has recently demonstrated remarkable achievements to address the concern. This is primarily due to its cost-efficiency, as PEFT necessitates the training of only a fraction of the model’s parameter, and it results in much lower required training memory and storage.

Adapter[7], a pinoeer PEFT method, is a bottleneck network inserted between layers within fixed pretrained model. LoRA [8] trains on a low-rank weight matrix, applied additively to selected matrix independently within the transformer layer. Zaken et al. [26] proposes to finetune the biases of the neural network only. Inspired by the effectiveness of text prompting methods in directing LLM, Prompt Tuning [10] and Prefix Tuning [11] have been developed, they concatenate a sequence of soft tokens(continuous vectors) into the model input or activation and only train these soft tokens.

In light of these, our study begins by identifying effective PEFT methods. This involves training on expert-written instruction tuning dataset SuperNI[22] and conducting a series of experiments. These experiments vary in hyperparameters, data sizes, data distribution and model sizes. To explore the PEFT’s effectiveness further, we extend our training and evaluation to a more challenging setup Tülu[23], and it consists of more diverse instruction tuning tasks and more comprehensive evaluation which covers a range of complex model capabilities (i.e., factual knowledge, reasoning, multilinguality, coding) and open-ended instruction-following abilities.

Our key findings are as follows:

1.

We identify LoRA and adapter are most effective PEFT methods for instruction tuning (Sec. 3.1). Greater LoRA ranks/adapter sizes help improving performance (Sec. 3.2).
2.

We find that LoRA and adapter training exhibits certain instability compared to finetuning, and such an instability correlates with greater rank and higher learning rate (Sec. 3.3).
3.

We validate that PEFTs with larger base models consistently improve the final performance (Sec. 3.4).
4.

We show that both LoRA and adapter have weaker generalization capability in low-data instruction tuning settings (Sec. 3.5), and LoRA shows weaker task-level memorization when compared to both adapter and finetuning (Sec. 3.6).
5.

For open instruction tuning, LoRA has generally better multifaceted capabilities than adapter and make it an ideal alternative to finetuning. Plus, both adapter and LoRA demonstrates weak coding, complex reasoning and long-form generation capabilities (Sec. 4).

2 Experiment Setups

2.1 Setup 1: T5 finetuning on SuperNI

In the first experiment, our goal is to identify the instruction following capabilities of PEFTs. We limit the hyperparameter search space along learning rates and PEFT-specific hyperparameter only. We choose to use SuperNI [22] as our dataset for two reasons. First, it comprises of a large amount of different tasks, offering an ideal testbed for the cross-task generalization capabilities of LLMs. Second, compared to some latest larger instruction data mixture, it’s considerably smaller which significantly reduces per-experiment runtime and thus overall runtime greatly. For the first setup, we use model T5-3B "LM-adapted" language model [18] which is further trained with a language modeling objective.

In our standard setup, our full dataset comprises 707 training tasks and 50 validation tasks, all randomly selected from SuperNI [22] original training tasks. The test data compromises 119 unseen English tasks adhering the original setup. For each task, we sample 100 instances. In preparing the data examples aligning with instruction data format, we concatenate the task definition and the task input for each data example. Based on the standard setup, we vary model sizes, data sizes, PEFT for ablation study of the correlation of one or more variables.

Each PEFT method is trained for 4 epochs to ensure adequate training for each PEFT method. For PEFTs, we utilize the AdamW optimizer, paired with a linear learning rate scheduler and a warm-up phase that comprises 3% of the total training steps. For finetuning, we use constant learning rate scheduler which kept the same training setting as Tk-Instruct [22]. We use RougeL [13] score as our evaluation metric, and the best-performing checkpoint is selected for evaluation on the test set. Overall, our first experiment design represents a balanced trade-off between testing effectiveness and identifying patterns of PEFT and efficiency of running experiments at scale.

Given the heterogeneous nature of PEFT methods, the ratios of their trainable parameters span a wide range, and their PEFT-specific hyperparameter varies by their design space. The search range for PEFT-specific hyperparameters and learning rates can be referred to Table 1.

2.2 Setup 2: LLaMa-2 finetuning on Tülu datasets

With the selected PEFTs and their best hyperparameters found from the first setup, we want to compare PEFTs based on modern LLMs with finetuning in more comprehensive instruction training and evaluation settings. For this purpose, we utilize LLaMa-2 [20] language model as our base model. We leverage open instruction datasets Tülu[23] which compromises both human-generated and GPT-generated instruction tuning data, and we follow the same multi-faceted evaluation setup covering factual knowledge, reasoning, multilinguality, coding and open-ended instruction following.

We train selected PEFT methods for 3 epochs. These were conducted with a maximum input length 2048, consistent with finetuning setting from Tülu[23]. Refer to Sec. C for detailed training hyperparameters.

3 Empirical Findings

We derived Finding 1 through 6 from Setup 2.1, and Finding 7 from Setup 2.2. Finding 3 through 6 provides a deeper exploration into LoRA.

3.1 Finding 1: LoRA and adapter are effective for instruction tuning.

Table 1: Experiments containing the grid search on PEFT hyperparameters and the best results across all hyperparameter combinations.

r

is the LoRA rank,

\eta

is the learning rate,

l

is the prefix/prompt length,

h

is the reparameterized MLP dimension. For each PEFT method, best performance is reported as the best RougeL score on SuperNI across all hyperparameter combinations. Additionally, we list the ratio between the trainable PEFT parameters and the model frozen parameters corresponding to the best performance setting.

Method	Searched hparams	Best hparams	% trainable	Best Perf.
LoRA	$r\in\{8,32,64,128,256,512\}$ , $\eta\in\{1\text{e-}5,5\text{e-}5,1\text{e-}4,5\text{e-}4,1\text{e-}3\}$	$r$ =512, $\eta$ =1e-4	9.6%	47.1
Adapter	$s\in\{8,32,64,128,256,512\}$ , $\eta\in\{1\text{e-}5,5\text{e-}5,1\text{e-}4,5\text{e-}4,1\text{e-}3\}$	$s$ =512, $\eta$ =1e-4	6.6%	46.7
Prefix Tuning	$l\in\{8,32,64,128,256,512\}$ , $h=\{null,256,512,1024\}$ , $\eta\in\{1\text{e-}5,5\text{e-}5,1\text{e-}4,5\text{e-}4,1\text{e-}3\}$	$l$ =512, $h$ =256, $\eta$ =1e-4	0.9%	41.6
Prompt Tuning	$l\in\{8,32,128,256\}$ , $\eta\in\{1\text{e-}5,5\text{e-}5,1\text{e-}4,5\text{e-}4,1\text{e-}3\}$	$p$ =8, $\eta$ =5e-5	0.001%	16.3
BitFit	$\eta\in\{1\text{e-}5,5\text{e-}5,1\text{e-}4,5\text{e-}4,1\text{e-}3\}$	$\eta$ =5e-4	0.04%	41.5
Full Finetuning	$\eta\in\{1\text{e-}5,5\text{e-}5,1\text{e-}4,5\text{e-}4,1\text{e-}3\}$	$\eta$ =1e-5	100%	47.8

Our extensive hyperparameter search reveals that among the five PEFTs, only LoRA and adapter are proved close to full finetuning in instruction tuning settings (See Table 1). Prompt tuning shows no effective learning due to the difficulty of optimizing soft prompts as claimed in some other findings [8] and the cross-task nature of instruction tuning. Prefix tuning shows some learning but still significantly underperforms finetuning. This observation is in line with recent theoretical analyses[17], which suggests that prefix tuning and prompt tuning are less expressive than finetuning. Prefix tuning also has a reparameterization trick that transforms the prefix matrix into MLP, and our experiment results indicate such reparameterization improves training stability. Lastly, BitFit also falls short compared to finetuning, and we think tuning bias alone also has limited expressivity.

The detailed experimental results for each PEFT are attached in the Sec. E.

3.2 Finding 2: More PEFT trainable parameters yield better performance.

We have investigated the impact of LoRA rank and adapter size in instruction tuning. As indicated by Fig. 1(a), we observe that a higher LoRA rank/adapter size consistently yields better performance when coupled with the optimal learning rate. The finding aligns with the general principle in machine learning, where an increase in the number of parameters correlates with enhanced model capacity and performance improvements. Also, due to the nature of task diversity in instruction tuning settings, more tasks coupled with more PEFT trainable parameters do improve model performance. However, the performance gain coupled with higher rank tends to diminish, despite the rank increases by the factor of two. See detailed results in Table 4 and Table 5.

Refer to caption — (a) Performance of different LoRA ranks and adapter size. Each point represents averaged score across three runs with different random seeds.

3.3 Finding 3: Higher LoRA rank is more sensitive to the learning rate

As Fig. 1(b) indicates, too small learning rate could cause underfitting given the same training epochs while high learning rate could cause training instability. If we also take rank into consideration, a lower rank has a better tolerance of high learning rate. For example, at rank 8, learning rate 1e-3 even produces the best result at the same rank. However, as the rank goes higher, relatively lower learning rate stabilize training and improves performance. According to the grid search result, the optimal learning rate is 1e-4 among experiments with different LoRA ranks.

3.4 Finding 4: LoRA and adapter with larger base models perform better

Both Fig. 2(a) and Fig. 2(b) illustrates that the performance of both LoRA and adapter is consistently improved from T5-base to T5-3b. This trend observed indicates that as the capacity of the underlying model increases, the ability to finetune LoRA and adapter also becomes more effective. Consequently, this highlights the importance of selecting an appropriately scaled model to maximize the benefits of LoRA.

3.5 Finding 5: LoRA underperforms in low-data setting

Despite fully trained LoRA is close to or on par with full finetuning, larger data size always benefits LoRA across ranks as shown in Fig. 3(b). On the other hand, in many industrial adaptations, there could only be a limited number of instruction tasks available. The important question is how many different instruction tasks required so that it starts to generalize for each method? In our experiments, we reveal that LoRA is actually a "slow learner" in a way that it requires more tasks to ramp up cross-task generalization than adapter and finetuning (See Fig. 3(a)), and it suffers from training instability when the number of training tasks is low (See Fig. 3(b)). Therefore, in low-data instruction tuning settings, especially domain specific, finetuning is still an optimal choice, and adapter could be a PEFT alternative if the slight overhead latency is acceptable.

3.6 Finding 6: LoRA has worse task-level memorization

Under the ideal downstream tuning setting, it is advantageous to provide training instruction tasks same as testing instruction tasks, and this resemble traditional training and test dataset split on a instance level. This raises a natural research question: "How do LoRA and adapter perform for test data which is in-distribution in task level but out of distribution in instance level?" In such scenarios, a key capability of language models is their memorization ability: how effectively they can learn and retain task-level features seen during training and perform well on tasks of the same types during testing. To assess this, we selected 100 tasks instances from each training task type, provided there were sufficient instances outside of training set. Our experiments indicates that LoRA demonstrates comparatively weaker task-level memorization capabilities than both adapter and finetuning. By contrast, adapter is just slightly weaker than finetuning in this aspect. (Fig. LABEL:peft_traditional_test) We hypothesize that the key reason for memorization capability lies in the number of extra parameters with nonlinearities, as pre-trained knowledge is predominantly located in the feed-forward network(FFN) layers [4] rather than attention layers. Consequently, since there is no additional nonlinear layer for LoRA and only query and key projection layer weight delta are tuned in our LoRA configuration, this results in a reduced capability to store task-specific knowledge.

It’s worth noting that the study by Mireshghallah et al. [15] investigate instance-level memorization to reduce extraction attack, and suggests that the adapter model exhibits comparatively less memorization at this level. In addition to our finding, we consider adapter could be a valuable alternative if both task-level in-distribution memorization and privacy are important for downstream tasks despite the slight inference latency.

Table 2: Performance of LoRA, adapter and fine-tuning tested on training tasks but unseen instances on SuperNI.

	Rouge-L
LoRA ( $r=512$ )	55.4
Adapter ( $s=512$ )	58.4
Fine tuning	59.7

3.7 Finding 7: PEFTs underperform in open instruction tuning and multifaceted testing

As Table 3 suggests, both LoRA and adapter exhibits a significantly weaker performance in reasoning tasks while LoRA consistently outperforms adapter when it comes to a wide range of challenging open-instruction tasks.

Table 3: Performance of LoRA(

r=512

), adapter(

s=512

) and finetuning based on LLaMa-2 7B and trained on Tülu-1.1 data mixture.

r

is LoRA rank and

s

is adapter size.

	MMLU	GSM	BBH	TydiQA	Codex-Eval	Average
	(factuality)	(reasoning)	(reasoning)	(multilinguality)	(coding)	Average
	EM	EM	EM	F1	P@10
	(0-shot)	(8-shot, CoT)	(3-shot, CoT)	(1-shot, GP)	(0-shot)
LoRA	49.7	29.1	43.3	52.1	19.7	38.78
Adapter	46.9	20.5	40.8	48.4	19.7	35.26
Finetuning	49.2	37.0	44.2	52.9	33.9	43.4

4 Related Work

Large Language Models (LLMs) have revolutionized the field of natural language processing with their vast knowledge base and advanced reasoning capabilities. Yet, their extensive parameter sizes pose challenges for traditional downstream finetuning.

As summarized in Lialin et al. [12], there are five distinct categories of PEFT methods with some methods straddling multiple categories. For the initial stage of our survey, we have selected seven PEFT method that span these categories. We employ LoRA [10] as a representative of reparameterization-based method, building on its proven effectiveness in works such as QLoRA [5]. BitFit [26] serves as our selective method. Prompt-Tuning [10] and Prefix-Tuning [11] are selected from the intersection of soft prompts and additive methods. Lastly, we include adapter [7] method which overlaps both additive and adapter-based method.

The line of research related to instruction tuning has developed in several key ways. Ziegler et al. [30] focused on finetuning language models through the use of human preferences, aiming to produce model outputs that better align with human intent and values. SuperNI [22] introduced an expert-written instruction dataset that spans a variety of task types. Wang et al. [21] leveraged pre-trained models to autonomously generate instructional data for subsequent finetuning. This approach led to notable enhancements in the model’s capability to accurately follow instructions, mitigating the need for costly human-annotated instruction datasets. Taori et al. [19] contributed a synthetic dataset generated from GPT-4 outputs. This dataset was created using self-instruct methods and was subsequently distilled for the purpose of instruction tuning. Wang et al. [23] studies how the combination of human annotated and GPT4 generated instruction dataset modulate the performance of trained model, with a blend of both proving optimal. In our experiment, we utilize the same experiment setting but with the integration of PEFT methods.

As reported in Biderman et al. [1] LoRA learns less but has better regularization effect, and it is sensitive to learning rates. Our finding is complementary to theirs with more focus on instruction tuning. In subsection 3.6, LoRA also has a weaker in-task-distribution memorization capability compared to finetuning, and in subsection 3.5 LoRA requires more instruction tasks to generalize. Different aspects of PEFTs for instruction tuning have also been explored. Zhuo et al. [29] investigates PEFTs for instruction tuning on coding, and it reveals that full finetuning still surpass all PEFTs in terms of downstream performance. Plus, Biderman et al. [1] also tests coding and math reasoning capability with domain dataset, and it shows LoRA is strong at math but weaker at coding. In subsection , with mixed-task dataset in our work, we find LoRA shows inferior performance in a wider range of challenging tasks including coding, complex reasoning and open-ended generation by compared to finetuning.

5 Conclusion

This study has demonstrated that PEFTs, particularly LoRA and adapter, present viable alternatives to full fine-tuning in instruction-tuning scenarios, offering a balance between performance and computational efficiency. Our comprehensive empirical analysis highlighted that greater LoRA ranks and adapter sizes enhance performance significantly, though they may introduce some training instability. Additionally, while LoRA outperforms adapter in open instruction tuning settings due to its robust generalization across diverse tasks, it requires a substantial number of tasks to achieve effective unseen task generalization.

Moreover, the limitations observed in the long-form generation capabilities of both LoRA and adapter underscore the ongoing challenges within PEFT methods, necessitating further innovation and exploration in this field. These findings advocate for a nuanced application of PEFT methods, tailored to specific model sizes, task types, and data availability, aligning them more closely with real-world applications. As the landscape of language model finetuning evolves, the insights from this study will hopefully guide future research towards optimizing the efficiency and effectiveness of instruction tuning across various domains.

Future work should focus on refining these methods to enhance their stability and expand their applicability to more complex and varied datasets. By continuing to investigate the trade-offs between different PEFT strategies and their impacts on model performance, the field can move towards more sophisticated and nuanced PEFT techniques that maximize both performance and efficiency.

References

Biderman et al. [2024] D. Biderman, J. G. Ortiz, J. Portes, M. Paul, P. Greengard, C. Jennings, D. King, S. Havens, V. Chiley, J. Frankle, C. Blakeney, and J. P. Cunningham. Lora learns less and forgets less, 2024. URL https://arxiv.org/abs/2405.09673.
Bommasani et al. [2022] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Goel, N. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. Krass, R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L. Li, X. Li, T. Ma, A. Malik, C. D. Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair, A. Narayan, D. Narayanan, B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. Nyarko, G. Ogut, L. Orr, I. Papadimitriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich, H. Ren, F. Rong, Y. Roohani, C. Ruiz, J. Ryan, C. Ré, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih, K. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr, R. E. Wang, W. Wang, B. Wu, J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, and P. Liang. On the opportunities and risks of foundation models, 2022.
Chen et al. [2023] J. Chen, A. Zhang, X. Shi, M. Li, A. Smola, and D. Yang. Parameter-efficient fine-tuning design spaces. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XSRSWxyJIC.
Dai et al. [2022] D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei. Knowledge neurons in pretrained transformers, 2022.
Dettmers et al. [2023] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023.
He et al. [2022] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig. Towards a unified view of parameter-efficient transfer learning, 2022.
Houlsby et al. [2019] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for nlp, 2019.
Hu et al. [2021] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models, 2021.
Ivison et al. [2023] H. Ivison, Y. Wang, V. Pyatkin, N. Lambert, M. Peters, P. Dasigi, J. Jang, D. Wadden, N. A. Smith, I. Beltagy, and H. Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023.
Lester et al. [2021] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. URL https://arxiv.org/abs/2104.08691.
Li and Liang [2021] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation, 2021.
Lialin et al. [2023] V. Lialin, V. Deshpande, and A. Rumshisky. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647, 2023.
Lin [2004] C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
Liu et al. [2022] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022.
Mireshghallah et al. [2022] F. Mireshghallah, A. Uniyal, T. Wang, D. Evans, and T. Berg-Kirkpatrick. Memorization in nlp fine-tuning methods, 2022.
Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback, 2022.
Petrov et al. [2023] A. Petrov, P. H. S. Torr, and A. Bibi. When do prompting and prefix-tuning work? a theory of capabilities and limitations, 2023.
Raffel et al. [2020] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 2020. URL https://arxiv.org/abs/1910.10683.
Taori et al. [2023] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: An instruction-following llama model. GitHub repository, 2023. URL https://github.com/tatsu-lab/stanford_alpaca.
Touvron et al. [2023] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
Wang et al. [2022a] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Annual Meeting of the Association for Computational Linguistics, 2022a. URL https://api.semanticscholar.org/CorpusID:254877310.
Wang et al. [2022b] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ Tasks. In EMNLP, 2022b.
Wang et al. [2023a] Y. Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. R. Chandu, D. Wadden, K. MacMillan, N. A. Smith, I. Beltagy, and H. Hajishirzi. How far can camels go? exploring the state of instruction tuning on open resources, 2023a.
Wang et al. [2023b] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023b. URL https://arxiv.org/abs/2212.10560.
Wei et al. [2022] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners, 2022.
Zaken et al. [2022] E. B. Zaken, S. Ravfogel, and Y. Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022.
Zhang et al. [2023] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, F. Wu, and G. Wang. Instruction tuning for large language models: A survey, 2023.
Zhou et al. [2023] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy. Lima: Less is more for alignment, 2023.
Zhuo et al. [2024] T. Y. Zhuo, A. Zebaze, N. Suppattarachai, L. von Werra, H. de Vries, Q. Liu, and N. Muennighoff. Astraios: Parameter-efficient instruction tuning code large language models, 2024. URL https://arxiv.org/abs/2401.00788.
Ziegler et al. [2020] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences, 2020.

Supplementary Material

Acknowledgments and Disclosure of Funding

We thank our founding advisor Jieyu Zhang at AdaBit AI for proofreading this paper.

Appendix A Limitations

Despite the comprehensiveness of our training and evaluations, we do not exhaustively cover all PEFT methods and more fine-grained hyperparameter grid search. In our work, we only select the most representative PEFT methods across the broad PEFT categories, and commonly used hyperparameters.

Given the computing constraint, our first set of experiments about hyperparameter search is based on SuperNI and T5. Therefore, the optimal hyperparamter based on them might not reflect latest model architecture’s performance on latest instruction tuning datasets which has a broader coverage on different topics. Plus, Tülu suggests incorporating SuperNI in the data mixture is harmful for model performance.

Appendix B Broader Impact

We believe that a comprehensive validation of a PEFT method is broadly positive. And it could help practitioners to narrow the search space of PEFT methods and hyperparameters to save experiment time and efforts.

Appendix C Model Training Details and Compute

We designed the first set of experiments to find the hyperparameter pattern with the considerations of each PEFT’s feature. For instance, the prompt length in prompt tuning [10] and the prefix length in prefix tuning [11] are both constrained by the input length. BitFit [26] features significantly fewer trainable parameters ( $\ll 1\%$ ) and lacks any adjustable PEFT hyperparameters. Considering these complexities, we opted not to standardize the number of trainable parameters for performance comparison. Instead, our experimental design adheres closely to the original configurations as reported in the respective foundational works. We select these representative PEFTs and conducted experiments within a restricted hyperparameter search space to yield meaningful insights.

For SuperNI [22] experiments in Sec. 2.1, we trained our models primarily on High-Flyer cluster, each node on which contains 8 Nvidia A100 GPUs. We utilize DistributedDataParallel for most training jobs when GPU memory permits; otherwise, we employ the DeepSpeed library and ZeRO optimizer. Our training hyperparameters that are not part of the grid search are as follows:

•

Precision: FP32
•

Epochs: 4
•

Weight decay: 0
•

Warmup ratio: 0.03
•

Max. seq. length: 1,024
•

Effective batch size: 128
•

Dropout: 0.1
•

LoRA Layers wrapped: all query and key layers

For Tülu experiments in Sec. 2.2, we trained our models primarily on a local cluster, each experiment is conducted on a single NVIDIA A6000 GPU without using extra training framework. Our training hyperparameters are as follows:

•

Precision: BFloat16
•

Epochs: 3
•

Weight decay: 0
•

Warmup ratio: 0.03
•

Learning rate: 1e-4
•

Max. seq. length: 4,096
•

Effective batch size: 128
•

LoRA Rank: 512
•

LoRA Alpha: 512
•

LoRA dropout: 0.1
•

Layers wrapped: all query and key layers

Appendix D Reproducibility

We cannot guarantee the exact reproduction of all experiment numbers because the experiments are conducted on preemptible clusters, which may restart multiple times. However, the overall findings remain valid. Plus, there are some experiment runs with unstable training failed to generate non-empty content, and it leads to near-zero RougeL scores and reduced average RougeL scores reported in the results. Such results are caused by the nature of training instability of PEFTs, high learning rates or a large volume of trainable parameters.

Appendix E PEFT hyperprameter search results

Table 4: Performance of LoRA with different hyperparameters.

r

denotes the LoRA rank.

	lr
	1e-05	5e-05	1e-4	5e-4	1e-3
$r=8$	37.2	42.3	43.0	44.3	44.8
$r=32$	38.9	41.5	43.6	45.5	42.2
$r=64$	40.4	43.8	44.2	44.4	27.8
$r=128$	40.8	44.5	45.5	35.1	33.3
$r=256$	40.9	44.8	46.5	38.0	0.0
$r=512$	42.9	45.8	47.1	23.2	0.0

Table 5: Performance of adapter with different hyperparameters.

s

is the adapter bottleneck size.

	lr
	1e-05	5e-05	1e-4	5e-4	1e-3
$s=8$	37.3	43.9	43.3	43.2	44.9
$s=32$	41.5	42.0	44.2	46.0	43.2
$s=64$	41.8	44.2	45.2	20.7	5.3
$s=128$	41.1	45.1	46.0	5.8	41.4
$s=256$	41.4	45.1	46.4	42.7	4.3
$s=512$	42.7	46.2	46.7	5.0	4.3

Table 6: Performance of prompt tuning with different hyperparameters.

l

is the prompt length.

	lr
	1e-5	5e-5	1e-4	5e-4	1e-3
$l=8$	16.3	16.3	16.3	16.3	16.3
$l=32$	16.3	16.3	16.3	16.1	16.2
$l=128$	15.6	15.6	15.6	15.5	15.2
$l=256$	15.2	15.2	15.2	15.1	9.0

Table 7: Performance of prefix tuning with different hyperparameters.

l

is the prefix length.

h

is the hidden dimension of reparameterized feedforward neural network.

h=\text{null}

indicates the reparameterization is not applied.

	lr
	1e-05	5e-05	1e-4	5e-4	1e-3
$l=32,\ h=\text{null}$	15.1	6.4	8.3	30.7	26.5
$l=32,\ h=256$	26.1	31.6	31.0	31.4	22.9
$l=32,\ h=512$	30.5	31.5	21.4	20.1	15.2
$l=32,\ h=1024$	29.0	21.2	30.3	21.0	8.8
$l=64,\ h=\text{null}$	12.2	1.0	6.7	34.0	35.3
$l=64,\ h=256$	37.4	41.2	40.2	39.0	0.0
$l=64,\ h=512$	39.5	40.2	40.7	39.8	37.2
$l=64,\ h=1024$	41.0	41.2	41.2	36.7	9.8
$l=128,\ h=\text{null}$	5.0	1.8	29.3	37.9	39.4
$l=128,\ h=256$	27.7	30.8	30.6	22.1	21.1
$l=128,\ h=512$	31.4	31.6	20.5	39.8	23.9
$l=128,\ h=1024$	28.8	31.2	30.9	9.3	11.2
$l=256,\ h=\text{null}$	8.2	15.3	33.0	31.1	38.0
$l=256,\ h=256$	29.8	30.0	34.7	30.4	31.9
$l=256,\ h=512$	26.3	30.7	31.2	25.1	16.5
$l=256,\ h=1024$	25.4	18.3	31.5	4.2	12.9
$l=512,\ h=\text{null}$	8.3	18.6	33.6	37.7	40.0
$l=512,\ h=256$	38.1	30.5	41.6	41.2	8.3
$l=512,\ h=512$	38.2	37.7	39.5	38.8	16.9
$l=512,\ h=1024$	40.6	32.8	34.1	21.6	6.5

Table 8: Performance of bitfit with different hyperparameters.

	lr
	1e-05	5e-05	1e-4	5e-4	1e-3
All biases	1.3	28.1	35.0	41.5	20.8

Appendix F Model size and LoRA rank/adapter size

Table 9: Performance of LoRA with different model sizes and LoRA ranks.

	$r$
model	8	32	64	128	256	512
t5-base-lm-adapt	30.5	31.2	31.8	32.5	33.0	33.9
t5-large-lm-adapt	34.1	36.6	35.1	37.8	36.3	38.2
t5-xl-lm-adapt	43.0	43.6	44.2	45.5	46.5	47.1

Parameter Efficient Instruction Tuning: An Empirical Study