0% found this document useful (0 votes)
23 views13 pages

QOG:Question and Options Generation Based On Language Model: Jincheng Zhou

This paper presents a model for Question-Options Generation (QOG) using fine-tuned sequence-to-sequence language models, demonstrating its efficiency and stability compared to existing methods. The authors evaluate three approaches: Pipeline QOG, Multi-task QOG, and End-to-end QOG, with experiments showing competitive performance against the large language model Llama 3-8B. The study highlights the potential applications of QOG in education, information retrieval, and data augmentation for enhancing text understanding in large language models.

Uploaded by

nguyễn bi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views13 pages

QOG:Question and Options Generation Based On Language Model: Jincheng Zhou

This paper presents a model for Question-Options Generation (QOG) using fine-tuned sequence-to-sequence language models, demonstrating its efficiency and stability compared to existing methods. The authors evaluate three approaches: Pipeline QOG, Multi-task QOG, and End-to-end QOG, with experiments showing competitive performance against the large language model Llama 3-8B. The study highlights the potential applications of QOG in education, information retrieval, and data augmentation for enhancing text understanding in large language models.

Uploaded by

nguyễn bi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Proceedings of Machine Learning Research , 2024 ACML 2024

QOG:Question and Options Generation


based on Language Model

Jincheng Zhou ZHOUJINCHENG @ GMAIL . COM


Zhuoshi Technology, University of Electronic Science and Technology of China
Yue Hu
Zhuoshi Technology
Ya Wang
arXiv:2406.12381v3 [cs.CL] 16 Jul 2024

Zhuoshi Technology

Editors: Vu Nguyen and Hsuan-Tien Lin

Abstract
Question-Options Generation (QOG) is a task that involves generating a set of question-options
pairs within a given input. This task has various applications, including fine-tuning large models,
information retrieval, and automated multiple-choice question generation for education. In this
paper, we develop QOG models using three different methods based on fine-tuning sequence-to-
sequence language models (LMs). Experiments demonstrate that the end-to-end QOG model is
computationally efficient and stable during both training and inference, outperforming other meth-
ods. Furthermore, our analysis indicates that our QOG models are competitive on the QOG task
compared to the large language model Llama 3-8B.
Keywords: Question-Options Generation (QOG); Information retrieval; sequence-to-sequence
language models

1. Introduction
Question-Options (QO) is derived from Question-Answering (QA), extending question-answer pairs
by adding three text-related distractors. Compared to QA, QO provides more information by com-
bining the correct answer with several specially designed incorrect answers. This information en-
ables the model to incorporate more knowledge and improve its ability to identify potential errors.
When used for fine-tuning large language model(Hu et al., 2021; Liu et al., 2021, 2023), the goal
is to identify the correct answer from four candidate answers, which improves the model’s un-
derstanding of the question and its ability to distinguish the differences between different options,
making the model understand the text better. When used for model evaluation, whether the model
can select the correct answer becomes a key indicator of its text learning and processing capabilities
(Hendrycks et al., 2020; Huang et al., 2024; Cobbe et al., 2021).
Question and Options Generation (QOG) refers to the task of generating a set of question-
options pairs given an input context (e.g. a paragraph). QOG can be used to develop unsupervised
question answering and as a data augmentation tool (Alberti et al., 2019; Yu et al., 2018; Riabi et al.,
2020; Singh et al., 2019; Longpre et al., 2019) to enhance the text understanding capabilities of large
language models. QOG can also be used as an educational aid(Agarwal et al., 2019; Cai et al., 2023;
Wang et al., 2023), to enhance information retrieval models(Nogueira et al., 2019; Pyatkin et al.,
2021), and as a way to explain models.

© 2024 J. Zhou, Y. Hu & Y. Wang.


Z HOU H U WANG

Shared models
Multitask QOG
Question
Pipeline QOG Generation

Independent models Questions Questions:

Who help you when


Answer LM
you are in trouble?
Questions: Extractor
Multi Context:
What are friends?
Context: Answers: Context
Who help you when
you are in trouble?
QOG Answers: Context:
...I think friends are
Answers Answer
...I think friends are Friends
What are friends?
Friends ...I think friends are Multi those people who
LM those people who Help you when you are Distractor
can help you when can help you when
Help you when you are
in trouble
those people who
can help you when
QOG you are in trouble. ... you are in trouble. ...
in trouble
...
Generation

... you are in trouble. ... Context Distractors Distractors:


Distractors:

Strangers
Multi LM
Strangers
Competitors
Competitors QOG Acquaintances
Acquaintances ...
Context
...
Answer

End2End
End2End QOG
QOG

Answers: Distractors:
Questions:
Friends Strangers
Who help you when
Help you when you Competitors
you are in trouble?
are in trouble Acquaintances
What are friends?
... ...

Figure 1: Overview of the considered QOG approaches.

QOG originates from Question Generation (QG)(Duan et al., 2017; Du et al., 2017; Ali et al.,
2010), which involves generating a question for an answer given an input context. Compared to
QG and QAG, QOG presents a more intricate challenge, as the correct answer and distractors must
be constructed rather than assumed as part of the input. Currently, it is unknown which models are
effective for the QOG task because no comprehensive comparative study has been established to
date.
In this paper, we consider the QOG as a task of generating questions and options given a context,
and compare three simple QOG methods based on fine-tuning encoder-decoder language models
such as T5(Raffel et al., 2020). The three methods are common methods for fine-tuning language
models now: (1) Pipeline QOG, which decomposes the task into three parts: answer extraction,
question generation and distractor generation, and learns a separate model for each subtask; (2)
Multi-task QOG, which uses a shared single model to train the three subtasks simultaneously instead
of separate models; and (3) End-to-end QOG, which directly generates questions-options using end-
to-end sequence-to-sequence learning. Finally, we introduce GPT-4 as a referee based on traditional
evaluation methods and compare the three methods in multiple domains to objectively evaluate the
cross-domain generalization ability of the models.

2. Related Work
There is no work using pre-trained LM for QOG, but there are related works on QG and QAG.

2.1. Question Generation


Question generation (QG) aims to generate questions given documents and answers. This approach
has been applied in a variety of scenarios, including data augmentation(Alberti et al., 2019; Pellicer
et al., 2023) and document retrieval(Nogueira et al., 2019; Hofstätter et al., 2020; Ram et al., 2023).
Puri et al. (2020) fine-tuned QG’s autoregressive LM to generate related questions. These works
focus on generating large-scale data at once without paying much attention to data quality. Xiong
QOG:Q UESTION AND O PTIONS G ENERATION BASED ON L ANGUAGE M ODEL

and Wu (2020) followed the idea of adversarial generation and used question-answering agents
and QG models to train adversarial models. This work used semantic similarity as an evaluation
indicator to improve the quality of generated data, but did not focus on the generalization ability of
the model in other fields. To this end, our work will focus on improving data quality while testing
the generalization ability of the model with data sets from different fields.

2.2. Question-Answer Generation


Question-Answer Generation (QAG) generates question-answer pairs based on a given document.
QAG can be used to develop QA models without human supervision and as an auxiliary tool for
data augmentation of large language models. Bartolo et al. (2021) used QAG model to generate
adversarial examples for QA, Lewis et al. (2021) improved extractive QA by generating millions
of question-answer pairs through QAG, In both examples, the fine-tuned model is BART. In order
to improve the recall rate of QA, Fang et al. (2020) regarded answer extraction as a classification
problem and used RoBERTa(Liu et al., 2019) to implement it. Han et al. (2023) generated QA
iteratively and improved the data quality of each iteration by modifying the seed set. Ushio et al.
(2023) made a comparison of several QAG methods and established a complete system. These
works can generate high-quality QA pairs, however they are limited to machine metrics and do not
evaluate QA quality according to human standards.

2.3. QG & QAG Evaluation


In terms of evaluating the quality of QG, Zhang and Bansal (2019) believed that the traditional
evaluation criteria were not accurate enough, they proposed using the QG model to generate ques-
tions for each answer in the data set, and then using the synthetic QA pairs to train the QA model.
The performance of these QA models will indirectly measure the ability of the QG model. Sultan
et al. (2020) also emphasized the limitations of traditional QG evaluation methods and introduced
new indicators for evaluating diversity to better evaluate the quality of QG. These studies enrich the
evaluation system by introducing new evaluation criteria. However, these criteria still have certain
deviations from human evaluation criteria. Shortly after the release of GPT-4, Moore et al. (2023)
proposed a method for evaluating the quality of multiple-choice questions based on GPT-4, with
a correlation of more than 80% with human evaluation. In this paper, we also choose to intro-
duce GPT-4 as part of the evaluation system to make the generated QO more in line with human
evaluation standards.

3. Question & Options Generation


Given a context c (e.g. a paragraph), the goal of QOG is to generate question-options pairs that
are relevant to the information in c: QOc = {(q 1 , o1 ), (q 2 , o2 ), . . . }. In the following, we will
introduce three different QOG methods based on fine-tuned language models.

3.1. Pipeline QOG


This approach decomposes the QOG task into three simpler subtasks: answer extraction (AE), ques-
tion generation (QG) and distractor generation (DG). The AE, QG and DG models are designed
to learn from each sample containing context, sentences, answers and corresponding questions,
Z HOU H U WANG

thereby optimizing their abilities to generate relevant questions, distractors, and extract correct an-
swers.
The AE model Pae first generates an answer a in a given context c, and then the QG model
Pqg generates a question q based on the context c and the answer a, which can be answered by the
answer a. Finally, the DG model Pdg generates distractors d based on the context c and the answer
a. The AE, QG, and DG models can be trained independently on any dataset consisting of triples
(c, a, q) by maximizing the conditional log-likelihood iteration:

ã = argmaxa Pae (a|c, s) (1)


q̃ = argmaxq Pqg (q|c, s, a) (2)
d˜ = argmaxd Pdg (d|c, s, a) (3)

Similar to other sequence-to-sequence learning, the log-likelihood here is based on token-level


predictions. When learning, the input to the AE model is in the form of:

[c1 , c2 , . . . , c|c| ]

where ci is the i-th token in context c, and |n| represents the number of tokens in the text. When
input to the QG model, the answer a is considered and the position of the answer a in the context is
marked with <hl>. The form is:

[c1 , . . . , <hl>, a1 , . . . , a|a| , <hl>, . . . , c|c| ]

where ai is the i-th token in the answer a. The model will learn the pattern of generating a question
from the highlighted answer. Finally, the answer and context are input into the model Pdg :

[c1 , . . . , <hl>, a1 , . . . , a|a| , <hl>, . . . , c|c| ]

At inference time, we replace the input answer a to the QG(2) and DG(3) models during training
with the prediction result ã of the AE model(1), and then predict the context c to obtain the generated
question-options. Since the result ã is used as the input of Pqg and Pdg , the effect of this method
almost depends on Pae .

3.2. Multitask QOG


The method mentioned in 3.1 trains an independent model for each subtask. Instead of this, Multi-
task QOG adopts a multi-task learning approach to fine-tune a shared model for AE, QG, and DG
at the same time. To be precise, we mix the training data of AE, QG and DG together, and a random
batch is extracted at each fine-tuning iteration. Each subtask is distinguished by adding a task pre-
fix at the beginning of the input text: “extract answer” (AE), “question generation”
(QG), and “distractor generation” (DG). The loss function will combine the loss func-
tions of AE, QG, and DG, and the total loss is the weighted sum of the three:

L = αLAE + βLQG + γLDG (4)

where α, β, and γ are hyperparameters. This design enables the model to jointly learn three different
but related task modes at once, thereby improving the overall generalization ability and efficiency.
QOG:Q UESTION AND O PTIONS G ENERATION BASED ON L ANGUAGE M ODEL

3.3. End2end QOG


This method directly inputs the context c and outputs the text containing all question-options pairs.
We can model this by converting the question-options pairs in the dataset into a flattened sentence
y, and fine-tuning a sequence-to-sequence model to generate y based on the context c. The format
of y is as follows:

T (Qc ) = “question:q1 , options:o1 |question:q2 , options:o2 | . . . ’’ (5)

where each pair of question-options is converted to the format of y and connected by the separator
’|’. The end-to-end QOG model Pqog is optimized by maximizing the following conditional log-
likelihood:

ỹ = argmaxy Pqog (y|c) (6)

4. Evaluation
4.1. Experimental Setting
Data. The data for training QOG is a multiple-choice question dataset generated based on SQuAD.
We send each pair of QA in SQuAD to a large language model (such as GPT-4) to generate dis-
tractors to form the dataset SQuAD-QO. For SQuADShifts and FinQA, we generated the test sets
SQuADShifts-QO and FinQA-QO in the same way. The relevant datasets have been publicly re-
leased on HuggingFace.

Approach Aver Amazon Wiki NYT Reddit Fin

Llama 3-8B 49.79 51.34 47.26 59.02 49.92 41.43

Pipeline 42.99 42.74 40.47 42.76 41.12 51.90


T5SMALL

Multitask 46.03 50.40 41.92 43.68 46.11 48.03


End2end 47.98 48.41 44.72 48.47 47.48 50.84

Pipeline 46.45 48.52 43.46 44.43 45.53 55.29


T5BASE

Multitask 47.05 47.84 45.84 51.08 44.56 45.93


End2end 50.78 49.90 47.94 52.14 51.14 52.77

Pipeline 51.45 50.73 48.42 46.27 49.82 62.01


T5LARGE

Multitask 51.61 48.95 49.22 49.19 52.65 58.04


End2end 54.66 54.58 51.28 55.56 52.16 59.71

Table 1: QO evaluation results(F1 ) of different QOG models on the test set. For comparison, we
introduced Llama 3-8B. The best score of QOG methods in each LM is shown in bold, and the best
result in each domain across all models is underlined.

Evaluation. Since QOG’s output involves a variety of questions and options, traditional natural
language generation metrics are not applicable to it. In order to evaluate the quality of generated
questions and options (QO) comprehensively, we adopted the following two methods: 1) We ran-
domly select 100 questions in the generated QO dataset and call GPT-4 to answer them, the more
Z HOU H U WANG

Pipeline
Multitask
End2end
60 Llama 3

55

50

45

40
Average Amazon Wiki NYT Reddit Fin

Figure 2: QO quality assessment results (F1 scores, 95% confidence intervals) generated by
T5LARGE multitask/pipeline/end2end on different domain datasets, compared with Llama 3-8B.

questions answered correctly indicates the higher quality of the generated QOs, where high quality
only indicates answerability, as we noticed that the model sometimes generates questions with un-
clear meanings or fails to generate them properly. 2) We generate a test set with a different dataset
than SQuAD and use the F1 scores to assess the generalisation ability of the QOG model to other
domains. For this purpose, we choose two datasets, SQuADShifts, which covers English reading
comprehension tasks in four domains (Amazon/Wiki/Nyt/Reddit), and FinQA, which is a Q&A
dataset in the financial domain. They cover Q&A in different domains and can effectively judge the
applicability and flexibility of the model within each domain.
T5 & Llama 3. For the three approaches mentioned above (i.e.,
Dataset Size
pipelined, multitasking, and end-to-end), we use T5 as the base
language model for our experiments. The model weights used SQuAD-QO 87,399
include T5-small,base,large , which are all open source on the Hug- SQuADShifts-QO 37,691
gingFace platform. In addition, we report the results of the latest FinQA-QO 8,281
open-source large model Llama 3-8B as a QOG model and com-
pare it with T5. Table 2: Size of Datasets

4.2. Results
Table 1 shows the evaluation results for the three methods considered. From T5SMALL to T5LARGE ,
the scores are improving for either method, showing that the model size has a significant positive
impact on the performance. Best model among all models T5LARGE (End2end) achieves the best
results in three of the five domains and outperforms Llama 3-8B on average. Even smaller models,
such as T5SMALL , generate QO pairs of decent quality.
Given the results, the End2end approach achieved the best results on the vast majority of do-
mains, with the highest average scores on T5 models of all sizes. Our analysis is that the effect of the
QOG:Q UESTION AND O PTIONS G ENERATION BASED ON L ANGUAGE M ODEL

Pipeline method depends on the first AE model, which passes its inputs backwards, and the overall
effect is actually is decreasing. The Multitask method shares the same neural network, so the output
of the three tasks is more stable. The problem with it is that only one third of the parameters are
available for each task on average, which makes the model’s performance degrade. The advantage
of End2end is that it treats the QOG as a single task, all parameters are updated during training to
optimise this task, which gives the model the best generalisation and performance of the three.
We also notice that the Pipeline method performs optimally
Approach GPT-4
on FinQA, because the answers of FinQA are mainly numbers
of short length (e.g. 236), and the AE model focuses on this Llama 3-8B 94
while learning and obtains the most efficient extraction pattern. T5SMALL (pipeline) 64
This brings us some inspirations: If the data of QOG has some T5 SMALL (multitask) 53
regularity, we can use the Pipeline method to train the model to T5 SMALL (end2end) 61
get better performance. T5 BASE (pipeline) 79
T5BASE (multitask) 68
T5BASE (end2end) 76
4.3. Assessing the quality of QO pairs using GPT-4 T5LARGE (pipeline) 85
GPT-4 is currently the best large language model and the one T5LARGE (multitask) 77
that best meets human evaluation criteria. In order to compre- T5LARGE (end2end) 83
hensively evaluate the quality of the QO pairs generated by each
Table 3: Mean scores for each
QOG model, we use GPT-4 as the judging model. The specific
QOG model under the GPT-4
method is to have each QOG model generate 100 QO pairs in dif-
judgement
ferent fields, and then GPT-4 answers these QO pairs. If GPT-4
answers correctly, it means that the question-options are logical
and answerable, which indicates that these QO pairs are qualified. In this way, we performed a
comprehensive evaluation of the quality of the QOs generated by each model, and the results are
shown in Table 3.
We found that Llama 3-8B, which previously performed
mediocrely in F1 scores, achieved the best results in all areas Approach Compute(ms) Memory(MB)
under the judgement of GPT-4, indicating that the QO generated Pipeline 363.62 1176.38
by Llama 3 is superior to other models in terms of logic. One
T5SMALL

Multitask 105.94 696.95


possible explanation is that Llama 3 does not have a fixed way of End2end 142.14 695.45
generating QO, while T5, which we have fine-tuned, has a fixed
Pipeline 909.23 3053.30
way of generating QO, resulting in a higher score when calcu-
T5BASE

lating similarity. This result shows that it is not enough to eval- Multitask 217.75 1336.99

uate only from the F1 score, the method we introduced GPT-4 End2end 301.83 1336.28

to evaluate will provide a valuable quality indicator for the QO Pipeline 1480.73 8960.01
T5LARGE

generated by the QOG model. Multitask 1597.59 3310.34


Although the proposed model did not outperform the Llama End2end 724.96 3314.11
3-8B, there is one positive outcome: the fine-tuned best model,
T5LARGE , remains competitive under this evaluation and can ac- Table 4: Inference time and
curately generate QOs that meet human requirements. memory usage of each QOG
model (CPU environment, re-
4.4. QOG Model Comparison sults are averaged over 100 ex-
In this paper, we have compared the performance of three QOG periments)
methods. However, performance is not the only criterion to be considered when choosing a QOG
Z HOU H U WANG

1800 10000
Pipeline Pipeline
1600
Multimask Multimask
1400 End2end 8000 End2end

1200

1000 6000

800
4000
600

400
2000
200

0
0
Small Base Large Small Base Large

Figure 3: Inference time Figure 4: Memory usage

model, as each method has its own advantages and limitations in terms of computational cost and
usability. We conducted experiments on each QOG model in the same environment to verify their
inference time and memory usage, and the results are shown in Table 4.
In terms of computational resources consumed, the End2end method outperforms the other
two methods because it can generate QO pairs in a single inference process. In contrast, both the
Pipeline and Multitask methods require a total of three inference sessions to generate QO pairs. The
computational cost of all three methods is proportional to the number of tokens in the input text.
From the perspective of memory usage, both Multitask and End2end methods use only one model
to complete the task, while the Pipeline method consists of three models, which is three times as
much as the other two in terms of memory.
Finally, the performance of the Pipeline method depends on the ability of answer extraction and
is limited by the fact that the three models are independent of each other in terms of their consistency
with respect to the objective, although it performs very well on datasets with regular answers. The
Multitask method shares a single model, which allows consistency between the three tasks to be
improved, however each task gets only one third of the parameters on average, which is a limitation
of its performance. The End2end method treats the QOG as a whole task and therefore it performs
better than the other methods in generating QO pairs.

5. Conclusion
In this paper, we take QOG as a task for generating questions and options given an input context, and
train QOG models using three different approaches. To evaluate them, we propose two approaches:
First, we use the QOG models to generate QO pairs on different domains and use F1 scores to
assess the generalisation ability of the QOG models to other domains; Second, GPT-4 is utilised to
select the correct options in the QOs, with higher GPT-4 scores mean better question quality. The
evaluation shows that the End2end QOG model is not only the fastest to generate, but also the most
effective. The findings of this study are encouraging, as the quality of the QO datasets generated by
the QOG model is close to that generated by Llama 3, and they consume much less resources than
the large language model.
QOG:Q UESTION AND O PTIONS G ENERATION BASED ON L ANGUAGE M ODEL

Limitations
For our research, the input to our QOG models is limited to 1024 tokens, for longer texts it will not
work. The options used for training are short and the logic for answering questions is simple, as a
result, our models cannot be used to generate longer and more complex options as well.The models
are only applicable to English scenarios, if we want to apply them to other languages, we need a
multilingual QO dataset for training and evaluation. We will try our best to solve these limitations
in the future.

References
Abhishek Agarwal, Nikhil Sachdeva, Raj Kamal Yadav, Vishaal Udandarao, Vrinda Mittal, Anubha
Gupta, and Abhinav Mathur. Eduqa: Educational domain question answering system using con-
ceptual network mapping. In ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 8137–8141. IEEE, 2019.

Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. Synthetic qa corpora
generation with roundtrip consistency. arXiv preprint arXiv:1906.05416, 2019.

Husam Ali, Yllias Chali, and Sadid A Hasan. Automatic question generation from sentences. In
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts,
pages 213–218, 2010.

Max Bartolo, Tristan Thrush, Robin Jia, Sebastian Riedel, Pontus Stenetorp, and Douwe Kiela.
Improving question answering model robustness with synthetic adversarial data generation. arXiv
preprint arXiv:2104.08678, 2021.

Pengshan Cai, Zonghai Yao, Fei Liu, Dakuo Wang, Meghan Reilly, Huixue Zhou, Lingxi Li, Yi Cao,
Alok Kapoor, Adarsha Bajracharya, et al. Paniniqa: Enhancing patient education through inter-
active question answering. Transactions of the Association for Computational Linguistics, 11:
1518–1536, 2023.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to
solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

Xinya Du, Junru Shao, and Claire Cardie. Learning to ask: Neural question generation for reading
comprehension. arXiv preprint arXiv:1705.00106, 2017.

Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou. Question generation for question answering. In
Proceedings of the 2017 conference on empirical methods in natural language processing, pages
866–874, 2017.

Yuwei Fang, Shuohang Wang, Zhe Gan, Siqi Sun, Jingjing Liu, and Chenguang Zhu. Accelerating
real-time question answering via question generation. arXiv preprint arXiv:2009.05167, 2020.

Xu Han, Kunlun Zhu, Shihao Liang, Zhi Zheng, Guoyang Zeng, Zhiyuan Liu, and Maosong Sun.
Qasnowball: An iterative bootstrapping framework for high-quality question-answering data gen-
eration. arXiv preprint arXiv:2309.10326, 2023.
Z HOU H U WANG

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint
arXiv:2009.03300, 2020.

Sebastian Hofstätter, Hamed Zamani, Bhaskar Mitra, Nick Craswell, and Allan Hanbury. Local self-
attention over long text for efficient document retrieval. In Proceedings of the 43rd International
ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2021–
2024, 2020.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint
arXiv:2106.09685, 2021.

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu,
Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese
evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36,
2024.

Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus,
Pontus Stenetorp, and Sebastian Riedel. Paq: 65 million probably-asked questions and what you
can do with them. Transactions of the Association for Computational Linguistics, 9:1098–1115,
2021.

Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-
tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.
arXiv preprint arXiv:2110.07602, 2021.

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt
understands, too. AI Open, 2023.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019.

Shayne Longpre, Yi Lu, Zhucheng Tu, and Chris DuBois. An exploration of data augmentation and
sampling techniques for domain-agnostic question answering. arXiv preprint arXiv:1912.02145,
2019.

Steven Moore, Huy A Nguyen, Tianying Chen, and John Stamper. Assessing the quality of multiple-
choice questions using gpt-4 and rule-based methods. In European Conference on Technology
Enhanced Learning, pages 229–245. Springer, 2023.

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document expansion by query
prediction. arXiv preprint arXiv:1904.08375, 2019.

Lucas Francisco Amaral Orosco Pellicer, Taynan Maier Ferreira, and Anna Helena Reali Costa.
Data augmentation techniques in natural language processing. Applied Soft Computing, 132:
109803, 2023.
QOG:Q UESTION AND O PTIONS G ENERATION BASED ON L ANGUAGE M ODEL

Raul Puri, Ryan Spring, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Training
question answering models from synthetic data. arXiv preprint arXiv:2002.09599, 2020.

Valentina Pyatkin, Paul Roit, Julian Michael, Reut Tsarfaty, Yoav Goldberg, and Ido Dagan.
Asking it all: Generating contextualized questions for any semantic role. arXiv preprint
arXiv:2109.04832, 2021.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research, 21:1–67, 2020.

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and
Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association
for Computational Linguistics, 11:1316–1331, 2023.

Arij Riabi, Thomas Scialom, Rachel Keraron, Benoı̂t Sagot, Djamé Seddah, and Jacopo Staiano.
Synthetic data augmentation for zero-shot cross-lingual question answering. arXiv preprint
arXiv:2010.12643, 2020.

Jasdeep Singh, Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. Xlda:
Cross-lingual data augmentation for natural language inference and question answering. arXiv
preprint arXiv:1905.11471, 2019.

Md Arafat Sultan, Shubham Chandel, Ramón Fernandez Astudillo, and Vittorio Castelli. On the
importance of diversity in question generation for qa. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, pages 5651–5656, 2020.

Asahi Ushio, Fernando Alva-Manchego, and Jose Camacho-Collados. An empirical comparison of


lm-based question and answer generation methods. arXiv preprint arXiv:2305.17002, 2023.

Yi Wang, Jinsheng Deng, Xi Yang, Jianyu Yi, and Zhaohui Ye. Mcqa: A responsive question-
answering system for online education. Sensors & Materials, 35, 2023.

Peixi Xiong and Ying Wu. Ta-student vqa: Multi-agents training by self-questioning. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10065–
10075, 2020.

Adams Wei Yu, David Dohan, Quoc Le, Thang Luong, Rui Zhao, and Kai Chen. Fast and accurate
reading comprehension by combining self-attention and convolution. In International conference
on learning representations, volume 2, 2018.

Shiyue Zhang and Mohit Bansal. Addressing semantic drift in question generation for semi-
supervised question answering. arXiv preprint arXiv:1909.06356, 2019.
Z HOU H U WANG

Appendix A. Hyper Parameters


In fine-tuning each QOG model, we searched for the optimal hyperparameters, and Table 5 shows
the best hyperparameters. The maximum input length is fixed as 512, and the maximum output
length is 256.

Approach Model Epoch LR Batch

Pipeline(AE) T5SMALL 6 0.0001 256


Pipeline(QG) T5SMALL 9 0.0001 64
Pipeline(DG) T5SMALL 13 0.00005 64
Multitask T5SMALL 7 0.0001 128
End2end T5SMALL 16 0.0001 128
Pipeline(AE) T5BASE 8 0.0001 64
Pipeline(QG) T5BASE 5 0.0001 32
Pipeline(DG) T5BASE 13 0.00005 32
Multitask T5BASE 7 0.0001 64
End2end T5BASE 17 0.0001 64
Pipeline(AE) T5LARGE 9 0.0001 32
Pipeline(QG) T5LARGE 6 0.00005 32
Pipeline(DG) T5LARGE 8 0.00001 32
Multitask T5LARGE 5 0.0001 32
End2end T5LARGE 12 0.0001 32

Table 5: Optimal hyperparameters for each QOG model.

Appendix B. Additional Results of LLM QO evaluation


Table 6 shows the QO evaluation results of some large language models with small parameters.
Qwen2-7B has the highest score among all models. Phi-3, as a smaller parameter model, also
performs well in the QO evaluation.

Appendix C. Additional Results of GPT-4 evaluation


We also use the GPT evaluation method to evaluate the QO generation quality of each llm. The
result is shown in Table 7.
QOG:Q UESTION AND O PTIONS G ENERATION BASED ON L ANGUAGE M ODEL

Model Aver Amazon Wiki NYT Reddit Fin

Qwen2-7B 57.98 47.37 58.72 55.29 52.51 76.03


Qwen1.5-MoE-A2.7B 48.73 34.16 55.29 43.56 35.91 74.72
Deepseek-7B 54.32 51.58 37.78 61.20 62.09 58.95
GLM-4-9B 42.00 51.56 43.06 8.61 36.24 70.55
Baichuan2-7B 43.64 48.82 33.07 39.01 18.35 78.97
Mistral-7B 38.68 42.86 34.23 37.32 44.45 34.54
Gemma-7B 43.38 48.59 46.67 57.45 10.93 53.26
Phi-3-3.8B 52.98 48.25 52.01 70.48 51.07 43.09

Table 6: QO evaluation results(F1 ) of LLM on the test set.

Model GPT-4
Qwen2-7B 98
Qwen1.5-MoE-A2.7B 88
Deepseek-7B 95
GLM-4-9B 81
Baichuan2-7B 85
Mistral-7B 83
Gemma-7B 93
Phi-3-3.8B 87

Table 7: Mean scores for each LLM under the GPT-4 judgement

You might also like