QOG:Question and Options Generation Based On Language Model: Jincheng Zhou
QOG:Question and Options Generation Based On Language Model: Jincheng Zhou
Zhuoshi Technology
Abstract
Question-Options Generation (QOG) is a task that involves generating a set of question-options
pairs within a given input. This task has various applications, including fine-tuning large models,
information retrieval, and automated multiple-choice question generation for education. In this
paper, we develop QOG models using three different methods based on fine-tuning sequence-to-
sequence language models (LMs). Experiments demonstrate that the end-to-end QOG model is
computationally efficient and stable during both training and inference, outperforming other meth-
ods. Furthermore, our analysis indicates that our QOG models are competitive on the QOG task
compared to the large language model Llama 3-8B.
Keywords: Question-Options Generation (QOG); Information retrieval; sequence-to-sequence
language models
1. Introduction
Question-Options (QO) is derived from Question-Answering (QA), extending question-answer pairs
by adding three text-related distractors. Compared to QA, QO provides more information by com-
bining the correct answer with several specially designed incorrect answers. This information en-
ables the model to incorporate more knowledge and improve its ability to identify potential errors.
When used for fine-tuning large language model(Hu et al., 2021; Liu et al., 2021, 2023), the goal
is to identify the correct answer from four candidate answers, which improves the model’s un-
derstanding of the question and its ability to distinguish the differences between different options,
making the model understand the text better. When used for model evaluation, whether the model
can select the correct answer becomes a key indicator of its text learning and processing capabilities
(Hendrycks et al., 2020; Huang et al., 2024; Cobbe et al., 2021).
Question and Options Generation (QOG) refers to the task of generating a set of question-
options pairs given an input context (e.g. a paragraph). QOG can be used to develop unsupervised
question answering and as a data augmentation tool (Alberti et al., 2019; Yu et al., 2018; Riabi et al.,
2020; Singh et al., 2019; Longpre et al., 2019) to enhance the text understanding capabilities of large
language models. QOG can also be used as an educational aid(Agarwal et al., 2019; Cai et al., 2023;
Wang et al., 2023), to enhance information retrieval models(Nogueira et al., 2019; Pyatkin et al.,
2021), and as a way to explain models.
Shared models
Multitask QOG
Question
Pipeline QOG Generation
Strangers
Multi LM
Strangers
Competitors
Competitors QOG Acquaintances
Acquaintances ...
Context
...
Answer
End2End
End2End QOG
QOG
Answers: Distractors:
Questions:
Friends Strangers
Who help you when
Help you when you Competitors
you are in trouble?
are in trouble Acquaintances
What are friends?
... ...
QOG originates from Question Generation (QG)(Duan et al., 2017; Du et al., 2017; Ali et al.,
2010), which involves generating a question for an answer given an input context. Compared to
QG and QAG, QOG presents a more intricate challenge, as the correct answer and distractors must
be constructed rather than assumed as part of the input. Currently, it is unknown which models are
effective for the QOG task because no comprehensive comparative study has been established to
date.
In this paper, we consider the QOG as a task of generating questions and options given a context,
and compare three simple QOG methods based on fine-tuning encoder-decoder language models
such as T5(Raffel et al., 2020). The three methods are common methods for fine-tuning language
models now: (1) Pipeline QOG, which decomposes the task into three parts: answer extraction,
question generation and distractor generation, and learns a separate model for each subtask; (2)
Multi-task QOG, which uses a shared single model to train the three subtasks simultaneously instead
of separate models; and (3) End-to-end QOG, which directly generates questions-options using end-
to-end sequence-to-sequence learning. Finally, we introduce GPT-4 as a referee based on traditional
evaluation methods and compare the three methods in multiple domains to objectively evaluate the
cross-domain generalization ability of the models.
2. Related Work
There is no work using pre-trained LM for QOG, but there are related works on QG and QAG.
and Wu (2020) followed the idea of adversarial generation and used question-answering agents
and QG models to train adversarial models. This work used semantic similarity as an evaluation
indicator to improve the quality of generated data, but did not focus on the generalization ability of
the model in other fields. To this end, our work will focus on improving data quality while testing
the generalization ability of the model with data sets from different fields.
thereby optimizing their abilities to generate relevant questions, distractors, and extract correct an-
swers.
The AE model Pae first generates an answer a in a given context c, and then the QG model
Pqg generates a question q based on the context c and the answer a, which can be answered by the
answer a. Finally, the DG model Pdg generates distractors d based on the context c and the answer
a. The AE, QG, and DG models can be trained independently on any dataset consisting of triples
(c, a, q) by maximizing the conditional log-likelihood iteration:
[c1 , c2 , . . . , c|c| ]
where ci is the i-th token in context c, and |n| represents the number of tokens in the text. When
input to the QG model, the answer a is considered and the position of the answer a in the context is
marked with <hl>. The form is:
where ai is the i-th token in the answer a. The model will learn the pattern of generating a question
from the highlighted answer. Finally, the answer and context are input into the model Pdg :
At inference time, we replace the input answer a to the QG(2) and DG(3) models during training
with the prediction result ã of the AE model(1), and then predict the context c to obtain the generated
question-options. Since the result ã is used as the input of Pqg and Pdg , the effect of this method
almost depends on Pae .
where α, β, and γ are hyperparameters. This design enables the model to jointly learn three different
but related task modes at once, thereby improving the overall generalization ability and efficiency.
QOG:Q UESTION AND O PTIONS G ENERATION BASED ON L ANGUAGE M ODEL
where each pair of question-options is converted to the format of y and connected by the separator
’|’. The end-to-end QOG model Pqog is optimized by maximizing the following conditional log-
likelihood:
4. Evaluation
4.1. Experimental Setting
Data. The data for training QOG is a multiple-choice question dataset generated based on SQuAD.
We send each pair of QA in SQuAD to a large language model (such as GPT-4) to generate dis-
tractors to form the dataset SQuAD-QO. For SQuADShifts and FinQA, we generated the test sets
SQuADShifts-QO and FinQA-QO in the same way. The relevant datasets have been publicly re-
leased on HuggingFace.
Table 1: QO evaluation results(F1 ) of different QOG models on the test set. For comparison, we
introduced Llama 3-8B. The best score of QOG methods in each LM is shown in bold, and the best
result in each domain across all models is underlined.
Evaluation. Since QOG’s output involves a variety of questions and options, traditional natural
language generation metrics are not applicable to it. In order to evaluate the quality of generated
questions and options (QO) comprehensively, we adopted the following two methods: 1) We ran-
domly select 100 questions in the generated QO dataset and call GPT-4 to answer them, the more
Z HOU H U WANG
Pipeline
Multitask
End2end
60 Llama 3
55
50
45
40
Average Amazon Wiki NYT Reddit Fin
Figure 2: QO quality assessment results (F1 scores, 95% confidence intervals) generated by
T5LARGE multitask/pipeline/end2end on different domain datasets, compared with Llama 3-8B.
questions answered correctly indicates the higher quality of the generated QOs, where high quality
only indicates answerability, as we noticed that the model sometimes generates questions with un-
clear meanings or fails to generate them properly. 2) We generate a test set with a different dataset
than SQuAD and use the F1 scores to assess the generalisation ability of the QOG model to other
domains. For this purpose, we choose two datasets, SQuADShifts, which covers English reading
comprehension tasks in four domains (Amazon/Wiki/Nyt/Reddit), and FinQA, which is a Q&A
dataset in the financial domain. They cover Q&A in different domains and can effectively judge the
applicability and flexibility of the model within each domain.
T5 & Llama 3. For the three approaches mentioned above (i.e.,
Dataset Size
pipelined, multitasking, and end-to-end), we use T5 as the base
language model for our experiments. The model weights used SQuAD-QO 87,399
include T5-small,base,large , which are all open source on the Hug- SQuADShifts-QO 37,691
gingFace platform. In addition, we report the results of the latest FinQA-QO 8,281
open-source large model Llama 3-8B as a QOG model and com-
pare it with T5. Table 2: Size of Datasets
4.2. Results
Table 1 shows the evaluation results for the three methods considered. From T5SMALL to T5LARGE ,
the scores are improving for either method, showing that the model size has a significant positive
impact on the performance. Best model among all models T5LARGE (End2end) achieves the best
results in three of the five domains and outperforms Llama 3-8B on average. Even smaller models,
such as T5SMALL , generate QO pairs of decent quality.
Given the results, the End2end approach achieved the best results on the vast majority of do-
mains, with the highest average scores on T5 models of all sizes. Our analysis is that the effect of the
QOG:Q UESTION AND O PTIONS G ENERATION BASED ON L ANGUAGE M ODEL
Pipeline method depends on the first AE model, which passes its inputs backwards, and the overall
effect is actually is decreasing. The Multitask method shares the same neural network, so the output
of the three tasks is more stable. The problem with it is that only one third of the parameters are
available for each task on average, which makes the model’s performance degrade. The advantage
of End2end is that it treats the QOG as a single task, all parameters are updated during training to
optimise this task, which gives the model the best generalisation and performance of the three.
We also notice that the Pipeline method performs optimally
Approach GPT-4
on FinQA, because the answers of FinQA are mainly numbers
of short length (e.g. 236), and the AE model focuses on this Llama 3-8B 94
while learning and obtains the most efficient extraction pattern. T5SMALL (pipeline) 64
This brings us some inspirations: If the data of QOG has some T5 SMALL (multitask) 53
regularity, we can use the Pipeline method to train the model to T5 SMALL (end2end) 61
get better performance. T5 BASE (pipeline) 79
T5BASE (multitask) 68
T5BASE (end2end) 76
4.3. Assessing the quality of QO pairs using GPT-4 T5LARGE (pipeline) 85
GPT-4 is currently the best large language model and the one T5LARGE (multitask) 77
that best meets human evaluation criteria. In order to compre- T5LARGE (end2end) 83
hensively evaluate the quality of the QO pairs generated by each
Table 3: Mean scores for each
QOG model, we use GPT-4 as the judging model. The specific
QOG model under the GPT-4
method is to have each QOG model generate 100 QO pairs in dif-
judgement
ferent fields, and then GPT-4 answers these QO pairs. If GPT-4
answers correctly, it means that the question-options are logical
and answerable, which indicates that these QO pairs are qualified. In this way, we performed a
comprehensive evaluation of the quality of the QOs generated by each model, and the results are
shown in Table 3.
We found that Llama 3-8B, which previously performed
mediocrely in F1 scores, achieved the best results in all areas Approach Compute(ms) Memory(MB)
under the judgement of GPT-4, indicating that the QO generated Pipeline 363.62 1176.38
by Llama 3 is superior to other models in terms of logic. One
T5SMALL
lating similarity. This result shows that it is not enough to eval- Multitask 217.75 1336.99
uate only from the F1 score, the method we introduced GPT-4 End2end 301.83 1336.28
to evaluate will provide a valuable quality indicator for the QO Pipeline 1480.73 8960.01
T5LARGE
1800 10000
Pipeline Pipeline
1600
Multimask Multimask
1400 End2end 8000 End2end
1200
1000 6000
800
4000
600
400
2000
200
0
0
Small Base Large Small Base Large
model, as each method has its own advantages and limitations in terms of computational cost and
usability. We conducted experiments on each QOG model in the same environment to verify their
inference time and memory usage, and the results are shown in Table 4.
In terms of computational resources consumed, the End2end method outperforms the other
two methods because it can generate QO pairs in a single inference process. In contrast, both the
Pipeline and Multitask methods require a total of three inference sessions to generate QO pairs. The
computational cost of all three methods is proportional to the number of tokens in the input text.
From the perspective of memory usage, both Multitask and End2end methods use only one model
to complete the task, while the Pipeline method consists of three models, which is three times as
much as the other two in terms of memory.
Finally, the performance of the Pipeline method depends on the ability of answer extraction and
is limited by the fact that the three models are independent of each other in terms of their consistency
with respect to the objective, although it performs very well on datasets with regular answers. The
Multitask method shares a single model, which allows consistency between the three tasks to be
improved, however each task gets only one third of the parameters on average, which is a limitation
of its performance. The End2end method treats the QOG as a whole task and therefore it performs
better than the other methods in generating QO pairs.
5. Conclusion
In this paper, we take QOG as a task for generating questions and options given an input context, and
train QOG models using three different approaches. To evaluate them, we propose two approaches:
First, we use the QOG models to generate QO pairs on different domains and use F1 scores to
assess the generalisation ability of the QOG models to other domains; Second, GPT-4 is utilised to
select the correct options in the QOs, with higher GPT-4 scores mean better question quality. The
evaluation shows that the End2end QOG model is not only the fastest to generate, but also the most
effective. The findings of this study are encouraging, as the quality of the QO datasets generated by
the QOG model is close to that generated by Llama 3, and they consume much less resources than
the large language model.
QOG:Q UESTION AND O PTIONS G ENERATION BASED ON L ANGUAGE M ODEL
Limitations
For our research, the input to our QOG models is limited to 1024 tokens, for longer texts it will not
work. The options used for training are short and the logic for answering questions is simple, as a
result, our models cannot be used to generate longer and more complex options as well.The models
are only applicable to English scenarios, if we want to apply them to other languages, we need a
multilingual QO dataset for training and evaluation. We will try our best to solve these limitations
in the future.
References
Abhishek Agarwal, Nikhil Sachdeva, Raj Kamal Yadav, Vishaal Udandarao, Vrinda Mittal, Anubha
Gupta, and Abhinav Mathur. Eduqa: Educational domain question answering system using con-
ceptual network mapping. In ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 8137–8141. IEEE, 2019.
Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. Synthetic qa corpora
generation with roundtrip consistency. arXiv preprint arXiv:1906.05416, 2019.
Husam Ali, Yllias Chali, and Sadid A Hasan. Automatic question generation from sentences. In
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts,
pages 213–218, 2010.
Max Bartolo, Tristan Thrush, Robin Jia, Sebastian Riedel, Pontus Stenetorp, and Douwe Kiela.
Improving question answering model robustness with synthetic adversarial data generation. arXiv
preprint arXiv:2104.08678, 2021.
Pengshan Cai, Zonghai Yao, Fei Liu, Dakuo Wang, Meghan Reilly, Huixue Zhou, Lingxi Li, Yi Cao,
Alok Kapoor, Adarsha Bajracharya, et al. Paniniqa: Enhancing patient education through inter-
active question answering. Transactions of the Association for Computational Linguistics, 11:
1518–1536, 2023.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to
solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Xinya Du, Junru Shao, and Claire Cardie. Learning to ask: Neural question generation for reading
comprehension. arXiv preprint arXiv:1705.00106, 2017.
Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou. Question generation for question answering. In
Proceedings of the 2017 conference on empirical methods in natural language processing, pages
866–874, 2017.
Yuwei Fang, Shuohang Wang, Zhe Gan, Siqi Sun, Jingjing Liu, and Chenguang Zhu. Accelerating
real-time question answering via question generation. arXiv preprint arXiv:2009.05167, 2020.
Xu Han, Kunlun Zhu, Shihao Liang, Zhi Zheng, Guoyang Zeng, Zhiyuan Liu, and Maosong Sun.
Qasnowball: An iterative bootstrapping framework for high-quality question-answering data gen-
eration. arXiv preprint arXiv:2309.10326, 2023.
Z HOU H U WANG
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint
arXiv:2009.03300, 2020.
Sebastian Hofstätter, Hamed Zamani, Bhaskar Mitra, Nick Craswell, and Allan Hanbury. Local self-
attention over long text for efficient document retrieval. In Proceedings of the 43rd International
ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2021–
2024, 2020.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint
arXiv:2106.09685, 2021.
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu,
Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese
evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36,
2024.
Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus,
Pontus Stenetorp, and Sebastian Riedel. Paq: 65 million probably-asked questions and what you
can do with them. Transactions of the Association for Computational Linguistics, 9:1098–1115,
2021.
Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-
tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.
arXiv preprint arXiv:2110.07602, 2021.
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt
understands, too. AI Open, 2023.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019.
Shayne Longpre, Yi Lu, Zhucheng Tu, and Chris DuBois. An exploration of data augmentation and
sampling techniques for domain-agnostic question answering. arXiv preprint arXiv:1912.02145,
2019.
Steven Moore, Huy A Nguyen, Tianying Chen, and John Stamper. Assessing the quality of multiple-
choice questions using gpt-4 and rule-based methods. In European Conference on Technology
Enhanced Learning, pages 229–245. Springer, 2023.
Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document expansion by query
prediction. arXiv preprint arXiv:1904.08375, 2019.
Lucas Francisco Amaral Orosco Pellicer, Taynan Maier Ferreira, and Anna Helena Reali Costa.
Data augmentation techniques in natural language processing. Applied Soft Computing, 132:
109803, 2023.
QOG:Q UESTION AND O PTIONS G ENERATION BASED ON L ANGUAGE M ODEL
Raul Puri, Ryan Spring, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Training
question answering models from synthetic data. arXiv preprint arXiv:2002.09599, 2020.
Valentina Pyatkin, Paul Roit, Julian Michael, Reut Tsarfaty, Yoav Goldberg, and Ido Dagan.
Asking it all: Generating contextualized questions for any semantic role. arXiv preprint
arXiv:2109.04832, 2021.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research, 21:1–67, 2020.
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and
Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association
for Computational Linguistics, 11:1316–1331, 2023.
Arij Riabi, Thomas Scialom, Rachel Keraron, Benoı̂t Sagot, Djamé Seddah, and Jacopo Staiano.
Synthetic data augmentation for zero-shot cross-lingual question answering. arXiv preprint
arXiv:2010.12643, 2020.
Jasdeep Singh, Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. Xlda:
Cross-lingual data augmentation for natural language inference and question answering. arXiv
preprint arXiv:1905.11471, 2019.
Md Arafat Sultan, Shubham Chandel, Ramón Fernandez Astudillo, and Vittorio Castelli. On the
importance of diversity in question generation for qa. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, pages 5651–5656, 2020.
Yi Wang, Jinsheng Deng, Xi Yang, Jianyu Yi, and Zhaohui Ye. Mcqa: A responsive question-
answering system for online education. Sensors & Materials, 35, 2023.
Peixi Xiong and Ying Wu. Ta-student vqa: Multi-agents training by self-questioning. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10065–
10075, 2020.
Adams Wei Yu, David Dohan, Quoc Le, Thang Luong, Rui Zhao, and Kai Chen. Fast and accurate
reading comprehension by combining self-attention and convolution. In International conference
on learning representations, volume 2, 2018.
Shiyue Zhang and Mohit Bansal. Addressing semantic drift in question generation for semi-
supervised question answering. arXiv preprint arXiv:1909.06356, 2019.
Z HOU H U WANG
Model GPT-4
Qwen2-7B 98
Qwen1.5-MoE-A2.7B 88
Deepseek-7B 95
GLM-4-9B 81
Baichuan2-7B 85
Mistral-7B 83
Gemma-7B 93
Phi-3-3.8B 87
Table 7: Mean scores for each LLM under the GPT-4 judgement