Evaluating Mathematical Reasoning of Large Language Models - A Focus On Error Identification and Correction
Evaluating Mathematical Reasoning of Large Language Models - A Focus On Error Identification and Correction
Xiaoyuan Li1 , Wenjie Wang2 * , Moxin Li2 , Junrong Guo1 , Yang Zhang1 , Fuli Feng1 *
University of Science and Technology of China1
National University of Singapore2
{xiaoyuanli,godrong,zy2015}@mail.ustc.edu.cn limoxin@u.nus.edu
{wenjiewang96,fulifeng93}@gmail.com
Correct Solution:
Step1: Each shirt cost
Wrong Answer: 50
20*.5=$10 with the sale
GPT-4
Step 2: So he paid 10*6=$60
Error Type:
Correct Answer: 60 Calculation Error Wrong Step: 2
Support
Evaluation Tasks
②Evaluation
Error-Presence Error-Step Error-Type Error
Identification Identification Identification Correction
Figure 2: Illustration of dataset construction and the four evaluation tasks. For dataset construction, we use GPT-4
to convert ground-truth solutions into wrong solutions containing specific error types. The four evaluation tasks
comprehensively access LLMs’ error identification and correction abilities from diverse perspectives.
• Task 2: Error-Step Identification (ES) intends prompts that reverses the original order of error
to find the first wrong step t in a wrong solution. types and randomly shuffles them. We compute
For Task 2, we require LLMs to output the judg- acc1 and the
P accuracy in identifying error type
ment ŷ, and if s contains errors, we also instruct acc3 = N1 N i=1 1{c = ĉ}, where ĉ is the error
LLMs to identify the first erroneous step t in the type of the first wrong step identified by LLMs.
solution. We devise the zero-shot prompts and
few-shot prompts with in-context learning exam- • Task 4: Error Correction (EC) seeks to rectify
ples for the ES task. Figure 25 and 26 show the the error and output the correct solution. We
zero-shot prompts for open-source and closed- prompt LLMs to output the judgment for y and
source models. For ES evaluation, we compute provide the corrected solution and answer â if
acc1 as EP and the accuracy in identifying er- s contains errors. We devise zero-shot and few-
ror step by acc2 = N1 N 1 shot prompts as ES. The prompts are displayed
P
i=1 {t = t̂}, where t̂
denotes the first wrong step predicted by LLMs. in Figure 31 and 32. We calculate P acc1 and the
accuracy of correction acc4 = N1 N i=1 1{a =
• Task 3: Error-Type Identification (ET) endeav- â}, where â and a are the predicted and ground-
ors to identify the error type. We instruct LLMs truth answers, respectively.
to output the judgment for y and identify the error
For Task 2 and Task 4, we propose to leverage the
type c of the first wrong step if s contains errors.
error type information in the prompts to hint LLMs
Here, c is selected from the pre-defined error
for error step identification and error correction.
types, such as calculation error. We define error
Accordingly, we design the zero-shot and few-shot
types in the prompts and design zero-shot and
prompts with error type information as shown in
few-shot prompts, where the few-shot prompt
Figure 27, 28, 33 and 34.
provides an example for each error type. Figure
29 and 30 showcase the zero-shot prompts for 3 Dataset Construction
open-source and closed-source models. Consid-
ering that the order of error types might affect A significant challenge in achieving the four eval-
the accuracy of identifying error types, we design uation tasks is lacking compatible datasets with
Error Type Definition
Calculation Error (CA) Error appears during the calculation process.
Counting Error (CO) Error occurs during the counting process.
Context Value Error (CV) Error arises when attributes of named entities do not align with the information provided.
Hallucination (HA) Error involves adding fictitious unrelated statements contradictory to the question.
Unit Conversion Error (UC) Error occurs during unit conversion process.
Operator Error (OP) Error involves a single operator being erroneously applied within the expression.
Formula Confusion Error (FC) Error appears when applying formula in inappropriate scenario.
Missing Step (MS) Error entails an incomplete generation of reasoning process, lacking a necessary step.
Contradictory Step (CS) Error manifests inconsistency between preceding and subsequent reasoning steps.
Table 1: Definition of nine common error types. Among them, unit conversion error, operator error, and formula
confusion error can be categorized as common sense error, indicating errors in the relationships that should be
understood within worldly common sense. The generation rules and examples are designed in Appendix C.
fine-grained error annotation. Therefore, we opt to Each dataset is comprised of 100 cases per error
construct a dataset that meets the requirements of type, resulting in a total of 1,800 cases for error
our evaluation tasks. This dataset should encom- identification and correction tasks.
pass erroneous solutions, error steps, error types, Human Evaluation. To evaluate the quality of
and correct answers for mathematical questions. EIC-Math, we randomly select 180 cases and invite
Initially, we distill nine common error types three evaluators for human evaluation. The results
from existing works (Wei et al., 2022; Toh et al., indicate that 92.5% cases have exactly satisfied
2023; Lightman et al., 2023; Shakarian et al., 2023; the requirements of the data generation prompts,
Bubeck et al., 2023; Sawada et al., 2023; Suzgun demonstrating the high quality of the generated
et al., 2022; Lyu et al., 2023; Kojima et al., 2022; dataset. More details of human evaluation can be
Li et al., 2023; Wang et al., 2022; Wang et al., found in Appendix D.
2023; Paul et al., 2023; Golovneva et al., 2022;
Ribeiro et al., 2023; Lewkowycz et al., 2022) and 4 Experiment
practical examples. Table 1 shows the error names We conduct extensive experiments to address the
and definitions, covering the single-step and cross- following research questions:
step errors. The specific definition difference and - RQ1: How do different LLMs perform on the
illustration examples are presented in Appendix B. four tasks on error identification and correction?
Data Generation. As illustrated in Figure 2, we - RQ2: How difficult are identifying and correcting
utilize the state-of-the-art LLM, GPT-4 (OpenAI, different error types?
2023), to generate the dataset, EIC-Math (Error - RQ3: How robust are LLMs to different prompts
Identification and Correction on Mathematical w.r.t. the four evaluation tasks?
problems), to support the evaluation tasks. We de-
sign some generation rules for different error types, Experiment Setup. We select typical commercial
which regulate the generated wrong solutions to closed-source LLMs, GPT-3.5, GPT-4, GLM-4,
strictly meet the definition of one error type2 . Then Gemini Pro, along with the general-purpose open-
we construct the data generation prompt based on source LLaMA-2 series, and the state-of-the-art
these generation rules and the in-context learning mathematical MetaMath series in their 7B and 13B
approach (Brown et al., 2020; Ouyang et al., 2022; versions for evaluation. Besides, we also evaluate
Min et al., 2022) to instruct GPT-4 to transform other three cutting-edge mathematical models: Mis-
correct solutions into wrong solutions. The data tral, Llemma and LEMA in their 7B versions. 3 To
generation process is detailed in Appendix F.1.1 to minimize randomness, we set the temperature to 0.
save space. Note that we use two datasets GSM8K For ease of statistical analysis, we prompt closed-
(Cobbe et al., 2021) and MathQA (Amini et al., source LLMs to output in JSON format. However,
2019) to construct the error cases, where GSM8K open-source models do not consistently adhere to
has annotated multi-step solutions and MathQA the format requirement, so we use a relaxed format
adopts the correct solutions generated by GPT-3.5. for their prompts.
3
Specifically, we conduct experiments using gpt-
2
In this work, we only consider generating the wrong solu- 3.5-turbo-1106, gpt-4-1106-preview, LLaMA-2-7B-chat,
tion with only one error type in a single step to simplify the LLaMA-2-13B-chat, MetaMath-7B-V1.0, MetaMath-13B-
evaluation process, leaving more complicated error identifica- V1.0, Mistral-7B-V0.1, Llemma-7B, LEMA-V1-PEFT-
tion and correction to future work. LLaMA-2-7B-GSM8K.
GSM8K MathQA
EP ES ET EC Avg EP ES ET EC Avg Avg
acc1 acc2 acc1 acc3 acc1 acc4 acc1 acc acc1 acc1 acc2 acc1 acc3 acc1 acc4 acc1 acc acc1 acc acc1
GPT-3.5 0.547 0.147 0.598 0.211 0.737 0.169 0.340 0.269 0.556 0.493 0.173 0.642 0.173 0.676 0.141 0.302 0.245 0.528 0.257 0.542
GPT-4 0.930 0.843 0.946 0.516 0.951 0.883 0.929 0.793 0.939 0.917 0.714 0.954 0.481 0.957 0.810 0.909 0.731 0.934 0.762 0.937
GLM-4 0.849 0.640 0.819 0.349 0.941 0.804 0.881 0.661 0.873 0.772 0.551 0.892 0.327 0.910 0.574 0.808 0.556 0.846 0.609 0.860
Gemini Pro 0.217 0.359 0.541 0.090 0.312 0.248 0.279 0.229 0.337 0.197 0.239 0.389 0.096 0.603 0.200 0.260 0.183 0.362 0.206 0.350
LLaMA-2-7B 0.538 0.184 0.914 0.048 0.396 0.067 0.871 0.209 0.680 0.536 0.176 0.861 0.052 0.358 0.039 0.792 0.201 0.637 0.205 0.659
LLaMA-2-13B 0.166 0.007 0.027 0.127 0.843 0.000 0.008 0.075 0.261 0.219 0.009 0.071 0.116 0.939 0.000 0.010 0.086 0.310 0.081 0.286
Avg 0.541 0.363 0.641 0.224 0.697 0.362 0.551 0.372 0.608 0.522 0.310 0.635 0.208 0.741 0.294 0.514 0.334 0.603 0.353 0.605
Table 2: Average accuracy of different models in four tasks on GSM8K and MathQA separately under zero-shot
prompts. EP calculates the average acc1 over all error types. ES calculates the average acc2 and acc1 as the values
for the first and second column respectively. And ET and EC conduct similar calculation as ES. The first column of
Avg is the average value of acc1 , acc2 , acc3 , and acc4 over all error types of models and represents the ability to
identify and correct errors, while the second column is the average value of acc1 of four tasks and only represents
the ability to identify errors.
CA CO CV CS MS HA UC OP FC Avg
acc acc1 acc acc1 acc acc1 acc acc1 acc acc1 acc acc1 acc acc1 acc acc1 acc acc1 acc acc1
GPT-3.5 0.201 0.366 0.285 0.518 0.246 0.581 0.339 0.640 0.189 0.525 0.319 0.645 0.215 0.354 0.256 0.619 0.261 0.629 0.257 0.542
GPT-4 0.606 0.681 0.733 0.955 0.841 0.986 0.719 0.934 0.608 0.935 0.860 0.968 0.833 0.988 0.780 0.988 0.878 0.995 0.762 0.937
GLM-4 0.338 0.468 0.653 0.839 0.611 0.933 0.544 0.859 0.523 0.878 0.794 0.949 0.676 0.884 0.605 0.949 0.733 0.975 0.608 0.859
Gemini Pro 0.089 0.128 0.171 0.310 0.243 0.386 0.131 0.274 0.201 0.350 0.396 0.594 0.096 0.210 0.271 0.476 0.251 0.420 0.206 0.350
LLaMA-2-7B 0.310 0.675 0.131 0.533 0.195 0.695 0.239 0.698 0.236 0.821 0.234 0.641 0.148 0.540 0.210 0.735 0.141 0.586 0.205 0.658
LLaMA-2-13B 0.036 0.265 0.043 0.260 0.088 0.306 0.166 0.299 0.071 0.318 0.131 0.294 0.054 0.234 0.088 0.328 0.046 0.265 0.080 0.285
Avg 0.263 0.430 0.336 0.569 0.371 0.648 0.356 0.617 0.305 0.638 0.456 0.682 0.337 0.535 0.368 0.682 0.385 0.645 0.353 0.605
Table 3: Average accuracy of different models in different error types on GSM8K and MathQA under zero-shot
prompts. We use the first two letters of the name of error type to represent it. The calculation of the first and second
column is similar as the Avg in Table 2.
4.1 Model Performance (RQ1) due to the efforts of LLMs to maintain consistency
with different generated contents.
Overall Performance. Table 2 presents the aver-
age accuracy of each LLM in four tasks on the EIC- Regarding the difference in two average accu-
Math dataset with GSM8K and MathQA. Over- racy (acc1 , acc4 ) between EC, among the models
all, GPT-4 demonstrates overwhelming superiority, with poor performance, Gemini Pro exhibits the
followed by GLM-4. GPT-3.5, Gemini Pro, and smallest difference, while LLaMA-2-7B shows the
LLaMA-2-7B have their own strengths and weak- largest. This suggests that Gemini Pro is cautious
nesses in four tasks. It is noteworthy that LLaMA- in error identification, with most identified errors
2-7B performs better than LLaMA-2-13B, which being correctable, whereas LLaMA-2-7B is more
may be related to inverse scaling (McKenzie et al., liberal in error identification rather than correction.
2023). This suggests that the ability of models Comparison Between Datasets. From the per-
to identify and correct errors does not necessarily spective of two datasets, it is often observed that
increase with model size. Moreover, the mathemat- the same model on MathQA tends to have lower ac-
ical models can only provide answers without error curacy across the four tasks compared to GSM8K.
identification or correction abilities, and thus their This is attributed to the higher difficulty level of
accuracy is low as showcased in Appendix E.4 and MathQA.
F.2. This indicates that they can only solve prob- Future Direction. Additionally, despite the over-
lems and lack comprehensive reasoning abilities. whelming superiority of GPT-4, its average accu-
Comparison Across Tasks. The average accuracy racy across the four tasks on the two simple MWP
of EP (acc1 ) is the highest among the four tasks datasets is only 76.2%. This indicates that the error
(acc1 , acc2 , acc3 , acc4 ), as it is the simplest. ES identification and correction tasks we design are
(acc2 ) and ET (acc3 ) tend to have close average challenging, and the lack of error identification and
accuracy compared to EC (acc4 ), despite being in- correction capability in LLMs somewhat restricts
tuitively less challenging. Actually, ES involves their mathematical reasoning abilities.
an additional counting process, while ET involves
4.2 Error Type Analysis (RQ2)
additional classification, leading to different em-
phases. It can also be noted that the average accu- Difficulty Levels of Error Types. In Table 3, we
racy acc1 fluctuates across the four tasks, which is compute the average accuracy of each model across
CA CO CV CS MS HA UC OP FC Avg
acci acc1 acci acc1 acci acc1 acci acc1 acci acc1 acci acc1 acci acc1 acci acc1 acci acc1 acci acc1
EP - 0.350 - 0.482 - 0.575 - 0.552 - 0.557 - 0.609 - 0.428 - 0.648 - 0.584 - 0.532
ES 0.203 0.476 0.323 0.589 0.362 0.697 0.367 0.667 0.320 0.683 0.408 0.697 0.323 0.583 0.383 0.699 0.343 0.652 0.337 0.638
ET 0.312 0.541 0.204 0.682 0.177 0.751 0.163 0.713 0.029 0.763 0.433 0.817 0.298 0.655 0.082 0.785 0.241 0.761 0.215 0.719
EC 0.188 0.355 0.335 0.523 0.369 0.569 0.344 0.537 0.313 0.549 0.372 0.604 0.298 0.473 0.361 0.598 0.373 0.583 0.328 0.532
Avg 0.263 0.430 0.336 0.569 0.371 0.648 0.356 0.617 0.305 0.638 0.337 0.535 0.368 0.682 0.385 0.645 0.456 0.682 0.353 0.605
Table 4: Average accuracy of different tasks in different error types on GSM8K and MathQA under zero-shot
prompts. And we calculate the average acc1 over all models for EP, the average acci (i = 2, 3, 4) and acc1 as the
values for the first and second columns respectively for ES, ET and EC.
Comparison between Different Models on the Table 6: F1 scores on EP under three prompt settings.
Same Error Type. Furthermore, GPT-3.5 and
Gemini Pro struggle with unit conversion error, type and relevant classification training data.
and the LLaMA-2 series also perform poorly in
unit conversion error and formula confusion error. 4.3 Prompt Robustness (RQ3)
At the same time, GPT-4 and GLM-4 perform well We devise a variety of prompts for the four tasks
in unit conversion error and formula confusing er- to explore the robustness of different models to dif-
ror. We speculate that this may be related to the ferent prompts. In addition, we investigate whether
size of the stored parameter knowledge. Due to the providing the error types to models can improve
lack of relevant common sense in the parameter the accuracy in ET and EC.
knowledge, it becomes challenging to identify and Prompt Robustness of EP. For EP, we select 50
correct related errors for smaller models. negative samples and add an equal number of posi-
The average accuracy of LLaMA-2-7B surpris- tive samples for each error type, totaling 100 sam-
ingly reaches 31% in calculation error, on par with ples for testing. And in Table 6, we compute their
GLM-4. Compared to other error types, LLaMA-2- average F1 scores under three different prompts:
7B and LLaMA-2-13B excell in contradictory step Simple, Normal and Misleading. By calculating
but perform poorly in counting error. the difference in average F1 scores across all error
Statistical Classification of Error Types. Table 5 types for each model, we evaluate their robustness
provides statistic on the count of error types clas- to different prompts. It is observed that closed-
sified on GPT-3.5 with GSM8K. Similar statistics source models exhibit greater robustness to differ-
for most other models and datasets are presented in ent prompts, with the maximum difference in aver-
Appendix F.2.2. It can be observed that most of the age F1 scores around 0.2. In contrast, open-source
error types are often misclassified as calculation models are highly sensitive to different prompts,
error, which may be attributed to the models’ lack exhibiting a tendency to classify almost all cases
of true understanding of the meanings of each error as correct without much consideration under Sim-
GSM8K MathQA
Zero-shot Few-shot Zero-shot-type Few-shot-type Zero-shot Few-shot Zero-shot-type Few-shot-type
GPT-3.5 0.147 0.198 0.294 0.352 0.173 0.157 0.248 0.304
GPT-4 0.843 0.841 0.878 0.881 0.714 0.691 0.739 0.739
GLM-4 0.640 0.632 0.744 0.689 0.551 0.496 0.603 0.581
Gemini Pro 0.359 0.052 0.567 0.112 0.239 0.031 0.394 0.086
LLaMA-2-7B 0.184 0.109 0.209 0.094 0.176 0.133 0.197 0.120
LLaMA-2-13B 0.007 0.003 0.002 0.004 0.009 0.003 0.004 0.017
Table 7: Average accuracy acc2 of models in ES on GSM8K and MathQA separately under four different prompt
settings. Zero-shot-type and Few-shot-type provide models with the error types. Few-shot is set to 2-shot. The
maximum average accuracy for each model on each dataset is in boldface.
GSM8K MathQA
Zero-shot Few-shot Zero-shot-reverse Zero-shot-random Zero-shot Few-shot Zero-shot-reverse Zero-shot-random
GPT-3.5 0.211 0.171 0.281 0.256 0.173 0.129 0.228 0.204
GPT-4 0.516 0.577 0.538 0.483 0.481 0.520 0.471 0.443
GLM-4 0.349 0.409 0.411 0.381 0.327 0.218 0.360 0.353
Gemini Pro 0.108 0.090 0.147 0.122 0.096 0.052 0.132 0.113
LLaMA-2-7B 0.048 0.076 0.097 0.081 0.052 0.104 0.121 0.082
LLaMA-2-13B 0.127 0.003 0.112 0.136 0.116 0.017 0.127 0.133
Table 8: Average accuracy acc3 of models in ET on GSM8K and MathQA separately under four different prompt
settings. Few-shot is set to 2-shot. Few-shot-random and Few-shot-reverse present similar results and are included
in Appendix F.2.3.
GSM8K MathQA
Zero-shot Few-shot Zero-shot-type Few-shot-type Zero-shot Few-shot Zero-shot-type Few-shot-type
GPT-3.5 0.296 0.169 0.477 0.594 0.274 0.141 0.402 0.572
GPT-4 0.901 0.883 0.922 0.929 0.834 0.810 0.847 0.874
GLM-4 0.853 0.804 0.912 0.937 0.692 0.574 0.694 0.752
Gemini Pro 0.117 0.248 0.844 0.283 0.082 0.200 0.680 0.186
LLaMA-2-7B 0.067 0.066 0.071 0.050 0.039 0.063 0.041 0.048
LLaMA-2-13B 0.000 0.006 0.000 0.010 0.000 0.018 0.000 0.019
Table 9: Average accuracy acc4 of models in EC on GSM8K and MathQA separately under four different prompt
settings. Zero-shot-type and Few-shot-type provide models with the error types. Few-shot is set to 2-shot. The
maximum average accuracy for each model on each dataset is in boldface.
ple and being misled to mostly classify cases as them. In Table 8, the impact of increasing the shot
incorrect under Misleading. on improving accuracy is also negligible by com-
paring zero-shot and few-shot prompts. The order
Prompt Robustness of ES. For ES, we design zero-
of error types does indeed affect classification ac-
shot and few-shot prompts for comparison and find
curacy as shown in Table 35 and 36. For example,
that increasing the shot has minimal effect on im-
hallucination is listed last in the sequential prompt.
proving the accuracy of this task and could even be
The average classification accuracy of hallucina-
counterproductive in Table 7. This indicates that
tion in the sequential prompt is much lower than
simple examples can not make models fully under-
that in the reversed order. It is noteworthy that in
stand the meaning of identifying the first erroneous
the random order, we place missing step first, but
step. By providing models with the error types,
its classification accuracy remains consistently low,
the accuracy of identifying error steps has been
indicating its inherent difficulty in identification.
significantly improved, with an average increase of
Prompt Robustness of EC. For EC, we adopt sim-
45.9% times and maximum increase of 12.71 times.
ilar prompt settings with ES and obtain similar
This informs that carefully designed examples can
results. Only delicately constructed prompts that
effectively improve the models’ ability to identify
provide the error types can effectively improve the
erroneous steps.
models’ ability to correct errors, with an average
Prompt Robustness of ET. For ET, we define nine increase of 47.9% times and up to a maximum of
error types in the prompts and design zero-shot and 8.29 times as displayed in Table 9.
few-shot prompts. Recognizing that the sequence
of error types may impact the accuracy of identify- 4.4 In-depth Analysis
ing errors, we also devise prompts that reverse the Comparison with Traditional Task. We conduct
default order of error types and randomly shuffle traditional task by inputting the questions from our
7 U D G L W L R Q D O 7 D V N 2 X U 7 D V N 5 Related Work