49.automatic English Essay Scoring Algorithm Based On Machine Learning
49.automatic English Essay Scoring Algorithm Based On Machine Learning
Authorized licensed use limited to: ULAKBIM UASL - Anadolu Universitesi. Downloaded on September 16,2023 at 09:43:32 UTC from IEEE Xplore. Restrictions apply.
as BERT can learn semantic feature circles at a high level, 1
and the model input for topic decision making uses BERT represent = layer k
3 k
(3)
encoding to enable better representation of text data [9].
Specifically, let the sentence sent = {f1, ..., fn}, f(1 ≤ fi ≤ n) The superscript REPRESENT of fi indicates the specific
denote the i-th word of the current sentence sent, and each
encoding method of the current ith word fi, layer
word within the sentence is given a vector representation of k
the word wi by the pre-training model BERT. indicating the BERT encoding of the kth (9 ≤ k ≤ 11) layer.
represent
III. AUTOMATIC ENGLISH ESSAY SCORING SYSTEM
w = BERT
i
( f i
) (2) BASED ON MACHINE LEARNING
A. AES System Design
In Fig. 1, the main functions of the system are divided into rmsprop optimization method, the gracefulness value of the
user login module, evaluation module, statistics module, and sentences can be predicted [11-12].
error correction module. c) Topic relevance evaluation: the topic relevance
1) Evaluation module: The evaluation module is the core results are measured by three dimensions: word granularity,
module of the system. The AES model, sentence grace model, sentence granularity, and chapter granularity, respectively,
and topic relevance model are built using the already labeled and the final topic relevance score is obtained by obtaining
corpus. Then the user input essays are evaluated in terms of the similarity with the topic under the three granularities and
vocabulary quality, sentence elegance and topic relevance, finding the mean value.
and the predicted score is derived. 2) Statistical module: The statistics module mainly
a) Vocabulary quality evaluation: The linguistic counts and displays the information of words and sentences,
features of the compositions, including the lexical statistics in including the number of(TNO) words, sentences, average
the compositions and the lexical statistics in each sentence, length of sentences(ALOS), advanced vocabulary and
are extracted and input into the gradient boosting regression excellent sentences. TNO words, the number of sentences,
tree to obtain the score prediction model, and then the user- and ALOS are first counted by dividing words and sentences,
input compositions are input into the model to obtain the final and then counted. TNO advanced vocabulary is identified and
score, which is taken as the final score of vocabulary quality counted by using the advanced vocabulary database of
[10]. IELTS, TOEFL, and Specialized 8; the number of excellent
b) Sentence grace evaluation: Based on the sentence sentences is identified and counted by sentence grace model
grace evaluation model, the grace value of the sentences is [13].
obtained by inputting each sentence into the trained grace 3) Error correction module: The error correction module
model, and then the mean value of the grace value is obtained identifies possible spelling errors in the essay and suggests
as the grace value of the sentences, and the range of the value corrections by calling the three-party library pyaspeller.
is 0-1. The Keras deep learning framework is used to model
B. Structure of English Composition Scoring Model
the convolutional neural network with 3, 4, and 5
convolutional kernels, each with 50 kernels, and the output of In order to generate the degree of deduction features,
the pooling layer is obtained by the Flatten method. Since the firstly, the TF of the feature item is calculated using the word
graceful model is a multiple-input single-output model, the as the semantic unit, and then the inverse document frequency
pooling output layer of the convolutional neural network and IDF of the feature item is calculated, and then the TF-IDF
weights are obtained by multiplying the two, and the top five
the manually extracted feature vectors with lengths of 150
words with the largest weights are selected as the keywords,
and 30, respectively, are connected using the concatenate and then the The word2vec trained model is then used to find
method and sent to the fully connected layer after stitching. the word vectors(WV) of these five words, and the mean value
By training and obtaining the gracefulness model through the of the cosine distance between the WV of these five words and
Authorized licensed use limited to: ULAKBIM UASL - Anadolu Universitesi. Downloaded on September 16,2023 at 09:43:32 UTC from IEEE Xplore. Restrictions apply.
the WV of the keywords of the propositional essay is Information such as time, place, person, and organization
calculated as the deductive degree feature. appearing within the essays were desensitized. The pre-
processing of ASAP in the experiment mainly included
C. Machine Learning Model Prediction removing the @ symbols of information words, splitting the
Three machine learning models, including RF, gradient composition data in word and sentence granularity, and
boosting tree (GBDT), and XGBoost, were selected as English normalizing the composition scores, of which 60% of the data
essay scoring prediction models. Therefore, the accuracy will were used as training data and 40% as test data.
be further improved by fusing the prediction results of RF and
GBDT and XGBoost models. Grade 4 model test machine assessment data: This data
contains a total of 5600 college students' model test grade 4
The RF, GBDT, and XGBoost models are trained essays, and the essay scores are machine scores.
separately using the training set data and predicted on the test
set data, and then the prediction results of the three models are Critique.com machine assessment data: This data comes
linearly fused according to the Bagging method. from the same data of the Grade 4 model exam, and contains
20,000 essay samples.
IV. EXPERIMENTAL RESULTS
B. Analysis of Results
A. Experimental Data 1) Validation of the effect of non-text features: The model
ASAP: The kaggle competition released a dataset was trained on each of the eight essay subsets of ASAP and
containing eight sets of essays, each with an essay topic and predicted the test set scores, and the corresponding quadratic
each essay corresponding to an overall rating. The data was weighted kappa values(QWKV) were calculated, shown in
obtained from local students in grades seven through ten, and Fig. 2.
all essays were secondarily scored by professionals.
From Fig. 2, we can see that the QWKV of RF is the In ASAP Essay Subset 4 and Essay Subset 8, the QWKV
largest on all ASAP essay datasets (1-8), with a value of of XGBoost model is higher than that of RF. This is because
0.7597, followed by XGBoost with a result of 0.7424, and essay set 8 has a wide range of scores. The results are divided
gradient boosting tree with a result of 0.7169. The model is into Non-Text Features and Non-Textual Features+Derivative
relatively complex, and too much consideration is given to the Degree Features, and the corresponding error will be
individuality of the samples, which will cover the magnified when the scoring range is large.
commonality of the samples and lead to poor prediction
results. In contrast, RF can effectively avoiding overfitting 2) Validation of the effect of textual features: Based on
and reducing variance even in the case of small training the NTF, the model was trained on all essay sets and predicted
samples. Therefore, RF can show better prediction results with the corresponding scores on the test set, and the
small data size. corresponding QWKV were calculated. Shown in Table I.
Authorized licensed use limited to: ULAKBIM UASL - Anadolu Universitesi. Downloaded on September 16,2023 at 09:43:32 UTC from IEEE Xplore. Restrictions apply.
From Table I, we can see that the prediction indexes of machine assessment data and the batching.com machine
each model improved after adding the withholding degree assessment dataset were used as experimental data, and the
feature to the non-text features. Among them, the RF model RF model was compared with the models such as BiLSTM
has the largest improvement, and its QWKV improves from and RNN, and the Pearson evaluation index was used to
0.7597 to 0.7684 after adding the withholding degree feature,
measure the correlation between the predicted total
which verifies the effectiveness of the withholding degree
feature for the AES model. composition score value and the real total score value, and
shown in Table II.
3) Verification of machine scoring validity: In order to
verify the validity of the RF, the ASAP, Level 4 model test
TABLE II. COMPARISON RESULTS
RF BiLSTM RNN
ASAP_1 0.93 0.86 0.71
ASAP_2 0.85 0.72 0.64
ASAP_3 0.87 0.80 0.69
ASAP_4 0.73 0.65 0.57
ASAP_5 0.81 0.73 0.62
ASAP_6 0.78 0.77 0.54
ASAP_7 0.75 0.66 0.51
ASAP_8 0.82 0.73 0.64
Level 4 Model Exam Machine Assessment Data 0.88 0.72 0.58
Critique.com machine assessment data 0.81 0.73 0.60
Average value 0.823 0.737 0.610
Comparing the RF model with the BiLSTM algorithm in Machine learning to support geographical origin traceability of Coffea
Table II, it can be seen that the performance of the follow-in Arabica. Adv. Artif. Intell. Mach. Learn. 2(1): 273-287 (2022).
forest algorithm is higher than that of the deep learning-based [4] Dmitry V. Vinogradov: Algebraic Machine Learning: Emphasis on
Efficiency. Autom. Remote. Control.83(6): 831-846 (2022).
composition scoring method overall. The experiments of the
[5] Ramesh, G. P., & Mohan Kumar, N. (2018). Radiometric analysis of
RF model on the ASAP dataset, the four-level model test ankle edema via RZF antenna for biomedical applications. Wireless
machine assessment dataset and the batching.com machine Personal Communications, 102(2), 1785-1798.
assessment dataset show that the Pearson mean reaches 0.823. [6] Waleed Alsanie, Mohamed I. Alkanhal, Mohammed Alhamadi,
Compared with the mean results of 0.737 and 0.610 of the Abdulaziz O. Al-Qabbany: Automatic scoring of arabic essays over
baseline models such as BiLSTM, RNN, etc., it improves by three linguistic levels. Prog. Artif. Intell. 11(1): 1-13 (2022).
0.086 and 0.213, respectively, which proves the effectiveness [7] Sami Nikkonen, Henri Korkalainen, Akseli Leino, Sami Myllymaa,
of the RF. The performance of the RF algorithm on the ASAP Brett Duce, Timo Leppanen, Juha Toyras : Automatic Respiratory
dataset is better than that of the Level 4 model test data and Event Scoring in Obstructive Sleep Apnea Using a Long Short-Term
Memory Neural Network. IEE J. Biomed. Health Informatics 25(8):
the critic.com data, which is due to the fact that the real scores 2917-2927 (2021).
of both essays are originally machine-graded, and there is still [8] Mumu Aktar, Donatella Tampieri, Hassan Rivaz, Marta Kersten-
a difference between that machine-graded and both manually- Oertel, Yiming Xiao: Automatic collateral circulation scoring in
graded scores, but from the results, the RF algorithm ischemic stroke using 4D CT angiography with low-rank and sparse
outperforms the BiLSTM-based deep learning approach. matrix decomposition. Int. J Comput. Assist. Radiol. Surg. 15(9): 1501-
1511 (2020).
V. CONCLUSION [9] M. Srinivas, R. Bharath, P. Rajalakshmi and C. K. Mohan, "Multi-level
classification: A generic classification method for medical datasets,"
In traditional offline teaching, teachers evaluate students' 2015 17th International Conference on E-health Networking,
English ability through different dimensions such as mastery Application & Services (HealthCom), Boston, MA, USA, 2015, pp.
of words, grammar, long and difficult sentences and full-text 262-267, doi: 10.1109/HealthCom.2015.7454509.
expression ability within their English compositions, but due [10] Bob D. de Vos, Jelmer M. Wolterink, Tim Leiner, Pim A. de Jong,
to the serious mismatch between classroom teaching time and Nikolas LeBmann, Ivana lsgum: Direct Automatic Coronary Calcium
Scoring in Cardiac and Chest CT. IEE Trans. Medical Imaging 38(9):
the number of students in the classroom, it is difficult for 2127-2138 (2019).
teachers to do a serious and careful all-round check on every [11] S. Begum, R. Banu, A. Ahamed and B. D. Parameshachari, "A
student, and teachers' energy and subjective factors also affect comparative study on improving the performance of solar power plants
the judgment of students' The teacher's energy and subjective through IOT and predictive data analytics," 2016 International
factors also affect the judgment of students' composition level. Conference on Electrical, Electronics, Communication, Computer and
Therefore, this paper constructs an AES system to realize the Optimization Techniques (ICEECCOT), Mysuru, India, 2016, pp. 89-
91, doi: 10.1109/ICEECCOT.2016.7955191.
efficiency and fairness of grading, which introduces the RF
[12] Brenda Such: Scaffolding English language learners for online
algorithm and can get better machine grading results. collaborative writing activities. Interact. Learn. Environ. 29(3): 473-
481 (2021).
REFERENCES
[13] Yea-Ru Tsai: Exploring the effects of corpus-based business English
[1] Cameron Cooper: Using Machine Learning to Ildentify At-risk writing instruction on EFL learners' writing proficiency and perception.
Students in an Introductory Programming Course at a Two-year Public Comput. High. Educ. 33(2): 475-498 (2021).
College. Adv. Artif. Intell. Mach. Learn. 2(3): 407-421 (2022).
[2] Keon-Myung Lee, Chan Sik Han, Kwang-lI Kim, Sang Ho Lee: Word
recommendation for English composition using big corpus data
processing. Clust. Comput. 22(Suppl 1): 1911-1924 (2019).
[3] Elisabete A. De Nadai Fernandes, Gabriel A. Sarries, Yuniel T.
Mazola, Robson C. de Lima, Gustavo N. Furlan, Marcio A. Bacchi:
Authorized licensed use limited to: ULAKBIM UASL - Anadolu Universitesi. Downloaded on September 16,2023 at 09:43:32 UTC from IEEE Xplore. Restrictions apply.