0% found this document useful (0 votes)
37 views4 pages

49.automatic English Essay Scoring Algorithm Based On Machine Learning

Uploaded by

t.akhil2134
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views4 pages

49.automatic English Essay Scoring Algorithm Based On Machine Learning

Uploaded by

t.akhil2134
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2023 IEEE International Conference on Integrated Circuits and Communication Systems (ICICACS)

Automatic English Essay Scoring Algorithm Based


on Machine Learning
Lijuan Wu
Heilongjiang International University
Harbin, China
[email protected]

In this paper, we first introduce AES-related techniques


2023 IEEE International Conference on Integrated Circuits and Communication Systems (ICICACS) | 979-8-3503-9846-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICICACS57338.2023.10099945

Abstract—With the development of natural language


processing(NLP) technology and machine learning, the research and propose an evaluation method for English scoring; then
task of automatic English scoring(AES) is becoming clearer and we design an English AES system containing four functional
clearer, and the research difficulties arise due to the mutual modules and a machine learning prediction model, and finally
constraints of research methods and annotation data. How to validate the effectiveness of the RF model in predicting
build a perfect and reliable scoring system has become a great scoring on the ASAP essay set, the Grade 4 essay data, and
challenge under today's research. In this paper, we designed an the critique web essay data.
English AES system, and verified the effectiveness of RF on
English scoring model by analyzing the prediction effect of RF II. TECHNIQUES AND METHODS
on non-text features and text features, and then compared the
Pearson correlation coefficients(PCC) of RF(RF), GBDT, and A. Technologies Related to Automatic Scoring of Essays
XGBoost, and the study showed that the performance of RF 1) NLP: Word separation and sentence separation are the
algorithm is higher than the other two composition scoring
methods.
basis of NLP. English word division is relatively simple,
usually using spaces and punctuation for natural separation,
Keywords—machine learning, random forest algorithm, but there are also a few cases of abbreviations such as don't
automatic scoring of English essays, Pearson correlation and they'll. Sentence division is the cutting of a text into
coefficient. individual sentences, usually according to punctuation marks.
I. INTRODUCTION The techniques of English word and sentence segmentation
are now relatively well established, and there are many
For English teaching, a teacher often teaches several publicly available techniques whose use is very effective [4-
classes at the same time, with hundreds of students, so if each
5].
person has an essay to grade, the teacher's workload will be
huge, and the subsequent need for targeted revision and re- 2) Topic decision making: The topic decision model
grading will multiply the teacher's workload. Automatic divides the topic strength. The model takes the endpoint of
grading is fast and efficient, and can greatly reduce the time the topic in the English translation as the sentence breaking
teachers spend on grading essays, allowing teachers to spend point, the sentence structure in the topic has a special
more time on teaching students and allowing English learners structure, and whether the sentence in the text is a truncation
to train their English writing skills in a targeted manner. point is the basis point of topic segmentation [6-7].
Research on automatic scoring of English essays has B. Evaluation methods
yielded good results. For example, some scholars have
1) PCC: The PCC is used to measure the degree of linear
successfully developed the PEG system, which is the earliest
automatic scoring system for essays. PEG assumes that the correlation of the variables. For example, for two linear
overall quality of an essay can be reflected by several shallow variables A and B, the degree of correlation between A and B
linguistic features, including the length of the essay, the length is recorded as x. x is [-1, 1], and the more the absolute value
of the vocabulary, the number of prepositions, and the number of x tends to 1, the stronger the correlation between A and B.
of pronouns, etc. These shallow features of the essay are then The more it tends to 0 indicates the weaker the correlation,
extracted and used to predict the fluency of the essay, the and a positive value of x indicates a positive correlation
richness of the vocabulary, the complexity of the sentence between M and N, and a negative number indicates a negative
structure, etc., and then the regression coefficients are derived correlation [8].
using Multiple regression method is used to find out the The PCC ρ is equal to the product of the covariance
regression coefficients, and the score of the composition is between the linear variables M and N divided by the standard
predicted by regression calculation, which neither uses natural deviation of the two, as shown in Equation (1).
language processing techniques nor studies the content and
chapter structure of the composition, nor does it consider the
theme of the composition [1-2]. Some scholars input the cov( A, B )
ρ= (1)
theme of the composition together with the content of the σσ A B
composition into a neural network model so that the model
learns the relevance of the content of the composition to the 2) Calculation of English composition tangency scores
theme of the composition by itself, and the obtained tangency based on topic richness within essays: The Transformer-
features are involved in the training of the overall scoring based bi-directional encoding representation model BERT
model of the composition [3]. There are many other AES
has achieved recent research results in downstream tasks such
models in English and all of them have been effective in
scoring results. as question-answer matching and text classification. Recent
research results have been achieved in downstream tasks such

979-8-3503-9846-5/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: ULAKBIM UASL - Anadolu Universitesi. Downloaded on September 16,2023 at 09:43:32 UTC from IEEE Xplore. Restrictions apply.
as BERT can learn semantic feature circles at a high level, 1
and the model input for topic decision making uses BERT represent =  layer k
3 k
(3)
encoding to enable better representation of text data [9].
Specifically, let the sentence sent = {f1, ..., fn}, f(1 ≤ fi ≤ n) The superscript REPRESENT of fi indicates the specific
denote the i-th word of the current sentence sent, and each
encoding method of the current ith word fi, layer
word within the sentence is given a vector representation of k

the word wi by the pre-training model BERT. indicating the BERT encoding of the kth (9 ≤ k ≤ 11) layer.

represent
III. AUTOMATIC ENGLISH ESSAY SCORING SYSTEM
w = BERT
i
( f i
) (2) BASED ON MACHINE LEARNING
A. AES System Design

Fig. 1. Overall framework of the system

In Fig. 1, the main functions of the system are divided into rmsprop optimization method, the gracefulness value of the
user login module, evaluation module, statistics module, and sentences can be predicted [11-12].
error correction module. c) Topic relevance evaluation: the topic relevance
1) Evaluation module: The evaluation module is the core results are measured by three dimensions: word granularity,
module of the system. The AES model, sentence grace model, sentence granularity, and chapter granularity, respectively,
and topic relevance model are built using the already labeled and the final topic relevance score is obtained by obtaining
corpus. Then the user input essays are evaluated in terms of the similarity with the topic under the three granularities and
vocabulary quality, sentence elegance and topic relevance, finding the mean value.
and the predicted score is derived. 2) Statistical module: The statistics module mainly
a) Vocabulary quality evaluation: The linguistic counts and displays the information of words and sentences,
features of the compositions, including the lexical statistics in including the number of(TNO) words, sentences, average
the compositions and the lexical statistics in each sentence, length of sentences(ALOS), advanced vocabulary and
are extracted and input into the gradient boosting regression excellent sentences. TNO words, the number of sentences,
tree to obtain the score prediction model, and then the user- and ALOS are first counted by dividing words and sentences,
input compositions are input into the model to obtain the final and then counted. TNO advanced vocabulary is identified and
score, which is taken as the final score of vocabulary quality counted by using the advanced vocabulary database of
[10]. IELTS, TOEFL, and Specialized 8; the number of excellent
b) Sentence grace evaluation: Based on the sentence sentences is identified and counted by sentence grace model
grace evaluation model, the grace value of the sentences is [13].
obtained by inputting each sentence into the trained grace 3) Error correction module: The error correction module
model, and then the mean value of the grace value is obtained identifies possible spelling errors in the essay and suggests
as the grace value of the sentences, and the range of the value corrections by calling the three-party library pyaspeller.
is 0-1. The Keras deep learning framework is used to model
B. Structure of English Composition Scoring Model
the convolutional neural network with 3, 4, and 5
convolutional kernels, each with 50 kernels, and the output of In order to generate the degree of deduction features,
the pooling layer is obtained by the Flatten method. Since the firstly, the TF of the feature item is calculated using the word
graceful model is a multiple-input single-output model, the as the semantic unit, and then the inverse document frequency
pooling output layer of the convolutional neural network and IDF of the feature item is calculated, and then the TF-IDF
weights are obtained by multiplying the two, and the top five
the manually extracted feature vectors with lengths of 150
words with the largest weights are selected as the keywords,
and 30, respectively, are connected using the concatenate and then the The word2vec trained model is then used to find
method and sent to the fully connected layer after stitching. the word vectors(WV) of these five words, and the mean value
By training and obtaining the gracefulness model through the of the cosine distance between the WV of these five words and

Authorized licensed use limited to: ULAKBIM UASL - Anadolu Universitesi. Downloaded on September 16,2023 at 09:43:32 UTC from IEEE Xplore. Restrictions apply.
the WV of the keywords of the propositional essay is Information such as time, place, person, and organization
calculated as the deductive degree feature. appearing within the essays were desensitized. The pre-
processing of ASAP in the experiment mainly included
C. Machine Learning Model Prediction removing the @ symbols of information words, splitting the
Three machine learning models, including RF, gradient composition data in word and sentence granularity, and
boosting tree (GBDT), and XGBoost, were selected as English normalizing the composition scores, of which 60% of the data
essay scoring prediction models. Therefore, the accuracy will were used as training data and 40% as test data.
be further improved by fusing the prediction results of RF and
GBDT and XGBoost models. Grade 4 model test machine assessment data: This data
contains a total of 5600 college students' model test grade 4
The RF, GBDT, and XGBoost models are trained essays, and the essay scores are machine scores.
separately using the training set data and predicted on the test
set data, and then the prediction results of the three models are Critique.com machine assessment data: This data comes
linearly fused according to the Bagging method. from the same data of the Grade 4 model exam, and contains
20,000 essay samples.
IV. EXPERIMENTAL RESULTS
B. Analysis of Results
A. Experimental Data 1) Validation of the effect of non-text features: The model
ASAP: The kaggle competition released a dataset was trained on each of the eight essay subsets of ASAP and
containing eight sets of essays, each with an essay topic and predicted the test set scores, and the corresponding quadratic
each essay corresponding to an overall rating. The data was weighted kappa values(QWKV) were calculated, shown in
obtained from local students in grades seven through ten, and Fig. 2.
all essays were secondarily scored by professionals.

Fig. 2. Experimental results of non-text features(NTF) on each composition set

From Fig. 2, we can see that the QWKV of RF is the In ASAP Essay Subset 4 and Essay Subset 8, the QWKV
largest on all ASAP essay datasets (1-8), with a value of of XGBoost model is higher than that of RF. This is because
0.7597, followed by XGBoost with a result of 0.7424, and essay set 8 has a wide range of scores. The results are divided
gradient boosting tree with a result of 0.7169. The model is into Non-Text Features and Non-Textual Features+Derivative
relatively complex, and too much consideration is given to the Degree Features, and the corresponding error will be
individuality of the samples, which will cover the magnified when the scoring range is large.
commonality of the samples and lead to poor prediction
results. In contrast, RF can effectively avoiding overfitting 2) Validation of the effect of textual features: Based on
and reducing variance even in the case of small training the NTF, the model was trained on all essay sets and predicted
samples. Therefore, RF can show better prediction results with the corresponding scores on the test set, and the
small data size. corresponding QWKV were calculated. Shown in Table I.

TABLE I. THE CHARACTERISTICS OF THE DEGREE OF DEDUCTION


RF XGBoost GBDT
Non-Text Features 0.7597 0.7424 0.7169
Non-textual features + Deductive degree features 0.7684 0.7496 0.7235

Authorized licensed use limited to: ULAKBIM UASL - Anadolu Universitesi. Downloaded on September 16,2023 at 09:43:32 UTC from IEEE Xplore. Restrictions apply.
From Table I, we can see that the prediction indexes of machine assessment data and the batching.com machine
each model improved after adding the withholding degree assessment dataset were used as experimental data, and the
feature to the non-text features. Among them, the RF model RF model was compared with the models such as BiLSTM
has the largest improvement, and its QWKV improves from and RNN, and the Pearson evaluation index was used to
0.7597 to 0.7684 after adding the withholding degree feature,
measure the correlation between the predicted total
which verifies the effectiveness of the withholding degree
feature for the AES model. composition score value and the real total score value, and
shown in Table II.
3) Verification of machine scoring validity: In order to
verify the validity of the RF, the ASAP, Level 4 model test
TABLE II. COMPARISON RESULTS
RF BiLSTM RNN
ASAP_1 0.93 0.86 0.71
ASAP_2 0.85 0.72 0.64
ASAP_3 0.87 0.80 0.69
ASAP_4 0.73 0.65 0.57
ASAP_5 0.81 0.73 0.62
ASAP_6 0.78 0.77 0.54
ASAP_7 0.75 0.66 0.51
ASAP_8 0.82 0.73 0.64
Level 4 Model Exam Machine Assessment Data 0.88 0.72 0.58
Critique.com machine assessment data 0.81 0.73 0.60
Average value 0.823 0.737 0.610

Comparing the RF model with the BiLSTM algorithm in Machine learning to support geographical origin traceability of Coffea
Table II, it can be seen that the performance of the follow-in Arabica. Adv. Artif. Intell. Mach. Learn. 2(1): 273-287 (2022).
forest algorithm is higher than that of the deep learning-based [4] Dmitry V. Vinogradov: Algebraic Machine Learning: Emphasis on
Efficiency. Autom. Remote. Control.83(6): 831-846 (2022).
composition scoring method overall. The experiments of the
[5] Ramesh, G. P., & Mohan Kumar, N. (2018). Radiometric analysis of
RF model on the ASAP dataset, the four-level model test ankle edema via RZF antenna for biomedical applications. Wireless
machine assessment dataset and the batching.com machine Personal Communications, 102(2), 1785-1798.
assessment dataset show that the Pearson mean reaches 0.823. [6] Waleed Alsanie, Mohamed I. Alkanhal, Mohammed Alhamadi,
Compared with the mean results of 0.737 and 0.610 of the Abdulaziz O. Al-Qabbany: Automatic scoring of arabic essays over
baseline models such as BiLSTM, RNN, etc., it improves by three linguistic levels. Prog. Artif. Intell. 11(1): 1-13 (2022).
0.086 and 0.213, respectively, which proves the effectiveness [7] Sami Nikkonen, Henri Korkalainen, Akseli Leino, Sami Myllymaa,
of the RF. The performance of the RF algorithm on the ASAP Brett Duce, Timo Leppanen, Juha Toyras : Automatic Respiratory
dataset is better than that of the Level 4 model test data and Event Scoring in Obstructive Sleep Apnea Using a Long Short-Term
Memory Neural Network. IEE J. Biomed. Health Informatics 25(8):
the critic.com data, which is due to the fact that the real scores 2917-2927 (2021).
of both essays are originally machine-graded, and there is still [8] Mumu Aktar, Donatella Tampieri, Hassan Rivaz, Marta Kersten-
a difference between that machine-graded and both manually- Oertel, Yiming Xiao: Automatic collateral circulation scoring in
graded scores, but from the results, the RF algorithm ischemic stroke using 4D CT angiography with low-rank and sparse
outperforms the BiLSTM-based deep learning approach. matrix decomposition. Int. J Comput. Assist. Radiol. Surg. 15(9): 1501-
1511 (2020).
V. CONCLUSION [9] M. Srinivas, R. Bharath, P. Rajalakshmi and C. K. Mohan, "Multi-level
classification: A generic classification method for medical datasets,"
In traditional offline teaching, teachers evaluate students' 2015 17th International Conference on E-health Networking,
English ability through different dimensions such as mastery Application & Services (HealthCom), Boston, MA, USA, 2015, pp.
of words, grammar, long and difficult sentences and full-text 262-267, doi: 10.1109/HealthCom.2015.7454509.
expression ability within their English compositions, but due [10] Bob D. de Vos, Jelmer M. Wolterink, Tim Leiner, Pim A. de Jong,
to the serious mismatch between classroom teaching time and Nikolas LeBmann, Ivana lsgum: Direct Automatic Coronary Calcium
Scoring in Cardiac and Chest CT. IEE Trans. Medical Imaging 38(9):
the number of students in the classroom, it is difficult for 2127-2138 (2019).
teachers to do a serious and careful all-round check on every [11] S. Begum, R. Banu, A. Ahamed and B. D. Parameshachari, "A
student, and teachers' energy and subjective factors also affect comparative study on improving the performance of solar power plants
the judgment of students' The teacher's energy and subjective through IOT and predictive data analytics," 2016 International
factors also affect the judgment of students' composition level. Conference on Electrical, Electronics, Communication, Computer and
Therefore, this paper constructs an AES system to realize the Optimization Techniques (ICEECCOT), Mysuru, India, 2016, pp. 89-
91, doi: 10.1109/ICEECCOT.2016.7955191.
efficiency and fairness of grading, which introduces the RF
[12] Brenda Such: Scaffolding English language learners for online
algorithm and can get better machine grading results. collaborative writing activities. Interact. Learn. Environ. 29(3): 473-
481 (2021).
REFERENCES
[13] Yea-Ru Tsai: Exploring the effects of corpus-based business English
[1] Cameron Cooper: Using Machine Learning to Ildentify At-risk writing instruction on EFL learners' writing proficiency and perception.
Students in an Introductory Programming Course at a Two-year Public Comput. High. Educ. 33(2): 475-498 (2021).
College. Adv. Artif. Intell. Mach. Learn. 2(3): 407-421 (2022).
[2] Keon-Myung Lee, Chan Sik Han, Kwang-lI Kim, Sang Ho Lee: Word
recommendation for English composition using big corpus data
processing. Clust. Comput. 22(Suppl 1): 1911-1924 (2019).
[3] Elisabete A. De Nadai Fernandes, Gabriel A. Sarries, Yuniel T.
Mazola, Robson C. de Lima, Gustavo N. Furlan, Marcio A. Bacchi:

Authorized licensed use limited to: ULAKBIM UASL - Anadolu Universitesi. Downloaded on September 16,2023 at 09:43:32 UTC from IEEE Xplore. Restrictions apply.

You might also like