PDF 1
PDF 1
Article
Software System for Automatic Grading of Paper Tests
Vladimir Jocovic 1, * , Bosko Nikolic 1 and Nebojsa Bacanin 2
1 The Department of Computer Science and Information Technology, School of Electrical Engineering,
University of Belgrade, Bulevar kralja Aleksandra, 11000 Belgrade, Serbia; [email protected]
2 Faculty of Informatics and Computing, Singidunum University, Danijelova 32, 11000 Belgrade, Serbia;
[email protected]
* Correspondence: [email protected]
Abstract: The advent of digital technology has revolutionized numerous aspects of modern life,
including the field of assessment and testing. However, paper tests, despite their seemingly archaic
nature, continue to hold a prominent position in various assessment domains. The accessibility,
familiarity, security, cost-effectiveness, and versatility of paper tests collectively contribute to their
continued prominence. Hence, numerous educational institutions responsible for conducting exami-
nations involving a substantial number of candidates continue to rely on paper tests. Consequently,
there arises a demand for the possibility of automated assessment of these tests, aiming to allevi-
ate the burden on teaching staff, enhance objectivity in evaluation, and expedite the delivery of
test results. Therefore, diverse software systems have been developed, showcasing the capability
to automatically score specific question types. Thus, it becomes imperative to categorize related
question types systematically, thereby facilitating a preliminary classification based on the content
and format of the questions. This classification serves the purpose of enabling effective compari-
son among existing software solutions. In this research paper, we present the implementation of
such a software system using artificial intelligence techniques, progressively expanding its capa-
bilities to evaluate increasingly complex question types, with the ultimate objective of achieving
a comprehensive evaluation of all question types encountered in paper-based tests. The system
detailed above demonstrated a recognition success rate of 99.89% on a curated dataset consisting of
734,825 multiple-choice answers. For the matching type, it achieved a recognition success rate of
99.91% on 86,450 answers. In the case of short answer type, the system achieved a recognition success
Citation: Jocovic, V.; Nikolic, B.; rate of 95.40% on 129,675 answers.
Bacanin, N. Software System for
Automatic Grading of Paper Tests. Keywords: artificial intelligence; automated test assessment; machine learning; paper test
Electronics 2023, 12, 4080.
https://doi.org/10.3390/
electronics12194080
1. Introduction
Academic Editor: Jian Sun
While digital assessments offer advantages such as enhanced interactivity, immediate
Received: 25 August 2023 feedback, and streamlined administration [1], the persistent utilization of paper tests
Revised: 20 September 2023
stems from several factors [2]. Firstly, the accessibility of paper tests allows for their
Accepted: 23 September 2023
deployment in regions with limited access to digital devices or electricity. Secondly, the
Published: 28 September 2023
familiarity associated with the pen-and-paper format engenders a sense of comfort and
alleviates test anxiety for many individuals. Thirdly, the perceived security of paper
tests cannot be disregarded [3]. Moreover, the cost-effectiveness of paper tests proves
Copyright: © 2023 by the authors.
advantageous in certain contexts, as they eliminate the need for substantial investments in
Licensee MDPI, Basel, Switzerland. digital infrastructure and ongoing maintenance [4]. Additionally, the versatility of paper
This article is an open access article tests in accommodating various assessment formats, particularly those requiring intricate
distributed under the terms and drawings, complex mathematical equations, or freehand writing, remains a prominent
conditions of the Creative Commons advantage [5]. The tangible nature of paper facilitates these tasks more naturally than their
Attribution (CC BY) license (https:// digital counterparts [6].
creativecommons.org/licenses/by/ In light of the aforementioned factors, paper tests continue to maintain their widespread
4.0/). usage within educational institutions. However, the reliance on paper tests necessitates a
manual review process, which poses significant challenges, particularly when dealing with
a substantial number of candidates. Nonetheless, due to the sheer volume of applicants, the
task of evaluation often becomes monotonous, unstimulating, and wearisome. Moreover,
the evaluation process introduces a level of subjectivity that can be largely mitigated by
utilizing question types that allow for precise and unambiguous responses. Nevertheless,
it is inevitable that errors may arise during the evaluation of such a vast number of answer
sheets within the review process itself. Hence, there arises a compelling necessity to auto-
mate the complete grading process. By doing so, not only would the burden be lifted from
reviewers who often find the task uninspiring, but it would also afford them additional
time to dedicate to crafting high-quality tests.
Therefore, the imperative to develop a system for automated assessment of paper tests
became evident. Initially, automated assessment systems were implemented to handle
exclusively multiple-choice questions, with answers separated from the question text and
presented on a distinct answer sheet. However, the practice of segregating the questions’
and answers’ text into separate forms introduces errors in form completion and distracts
candidates from the essence of the question itself. Therefore, a desirable feature for the
system is the ability to evaluate tests where the questions’ and answers’ text are unified.
Moreover, as the demand arose for automated evaluation of various question types, novel
solutions were devised specifically tailored to each respective question type.
Paper tests encompass a diverse array of question types, making it challenging to
exhaustively enumerate every potential question variant encountered in these assessments.
The presence of synonymous terms that interchangeably denote essentially the same
question type further contributes to this complexity. However, for the purpose of facilitating
transparent and comparative evaluation of implemented software systems capable of
automatically assessing paper tests, it becomes crucial to classify distinct question types
and consolidate related question types into cohesive classes.
The following classes have been identified:
• Multiple Choice: Candidates select one or more correct answers from a matrix-like
arrangement of choices, typically represented by circles or squares.
• Matching: Candidates establish associations between related concepts or underline
specific responses.
• Short Answer: Candidates provide concise answers in the form of one or more words
or numerical values, without requiring an extensive interpretation of the written
content.
• Essay: Candidates compose extended written responses, emphasizing a comprehen-
sive understanding and articulation of the underlying concepts during the evaluation
process.
These question classes are presented in ascending order of complexity, considering
the technologies required to effectively address them. Initially, the implemented system
focused solely on automating the evaluation of multiple-choice questions [7], and a com-
prehensive account of this capability can be found in previous research [8]. However,
subsequent developments have facilitated the expansion of the system to encompass the
automated assessment of matching questions, where correct answers are underlined, as
well as short answer questions involving numerical responses. It is envisaged that future
iterations of the system will further evolve to encompass evaluation capabilities across all
the aforementioned question classes, thereby automating the entire assessment process.
The primary contributions of the presented work can be summarized as follows:
• Examination and classification of the questions employed in the tests.
• Assessment and categorization of current systems designed for automated test grading.
• Development and deployment of an automated test grading system utilizing artificial
intelligence techniques, which outperformed other existing tools.
• A comprehensive evaluation of the implemented system across a vast collection of
digitally formatted paper tests.
Electronics 2023, 12, 4080 3 of 24
The rest of this paper is organized as follows. Section 2 of this paper presents a
collection of software systems that demonstrate the ability to automatically score specific
categories of questions. In Section 3, a control flow diagram of the developed software
system is presented and illuminated. The subsequent Section 4 highlights the implementa-
tion specifics of the system, placing particular emphasis on the components responsible
for evaluating the aforementioned question types. Section 5 presents an evaluation of
the implemented software system, encompassing performance data, limitations encoun-
tered, and challenges faced during its development. Finally, Section 6 concludes the paper,
by providing a brief summary and outlining future endeavors and enhancements to be
pursued.
2. Related Work
The initial phase involved conducting a comprehensive search for software systems
capable of automated grading across the aforementioned question types. Extensive ex-
ploration was carried out within the realm of open-access literature. An analysis of their
accuracy, performance, and limitations will be undertaken and thoroughly discussed.
It has become apparent that each system is designed to specialize in the evaluation of
a particular question class. This specialization has facilitated the identification of notable
resemblances among software systems belonging to the same class. Nonetheless, even
among software systems capable of automatically scoring the same question class, diverse
approaches to detection and grading exist, even when utilizing identical technologies.
The filtering criteria of identified software systems excluded software systems doc-
umented in papers published more than a decade ago, ensuring a focus on recent ad-
vancements. Meanwhile, the prioritization criteria favored software systems with recent
publication dates or those that garnered a higher number of citations, indicative of their
prominence in the field.
The chosen systems satisfying the previously given criteria were developed at the
School of Electrical Engineering, University of Belgrade (SEE-UB) [8], School of Engineering,
Edith Cowan University (SE-ECU) [9], Department of Telematic Engineering, University
Carlos III of Madrid (DTE-UCM) [10], School of Electronic and Information Engineering,
Foshan University (SEIE-FU) [11], Prince Mohammad bin Fahd University (PMFU) [12],
Artificial intelligence Department, Faculty of Computers and Artificial Intelligence, Benha
University (FCAI-BU) [13], Information Technologies Division, Adana Alparslan Turkes
Science and Technology University (ITD-AATSU) [14], School of Software South China,
University of Technology Guangzhou and College of Medical Information Engineering,
Guangzhou University of Chinese Medicine (SSSC-UTG/CMIE-GUCM) [15].
Each of the selected software systems will be accompanied by a concise overview
of introductory details. This includes the software system’s designated name, year of
publication, applicable domains, and the corresponding methods used. The selected
software systems and their introductory details can be seen in Table 1.
2.1. Multiple-Choice
Multiple-choice questions are widely employed in exams as they offer a versatile
format. Such questions typically consist of a stem or partial statement, along with multiple
answer choices. Another category within this class is True/False questions. The answer
choices in a multiple-choice question include one or more correct options, as well as
incorrect alternatives known as distractors. This streamlined approach allows candidates
to swiftly answer such questions, making them particularly suitable for assessing a broad
range of knowledge.
All the systems capable of automatically assessing multiple-choice question types
employ diverse computer vision algorithms during the processing phase. Each system has
indicated the technologies they utilize and are implemented in programming languages
such as Python or MATLAB.
The majority of selected systems utilize the OpenCV library in conjunction with
proprietary algorithms to detect different shapes (such as circles and rectangles) that
represent question-and-answer regions. Some systems (DTE-UCM, for instance) perform
the recognition process on a template test utilizing the Hough transform [16], translating
the identified question and answer structure onto filled-in tests by aligning the template
with the test boundaries. Conversely, other systems (SEE-UB, for instance) solely focus on
detecting regions of interest on filled-in tests.
Regarding accuracy in grading, all selected systems demonstrate astonishingly high
levels of precision: SEE-UB (99.9%), DTE-UCM (99.4%), and SE-ECU (95–100%). Among
them, the SEE-UB system stands out as the fastest, processing a test sheet in approximately
250 milliseconds, while the slowest system requires several seconds. It should be noted,
however, that the number of questions and answers on the answer sheets varies, and some
systems impose limitations on the quantity allowed per test sheet. Additionally, all systems,
except for the SEE-UB system, require questions and answers to be separated on distinct
sheets of paper. In our evaluation, the system developed by the authors (the SEE-UB) has
demonstrated the most impressive results. An example of a multiple-choice question can
be seen in Figure 1a (Figure 1b is the English-translated version of Figure 1a).
Figure 1. (a) Multiple-choice question in Serbian. (b) Multiple-choice questions translated into English.
Electronics 2023, 12, 4080 5 of 24
2.2. Matching
Matching questions are commonly employed in exams as an effective method for
evaluating relationships between concepts. Typically, these questions consist of two sides,
with statements or stems occupying the left side, and corresponding answers or choices on
the right side. Moreover, this question type can be presented in a textual format where it
becomes imperative to underline the words that correspond to the answer to the provided
question.
Matching questions are typically employed when detailed knowledge assessment is
crucial. Additionally, multiple-choice questions that provide designated answer placehold-
ers, requiring students to write the selected answer letter or number, are also classified
within this question category. This decision stems from the fact that by increasing the
number of answer placeholders in these questions, they can be easily generalized to the
Matching question format.
The number of systems capable of grading matching questions is relatively limited.
The SEIE-FU system is implemented using the Python programming language, while the
PMFU system additionally utilizes MATLAB programming language besides other tools.
Each of the chosen systems utilizes Python libraries for convolutional neural networks,
accompanied by proprietary algorithms to detect regions containing questions and answers.
Both systems perform recognition exclusively on filled-in tests. The SEIE-FU system
employs the YOLO [17] convolutional network, while the PMFU system develops its own
convolutional neural network.
The PMFU system demonstrates a high grading accuracy of at least 92%. The SEIE-FU
system did not explicitly state the achieved accuracy in grading. None of the selected
systems provided information regarding the processing time required for a single test
sheet. Additionally, none of the systems impose limitations on the number of questions
and answers per test sheet.
In our assessment, the SEIE-FU system has showcased superior results. It effectively
handles issues related to incorrect recognition caused by scribbles, and it does not restrict
students in terms of where they can provide their answers. Furthermore, the system
supports multiple languages in addition to English. On the other hand, the PMFU system
encounters challenges when processing slightly tilted scanned test images, leading to the
exclusion of those particular tests. An example of a matching question can be seen in
Figure 2a (Figure 2b is the English-translated version of Figure 2a).
Figure 2. (a) Matching question in Serbian. (b) Matching questions translated into English.
these questions consist of a concise statement or paragraph with blank placeholders that
must be filled in. These types of questions are usually employed when assessing candidates’
depth of knowledge.
All of the systems included in the selection employ some form of artificial intelligence,
utilizing a range of techniques such as machine learning, natural language processing, deep
learning, relation networks, long short-term memory, bidirectional encoder representations
from transformers, semantic matching, and text mining. Among the selected systems,
FCAI-BU and SSSC-UTG/CMIE-GUCM utilize convolutional neural networks to detect
handwritten short answers on paper tests. In contrast, other systems perform the grading
process directly on the extracted text from digital images or input forms, without involving
the recognition of handwritten text from paper test images.
Regarding the reported grading accuracy, not all systems disclose the achieved values.
FCAI-BU presents a Pearson score ranging from 0.77 to 0.95, ITD-AATSU reports a score of
0.95, and SSSC-UTG/CMIE-GUCM demonstrates 95% accuracy in grading. None of the
systems provide information on the processing time required for grading tests.
In our assessment, the SSSC-UTG/CMIE-GUCM system has displayed the most
favorable outcomes. It showcases remarkable accuracy in grading and incorporates a
handwriting text recognition module, supported by a preprocessing module that assists in
locating the handwritten answers and performs various image manipulation operations to
enhance the accuracy of text recognition. An example of a short answer question can be
seen in Figure 3a (Figure 3b is the English-translated version of Figure 3a).
Figure 3. (a) Short answer question in Serbian. (b) Short answer questions translated into English.
since the limitations of one question type can be balanced by the strengths of another, the
significance of test diversity, encompassing various question classes, becomes paramount.
Therefore, it becomes highly imperative to establish a system capable of automatically
evaluating as many question classes within a test as feasible, thereby accommodating
a broader spectrum of tests. Furthermore, it is crucial for this system to facilitate the
composition or selection of answers directly within the designated answer space for each
question, rather than relying on separate forms for response submission. This approach
enhances candidates’ focus on the question content and their responses while substantially
reducing the likelihood of errors compared to tests featuring separate forms for questions,
answers, and response submission.
The entire system comprises three fundamental components: zoning, scanning, and
processing.
The zoning component is realized as an application that takes a digital blank test
as input. A zoner operator is responsible for identifying and marking the QR code and
barcode zones on the initial page, which contain the test and candidate information, respec-
tively. Additionally, the upper left border region is designated as the reference point for
all subsequent regions of interest, encompassing the questions and their corresponding
answers. On subsequent pages, QR codes are present, carrying information about the data
on each page. Subsequently, the zoner operator selects and outlines rectangular zones for
the questions and their respective answers by utilizing mouse-dragging techniques on each
page. The operator also specifies the question type and verifies the captured data.
Zoning is performed on a blank test devoid of question-and-answer text, primarily
for security reasons. The zoning process generates a zoned test configuration file, which
stores crucial details such as the number of pages, the quantity and types of questions
on each page, and the approximate locations of the questions and their answers. This
information is solely utilized for validation purposes within the processing component,
which is responsible for locating the regions of interest.
The scanning component is realized as an application designed to handle filled-in
paper tests. It operates by accepting the filled-in test sheets as input. The scanning
component of the system takes over, performing QR code recognition and arranging the
test pages in the correct order. The scanned test pages are then assembled into a multi-page
.tiff file.
Each scanned test can fall into one of three categories: complete, incomplete, or
erroneous. Fully successful scans, representing complete tests, are sent to a dedicated
repository on the server that stores the complete tests. Incomplete tests, which are missing
one or more pages, are also sent to a specific repository for incomplete tests. While they
are forwarded for further processing, manual intervention becomes necessary to address
the missing pages. Tests with errors occur when the QR codes or barcodes cannot be read
or when the test pages exhibit significant deformities. Such tests are not dispatched for
processing and require manual intervention.
For each scanned test, the scanner component sends messages to the associated mes-
sage queue. Each message contains pertinent information and, additionally, the component
periodically transmits statistical information regarding the number of complete, incomplete,
and erroneous tests. Furthermore, when scanning the same test multiple times, distinct
messages are dispatched to the corresponding message queue, resulting in the processing
component handling the same test on multiple occasions.
The test processing component of the system has been designed as a versatile service
capable of running in multiple instances. Each instance of the service establishes a pool of
processing threads, with each thread being assigned to an incoming message queue. For
every scanned and completed test, a corresponding message is inserted into the incoming
message queue by the scanning component. These messages contain the file path to the
digitally scanned test stored on the server, as well as the path to the configuration file
specific to that test.
Upon successful processing of a test, the resulting data and outcomes are stored in the
directory dedicated to completed tests, residing within the corresponding server directory
for that particular test. Upon receipt of a message, each processing thread meticulously
deciphers its contents, subsequently retrieving the designated test and configuration file
from the specified path. Once these essential components are obtained, the intricate process
of test and question processing commences, whereby the appropriate procedures are
employed based on the specific question types encountered.
individual page is scanned for a QR code. These QR codes are inserted during the test
design phase and contain information pertaining to the test identification and the respective
page number. Since these QR codes are incorporated into every page during test design,
they are inherently well-oriented. This orientation allows for the system to accurately
determine any necessary rotation adjustments needed to correct any inadvertent page
rotation that may occur during the digitization process of the test.
Prior to initiating the question processing phase, it is imperative to identify the specific
regions encompassing the questions and their corresponding answers. These regions are
characterized by rectangular shapes and are positioned between four borders placed at
each corner of the test page, forming a large letter L shape. Each test is represented by a
multi-page .tiff file, where each image within the file corresponds to a single page of the test.
To enhance the recognition process, several morphological operations need to be applied to
the images.
At first, the test image undergoes grayscale conversion, as the subsequent algorithms
necessitate this image format. Subsequently, a Gaussian blur is applied to the image using
an appropriately sized rectangular kernel, effectively eliminating Gaussian noise. The
image is then transformed into a binary inverted representation using an appropriate
threshold function, employing the OTSU algorithm for the automated selection of the
optimal threshold value. The inversion image is essential since the frames outlining the
regions of interest in the original test are black, whereas shape detection algorithms require
the shapes to be white. Morphological operations are then implemented to retain only
horizontal and vertical lines, preserving rectangular regions encompassing the questions,
answers, and borders within the image. Additionally, a closing morphological operation
is performed to address any discontinuities in shapes of interest introduced during the
scanning process. With these transformations completed, the search for regions of interest
can be initiated. A search for shapes within the image yields a list of contours, which are
defined as lines that enclose regions consisting of pixels with the same intensity. In the
context of a binary image, where pixels can have either a value of 1 or 0, contours outline
areas where the intensity remains constant (with a value of 1 in the case of a binary inverted
image).
The identification of answer regions varies depending on the question type. Fur-
thermore, the approaches to processing answers also exhibit variations. The subsequent
subsections will elaborate on the techniques employed for their detection. Algorithm 1
illustrates the pseudocode for the processing component.
Electronics 2023, 12, 4080 10 of 24
4.1. Multiple-Choice
The system is capable of processing questions that require selecting one or more
provided answers by blackening the corresponding circles. These types of questions
involve zoning the answer region, which is achieved through the zoning process using
the zoner operator. Algorithm 2 provides the pseudocode for the processing component
responsible for managing multiple-choice questions, while Figure 5 visually details the
sequential steps involved in handling this question type on a test page, connected with
arrows to illustrate the flow.
Electronics 2023, 12, 4080 11 of 24
The answer regions in multiple-choice questions are situated within the question
regions and are identified through the use of tables. The process of determining these
answer tables, which are comprised of rectangular cells, involves conducting a search
for rectangular contours within the question region. In addition to the location of each
contour, the tree search method yields information about the hierarchical parent-child-
sibling relations between the contours. Subsequently, a list of rectangular contours is
obtained, which can potentially form a rectangular answer table.
Upon confirming the answer tables for a given question, the next step involves iden-
tifying the answer circles that need to be marked in order to answer the question. Each
answer circle is represented by a single contour located at the center of a table cell. While
one could simply check for darkened central regions within each cell, an additional search
is conducted to detect the contours of the response circles. This approach ensures that
the marking of answers by students is not specific and verifies adherence to the specified
method of filling, as outlined in the test instructions.
Electronics 2023, 12, 4080 12 of 24
The process of circle detection involves various mathematical operations and checks.
For each potential circle contour, the perimeter is calculated, and based on this value, a
reliability index is determined. This index is computed as the ratio of 4 times PI times the
area of the contour to the square of its perimeter. A perfect circle would yield an index
value of 1. To further validate the contour, the moment of the candidate circle is calculated,
aiding in determining the contour’s center of mass. The distance from the center of mass
to each point on the contour is then calculated, and the median value is obtained. It is
assessed whether a sufficient number of contour edge points meet the condition that their
distance to the center of mass falls within a predetermined epsilon neighborhood of the
previously computed median value. The selection of thresholds is based on experimental
determinations. Algorithm 3 presents the process for testing and verifying circles.
Algorithm 3. Pseudocode within the processing component designed for evaluating circle candidates
def
def is_circle( (candidate, thresh, eps)
):
num = 4 * math.PI * get_contour_area((candidate))
() ** 2
den = candidate.get_perimiter()
if not (( ) <= num / den <= ((1 + thresh))
((1 - thresh) ))
)):
return False
()
center, points = candidate.get_moments() ()
(), candidate.get_points()
([
median = median([ (center, p)
([get_distance( ) for p in points])
])
dists = [
(distance(
abs( (center, p)
) - median_dist)) / median_dist < eps
()
for p in c.get_points()
]
(dists)
if sum( ) <= len(
(dists)
) * thresh:
return False
return True
If the number of detected circles is insufficient for the given question, an error is
recorded, indicating that an insufficient number of circles were detected. Conversely, if the
number of circles exceeds the requirement for the question, the N largest circles (where N
represents the number of answer circles expected for that question) are chosen based on
their area. This selection accounts for situations where certain letters (e.g., “o”) may be
Electronics 2023, 12, 4080 13 of 24
recognized as circles. However, there is a constraint that the answer circles must be larger
than any other letters forming the answer word within the table. Finally, a final check is
conducted to ensure that the identified circles adhere to the criteria of forming a regular
grid.
Once the grid of answer circles for a question is established, the process of determining
the fill status of each circle begins. Initially, a portion of the square image containing the
contour of each circle is extracted. Subsequently, the number of black pixels within the
image, which correspond to the area inside the circle, is computed. However, it is important
to note that even unfilled circles exhibit a small percentage of black pixels due to the
contribution of the circle’s borders. Additionally, students may vary in their degree of
shading when filling the circles. To address these factors, a min-max scaling approach
is employed to derive a score for the level of circle filling. It is important to consider
that scenarios may arise where either all circles are filled, or none are filled, and they are
handled appropriately. Figure 6a displays the multiple-choice question featured in the test
(Figure 6b is the English-translated version of Figure 6a).
4.2. Matching
For questions of this type in the reference test, the format involves underlining one or
more correct words within the free text of the answer. The answer region of the question
corresponds to the question region itself and is not zoned by the zoner operator during
the zoning process. Algorithm 4 illustrates the pseudocode for the processing component
responsible for managing matching questions, while Figure 7 visually details the sequential
steps involved in handling this question type on a test page, connected with arrows to
illustrate the flow.
The initial step entails localizing and cropping the image of the question region of inter-
est from the blank template test. This image is then converted to grayscale and subsequently
Electronics 2023, 12, 4080 15 of 24
binarized to facilitate further processing. The Tesseract optical character recognition (OCR)
engine is employed to extract the words from the question’s and answer’s text, determining
their positions and dimensions. The words are organized into paragraphs and within
each paragraph, they are arranged into lines according to their appearance order. This
segmentation aids in identifying the regions below the text lines where pencil underlining
may occur.
Two distinct methods are utilized to determine whether a word is underlined, with
differences in the underlining regions. In the first method, the underlining regions span
from the baseline of one text line to the top line of the next text line within the same
paragraph. If the word is the last line of the paragraph, the underlining extends to the
beginning of the subsequent paragraph or the end of the question region if it is the last
line of the last paragraph. In the second method, the top edge of the underlining region
is adjusted to align with the middle of the text line. In both cases, the underlining region
of each word is divided into an appropriate odd number of segments, and the number
of blackened pixels is calculated. A distinction between the two methods is that in the
first method, the number of blackened pixels is computed on the binary question image,
while in the second method, the question image undergoes morphological transformation
to retain only horizontal lines. This transformation is achieved by applying an appropriate
kernel to preserve lines of at least one letter width.
Both methods are employed to determine if a word is underlined, with the second
method utilized only when the first method indicates that the word is not underlined.
The first method suffices because observing the difference in pixel darkness within the
underlining region of the question image in the template blank test compared to the
filled test has proven to be effective. The second method is necessary as candidates often
underline the word in a way that the underlining line passes through the word rather than
beneath it. Thus, the underlying region’s location in the second method, as previously
mentioned, is adjusted. Additionally, the first method yields inadequate results when
the underlining line is obscured by black pixels forming the letters of the word, and the
difference in pixel darkness between the template test and the specific filled-in test sample
is insufficient.
To observe the difference in pixel darkness within the underlined region between
the template and the filled-in test, it is necessary to search for the words in the filled-in
test or transform the question image from the filled-in test after successfully identifying
the question region. The second method is chosen due to its efficiency, as underlining
the word impedes the accurate recognition of the word’s position and dimensions. The
transformation of the filled-in test question image involves converting it to grayscale,
followed by binarization. The ORB (Oriented FAST and Rotated BRIEF) detector is then
initialized to detect key points and descriptors in both the template test and the filled-
in test images. Given the similarity between the images and their negligible rotation
angles, a simple detector like ORB is sufficient. Subsequently, a BFMatcher (Brute-force
matcher) object is created, specifying the Hamming distance as the distance criterion for the
similarity between the two sets of descriptors. Using the matcher object and the descriptors,
the locations where the matching occurs are determined, and the best matches (smallest
distance) are selected. These matches are used to find a perspective transformation that
aligns the filled-in test question image with the template test question image.
With the transformed question image in hand, the next step is to determine the level
of pixel darkening within the underlined regions. The process is repeated using the two
aforementioned methods, which were initially applied to the template test question image
to establish a reference value for pixel darkness within the underlined region of each word.
The obtained values from the filled-in test and the template test are compared. If the
difference exceeds the reference value specified in the configuration file in at least half
of the segments, the word is considered underlined. Notably, the blackness threshold is
notably lower in the second method, taking into account that the morphological operation,
which retains only horizontal lines on the test, eliminates a significant number of blackened
Electronics 2023, 12, 4080 16 of 24
pixels. Figure 8 displays the matching question featured in the test (Figure 8 will not be
translated into English as it depicts the identical question as presented in Figure 2a).
Algorithm 5. Processing component pseudo code for handling short answer questions
def get_short_written((page, answer) ):
answer_image = get_answer_image( (page, answer)
)
lines_image = morph_image_horizontal_lines( (answer_image) )
lines = find_contours((lines_image) )
answer_lines = find_answer_lines( (answer, lines)
)
morphed_image = erase_lines( (answer_image, answer_lines) )
morphed_image = repair_image( (morphed_image))
hw_model = get_or_load_model() ()
result = []
for line in answer_lines:
seg_image = segment_image( (morphed_image, line))
contours = find_contours( (seg_image))
contours = sorted((contours, key=lambda
=lambda c: c.get_x()
() =False)
(), reverse=False)
digits = []
for c in contours:
c_image = segment_image( (seg_image, c))
d = hw_model.decode((c_image) )
(d)
digits.append( )
((line, 0 0 .join([
((
result.append(( ([
([str((d)
) for d in digits])))
])))
return result
Electronics 2023, 12, 4080 17 of 24
Figure 9. The sequence of processing short answer question type on a test page.
The first step involves identifying and extracting the image of the question region of
interest from the filled-in test. This image is then converted to grayscale and subsequently
transformed into a binary format to aid in subsequent processing. Then, a morphological
operation is applied to the image of the question, resulting in only the horizontal lines
remaining visible. This is achieved by applying an appropriate kernel element with di-
mensions specified in the corresponding configuration file for that test. Next, contours
representing the horizontal lines are detected, and their location information is overlapped
with the answer regions provided by the zoning process for each question. Once the hori-
zontal lines for answer writing are determined, an image processing operation is performed
to erase the answer-writing line from the image. Following that, a morphological closing
operation is applied to potentially recover any extended portions of the written answer,
considering that candidates often write their answers in a way that extends not only above
the line (as it should) but also below it, effectively intersecting with the answer line. Then,
the answer region is extracted from the image, and a search for digit contours is conducted.
Each character is extracted from the image and transformed into a suitable format for
the model used for the recognition of the characters. This process entails resizing the
recognized region of the characters, aligning it to the center, and adding padding to the
smaller dimension to achieve a final size of 32 pixels by 32 pixels, which matches the input
data requirements of the network model.
For the purpose of digit recognition, a modified version of the convolutional neural
network LENET5 [18] is utilized [19]. The original neural network model exhibits issues
of significant bias and variance, and suitable optimizations have been implemented to
address these challenges effectively [20–22]. The modified neural network model consists
of two convolutional layers followed by a max-pooling layer with ReLU activation. These
are then followed by two more convolutional layers and another max-pooling layer with
ReLU activation. Afterward, three fully connected layers are employed, culminating in
10 outputs that utilize the softmax function.
Batch normalization is conducted after every two convolutional layers and after each
fully connected layer to stabilize the network and promote faster convergence of the learn-
ing algorithm. After each pooling layer and the final fully connected layer, regularization is
performed by randomly dropping a quarter of the neurons from the network to enhance
Electronics 2023, 12, 4080 18 of 24
Figure 10. (a) Example of a short answer question in Serbian. (b) Example of a short answer question
translated into English.
and identification page, while the final page is left blank. Within the test, the questions
are distributed across these pages. The answer region is positioned within the question
region. In certain cases, both the question and its corresponding answers may span multiple
pages. In the case of multiple-choice questions, the number of answer options provided
may vary. Similarly, for matching questions, the number of words to be underlined can
range from a single word to multiple words. Each test consisted of 20 questions, with
17 being multiple-choice, 2 matching questions, and 1 short numerical answer question.
Subsequently, the results for each question type will be presented.
5.1. Multiple-Choice
In the system testing process, a total of 734,825 multiple-choice questions were in-
cluded. Among these questions, 734,487 question-and-answer regions were successfully
detected, resulting in a 99.95% success rate in identifying question-and-answer regions.
Furthermore, 734,132 grids of circles representing the answer choices to these questions
were accurately recognized, representing a success rate of 99.91%.
However, in a small fraction of cases (0.09%), specifically 693 questions, the answer
region was successfully identified, but the grid of circles was not correctly recognized and
necessitated manual verification. Upon manual review, it was discovered that candidates
often crossed out certain circles to invalidate previous answers, which caused the circles
to go unrecognized. Additionally, some candidates encircled the answer choices, created
black squares around the circles, or made additional markings on the circles, resulting
in the omission of certain circles from recognition. Consequently, such questions were
flagged for manual inspection to ensure accuracy. Figure 11a illustrates an instance where a
particular response necessitates manual review (Figure 11b is the English-translated version
of Figure 11a).
Figure 11. (a) Responses needing a manual review in Serbian. (b) Responses needing a manual
review translated into English.
It was anticipated that the cumulative count of answer circles would reach 3,319,086, in-
corporating 734,487 recognized answer regions. After conducting a thorough examination,
the actual number of identified answer circles was 3,316,215, showcasing an exceptional
precision rate of 99.91% in the recognition of answer circles.
Out of a total of 734,132 successfully recognized grids of circles, 733,987 grids accurately
had all of their circles’ fullness correctly identified, resulting in a remarkable rate of 99.98%.
Considering the 3,316,926 circles present in these 733,987 grids, an outstanding 99.99% of the
circles were correctly identified with their fullness status, totaling 3,316,458 circles.
5.2. Matching
During the system testing phase, the number of questions classified as “matching
type” was 86,450. Out of this total, 86,391 question-and-answer regions were successfully
identified, resulting in a success rate of 99.93% in detecting question-and-answer regions.
Moreover, 86,373 answers to these questions were accurately recognized, representing a
success rate of 99.98%.
However, for a small subset of 28 questions (0.02%), although the question regions
were successfully detected, not only the underlined words within those questions were
recognized, or not all of the underlined words were detected. A manual review of these
tests revealed that candidates often crossed out one of the words to invalidate a previous
answer, which inadvertently led to additional words being recognized. Certain candidates
utilized excessively thin and barely discernible underlines on words, resulting in their
non-recognition. Figure 12 illustrates one such instance (Figure 12 will not be translated
into English as it depicts the identical question as presented in Figure 2a).
In the first question, one word should have been underlined, while in the second
question, seven words should have been underlined. Nevertheless, it is important to note
that not all candidates provided correct answers or completed the question accurately.
As a result, the total number of underlined words reached a quantity of 781,779. The
Electronics 2023, 12, 4080 21 of 24
5.4. Summary
Table 2 provides a comprehensive summary of the performance data acquired during
the evaluation process of the software system, categorized according to each question type
that the system is capable of assessing.
In the table provided, for answer recognition in each question type, two values are
presented: “Correctly recognized answers” and “Correctly recognized elements (circles,
words, digits)”. These values vary for each question type, with the second value being
higher. The disparity arises from the fact that each question entails a larger number of
Electronics 2023, 12, 4080 22 of 24
6. Conclusions
The primary objective of this study is to showcase the software system, which has
demonstrated high precision in evaluating different question types and automatically
recognizing marked answers on a combined question-and-answer paper format. The
system introduces the capacity to evaluate multiple choice questions, and matching-type
questions, wherein the answers require the candidate to underline specific words from
the test, and it can evaluate short answer questions. To address these challenges, the
system employs additional artificial intelligence techniques, including convolutional neural
networks and computer vision. Moreover, it encompasses the capability to identify and
categorize various types of errors, assigning them to different severity levels, and managing
their processing accordingly.
However, it would be highly desirable to enhance the system’s resilience and adapt-
ability when confronted with challenges posed by suboptimal scan quality. By fortifying
the system’s capabilities in this regard, it would ensure consistent and reliable performance,
even in less-than-ideal scanning conditions. Furthermore, the most significant challenge
lies in enabling the system to grade essay-type questions. To tackle this issue, further
augmentation of the system is needed, utilizing additional artificial intelligence techniques
such as natural language processing [27]. It would also be advantageous for the system to
verify the candidate’s identity based on their handwriting [28,29]. These aspects present
avenues for future research and development.
The system that has been put into operation is utilized during the assessment of stu-
dents, although its applicability extends beyond academic settings. For instance, this system
could find utility in assessing knowledge for purposes such as acquiring a driver’s license.
Furthermore, it can serve various purposes, including conducting large-scale paper-based
surveys or collecting information on medical histories through medical questionnaires and
other paper-based forms.
Author Contributions: Conceptualization, V.J.; Data curation, N.B.; Formal analysis, B.N.; Funding
acquisition, B.N.; Investigation, V.J.; Methodology, B.N.; Project administration, B.N.; Resources, N.B.;
Software, V.J.; Supervision, B.N.; Validation, N.B. and B.N.; Visualization, V.J.; Writing—original draft
Electronics 2023, 12, 4080 23 of 24
preparation, V.J.; Writing—review and editing, B.N. and N.B. All authors have read and agreed to the
published version of the manuscript.
Funding: This research was supported and funded by the Science Fund of the Republic of Serbia,
grant no. 6526093, AI-AVANTES (www.fondzanauku.gov.rs). The authors gratefully acknowledge
the support.
Data Availability Statement: The data supporting this study’s findings belong to the Ministry of
education of the Republic of Serbia. Due to this reason, the data cannot be public. The authors receive
data under special terms for research purposes only. If needed, the authors can send a request to the
Ministry to make data available to the editor to verify the submitted manuscript.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Jocović, V.; Ðukić, J.; Mišić, M. First Experiences with Moodle and Coderunner Platforms in Programming Course. In Proceedings
of the Tenth International Conference on e-Learning, Belgrade Metropolitan University, Belgrade, Serbia, 29 September 2019;
pp. 81–86, ISBN 978-86-89755-18-3.
2. Lewis, I.; Watson, B.; White, K.M. Internet versus paper-and-pencil survey methods in psychological experiments: Equivalence
testing of participant responses to health-related messages. Aust. J. Psychol. 2009, 61, 107–116. [CrossRef]
3. Hüseyin, Ö.Z.; Özutran, T. Computer-based and paper-based testing: Does the test administration mode influence the reliability
and validity of achievement tests? J. Lang. Linguist. Stud. 2018, 14, 67–85.
4. McClelland, T.; Cuevas, J. A comparison of computer based testing and paper and pencil testing in mathematics assessment.
Online J. New Horiz. Educ. 2020, 10, 78–89.
5. Candrlic, S.; Katić, M.A.; Dlab, M.H. Online vs. Paper-based testing: A comparison of test results. In Proceedings of the 37th
International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija,
Croatia, 26–30 May 2014; pp. 657–662. [CrossRef]
6. Jeong, H. A comparative study of scores on computer-based tests and paper-based tests. Behav. Inf. Technol. 2014, 33, 410–422.
[CrossRef]
7. Santosh, K.C.; Antani, S.K. Recent trends in image processing and pattern recognition. Multimed. Tools Appl. 2020, 79, 34697–34699.
[CrossRef]
8. Jocovic, V.; Marinkovic, M.; Stojanovic, S.; Nikolic, B. Automated assessment of pen and paper tests using computer vision.
Multimed. Tools Appl. 2023, 1–22. [CrossRef]
9. Alomran, M.; Chai, D. Automated scoring system for multiple choice test with quick feedback. Int. J. Inf. Educ. Technol. 2018, 8,
538–545. [CrossRef]
10. Fisteus, J.A.; Pardo, A.; García, N.F. Grading Multiple Choice Exams with Low-Cost and Portable Computer-Vision Techniques.
J. Sci. Educ. Technol. 2013, 22, 560–571. [CrossRef]
11. Lu, M.; Zhou, W.; Ji, R. Automatic Scoring System for Handwritten Examination Papers Based on YOLO Algorithm. J. Phys. Conf.
Ser. 2021, 2026, 12–30. [CrossRef]
12. Shaikh, E.; Mohiuddin, I.; Manzoor, A.; Latif, G.; Mohammad, N. Automated grading for handwritten answer sheets using
convolutional neural networks. In Proceedings of the 2019 2nd International Conference on New Trends in Computing Sciences
(ICTCS), Amman, Jordan, 9–11 October 2019; pp. 1–6. [CrossRef]
13. Abdul Salam, M.; El-Fatah, M.A.; Hassan, N.F. Automatic grading for Arabic short answer questions using optimized deep
learning model. PLoS ONE 2022, 17, 269–272. [CrossRef]
14. Tulu, C.N.; Ozkaya, O.; Orhan, U. Automatic short answer grading with SemSpace sense vectors and MaLSTM. IEEE Access 2021,
9, 19270–19280. [CrossRef]
15. Lin, Y.; Zheng, L.; Chen, F.; Sun, S.; Lin, Z.; Chen, P. Design and Implementation of Intelligent Scoring System for Handwritten
Short Answer Based on Deep Learning. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and
Information Systems (ICAIIS), Dalian, China, 20–22 March 2020; pp. 184–189. [CrossRef]
16. Sigut, J.; Castro, M.; Arnay, R.; Sigut, M. OpenCV basics: A mobile application to support the teaching of computer vision
concepts. IEEE Trans. Educ. 2020, 63, 328–335. [CrossRef]
17. Huang, R.; Pedoeem, J.; Chen, C. YOLO-LITE: A real-time object detection algorithm optimized for non-GPU computers.
In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018;
pp. 2503–2510. [CrossRef]
18. Meir, Y.; Tevet, O.; Tzach, Y.; Hodassman, S.; Gross, R.D.; Kanter, I. Efficient shallow learning as an alternative to deep learning.
Sci. Rep. 2023, 13, 5423. [CrossRef] [PubMed]
19. Zhang, J.; Yu, X.; Lei, X.; Wu, C. A novel deep LeNet-5 convolutional neural network model for image recognition. Comput. Sci.
Inf. Syst. 2022, 19, 1463–1480. [CrossRef]
20. Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K.A. Survey of quantization methods for efficient neural
network inference. arXiv 2021, arXiv:2103.13630.
Electronics 2023, 12, 4080 24 of 24
21. Huang, Q. Weight-quantized squeezenet for resource-constrained robot vacuums for indoor obstacle classification. AI 2022, 3,
180–193. [CrossRef]
22. Tasci, M.; Istanbullu, A.; Kosunalp, S.; Iliev, T.; Stoyanov, I.; Beloev, I. An Efficient Classification of Rice Variety with Quantized
Neural Networks. Electronics 2023, 12, 2285. [CrossRef]
23. Tang, Z.; Luo, L.; Xie, B.; Zhu, Y.; Zhao, R.; Bi, L.; Lu, C. Automatic sparse connectivity learning for neural networks. IEEE Trans.
Neural Netw. Learn. Syst. 2022, 1–15. [CrossRef]
24. Sher, A.; Trusov, A.; Limonova, E.; Nikolaev, D.; Arlazarov, V.V. Neuron-by-Neuron Quantization for Efficient Low-Bit QNN
Training. Mathematics 2023, 11, 2112. [CrossRef]
25. Lin, S.; Ma, X.; Ye, S.; Yuan, G.; Ma, K.; Wang, Y. Toward extremely low bit and lossless accuracy in dnns with progressive admm.
arXiv 2019, arXiv:1905.00789.
26. Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process.
Mag. 2012, 29, 141–142. [CrossRef]
27. Camus, L.; Filighera, A. Investigating transformers for automatic short answer grading. In Proceedings of the Artificial Intelligence
in Education: 21st International Conference, AIED 2020: Proceedings, Part II 21, Ifrane, Morocco, 6–10 July 2020; pp. 43–48.
[CrossRef]
28. Patil, A.; Rane, M. Convolutional neural networks: An overview and its applications in pattern recognition. In Proceedings of
the Information and Communication Technology for Intelligent Systems: Proceedings of ICTIS 2020, Singapore, Singapore, 22
October 2020; pp. 21–30. [CrossRef]
29. Rehman, A.; Naz, S.; Razzak, M.I. Writer identification using machine learning approaches: A comprehensive review. Multimed.
Tools Appl. 2019, 78, 10889–10931. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.