0% found this document useful (0 votes)

38 views12 pages

2025 - Clustering Source Code From Automated Assessment of Programming Assignments

The paper presents AsanasCluster, an online tool for clustering source code submissions in automated programming assessments, aimed at improving feedback by grouping similar mistakes and solutions. It utilizes semantic graph representations of code, focusing on control and data flow features, and employs an incremental k-means model for efficient clustering without requiring code execution. The evaluation shows that AsanasCluster performs well in terms of runtime and precision compared to existing tools, facilitating targeted and personalized feedback for students.

Uploaded by

Steven Meyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views12 pages

2025 - Clustering Source Code From Automated Assessment of Programming Assignments

Uploaded by

Steven Meyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

International Journal of Data Science and Analytics (2025) 20:1581–1592

https://doi.org/10.1007/s41060-024-00554-5

REGULAR PAPER

Clustering source code from automated assessment of programming

assignments
José Carlos Paiva1,2 · José Paulo Leal1,2 · Álvaro Figueira1,2

Received: 3 January 2024 / Accepted: 15 April 2024 / Published online: 15 May 2024
© The Author(s) 2024

Abstract
Clustering of source code is a technique that can help improve feedback in automated program assessment. Grouping code
submissions that contain similar mistakes can, for instance, facilitate the identification of students’ difficulties to provide
targeted feedback. Moreover, solutions with similar functionality but possibly different coding styles or progress levels can
allow personalized feedback to students stuck at some point based on a more developed source code or even detect potential
cases of plagiarism. However, existing clustering approaches for source code are mostly inadequate for automated feedback
generation or assessment systems in programming education. They either give too much emphasis to syntactical program
features, rely on expensive computations over pairs of programs, or require previously collected data. This paper introduces
an online approach and implemented tool—AsanasCluster—to cluster source code submissions to programming assignments.
The proposed approach relies on program attributes extracted from semantic graph representations of source code, including
control and data flow features. The obtained feature vector values are fed into an incremental k-means model. Such a model
aims to determine the closest cluster of solutions, as they enter the system, timely, considering clustering is an intermediate
step for feedback generation in automated assessment. We have conducted a twofold evaluation of the tool to assess (1)
its runtime performance and (2) its precision in separating different algorithmic strategies. To this end, we have applied
our clustering approach on a public dataset of real submissions from undergraduate students to programming assignments,
measuring the runtimes for the distinct tasks involved: building a model, identifying the closest cluster to a new observation,
and recalculating partitions. As for the precision, we partition two groups of programs collected from GitHub. One group
contains implementations of two searching algorithms, while the other has implementations of several sorting algorithms.
AsanasCluster matches and, in some cases, improves the state-of-the-art clustering tools in terms of runtime performance and
precision in identifying different algorithmic strategies. It does so without requiring the execution of the code. Moreover, it
is able to start the clustering process from a dataset with only two submissions and continuously partition the observations as
they enter the system.

Keywords Programming learning · Automated assessment · Programming assignments · Clustering · Semantic graph

1 Introduction

Learning to program requires extensive and varied prac-

B José Carlos Paiva tice, obtained through solving a wide range of programming
[email protected] assignments supported with accurate, timely, and formative
José Paulo Leal feedback [2, 10, 30]. Such feedback cannot be guaranteed
[email protected] manually on learners’ demand, as instructors can neither
Álvaro Figueira verify the code attempts for all learners in a class nor are
arﬁ[email protected] always available outside classes. Thus, scalable and auto-
matic techniques to assess programming assignments have
1 CRACS, INESC TEC, Rua do Campo Alegre, 4169-007 long been investigated to address this need and are still a
Porto, Portugal
target of increasing research interest [35].
2 DCC, FCUP, Rua do Campo Alegre, 4169-007 Porto,
Portugal

123
1582 International Journal of Data Science and Analytics (2025) 20:1581–1592

The clustering of source code has been initially introduced The ultimate goal of this clustering process is to, given
into the automated assessment of programming assignments an incorrect solution, determine the closest cluster consider-
for plagiarism detection purposes [31]. By grouping sub- ing all correct submissions up to date. From this cluster, we
missions that exhibit high similarity, the space of possible select a correct solution, which theoretically follows the same
cases of plagiarism reduces considerably, enabling a more algorithmic strategy, to compare against the wrong program
thorough pairwise inspection on them [34, 50]. Multiple and generate personalized feedback for the student. Conse-
strategies to measure similarity have been proposed, includ- quently, the runtime of the clustering process must allow for
ing structural [9, 20, 33, 39], semantical [4, 5, 7], and near real-time feedback, and considering feedback genera-
behavioral [21, 28, 49] approaches. Eventually, clustering tion is a subsequent task. Moreover, it is important that the
has emerged as a promising technique to support the gener- model solution adopts the same strategy as the incorrect one
ation of feedback on the correctness of solutions and how to if it is a valid approach, to support students’ development in
progress after mistakes. Having solutions with similar func- their own line of thought. Hence, we evaluate AsanasCluster
tionality, code complexity, structure, or behavior together on a public dataset—PROGpedia [36]—both regarding the
can, for instance, facilitate the delivery of targeted feedback runtimes and the effectiveness in identifying different algo-
on common mistakes from learners by grouping submissions rithmic solution strategies.
that contain similar errors or misconceptions [11, 18] and The remainder of this paper is organized as follows. Sec-
enable the generation of personalized feedback to improve a tion 2 presents some of the most important works involving
program based on a solution adopting a similar strategy but the clustering of source code for the automated assessment of
correct [6, 22]. programming assignments. Section 3 reviews the necessary
Nevertheless, clustering source code is a complex task. concepts for the correct understanding of this work. Section 4
On the one hand, approaches often require computing an describes the proposed approach. Section 5 demonstrates the
edit distance between each pair of solutions (e.g., abstract effectiveness of this approach using a public dataset of real
syntax tree edit distance), which is expensive. On the other submissions to programming assignments. Finally, Sect. 6
hand, the quality of the clusters is highly-dependent not only discusses and summarizes the contributions of this work.
on the selected representation of source code but also on the
model used and the available data. Therefore, most of the
proposed approaches perform poorly in providing feedback
in programming courses. These either: (1) overly focus on 2 Related work
syntax and/or require exact matching of program features,
generating a large number of clusters as a consequence [14, Earlier approaches for clustering source code in program-
17]; (2) rely on expensive pairwise computations [19, 32]; ming education are based on textual similarity. These
(3) require a large amount of previously generated data [37, approaches often involve the extraction of tokens or selection
44]; or (4) are specialized in a specific type of assignment of keywords from the source code, followed by pairwise com-
(e.g., dynamic programming) and not generalizable [22]. parison using some well-known distance metric or common
This work proposes an approach and tool—AsanasCluster— text mining techniques [24, 31, 34]. While such approaches
to cluster correct source code submissions to programming can inherit much from text clustering, they are generally very
assignments based on their algorithmic strategies. To this sensitive to changes in code structure or formatting.
end, we extract the control flow graph (CFG), which encodes A popular program representation used in clustering
the execution order of individual statements of a program, and approaches is abstract syntax trees (ASTs), as they cap-
data flow graph (DFG), which describes how data variables ture just enough information to understand the structure of
get updated between instructions, from the submitted source the code. Such clustering approaches compute similarity
code. The combined information of these representations using distances in feature space [15, 16], string edit dis-
captures the key aspects of the algorithmic strategy adopted tance [42], tree edit distance [19], or normalization [41, 47].
in the original program [12], ignoring its syntax. As even For instance, Codewebs [32] customizes and employs a set
computing the pairwise graph edit distance of these simpli- of semantics-preserving AST transformations to normalize
fied representations would be expensive, we rather compile a and cluster student submissions.
vector of features from them, which is used as the input to the Luxton-Reilly et al. [29] claim that different solutions
clustering model. This model is an incremental mini-batch have distinct structural variations, which can be encoded
k-means variant [43] of the popular Lloyd’s classic k-means using control flow graphs (CFGs). This means that clustering
algorithm [26]. Such a model moves clusters’ centers as new source codes by their control flow structures divide them into
correct submissions enter the system, reducing training time categories. OverCode [14] and CLARA [17] combine these
considerably when compared to re-training the model on the structures with dynamic information on variable values to
complete dataset. cluster solutions. However, these techniques generate a large

123
International Journal of Data Science and Analytics (2025) 20:1581–1592 1583

number of clusters as they focus excessively on the syntactic of granularity, i.e., including the order in which expressions
details of the source code. and sub-expressions are evaluated. The nodes of the EOG
SemCluster [37] uses a vector representation of programs are the same nodes as those of the abstract syntax tree of the
based on semantic program features, which can be used with program, whereas an edge (n i , n j ) means that n j is evaluated
standard clustering algorithms such as k-means. The fea- after n i .
tures include control flow features and data flow features. The differences between the EOG and the CFG, which
The former describes how the problem space is partitioned connects basic blocks of statements, are only a few, partic-
into sub-spaces (i.e., the control flow paths), while the lat- ularly: methods without explicit return statements have an
ter captures the frequency of occurrence of distinct pairs of edge in the EOG to a virtual return node; the EOG consid-
successive values of individual variables in test executions. ers opening blocks (e.g., {) as separate nodes; the EOG uses
Using deep learning to learn program embeddings from separate nodes for the if keyword and the condition; and the
token sequences, ASTs, CFGs, program states, or other EOG considers a method header as a node.
program representations is the recent trend in program clus-
tering [27, 38, 40, 44, 45]. Nevertheless, training such models
still requires considerable effort and a meticulous selection 3.3 Data flow graph
of inputs. Finally, other clustering approaches specialize in
specific programming problems such as dynamic program- A data flow graph (DFG) is a directed graph G = (N , E),
ming [22] and interactive programs [8]. where N is the set of nodes and E is the set of directed edges.
Each node within the set N = {n 1 , n 2 , ...} denotes a distinct
computational unit or instruction, whereas the directed edges
3 Definitions (n i , n j ) for n i , n j ∈ N within the set E represent the data
dependencies, i.e., the output data of n i is consumed by n j .
In this section, we present the concepts of control flow Such visualization enables a clear view of the data processing
graph (CFG), evaluation order graph (EOG), data flow graph pipeline (i.e., the flow of data along the edges establishes the
(DFG), and k-means clustering that form the basis of the sequence in which operations should be executed), support-
proposed approach. ing the analysis and optimization of the program through the
identification of parallel execution possibilities.
3.1 Control flow graph

A control ﬂow graph (CFG) is a directed graph G = 3.4 K-means clustering

(N , E, n 0 , n f ), where N represents the set of nodes, E is
the set of directed edges (i.e., pairs of elements of N ), and K -means clustering method is a popular unsupervised
n 0 , n f correspond to the entry and exit nodes, respectively. machine learning technique for partitioning a set of obser-
The set of nodes N = {n 1 , n 2 , ...} ∪ {n 0 , n f } corresponds to vations (or data points) into k different clusters. Firstly, the k
basic blocks, i.e., maximal-length sequences of branch-free initial centroids are randomly selected, where k is a user-
instructions of a program. The set of edges E represents con- defined parameter. Each data point d is then assigned to
trol dependencies between the nodes. The two extra nodes the closest mean (or centroid), and the collection of points
n 0 , n f , which represent the node through which the control assigned to a centroid forms a cluster. Afterward, the centroid
enters the graph (entry node n 0 ) and the node through which of each cluster is updated based on all points in the cluster.
the control exits the graph (exit node n f ), are added such that This iterative procedure is repeated until no changes occur
each node of the graph has at most two successors. in the clusters.
The CFG captures the control flow behavior of a program, The method can be formally defined as follows. Con-
considering the possible paths and decisions taken during sider D = {d1 , ..., dn } is the set of observations to be
program execution. It provides a structured representation of clustered, where each di ∈ Rm is represented by a m-
the control flow of the program, supporting program analysis, dimensional feature vector. Then, k-means partitions the data
optimization, and the understanding of its behavior. points in D into K clusters with centroids C ∗ = {C1 , ..., Ck }
∑K ∑
such that i=1 d∈Ci dist(d, µi ) is minimal, where µi =
1 ∑
3.2 Evaluation order graph Ci d∈Ci d is the centroid of cluster Ci and “dist” is the
distance function used. There are many distance metrics
The evaluation order graph (EOG) [46] is a directed graph that can be used, such as the squared Euclidean distance,
G = (N , E), where N represents the set of nodes and E is the i.e., dist(d, µ) = ||d − µ||2 , and the cosine distance, i.e.,
set of directed edges, designed to capture the order in which dist(d, µ) = (d · µ)/(||d||.||µ||). The best one depends on
code is executed, similarly to a CFG, but on a finer level the dataset composition.

123
1584 International Journal of Data Science and Analytics (2025) 20:1581–1592

Even though this problem is known to be NP-hard, such 4.1 Feature engineering
gradient descent methods generally converge to a local opti-
mum if seeded with an initial set of k observations drawn One key characteristic of the proposed approach lies in the
uniformly and randomly from D [3]. Bottou et al. [3] used representation of the program used. The clustering process
this property to propose an online stochastic gradient descent aims to separate source code solutions by their algorithmic
variant that computes a gradient descent step on one obser- strategy, i.e., a sequence of instructions executed in a well-
vation at a time, which makes it converge faster on large defined order to solve a problem or calculate a function.
datasets but degrading the quality of clusters (due to stochas- The flow of execution of a program, i.e., the order in which
tic noise). Sculley [43] proposes an optimization for k-means the instructions execute, is, thus, an essential aspect of the
clustering by processing mini-batches rather than individual algorithmic strategy. Combining this with knowledge of the
data points, which tend to have lower stochastic noise and data dependencies among these instructions, the algorithmic
are not affected in terms of cost when datasets grow large strategy is largely covered [12]. The former information is
with redundant observations. captured by the CFG (or the EOG), whereas the latter is
encoded in the DFG, as explained in Subsections 3.1 and
3.3.
4 Clustering source code with AsanasCluster To obtain these representations, we firstly adapted a Kotlin
library [13], initially developed to extract the code property
This section introduces the design and the implementation graph (CPG) [48] out of source code written in either Python,
of a tool, named AsanasCluster, to cluster correct source Java, C, or C++. The CPG is a data structure combining
code solutions submitted to programming assignments in the AST, DFG, and EOG, designed to mine large codebases
real time. This approach addresses a few gaps in existing for programming patterns that represent security vulnerabili-
techniques. First, it groups programs by their algorithmic ties. As this representation includes the required information,
strategy from a high-level perspective, which generates fewer our adaptation consists of adding a feature to the library for
clusters than most existing clustering approaches. Second, it exporting the CPG in comma-separated value (CSV) format.
extracts and relies on a vector of features from the seman- The exported artifact is composed of two CSV files: one
tic graph representations of the program, avoiding expensive containing the description of the nodes, including ID, type
pairwise computations such as the graph edit distance across of construct, token, and location, and the other describing
the complete dataset. Lastly, it follows an incremental clus- the edges, including source, location, origin (AST, EOG, or
tering model, meaning that solutions are assigned to clusters DFG), among other information of its specific origin (e.g.,
as they enter the system rather than all at once. Such a model variable identifier for edges of the DFG). While both the
not only reduces the time to discover the closest cluster to EOG and the CFG encode the control flow of a program, the
a new observation considerably but also enables this task to latter is a significantly smaller graph. Hence, before further
run with up-to-date information on submitted solutions. computations, the obtained EOG is transformed into a CFG
The workflow of AsanasCluster is illustrated in Fig. 1. through a process involving edge contraction, i.e., every edge
Given a set of existing solutions P to a programming assign- whose source has an out-degree of one and destination has
ment, for each new program p, received as input, it generates an in-degree of one is contracted.
both an EOG and a DFG using an adaptation of an existing Clustering by the CFG and the DFG would require mea-
Kotlin library [13], designed to extract the code property suring two pairwise graph edit distances over the full dataset.
graph (which includes the representations needed) out of These are complex operations whose computational cost
source code. This step guarantees support for programs writ- grows exponentially on the graphs and dataset size. There-
ten in either Python, Java, C, or C++. The obtained EOG is fore, our approach is rather feature-based. We derive a feature
transformed into a CFG through a process involving edge vector composed of numeric values calculated from the
contraction, i.e., for every edge whose source has an out- characteristics of both graphs, CFG and DFG. This vec-
degree of one and destination has an in-degree of one is tor contains 11 features, namely: connected_components,
contracted. These two final representations, DFG and CFG, the number of connected components in the control flow
are analyzed to compute the control flow and data flow fea- graph (i.e., being an intra-procedural representation, the
tures that compose the feature vector of a program, (described multiple procedures have no connection in the graph);
in Subsection 4.1). Finally, the resulting feature vector is fed loop_statements, the number of loop statements (e.g.,
into the k-means clustering algorithm implemented (refer to for, foreach, while, and do... while) in the pro-
Subsection 4.1 for details). gram; conditional_statements, the number of conditional

123
International Journal of Data Science and Analytics (2025) 20:1581–1592 1585

Fig. 1 Scheme of how AsanasCluster works on a high level

statements (e.g., if) in the program; cycles, the number Table 1 Features of the model
of different cycles in the control flow graph; paths, the Feature Type Source Weight
number of different paths in the control flow graph; cyclo-
matic_complexity, a software metric that measures the connected_components Integer CFG 1
complexity of a program by analyzing its control flow (i.e., loop_statements Integer CFG 1
it provides a quantitative measure of the number of possible conditional_statements Integer CFG 1
execution paths in the program); variable_count, the number cycles Integer CFG 1
of variables used in the program, excluding variables which paths Integer CFG 1
are never read; total_reads, the total number of read oper- cyclomatic_complexity Integer CFG 1
ations on variables; total_writes, the total number of write variable_count Integer DFG 0.6
operations on variables; max_reads, the maximum number total_reads Integer DFG 0.1
of read operations on single variable; and max_writes, the total_writes Integer DFG 0.1
maximum number of write operations on single variable. max_reads Integer DFG 0.1
Table 1 summarizes the features of the model. max_writes Integer DFG 0.1
As the order of execution of instructions has the most rele-
vance in the algorithmic strategy of the solution, we decided
to split the weight of the data flow features. Among these, the
variable_count weighs more, as the others are dependent on 4.2 Clustering model
it by definition. The summed weight of all these features is
the same as that of a single control flow feature. Moreover, The values of the final feature vector are given as input to
we have scaled the data so that it has zero mean and unit the k-means clustering algorithm implemented (see Subsec-
variance. For that, running means and variances are main- tion 3.4). This specific implementation starts by randomly
tained for each feature. Even though, for being incremental, instantiating k centroids, according to a Gaussian distribu-
the exact means and variances are not known in advance, this tion. The value of k is the main hyper-parameter of the model
does not have a detrimental impact in the long term. and sets the limit on the number of formed clusters. As the
Having a high number of features in the model makes it goal is to have as many clusters as the number of algorith-
more difficult to manage and may even add noise, as some of mic solution strategies, an adequate value would be greater
these features can be redundant. To prevent this, the correla- or equal to the expected count of different strategies. We
tion of the 11 features of our model has been measured using have limited the maximum amount of clusters to 16 as the
Pearson’s correlation coefficient [23] on the 16 programming possibility of an academic-level programming assignment
exercises of PROGpedia dataset [36]. The correlation coef- having more than 16 algorithmic solution strategies can be
ficient has values from −1 to 1: A value closer to 0 implies neglected. Nevertheless, this value can be defined explicitly,
weaker correlation (i.e., 0 is no correlation); a value closer to per assignment.
1 means stronger positive correlation; and a value closer to Given a new submission, more precisely the feature vector
−1 implies stronger negative correlation. Each programming extracted from it, we first identify the closest centroid. This is
exercise is analyzed separately and casts a vote on pairs with done by measuring the distance from the new observation to
a correlation above 0.9. For pairs with half or more of the total each centroid, using a certain distance metric, and selecting
votes, a member is eliminated. Nevertheless, in this case, no the minimum of these distances. In this case, we tried the
correlated pair has been identified with these conditions. Manhattan distance, Euclidean distance, and cosine distance

123
1586 International Journal of Data Science and Analytics (2025) 20:1581–1592

in two sets of submissions to programming assignments with the development and testing of AsanasCluster. Mooshak uses
well-defined algorithmic solution strategies. The Euclidean the file system as the object database, storing and retrieving
distance revealed a lower average error index (0) than the data in Tcl-code files organized in directories. Therefore, the
Manhattan (0.3) and cosine (0.25) distances and, thus, was submissions’ metadata is stored alongside the source code
applied. After identifying the centroid (and cluster) to which and extracted CSV files of the CPG in the submission folder.
the new observation belongs, the centroid’s position “moves” For building a clustering model, AsanasCluster simply
in the direction of the new element. The amount by which to iterates the submissions’ directory and, for each submis-
move the centroid is a product of their scalar distance and the sion folder, loads the CPG and processes it into the model
learning rate. The learning rate is the inverse of the number (if accepted). When a new submission enters the system,
of solutions assigned to a cluster during the process, i.e., as AsanasCluster acts as the last evaluator of Mooshak, adding
the number of elements increases, the effect of new elements the submission into the model and echoing the classification
is reduced. of the previous evaluator. If the submission has a rejection
The pseudocode of this clustering process is presented in classification, the identification of the closest cluster is also
Algorithm 1. It assumes that the feature vector is provided printed, and model updates get discarded.
as the solution object, ignoring the extraction of the graph
representations and subsequent computation of the feature
vector values. Moreover, when centroids “move,” the closest
5 Evaluation
centroid is re-identified for previous solutions.
This section presents the results of the evaluation of the
Algorithm 1 Pseudocode of the k-means clustering process accuracy and time adequacy of AsanasCluster for automated
Require: 2 ≤ k ≤ 16  Number of centroids to initialize. assessment of programming assignments. To this end, we
Ensure: dist(c, S) is a function that computes the distance between have evaluated the performance of clustering on a public
two feature vectors, according to the metric used.
Ensure: C has k centroids randomly initialized according to k- collection—PROGpedia [36]—of source code submitted to
means++ seeding algorithm. 16 programming assignments on Mooshak [25] in undergrad-
Ensure: N has k zeroes. uate Computer Science courses within multiple years of the
repeat 2003–2020 time span. The dataset comprises a total of 9117
Let S be the new solution
min, min c ← ∞, 0 submissions. As we intend to use the clustering output as
for c ∈ C do  Identify the closest centroid input to a program repair tool, we separate the submissions
d ← dist(c, S) not only by programming exercise but also by programming
if d ≤ min then language. Only solutions written in C/C++ (C17), Java (Java
min ← d
min c ← c 8), and Python (version 3) were considered (Note: version
end if within parentheses means “compatible with” not an exact
end for match). All tests run on a Dell XPS 15 9570.
N [min c ] ← N [min c ] + 1
if S is correct then  A correct solution moves its centroid
min c ← min c + (1/N [min c ]) × S 5.1 Runtime
end if
until no more submissions
Our goal is to use AsanasCluster as an intermediate step
in the automated assessment of programming assignments.
While no time limit for a single evaluation is formally defined
4.3 Mooshak integration in the literature of automated assessment of programming
assignments, one minute is a reasonable limit for a task that
AsanasCluster aims to integrate into automated assessment is meant to be nearly real time [1]. To evaluate the scal-
engines, consuming their submissions’ data both offline (i.e., ability of AsanasCluster, we measure the amount of time
previously submitted solutions) and in real time (i.e., new required to (1) build a clustering model with past submis-
submissions entering the system). To this end, AsanasClus- sions from scratch, (2) discover a new correct solution, and
ter has two modes. One builds a clustering model from all (3) determine the cluster of a new submitted solution. As
existing submissions to a specified programming assignment. for (1), we have built models for the set of correct solu-
The other loads a clustering model saved into the disk and tions from PROGpedia [36], separating data by programming
identifies the closest cluster to the given submission, includ- exercise and language. Table 2 summarizes the composition
ing it in the model if it is an accepted solution. of the dataset regarding submissions, including the number
Mooshak [25] is one of the existing systems providing of submissions and the average lines of code for each pair
automated assessment capabilities and the one selected for assignment/programming language. For (2), a new correct

123
International Journal of Data Science and Analytics (2025) 20:1581–1592 1587

Table 2 Submissions’ details from PROGpedia dataset Table 3 Runtime and number of clusters for PROGpedia dataset
ID # of Submissions Avg. LoC ID Training time Number of clusters
C C++ JAVA PY C C++ JAVA PY C C++ JAVA PY C C++ JAVA PY

06 40 – 100 64 30 – 36 22 06 1m37s – 4m24s 3m1s 4 – 5 4

16 20 – 105 30 32 – 45 17 16 0m51s – 4m46s 1m30s 4 – 4 4
18 1 – 61 5 73 – 166 57 18 0m5s – 4m11s 0m26s 1 – 3 1
19 2 – 66 139 88 – 141 98 19 0m11s – 4m10s 8m58s 1 – 2 1
21 2 – 21 112 137 – 227 89 21 0m14s – 1m36s 7m39s 1 – 3 3
22 3 – 52 60 55 – 90 28 22 0m13s – 2m41s 2m59s 3 – 7 5
23 1 – 71 38 141 – 189 63 23 0m6s – 4m34s 2m59s 1 – 3 1
34 172 26 205 – 50 34 31 – 34 8m18s 1m12s 9m21s – 7 4 5 –
35 76 24 140 – 60 60 60 – 35 3m39s 1m9s 7m3s – 4 3 4 –
39 75 25 154 – 96 77 88 – 39 4m23s 1m29s 9m14s – 7 4 8 –
42 58 26 138 – 67 66 65 – 42 2m55s 1m18s 6m49s – 9 4 4 –
43 77 32 178 – 52 49 52 – 43 3m32s 1m30s 8m51s – 6 2 3 –
45 54 21 148 – 49 50 51 – 45 2m45s 1m6s 7m32s – 8 2 5 –
48 29 24 136 – 49 49 56 – 48 1m20s 1m9s 6m32s – 2 3 3 –
53 1 43 152 – 110 119 148 – 53 0m6s 2m31s 9m30s – 1 4 3 –
56 1 22 85 – 76 95 110 – 56 0m5s 1m9s 4m49s – 1 4 4 –

Table 4 depicts the time needed to (2) process a new solu-

solution has been developed. Finally, in (3) we select ran- tion into the model and (3) identify the best cluster for a new
domly a wrong attempt. submitted solution, using the built models. On average, it
Building a clustering model on a set of submissions (1) takes 5,397 s to learn (2) and 4,910 s to predict the cluster of
requires four steps. First, search and select the adequate solu- a new observation (3). There is no task running for 7 or more
tions (i.e., accepted solutions written in the programming seconds. (6,981 is the worst case.)
language of the model) from the directory containing all
submissions to an assignment. Second, generate the needed 5.2 Error index
semantic graph representations, i.e., the DFG and the CFG.
Third, compute the feature vector from the representations. This analysis aims to validate the effectiveness of AsanasClus-
Finally, build the k-means model, processing existing obser- ter in separating the different algorithmic strategies imple-
vations. Table 3 presents the size of the solutions’ sets, mented in solutions. To this end, we consider a simple metric
building times, and the number of generated clusters for each (1), which we named Error Index. The Error Index takes val-
pair (programming assignment, programming language). ues between 0 and 1, where 0 indicates that all solutions were
The maximum model’s building time is 9 min and 30 s correctly grouped. A solution is considered wrongly grouped
for the 152 Java submissions to programming assignment if it is in the cluster of a different strategy (i.e., a cluster
53, which demands the implementation of a graph searching belongs to the algorithmic strategy with most solutions in
algorithm. As expected, the amount of submissions has the it). This metric purposely ignores the case where solutions
greatest impact on training performance when compared to adopting the same algorithmic strategy spread across differ-
the programming language or complexity of the program- ent clusters. The reason is that we understand those solutions
ming assignment. However, the complexity of the solutions can still be quite distinct.
also affects the runtime negatively. For instance, processing
the 205 submissions to programming assignment 34, which Nr. of wrongly grouped solutions
requires sorting a vector of numbers, takes less 9 s than the Err or I ndex = (1)
Nr. of solutions
152 to assignment 53.
The number of generated clusters has no noticeable To evaluate this, we conduct two separate tasks. The first
correlation with either the number of submissions or the pro- task consists of clustering a set of 100 different implemen-
gramming language. The median number of clusters for the tations of two graph searching algorithms, 50 depth-first
models built is 4. The set of solutions written in C for exer- search, and 50 breadth-first search. These programs were
cise 42 has 9 clusters, which is the highest amount of clusters collected during an Algorithm Design and Analysis class.
identified for the evaluated sets. In the second task, we cluster a collection of 100 programs

123
1588 International Journal of Data Science and Analytics (2025) 20:1581–1592

Table 4 Runtime for cluster

ID Predict time Learn time
discovery and processing a new
C C++ JAVA PY C C++ JAVA PY
case, using previously built
modules 06 4.837 s – 4.419 s 4.259 s 4.400 s – 4.610 s 4.677 s
16 4.330 s –– 4.351 s 4.345 s 4.557 s – 4.830 s 4.737 s
18 4.508 s – 5.238 s 4.960 s 5.043 s – 5.821 s 5.493 s
19 4.449 s – 4.887 s 4.277 s 5.105 s – 5.515 s 4.648 s
21 4.955 s – 5.201 s 4.426 s 5.933 s – 6.271 s 5.668 s
22 4.392 s – 4.714 s 4.999 s 4.877 s – 5.070 s 5.322 s
23 5.012 s – 5.723 s 6.611 s 5.877 s – 6.321 s 6.981 s
34 4.614 s 4.485 s 4.926 s – 4.981 s 4.996 s 5.454 s –
35 4.403 s 4.713 s 5.017 s – 5.903 s 5.351 s 5.362 s –
39 4.945 s 4.682 s 5.690 s – 5.969 s 5.329 s 5.575 s –
42 4.784 s 4.698 s 5.088 s – 5.407 s 5.037 s 5.301 s –
43 4.518 s 4.699 s 5.081 s – 4.806 s 4.868 s 5.530 s –
45 5.076 s 5.261 s 6.193 s – 5.234 s 4.879 s 5.769 s –
48 4.854 s 4.742 s 4.972 s – 5.566 s 4.956 s 5.261 s –
53 4.903 s 5.290 s 6.261 s – 6.436 s 6.232 s 6.676 s –
56 4.880 s 4.724 s 5.277 s – 5.408 s 5.305 s 5.702 s –

Fig. 2 2-Component PCA visualization of the clustering of implementations of graph searching (left) and sorting algorithms (right)

from GitHub implementing sorting algorithms, namely heap, quick sort (blue), and merge sort (yellow). Note that some
merge, insertion, and quick sort. There are 25 samples of each points are not visible as they share the same (or very close)
sorting algorithm. values of the feature vector. In both experiments, there is
Figure 2 illustrates the 2-component principal component one cluster corresponding to each of the different included
analysis (PCA) visualization of the resulting clusters for both algorithmic strategies, and solutions are split evenly by these
tasks. (PC1 and PC2 explain 83% and 6% of the variability, clusters. Furthermore, the Error Index of both tasks evalu-
respectively, in the left chart, and 60% and 24% in the right ates to 0, as there is no cluster with solutions of different
chart.) The red crosses indicate clusters’ centroids. On the left algorithmic strategies.
chart, there are four clusters. The green cluster contains the Extending the collection of the second task with one
50 points corresponding to the depth-ﬁrst search implementa- implementation of the radix sort algorithm also does not
tions. The breadth-ﬁrst search implementations are assigned affect the Error Index. As depicted in the 2-component PCA
to the gray cluster. On the right, four clusters match the differ- visualization of Fig. 3 (PC1 and PC2 explain 61% and 23%
ent sorting algorithms: insertion sort (gray), heap sort (green), of the variability, respectively), a new cluster (blue) is cre-

123
International Journal of Data Science and Analytics (2025) 20:1581–1592 1589

while other clusters remain unaffected. Therefore, the Error

Index is 0.

5.3 Discussion

There are a few tools presented in the literature that intro-

duce clustering approaches comparable to the one described
in this paper, namely SemCluster [37], OverCode [14], and
CLARA [17]. The evaluation of SemCluster includes a com-
parison with the latter tools (OverCode and CLARA). Even
though calculating the similarity between two implementa-
tions with small sizes, such as those referred in Table 2, can be
performed in a short period of time, clustering a new solution
requires pairwise comparison between the new solution and
each of the existing. The evaluation of SemCluster demon-
Fig. 3 2-Component PCA visualization of the clustering of implemen-
tations of sorting algorithms, adding a single radix sort implementation strates this has a tremendous impact in terms of runtime of the
approaches (e.g., CLARA tool can take more than 100 min
for programs with less than 100 lines of code).
SemCluster reveals a better runtime performance and pre-
cision in identifying different algorithmic solution strategies
than the existing tools [37]. Nevertheless, none of the pro-
posed approaches is incremental, i.e., they require rebuilding
the clustering model on every new submission. This takes
much more than a minute in any of the analyzed tools, even
for small size programs with less than 50 lines of code.
Assuming that representations of source code are stored
between model training sessions, SemCluster still has a
median runtime of 18 s for the average time needed to recalcu-
late models (i.e., when adding an element), possibly reaching
30 s.
Due to the unavailability of dataset used in [37], the eval-
uation of the runtime performance described in this paper
applies our tool on identical tasks but using a different,
Fig. 4 2-Component PCA visualization of the clustering of implemen- publicly available, dataset [36]. This dataset contains 16
tations of sorting algorithms, adding a breadth-first search implemen- assignments of various complexities, delivered at multiple
tation
stages of undergraduate CS courses, using several distinct
algorithms with implementations written in C, C++, Java,
ated with this new solution, whereas the existing clusters and Python. The composition of the dataset is fairly similar
are not changed. However, adding a few implementations to the dataset used in [37], as shown in Table 2. In these con-
of the selection sort algorithm increases the Error Index. As ditions, our tool has median runtimes of 4 (Python) and 5
its implementation is semantically similar to insertion sort, (C, C++, and Java) seconds to identify the cluster of a new
they are both assigned to the same cluster until there are solution and integrate it into the model. In the worst case, it
enough samples to form a new cluster. For instance, includ- can take up to 7 s.
ing 3 implementations results in an Error Index of 0.03 (3 Regarding the precision in identifying the different algo-
incorrectly grouped solutions out of 103). rithmic solution strategies, SemCluster has proven its effec-
To demonstrate how AsanasCluster handles unrelated tiveness in two tasks. Firstly, it successfully separates 100
solutions submitted intentionally or accidentally, we have solutions to an assignment involving sorting algorithms by
added a breadth-first search implementation to the cluster- 4 clusters, according to the adopted algorithm: bubble sort,
ing model of the sorting algorithms. The result is depicted in quicksort, and none specifically (two clusters). Lastly, it can
the 2-component PCA visualization of Fig. 4. (PC1 and PC2 perfectly partition 100 programs into 2 clusters depending on
explain 63% and 21% of the variability, respectively.) The the searching algorithm applied, i.e., depth-first and breadth-
unrelated solution is isolated in a new cluster (gray area), first search. Similarly, the experiment conducted to evaluate

123
1590 International Journal of Data Science and Analytics (2025) 20:1581–1592

AsanasCluster achieved optimal results in the two tasks, as ing reveal great accuracy in separating different algorithmic
described in Subsection 5.2. strategies.
Therefore, considering its inclusion in the process of auto- Our goal is to integrate AsanasCluster as the first step of
mated assessment of programming assignments, AsanasClus- our workflow to repair incorrect student attempts. For a given
ter can achieve better performance than the most similar tools programming assignment, we rely on AsanasCluster to clus-
presented in the literature. Nevertheless, the main benefits are ter the correct student solutions. Given an incorrect student
(1) not requiring the execution of the code to extract the fea- program, we identify the cluster of solutions most similar
ture vector and (2) being able to start the clustering process to the submitted program and compare it against one of the
from a dataset with only two submissions and recalculate solutions in the selected cluster, generating the most pertain-
clusters. ing modifications that get the student to the correct solution.
Furthermore, the tool has the potential to be applied in many
5.4 Threats to validity other automated reasoning tasks in programming education
and beyond (e.g., learning analytics, similarity detection, and
Only a direct comparison with SemCluster [37] (i.e., using fault localization).
the same dataset) would allow us to demonstrate an improve-
Acknowledgements Not applicable.
ment of the state of the art in terms of runtime and precision
in separating algorithmic strategies. Unfortunately, neither Author Contributions JCP, JPL, and ÁF contributed to conceptual-
the dataset nor the tool is publicly available and was not ization; methodology; project administration; validation; and visu-
also made available upon request to the authors. We have, alization; JCP was involved in data curation, funding acquisition;
investigation; resources; and writing—original draft, and provided soft-
however, tried to select a similarly complex dataset with a ware; and JPL and ÁF contributed to supervision. All authors have read
few even larger solutions on average, considering the lines and agreed to the published version of the manuscript.
of code.
Furthermore, we have evaluated our approach to small- Funding Open access funding provided by FCT|FCCN (b-on). J.C.P.’s
work is funded by the FCT—Fundação para a Ciência e a Tecnologia
to-medium size programs typically found in introductory (Portuguese Foundation for Science and Technology), Portugal—for
programming problems. While this is in line with related the PhD Grant 2020.04430.BD.
work, we aim to validate the extension of our approach to
larger programs, as found in more advanced courses, in future Availability of data and materials The datasets generated and/or
analyzed during the current study are available in the PROGpedia repos-
work. itory, https://doi.org/10.5281/zenodo.7449056.

Declarations
6 Conclusion

This paper presents a novel online approach to clustering Conﬂict of interest The authors declare that they have no conﬂict of
source code for supporting the automatic assessment of interest.
programming assignments based on quantitative program Open Access This article is licensed under a Creative Commons
features extracted from the programs’ semantic graph rep- Attribution 4.0 International License, which permits use, sharing, adap-
resentations, namely the CFG and the DFG. This approach tation, distribution and reproduction in any medium or format, as
aims to (1) generate a number of clusters close to the number long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indi-
of different algorithmic solution strategies, (2) avoid expen- cate if changes were made. The images or other third party material
sive pairwise computations, and (3) learn incrementally, in this article are included in the article’s Creative Commons licence,
i.e., every solution processed becomes part of the model’s unless indicated otherwise in a credit line to the material. If material
“knowledge” for subsequent observations. is not included in the article’s Creative Commons licence and your
intended use is not permitted by statutory regulation or exceeds the
Even though the evaluation presents some building times permitted use, you will need to obtain permission directly from the copy-
close to 10 min (see Table 3), building a model from scratch is right holder. To view a copy of this licence, visit http://creativecomm
a step performed only once (when loading the programming ons.org/licenses/by/4.0/.
assignment) in online clustering approaches, with no effect
on a submission’s assessment time. In fact, the assessment
of a submission involves recalculating centroids including
the new observation and/or determining the closest cluster,
which takes under 7 s in all trials performed (see Table 4).
References
Such a delay (below one minute) is acceptable for automated
1. Ala-Mutka, K.M.: A survey of automated assessment approaches
assessment of programming assignments. Furthermore, the for programming assignments. Comput. Sci. Educ. 15(2), 83–102
experiments conducted to measure the precision of cluster- (2005). https://doi.org/10.1080/08993400500150747

123
International Journal of Data Science and Analytics (2025) 20:1581–1592 1591

2. Bennedsen, J., Caspersen, M.E.: Failure rates in introductory pro- Education, pp. 644–648. Springer, Berlin (2013). https://doi.org/
gramming. SIGCSE Bull. 39(2), 32–36 (2007). https://doi.org/10. 10.1007/978-3-642-39112-5_79
1145/1272848.1272879 17. Gulwani, S., Radiček, I., Zuleger, F.: Automated clustering and
3. Bottou, L., Bengio, Y.: Convergence properties of the k-means program repair for introductory programming assignments. In:
algorithms. In: Proceedings of the 7th International Conference on Proceedings of the 39th ACM SIGPLAN Conference on Pro-
Neural Information Processing Systems, pp. 585–592. MIT Press, gramming Language Design and Implementation, pp. 465–480.
Cambridge, MA, USA, NIPS’94 (1994) Association for Computing Machinery, New York, NY, USA, PLDI
4. Chae, D.K., Ha, J., Kim, S.W., et al.: Software plagiarism detection: 2018 (2018). https://doi.org/10.1145/3192366.3192387
a graph-based approach. In: Proceedings of the 22nd ACM Inter- 18. Head, A., Glassman, E., Soares, G., et al.: Writing reusable code
national Conference on Information & Knowledge Management, feedback at scale with mixed-initiative program synthesis. In:
pp. 1577–1580. Association for Computing Machinery, New York, Proceedings of the Fourth (2017) ACM Conference on Learn-
NY, USA, CIKM ’13 (2013). https://doi.org/10.1145/2505515. ing @ Scale, pp. 89–98. Association for Computing Machinery,
2507848 New York, NY, USA, L@S ’17 (2017). https://doi.org/10.1145/
5. Chen, R., Hong, L., Lu, C., et al.: Author identification of software 3051457.3051467
source code with program dependence graphs. In: Proceedings 19. Huang, J., Piech, C., Nguyen, A., et al.: Syntactic and functional
of the 2010 IEEE 34th Annual Computer Software and Appli- variability of a million code submissions in a machine learning
cations Conference Workshops, pp. 281–286. IEEE Computer MOOC. In: Walker, E., Looi, C. (eds.) Proceedings of the Work-
Society, USA, COMPSACW ’10 (2010). https://doi.org/10.1109/ shops at the 16th International Conference on Artificial Intelligence
COMPSACW.2010.56 in Education AIED 2013, CEUR Workshop Proceedings, vol 1009.
6. Chow, S., Yacef, K., Koprinska, I., et al.: Automated data-driven CEUR-WS.org, Memphis, TN, USA, pp. 25–32 (2013). https://
hints for computer programming students. In: Adjunct Publica- ceur-ws.org/Vol-1009/0105.pdf
tion of the 25th Conference on User Modeling, Adaptation and 20. Inoue, U., Wada, S.: Detecting plagiarisms in elementary program-
Personalization, pp. 5–10. Association for Computing Machinery, ming courses. In: 2012 9th International Conference on Fuzzy
New York, NY, USA, UMAP ’17 (2017). https://doi.org/10.1145/ Systems and Knowledge Discovery. IEEE, Chongqing, China, pp.
3099023.3099065 2308–2312 (2012). https://doi.org/10.1109/FSKD.2012.6234186
7. Cosma, G., Joy, M.: An approach to source-code plagiarism detec- 21. Jhi, Y.C., Wang, X., Jia, X., et al.: Value-based program char-
tion and investigation using latent semantic analysis. IEEE Trans. acterization and its application to software plagiarism detection.
Comput. 61(3), 379–394 (2012). https://doi.org/10.1109/TC.2011. In: Proceedings of the 33rd International Conference on Software
223 Engineering. Association for Computing Machinery, New York,
8. Drummond, A., Lu, Y., Chaudhuri, S., et al.: Learning to grade NY, USA, ICSE ’11, pp. 756–765 (2011). https://doi.org/10.1145/
student programs in a massive open online course. In: Proceedings 1985793.1985899
of the 2014 IEEE International Conference on Data Mining, pp. 22. Kaleeswaran, S., Santhiar, A., Kanade, A., et al.: Semi-supervised
785–790. IEEE Computer Society, USA, ICDM ’14 (2014). https:// verified feedback generation. In: Proceedings of the 2016 24th
doi.org/10.1109/ICDM.2014.142 ACM SIGSOFT International Symposium on Foundations of Soft-
9. Durić, Z., Gašević, D.: A source code similarity system for plagia- ware Engineering. Association for Computing Machinery, New
rism detection. Comput. J. 56(1), 70–86 (2012). https://doi.org/10. York, NY, USA, FSE 2016, pp. 739–750 (2016). https://doi.org/
1093/comjnl/bxs018 10.1145/2950290.2950363
10. Elmaleh, J., Shankararaman, V.: Improving student learning in 23. Kirch, W. (ed.): Pearson’s Correlation Coefficient, pp. 1090–1091.
an introductory programming course using flipped classroom and Springer, Dordrecht (2008)
competency framework. In: 2017 IEEE Global Engineering Edu- 24. Koivisto, T., Hellas, A.: Evaluating CodeClusters for effectively
cation Conference (EDUCON), pp. 49–55. IEEE, Athens, Greece providing feedback on code submissions. In: 2022 IEEE Frontiers
(2017). https://doi.org/10.1109/EDUCON.2017.7942823 in Education Conference (FIE). IEEE, pp. 1–9 (2022). https://doi.
11. Emerson, A., Smith, A., Rodriguez, F.J., et al.: Cluster-based analy- org/10.1109/FIE56618.2022.9962751
sis of novice coding misconceptions in block-based programming. 25. Leal, J.P., Silva, F.: Mooshak: a web-based multi-site programming
In: Proceedings of the 51st ACM Technical Symposium on Com- contest system. Softw. Pract. Exp. 33(6), 567–581 (2003). https://
puter Science Education. Association for Computing Machinery, doi.org/10.1002/spe.522
New York, NY, USA, SIGCSE ’20, pp. 825–831 (2020). https:// 26. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf.
doi.org/10.1145/3328778.3366924 Theor. 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.
12. Feautrier, P.: Dataflow analysis of array and scalar references. 1056489
Int. J. Parallel Prog. 20(1), 23–53 (1991). https://doi.org/10.1007/ 27. Luo, L., Zeng, Q.: Solminer: mining distinct solutions in programs.
BF01407931 In: Proceedings of the 38th International Conference on Software
13. Fraunhofer AISEC: Code Property Graph (2023). https://github. Engineering Companion. Association for Computing Machinery,
com/Fraunhofer-AISEC/cpg. Accessed 20 May 2023 New York, NY, USA, ICSE ’16, pp. 481–490 (2016). https://doi.
14. Glassman, E.L., Scott, J., Singh, R., et al.: Overcode: visualizing org/10.1145/2889160.2889202
variation in student solutions to programming problems at scale. 28. Luo, L., Ming, J., Wu, D., et al.: Semantics-based obfuscation-
ACM Trans. Comput. Hum. Interact. 22(2), 25 (2015). https://doi. resilient binary code similarity comparison with applications to
org/10.1145/2699751 software plagiarism detection. In: Proceedings of the 22nd ACM
15. Gross, S., Zhu, X., Hammer, B., et al.: Cluster based feedback SIGSOFT International Symposium on Foundations of Software
provision strategies in intelligent tutoring systems. In: Cerri, S.A., Engineering. Association for Computing Machinery, New York,
Clancey, W.J., Papadourakis, G., et al. (eds.) Intelligent Tutoring NY, USA, FSE 2014, pp. 389–400 (2014). https://doi.org/10.1145/
Systems, pp. 699–700. Springer, Berlin (2012). https://doi.org/10. 2635868.2635900
1007/978-3-642-30950-2_127 29. Luxton-Reilly, A., Denny, P., Kirk, D., et al.: On the differences
16. Gross, S., Mokbel, B., Hammer, B., et al.: Towards providing feed- between correct student solutions. In: Proceedings of the 18th ACM
back to students in absence of formalized domain models. In: Lane, Conference on Innovation and Technology in Computer Science
H.C., Yacef, K., Mostow, J., et al. (eds.) Artificial Intelligence in Education. Association for Computing Machinery, New York, NY,

123
1592 International Journal of Data Science and Analytics (2025) 20:1581–1592

USA, ITiCSE ’13, pp. 177–182 (2013). https://doi.org/10.1145/ for Computing Machinery, New York, NY, USA, WWW ’10, pp.
2462476.2462505 1177–1178 (2010). https://doi.org/10.1145/1772690.1772862
30. Luxton-Reilly, A., Simon Albluwi, I., et al.: Introductory program- 44. Wang, K., Singh, R., Su, Z.: Dynamic neural program embedding
ming: a systematic literature review. In: Proceedings Companion for program repair (2018). https://doi.org/10.48550/arXiv.1711.
of the 23rd Annual ACM Conference on Innovation and Technol- 07163
ogy in Computer Science Education. Association for Computing 45. Wang, K., Singh, R., Su, Z.: Search, align, and repair: data-driven
Machinery, New York, NY, USA, ITiCSE 2018 Companion, pp. feedback generation for introductory programming exercises. In:
55–106 (2018). https://doi.org/10.1145/3293881.3295779 Proceedings of the 39th ACM SIGPLAN Conference on Pro-
31. Moussiades, L., Vakali, A.: PDetect: a clustering approach for gramming Language Design and Implementation. Association for
detecting plagiarism in source code datasets. Comput. J. 48(6), Computing Machinery, New York, NY, USA, PLDI 2018, pp. 481–
651–661 (2005). https://doi.org/10.1093/comjnl/bxh119 495 (2018). https://doi.org/10.1145/3192366.3192384
32. Nguyen, A., Piech, C., Huang, J., et al.: Codewebs: scalable home- 46. Weiss, K., Banse, C.: A language-independent analysis platform for
work search for massive open online programming courses. In: source code (2022). https://doi.org/10.48550/arXiv.2203.08424
Proceedings of the 23rd International Conference on World Wide 47. Xu, S., Chee, Y.S.: Transformation-based diagnosis of student
Web. Association for Computing Machinery, New York, NY, USA, programs for programming tutoring systems. IEEE Trans. Softw.
WWW ’14, pp. 491–502 (2014). https://doi.org/10.1145/2566486. Eng. 29(4), 360–384 (2003). https://doi.org/10.1109/TSE.2003.
2568023 1191799
33. Ohmann, T., Rahal, I.: Efficient clustering-based source code pla- 48. Yamaguchi, F., Golde, N., Arp, D., et al.: Modeling and discovering
giarism detection using PIY. Knowl. Inf. Syst. 43(2), 445–472 vulnerabilities with code property graphs. In: 2014 IEEE Sympo-
(2014). https://doi.org/10.1007/s10115-014-0742-2 sium on Security and Privacy, pp. 590–604. IEEE, Berkeley, CA,
34. Ohmann, T., Rahal, I.: Efficient clustering-based source code pla- USA (2014). https://doi.org/10.1109/SP.2014.44
giarism detection using PIY. Knowl. Inf. Syst. 43(2), 445–472 49. Zhang, F., Wu, D., Liu, P., et al.: Program logic based software pla-
(2015). https://doi.org/10.1007/s10115-014-0742-2 giarism detection. In: 2014 IEEE 25th International Symposium on
35. Paiva, J.C., Leal, J.P., Figueira, A.: Automated assessment in com- Software Reliability Engineering, pp. 66–77. IEEE, Naples, Italy
puter science education: a state-of-the-art review. ACM Trans. (2014). https://doi.org/10.1109/ISSRE.2014.18
Comput. Educ. (2022). https://doi.org/10.1145/3513140 50. Ďuračı́k, M., Kršák, E., Hrkút, P.: Scalable source code plagiarism
36. Paiva, J.C., Leal, J.P., Figueira, Á.: Progpedia: collection of detection using source code vectors clustering. In: 2018 IEEE 9th
source-code submitted to introductory programming assignments. International Conference on Software Engineering and Service Sci-
Data Brief 46, 108887 (2023). https://doi.org/10.1016/j.dib.2023. ence (ICSESS), pp. 499–502. IEEE, Beijing, China (2018). https://
108887 doi.org/10.1109/ICSESS.2018.8663708
37. Perry, DM., Kim, D., Samanta, R., et al.: Semcluster: cluster-
ing of imperative programming assignments based on quantitative
semantic features. In: Proceedings of the 40th ACM SIGPLAN
Publisher’s Note Springer Nature remains neutral with regard to juris-
Conference on Programming Language Design and Implementa-
dictional claims in published maps and institutional affiliations.
tion. Association for Computing Machinery, New York, NY, USA,
PLDI 2019, pp. 860–873 (2019). https://doi.org/10.1145/3314221.
3314629
38. Piech, C., Huang, J., Nguyen, A., et al.: Learning program embed-
dings to propagate feedback on student code. In: Proceedings of
the 32nd International Conference on International Conference on
Machine Learning, vol. 37, pp. 1093–1102. JMLR.org, ICML’15
(2015)
39. Poon, J.Y., Sugiyama, K., Tan, Y.F., et al.: Instructor-centric source
code plagiarism detection and plagiarism corpus. In: Proceedings
of the 17th ACM Annual Conference on Innovation and Technol-
ogy in Computer Science Education. Association for Computing
Machinery, New York, NY, USA, ITiCSE ’12, pp. 122–127 (2012).
https://doi.org/10.1145/2325296.2325328
40. Pu, Y., Narasimhan, K., Solar-Lezama, A., et al.: Sk_p: a neu-
ral program corrector for MOOCs. In: Companion Proceedings of
the 2016 ACM SIGPLAN International Conference on Systems,
Programming, Languages and Applications: Software for Human-
ity. Association for Computing Machinery, New York, NY, USA,
SPLASH Companion 2016, pp. 39–40 (2016). https://doi.org/10.
1145/2984043.2989222
41. Rivers, K., Koedinger, K.R.: A canonicalizing model for building
programming tutors. In: Cerri, S.A., Clancey, W.J., Papadourakis,
G., et al. (eds.) Intelligent Tutoring Systems. Springer, Berlin, pp.
591–593 (2012). https://doi.org/10.1007/978-3-642-30950-2_80
42. Rivers, K., Koedinger, K.R.: Automatic generation of program-
ming feedback: a data-driven approach. In: The First Workshop on
AI-supported Education for Computer Science (AIEDCS 2013),
pp. 50–59. Memphis, USA (2013)
43. Sculley, D.: Web-scale k-means clustering. In: Proceedings of the
19th International Conference on World Wide Web. Association

123

Automated Clustering and Program Repair For Introductory Programming Assignments
No ratings yet
Automated Clustering and Program Repair For Introductory Programming Assignments
16 pages
2016 Eled
No ratings yet
2016 Eled
6 pages
Paper 4
No ratings yet
Paper 4
9 pages
Automatic Assessment of Programming Assignment
No ratings yet
Automatic Assessment of Programming Assignment
9 pages
A New Similarity Based Method For Assess
No ratings yet
A New Similarity Based Method For Assess
19 pages
Data-Driven Feedback Generator For Online Programing Courses
No ratings yet
Data-Driven Feedback Generator For Online Programing Courses
4 pages
Neural Encoding for Python Education
No ratings yet
Neural Encoding for Python Education
34 pages
Automated Grading and Feedback Tools For Programming Education: A Systematic Review
No ratings yet
Automated Grading and Feedback Tools For Programming Education: A Systematic Review
43 pages
Adaptive Assessment in The Class of Programming
100% (1)
Adaptive Assessment in The Class of Programming
31 pages
Analyze Your Scratch Projects With Dr. Scratch and Assess Your Computational Thinking Skills
100% (1)
Analyze Your Scratch Projects With Dr. Scratch and Assess Your Computational Thinking Skills
34 pages
Automated Assessment in Computer Science Education: A State-of-the-Art Review
No ratings yet
Automated Assessment in Computer Science Education: A State-of-the-Art Review
40 pages
Download
No ratings yet
Download
7 pages
Code Contrast A Contractive Learning Approach - For - G
No ratings yet
Code Contrast A Contractive Learning Approach - For - G
28 pages
Online Judge PDF
No ratings yet
Online Judge PDF
17 pages
2016 Iceit
No ratings yet
2016 Iceit
4 pages
Graph Clustering
No ratings yet
Graph Clustering
38 pages
Thesis Similar Work
No ratings yet
Thesis Similar Work
125 pages
Comprehensive Automated Programming Assessment System
No ratings yet
Comprehensive Automated Programming Assessment System
19 pages
Jype An Education Oriented Integrated PR
No ratings yet
Jype An Education Oriented Integrated PR
99 pages
Graph Decomposition Assignment
No ratings yet
Graph Decomposition Assignment
16 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
9 pages
Grading Programming Assignments by Summarization - LLMs
No ratings yet
Grading Programming Assignments by Summarization - LLMs
6 pages
Wang 2007
No ratings yet
Wang 2007
9 pages
Modeling How Students Learn To Program: Chris Piech, Mehran Sahami, Daphne Koller, Stephen Cooper, Paulo Blikstein
No ratings yet
Modeling How Students Learn To Program: Chris Piech, Mehran Sahami, Daphne Koller, Stephen Cooper, Paulo Blikstein
6 pages
Approaches To Assess Computational Thinking Compet
No ratings yet
Approaches To Assess Computational Thinking Compet
24 pages
Assessment of Complex Programming Using SIETTE
No ratings yet
Assessment of Complex Programming Using SIETTE
15 pages
A Visual Analysis Clone Method
No ratings yet
A Visual Analysis Clone Method
10 pages
Design Pattern For Graph Algorithms
No ratings yet
Design Pattern For Graph Algorithms
72 pages
Problem Statement. Quiz Game-2
No ratings yet
Problem Statement. Quiz Game-2
6 pages
Generic Assessment Rubrics For Computer Programming Courses
No ratings yet
Generic Assessment Rubrics For Computer Programming Courses
16 pages
CH #2 Solved Exercise
No ratings yet
CH #2 Solved Exercise
3 pages
Unsupervised Detection of Solving Strate
No ratings yet
Unsupervised Detection of Solving Strate
9 pages
General Coding Assessment Framework
No ratings yet
General Coding Assessment Framework
8 pages
IJCRT2312092
No ratings yet
IJCRT2312092
6 pages
Autograder
No ratings yet
Autograder
12 pages
Hierarchical Program Element Matching
No ratings yet
Hierarchical Program Element Matching
10 pages
The Impact of Peer Assessment and Feedback Strategy in Learning Computer Programming in Higher Education
No ratings yet
The Impact of Peer Assessment and Feedback Strategy in Learning Computer Programming in Higher Education
11 pages
Sustainability 15 12917
No ratings yet
Sustainability 15 12917
16 pages
The Path To The Development Programming Expertise: Buggy
No ratings yet
The Path To The Development Programming Expertise: Buggy
26 pages
Active Learning in Programming Education
No ratings yet
Active Learning in Programming Education
42 pages
Awale, N., Pandey, M., Dulal, A., & Timsina, B. (2020) - Plagiarism Detection in Programming Assignments Using Machine Learning
No ratings yet
Awale, N., Pandey, M., Dulal, A., & Timsina, B. (2020) - Plagiarism Detection in Programming Assignments Using Machine Learning
9 pages
Estimating Difficulty Levels of Programming Problems With Pre-Trained Models
No ratings yet
Estimating Difficulty Levels of Programming Problems With Pre-Trained Models
5 pages
Intro to Programming with Alice
No ratings yet
Intro to Programming with Alice
4 pages
Who Tests The Testers?: Avoiding The Perils of Automated Testing John Wrenn Shriram Krishnamurthi Kathi Fisler
No ratings yet
Who Tests The Testers?: Avoiding The Perils of Automated Testing John Wrenn Shriram Krishnamurthi Kathi Fisler
9 pages
ML-Based Code Plagiarism Detection
No ratings yet
ML-Based Code Plagiarism Detection
9 pages
Computational Thinking MCQs & Algorithms
No ratings yet
Computational Thinking MCQs & Algorithms
9 pages
Learning Functional Programs With Function Invention and Reuse
No ratings yet
Learning Functional Programs With Function Invention and Reuse
46 pages
The Abstraction First Approach To Data A
No ratings yet
The Abstraction First Approach To Data A
16 pages
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
16 pages
Introduction - Data Structure: Lecture-Module1
No ratings yet
Introduction - Data Structure: Lecture-Module1
40 pages
Newbcode - A Programming Learning App Using Text-Based Classification
No ratings yet
Newbcode - A Programming Learning App Using Text-Based Classification
48 pages
A Pedagogical Approach For Teaching Data Structures and Algorithms
No ratings yet
A Pedagogical Approach For Teaching Data Structures and Algorithms
3 pages
GPLAG Detection of Software Plagiarism B
No ratings yet
GPLAG Detection of Software Plagiarism B
10 pages
A Competitive Programming Approach To A University Introductory Algorithms Course
No ratings yet
A Competitive Programming Approach To A University Introductory Algorithms Course
6 pages
C Programming Question Bank Solved
No ratings yet
C Programming Question Bank Solved
27 pages
Elena Glassman Research Statement Annotated
No ratings yet
Elena Glassman Research Statement Annotated
6 pages
Comprehensive Algorithms and Data Structures Syllabus
No ratings yet
Comprehensive Algorithms and Data Structures Syllabus
91 pages
Programming Assignment 5: Minimum Spanning Trees
No ratings yet
Programming Assignment 5: Minimum Spanning Trees
15 pages
2024 - Optimal Test Case Generation For Boundary Value Analysis Base - 1
No ratings yet
2024 - Optimal Test Case Generation For Boundary Value Analysis Base - 1
24 pages
2025 - Boundary Value Test Input Generation Using Prompt Engineering With LLMs Fault Detection and Coverage Analysis
No ratings yet
2025 - Boundary Value Test Input Generation Using Prompt Engineering With LLMs Fault Detection and Coverage Analysis
11 pages
2000 - Black-Box Test Reduction Using Input-Output Analysis
No ratings yet
2000 - Black-Box Test Reduction Using Input-Output Analysis
5 pages
2025 - Clustering Black Box Test Cases by Embedding Models For Test Suite Reduction
No ratings yet
2025 - Clustering Black Box Test Cases by Embedding Models For Test Suite Reduction
78 pages
2023 - Automated Black-Box Boundary Value Detection - Base 2
No ratings yet
2023 - Automated Black-Box Boundary Value Detection - Base 2
48 pages
2015 - Test Suite Reduction Using Hgs Based Heuristic Approach
No ratings yet
2015 - Test Suite Reduction Using Hgs Based Heuristic Approach
20 pages
2015 - Comparing and Combining Test-Suite Reduction and Regression Test Selection
No ratings yet
2015 - Comparing and Combining Test-Suite Reduction and Regression Test Selection
11 pages
AdminServer Status and Management Logs
No ratings yet
AdminServer Status and Management Logs
40 pages
Modern Aether Science - Aspden, Harold
100% (3)
Modern Aether Science - Aspden, Harold
170 pages
Calculus Optimization Problems
No ratings yet
Calculus Optimization Problems
10 pages
Distributed ML with Spark & Keras
No ratings yet
Distributed ML with Spark & Keras
23 pages
Lecture 1 Intro To Motors
No ratings yet
Lecture 1 Intro To Motors
11 pages
2G2WD-21 8 Port Antenna Specifications
No ratings yet
2G2WD-21 8 Port Antenna Specifications
3 pages
Texture Profile Analysis
No ratings yet
Texture Profile Analysis
7 pages
FactoryTalk View SE Lab
No ratings yet
FactoryTalk View SE Lab
2 pages
Data Structures for CS Students
No ratings yet
Data Structures for CS Students
46 pages
Heat Exchanger Experiment
No ratings yet
Heat Exchanger Experiment
48 pages
Ext 36407
No ratings yet
Ext 36407
4 pages
Radiodj - Get Shoutcast v1 - v2 Listeners Number in
No ratings yet
Radiodj - Get Shoutcast v1 - v2 Listeners Number in
3 pages
AI-Based Human Detection Project
No ratings yet
AI-Based Human Detection Project
25 pages
3D Hall-Effect Position Sensor Offers Stray-Field Compensation
No ratings yet
3D Hall-Effect Position Sensor Offers Stray-Field Compensation
4 pages
Akg C555L
No ratings yet
Akg C555L
11 pages
A Survey On TCP (Transmission Control Protocol) and Udp (User Datagram Protocol) Over Aodv Routing Protocol
100% (1)
A Survey On TCP (Transmission Control Protocol) and Udp (User Datagram Protocol) Over Aodv Routing Protocol
8 pages
gr6 Science Baseline QP
No ratings yet
gr6 Science Baseline QP
5 pages
Plastic Deformation Recovery Recrystallization Etc
No ratings yet
Plastic Deformation Recovery Recrystallization Etc
18 pages
Early Grade Math in Developing Nations
No ratings yet
Early Grade Math in Developing Nations
6 pages
SR Elite & AIIMS S60 Physics - Important Question Numbers
No ratings yet
SR Elite & AIIMS S60 Physics - Important Question Numbers
4 pages
Product Description: SD-VTK Air Velocity Test Kit
No ratings yet
Product Description: SD-VTK Air Velocity Test Kit
1 page
Amasiri Project PP
No ratings yet
Amasiri Project PP
90 pages
LUCREZIA - Structural Design Report - Verification of Slab
No ratings yet
LUCREZIA - Structural Design Report - Verification of Slab
7 pages
Comprehensive Guide to Biosensors
100% (2)
Comprehensive Guide to Biosensors
36 pages
Chemistry Investigatory Project v1
No ratings yet
Chemistry Investigatory Project v1
18 pages
Intro to Logic and Set Theory
67% (3)
Intro to Logic and Set Theory
66 pages
Mac OS X Shortcuts: Shortcuts With Global Scope
No ratings yet
Mac OS X Shortcuts: Shortcuts With Global Scope
7 pages
MVC Econ Summer22-PracticeMidterm
No ratings yet
MVC Econ Summer22-PracticeMidterm
4 pages
Icom IC 706 MKIIG Service Menu
No ratings yet
Icom IC 706 MKIIG Service Menu
2 pages
Fabric Weave Types and Drawing Techniques
No ratings yet
Fabric Weave Types and Drawing Techniques
43 pages

2025 - Clustering Source Code From Automated Assessment of Programming Assignments

Uploaded by

2025 - Clustering Source Code From Automated Assessment of Programming Assignments

Uploaded by

International Journal of Data Science and Analytics (2025) 20:1581–1592

Clustering source code from automated assessment of programming

Learning to program requires extensive and varied prac-

A control ﬂow graph (CFG) is a directed graph G = 3.4 K-means clustering

Fig. 1 Scheme of how AsanasCluster works on a high level

06 40 – 100 64 30 – 36 22 06 1m37s – 4m24s 3m1s 4 – 5 4

Table 4 depicts the time needed to (2) process a new solu-

Table 4 Runtime for cluster

while other clusters remain unaffected. Therefore, the Error

There are a few tools presented in the literature that intro-

You might also like