0% found this document useful (0 votes)
25 views7 pages

Download

This conference paper presents a source code management system designed for e-learning programming education, focusing on plagiarism detection through a novel canonical representation of source codes. The authors propose a method that utilizes local alignment techniques to compute program similarity, enhancing the detection of syntactic plagiarism. The paper discusses the effectiveness of their approach compared to existing tools like JPlag and outlines the implementation within an e-learning system.

Uploaded by

railive585
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views7 pages

Download

This conference paper presents a source code management system designed for e-learning programming education, focusing on plagiarism detection through a novel canonical representation of source codes. The authors propose a method that utilizes local alignment techniques to compute program similarity, enhancing the detection of syntactic plagiarism. The paper discusses the effectiveness of their approach compared to existing tools like JPlag and outlines the implementation within an e-learning system.

Uploaded by

railive585
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/4316343

Source Code Management System for E-Learning Based Programming


Education

Conference Paper · November 2007


DOI: 10.1109/ICDIM.2007.4444250 · Source: IEEE Xplore

CITATIONS READS

3 1,685

4 authors, including:

Gyun Woo Hwan-Gue Cho


Pusan National University Pusan National University
100 PUBLICATIONS 395 CITATIONS 267 PUBLICATIONS 2,007 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Hwan-Gue Cho on 18 December 2015.

The user has requested enhancement of the downloaded file.


Source Code Management System for E-Learning Based Programming
Education

Jeong-Hoon Ji, Su-Hyun Park, Gyun Woo, Hwan-Gue Cho


Dept. of Computer Engineering
Pusan National University
Busan, 609-735, Republic of Korea
{jhji,shpark,woogyun,hgcho}@pusan.ac.kr

Abstract and those of the extracted data.


Most plagiarism detection tools capture different levels
Content based data retrieval technology is well-known in of plagiarism strategies. The levels of these plagiarism
the information retrieval area. It assumes that the subject strategies imply the granularity of information needed to
data is unstructured, but sometimes this data can be struc- plagiarize without knowing the internal logic of the pro-
tured. This is true for computer programs. In this paper, we gram. In this paper, we will discuss the levels of plagiarism
have proposed a canonical form of program source codes, strategies in Section 4.
which explicitly reveal the program structure explicitly. As We also propose a canonical representation of the source
an application of the canonical form, the program similarity codes. The canonical representation partly reflects the se-
can be computed with an additional technique named local mantics of the control structures. Therefore, we are able
alignment, which has previously been proposed in compu- to identify equivalent control structures by using canonical
tational biology. We have implemented a source code man- forms. Our new approach is powerful enough to capture the
agement system based on this method as a subsystem of the attacks of altering control structures.
E-learning system called ESPA, supporting the evaluation This paper is organized as follows. In Section 2, we
of programs submitted in programming courses. We have briefly review the related works. In Section 3, we take a
compared the effectiveness of our method with JPlag, one glance at our system and explain the overall procedure to
of the most stable tool plagiarism detection tools currently detect code plagiarism. In Section 4, we describe how to
being used. construct a canonical form from a given source code. In
Section 5, experimental results are described. In Section 6,
we draw a conclusion and sketch further directions.
1 Introduction
2 Previous Studies
E-learning system is gaining importance and is espe-
cially effective in programming courses. Since program A overly basic approach for detecting plagiarism is fin-
source codes are generally less readable than reports in a gerprinting [6]. A fingerprint of a program is a specific
natural language, a mechanical tool for evaluating program characteristic value of the program, such as the frequency
source codes is very helpful in maintaining a programming vector of keywords. Fingerprinting can also be considered
course. One of the important facilities of the E-learning as a kind of software metric. Therefore this technique is
system supporting programming courses is automatic simi- closely related to the software metric comparison method
larity checking, basically plagiarism detection. [8, 16, 3]. This method is very popular because the finger-
Various plagiarism detection techniques and tools have print of a program is very easy to compute. Its effective-
been introduced such as: YAP [17], MOSS [12], SIM [4], ness is extremely limited in that it can not locate the pla-
SID [2], CodeMatch [18], and JPlag [11]. All of these tools giarized section, but can only detect the pairs of plagiarized
have a similar common architecture. They extract data from programs.
a pair of source codes and then compare the result data to Linearization of source codes is generally used to locate
find the final similarity of the source codes. They all rely on the exact similar parts of two programs. Greedy string tiling
the correlation between the similarities of the original codes is a typical method based on the linearization. In order

1-4244-1476-8/07/$25.00 ©2007 IEEE.

- 362 -
to detect the maximal identical subsequences, it compares
two token sequences, obtained from linearization. Then
the detected subsequences are deleted from both sequences.
These two steps are performed iteratively until no similar
subsequences exist. JPlag [11] adopts this technique.
Local alignment [14] is another technique based on lin-
earization. The alignment technique is a very popular
method to compare two DNA sequences in computational
biology. Local alignment finds the similar parts of two se-
quences by deleting some symbols and inserting gap sym-
bols. A gap symbol can be considered a wild card, which
matches any symbols. The method used in this paper is
based on the local alignment.
Quite recently, a very different approach using Kol-
mogorov complexity has been proposed. The Kolmogorov
complexity of a sequence can be considered to indicate the
gross of information contained in that sequence. The ba-
sic idea is to compare the compressed result of two pro-
grams using a specific compression algorithm, for exam-
ple the LZW-type compression algorithm [2]. This method
can filter out any redundant code blocks but is unable to
filter out unreachable code blocks meaninglessly attached.
Hence, an additional tool for removing unreachable code
blocks is necessary to use this method.
There are also other kinds of structural comparison
methods such as: the common interval method [13] or the Figure 1. Linearizing scheme using canonical
syntax tree comparison method [1, 15]. Our approach is representation of source codes
similar to these methods in that the methods are reflecting
the structure of source codes, but our approach is different
in that it is a merged approach applying the local alignment
to structure preserved canonical forms. canonical graph rather than the syntax tree. The syntax tree
Mishne and Rijke have proposed a source code retrieval may be useful for generating an object code, but since the
method using conceptual graphs [9]. The conceptual graph token stream should be regenerated a series-parallel graph
is similar to the abstract syntax tree, yet it may contain other is adopted instead of the syntax tree.
information such as comments. This approach seems effec- Once the canonical forms are generated, the rest of the
tive to retrieve source codes from a large database of source steps are straightforward. These steps are largely based on
codes, but, it is not precise enough to locate any very similar the local alignment. The pairs of program DNA’s are gener-
parts from the retrieved source codes. ated during the linearization step. Then the local alignment
algorithm is applied to the pair of the program DNA’s to
3 Source Code Management Procedure get similar subsequences. The similarity score is reported
according to the length of the similarly detected region.
Our method is based on the local alignment. The unique To compare two sequences of tokens, we have adopted
characteristic of our approach is that a canonical form is a local alignment, specifically the Smith-Waterman Algo-
suggested to come up with a structure-aware plagiarism at- rithm [14]. Given that two strings consisted in finite sym-
tack. The canonical form is a variant of a control flow graph, bols, a local alignment algorithm finds the most similar sub-
which contains no back edges. Figure 1 shows the overall parts, which are common in both strings. These subparts
steps of our approach. don’t need to be equal because some mismatches are al-
Before the linearization steps, we received canonical lowed if the matches surrounding the mismatches are long
forms from the given source codes. These canonical forms enough to compensate for the mismatches.
cannot be attained from a lexical level. Parsing steps should To get the most similar part from two strings, we need a
be involved to construct the canonical forms from the source specific scoring method. Usually, an integer scoring scheme
codes. A Parser reads the token stream and then generates is used: +1 for a match, −1 for a mismatch, and −2 for a
the corresponding syntax tree. Our system generates the match using a gap symbol. A gap symbol is like a wildcard

- 363 -
that can be matched to any symbol. The similarity matrix Since the code transformation methods are well known,
M is defined as follows: students can easily obtain a plagiarized program by apply-
ing them to an original source. Edward Jones has showed a
⎧ set of code plagiarism techniques [7].
⎨ +1 if ki = kj We found that the most common code plagiarism uses
M (ki , kj ) = −1 if ki = kj a simple syntactic transformation in the if-then-else clause

−2 if ki = _ or kj = _ and for loop to while loop transformation. For example, we
can simply transform if (A) then B; else C; to if (¬ A) then
Once the similarity matrix is defined, the local align- C; else B; without understanding of the internal logic. In a
ment algorithm can locate the most similar part from similar way, for (A;B;C) {D} can be transformed determin-
any two strings given. The programs to compare istically into A; while (B) {D; C;}. However this syntactic
are: A = a1 , a2 , . . . , ap  and B = b1 , b2 , . . . , bq . perturbation in source level can not easily be detected by
Next, the local alignment algorithm align locates a pair manual inspection, if the code has complicated statements.
of subsequences of the same length: align(A, B) =
We have explained in Section 3 how to extract the pro-
(a1 , b1 ), (a2 , b2 ), . . . , (am , bm ).1 And, the score of this
gram DNA by scanning the whole source code. If we apply
interval is defined as:
the naive scanning method which was used in the previous
 paper [5], the transformation of if (A) then B; else C; to if (¬
Score(A, B) = M (a, b)
A) then C; else B; can generate quite a different code DNA,
(a,b)∈align(A,B)
though it was exactly the same code.
After the similarity score is evaluated, the normalized In order to overcome this kind of syntactic plagiarism,
similarity can be computed. Though a normalized similar- we propose one novel canonical representation for a new
ity can be defined in several different ways, the following codes. So if we use this canonical code transformation, then
normalization is generally used: a set of syntactically transformed code can be converged
into a unique canonical form. Thus we can improve the sim-
2 Score(A, B) ilarity value of two different codes which follow the same
SIM (A, B) = program logic. We propose that the Series-Parallel Graph
Score(A, A) + Score(B, B)
Model works well for the canonical representation of pro-
If two strings A and B are exactly equal, then grams.
SIM (A, B) is 100%. If two strings have no substrings in
common, then SIM (A, B) is 0%. This measure has a weak- 4.2 SP-Graph Model for Canonical Representa-
ness in that it can be substantially lowered if a cheater in- tion
serts a lot of unused code.
The similarity of two programs can be computed after Every procedural language (say, C/C++, Pascal, Java)
the linearization step. In Figure 1, we showed that the lin- consists of three types of basic building blocks: sequential
earization and the receiving the canonical forms are sepa- blocks of statements, conditional (multi-) branches (if-then-
rate steps. It is true that they are conceptually separated, but else or switch-case), and iterative looping structures (for or
these two steps can be effectively incorporated into a single while loops). The control structure of a source code is able
step. We will describe this merged step as a linearization to be nicely modeled by a Series-Parallel graph (SP-graph).
function. The formal definition of this canonical lineariza- We can represent a sequential and looping block as a serial
tion function will be explained in Section 4. edge in SP-graph, conditional branch as parallel edges. Let
us give a formal definition of SP-Graph:
4 Canonical Representation of Source Code
Definition 1 A Graph is a Series-Parallel Graph, if it may
4.1 Typical Techniques in Code Plagiarism be turned into K2 , a complete graph with 2 vertices, by a
sequence of the following operations: (1) Replacement of
a pair of parallel edges with a single edge that connects
Parker and Hamblen [10] gave a definition for a plagia-
their common endpoints; and (2) Replacement of a pair of
rized program to be a program which has been produced
edges incident to a vertex of degree 2 with a single edge that
from another program, namely an original program, with a
connects the two other endpoints of the original edges.
small number of syntactic transformations without the un-
derstanding of the meaning or internal logic of the program.
So we are able to freely rearrange a set of parallel edges
1 In
fact, the lengths of subsequences may not be equal because of gap and blocks in a SP-graph, which is the main idea of our
symbols. code canonicalization. In the following figure, we show the

- 364 -
corresponding SP-graph component for sequential block, if- us then make a canonical form for this block. Since M pre-
then-else, and switch-case. cedes N, the dna() of this block should be DMNF.
Let us consider the outer if-then-else block, from C to Q.
The dna of left parallel edge is LE and the dna of right edge
is DMNF, the whole dna from block C to Q should be C
DMNF LE Q; spaces were intentionally inserted to help the
reader’s understanding.
P In a similar way, let us to compute the dna of a switch
block (from H to R). The set of dna’s four parallel edges
B1 are {O, BS, KIXY, J }. If we sort these four dna’s, then the
Q Q dna sequence is BS J KIXY O. Finally we are able to get
the entire dna from H to R as HBSJKIXYOR. So the dna()
of the most outer left edge is GHBSJKIXYORUTVW. So
B2
then else case 1: case 2: case n: the canonical dna of the whole source CodeA should be
ACDMNFLEQPGHBSJKIXYORUTVWZ.
If we do not consider the canonical form of code
B3 E E dna, then the total dna of CodeA might be AGHOB-
SKXIYJRUTVWCLEDMNFQPZ, which is quite different
to ACDMNFLEQPGHBSJKIXYORUTVWZ. We are able
to easily understand that this canonicalization of program
code prevents a simple syntactic plagiarism such as if-then-
else, switch-case clause reordering.
Figure 2. A SP-graph for (a) sequential blocks
(b) if-then-else block (c) switch-case 5 Experiments

5.1 Preparing Test Data


The canonical representation of program DNA is able
We have conducted experiments for detecting plagia-
to be obtained using a recursive procedure. Let dna(B)
rized codes using canonicalized code dna. Since it is not
denote the program DNA of subblock B in a program. Note
easy to find a real plagiarized code in programming class,
that dna(B) is a kind of string or sequence of integers. So
because cheating in programming assignments results in a
we are able to give the lexicographical ordering to dna(·).
severe penalty, we have prepared a set of artificially plagia-
If the set of blocks are sequentially ordered as shown in
rized codes by asking 10 subject students to plagiarize sam-
Figure 2-(a), the whole dna(B) is the concatenated form
ple programs within a fixed time. We selected four groups
of each basic block’s dna, such as dna(B) = dna(B1 ) 
from the programs submitted to the 2006 ICPC (Interna-
dna(B2 )  dna(B3 ), where  denotes the string concate-
tional Collegiate Programming Contest 2006) as shown in
nation.
Table 1.
In if-then-else case, first we construct the basic dna(·) of
two clauses, then and else. Next we construct the whole
dna(if (A) then B else C) as the following: dna(A)  Table 1. The experimental data
first{dna(B), dna(C)}  last{dna(B), dna(C)} where line of source code
No Group Files Pairs
first{A, B} returns the first element among A and B in Max Min Avg.
terms of lexicographical order and last returns the last. In 1 ICPC0601 15 105 234 93 122.07
switch-case clause, the dna(·) of each case block is placed 2 ICPC0602 15 105 110 41 86.73
lexicographically. 3 ICPC0603 15 105 215 65 104.33
4 ICPC0604 15 105 227 79 110.47
4.3 Example of Canonical Representation

Let us explain the power of SP-graph modeling using For each group, there are 5 independent programs (IP)
quite a complex example. Figure 3 shows the structure of and 10 plagiarized programs (PP) which have been plagia-
an example source code. It involves complicatedly nested rized from 5 IP codes. So we receive 105 pairs of code dna
structures of if-then-else and switch-case blocks. The right- for comparison. So in each testing group, there are 10 pairs
most subblock from D to F is a basic if-then-else block. Let of plagiarized programs.

- 365 -
A Here, P (k) denotes a set of k programs from the first
rank to k-th rank program; RP denotes the number of real
plagiarized pairs in a testing group; N denotes the total
G C
number of programs in the testing group.
Figure 4 shows that our canonical representation is supe-
H L D rior to JPlag in terms of sensitivity and specificity. Here we
have conducted another interesting experiment. The second
B K M N
author modified the source code shown in Figure 3 by trans-
forming all if-then-else and switch-case clause. Our similar-
ity measure using canonical representation between the two
X I E F codes gives 92.4%, while JPlag estimated the similarity at
47.3%.
O S Y J Q We have investigated one pair of programs in group 4,
where a student transformed a switch clause into a few if-
then-else clauses. Our system has successfully detected this
R
plagiarism since such transformation does not give any ef-
fect in the canonical representation.
U P And our system resolves macro expansion (e.g.,
#define feature), this kind of attack does not work, while
T V
JPlag does not cope with this macro expansion attack. For
one extreme case using a macro expansion attack found in
Group ICPC0604, our detection system showed 98.4% sim-
W ilarity, while JPlag showed only 46.2% similarity.

Z 6 Conclusion

Figure 3. The series-parallel graph for a pretty Plagiarism detection in source codes is an essential fa-
complex source code. The canonical dna for cility in E-learning programming education system. In this
this program is ACDMNFLEQPGHBSJKIXY- paper, we have introduced an improved plagiarism detection
ORUTVWZ method, which is based on canonical forms of source codes.
Let us summarize the main contributions of our paper.

• In order to detect plagiarism, we introduced a simi-


larity measure between two different programs by ex-
5.2 Accuracy Comparison to JPlag ploiting local alignment, which is mainly used in bio-
logical sequence analysis.
We would like to compare the performance of our system
to the JPlag tool, since JPlag is an excellent tool for code • We showed that this local alignment methodology is
plagiarism detection. The effectiveness and accuracy of a quite reliable in computing the plagiarism similarity
detecting tool is measured by two values: specificity and among source codes.
sensitivity.
• The canonical representation scheme works quite well
Before evaluating the specificity and sensitivity, we have
to detect the typical plagiarism of your average stu-
sorted the pair of compared programs according to their
dent; this scheme is especially effective in capturing
similarity value. The higher the pair of a program’s rank
the control structure alternation.
is, the more probable the two programs are involved in pla-
giarism. Assume that we collected top k pairs of programs • Our system appears to be superior to JPlag in terms of
in the sorted pair list in term of dna(·) similarity. Then specificity and sensitivity by experimenting with some
Sensitivity(k) and Specificity(k) is defined as follows: artificially plagiarized programs.

Currently, we are developing an algorithm to trace the


# of plagiarized programs in P (k)
Sensitivity(k) = evolution process of a program by exploiting the analogy of
RP biological phylogenetic tree construction. It would be inter-
# of plagiarized programs in P (k) esting to reconstruct the intermediate processes of program
Specificity(k) =
|P (k)| improvement (or plagiarism).

- 366 -
1.1 References
SP(ours)
1 SN(ours)
SP(JPlag)
0.9 SN(JPlag)
[1] I. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier.
0.8 Clone detection using abstract syntax trees. In Proceedings
0.7 of the International Conference on Software Maintenance,
0.6
pages 368–377. IEEE Computer Society Press, 1998.
0.5 [2] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker.
0.4
Shared information and program plagiarism detection. IEEE
0.3
Trans. on Information Theory, 50(7):1545–1551, 2004.
0.2 [3] C. Daly and J. Horgan. A technique for detecting plagiarism
0.1
0 20 40 60 80 100 in computer code. Comput. J, 48(6):662–666, 2005.
rank
[4] D. Gitchell and N. Tran. Sim: a utility for detecting sim-
(a) ICPC06-01 Group ilarity in computer programs. In SIGCSE, pages 266–270,
1999.
[5] J.-H. Ji, G. Woo, and H.-G. Cho. A source code linearization
1.1
SP(ours)
SN(ours)
technique for detecting plagiarized programs. ACM SIGCSE
1

0.9
SP(JPlag)
SN(JPlag)
Bulletin, 39(3):73–77, 2007.
[6] J. H. Johnson. Identifying redundancy in source code using
0.8
fingerprints. In CASCON ’93, pages 171–183. IBM Press,
0.7

0.6
1993.
[7] E. L. Jones. Plagiarism monitoring and detection—towards
0.5
an open discussion. J. of Consortium for Computing in Small
0.4

0.3
Colleges, 16(3):229–236, 2001.
[8] S. Mann and Z. Frew. Similarity and originality in code:
0.2
plagiarism and normal variation in student assignments. In
0.1
0 20 40 60 80 100
rank
Proceedings on Eighth Australasian Computing Education
Conference, volume 52, pages 143–150. ACS, 2006.
(b) ICPC06-02 Group [9] G. Mishne and M. de Rijke. Source code retrieval using con-
ceptual similarity. In Proceedings RIAO 2004, pages 539–
1.1 554, 2004.
SP(ours)
1 SN(ours) [10] A. Parker and J. O. Hamblen. Computer algorithms for
SP(JPlag)
0.9 SN(JPlag) plagiarism detection. IEEE Transactions on Education,
0.8 32(2):94–99, 1989.
0.7 [11] L. Prechelt, G. Malpohl, and M. Philippsen. Finding pla-
0.6 giarisms among a set of programs with JPlag. Journal of
0.5 Universal Computer Science, 8(11):1016–1038, 2002.
0.4 [12] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing:
0.3 Local algorithms for document fingerprinting. In SIGMOD
0.2 Conference, pages 76–85, 2003.
0.1
0 20 40 60 80 100
[13] T. Schmidt and J. Stoye. Quadratic time algorithms for find-
rank
ing common intervals in two and more sequences. In Pro-
(c) ICPC06-03 Group ceedings of the 15th Annual Symposium on Combinatorial
Pattern Matching, pages 347–358. Springer, 2004.
[14] T. F. Smith and M. S. Waterman. Identification of com-
1.1
SP(ours) mon molecular subsequences. Journal of Molecular Biol-
1 SN(ours)
SP(JPlag) ogy, 147:195–197, 1981.
0.9 SN(JPlag)
[15] J.-W. Son, S.-B. Park, and S.-Y. Park. Program plagiarism
0.8
detection using parse tree kernels. In PRICAI 2006, pages
0.7
1000–1004. Springer, Aug. 2006.
0.6
[16] K. L. Verco and M. J. Wise. Software for detecting suspected
0.5
plagiarism: Comparing structure and attribute-counting sys-
0.4
tems. In Proceedings of the 1st Australian Conference on
0.3
Computer Science Education, pages 130–134, Sydney, Aus-
0.2
tralia, July 1996.
0.1
0 20 40
rank
60 80 100 [17] M. J. Wise. Yap3: improved detection of similarities in com-
puter program and other texts. In SIGCSE, pages 130–134,
(d) ICPC06-04 Group 1996.
[18] B. Zeidman. Detecting source-code plagiarism. Dr. Dobb’s
Figure 4. The experimental result of mea- Journal, July 2004.
surement the Sensitivity and Specificity for
ICPC06 groups

- 367 -

View publication stats

You might also like