Download
Download
net/publication/4316343
CITATIONS READS
3 1,685
4 authors, including:
All content following this page was uploaded by Hwan-Gue Cho on 18 December 2015.
- 362 -
to detect the maximal identical subsequences, it compares
two token sequences, obtained from linearization. Then
the detected subsequences are deleted from both sequences.
These two steps are performed iteratively until no similar
subsequences exist. JPlag [11] adopts this technique.
Local alignment [14] is another technique based on lin-
earization. The alignment technique is a very popular
method to compare two DNA sequences in computational
biology. Local alignment finds the similar parts of two se-
quences by deleting some symbols and inserting gap sym-
bols. A gap symbol can be considered a wild card, which
matches any symbols. The method used in this paper is
based on the local alignment.
Quite recently, a very different approach using Kol-
mogorov complexity has been proposed. The Kolmogorov
complexity of a sequence can be considered to indicate the
gross of information contained in that sequence. The ba-
sic idea is to compare the compressed result of two pro-
grams using a specific compression algorithm, for exam-
ple the LZW-type compression algorithm [2]. This method
can filter out any redundant code blocks but is unable to
filter out unreachable code blocks meaninglessly attached.
Hence, an additional tool for removing unreachable code
blocks is necessary to use this method.
There are also other kinds of structural comparison
methods such as: the common interval method [13] or the Figure 1. Linearizing scheme using canonical
syntax tree comparison method [1, 15]. Our approach is representation of source codes
similar to these methods in that the methods are reflecting
the structure of source codes, but our approach is different
in that it is a merged approach applying the local alignment
to structure preserved canonical forms. canonical graph rather than the syntax tree. The syntax tree
Mishne and Rijke have proposed a source code retrieval may be useful for generating an object code, but since the
method using conceptual graphs [9]. The conceptual graph token stream should be regenerated a series-parallel graph
is similar to the abstract syntax tree, yet it may contain other is adopted instead of the syntax tree.
information such as comments. This approach seems effec- Once the canonical forms are generated, the rest of the
tive to retrieve source codes from a large database of source steps are straightforward. These steps are largely based on
codes, but, it is not precise enough to locate any very similar the local alignment. The pairs of program DNA’s are gener-
parts from the retrieved source codes. ated during the linearization step. Then the local alignment
algorithm is applied to the pair of the program DNA’s to
3 Source Code Management Procedure get similar subsequences. The similarity score is reported
according to the length of the similarly detected region.
Our method is based on the local alignment. The unique To compare two sequences of tokens, we have adopted
characteristic of our approach is that a canonical form is a local alignment, specifically the Smith-Waterman Algo-
suggested to come up with a structure-aware plagiarism at- rithm [14]. Given that two strings consisted in finite sym-
tack. The canonical form is a variant of a control flow graph, bols, a local alignment algorithm finds the most similar sub-
which contains no back edges. Figure 1 shows the overall parts, which are common in both strings. These subparts
steps of our approach. don’t need to be equal because some mismatches are al-
Before the linearization steps, we received canonical lowed if the matches surrounding the mismatches are long
forms from the given source codes. These canonical forms enough to compensate for the mismatches.
cannot be attained from a lexical level. Parsing steps should To get the most similar part from two strings, we need a
be involved to construct the canonical forms from the source specific scoring method. Usually, an integer scoring scheme
codes. A Parser reads the token stream and then generates is used: +1 for a match, −1 for a mismatch, and −2 for a
the corresponding syntax tree. Our system generates the match using a gap symbol. A gap symbol is like a wildcard
- 363 -
that can be matched to any symbol. The similarity matrix Since the code transformation methods are well known,
M is defined as follows: students can easily obtain a plagiarized program by apply-
ing them to an original source. Edward Jones has showed a
⎧ set of code plagiarism techniques [7].
⎨ +1 if ki = kj We found that the most common code plagiarism uses
M (ki , kj ) = −1 if ki = kj a simple syntactic transformation in the if-then-else clause
⎩
−2 if ki = _ or kj = _ and for loop to while loop transformation. For example, we
can simply transform if (A) then B; else C; to if (¬ A) then
Once the similarity matrix is defined, the local align- C; else B; without understanding of the internal logic. In a
ment algorithm can locate the most similar part from similar way, for (A;B;C) {D} can be transformed determin-
any two strings given. The programs to compare istically into A; while (B) {D; C;}. However this syntactic
are: A = a1 , a2 , . . . , ap and B = b1 , b2 , . . . , bq . perturbation in source level can not easily be detected by
Next, the local alignment algorithm align locates a pair manual inspection, if the code has complicated statements.
of subsequences of the same length: align(A, B) =
We have explained in Section 3 how to extract the pro-
(a1 , b1 ), (a2 , b2 ), . . . , (am , bm ).1 And, the score of this
gram DNA by scanning the whole source code. If we apply
interval is defined as:
the naive scanning method which was used in the previous
paper [5], the transformation of if (A) then B; else C; to if (¬
Score(A, B) = M (a, b)
A) then C; else B; can generate quite a different code DNA,
(a,b)∈align(A,B)
though it was exactly the same code.
After the similarity score is evaluated, the normalized In order to overcome this kind of syntactic plagiarism,
similarity can be computed. Though a normalized similar- we propose one novel canonical representation for a new
ity can be defined in several different ways, the following codes. So if we use this canonical code transformation, then
normalization is generally used: a set of syntactically transformed code can be converged
into a unique canonical form. Thus we can improve the sim-
2 Score(A, B) ilarity value of two different codes which follow the same
SIM (A, B) = program logic. We propose that the Series-Parallel Graph
Score(A, A) + Score(B, B)
Model works well for the canonical representation of pro-
If two strings A and B are exactly equal, then grams.
SIM (A, B) is 100%. If two strings have no substrings in
common, then SIM (A, B) is 0%. This measure has a weak- 4.2 SP-Graph Model for Canonical Representa-
ness in that it can be substantially lowered if a cheater in- tion
serts a lot of unused code.
The similarity of two programs can be computed after Every procedural language (say, C/C++, Pascal, Java)
the linearization step. In Figure 1, we showed that the lin- consists of three types of basic building blocks: sequential
earization and the receiving the canonical forms are sepa- blocks of statements, conditional (multi-) branches (if-then-
rate steps. It is true that they are conceptually separated, but else or switch-case), and iterative looping structures (for or
these two steps can be effectively incorporated into a single while loops). The control structure of a source code is able
step. We will describe this merged step as a linearization to be nicely modeled by a Series-Parallel graph (SP-graph).
function. The formal definition of this canonical lineariza- We can represent a sequential and looping block as a serial
tion function will be explained in Section 4. edge in SP-graph, conditional branch as parallel edges. Let
us give a formal definition of SP-Graph:
4 Canonical Representation of Source Code
Definition 1 A Graph is a Series-Parallel Graph, if it may
4.1 Typical Techniques in Code Plagiarism be turned into K2 , a complete graph with 2 vertices, by a
sequence of the following operations: (1) Replacement of
a pair of parallel edges with a single edge that connects
Parker and Hamblen [10] gave a definition for a plagia-
their common endpoints; and (2) Replacement of a pair of
rized program to be a program which has been produced
edges incident to a vertex of degree 2 with a single edge that
from another program, namely an original program, with a
connects the two other endpoints of the original edges.
small number of syntactic transformations without the un-
derstanding of the meaning or internal logic of the program.
So we are able to freely rearrange a set of parallel edges
1 In
fact, the lengths of subsequences may not be equal because of gap and blocks in a SP-graph, which is the main idea of our
symbols. code canonicalization. In the following figure, we show the
- 364 -
corresponding SP-graph component for sequential block, if- us then make a canonical form for this block. Since M pre-
then-else, and switch-case. cedes N, the dna() of this block should be DMNF.
Let us consider the outer if-then-else block, from C to Q.
The dna of left parallel edge is LE and the dna of right edge
is DMNF, the whole dna from block C to Q should be C
DMNF LE Q; spaces were intentionally inserted to help the
reader’s understanding.
P In a similar way, let us to compute the dna of a switch
block (from H to R). The set of dna’s four parallel edges
B1 are {O, BS, KIXY, J }. If we sort these four dna’s, then the
Q Q dna sequence is BS J KIXY O. Finally we are able to get
the entire dna from H to R as HBSJKIXYOR. So the dna()
of the most outer left edge is GHBSJKIXYORUTVW. So
B2
then else case 1: case 2: case n: the canonical dna of the whole source CodeA should be
ACDMNFLEQPGHBSJKIXYORUTVWZ.
If we do not consider the canonical form of code
B3 E E dna, then the total dna of CodeA might be AGHOB-
SKXIYJRUTVWCLEDMNFQPZ, which is quite different
to ACDMNFLEQPGHBSJKIXYORUTVWZ. We are able
to easily understand that this canonicalization of program
code prevents a simple syntactic plagiarism such as if-then-
else, switch-case clause reordering.
Figure 2. A SP-graph for (a) sequential blocks
(b) if-then-else block (c) switch-case 5 Experiments
Let us explain the power of SP-graph modeling using For each group, there are 5 independent programs (IP)
quite a complex example. Figure 3 shows the structure of and 10 plagiarized programs (PP) which have been plagia-
an example source code. It involves complicatedly nested rized from 5 IP codes. So we receive 105 pairs of code dna
structures of if-then-else and switch-case blocks. The right- for comparison. So in each testing group, there are 10 pairs
most subblock from D to F is a basic if-then-else block. Let of plagiarized programs.
- 365 -
A Here, P (k) denotes a set of k programs from the first
rank to k-th rank program; RP denotes the number of real
plagiarized pairs in a testing group; N denotes the total
G C
number of programs in the testing group.
Figure 4 shows that our canonical representation is supe-
H L D rior to JPlag in terms of sensitivity and specificity. Here we
have conducted another interesting experiment. The second
B K M N
author modified the source code shown in Figure 3 by trans-
forming all if-then-else and switch-case clause. Our similar-
ity measure using canonical representation between the two
X I E F codes gives 92.4%, while JPlag estimated the similarity at
47.3%.
O S Y J Q We have investigated one pair of programs in group 4,
where a student transformed a switch clause into a few if-
then-else clauses. Our system has successfully detected this
R
plagiarism since such transformation does not give any ef-
fect in the canonical representation.
U P And our system resolves macro expansion (e.g.,
#define feature), this kind of attack does not work, while
T V
JPlag does not cope with this macro expansion attack. For
one extreme case using a macro expansion attack found in
Group ICPC0604, our detection system showed 98.4% sim-
W ilarity, while JPlag showed only 46.2% similarity.
Z 6 Conclusion
Figure 3. The series-parallel graph for a pretty Plagiarism detection in source codes is an essential fa-
complex source code. The canonical dna for cility in E-learning programming education system. In this
this program is ACDMNFLEQPGHBSJKIXY- paper, we have introduced an improved plagiarism detection
ORUTVWZ method, which is based on canonical forms of source codes.
Let us summarize the main contributions of our paper.
- 366 -
1.1 References
SP(ours)
1 SN(ours)
SP(JPlag)
0.9 SN(JPlag)
[1] I. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier.
0.8 Clone detection using abstract syntax trees. In Proceedings
0.7 of the International Conference on Software Maintenance,
0.6
pages 368–377. IEEE Computer Society Press, 1998.
0.5 [2] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker.
0.4
Shared information and program plagiarism detection. IEEE
0.3
Trans. on Information Theory, 50(7):1545–1551, 2004.
0.2 [3] C. Daly and J. Horgan. A technique for detecting plagiarism
0.1
0 20 40 60 80 100 in computer code. Comput. J, 48(6):662–666, 2005.
rank
[4] D. Gitchell and N. Tran. Sim: a utility for detecting sim-
(a) ICPC06-01 Group ilarity in computer programs. In SIGCSE, pages 266–270,
1999.
[5] J.-H. Ji, G. Woo, and H.-G. Cho. A source code linearization
1.1
SP(ours)
SN(ours)
technique for detecting plagiarized programs. ACM SIGCSE
1
0.9
SP(JPlag)
SN(JPlag)
Bulletin, 39(3):73–77, 2007.
[6] J. H. Johnson. Identifying redundancy in source code using
0.8
fingerprints. In CASCON ’93, pages 171–183. IBM Press,
0.7
0.6
1993.
[7] E. L. Jones. Plagiarism monitoring and detection—towards
0.5
an open discussion. J. of Consortium for Computing in Small
0.4
0.3
Colleges, 16(3):229–236, 2001.
[8] S. Mann and Z. Frew. Similarity and originality in code:
0.2
plagiarism and normal variation in student assignments. In
0.1
0 20 40 60 80 100
rank
Proceedings on Eighth Australasian Computing Education
Conference, volume 52, pages 143–150. ACS, 2006.
(b) ICPC06-02 Group [9] G. Mishne and M. de Rijke. Source code retrieval using con-
ceptual similarity. In Proceedings RIAO 2004, pages 539–
1.1 554, 2004.
SP(ours)
1 SN(ours) [10] A. Parker and J. O. Hamblen. Computer algorithms for
SP(JPlag)
0.9 SN(JPlag) plagiarism detection. IEEE Transactions on Education,
0.8 32(2):94–99, 1989.
0.7 [11] L. Prechelt, G. Malpohl, and M. Philippsen. Finding pla-
0.6 giarisms among a set of programs with JPlag. Journal of
0.5 Universal Computer Science, 8(11):1016–1038, 2002.
0.4 [12] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing:
0.3 Local algorithms for document fingerprinting. In SIGMOD
0.2 Conference, pages 76–85, 2003.
0.1
0 20 40 60 80 100
[13] T. Schmidt and J. Stoye. Quadratic time algorithms for find-
rank
ing common intervals in two and more sequences. In Pro-
(c) ICPC06-03 Group ceedings of the 15th Annual Symposium on Combinatorial
Pattern Matching, pages 347–358. Springer, 2004.
[14] T. F. Smith and M. S. Waterman. Identification of com-
1.1
SP(ours) mon molecular subsequences. Journal of Molecular Biol-
1 SN(ours)
SP(JPlag) ogy, 147:195–197, 1981.
0.9 SN(JPlag)
[15] J.-W. Son, S.-B. Park, and S.-Y. Park. Program plagiarism
0.8
detection using parse tree kernels. In PRICAI 2006, pages
0.7
1000–1004. Springer, Aug. 2006.
0.6
[16] K. L. Verco and M. J. Wise. Software for detecting suspected
0.5
plagiarism: Comparing structure and attribute-counting sys-
0.4
tems. In Proceedings of the 1st Australian Conference on
0.3
Computer Science Education, pages 130–134, Sydney, Aus-
0.2
tralia, July 1996.
0.1
0 20 40
rank
60 80 100 [17] M. J. Wise. Yap3: improved detection of similarities in com-
puter program and other texts. In SIGCSE, pages 130–134,
(d) ICPC06-04 Group 1996.
[18] B. Zeidman. Detecting source-code plagiarism. Dr. Dobb’s
Figure 4. The experimental result of mea- Journal, July 2004.
surement the Sensitivity and Specificity for
ICPC06 groups
- 367 -