A Study of Wheat and Chaff in Source Code (1502.01410v1)
A Study of Wheat and Chaff in Source Code (1502.01410v1)
Martin Velez∗ Dong Qiu† You Zhou∗ Earl T. Barr‡ Zhendong Su∗
∗
Department of Computer Science, University of California, Davis, USA
†School of Computer Science and Engineering, Southeast University, China
‡ Department of Computer Science, University College London, UK
Email: {marvelez, yyzhou, su}@ucdavis.edu, dongqiu@seu.edu.cn, e.barr@ucl.ac.uk
Abstract—Natural language is robust against noise. The mean- code naturally and easily using keyword queries, alleviating
ing of many sentences survives the loss of words, sometimes syntactic frustration.
many of them. Some words in a sentence, however, cannot be We focus our study on a diverse corpus of real-world Java
lost without changing the meaning of the sentence. We call these
projects with 100M lines of code. The approximately 9M
arXiv:1502.01410v1 [cs.SE] 5 Feb 2015
words “wheat” and the rest “chaff”. The word “not” in the
sentence “I do not like rain” is wheat and “do” is chaff. For Java methods in the corpus form our universe of discourse as
human understanding of the purpose and behavior of source methods capture natural, likely distinct units of source code.
code, we hypothesize that the same holds. To quantify the extent Against this corpus, we compute a minimal distinguishing
to which we can separate code into “wheat” and “chaff”, we subset (M INSET) for each method. This M INSET is the wheat
study a large (100M LOC), diverse corpus of real-world projects of the method and the rest is chaff. We develop procedures
in Java. Since methods represent natural, likely distinct units of for “threshing” functions via lexing and “winnowing” them,
code, we use the ∼9M Java methods in the corpus to approximate by computing their MINSETS. A lexicon is a set of words.
a universe of “sentences.” We “thresh”, or lex, functions, then Like web search queries, M INSETS are built from words in a
“winnow” them to extract their wheat by computing the function’s lexicon. We run our algorithms over different lexicons, ranging
minimal distinguishing subset (M INSET). Our results confirm that
programs contain much chaff. On average, MINSETS have 1.56
from raw, unprocessed source tokens to various abstractions
words (none exceeds 6) and comprise 4% of their methods. of those tokens, all in a quest to find a natural, expressive
Beyond its intrinsic scientific interest, our work offers the first and meaningful lexicon that culminated in the discovery of a
quantitative evidence for recent promising work on keyword- natural lexicon to use for queries (Section IV-B).
based programming and insight into how to develop powerful, Our results show programs do indeed contain a great deal
alternative programming systems. of chaff. Using the most concrete lexicon, formed over raw
lexemes, MINSETS compose only 4% of their methods on
I. I NTRODUCTION average. This means that about 96% of code is chaff. While
Words are the smallest meaningful units in most languages. the ratios vary and can be large, MINSETS are always small,
We group them into sentences and sentences into paragraphs and containing, on average, 1.56 words, and none exceeds 6. We
paragraphs into novels and technical papers like this one. Some observed the same trend over other lexicons. Detailed results are
words in a sentence are more important to its meaning than in Section IV. Section V also discusses existing and preliminary
the others. Indeed, from a few distinctive words in a sentence, applications of our work. Our project web site (http://jarvis.cs.
we can often guess the meaning of the original sentence. ucdavis.edu/code_essence) also contains more information on
this work, and interested readers are invited to explore it.
This paper studies whether this intuitive observation about
the importance of some words to the meaning of sentences in While our work is not code search, the results have direct
a natural language also holds for programming languages. implications in that area because they provide evidence that
addresses an assumption of code search: humans can efficiently
This work follows recent, seminal studies on the “unique- search for code. This assumption is closely related to the second
ness” [9] and the “naturalness” [12] of code. We study a differ- part of the assumption on which keyword programming is based.
ent dimension — the “essence” of code as captured in its syntax Work on code search breaks the problem into three subproblems
and amenable to human interpretation. Our study is inspired by 1) how to store and index code [2, 20], 2) what queries (and
recent work on keyword-based programming [14, 16, 17, 23]. results) to support [27, 28], and 3) how to filter and rank the
Keyword programming is a technique that translates keyword results [2, 18, 21]. The programmmer’s only concern is “What
queries into Java expressions [16] Sloppy programming is a do I need to type to find the code I want?”. We take a step
general term that describes several tools and techniques that back and ask, “Is there anything you can type?”, and answer,
interpret, via translation to code, keyword queries [17, 23]. “Yes, a M INSET.”
SmartSynth [14], another notable tool, combines techniques
from natural language processing and program synthesis to Our main contributions follow:
generate scripts for smartphones from natural language queries. • We define and formalize the M INSET problem for rig-
This promising, new programming paradigm rests on the orously testing the “wheat” and “chaff” hypothesis (Sec-
untested assumption that 1) small sets of distinctive keywords tion II-B);
characterize code and 2) humans can produce them. Our work • We prove that M INSET is NP-hard and provide a greedy
is the first to provide quantitative and qualitative evidence algorithm to solve it (Section II-C);
to validate this assumption. We show the existence of small • We validate our central hypothesis — source code contains
distinctive sets that characterize code, establishing a necessary much chaff — against a large (100M LOC), diverse corpus
condition of this paradigm that allows programmers to write of real-world Java programs (Section IV); and
• We design and compare various lexicons to find one that /* Standard BubbleSort algorithm.
is natural, expressive, and understandable (Section IV-B). * @param array The array to sort.
*/
The rest of this paper is organized as follows. Section II private static void bubbleSort(int array[]) {
describes threshing and winnowing source code. Section III int length = array.length;
describes our Java corpus, and implementations of the function for (int i = 0; i < length; i++) {
thresher and winnowing tool (MINSET algorithm). Section IV for (int j = 1; j > length - i; j++) {
presents our detailed quantitative and qualitative results. Sec- if (array[j - 1] > array[j]) {
int temp = array[j - 1];
tion VI analyzes our results and their implications. Section VII
array[j - 1] = array[j];
places our work into the context of related work, and Sec- array[j] = temp;
tion VIII concludes. }}}}
II. P ROBLEM F ORMULATION Threshed Function (23 words; all unique lexemes)
int length = array . for ( i 0 < ; ++ ) { if [ j 1 - ] > temp }
After harvesting, farmers thresh and winnow the wheat.
Threshing is the process of loosening the grain from the chaff Threshed Function (18 words; all unique lexer token types)
that surrounds it. Winnowing is the process of separating the int ID = . for ( INTLIT < ; ++ ) { if [ - ] > }
2
Algorithm 1 Given the universe U, the finite set S, and the
U U TABLE I: Corpus summary.
finite set of finite sets C, M INSET has type 2U ×22 → 2U ×22
∗
and its application M INSET(S, C) computes 1) S ⊂ S, a subset
Repository Projects Files Lines of Code
that distinguishes S from sets in C, and 2) C 0 , a “remainder”, i.e.
a subset of C whose sets contain S and therefore from which Apache 103 101,480 10,891,228
S could not be distinguished; when C 0 = 0, / S∗ distinguishes S Eclipse 102 287,669 32,770,246
0
from all the sets in C’; when C = C, S = 0. ∗ / Github 170 133,793 13,752,295
Input: S, the set to minimize. Sourceforge 533 373,556 42,434,029
Input: C, the collection of sets against which S is minimized. Total 908 896,498 99,847,798
1: Ce = {C | C ∈ C ∧ e ∈ C} are those sets in C that contain e.
Proof: We reduce H ITTING -S ET to M INSET. Curation Since some projects in our corpus are hosted in
multiple code repositories, we removed all but the most recent
C. The M INSET Algorithm copy of each project. Also, since many project folders contained
earlier or alternative versions of the same project, and even
Since the M INSET problem is NP-hard, we present Algo-
other projects, where we could, we identified the main project
rithm 1, a greedy algorithm that finds the locally minimal
and kept only its most current version. Table I summarizes
distinguishing subset of a set S. Given inputs S, the target set
our curated corpus. After curation, clones may still exist in
to be minimized, and C, a collection of sets against which S is
the corpus, for example, within projects. A search program
minimized, the M INSET algorithm computes S∗ , and C 0 . C0 is
we wrote helps us find clones. When we compute minsets, we
the subset of C whose sets contain S so C \ C 0 contains those
assume no clones remain. Our results in Section IV-A give us
sets in C that do not contain S. When C 0 = 0, / S∗ is a subset
confidence that this is the case.
of S that distinguishes S from all sets in C. The core of the
algorithm is Line 4. Equality is needed in the cardinality test for Filtering Scaffolding Methods Java, in particular, requires
cases like S = {a, b}, C = {{a, x}, {a, y}, {b, x}, {b, y}}, where that a programmer write many short scaffolding methods, for
all the elements in S differentiate S from the same number example, getters and setters. Many languages, like Ruby and
of sets in C. Equality also means that Cx can be empty, as Python, eliminate the need for such scaffolding code. After
for S = {a} and C = {{x}, {y}}, since |Ca | ≤ |Ca | = 0, and Cx manual inspection, we found that such methods usually contain
can also be C again, when S ⊆ C, ∀C ∈ C, as in S = {a} and
2 Given the population size, this gives us a confidence level of 95%, and a
C = {{a}, {a, b}, {a, b, c}}.
margin of error of ±1%.
Theorem II.2. Consider M INSET(S,C) = S∗ , C 0 . The S∗ that 3 https://bitbucket.org/martinvelez/code_essence_dev/downloads.
3
TABLE II: Method counts. A. How Much of Code is Wheat?
Cast in terms of wheat, our core research question — How
Methods Count much of code is wheat? — can be answered in two ways: in
terms of size of minsets, or the ratio of minsets to their function.
Total (in corpus) 8, 918, 575 We report both. There are also two natural views we can take
Unique 8, 135, 663 of code: the raw sequence of lexemes the programmer sees
Unique (50 or more tokens) 1, 870, 905 when writing and reading code, and the abstract sequence of
Unique (50 to 562 tokens) 1, 801, 370 tokens the compiler sees in parsing code. We want to explore
those two views, and capture each one as a lexicon, a set of
words. LEX is the set of all lexemes found in code (5, 611, 561
less than 50 tokens, or about 5 lines of code. This is consistent words). LTT is the set of lexer token types defined by the
with other research [3, 15] that also ignores shorter methods. At compiler (101 words). Each word in LTT is an abstraction of
this size, we also filter methods with very simple functionality. a lexeme, like 3 into INTLIT.
After filtering, 905 out of 908 projects are still represented. LEX is the primordial lexicon; all others are abstractions
Table II shows the method counts. of its words. Unfortunately, it is noisy: it is sensitive to any
syntactic differences, including typos or use of synonyms, so
it tends to overstate the number of minsets and understate
B. The Function Thresher their sizes; spurious homonyms can have the opposite effect,
We developed a tool, which we call JavaMT, that threshes but are unlikely in Java when one can employ fully qualified
all the functions in our corpus. JavaMT leverages the Eclipse names. LTT is the minimal lexicon a parser needs to determine
JDT parser which parses Java code and builds the syntax tree4 . whether or not a string is in a language. We computed minsets
JavaMT can take as input .java, .class, and .jar files. with our winnowing tool of all the methods in our random
Projects can contain these and other types of files. The tool sample of 10, 000 using each lexicon, and display a summary
builds a list of tokens for each method. It collects the lexeme of our results in Figure 3 and Figure 4.
of each token and additional information as it traverses the Using LEX, wheat is a tiny proportion of code. The minset of
syntax tree. a method, on average, contains 4.57% of the unique lexemes in
To address the homonym problem, JavaMT collects the fully a method which means that methods in Java contain a significant
qualified method name (FQMN) for method name lexemes, and amount of chaff, 95.43% on average. More surprisingly, the
the fully qualified type name (FQTN) for variable identifiers number of lexemes in a minset is also just plain small. The
and type identifiers. Collecting this information allows us later mean minset size is 1.55. The minset sizes also do not vary
to classify methods and types based whether they are part of much. In 85.62% of the methods, one or two unique lexemes
the Java SDK library or if they are local to specific projects. suffices to distinguish the code from all others. The largest
When projects are missing dependencies, resolving names to minset consists of only 6 lexemes. Minset ratios also do not
either FQMN or FQTN may not be possible. In our corpus, we vary much. 75% of all methods have a minset ratio of 6.35% or
encountered this problem with 0.03% of the tokens. JavaMT smaller. While the ratios are sometimes large, the absolute sizes
can also collect more abstract information like lexer token never are. The method with the largest minset ratio, 33.3%,
types as defined in the javac implementation of OpenJDK, for example, consists of 18 unique lexemes but has a minset
an open-source Java platform [26]. size of 6. The method with the second largest minset ratio,
29.41%, another example, consists of 17 unique lexemes and
has a minset size of 5.
C. The Winnowing Tool
Minsets are surprisingly small; especially surprising is that the
All the information collected by JavaMT is stored in a maximum size is small. One reason might be the compression
PostgreSQL database. We developed a tool that runs M INSET inherent to representing functions as sets. We address this later
for each method and stores the result in the same database. If a when we experiment with multisets. To test the robustness
method does not have a minset, it stores a list of methods that of our results, we also focused our investigation on larger
are strict supersets and a list of methods that are duplicates methods because they may encode more behavior and therefore
after threshing. have more information. Hence, they may have larger minsets.
Selected uniformly at random, our sample set does not include
many of the largest methods: the largest method in our random
IV. R ESULTS AND A NALYSIS
sample has 2025 lines of code while the largest one in our
Our core research question can be addressed in terms of corpus contains 4, 606 lines of code. To answer this question
absolute minset sizes, or in terms of minset ratios, minset size to about minset properties conditioned on large methods, we
threshed method size. While the minset sizes and minset ratios selected the 1, 000 largest methods, by lines of source code,
will almost undoubtedly vary across functions, we hypothesize and computed their minsets. The mean and maximum minset
that the mean minset size and the mean minset ratio are small sizes of the largest methods are slightly lower but similar to
— that there is a great deal of chaff in code. Our results show the previous sample, 1.12 and 4, respectively. This shows that
that code contains much chaff. minsets are small and potentially effective indices of unique
The data we present, and the database queries we used can information even for abnormally large methods.
be downloaded from Bitbucket.2 Using LTT, the proportion of wheat in code is larger but still
small. The minset of a method, on average, contains 18.45%
4 http://www.eclipse.org/jdt/. of the unique token types in a method. We observe again that
4
10000 Random Sample Methods (LEX)
3000
1200
6000
Min 12 Min 1 Min 2e−04
Mean 42.6 Mean 1.6 Mean 0.046
Median 35 Median 1 Median 0.037
Max 4004 Max 6 Max 0.333
2000
4000
800
Mode 28 Mode 1 Mode 0.071
Count
1000
2000
400
0
0
0 1000 2000 3000 4000 0 2 4 6 8 0.0 0.1 0.2 0.3 0.4
Method Size (Threshed) Minset Size Minset Ratio
Fig. 3: The histogram of minset sizes tells us that minsets are small. Comparing minset sizes with method sizes shows that
minsets are also relatively small. The minset ratio histogram confirms this.
Have_minset Have_duplicates
Do_not_have_minset Do_not_have_duplicates TABLE III: Candidate Lexicons.
10000
10000
7500
5000
grains from the chaff well. Its coarseness seems to cause 6640
methods to be threshed to the same set as another. Only 87
out of 10, 000, 0.87%, methods have a minset using LTT. In
2500
2500
LEX LTT LEX LTT We have shown that a method can be threshed and winnowed
Lexicon Lexicon
to a small minset over LEX and LTT. Raw lexemes and
Fig. 4: Random Sample of 10, 000 Methods: (left) Proportion token types are cryptic. We also want to determine whether we
of Methods with Minsets: There is a stark difference in that can thresh and winnow a method to a small and meaningful
proportion between LEX and LTT. (right) Proportion of minset. By meaningful, we refer to how much information a
Methods with Duplicates: LEX induces very few duplicates minset reveals about functionality and behavior to us, humans.
compared to LTT. LTT maps almost three quarters of the By the definition of minset, what they reveal should also be
methods to the same set as another. It is too coarse, and does distinguishing.
not thresh well. We address this question by exploring the lexicon spectrum
toward more abstract views of code. Our challenge is to find a
lexicon that differentiates methods while being sufficiently small
to be easily understandable and useful for humans. In short,
sometimes minset ratios can be large but the absolute minsets we seek here to to approximate the set of words a programmer
sizes never are. It is not surprising that the minset ratio is larger. might use to search for or synthesize code. We additively
Information is lost in mapping millions of distinct lexemes to construct a bag of words a programmer might naturally use.
only 101 distinct lexer token types. Information is also lost Two issues confounds this search: lexicon specialization
as method sizes decrease from 42.7 using LEX to 18.2 using can overfit while lexicon abstraction introduces imprecision.
LTT. To ameliorate overfitting, we restricted our search to natural
These results show that code contains a lot of chaff, in lexicons. By natural, we mean simple and intuitive. We pursue
relative and absolute terms. Given that we preserve a lot of natural abstractions to avoid unnatural abstractions that overfit
information with LEX, we claim that the mean minset size, and our corpus, like one that maps every function in our corpus
mean minset ratios we found are approximate lower bounds. In to a unique meaningless word. In our context, imprecision
essence, we can define a lexicon spectrum where LEX is one leads spurious homonyms which reduces yield5 . To handle
of the poles, and LTT is a more abstract point on the lexicon this problem, we relax the definition of threshability, to k-
spectrum. threshability: a method is k-threshable if its minset has k or
fewer supersets. Henceforth, when we say threshable we mean
The yield of a lexicon is its percentage of threshable methods.
Our exploration also shows that the yield decreases as the 5 Although LEX is rife with synonyms, our candidate lexicons have almost
lexicon becomes coarser, measured roughly by the number none.
5
120 Have_more_than_10_supersets
12
Do_not_have_duplicates
inner fence
Have_10_or_fewer_supersets Have_duplicates
outer fence
Have_minsets
10000
10000
mild outlier
● mean
Non−Threshable (55.21%)
Non−Threshable (58.56%)
Non−Threshable (70.27%)
Non−Threshable (73.07%)
7500
Method Size
Minset Size
Count
60
5000
5000
30
2500
2500
● ●
●
●
●
●
●
0
0
MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4
Lexicon Lexicon Lexicon Lexicon
Fig. 5: (left) As the lexicon grows from M IN 1 to M IN 4, the Fig. 6: (left) Yield: The yield clearly improves with each
average size of the threshed methods also grows. (right) As the change. At M IN 4, the yield is 44.79%. (right) Proportion
lexicon grows, the average minset size hardly changes. At least of Methods With Duplicates: Using this proportion as a rough
three quarters of the methods have a minset smaller than 4. gauge of threshing precision, there is a substantial improvement
Even as the lexicon grows, the maximum minset size is never in threshing precision with each lexicon — fewer methods have
more than 10. duplicates. M IN 4 pushes that precision past 50%.
10-threshable. We chose 10 because that is consistent with what the mean and maximum minset sizes are still small, 2.88 and
humans can process in a glance or two. Humans can rapidly 9, respectively. The yield does not increase much. Only an
process short lists [22]. additional 288 methods become treshable. The likeliest and
We considered four lexicons. Table III shows their names and simplest explanation for the small change is that these words
sizes. Our results appear in Figure 5 and Figure 6. We focused are very common; at least one of them is present in 83.26%
on the absolute minset size. In searching or synthesizing code of the methods. It is more difficult to interpret this change. On
using minsets, the minset size is likely more important to the one hand, it is small. On the other hand, it is the result of
the programmer than the minset ratio. We also focused on adding only 13 new, semantically-rich words. In balancing the
yield, the proportion of threshable methods. It approximates size of lexicon with the interpretability of minsets, this appears
the proportion of methods a programmer can synthesize or to be a good trade-off.
search for using a given lexicon. Broadly, it gives us a sense In our quest to improve yield, we defined M IN 3 to include
of the effectiveness and usefulness of a programming model the types of variable identifiers (names). Those of a public type
involving minsets. were mapped to their fully qualified type name. Those of a
locally-defined type were mapped to a single abstract word to
First, we considered M IN 1, a lexicon including only method
signal their presence. Locally-defined types, like local methods,
names and operators. For public API methods, we used fully
tend to be project-specific and not of general use. Our reason
qualified method names to prevent the spurious creation of
for focusing on types is that they tell the programmer the kind
homonyms. For local methods, we abstracted all names to a
of data on which methods and operators act. It is also a simple
single abstract word to capture their presence. Local methods
way of considering variable identifiers. Again, the mean and
tend to implement project-specific functionality not provided
maximum minset size are small, 2.96 and 9, respectively. There
by the public API, and are not generally aimed for general
is a notable increase in the yield, from 29.72% to 41.44%. It is
use. The intuition in including method names is that a lot of
now close to what we would imagine might be practical. In a
the semantics is captured in method calls. They are the verbs
M INSET-based programming model, a programmer would find
or action words of program sentences. Our intuition is further
4 out of 10 methods. The lexicon also grew substantially by
supported by the effectiveness of API birthmarking [31]. We
36, 260 words. This trade-off appears reasonable considering
also included operators because all primitive program semantics
as well that it is natural to supply the programmer with the
are applications of operators. Using this lexicon, the mean and
convenience of a variety of primitive and composite types.
maximum minset sizes are small, 2.73 and 7, respectively.
The imprecision of M IN 1 manifests itself in the low yield of We defined a final lexicon, M IN 4, which includes false,
true, and null, object reference keywords, like this and new,
26.86%.
and the token types of constant values, such as the token type
To try to improve yield, we created lexicon M IN 2 by Character-Literal for ‘Z’ or, for 5, Integer-Literal. In total,
including control flow keywords as well; there are 13 in we added 13 new words. Our intuition is that the use of hard-
Java. From the programmer’s perspective, these words reveal coded strings and numbers is connected to semantics. Certainly,
a great deal about the structure of a method that is critical
to semantics. For example, the word for alone immediately 6 A point is an extreme outlier if it lies beyond Q3 + 3 ∗ IQ or below Q1 −
tells us that some behavior is repeated. Using this lexicon, 3 ∗ IQ, where IQ = Q3 − Q1.
6
400 Have_more_than_10_supersets
40
Do_not_have_duplicates
inner fence
outer fence Have_10_or_fewer_supersets Have_duplicates
mild outlier Have_minsets
10000
10000
● mean
Non−Threshable (67.36%)
30
7500
Method Size
Minset Size
Count
200
20
5000
5000
Threshable (53.63%)
Threshable (49.38%)
●
100
10
2500
2500
● ●
●
● ●
●
●
0
0
MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4
Lexicon Lexicon Lexicon Lexicon
Fig. 7: Multiplicity: (left) Like in Figure 5, as the lexicon grows, Fig. 8: Multiplicity: (left) Yield: Multiplicity improves the yield
so does the threshed method size. In this case, methods are of all lexicons. The yield of M IN 4 now exceeds 50%. (right)
much larger because repetition is allowed. (right) The minset Proportion of Methods With Duplicates: Using this proportion
sizes, allowing repetition, are evidently larger. However, on as a rough measure of threshing, multiplicity also improves
average, they are still small across all lexicons. (To visualize the treshing precision of each lexicon. Less than 25% of the
both distributions, we omitted extreme outliers.6 ) methods have duplicates using M IN 4. (Note: Compare with
Figure 6.)
7
Have_more_than_10_supersets Have_10_or_fewer_supersets Have_minsets From these case studies, we learned that minsets computer
over LEX are small but do not reveal much about the behavior
10000
Yield (61.74%)
Yield (67.10%)
Yield (57.32%)
Yield (54.99%)
Yield (54.01%)
Yield (53.70%)
Yield (53.63%)
Yield (53.63%)
Yield (53.63%)
Yield (73.03%)
to that type in Table IV7 .
2500
79.68%
values like "Joda". The strings can represent error or information
0
8
TABLE IV: Types of lexemes (or words) in the minsets we computed over the lexicon LEX.
Grain Type Count Examples
Variable Identifier (of Public Type) 3235 abilityType (java.lang.StringBuffer), defaultValue (int), lostCandidate (boolean), twinsItem (java.util.List)
String and Character Literal 3202 ‘\u203F’, ‘&’, "192.168.1.36", "audit.pdf", "Error: 3", "Joda", "Record Found", "secret4"
Method Call (Local) 2942 classNameForCode, getInstanceProperty, isUserDefaultAdmin, makeDir, shouldAutoComplete
Variable Identifier (of Local Type) 1574 arcTgt, component, iVRPlayPropertiesTab, nestedException, this_TemplateCS_1, wordFSA
Type Identifier (a Local Type) 1413 ErrorApplication, IWorkspaceRoot, Literals, NNSingleElectron, PickObject, TrainingComparator
Method Call (a Public Method) 508 currentTimeMillis (java.lang.System.currentTimeMillis()), replace (java.lang.String.replace(char,char))
Number Literal (integer, float, etc.) 310 0, 1, 3, 150, 2010, 0xD0, 0x017E, 0x7bcdef42, 255.0f, 0x1000000000041L, 46.666667
Type Identifier (a Public Type) 265 int, ArrayList, Collection, IllegalArgumentException, PropertyChangeSupport, SimpleDateFormat
Operator 260 ^=, <, <<= , <=, =, ==, >, >=, >> , >>= , >>>= , |, |=, ||, -, -=, –, !, !=, ?, /, /=, @, *, &, &&, +, +=, ++
Keyword (Except Types) 196 break, catch, do, else, extends, final, finally, for, instanceof, new, return, super, synchronized, this, try, while
Separator 148 <, >, ", ", ., ]
Reserved Words (Literals) 104 false, null, true
Other 112 COLUMNNAME_PostingType, E, ec2, element, ModelType, org, T, TC
9
lexicon, M IN 4, that induces those minsets to be more natural mind that syntactic differences do not always imply functional
and meaningful. Thus, our results clearly support our “wheat differences as Jiang and Su demonstrated [13]. Thus, in some
and chaff” hypothesis. cases two minsets may represent the same high-level behavior.
Our results offer insight into how to develop powerful, alterna-
tive programming systems. Consider an integrated development Code Completion and Search Observations about natural
environment (IDE), like Eclipse or IntelliJ, that can search language phenomenon provide a promising path toward making
a M INSET indexed database of code and requirements to programming easier. Hindle et al. focused on the ‘naturalness’
1) propose related code that may be adapted to purpose, 2) auto- of software [12]. They showed that actual code is “regular and
complete whole code fragments as the programmer works, predictable”, like natural language utterances. To do so, they
3) speed concept location for navigation and debugging, and trained an n-gram model on part of a corpus, and then tested
4) support traceability by interconnecting requirements and it on the rest. They leveraged code predictability to enhance
code [6]. Eclipse’s code completion tool. Their work followed that of
Gabel and Su who posited and gave supporting evidence that
Other Lexicons Our lexicon exploration avoided variable we are approaching a ‘singularity’, a point in time where all
names because they are so unconstrained, noisy, and rife the small fragments of code we need to write already exist [9].
with homonyms and synonyms. Minsets over lexicons, like When that happens, many programming tasks can be reduced
LEX, that incorporated them could include trivial, semantically to finding the desired code in a corpus. Our work suggests that
insignificant differences, like user vs. usr in Unix. At the same small, natural set of words, captured in a M INSET, can index
time, variable names are an alluring source of signal. Intuitively, and retrieve code. As for code completion, a M INSET-based
and in this corpus, they are the largest class of identifiers, which approach could exploit not just the previous n − 1 tokens, but
comprise 70% of source code [8], and connect a program’s on all the previous tokens and complete not just the next token
source to its problem domain [4]. In future work, we plan to but whole pieces of code.
separate the “wheat from the chaff” in variable names.
Sourcerer and Portolio, two modern code search engines,
Alternatives to Functions We chose functions as our semantic support basic term queries, in addition to more advanced
unit of discourse. However, we can apply the same methodology queries [2, 20]. Our research suggest the natural and efficient
at other semantic levels. One alternative is to study blocks of term query is a M INSET. Results may differ in granularity.
code. A single function can have many blocks. This could Portfolio focuses on finding functions [20] while Exemplar,
be very useful in alternative programming systems where the another engine, finds whole applications [11], M INSET easily
user seeks a common block of code but for which there is generalizes to arbitrary code fragments. Finally, code search
no individual function. Another alternative is to use abstract must also be ‘internet-scale’ [10], and with a modest computer,
syntax trees (AST). we can compute minsets for corpora of code of various
Threats to Validity We identify two main threats. The first languages, and update them regularly as new code is added.
is that we only studied Java. However, we have no reason to Code completion tools suggest code a programmer might
believe that the “wheat and chaff” hypothesis does not hold want to use. They infer relevant code and rank it. Many diverse,
for other programming languages. Java, though more modern, useful tools and strategies exist [5, 24, 25, 32]. Our work
was designed to be very similar to C and C++ so that it could suggests a different, complementary M INSET-based strategy: If
be adopted easily. The second threat comes from our corpus: what the programmer is coding contains the M INSET of some
size and diversity. We downloaded a very large corpus, by any piece of code, suggest that.
standard. In fact, we downloaded all the Java projects listed as
“Most Popular” in the four code repositories we crawled. Those Genetics and Debugging At a high-level, Algorithm 1 isolates
code repositories are known primarily for hosting open-source a minimal set of essential elements. Central to synthetic biology
projects. Thus, there is no indication that they are biased toward is the search for the ‘minimal genome’, the minimal set of
any specific types of projects. We plan to replicate this study on genes essential to living organisms [1] [19]. Delta debugging
a larger Java corpus and with language of different paradigms is very similar in that it finds a minimal set of lines of code
like List and Prolog to help us understand to what extent the that trigger a bug [7]. Both approaches rely on an oracle who
“wheat and chaff” phenonemon varies. defines what is ‘essential’ whereas we define ‘essentialness’
with respect to other sets.
VII. R ELATED W ORK
Although we are the first to study the phenomenon of “wheat” VIII. C ONCLUSION AND F UTURE W ORK
and “chaff” in code9 , a few strands of related work exist.
We imagine that code, to the human mind, is amorphous, and
Code Uniqueness At a basic level, our study is about ask: “If a programmer were reading this code, what features
uniqueness. Gabel and Su also studied uniqueness [9]. They would be semantically important?” and “If a programmer were
found that software generally lacks uniqueness which they mea- trying to write this piece of code, what key ideas would the
sure as the proportion of unique, fixed-length token sequences programmer communicate?” A M INSET is our proposal of a
in a software project. We studied uniqueness differently. We useful, formal definition of these key ideas as ‘wheat.’ Our
captured the distinguishing core semantics (the essence) of definition is constructive, so a computer can compute Minsets
a piece of code in a unique subset of syntactic features, a to generate or retrieve an intended piece of code.
M INSET, whose elements may not be unique or even rare
but together uniquely identify a piece of code. We keep in We evaluated Minsets, over a large corpus of real-world
Java programs, using various, natural lexicons: the computed
9 Others have used the “wheat and chaff” analogy in the computing world minsets are sufficiently small and understandable for use in
but in different domains [29, 30]. code search, code completion, and natural programming.
10
R EFERENCES [16] G. Little and R. C. Miller. Keyword programming in Java. In Proceedings
[1] C. G. Acevedo-Rocha, G. Fang, M. Schmidt, D. W. Ussery, and of the IEEE/ACM International Conference on Automated Software
A. Danchin. From essential to persistent genes: a functional approach to Engineering, pages 84–93, 2007.
constructing synthetic life. Trends in Genetics, 29(5):273–279, 2013. [17] G. Little, R. C. Miller, V. H. Chou, M. Bernstein, T. Lau, and A. Cypher.
[2] S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and Sloppy programming. In A. Cypher, M. Dontcheva, T. Lau, and J. Nichols,
C. Lopes. Sourcerer: a search engine for open source code supporting editors, No Code Required, pages 289–307. Morgan Kaufmann, 2010.
structure-based search. In Companion to the 21st ACM SIGPLAN [18] D. Mandelin, L. Xu, R. Bodík, and D. Kimelman. Jungloid mining:
Symposium on Object-Oriented Programming Systems, Languages, and helping to navigate the API jungle. In Proceedings of the 2005
Applications, pages 681–682, 2006. ACM SIGPLAN Conference on Programming Language Design and
[3] H. A. Basit and S. Jarzabek. Efficient token based clone detection with Implementation, pages 48–61, 2005.
flexible tokenization. In Proceedings of the 6th Joint Meeting of the [19] J. Maniloff. The minimal cell genome: "on being the right size".
European Software Engineering Conference and the ACM SIGSOFT Proceedings of the National Academy of Sciences, 93(19):10004–10006,
Symposium on the Foundations of Software Engineering, pages 513–516, 1996.
2007.
[20] C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu. Portfolio:
[4] D. Binkley, M. Davis, D. Lawrie, J. I. Maletic, C. Morrell, and B. Sharif. finding relevant functions and their usage. In Proceedings of the 33rd
The impact of identifier style on effort and comprehension. Empirical International Conference on Software Engineering, pages 111–120, 2011.
Software Engineering, 18(2):219–276, Apr. 2013.
[21] C. McMillan, N. Hariri, D. Poshyvanyk, J. Cleland-Huang, and
[5] M. Bruch, M. Monperrus, and M. Mezini. Learning from examples B. Mobasher. Recommending source code for use in rapid software
to improve code completion systems. In Proceedings of the 7th Joint prototypes. In Proceedings of the 34th International Conference on
Meeting of the European Software Engineering Conference and the ACM Software Engineering, pages 848–858, 2012.
SIGSOFT Symposium on the Foundations of Software Engineering, pages
213–222, 2009. [22] G. A. Miller. The magical number seven, plus or minus two: some
limits on our capacity for processing information. Psychological review,
[6] J. Cleland-Huang, R. Settimi, O. BenKhadra, E. Berezhanskaya, and 63(2):81, 1956.
S. Christina. Goal-centric traceability for managing non-functional
requirements. In Proceedings of the International Conference on Software [23] R. C. Miller, V. H. Chou, M. Bernstein, G. Little, M. Van Kleek, D. Karger,
Engineering, pages 362–371, 2005. and m. schraefel. Inky: a sloppy command line for the web with rich
visual feedback. In Proceedings of the 21st Annual ACM Symposium on
[7] H. Cleve and A. Zeller. Finding failure causes through automated testing. User Interface Software and Technology, pages 131–140, 2008.
In Proceedings of the Fourth International Workshop on Automated
Debugging, 2000. [24] A. T. Nguyen, T. T. Nguyen, H. A. Nguyen, A. Tamrawi, H. V. Nguyen,
J. Al-Kofahi, and T. N. Nguyen. Graph-based pattern-oriented, context-
[8] F. Deißenböck and M. Pizka. Concise and consistent naming. In Pro- sensitive source code completion. In Proceedings of the 34th International
ceedings of the 13th International Workshop on Program Comprehension, Conference on Software Engineering, pages 69–79, 2012.
pages 97–106, 2005.
[25] T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A
[9] M. Gabel and Z. Su. A study of the uniqueness of source code. In statistical semantic language model for source code. In Proceedings of
Proceedings of the 18th ACM SIGSOFT Symposium on the Foundations the 9th Joint Meeting of the European Software Engineering Conference
of Software Engineering, pages 147–156, 2010. and the ACM SIGSOFT Symposium on the Foundations of Software
[10] R. E. Gallardo-Valencia and S. Elliott Sim. Internet-scale code search. In Engineering, 2013.
Proceedings of the 2009 ICSE Workshop on Search-Driven Development- [26] Oracle openJDK. http://openjdk.java.net/, 2012.
Users, Infrastructure, Tools and Evaluation, pages 49–52, 2009.
[27] S. P. Reiss. Semantics-based code search. In Proceedings of the 31st
[11] M. Grechanik, C. Fu, Q. Xie, C. McMillan, D. Poshyvanyk, and C. Cumby. International Conference on Software Engineering, pages 243–253, 2009.
A search engine for finding highly relevant applications. In Proceedings
of the ACM/IEEE International Conference on Software Engineering, [28] S. P. Reiss. Specifying what to search for. In Proceedings of the 2009
pages 475–484, 2010. ICSE Workshop on Search-Driven Development-Users, Infrastructure,
Tools and Evaluation, pages 41–44, 2009.
[12] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness
of software. In Proceedings of the International Conference on Software [29] R. Rivest. Chaffing and winnowing: Confidentiality without encryption,
Engineering, pages 837–847, 2012. March 1998. web page.
[13] L. Jiang and Z. Su. Automatic mining of functionally equivalent code [30] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local
fragments via random testing. In Proceedings of the 18th International algorithms for document fingerprinting. In Proceedings of the 2003 ACM
Symposium on Software Testing and Analysis, pages 81–92, 2009. SIGMOD International Conference on Management of Data, SIGMOD
’03, pages 76–85, New York, NY, USA, 2003. ACM.
[14] V. Le, S. Gulwani, and Z. Su. SmartSynth: synthesizing smartphone
automation scripts from natural language. In Proceeding of the 11th [31] D. Schuler, V. Dallmeier, and C. Lindig. A dynamic birthmark for Java.
Annual International Conference on Mobile Systems, Applications, and In Proceedings of the International Conference on Automated Software
Services, pages 193–206, 2013. Engineering, pages 274–283, 2007.
[15] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: a tool for finding [32] C. Zhang, J. Yang, Y. Zhang, J. Fan, X. Zhang, J. Zhao, and P. Ou.
copy-paste and related bugs in operating system code. In Proceedings of Automatic parameter recommendation for practical API usage. In Pro-
the Symposium on Operating Systems Design & Implementation, pages ceedings of the 34th International Conference on Software Engineering,
289–302, 2004. pages 826–836, 2012.
11