0% found this document useful (0 votes)
25 views11 pages

A Study of Wheat and Chaff in Source Code (1502.01410v1)

This study explores the concept of 'wheat' and 'chaff' in source code, identifying essential components of code (wheat) that are crucial for understanding its meaning, as opposed to less significant elements (chaff). By analyzing a diverse corpus of 100 million lines of Java code, the researchers quantify the minimal distinguishing subset (M INSET) of code, revealing that on average, only 4% of code is essential. The findings have implications for code search and keyword-based programming, providing evidence that small sets of distinctive keywords can effectively characterize code.

Uploaded by

gabrielmmc1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views11 pages

A Study of Wheat and Chaff in Source Code (1502.01410v1)

This study explores the concept of 'wheat' and 'chaff' in source code, identifying essential components of code (wheat) that are crucial for understanding its meaning, as opposed to less significant elements (chaff). By analyzing a diverse corpus of 100 million lines of Java code, the researchers quantify the minimal distinguishing subset (M INSET) of code, revealing that on average, only 4% of code is essential. The findings have implications for code search and keyword-based programming, providing evidence that small sets of distinctive keywords can effectively characterize code.

Uploaded by

gabrielmmc1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A Study of “Wheat” and “Chaff” in Source Code

Martin Velez∗ Dong Qiu† You Zhou∗ Earl T. Barr‡ Zhendong Su∗

Department of Computer Science, University of California, Davis, USA
†School of Computer Science and Engineering, Southeast University, China
‡ Department of Computer Science, University College London, UK
Email: {marvelez, yyzhou, su}@ucdavis.edu, dongqiu@seu.edu.cn, e.barr@ucl.ac.uk

Abstract—Natural language is robust against noise. The mean- code naturally and easily using keyword queries, alleviating
ing of many sentences survives the loss of words, sometimes syntactic frustration.
many of them. Some words in a sentence, however, cannot be We focus our study on a diverse corpus of real-world Java
lost without changing the meaning of the sentence. We call these
projects with 100M lines of code. The approximately 9M
arXiv:1502.01410v1 [cs.SE] 5 Feb 2015

words “wheat” and the rest “chaff”. The word “not” in the
sentence “I do not like rain” is wheat and “do” is chaff. For Java methods in the corpus form our universe of discourse as
human understanding of the purpose and behavior of source methods capture natural, likely distinct units of source code.
code, we hypothesize that the same holds. To quantify the extent Against this corpus, we compute a minimal distinguishing
to which we can separate code into “wheat” and “chaff”, we subset (M INSET) for each method. This M INSET is the wheat
study a large (100M LOC), diverse corpus of real-world projects of the method and the rest is chaff. We develop procedures
in Java. Since methods represent natural, likely distinct units of for “threshing” functions via lexing and “winnowing” them,
code, we use the ∼9M Java methods in the corpus to approximate by computing their MINSETS. A lexicon is a set of words.
a universe of “sentences.” We “thresh”, or lex, functions, then Like web search queries, M INSETS are built from words in a
“winnow” them to extract their wheat by computing the function’s lexicon. We run our algorithms over different lexicons, ranging
minimal distinguishing subset (M INSET). Our results confirm that
programs contain much chaff. On average, MINSETS have 1.56
from raw, unprocessed source tokens to various abstractions
words (none exceeds 6) and comprise 4% of their methods. of those tokens, all in a quest to find a natural, expressive
Beyond its intrinsic scientific interest, our work offers the first and meaningful lexicon that culminated in the discovery of a
quantitative evidence for recent promising work on keyword- natural lexicon to use for queries (Section IV-B).
based programming and insight into how to develop powerful, Our results show programs do indeed contain a great deal
alternative programming systems. of chaff. Using the most concrete lexicon, formed over raw
lexemes, MINSETS compose only 4% of their methods on
I. I NTRODUCTION average. This means that about 96% of code is chaff. While
Words are the smallest meaningful units in most languages. the ratios vary and can be large, MINSETS are always small,
We group them into sentences and sentences into paragraphs and containing, on average, 1.56 words, and none exceeds 6. We
paragraphs into novels and technical papers like this one. Some observed the same trend over other lexicons. Detailed results are
words in a sentence are more important to its meaning than in Section IV. Section V also discusses existing and preliminary
the others. Indeed, from a few distinctive words in a sentence, applications of our work. Our project web site (http://jarvis.cs.
we can often guess the meaning of the original sentence. ucdavis.edu/code_essence) also contains more information on
this work, and interested readers are invited to explore it.
This paper studies whether this intuitive observation about
the importance of some words to the meaning of sentences in While our work is not code search, the results have direct
a natural language also holds for programming languages. implications in that area because they provide evidence that
addresses an assumption of code search: humans can efficiently
This work follows recent, seminal studies on the “unique- search for code. This assumption is closely related to the second
ness” [9] and the “naturalness” [12] of code. We study a differ- part of the assumption on which keyword programming is based.
ent dimension — the “essence” of code as captured in its syntax Work on code search breaks the problem into three subproblems
and amenable to human interpretation. Our study is inspired by 1) how to store and index code [2, 20], 2) what queries (and
recent work on keyword-based programming [14, 16, 17, 23]. results) to support [27, 28], and 3) how to filter and rank the
Keyword programming is a technique that translates keyword results [2, 18, 21]. The programmmer’s only concern is “What
queries into Java expressions [16] Sloppy programming is a do I need to type to find the code I want?”. We take a step
general term that describes several tools and techniques that back and ask, “Is there anything you can type?”, and answer,
interpret, via translation to code, keyword queries [17, 23]. “Yes, a M INSET.”
SmartSynth [14], another notable tool, combines techniques
from natural language processing and program synthesis to Our main contributions follow:
generate scripts for smartphones from natural language queries. • We define and formalize the M INSET problem for rig-
This promising, new programming paradigm rests on the orously testing the “wheat” and “chaff” hypothesis (Sec-
untested assumption that 1) small sets of distinctive keywords tion II-B);
characterize code and 2) humans can produce them. Our work • We prove that M INSET is NP-hard and provide a greedy
is the first to provide quantitative and qualitative evidence algorithm to solve it (Section II-C);
to validate this assumption. We show the existence of small • We validate our central hypothesis — source code contains
distinctive sets that characterize code, establishing a necessary much chaff — against a large (100M LOC), diverse corpus
condition of this paradigm that allows programmers to write of real-world Java programs (Section IV); and
• We design and compare various lexicons to find one that /* Standard BubbleSort algorithm.
is natural, expressive, and understandable (Section IV-B). * @param array The array to sort.
*/
The rest of this paper is organized as follows. Section II private static void bubbleSort(int array[]) {
describes threshing and winnowing source code. Section III int length = array.length;
describes our Java corpus, and implementations of the function for (int i = 0; i < length; i++) {
thresher and winnowing tool (MINSET algorithm). Section IV for (int j = 1; j > length - i; j++) {
presents our detailed quantitative and qualitative results. Sec- if (array[j - 1] > array[j]) {
int temp = array[j - 1];
tion VI analyzes our results and their implications. Section VII
array[j - 1] = array[j];
places our work into the context of related work, and Sec- array[j] = temp;
tion VIII concludes. }}}}

II. P ROBLEM F ORMULATION Threshed Function (23 words; all unique lexemes)
int length = array . for ( i 0 < ; ++ ) { if [ j 1 - ] > temp }
After harvesting, farmers thresh and winnow the wheat.
Threshing is the process of loosening the grain from the chaff Threshed Function (18 words; all unique lexer token types)
that surrounds it. Winnowing is the process of separating the int ID = . for ( INTLIT < ; ++ ) { if [ - ] > }

grain or kernels from the chaff. In this section, we define “wheat”


and “chaff”, describe code threshing, and present M INSET, our Fig. 1: The top part shows a Java method that implements the
winnowing algorithm. Bubble Sort algorithm. The bottom part shows two threshing
results. In the first set, we keep all (unique) lexemes. In the
A. Threshing second set, we map each lexeme to its lexer token type. Note
We view functions as the “stalks of wheat”. Functions are that, since some lexemes map to the same lexer token type,
natural, likely distinct, units of code and functionality. One the second set is smaller.
could also choose other units like individual statements, blocks,
or classes. This granularity seems adequate. Functions are
usually the building blocks of more complex components. To
thresh, we parse a function to get its set of lexemes. Then, we of “java.util.Map.get()” or “java.util.List.get()”. In Java,
map this set of lexemes to a set (or bag) of “words”. we fully qualify homonyms to distinguish them as shown.
What is a “word”? We are free to define the lexicon, the In general, we can map lexemes to distinct words to capture
set of (allowed) words. A natural, basic lexicon is the set of the difference in behavior. We can also abstract distinct lexemes
lexemes; a lexeme is a delimited string of characters in code, we suspect have the same effect on behavior, i.e. synonyms, to
where space and punctuation are typical delimiters; it is an the same word. For example, variable identifiers can be replaced
atomic syntactic unit in a programming language.1 Under this with their type under a language’s type system. In general,
lexicon, words are lexemes. New lexicons can be formed by a lexicon that is fine-grained and concrete may exaggerate
abstraction over lexemes. In natural languages, for example, the unimportant differences between functions, while one that is
words in a sentence can be replaced by their part of speech, like coarse and abstract may blur important differences. At both
N OUN, V ERB, or A DJECTIVE, to highlight structure. Similarly, ends of the spectrum of lexicons, it may be difficult to separate
code parsers tag each lexeme with one of a set of token types. the grain from the chaff later.
Thus, another natural, but more abstract, lexicon consists of
token types. New lexicons can also be defined by filtering B. Winnowing
specific lexemes. For example, we can allow all lexemes except In threshing, we simplified the representation of a function
delimiters, like ‘(’, and ‘)’. Under this lexicon, a function’s set by mapping its source code to a set of lexical features, words.
would be all its lexemes except the delimiters. Finding the wheat of function is thus reduced to finding a
Figure 1 illustrates the threshing process. It shows the source unique subset of code features. This unique subset distinguishes
code of a Java method that sorts numbers using bubble sort. It each function from all other functions (when all functions
also shows the threshed function using a lexicon consisting of are represented as sets of words). We call any such subset a
all raw lexemes, and a lexicon consisting only of lexer token distinguishing subset, and define it precisely in Definition II.1.
types. We call the problem of finding the minimum distinguishing
subset (M INSET) the M INSET problem.
Varying the lexicon allows us to explore programming
language-specific information. The lexicon consisting of all Definition II.1. Given a finite set S, and a finite collection of
lexemes probably includes many elements that we suspect finite sets C, S∗ is a distinguishing subset of S if and only if
have little to do with the behavior of functions, i.e., delimiters (P1) S∗ ⊆ S S∗ is a subset of S
and string literals like "Joe". We can filter those lexemes, by (P2) ∀C ∈ C, S 6⊆ C S∗ is only a subset of S

not scooping them into the winnowing screen. We can also


filter other lexemes, like the type annotation “int” in “int What is wheat and what is chaff in code? The wheat grain
cars = 0;”, to explore how important they are in the model.
of a piece of code is the M INSET. A M INSET identifies a
Functions may also contain, to adapt a word from linguistics, piece of code — wheat and chaff together. The M INSET are
homonyms: identical lexemes with distinct effects on behavior. distinguishing features, a kind of semantic core. The M INSET,
For example, in Java, the lexeme ”get” could be a method call however, is not itself executable. Just as a wheat grain depended
1 Linguistics defines a lexeme differently. A lexeme is the set of forms a on chaff to grow, a M INSET depends on its surrounding context
single word can take. For example, ‘run’, ‘runs’, ‘running’ are all forms of to execute and provide functionality. We call this surrounding
the same lexeme identified by the word ‘run’. context chaff: it consists of the low-level technological details

2
Algorithm 1 Given the universe U, the finite set S, and the
U U TABLE I: Corpus summary.
finite set of finite sets C, M INSET has type 2U ×22 → 2U ×22

and its application M INSET(S, C) computes 1) S ⊂ S, a subset
Repository Projects Files Lines of Code
that distinguishes S from sets in C, and 2) C 0 , a “remainder”, i.e.
a subset of C whose sets contain S and therefore from which Apache 103 101,480 10,891,228
S could not be distinguished; when C 0 = 0, / S∗ distinguishes S Eclipse 102 287,669 32,770,246
0
from all the sets in C’; when C = C, S = 0. ∗ / Github 170 133,793 13,752,295
Input: S, the set to minimize. Sourceforge 533 373,556 42,434,029
Input: C, the collection of sets against which S is minimized. Total 908 896,498 99,847,798
1: Ce = {C | C ∈ C ∧ e ∈ C} are those sets in C that contain e.

2: S∗ = 0/ Algorithm 1 computes distinguishes S from a subset of C; when


3: while S 6= 0/ ∧ C =
6 0/ do C 0 = 0,
/ S∗ is a minimally distinguishing subset of S.
// Greedily pick an element that most differentiates S.
4: e := CHOOSE({x ∈ S | |Cx | ≤ |Cy |, ∀y ∈ S}) Proof: By induction on S∗ .
5: if Ce = 0/ ∨ Ce = C break The worst case complexity of M INSET(S, C) is O(|S|2 |C|).
6: S∗ := S∗ ∪ {e} First, there are |S| iterations and, in each call, for each element
7: S := S \ {e} x ∈ S, we need to, 1) compute Cx , each at a cost of |C|, for
8: C := Ce a total cost of O(|S||C|), then 2) then find the minimum |Cx |
9: return S∗ , C at a cost of O(|S|). Of course, S and C are smaller in each
iteration, but we ignore this and over-approximate. Thus, we
Step S∗ S C CHOOSE have O(|S|(|S||C| + |S|)) = O(|S|2 |C|).
0 0/ {a, b, e} {{a, c}, {b, c, d}, {a, d, e}} b (Cb = 1) As mentioned earlier, modeling functions as sets discards
1 {b} {a, e} {{b, c, d}} e (Ce = 0) differences in methods due to multiplicity. We have also
2 {b, e} {a} 0/ developed a multiset version of the M INSET algorithm, which
we omit due to lack of space.
Fig. 2: The execution of Algorithm 1 il-
lustrated on the following problem instance: III. S ETUP AND I MPLEMENTATION
M INSET({a, b, e}, {{a, c}, {b, c, d}, {a, d, e}}). We selected a very popular, modern programming language,
Java, and collected a large (100M lines of code), diverse corpus
of real-world projects. Ignoring scaffolding and very simple
methods, which we define as those containing fewer than 50
of a programming language and a platform that obscure the tokens, there are 1, 870, 905 distinct methods in our corpus.
higher-level semantics of a function. We selected a simple random sample of 10, 000 methods2 . Our
The M INSET problem We now formally define the core software and data is available3 .
computational problem that we study.
A. Code Corpus
Definition II.2 (The M INSET Problem). Given a finite set Over the summer of 2012, we downloaded almost one
S, and a finite collection of finite sets C, find a minimum thousand of the most popular projects from four widely-used
distinguishing subset (minset) S∗ of S. open source code repositories: Apache, Eclipse, Github, and
Theorem II.1. M INSET is NP-hard. Sourceforge.

Proof: We reduce H ITTING -S ET to M INSET. Curation Since some projects in our corpus are hosted in
multiple code repositories, we removed all but the most recent
C. The M INSET Algorithm copy of each project. Also, since many project folders contained
earlier or alternative versions of the same project, and even
Since the M INSET problem is NP-hard, we present Algo-
other projects, where we could, we identified the main project
rithm 1, a greedy algorithm that finds the locally minimal
and kept only its most current version. Table I summarizes
distinguishing subset of a set S. Given inputs S, the target set
our curated corpus. After curation, clones may still exist in
to be minimized, and C, a collection of sets against which S is
the corpus, for example, within projects. A search program
minimized, the M INSET algorithm computes S∗ , and C 0 . C0 is
we wrote helps us find clones. When we compute minsets, we
the subset of C whose sets contain S so C \ C 0 contains those
assume no clones remain. Our results in Section IV-A give us
sets in C that do not contain S. When C 0 = 0, / S∗ is a subset
confidence that this is the case.
of S that distinguishes S from all sets in C. The core of the
algorithm is Line 4. Equality is needed in the cardinality test for Filtering Scaffolding Methods Java, in particular, requires
cases like S = {a, b}, C = {{a, x}, {a, y}, {b, x}, {b, y}}, where that a programmer write many short scaffolding methods, for
all the elements in S differentiate S from the same number example, getters and setters. Many languages, like Ruby and
of sets in C. Equality also means that Cx can be empty, as Python, eliminate the need for such scaffolding code. After
for S = {a} and C = {{x}, {y}}, since |Ca | ≤ |Ca | = 0, and Cx manual inspection, we found that such methods usually contain
can also be C again, when S ⊆ C, ∀C ∈ C, as in S = {a} and
2 Given the population size, this gives us a confidence level of 95%, and a
C = {{a}, {a, b}, {a, b, c}}.
margin of error of ±1%.
Theorem II.2. Consider M INSET(S,C) = S∗ , C 0 . The S∗ that 3 https://bitbucket.org/martinvelez/code_essence_dev/downloads.

3
TABLE II: Method counts. A. How Much of Code is Wheat?
Cast in terms of wheat, our core research question — How
Methods Count much of code is wheat? — can be answered in two ways: in
terms of size of minsets, or the ratio of minsets to their function.
Total (in corpus) 8, 918, 575 We report both. There are also two natural views we can take
Unique 8, 135, 663 of code: the raw sequence of lexemes the programmer sees
Unique (50 or more tokens) 1, 870, 905 when writing and reading code, and the abstract sequence of
Unique (50 to 562 tokens) 1, 801, 370 tokens the compiler sees in parsing code. We want to explore
those two views, and capture each one as a lexicon, a set of
words. LEX is the set of all lexemes found in code (5, 611, 561
less than 50 tokens, or about 5 lines of code. This is consistent words). LTT is the set of lexer token types defined by the
with other research [3, 15] that also ignores shorter methods. At compiler (101 words). Each word in LTT is an abstraction of
this size, we also filter methods with very simple functionality. a lexeme, like 3 into INTLIT.
After filtering, 905 out of 908 projects are still represented. LEX is the primordial lexicon; all others are abstractions
Table II shows the method counts. of its words. Unfortunately, it is noisy: it is sensitive to any
syntactic differences, including typos or use of synonyms, so
it tends to overstate the number of minsets and understate
B. The Function Thresher their sizes; spurious homonyms can have the opposite effect,
We developed a tool, which we call JavaMT, that threshes but are unlikely in Java when one can employ fully qualified
all the functions in our corpus. JavaMT leverages the Eclipse names. LTT is the minimal lexicon a parser needs to determine
JDT parser which parses Java code and builds the syntax tree4 . whether or not a string is in a language. We computed minsets
JavaMT can take as input .java, .class, and .jar files. with our winnowing tool of all the methods in our random
Projects can contain these and other types of files. The tool sample of 10, 000 using each lexicon, and display a summary
builds a list of tokens for each method. It collects the lexeme of our results in Figure 3 and Figure 4.
of each token and additional information as it traverses the Using LEX, wheat is a tiny proportion of code. The minset of
syntax tree. a method, on average, contains 4.57% of the unique lexemes in
To address the homonym problem, JavaMT collects the fully a method which means that methods in Java contain a significant
qualified method name (FQMN) for method name lexemes, and amount of chaff, 95.43% on average. More surprisingly, the
the fully qualified type name (FQTN) for variable identifiers number of lexemes in a minset is also just plain small. The
and type identifiers. Collecting this information allows us later mean minset size is 1.55. The minset sizes also do not vary
to classify methods and types based whether they are part of much. In 85.62% of the methods, one or two unique lexemes
the Java SDK library or if they are local to specific projects. suffices to distinguish the code from all others. The largest
When projects are missing dependencies, resolving names to minset consists of only 6 lexemes. Minset ratios also do not
either FQMN or FQTN may not be possible. In our corpus, we vary much. 75% of all methods have a minset ratio of 6.35% or
encountered this problem with 0.03% of the tokens. JavaMT smaller. While the ratios are sometimes large, the absolute sizes
can also collect more abstract information like lexer token never are. The method with the largest minset ratio, 33.3%,
types as defined in the javac implementation of OpenJDK, for example, consists of 18 unique lexemes but has a minset
an open-source Java platform [26]. size of 6. The method with the second largest minset ratio,
29.41%, another example, consists of 17 unique lexemes and
has a minset size of 5.
C. The Winnowing Tool
Minsets are surprisingly small; especially surprising is that the
All the information collected by JavaMT is stored in a maximum size is small. One reason might be the compression
PostgreSQL database. We developed a tool that runs M INSET inherent to representing functions as sets. We address this later
for each method and stores the result in the same database. If a when we experiment with multisets. To test the robustness
method does not have a minset, it stores a list of methods that of our results, we also focused our investigation on larger
are strict supersets and a list of methods that are duplicates methods because they may encode more behavior and therefore
after threshing. have more information. Hence, they may have larger minsets.
Selected uniformly at random, our sample set does not include
many of the largest methods: the largest method in our random
IV. R ESULTS AND A NALYSIS
sample has 2025 lines of code while the largest one in our
Our core research question can be addressed in terms of corpus contains 4, 606 lines of code. To answer this question
absolute minset sizes, or in terms of minset ratios, minset size to about minset properties conditioned on large methods, we
threshed method size. While the minset sizes and minset ratios selected the 1, 000 largest methods, by lines of source code,
will almost undoubtedly vary across functions, we hypothesize and computed their minsets. The mean and maximum minset
that the mean minset size and the mean minset ratio are small sizes of the largest methods are slightly lower but similar to
— that there is a great deal of chaff in code. Our results show the previous sample, 1.12 and 4, respectively. This shows that
that code contains much chaff. minsets are small and potentially effective indices of unique
The data we present, and the database queries we used can information even for abnormally large methods.
be downloaded from Bitbucket.2 Using LTT, the proportion of wheat in code is larger but still
small. The minset of a method, on average, contains 18.45%
4 http://www.eclipse.org/jdt/. of the unique token types in a method. We observe again that

4
10000 Random Sample Methods (LEX)

3000

1200
6000
Min 12 Min 1 Min 2e−04
Mean 42.6 Mean 1.6 Mean 0.046
Median 35 Median 1 Median 0.037
Max 4004 Max 6 Max 0.333
2000

4000

800
Mode 28 Mode 1 Mode 0.071
Count
1000

2000

400
0

0
0 1000 2000 3000 4000 0 2 4 6 8 0.0 0.1 0.2 0.3 0.4
Method Size (Threshed) Minset Size Minset Ratio

Fig. 3: The histogram of minset sizes tells us that minsets are small. Comparing minset sizes with method sizes shows that
minsets are also relatively small. The minset ratio histogram confirms this.

Have_minset Have_duplicates
Do_not_have_minset Do_not_have_duplicates TABLE III: Candidate Lexicons.
10000

10000

Do Not Have Minset (99.13%)


Name M IN 1 M IN 2 M IN 3 M IN 4
Size (words) 55,543 55,556 91,816 91,829
Non−Threshable (99.13%)
7500

7500

Have Minset (91.62%)


Threshable (91.62%)
Count

of words. Our coarsest lexicon, LTT, does not loosen the


5000

5000

grains from the chaff well. Its coarseness seems to cause 6640
methods to be threshed to the same set as another. Only 87
out of 10, 000, 0.87%, methods have a minset using LTT. In
2500

2500

contrast, LEX appears to preserve sufficient information so


that 9, 087 out of 10, 000 methods have a minset.

B. What is a Natural, Minimal Lexicon?


0

LEX LTT LEX LTT We have shown that a method can be threshed and winnowed
Lexicon Lexicon
to a small minset over LEX and LTT. Raw lexemes and
Fig. 4: Random Sample of 10, 000 Methods: (left) Proportion token types are cryptic. We also want to determine whether we
of Methods with Minsets: There is a stark difference in that can thresh and winnow a method to a small and meaningful
proportion between LEX and LTT. (right) Proportion of minset. By meaningful, we refer to how much information a
Methods with Duplicates: LEX induces very few duplicates minset reveals about functionality and behavior to us, humans.
compared to LTT. LTT maps almost three quarters of the By the definition of minset, what they reveal should also be
methods to the same set as another. It is too coarse, and does distinguishing.
not thresh well. We address this question by exploring the lexicon spectrum
toward more abstract views of code. Our challenge is to find a
lexicon that differentiates methods while being sufficiently small
to be easily understandable and useful for humans. In short,
sometimes minset ratios can be large but the absolute minsets we seek here to to approximate the set of words a programmer
sizes never are. It is not surprising that the minset ratio is larger. might use to search for or synthesize code. We additively
Information is lost in mapping millions of distinct lexemes to construct a bag of words a programmer might naturally use.
only 101 distinct lexer token types. Information is also lost Two issues confounds this search: lexicon specialization
as method sizes decrease from 42.7 using LEX to 18.2 using can overfit while lexicon abstraction introduces imprecision.
LTT. To ameliorate overfitting, we restricted our search to natural
These results show that code contains a lot of chaff, in lexicons. By natural, we mean simple and intuitive. We pursue
relative and absolute terms. Given that we preserve a lot of natural abstractions to avoid unnatural abstractions that overfit
information with LEX, we claim that the mean minset size, and our corpus, like one that maps every function in our corpus
mean minset ratios we found are approximate lower bounds. In to a unique meaningless word. In our context, imprecision
essence, we can define a lexicon spectrum where LEX is one leads spurious homonyms which reduces yield5 . To handle
of the poles, and LTT is a more abstract point on the lexicon this problem, we relax the definition of threshability, to k-
spectrum. threshability: a method is k-threshable if its minset has k or
fewer supersets. Henceforth, when we say threshable we mean
The yield of a lexicon is its percentage of threshable methods.
Our exploration also shows that the yield decreases as the 5 Although LEX is rife with synonyms, our candidate lexicons have almost
lexicon becomes coarser, measured roughly by the number none.

5
120 Have_more_than_10_supersets

12
Do_not_have_duplicates
inner fence
Have_10_or_fewer_supersets Have_duplicates
outer fence
Have_minsets

10000

10000
mild outlier
● mean

Non−Threshable (55.21%)
Non−Threshable (58.56%)

Do Not Have Minset (74.08%)


90

Non−Threshable (70.27%)

Do Not Have Minset (76.15%)


9

Non−Threshable (73.07%)

Do Not Have Minset (83.98%)


Do Not Have Minset (85.52%)
7500

7500
Method Size

Minset Size

Count
60

5000

5000
30

2500

2500
● ●





0

0
MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4
Lexicon Lexicon Lexicon Lexicon

Fig. 5: (left) As the lexicon grows from M IN 1 to M IN 4, the Fig. 6: (left) Yield: The yield clearly improves with each
average size of the threshed methods also grows. (right) As the change. At M IN 4, the yield is 44.79%. (right) Proportion
lexicon grows, the average minset size hardly changes. At least of Methods With Duplicates: Using this proportion as a rough
three quarters of the methods have a minset smaller than 4. gauge of threshing precision, there is a substantial improvement
Even as the lexicon grows, the maximum minset size is never in threshing precision with each lexicon — fewer methods have
more than 10. duplicates. M IN 4 pushes that precision past 50%.

10-threshable. We chose 10 because that is consistent with what the mean and maximum minset sizes are still small, 2.88 and
humans can process in a glance or two. Humans can rapidly 9, respectively. The yield does not increase much. Only an
process short lists [22]. additional 288 methods become treshable. The likeliest and
We considered four lexicons. Table III shows their names and simplest explanation for the small change is that these words
sizes. Our results appear in Figure 5 and Figure 6. We focused are very common; at least one of them is present in 83.26%
on the absolute minset size. In searching or synthesizing code of the methods. It is more difficult to interpret this change. On
using minsets, the minset size is likely more important to the one hand, it is small. On the other hand, it is the result of
the programmer than the minset ratio. We also focused on adding only 13 new, semantically-rich words. In balancing the
yield, the proportion of threshable methods. It approximates size of lexicon with the interpretability of minsets, this appears
the proportion of methods a programmer can synthesize or to be a good trade-off.
search for using a given lexicon. Broadly, it gives us a sense In our quest to improve yield, we defined M IN 3 to include
of the effectiveness and usefulness of a programming model the types of variable identifiers (names). Those of a public type
involving minsets. were mapped to their fully qualified type name. Those of a
locally-defined type were mapped to a single abstract word to
First, we considered M IN 1, a lexicon including only method
signal their presence. Locally-defined types, like local methods,
names and operators. For public API methods, we used fully
tend to be project-specific and not of general use. Our reason
qualified method names to prevent the spurious creation of
for focusing on types is that they tell the programmer the kind
homonyms. For local methods, we abstracted all names to a
of data on which methods and operators act. It is also a simple
single abstract word to capture their presence. Local methods
way of considering variable identifiers. Again, the mean and
tend to implement project-specific functionality not provided
maximum minset size are small, 2.96 and 9, respectively. There
by the public API, and are not generally aimed for general
is a notable increase in the yield, from 29.72% to 41.44%. It is
use. The intuition in including method names is that a lot of
now close to what we would imagine might be practical. In a
the semantics is captured in method calls. They are the verbs
M INSET-based programming model, a programmer would find
or action words of program sentences. Our intuition is further
4 out of 10 methods. The lexicon also grew substantially by
supported by the effectiveness of API birthmarking [31]. We
36, 260 words. This trade-off appears reasonable considering
also included operators because all primitive program semantics
as well that it is natural to supply the programmer with the
are applications of operators. Using this lexicon, the mean and
convenience of a variety of primitive and composite types.
maximum minset sizes are small, 2.73 and 7, respectively.
The imprecision of M IN 1 manifests itself in the low yield of We defined a final lexicon, M IN 4, which includes false,
true, and null, object reference keywords, like this and new,
26.86%.
and the token types of constant values, such as the token type
To try to improve yield, we created lexicon M IN 2 by Character-Literal for ‘Z’ or, for 5, Integer-Literal. In total,
including control flow keywords as well; there are 13 in we added 13 new words. Our intuition is that the use of hard-
Java. From the programmer’s perspective, these words reveal coded strings and numbers is connected to semantics. Certainly,
a great deal about the structure of a method that is critical
to semantics. For example, the word for alone immediately 6 A point is an extreme outlier if it lies beyond Q3 + 3 ∗ IQ or below Q1 −
tells us that some behavior is repeated. Using this lexicon, 3 ∗ IQ, where IQ = Q3 − Q1.

6
400 Have_more_than_10_supersets

40
Do_not_have_duplicates
inner fence
outer fence Have_10_or_fewer_supersets Have_duplicates
mild outlier Have_minsets

10000

10000
● mean

Do Not Have Minset (66.59%)


Do Not Have Minset (69.53%)
Non−Threshable (64.18%)
300

Non−Threshable (67.36%)
30

Do Not Have Minset (78.98%)


Do Not Have Minset (81.22%)
7500

7500
Method Size

Minset Size

Count
200

20

5000

5000
Threshable (53.63%)
Threshable (49.38%)

100

10

2500

2500
● ●

● ●



0

0
MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4 MIN1 MIN2 MIN3 MIN4
Lexicon Lexicon Lexicon Lexicon

Fig. 7: Multiplicity: (left) Like in Figure 5, as the lexicon grows, Fig. 8: Multiplicity: (left) Yield: Multiplicity improves the yield
so does the threshed method size. In this case, methods are of all lexicons. The yield of M IN 4 now exceeds 50%. (right)
much larger because repetition is allowed. (right) The minset Proportion of Methods With Duplicates: Using this proportion
sizes, allowing repetition, are evidently larger. However, on as a rough measure of threshing, multiplicity also improves
average, they are still small across all lexicons. (To visualize the treshing precision of each lexicon. Less than 25% of the
both distributions, we omitted extreme outliers.6 ) methods have duplicates using M IN 4. (Note: Compare with
Figure 6.)

reading hard-coded values can be informative. Also, in a new


programming model, a programmer may need to indicate that shifted, 2.73–3.06, shifted and got a bit wider, 7.06–9.56. The
some constant string or number will be used. For example, outliers of minset sizes moved farther to the right. Previously,
if the programmer wishes to find a method that calculates they ranged from 7–10 and now they range from 258–438.
the area of a circle, then it would be natural to indicate The right tails have grown longer. For example, using M IN 4,
that target method likely contains 3.14 or PI. After including 75.67% of the minsets have fewer than 10 words. Another
these words, the mean and maximum minst size remain small, cost of the gain in yield was in minset computation where we
3.06 and 10, respectively. The yield increased from 41.44% observed an approximate slowdown factor ranging from 4 to 7.
to 44.79%. Adding this small number of semantically-rich For example, computing multiset minsets using M IN 1 took 44
words to the lexicon seems to be another reasonable exchange hours instead of 6. In practice, the slowdown is much better
for a noticeable gain in yield: under this lexicon, the words than Algorithm 1’s complexity implies. Overall, despite its cost,
are easier to interpret (see Section IV-D for our analysis of modeling methods as multisets over M IN 4 produces a yield
the interpretability of minsets built from these words) while with practical value: it easily distinguishes more than half of
remaining small enough for humans to work with, e.g. a the methods in our sample set.
human could potentially write a minset from scratch while
programming using key words [16]. Multiplicity appears to also improve the how well methods are
threshed. Threshing maps a method to a set (or multiset), and
C. Improving Threshing and Winnowing can map two unique methods to the same set or multiset. When
Instead of continuing our search for lexicons generated from this happens, the M INSET algorithm cannot distinguish them.
ever more complex abstractions over lexemes, we reconsidered We can use the proportion of methods with duplicates to gauge
multiplicity, the number of copies of a word in a method. the precision of threshing. LEX gave us a baseline of 3.20%.
We hypothesized that modeling methods as multisets would When we experimented with lexicons M IN 1 through M IN 4
recapture some textual and semantic differences, and thereby and no multiplicity, we observed the portion improved from
increase the yield of the lexicons M IN 1 through M IN 4. We 66.4% using M IN 1 down to 41.64% using M IN 4 (Figure 8).
used the multiset version of Algorithm 1 to recompute minsets, Multiplicity cut those portions nearly in half. For example,
and show our results in Figure 7. using M IN 4, the portion is now 23.59%.
Multiplicity improved yield at the cost of larger absolute The remaining portion of non-threshable methods is intrigu-
minset sizes. The yield increased for all lexicons. The new ing. There are still 46.37% non-treshable methods, entirely
yields ranged from 32.64%–53.63%. The smallest increase subsumed by more than 10 other methods. We certainly
in yield was using M IN 1 (3.18%) and the largest was using expected some methods to subsume others because of their
M IN 4 (8.84%). More concretely, using M IN 4, the number sheer size. We also expected families of semantically-related
of threshable methods increased by 884. Multiplicity also methods where some subsume others. However, given that
improved the minset ratios over all lexicons. For example, methods are not that small, containing, on average, 72.8 words
using M IN 4, the mean minset ratio decreased from 15.47% to over M IN 4, and that the the portion of methods with duplicates
5.35%. The cost of considering multiplicity, however, was an is small, we suspected another reason. We hypothesized that
overall increase in minset sizes; the range of mean minset sizes there are abnormally large methods subsuming a great number

7
Have_more_than_10_supersets Have_10_or_fewer_supersets Have_minsets From these case studies, we learned that minsets computer
over LEX are small but do not reveal much about the behavior
10000

of the method. Minsets over M IN 4, on the other hand, are still


small but also give insight into the functionality of a method.
Study of LEX Since there are thousands of minsets, we
7500

took a broad view. For all minsets, we partitioned lexemes by


type, leveraging information collected JavaMT; the types we
Count

defined are similar to lexer token types but broader in some


5000

cases and narrower in others. We provide a list of the lexeme


types we defined, along with the counts of lexemes belonging

Yield (61.74%)

Yield (67.10%)
Yield (57.32%)
Yield (54.99%)
Yield (54.01%)
Yield (53.70%)
Yield (53.63%)

Yield (53.63%)

Yield (53.63%)

Yield (73.03%)
to that type in Table IV7 .
2500

Public type variable identifiers, and string and character


literals dominate minsets. String literals are constant string

79.68%
values like "Joda". The strings can represent error or information
0

messages, IP addresses, names, pretty much anything. Perhaps


72028 36104 18007 9003 4501
Maximum Method Size
2250 1125 562 281 140 70
this is why are at the top of the list: they can be unique or
very rare. We divide certain classes of words depending if they
Fig. 9: The number of threshable methods increases as the are public or local — method invocations, type identifiers,
maximum method size filter is tuned down to 562. From there, and variable names. Public words are more standard and
the number of methods and the number of threshable methods common whereas local words are more specialized and rare. Not
decreases substantially. Thus, setting the filter at 562 seems surprisingly then, we observe that standard language features,
appropriate. like keywords and operators, and public types and methods
are less common in minsets. The only exceptions are variable
identifiers of public types. Their distinctiveness is due in part
to synonyms and homonyms. A programmer has great freedom
of methods. in creating them. For example, dir appears 8017 times, as a
We conducted an experiment where we gradually filtered variable name in methods, while directory appears only 2774
large methods to observe the effect on yield (Figure 9). We times. Another reason is that variable identifiers are more
initialized the filter size to 72, 028, the maximum method size prevalent thant other type of identifiers, like types and method
(in tokens) in our corpus, and repeatedly halved it down to 70; calls.
the miminum size of a method is 50. If we filter methods with Study of M IN 4 We studied the minsets produced in our
more than 562 tokens, or about 56 lines of code, then the yield last experiment in Section IV-C. We selected nine minsets
improves from 53.67% to 61.74%. This filter means that, in (Figure 10); we partitioned the methods into low, medium,
a new programming model, what the programmer is coding and high minset ratios and picked three uniformly at random
would not be compared against abnormally large methods, by from each subset. For each minset, we tried to understand each
default. With such a filter, 6 out of 10 methods can be easily element and what they revealed together about the behavior
distinguished from others via their minset. If we doubled the of a method. Then we inspected the method source code
filter size, we would reconsider 55, 953 methods, and the yield more carefully to assess how well the minsets capture method
would still be higher at 57.32% than without the filter. Since functionality. Due to lack of space, we discuss only three in
there is a relatively low number of these large methods, 69, 535 detail.
out of 1, 870, 905 (or 3.7%), the trade-off seems reasonable. A
maximum size filter would clearly add practical value in a new Low: L1 The method named javax.xml.bind.Unmarshaller.-
programming model. unmarshal from (java.xml.transform.Source) deserializes XML
M IN 4 is a natural lexicon suited for code search, synthesis, documents and returns a Java content tree object; java.awt.Image
and robust programming. We recomputed minsets using M IN 4 is an abstract classes that represents graphical images. From
considering multiplicity, and the filter size set to 562. As we this minset, we infer that this method handles images and XML
already mentioned, the yield is 61.74%. The mean minset size files. Since it reads the XML file, we also infer that it uses
increases with the filter from 9.56 to 11.03. The minset sizes XML data in some manner. Perhaps the file contains a list of
vary but have a clear positive skew where fewer than 25% images, or the data in the file is used to create or alter an
contain more than 12 words. That right tail of the distribution image. After inspecting the source code, we find that it is a
is significantly shorter; the maximum size decreased from 689 method in the LargeInlineBinaryTestCases class of the Eclipse
to 173 because of the filter. Link project, which manages XML files and other data stores.
Our understanding was not far off: the method does read a
D. Minset Case Studies binary XML file that contains images.
Recall that our definition of a M INSET, the wheat of a method, Medium: M1 The java.lang.Class.isInstance(java.lang.-
does not imply that a M INSET is unique. Nor does it imply Object) method checks if a given object is an object of type
that chaff is meaningless, containing little information about Class or assignment-compatible with its calling object. The
the method. A M INSET is also not executable. To be useful, java.sql.Date. toString() method converts a Date object,
a M INSET should capture core, distinguishing functionality in
a method, and be easily understandable. We studied whether 7 A caveat: Algorithm 1 at line 4 picks arbitrarily between two equally rare
this is the case. words. Thus, these counts could differ.

8
TABLE IV: Types of lexemes (or words) in the minsets we computed over the lexicon LEX.
Grain Type Count Examples
Variable Identifier (of Public Type) 3235 abilityType (java.lang.StringBuffer), defaultValue (int), lostCandidate (boolean), twinsItem (java.util.List)
String and Character Literal 3202 ‘\u203F’, ‘&’, "192.168.1.36", "audit.pdf", "Error: 3", "Joda", "Record Found", "secret4"
Method Call (Local) 2942 classNameForCode, getInstanceProperty, isUserDefaultAdmin, makeDir, shouldAutoComplete
Variable Identifier (of Local Type) 1574 arcTgt, component, iVRPlayPropertiesTab, nestedException, this_TemplateCS_1, wordFSA
Type Identifier (a Local Type) 1413 ErrorApplication, IWorkspaceRoot, Literals, NNSingleElectron, PickObject, TrainingComparator
Method Call (a Public Method) 508 currentTimeMillis (java.lang.System.currentTimeMillis()), replace (java.lang.String.replace(char,char))
Number Literal (integer, float, etc.) 310 0, 1, 3, 150, 2010, 0xD0, 0x017E, 0x7bcdef42, 255.0f, 0x1000000000041L, 46.666667
Type Identifier (a Public Type) 265 int, ArrayList, Collection, IllegalArgumentException, PropertyChangeSupport, SimpleDateFormat
Operator 260 ^=, <, <<= , <=, =, ==, >, >=, >> , >>= , >>>= , |, |=, ||, -, -=, –, !, !=, ?, /, /=, @, *, &, &&, +, +=, ++
Keyword (Except Types) 196 break, catch, do, else, extends, final, finally, for, instanceof, new, return, super, synchronized, this, try, while
Separator 148 <, >, ", ", ., ]
Reserved Words (Literals) 104 false, null, true
Other 112 COLUMNNAME_PostingType, E, ec2, element, ModelType, org, T, TC

ID MinSet (MIN4) Ratio In the source, we confirm that it is a constructor in the


javax.xml.bind.Unmarshaller.unmarshal(javax.xml. HsqlSocketFactorySecure class in the CloverETL project. It
L1 Java.awt.Image
transform.Source)
2.53% wraps code that instantiates a Provider class and adds it to
javax.swing.DefaultBoundedRangeModel2Test.checkValues(javax.swin the Security object in a try block. If adding the provider fails,
L2 2.04%
g.BoundedRangeModel,int,int,int,int,boolean) it catches the exception, as we had inferred.
L3 / java.text.Bidi.getRunLevel(int) 4.55%
V. A PPLICATIONS
M1 java.lang.Class.isInstance(java.lang.Object) java.sql.Date.toString() 12.5%
Though our study is primarily empirical, in this section, we
java.security.AccessController.<java.lang. javax.security.auth.Policy.getPermiss
M2 Object>doPrivileged(java.security.Privile ions(javax.security.auth.Subject,java. 12.5% describe existing and new applications for minsets.
gedAction<java.lang.Object>) security.CodeSource)
SmartSynth (Existing) As we mentioned earlier, the clearest
M3 @ java.sql.PreparedStatement.setByte(int,byte) 12.5%
and, perhaps, most promising application for minsets is in
H1 =
java.security.Security.addPro
2 java.lang.Exception vider(java.security.Provider) super 23.8% keyword-based programming. SmartSynth [14] is a recent,
modern incarnation. SmartSynth generates a smartphone script
H2 boolean
java.lang.Object.equals
(java.lang.Object)
org.eclipse.linuxtools.tmf.core.trace.
TmfExperiment<LTYPE> 3 27.8% from a natural language description (query). “Speak weather in
the morning” is an example of a successful query. SmartSynth
com.sun.javadoc. java.lang.String.equals(
H3 ClassDoc
3 java.lang.String[]
java.lang.Object)
31.3% uses NLP techniques to parse the query and map it to a set of
“components” (words) in its underlying programming language.
Fig. 10: This shows the minets of nine methods (M IN 4). L1- Combining a variety of techniques, it then infers relationships
L3 are minsets that have low minset ratios. M1-M3 have between the words to generate and rank candidate scripts.
At its heart is the idea that usable code can be constructed
medium minset ratios. H1-H3 have high minset ratios. The
from a small set of words. This subset is a minset or another
minset elements are rich and reveal some information about
distinguishing subset.
the behavior of their respective methods.
Code Search Engine (New) A major problem of code search
is ranking results [2, 18, 21]. We built a code search engine
that uses a new ranking scheme8 . Relevant methods are ranked
which has been wrapped as an SQL date value, to a String. by the similarity between their minsets and the user’s query.
From this minset, we understand the type of a variable is For example, the query “sort array int” returns 135 methods.
checked. Perhaps, reflection is used on an object to ensure it The top result, with minset “sort array parseInt 16”, returns a
is an instance of type Date before it converted to a string, for sorted array of integers, if the ‘sort’ flag is set.
printing or storage. Inspecting the source code we find that this
method resides in the DateType class of the Hibernate ORM Code Summarizer (New) From our case studies of M IN 4
project. Again, our understanding is very close to the behavior minsets, we realized that minsets can effectively summarize
of the method. The method is passed an object, which it code. We built a code summary web application8 . A user enters
ensures is a java.sql.Date class object, and then returns the the source code of a method, our tool computes a minset, and
value as a string in the appropriate SQL dialect. presents it as a concise summary. Due to space constraints, we
omit a full example and invite interested readers to explore
High: H1 The java.lang.Exception object is thrown in Java to our web application. Figure 10 shows examples of minsets
indicate abnormal flow or behavior. The = operator tells us that summarizing methods.
there is an assignment but is very common. The java.security-
.Security.addProvider(java.security.Provider) method adds a VI. D ISCUSSION
security service object, Provider, to a Security object. The The main purpose of this study was to test our “wheat and
Security object centralizes all the security properties in an chaff” hypothesis. We have shown, over a variety of lexicons,
application. The super keyword refers to the superclass. From that functions can be identified by a subset of their words,
this minset, we can infer that it describes a constructor that that those subsets tend to be very small, and suggested a
probably overrides a method in its superclass. We also infer
that it catches an exception when adding the provider fails. 8 http://jarvis.cs.ucdavis.edu/code_essence.

9
lexicon, M IN 4, that induces those minsets to be more natural mind that syntactic differences do not always imply functional
and meaningful. Thus, our results clearly support our “wheat differences as Jiang and Su demonstrated [13]. Thus, in some
and chaff” hypothesis. cases two minsets may represent the same high-level behavior.
Our results offer insight into how to develop powerful, alterna-
tive programming systems. Consider an integrated development Code Completion and Search Observations about natural
environment (IDE), like Eclipse or IntelliJ, that can search language phenomenon provide a promising path toward making
a M INSET indexed database of code and requirements to programming easier. Hindle et al. focused on the ‘naturalness’
1) propose related code that may be adapted to purpose, 2) auto- of software [12]. They showed that actual code is “regular and
complete whole code fragments as the programmer works, predictable”, like natural language utterances. To do so, they
3) speed concept location for navigation and debugging, and trained an n-gram model on part of a corpus, and then tested
4) support traceability by interconnecting requirements and it on the rest. They leveraged code predictability to enhance
code [6]. Eclipse’s code completion tool. Their work followed that of
Gabel and Su who posited and gave supporting evidence that
Other Lexicons Our lexicon exploration avoided variable we are approaching a ‘singularity’, a point in time where all
names because they are so unconstrained, noisy, and rife the small fragments of code we need to write already exist [9].
with homonyms and synonyms. Minsets over lexicons, like When that happens, many programming tasks can be reduced
LEX, that incorporated them could include trivial, semantically to finding the desired code in a corpus. Our work suggests that
insignificant differences, like user vs. usr in Unix. At the same small, natural set of words, captured in a M INSET, can index
time, variable names are an alluring source of signal. Intuitively, and retrieve code. As for code completion, a M INSET-based
and in this corpus, they are the largest class of identifiers, which approach could exploit not just the previous n − 1 tokens, but
comprise 70% of source code [8], and connect a program’s on all the previous tokens and complete not just the next token
source to its problem domain [4]. In future work, we plan to but whole pieces of code.
separate the “wheat from the chaff” in variable names.
Sourcerer and Portolio, two modern code search engines,
Alternatives to Functions We chose functions as our semantic support basic term queries, in addition to more advanced
unit of discourse. However, we can apply the same methodology queries [2, 20]. Our research suggest the natural and efficient
at other semantic levels. One alternative is to study blocks of term query is a M INSET. Results may differ in granularity.
code. A single function can have many blocks. This could Portfolio focuses on finding functions [20] while Exemplar,
be very useful in alternative programming systems where the another engine, finds whole applications [11], M INSET easily
user seeks a common block of code but for which there is generalizes to arbitrary code fragments. Finally, code search
no individual function. Another alternative is to use abstract must also be ‘internet-scale’ [10], and with a modest computer,
syntax trees (AST). we can compute minsets for corpora of code of various
Threats to Validity We identify two main threats. The first languages, and update them regularly as new code is added.
is that we only studied Java. However, we have no reason to Code completion tools suggest code a programmer might
believe that the “wheat and chaff” hypothesis does not hold want to use. They infer relevant code and rank it. Many diverse,
for other programming languages. Java, though more modern, useful tools and strategies exist [5, 24, 25, 32]. Our work
was designed to be very similar to C and C++ so that it could suggests a different, complementary M INSET-based strategy: If
be adopted easily. The second threat comes from our corpus: what the programmer is coding contains the M INSET of some
size and diversity. We downloaded a very large corpus, by any piece of code, suggest that.
standard. In fact, we downloaded all the Java projects listed as
“Most Popular” in the four code repositories we crawled. Those Genetics and Debugging At a high-level, Algorithm 1 isolates
code repositories are known primarily for hosting open-source a minimal set of essential elements. Central to synthetic biology
projects. Thus, there is no indication that they are biased toward is the search for the ‘minimal genome’, the minimal set of
any specific types of projects. We plan to replicate this study on genes essential to living organisms [1] [19]. Delta debugging
a larger Java corpus and with language of different paradigms is very similar in that it finds a minimal set of lines of code
like List and Prolog to help us understand to what extent the that trigger a bug [7]. Both approaches rely on an oracle who
“wheat and chaff” phenonemon varies. defines what is ‘essential’ whereas we define ‘essentialness’
with respect to other sets.
VII. R ELATED W ORK
Although we are the first to study the phenomenon of “wheat” VIII. C ONCLUSION AND F UTURE W ORK
and “chaff” in code9 , a few strands of related work exist.
We imagine that code, to the human mind, is amorphous, and
Code Uniqueness At a basic level, our study is about ask: “If a programmer were reading this code, what features
uniqueness. Gabel and Su also studied uniqueness [9]. They would be semantically important?” and “If a programmer were
found that software generally lacks uniqueness which they mea- trying to write this piece of code, what key ideas would the
sure as the proportion of unique, fixed-length token sequences programmer communicate?” A M INSET is our proposal of a
in a software project. We studied uniqueness differently. We useful, formal definition of these key ideas as ‘wheat.’ Our
captured the distinguishing core semantics (the essence) of definition is constructive, so a computer can compute Minsets
a piece of code in a unique subset of syntactic features, a to generate or retrieve an intended piece of code.
M INSET, whose elements may not be unique or even rare
but together uniquely identify a piece of code. We keep in We evaluated Minsets, over a large corpus of real-world
Java programs, using various, natural lexicons: the computed
9 Others have used the “wheat and chaff” analogy in the computing world minsets are sufficiently small and understandable for use in
but in different domains [29, 30]. code search, code completion, and natural programming.

10
R EFERENCES [16] G. Little and R. C. Miller. Keyword programming in Java. In Proceedings
[1] C. G. Acevedo-Rocha, G. Fang, M. Schmidt, D. W. Ussery, and of the IEEE/ACM International Conference on Automated Software
A. Danchin. From essential to persistent genes: a functional approach to Engineering, pages 84–93, 2007.
constructing synthetic life. Trends in Genetics, 29(5):273–279, 2013. [17] G. Little, R. C. Miller, V. H. Chou, M. Bernstein, T. Lau, and A. Cypher.
[2] S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and Sloppy programming. In A. Cypher, M. Dontcheva, T. Lau, and J. Nichols,
C. Lopes. Sourcerer: a search engine for open source code supporting editors, No Code Required, pages 289–307. Morgan Kaufmann, 2010.
structure-based search. In Companion to the 21st ACM SIGPLAN [18] D. Mandelin, L. Xu, R. Bodík, and D. Kimelman. Jungloid mining:
Symposium on Object-Oriented Programming Systems, Languages, and helping to navigate the API jungle. In Proceedings of the 2005
Applications, pages 681–682, 2006. ACM SIGPLAN Conference on Programming Language Design and
[3] H. A. Basit and S. Jarzabek. Efficient token based clone detection with Implementation, pages 48–61, 2005.
flexible tokenization. In Proceedings of the 6th Joint Meeting of the [19] J. Maniloff. The minimal cell genome: "on being the right size".
European Software Engineering Conference and the ACM SIGSOFT Proceedings of the National Academy of Sciences, 93(19):10004–10006,
Symposium on the Foundations of Software Engineering, pages 513–516, 1996.
2007.
[20] C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu. Portfolio:
[4] D. Binkley, M. Davis, D. Lawrie, J. I. Maletic, C. Morrell, and B. Sharif. finding relevant functions and their usage. In Proceedings of the 33rd
The impact of identifier style on effort and comprehension. Empirical International Conference on Software Engineering, pages 111–120, 2011.
Software Engineering, 18(2):219–276, Apr. 2013.
[21] C. McMillan, N. Hariri, D. Poshyvanyk, J. Cleland-Huang, and
[5] M. Bruch, M. Monperrus, and M. Mezini. Learning from examples B. Mobasher. Recommending source code for use in rapid software
to improve code completion systems. In Proceedings of the 7th Joint prototypes. In Proceedings of the 34th International Conference on
Meeting of the European Software Engineering Conference and the ACM Software Engineering, pages 848–858, 2012.
SIGSOFT Symposium on the Foundations of Software Engineering, pages
213–222, 2009. [22] G. A. Miller. The magical number seven, plus or minus two: some
limits on our capacity for processing information. Psychological review,
[6] J. Cleland-Huang, R. Settimi, O. BenKhadra, E. Berezhanskaya, and 63(2):81, 1956.
S. Christina. Goal-centric traceability for managing non-functional
requirements. In Proceedings of the International Conference on Software [23] R. C. Miller, V. H. Chou, M. Bernstein, G. Little, M. Van Kleek, D. Karger,
Engineering, pages 362–371, 2005. and m. schraefel. Inky: a sloppy command line for the web with rich
visual feedback. In Proceedings of the 21st Annual ACM Symposium on
[7] H. Cleve and A. Zeller. Finding failure causes through automated testing. User Interface Software and Technology, pages 131–140, 2008.
In Proceedings of the Fourth International Workshop on Automated
Debugging, 2000. [24] A. T. Nguyen, T. T. Nguyen, H. A. Nguyen, A. Tamrawi, H. V. Nguyen,
J. Al-Kofahi, and T. N. Nguyen. Graph-based pattern-oriented, context-
[8] F. Deißenböck and M. Pizka. Concise and consistent naming. In Pro- sensitive source code completion. In Proceedings of the 34th International
ceedings of the 13th International Workshop on Program Comprehension, Conference on Software Engineering, pages 69–79, 2012.
pages 97–106, 2005.
[25] T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A
[9] M. Gabel and Z. Su. A study of the uniqueness of source code. In statistical semantic language model for source code. In Proceedings of
Proceedings of the 18th ACM SIGSOFT Symposium on the Foundations the 9th Joint Meeting of the European Software Engineering Conference
of Software Engineering, pages 147–156, 2010. and the ACM SIGSOFT Symposium on the Foundations of Software
[10] R. E. Gallardo-Valencia and S. Elliott Sim. Internet-scale code search. In Engineering, 2013.
Proceedings of the 2009 ICSE Workshop on Search-Driven Development- [26] Oracle openJDK. http://openjdk.java.net/, 2012.
Users, Infrastructure, Tools and Evaluation, pages 49–52, 2009.
[27] S. P. Reiss. Semantics-based code search. In Proceedings of the 31st
[11] M. Grechanik, C. Fu, Q. Xie, C. McMillan, D. Poshyvanyk, and C. Cumby. International Conference on Software Engineering, pages 243–253, 2009.
A search engine for finding highly relevant applications. In Proceedings
of the ACM/IEEE International Conference on Software Engineering, [28] S. P. Reiss. Specifying what to search for. In Proceedings of the 2009
pages 475–484, 2010. ICSE Workshop on Search-Driven Development-Users, Infrastructure,
Tools and Evaluation, pages 41–44, 2009.
[12] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness
of software. In Proceedings of the International Conference on Software [29] R. Rivest. Chaffing and winnowing: Confidentiality without encryption,
Engineering, pages 837–847, 2012. March 1998. web page.

[13] L. Jiang and Z. Su. Automatic mining of functionally equivalent code [30] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local
fragments via random testing. In Proceedings of the 18th International algorithms for document fingerprinting. In Proceedings of the 2003 ACM
Symposium on Software Testing and Analysis, pages 81–92, 2009. SIGMOD International Conference on Management of Data, SIGMOD
’03, pages 76–85, New York, NY, USA, 2003. ACM.
[14] V. Le, S. Gulwani, and Z. Su. SmartSynth: synthesizing smartphone
automation scripts from natural language. In Proceeding of the 11th [31] D. Schuler, V. Dallmeier, and C. Lindig. A dynamic birthmark for Java.
Annual International Conference on Mobile Systems, Applications, and In Proceedings of the International Conference on Automated Software
Services, pages 193–206, 2013. Engineering, pages 274–283, 2007.

[15] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: a tool for finding [32] C. Zhang, J. Yang, Y. Zhang, J. Fan, X. Zhang, J. Zhao, and P. Ou.
copy-paste and related bugs in operating system code. In Proceedings of Automatic parameter recommendation for practical API usage. In Pro-
the Symposium on Operating Systems Design & Implementation, pages ceedings of the 34th International Conference on Software Engineering,
289–302, 2004. pages 826–836, 2012.

11

You might also like