0% found this document useful (0 votes)
12 views8 pages

CodeMine

The document discusses CODEMINE, a software development data analytics platform created by Microsoft to collect and analyze engineering process data across various product teams. It highlights the platform's architecture, data sources, and the common practices for utilizing the data to improve software development processes. The article aims to share lessons learned and design rationale to assist other organizations in building similar analytics platforms.

Uploaded by

krugersone45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views8 pages

CodeMine

The document discusses CODEMINE, a software development data analytics platform created by Microsoft to collect and analyze engineering process data across various product teams. It highlights the platform's architecture, data sources, and the common practices for utilizing the data to improve software development processes. The article aims to share lessons learned and design rationale to assist other organizations in building similar analytics platforms.

Uploaded by

krugersone45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

FOCUS: SOFTWARE ANALYTICS: SO WHAT?

CODEMINE: though each solution is unique in its


intended purpose and the way it im-
proves the engineering process, there

Building a Software are commonalities of inputs, outputs,


and methods used among the tools. For
example, a majority of the reviewed
Development Data Analytics tools need similar input data: source
code repositories and system binaries,
Platform at Microsoft defect databases, and organization
hierarchies.
In late 2009, a team at Microsoft
was established to explore and imple-
Jacek Czerwonka, Microsoft
ment a common platform, CODE-
Nachiappan Nagappan, Microsoft Research MINE, for collecting and analyzing
engineering process data from across
Wolfram Schulte, Microsoft a diverse set of Microsoft’s product
teams. CODEMINE quickly became
Brendan Murphy, Microsoft Research pervasive and is now deployed in all
major product groups at Microsoft.
This project wasn’t done for the sake
of research or academic impact but was
// The process Microsoft has gone through developing actually deployed in Microsoft and has
CODEMINE—a software development data analytics hundreds of users. Currently, CODE-
MINE is deployed in all major Micro-
platform for collecting and analyzing engineering
soft product groups: Windows, Win-
process data—includes constraints, and pivotal dows Phone, Office, Exchange, Lync,
organizational and technical choices. // SQL, Azure, Bing, and Xbox.
This article presents the motivation,
challenges, solutions, and most impor-
tant, the lessons learned by the CODE-
MINE team to aid in replicating such
a platform in other organizations. We
hope our design rationale can help oth-
ers who are building similar analytics
platforms.

Data Sources and Schema


EARLY, TRUSTWORTHY DATA avail- • risk evaluation and change impact Figure 1 depicts a high-level schema of
able at the required frequencies lets analysis tools, 2 the repositories and types of artifacts
engineers and managers make data- • version control branch structure mined by CODEMINE. In terms of
driven decisions that can enable the optimization, 3 both volume and frequency of change,
success of the software development • social-technical data analysis,4 and source code repositories are the larg-
process to deliver a high-quality, on- • custom search for bugs and debug est sources of engineering data for a
time software system. At Microsoft, logs, speeding up investigations of company like Microsoft. They con-
several teams use data to improve pro- new issues.5 tain information on a variety of source
cesses. Examples include code–related artifacts divided into data
When reviewing these and other describing its state, composition, and
• trend monitoring and reports on de- solutions from our existing portfolio high-level attributes, as well as data de-
velopment health,1 of tools, our teams realized that even scribing ongoing code changes. In the

64 I E E E s o f t w A r E | P u B l I s H E D B y t H E I E E E C o M P u t E r s o C I E t y 0 74 0 -74 5 9 / 13 / $ 3 1. 0 0 © 2 0 13 I E E E

s4nag.indd 64 6/6/13 12:01 PM


Work item Source code Process information

Feature/defect Integration Product Schedule


moves
created on
resolves ships
from
defines
opens resolves Submitted
into
Organization Change Branch Build Test
submits

Person belongs to built on

works edits belongs to tests


with requests comments on Source file Executable Test job
defines

implements
Code review submits as

Procedure/ Class/
Review
method type
calls uses

FIGURE 1. The types of data CODEMINE platform collects. Artifacts are cross-referenced as much as possible, allowing queries against
CODEMINE to go beyond an individual repository.

former category, the primary concepts the engineering activity, taking into ac- and accessible from each instance of
are source fi les and their attributes: to- count the two most common software the CODEMINE data platform; how-
tal size, size of code versus comments, verification and validation activities. ever, each instance might have slightly
implemented methods, and defi ned Organization information and process different capabilities, in terms of both
classes or types. In the latter, concepts information (such as release schedules data stored and analytics that execute
of a change, a branch, and an integra- and development milestones) are also on it. Yet, client applications will be
tion characterize the team’s output a part of CODEMINE. They provide able to run on the data platform as long
over time. context for the engineering activity, the as the data they need is present, ideally
Another large and important body code being developed, and all activities scaling their capabilities on the basis of
of data resides in work item reposito- around that. which data is actually present. If an ap-
ries. These typically encompass both As Figure 1 depicts, artifacts are plication can’t run on a particular in-
features and defects, both types of cross-referenced as much as possible, stance of the data platform, it will be
which are often tightly linked to source allowing queries against CODEMINE able to fail gracefully.
code changes. It’s a bidirectional rela- to go beyond an individual repository.
tionship—features and defects are both Data Store
a trigger for as well as a cause of source Architecture The core element of the data platform is
code changes. Figure 2 describes the CODEMINE the data store. It’s a logical concept re-
Data on builds describes the com- platform’s high-level architecture. alized as a collection of data sources—
position of the fi nal software product More than one instance currently ex- typically databases but also fi le shares
and also allows us to map source code ists; all conform to the same blue- with either text or binary fi les. These
to the resulting executable. Code re- print. We’re assuming a high degree data sources don’t have to be colocated
views and tests complete the picture of of commonality in the data stored in but are likely to remain geographically

J u ly/A u g u s t 2 0 1 3 | IEEE s o f t w A r E 65

s4nag.indd 65 6/6/13 12:01 PM


FOCUS: SOFTWARE ANALYTICS: SO WHAT?

Single-purpose analysis

Tools
... ...

... ...

... ...
CODEMINE
platform API

CODEMINE
datamart services
Data model exposed for querying

Data catalog and data recovery


CODEMINE
data source

Code People Process Data publishing and replication

Work Flexibility in data availability


items Builds Access permissions

Data archiving
CODEMINE
loaders

Understand format of repositories


Failure-resistant loading
Remove noise from data
Save data into a common schema
Product team’s
repositories

Code Bugs Builds Organization Tests

Small number of source schemas


Many combinations

FIGURE 2. High-level architecture of a CODEMINE data platform instance.

close to the raw data they cache con- the data store. They understand the Platform APIs (Data Model)
sistent with individual product group schema of the raw data source they’re CODEMINE has a standard set of
data and security policy. It’s not neces- querying from. Data loaders are built interfaces that expose data from the
sary for all data platform deployments to be as independent and decoupled data platform. The interfaces target
to have the same data sources. Appli- from one another as possible. most common entities such as code,
cations use the data catalog service to The data collection workflow takes defects, features, tests, people, and
query for presence and logical loca- care of orchestrating data collections, their attributes and relationships.
tion (such as a connection string or fi le enforcing any dependencies, and ensur- The most common usage patterns
share name) of specific pieces of data. ing collections happen in the correct should be realized through this data
order. The workflow will be defi ned in model.
Data Loaders close cooperation with product groups Applications that make use of the
Data loaders are modules of code that and adheres to the “pull once” model data platform will most often follow
read raw data and directly put it into of data collection as closely as possible. this pattern:

66 I E E E s o f t w A r E | w w w. C o M P u t E r . o r g / s o f t w A r E

s4nag.indd 66 6/6/13 12:01 PM


1. Query the data catalog to ensure
that the needed data exists. Fetch
connection strings to data sources
or URLs to needed services.
2. Tailor functionality depend-
ing on the available data.
3. Connect to services and query for
data through the data model.
4. Display data.

The data model is the preferred


way to access the data stored in the
data platform. It needs to be expres-
sive enough to support the data needs
of the productized solutions. However,
for purposes of specialized queries,
one-off research tasks, or prototyping,
access to interfaces exposed by indi-
vidual data sources is also available.

Platform Services
Platform services encompass a vari- FIGURE 3. CRANE tool screenshot.
ety of features related to data catalog-
ing, security and access permissions,
event logging, data archiving, and data and opening it up to both the Micro- related to their product, process,
publishing. soft internal research community and or organization.
Each part of the data platform sys- product groups, three distinct patterns • To enable new research. Data
tem needs to be able to log events to of data use emerged: from each product team, and espe-
a common place. Reasons for logging cially from across product teams,
include health monitoring and trend- • As a data source for a reporting is a compelling source of informa-
ing, data access auditing, execution tool or methodology that’s part of tion and inspiration for new lines
tracing, and alerting in failure cases. a product team’s process. When a of research.
Product groups need the ability to product team uses the CODEMINE
control access to their cached data platform and the client application What follows are examples from
the same way they control access to in production, this usage pattern re- each of these categories.
raw data sources. The security pol- quires data freshness and reliability
icy module must be able to under- of data acquisition and analysis as Example 1: Mature Research
stand the security configuration sys- well as operational uptime and ef- Encoded into a Tool
tems used by product groups, query ficiency to get to data. Change is a fundamental unit of work
the security policies at the right • For one-time, custom analysis fo- for software development teams that
frequency, and apply them to both cusing on answering a specifi c exists regardless of whether a prod-
stored data and interfaces accessible question. Although the data might uct is a traditional boxed version or
from outside the data platform. Cur- not be stored in a way that’s opti- a service or whether a team uses an
rently, data platform instances are mized for a particular query, the agile process or a more traditional
protected by individual and separate fact that the data is available at all approach.
security groups. and easy to access (compared to Making postrelease changes re-
accessing raw data sources for the quires a thorough understanding of
Data Platform same data) makes CODEMINE not only the architecture of the soft-
Usage Scenarios the go-to data source when a prod- ware component to be changed but also
In the process of creating the platform uct team needs to make a decision its dependencies and interactions with

J u ly/A u g u s t 2 0 1 3 | IEEE s o f t w A r E 67

s4nag.indd 67 6/6/13 12:01 PM


FOCUS: SOFTWARE ANALYTICS: SO WHAT?

minimum-cost algorithm, CRANE is


Single branch removal able to recommend specific, high-value
tests.
The system has already been suc-
cessfully deployed in Windows, and
pilots are underway in other product
teams.

Example 2: Ad Hoc Analysis


for Decision Making
Velocity cost

Here’s a simple but very important


question: Is code coverage effective,
and is there a code coverage percentage
at which we should stop testing?
We analyzed code coverage of mul-
tiple released versions of Microsoft
products and correlated branch and
statement coverage with postrelease
0
failures. There was a strong positive
0 correlation between coverage and fail-
Conflict avoidance
ures. From discussions with the rela-
vant teams, we found out that there are
FIGURE 4. Velocity versus conflict avoidance. Red dots indicate branches that aren’t useful, several reasons for the existence of this
green dots indicate branches that are useful, and blue dots indicate branches with mixed inverse relationship:
utility.
• code covered doesn’t guarantee that
the code is correct;
other system components. Testing such with features, defects, people, code • having 100 percent code coverage
changes in reasonable time and at a reviews, and auxiliary data such as doesn’t mean the system will have
reasonable cost is a problem because an code coverage. CRANE is able to use no failures—rather, it means that
infi nite number of test cases can be ex- this data, and consequently, teams can bugs could be found outside antici-
ecuted for any modification.2 Further- automatically receive information on pated coverage scenarios; and
more, they’re applicable to hundreds of change composition, associated bugs, • each time a fi x is done, a test case
millions of users; even the smallest mis- similar changes, involved people, and is written to cover the fi x (often,
takes can translate to very costly fail- possible risks and recommend risk- changed binaries might therefore
ures and rework. mitigation steps. have high code coverage simply be-
CRANE is a failure prediction, CRANE is able to not only surface cause they have been modified sev-
change risk analysis, and test prioriti- information about changes but also eral times).
zation system at Microsoft Corporation provide interpretation via overlaying
that leverages existing research for the coverage data and statistical risk mod- This fi nding led us to a follow-up
development and maintenance of the els to identify the most risky and least study on the use of code coverage in
Windows operating system family.2 covered parts of a change. Figure 3 conjunction with code complexity (for
CRANE is built on top of the shows a snapshot of a CRANE analy- example, cyclomatic complexity and
CODE MINE infrastructure, as shown sis, which identifies change, coverage, class coupling) as a better indicator of
in Figure 2 on the top layer, where the dependency, people, and prior bug in- code quality. In addition, we were able
tools exist leveraging the CODEMINE formation. It allows engineers and en- to benchmark our results with external
platform. The CODEMINE data plat- gineering managers to focus their atten- organizations such as Avaya to com-
form constantly monitors changes tion on the most failure-prone parts of pare and contrast our results. 6 Studies
happening in the source code reposi- their work. Through use of code cov- of unit testing show its increased effec-
tory and can cross-reference these erage data and a maximum-coverage/ tiveness in obtaining high-quality code

68 I E E E s o f t w A r E | w w w. C o M P u t E r . o r g / s o f t w A r E

s4nag.indd 68 6/6/13 12:01 PM


because it eliminates the need for tes- and Windows 7) and observed that a Encode Process Information
ters to find the category of bugs that branch structure that aligns with the Process information, including release
could be more easily found by devel- team’s organizational structure leads to schedule (milestones and dates), orga-
opers and lets them focus more on sce- fewer postrelease failures than branches nization of code bases, team structure,
nario testing.7 aligned to the product’s architectural and so on, is very important to provide
layering. 8 better context—for example, why is
Example 3: Use of Data in New Research there a sudden spike in bugs (more us-
Many companies use branches in ver- Lessons on ers added), sudden spike in code churn
sion control systems to coordinate the Replicating CODEMINE (code integration milestone), and so on?
work of hundreds to thousands of de- One of our primary goals in this At Microsoft, this information isn’t
velopers building a software system or article is to help other organizations present in one place or tool. It might
service. Branches isolate concurrent replicate the work we’re doing with pop up in project tracking, a code re-
work, avoiding instability during de- CODEMINE to build their own data pository, and bug-tracking tools or be
velopment. The downside is an increase
in time for changes to move though the
system. So, can we determine the op-
timal branch structure, guaranteeing Branch evaluation lets teams identify
fast code velocity and a high degree of
isolation? Answering this question isn’t the cost paid for the benefit
only important to Microsoft but also to
other commercial companies and the
research community.
Toward this end, we performed analytics platform. We’ve compiled a configured with some level of custom-
various experiments on simulating dif- list of suggestions from our experience ization to interpret this. Organizations
ferent branch structures. 3 For exam- that would assist in replicating our should make plans to embed this infor-
ple, we replayed the check-in history CODEMINE effort and some things mation in the system to provide valu-
of several product groups, assuming for other organizations to think able metadata.
specific branches didn’t exist. Under of differently when building their
these conditions, all changes hierarchi- platform. Provide Flexibility and Extensibility for
cally roll up to a higher-level branch, Collected Data and Deployed Analytics
and we can detect conflicts by identi- Create an Independent Instance for Each Product teams have varying require-
fying files getting modified together. Product Team in the Data Platform ments and need the ability to define
The resulting graph (see Figure 4) plots Easy partitioning, the ability to con- which data and metadata are stored
the cost of a branch versus its value as strain access, and the ability to move in the data platform and how they’re
a factor, isolating parallel lines of de- parts of the infrastructure greatly as- analyzed; this will allow teams to best
velopment. In Figure 4, red dots indi- sisted us in creating independent reflect their existing processes or en-
cate branches that aren’t useful—that instances. able new ones. For example, one team
is, adding velocity and not providing might decide to add customer user data
much conflict avoidance. Green dots Have Uniform Interfaces for Data Analysis to their instance of their data store. The
indicate branches that are useful, and Even though multiple instances will ex- system should be able to fully support
blue dots indicate branches with mixed ist, applications need to rely on a com- such extensions.
utility. Branch structures are created mon set of services, APIs, or a stable
in context and to suit needs of a spe- schema present in each. The data plat- Allow Dynamic Discovery of Data
cific product and organization; such form interfaces’ evolution must be done Platform’s Capabilities by Application
branch evaluation lets teams identify very carefully; preserving backward Each application relying on the data
the cost paid for the benefit and iden- compatibility should be of primary platform needs the ability to identify
tify parts of the branch tree that should concern. This also greatly helps when capabilities of a particular data plat-
be restructured. 3 you build an application once and can form instance and adjust its function
We also analyzed the architectural redeploy it multiple times across several accordingly. For example, some prod-
structure of Windows (for both Vista data instances. uct groups collect and archive historical

J u ly/A u g u s t 2 0 1 3 | IEEE S o f t w a r e  69

s4nag.indd 69 6/6/13 12:01 PM


FOCUS: SOFTWARE ANALYTICS: SO WHAT?

Innovate at the Right Level of the Stack


Use mature foundational technology
Collaboration
New research and existing programming skills. As
surfaces
in software
further areas of much as possible, we try to use the op-
engineering
improvement erating system, storage, and database
platform technology that’s mature and
already part of Microsoft’s stack to
avoid spending time innovating, for
example, at the level of raw storage or
methods of distributed computation.
Instead, we focus on data availability,
accurate data acquisition, data clean-
Additional and Solutions/tools
clean data easily deployed ing, abstracting representation of the
available for in product engineering process, and data analysis.
further research groups In terms of accessing data, we need to
ensure any new programming models
used are absolutely necessary for the
task so we don’t create artificial barri-
Solved business ers of entry for users of our data.
problem

FIGURE 5. Cycle of collaboration and data availability.

code coverage data and some choose


not to. The tools must be able to seam-
the data platform must adhere to rules
of a well-behaved service defi ned by op-
W e’ve observed that once
data is easily accessible,
new usage scenarios open
up; for instance, CODEMINE is cur-
rently being used to understand on-
lessly scale their functionality down if erations teams. Resiliency to failure, re- boarding processes, optimize individ-
code coverage data isn’t available for a try logic, logging of fatal and nonfatal ual processes (like build), and optimize
particular product group. errors, health monitoring, and notifica- overall code flow.
tions should be built in. Another significant goal of the
Support Policies for Security, CODE MINE platform is enabling fu-
Privacy, and Audit Host as a Cloud Service ture research and analysis. Figure 5
The data platform must allow for set- Based on need, economic consider- explains the cycle of data availability
ting authorization, authentication, pri- ations, load, and availability require- where new research in software en-
vacy, and audit policies to reflect se- ments, carefully evaluate the necessity gineering spawns new solutions to be
curity requirements and policies of the to host the service on the cloud or on deployed in product groups. These so-
product group or the data owner. As traditional servers. Overengineering al- lutions can be used to solve large busi-
a general rule, information leaving the ways leads to wasted effort. ness problems and enable additional
data platform will be accessible only to research with the scaling out of the en-
people who were granted permission to Know the Data Platform Might gineering work. This further strength-
the original raw data coming in; how- Not Fulfill All Data Needs ens the collaboration and opens new
ever, stricter rules might apply to sub- The data platform will be scoped to avenues for collaboration and again
sets of data. provide data that’s used by several cli- leads to new research ideas.
ent applications—that is, there must As a way to propagate the ideas of
Allow Ongoing Support be a level of commonality of inputs for data-driven decision making, we re-
and Maintenance Outside of CODEMINE the platform to start serving the data. cently started a virtual community fo-
In most cases, product teams eventually However, applications can still access cused on sharing questions, solutions,
take over ownership and operations other, more specialized data sources methods, and tools related to engineer-
for their respective data platform in- and federate with the data platform as ing process data analysis. It is a cross-
stances. To ensure a smooth transition, their needs dictate. disciplinary group of product team

70 I E E E s o f t w A r E | w w w. C o M P u t E r . o r g / s o f t w A r E

s4nag.indd 70 6/6/13 12:01 PM


members and researchers with expe-

ABOUT THE AUTHORS


rience and backgrounds in empirical
software engineering, data analysis, JACEK CZERWONKA is a principal software architect in the Tools
and data visualization. The group’s for Software Engineers group at Microsoft. His research interests
include software testing and quality assurance, systems-level testing,
goal is to emphasize data-driven deci-
pairwise and model-based testing, and data-driven decision making
sion making in our teams and to equip on software projects. Czerwonka received his MSc in Computer Sci-
product teams with relevant guidelines, ence from Technical University of Szczecin. Contact him at jacekcz@
methods, and tools. As we realize our microsoft.com.
goals, the CODEMINE data platform
often serves as the common denomina-
tor in our community activities. NACHIAPPAN NAGAPPAN is a principal researcher in the Empiri-
cal Software Engineering group at Microsoft Research. His research
interests include software analytics, focusing software reliability, and
empirical software engineering processes. Nagappan received a PhD in
References computer science from North Carolina State University. Contact him at
1. N. Nagappan and T. Ball, “Use of Relative [email protected].
Code Churn Measures to Predict System De-
fect Density,” Proc. 27th Int’l Conf. Software
Eng. (ICSE 05), ACM, 2005, pp. 284–292.
2. J. Czerwonka et al., “CRANE: Failure Predic-
tion, Change Analysis and Test Prioritization
in Practice—Experiences from Windows,” WOLFRAM SCHULTE is an engineering general manager and
Proc. 4th Int’l Conf. Software Testing, Veri- principal researcher at Microsoft. His research interests include soft-
fi cation and Validation (ICST 11), IEEE CS, ware engineering, focusing on build, modeling, verification, test, and
2011, pp. 357–366. programming languages, ranging from language design to runtimes.
3. C. Bird and T. Zimmermann, “Assessing the Schulte received a PhD in computer science from the Technical Univer-
Value of Branches with What-If Analysis,” sity of Berlin. Contact him at [email protected].
Proc. ACM SIGSOFT 20th Int’l Symp. Foun-
dations of Software Eng. (FSE 12), ACM,
2012, pp. 45–54.
4. C. Bird et al., “Putting It All Together: Using BRENDAN MURPHY is a principal researcher at Microsoft Research. His research interests
Socio-technical Networks to Predict Failures,”
include system dependability, encompassing measurement, reliability, and availability. Contact
Proc. 20th Int’l Symp. Software Reliability
Eng. (ISSRE 09), IEEE CS, 2009, pp. 109–119.
him at [email protected].
5. B. Ashok et al., “DebugAdvisor: A Recom-
mender System for Debugging,” Proc. 7th
Joint Meeting European Software Eng. Conf.
and ACM SIGSSOFT Symp. Foundations of
Software Eng. (ESEC/FSE 09), ACM, 2009,
pp. 373–382.

Call Articles
6. A. Mockus, N. Nagappan, and T.T. Dinh-
Trong, “Test Coverage and Post-verifi cation
Defects: A Multiple Case Study,” Proc. 3rd
Int’l Symp. Empirical Software Eng. and
Measurement (ESEM 09), IEEE CS, 2009,

for
pp. 291–301.
7. L. Williams, G. Kudrjavets, and N. Nagappan,
“On the Effectiveness of Unit Test Automa- IEEE Software seeks practical, readable
tion at Microsoft,” Proc. 20th Int’l Symp. articles that will appeal to experts and nonexperts
Software Reliability Eng. (ISSRE 09), IEEE alike. The magazine aims to deliver reliable
CS, 2009, pp. 81–89.
information to software developers and managers
8. E. Shihab, C. Bird, and T. Zimmermann,
“The Effect of Branching Strategies on Soft- to help them stay on top of rapid technology
ware Quality,” Proc. Int’l Symp. Empirical change. Submissions must be original and no
Software Eng. and Measurement (ESEM 12), more than 4,700 words, including 200 words
ACM, 2012, pp. 301–310.
for each table and figure.

Author guidelines:
www.computer.org/software/author.htm
Further details: [email protected]
Selected CS articles and columns
are also available for free at www.computer.org/software
http://ComputingNow.computer.org.

J u ly/A u g u s t 2 0 1 3 | IEEE s o f t wA r E 71

s4nag.indd 71 6/6/13 12:01 PM

You might also like