Data Science: An Action Plan For Expanding The Technical Areas of The Field of Statistics
Data Science: An Action Plan For Expanding The Technical Areas of The Field of Statistics
Abstract
An action plan to enlarge the technical areas of statistics focuses on the data analyst. The plan sets out six
technical areas of work for a university department and advocates a specific allocation of resources devoted
to research in each area and to courses in each area. The value of technical work is judged by the extent
to which it benefits the data analyst, either directly or indirectly. The plan is also applicable to government
research labs and corporate research organizations.
(20%) Theory: foundations of data science; general approaches to models and methods, to computing
with data, to teaching, and to tool evaluation; mathematical investigations of models and methods, of
computing with data, of teaching, and of evaluation.
Universities have been chosen as the setting for implementation because they have been our traditional institutions for innovation, and they can rapidly redirect areas of work by changing what is taught to graduates of
data science. But a similar plan would apply to government research labs and corporate research organizations.
Change is needed in the technical areas of data science because critical areas that can be of immense benefit to
data analysts need more resources. Computing with data, an area whose importance has recently been recognized
by computer scientists, needs much more resources. Multidisciplinary investigations are a major activity in
some departments but not others. Many students of data science go on to teach, but it is rare to find a course in
university curricula on pedagogy. A rigorous evaluation of tools and their development applies to data science
what statisticians routinely advocate for process improvement in other disciplines.
The primary agents for change should be university departments themselves. But it is reasonable for departments to look both to university administrators and to funding agencies for resources to assist in bringing about
the change.
Please read on for more detail and discussion.
2 Multidisciplinary Projects
The single biggest stimulus of new tools and theories of data science is the analysis of data to solve problems
posed in terms of the subject matter under investigation. Creative researchers, faced with problems posed by
data, will respond with a wealth of new ideas that often apply much more widely than the particular data sets
that gave rise to the ideas. If we look back on the history of statistics for example, R. A. Fisher inventing
the design of experiments stimulated by agriculture data, John Tukey inventing numerical spectrum analysis
stimulated by physical science and engineering data, and George Box inventing response surface analysis based
on chemical process data we see that the greatest advances have been made by people close to the analysis of
data.
Because data are the heat engine for invention in data science, the action plan allocates 25% of resources to
data analysis investigations. This does not mean that each and every faculty member needs to analyze data. But
data analysis needs to be part of the blood stream of each department and all should be aware of the workings of
subject matter investigations and derive stimulus from them.
Students should analyze data. Doing so should be a required, major part of undergraduate and graduate
programs in data science.
A vast array of methods exist for estimation and distribution, but far less effort has gone into methods for
model building. True, methods have been extensively developed for certain classes of models; one example is
classical linear regression models. But many other widely-used classes have virtually no methods; one example
is random-parameter models (random effects, repeated measures, random coefficients, randomized blocks, etc.).
For the data analyst, the inequity is unwarranted. Often, the model building phase is the salient part of the
analysis, and the estimation and distribution phase is straightforward. Model building is complex because it
requires combining information from exploring the data and information from sources external to the data such
as subject matter theory and other sets of data. Inevitably, specifications must be chosen by an informal process
that balances information from the data, information from sources external to the data, and the desirability of
parsimony. Tools that help specification are much needed by data analysts.
data, and the ability of computer scientists to value highly such work. In 1998, Chambers and the S system
for graphics and data analysis won the worlds most prestigious software prize, the ACM Software System
Award, receiving the citation, For The S system, which has forever altered how people analyze, visualize, and
manipulate data. The esteem in which S is held is made clear by the winners of the award in previous years
such as UNIX, VisiCalc, TeX, SMALLTALK, PostScript, TCP/IP, the World Wide Web, Mosaic, and Tcl/Tk.
5 Pedagogy
A data science department in a university must, of course, concern itself with teaching in its own setting. But
it is vital that resources be spent to study pedagogy and to teach pedagogy. It makes sense that such study
encompass more than the university setting; curricula in elementary and secondary schools, company training
programs, and continuing education programs are important as well. Education in data science does many things.
It trains statisticians. But just as important it trains non-statisticians, conveying how valuable data science is for
learning about the world.
6 Tool Evaluation
The outcome of two areas of data science models and methods, and computing with data are tools for the
data analyst. Work in this area can be made more effective by formal surveys of the practice and needs of data
analysts, and formal study of the process of developing tools. In other words, we need to measure and evaluate
data science.
Statisticians are the first to step up to assert that process improvement needs process measurement and an
analysis of the resulting data. Statisticians should turn this methodology inward to study data science. There
should be surveys to determine what methods and models and what computing methods and systems are used by
data analysts in practice today. There should be surveys that poll practicing data analysts to determine perceived
needs for new tools. There should be studies to determine to determine how the process of tool development can
be improved.
7 Theory
Theory, both mathematical and non-mathematical theory, is vital to data science. Theoretical work needs to have
a clearly delineated outcome for the data analyst, albeit indirect in many cases. Tools of data science models
and methods together with computational methods and computing systems link data and theory. New data
create the need for new tools. New tools need new theory to guide their development.
Mathematics is an important knowledge base for theory. It is far too important to take for granted by requiring
the same body of mathematics for all. Students should study mathematics on an as-needed basis. Some will
need the finite mathematics often taught in computer science departments. Others might need measure theory
and functional analysis.
However, all students need an intensive grounding in mathematical probability, but it must be probability in
the sense of random variation and random variables, as opposed to probability in the sense of measure theory
and measurable functions, which is needed by only a few students. The reason goes back to the data. Data
vary, and that variation typically needs to be thought of probabilistically, and a superb intuition for probabilistic
variation becomes the basis for expert modeling of the variation.
Not all theory is mathematical. In fact, the most fundamental theories of data science are distinctly nonmathematical. For example, the fundamentals of the Bayesian theory of inductive inference involve nonmathematical
ideas about combining information from the data and information external to the data. Basic ideas are conveniently expressed by simple mathematical expressions, but mathematics is surely not at issue. Mathematical
theory is a means of investigation and can shed light on all areas of data science including the fundamental
theories.
8 Outcomes
One outcome of the plan is that computer science joins mathematics as an area of competency for the field of
data science. This enlarges the intellectual foundations. It implies partnerships with computer scientists just as
there are now partnerships with mathematicians. It implies that students develop skills in computing as well as
mathematics. It implies faculty expertise in computing as well as mathematics. The reason goes back to the data
analyst. Today, exciting new frontiers of data science that hold great promise for analysts involve computing
with data.
Another outcome results from extensive involvement in data analysis projects. It carries statistical thinking to
subject matter disciplines. This is vital. A very limited view of data science is that it is practiced by statisticians.
The wide view is that data science is practiced by statisticians and subject matter analysts alike, blurring exactly
who is and who in not a statistician. The wide view is the realistic one because all the statisticians in the world
would not have time to analyze a tiny fraction of the databases in the world. The wide view has far greater
promise of a widespread influence of the intellectual content of the field of data science.
Two other outcomes result from making explicit the link between theory and data. First, it guides theory in
important ways. Second, it can substantially increase the domain of support for theory. Departments can delineate the link explicitly in requests for funding for multidisciplinary investigations. This brings the possibility of
funding for the theory of data science from sources that support other subject matter disciplines.
Finally, a successful implementation of the plan will bring new, exciting areas of research and development
to data science.
References
Box, G. E. P. (1976). Science and Statistics. Journal of the American Statistical Association 71, 791799.
Chambers, J. (1999). Computing with Data: Concepts and Challenges. The American Statistician 53, 7384.
Cleveland, W. S. (1993). Visualizing Data. Summit, New Jersey, U.S.A.: Hobart Press.
Cobb, G. W. and D. S. Moore (1997). Mathematics, Statistics, and Teaching. The American Mathematical
Monthly 104, 801823.
Cooley, J. W. and J. W. Tukey (1965). An Algorithm for the Machine Calculation of Complex Fourier Series.
Mathematics of Computation 19, 297301.
Fisher, R. A. (1922). The Foundations of Theoretical Statistics. Philosophical Transactions of the Royal
Society of London, A 222, 309368.
Friedman, J. H. (2000). The Role of Statistics in the Data Revolution. Technical report, Statistics Department,
Stanford University.
Gelfand, A. E. and A. F. M. Smith (1990). Sampling-Based Approaches to Calculating Marginal Densities.
Journal of the American Statistical Association 85, 398409.
Marquardt, D. W. (1979). Statistical Consulting in Industry. The American Statistician 33, 102107.
Moore, D. S., G. W. Cobb, J. Garfield, and W. Q. Meeker (1995). Statistics Education Fin de Si`ecle. The
American Statistician 49, 250260.
Mosteller, F. (1988). Broadening the Scope of Statistics and Statistical Education. The American Statistician
42, 9399.
Nichols, D. (2000). Future Directions for the Teaching and Learning of Statistics at the Tertiary Level. Technical report, Department of Statistics and Econometrics, Australian National University.
Nolan, D. and T. Speed (1999). Teaching Statistics Theory Through Applications. The American Statistician
53, 370375.
Smith, A. F. M. (2000). Public Policy Issues as a Route to Statistical Awareness. Technical report, Department
of Mathematics, Imperial College.
Tukey, J. W. and M. B. Wilk (1986). Data Analysis and Statistics: An Expository Overview. In L. V. Jones
(Ed.), The Collected Works of John W. Tukey, pp. 549578. New York: Chapman & Hall.
Wegman, E. J. (2000). Visions: The Evolution of Statistics. Technical report, Center for Computational
Statistics, George Mason University.