0% found this document useful (0 votes)
200 views

Statistical Modeling 2 Cultures

This document discusses two different approaches to statistical modeling - the data modeling culture and the algorithmic modeling culture. The data modeling culture assumes the data are generated from a specific stochastic data model, while the algorithmic culture treats the inside of the data generation process as unknown and focuses on finding predictive algorithms. The author argues that over-reliance on data modeling has limited the field of statistics and prevented working on interesting new problems. The document also describes two projects the author worked on as a consultant that demonstrate issues with relying solely on data modeling.

Uploaded by

invisible_in
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
200 views

Statistical Modeling 2 Cultures

This document discusses two different approaches to statistical modeling - the data modeling culture and the algorithmic modeling culture. The data modeling culture assumes the data are generated from a specific stochastic data model, while the algorithmic culture treats the inside of the data generation process as unknown and focuses on finding predictive algorithms. The author argues that over-reliance on data modeling has limited the field of statistics and prevented working on interesting new problems. The document also describes two projects the author worked on as a consultant that demonstrate issues with relying solely on data modeling.

Uploaded by

invisible_in
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Statistical Modeling: The Two Cultures Author(s): Leo Breiman Reviewed work(s): Source: Statistical Science, Vol.

16, No. 3 (Aug., 2001), pp. 199-215 Published by: Institute of Mathematical Statistics Stable URL: http://www.jstor.org/stable/2676681 . Accessed: 24/06/2012 18:22
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to Statistical Science.

http://www.jstor.org

Statistical Science 2001, Vol. 16, No. 3, 199-231

Statistical Modeling: The Two Cultures


Leo Breiman
Abstract. There are two culturesin the use of statisticalmodeling to reach conclusions from data. One assumes that the data are generated data model.The other uses algorithmic modelsand bya givenstochastic treatsthe data mechanism unknown. as The statistical has community been committed thealmostexclusive ofdata models.This committo use menthas led to irrelevant and theory, questionable conclusions, has kept from statisticians on current working a large rangeofinteresting problems.Algorithmic bothin theory and practice, has developed modeling, in rapidly fieldsoutsidestatistics. can be used bothon large complex It data sets and as a more accurateand informative alternative data to on modeling smallerdata sets. If our goal as a fieldis to use data to solve problems, thenwe need to moveaway from exclusivedependence on data modelsand adopta morediverseset oftools.
1. INTRODUCTION The values of the parameters are estimated from the data and the model then used for information and/orprediction.Thus the black box is filledin like this: Y-4 regression linear Cox model
logisticregression

Statisticsstartswithdata. Thinkofthe data as beinggenerated a blackbox in whicha vector by of inputvariablesx (independent variables)go in one side,and on the otherside the responsevariablesy comeout. Inside the black box,naturefunctions to associate the predictor variableswiththe response is variables,so the picture like this: y * nature x

Model validation. Yes-no using goodness-of-fit population.98% of all statistiEstimatedculture


cians. The AlgorithmicModeling Culture The analysis in this culture considers the inside of the box complex and unknown. Their approach is to finda functionf(x)-an algorithmthat operates on x to predict the responses y. Their black box looks like this: tests and residual examination.

Thereare twogoals in analyzing data: the Prediction. be able to predict To whattheresponses are goingto be to future inputvariables; To about Information. extractsome information how nature is associatingthe responsevariables to the inputvariables. There are two different approachestowardthese goals:
The Data Modeling Culture

unknown
decision trees neural nets

.4

The analysisin thisculturestartswithassuming a stochastic data modelforthe inside ofthe black box.Forexample, common a data modelis thatdata are generated independent drawsfrom by responsevariables= f(predictor variables, randomnoise,parameters)
Leo Breiman is Professor, Department of Statistics, University California, Berkeley, of California 947204735 (e-mail: [email protected]).
199

Modelvalidation. Measured by predictiveaccuracy. Estimatedculture population.2% of statisticians,


many in other fields. In this paper I will argue that the focus in the statistical communityon data models has: * Led to irrelevant theory and questionable scientificconclusions;

200

L. BREIMAN

* Kept statisticiansfromusing more suitable algorithmic models; * Prevented statisticians from working exciton ing new problems; I will also review some of the interesting new developments algorithmic in modeling machine in learning and lookat applications threedata sets. to
2. ROAD MAP

than data models.This between inputsand outputs is illustratedusing two medical data sets and a at geneticdata set. A glossary the end ofthe paper explains terms that not all statisticiansmay be familiar with.
3. PROJECTS IN CONSULTING

It maybe revealing understand to howI becamea of After sevenmember the small secondculture. a I and yearstintas an academicprobabilist,resigned wentintofull-time free-lance consulting. After thirI teenyearsofconsultingjoinedtheBerkeley Statisin ticsDepartment 1980 and have been theresince. My experiences a consultant as formed views my Section3 describes aboutalgorithmic modeling. two oftheprojects worked These are givento show I on. howmyviewsgrewfrom such problems. When I returnedto the university and began readingstatistical journals,the researchwas distant fromwhat I had done as a consultant.All articlesbeginand end withdata models.My observations about published theoreticalresearch in statistics in Section4. are Data modeling giventhe statistics has field many successes in analyzingdata and gettinginformathe tionaboutthe mechanisms producing data. But there is also misuse leading to questionableconclusionsabout the underlying mechanism. This is in thatis a discussion reviewed Section5. Following to (Section6) ofhowthecommitment data modeling from has prevented statisticians new scientering and commercial fieldswherethe data being entific is gathered notsuitablefor analysisbydata models. in In the past fifteen years,the growth algorithhas mic modelingapplicationsand methodology been rapid. It has occurred largelyoutside statistics in a new community-often called machine learning-thatis mostly youngcomputer scientists overthe last (Section7). The advances,particularly fiveyears,have been startling. Three of the most important changesin perception be learnedfrom to in these advances are described Sections8, 9, and names: 10, and are associatedwiththe following of Rashomon:the multiplicity goodmodels; Occam: the conflict between simplicity and accuracy; Bellman:dimensionality-curse blessing? or froma Black Section 11 is titled "Information Box" and is important showingthat an algoin rithmic modelcan producemoreand morereliable information of about the structure the relationship

I As a consultant designedand helpedsupervise Agency Protection surveysforthe Environmental Concourtsystems. (EPA) and the stateand federal for weredesigned the EPA, and trolled experiments of I analyzedtraffic data forthe U.S. Department Transportation Transportation and the California Department. Most ofall, I workedon a diverseset Here are someexamples: ofprediction projects. ozonelevels. Predicting next-day halogen-containing Using mass spectrato identify compounds. high altitude Predicting class of a ship from the radar returns. to the class ofa subUsing sonar returns predict MorseCode. Identity hand-sent of Toxicity chemicals. of traffic On-lineprediction the cause ofa freeway of breakdown. Speech recognition The sourcesofdelayin criminal trialsin statecourt systems. and To understand natureofthese problems the the approachestaken to solve them,I give a fuller of two description the first on the list.
3.1 The Ozone Project marine.

In the mid-to late 1960s ozone levels became a serious health problemin the Los Angeles Basin. At Threedifferent levelswereestablished. the alert not all workers were directed highest, government werekeptoff to driveto work, children playgrounds and outdoor exercisewas discouraged. The majorsourceofozoneat thattimewas automobiletailpipeemissions.These rose into the low and atmosphere weretrappedtherebyan inversion aided by sunlayer.A complexchemicalreaction, ozonetwoto three light, cookedaway and produced commute hours.The alert hoursafterthe morning but wereissued in the morning, wouldbe warnings if moreeffective theycould be issued 12 hours in the a advance.In the mid-1970s, EPA funded large effort see ifozonelevels couldbe accurately to predicted12 hoursin advance. Commuting patternsin the Los Angeles Basin with the total variationin any given are regular,

STATISTICAL MODELING: THE TWO CULTURES

201

daylighthour varyingonly a few percent from one weekdayto another. Withthe total amountof emissionsabout constant, the resulting ozone levels depend on the meteorology the preceding of days. A large data base was assembled consisting of lower and upper air measurements U.S. at weatherstationsas far away as Oregonand Arizona, togetherwith hourly readings of surface temperature, humidity, and wind speed at the dozens of air pollutionstationsin the Basin and nearbyareas. Altogether, therewere daily and hourly readings ofover450 meteorological fora periodof variables seven years, with corresponding hourlyvalues of in ozone and otherpollutants the Basin. Let x be the predictor vectorof meteorological variables on the nth day.There are morethan 450 variablesin x since information several days back is included. Let y be the ozone level on the (n + 1)st day.Then the problem was to construct function (x) such a f that forany future and future variday predictor the nextday'sozonelevel y. To estimate predictiveaccuracy,the firstfive years of data were used as the trainingset. The last two years were set aside as a test set. The methods available in the prealgorithmic modeling 1980s decades seem primitive now.In this project wererun,followed varilarge linearregressions by able selection. Quadratictermsin, and interactions the variableswereadded and variamong, retained able selection used again to prunethe equations.In the end,the project was a failure-the false alarm rate of the final predictor was too high. I have regrets thatthis project can't be revisited withthe toolsavailable today.
3.2 The Chlorine Project ables x forthat day, f (x) is an accurate predictorof

splitand the field.The moleculesofthe compound are lighterfragments bent more by the magnetic hit fieldthan the heavier.Then the fragments an on of withtheposition thefragment absorbing strip, of weight the the stripdetermined the molecular by at of fragment. intensity theexposure thatposiThe The of tion measuresthe frequency the fragment. frereflecting resultant mass spectrahas numbers 1 molecular weight up to from quenciesoffragments of The compound. weight the original the molecular fragments there and to peaks correspond frequent are manyzeroes.The available data base consisted and mass spectra ofthe knownchemicalstructure of30,000compounds. vector is ofvarix predictor The mass spectrum able dimensionality. Molecularweightin the data base variedfrom to over10,000.The variableto 30 be predicted is y = 1: containschlorine, y = 2: does notcontainchlorine. a f(x) that The problemis to construct function of is an accuratepredictor y wherex is the mass of spectrum the compound. accuracythe data set was To measurepredictive into a 25,000 membertraining randomly divided set and a 5,000 membertest set. Linear discriminant analysis was tried,then quadratic discrimito nant analysis.These were difficult adapt to the variabledimensionality. thistimeI was thinking By in about decisiontrees.The hallmarksof chlorine This domainknowlmass spectrawereresearched. into the decisiontree algoedge was incorporated rithm the designofthe set of1,500yes-noquesby tionsthatcouldbe appliedto a mass spectraofany The dimensionality. resultwas a decisiontreethat and nonchlogave 95% accuracyon bothchlorines rines (see Breiman,Friedman,Olshen and Stone, 1984).
3.3 Perceptions on Statistical Analysis

The EPA samplesthousandsofcompounds year a In and tries to determine theirpotentialtoxicity. the mid-1970s, standardprocedure the was to meaand to try sure the mass spectraofthe compound from mass its to determine chemicalstructure its spectra. is the Measuring mass spectra fastand cheap.But the determination chemicalstructure of fromthe mass spectra requires a painstakingexamination by a trainedchemist.The cost and availabilityof to enoughchemists analyze all ofthe mass spectra dauntedthe EPA. Many toxiccompounds produced containhalogens.So the EPA fundeda projectto if of in determine the presence chlorine a compound couldbe reliably from mass spectra. its predicted Mass spectra are producedby bombarding the withions in the presenceof a magnetic compound

As I leftconsulting go back to the university, to I with theseweretheperceptionshad aboutworking data to findanswersto problems: a (a) Focus on finding goodsolution-that'swhat consultants paid for. get (b) Live with the data beforeyou plunge into modeling. (c) Searchfora modelthatgivesa goodsolution, or eitheralgorithmic data. accuracyon test sets is the crite(d) Predictive rionforhowgoodthe modelis. are partner. (e) Computers an indispensable

202

L. BREIMAN

4. RETURN TO THE UNIVERSITY

I had one tip about what researchin the universitywas like. A friendof mine, a prominent statisticianfromthe BerkeleyStatistics Department,visitedme in Los Angelesin the late 1970s. I After described the decisiontree methodto him, his firstquestionwas, "What'sthe model forthe data?"
4.1 Statistical Research

These truisms have often beenignored theenthuin siasm forfitting data models.A few decades ago, the commitment data modelswas suchthateven to simple precautionssuch as residual analysis or tests in goodness-of-fit werenotused. The belief the infallibility data modelswas almostreligious. of It is a strangephenomenon-oncea model is made, then it becomestruthand the conclusions from it are infallible.
5.1 An Example

I Upon myreturn, startedreadingtheAnnals of Statistics, flagship the journal oftheoretical statistics,and was bemused.Everyarticlestartedwith Assumethatthe data are generated the followby
ing model: ...

I illustrate witha famous(also infamous) example: assume the data is generatedby independent drawsfrom model the
(R)
Y = bo +
M

bmXm + 8,
1

American Statistical Association (JASA), virtually

field statistics a sciencethatdeals withdata. I of as am at the verylow end ofthe spectrum. Still,there have been some gems that have combinednice and significant theory An applications. exampleis wavelettheory. Even in applications, data models are universal.For instance,in the Journalof the everyarticlecontainsa statement the form: of
ing model: ...

theory published in the Annals of Statistics to the

followed mathematics by exploring inference, hypothesis testing and asymptotics. There is a wide of spectrum opinion the regarding usefulness the of

Assumethatthe data are generated the followby I am deeplytroubled the current by and past use of data modelsin applications, wherequantitative conclusions drawnand perhapspolicydecisions are made.
5. THE USE OF DATA MODELS

Statisticiansin applied research considerdata modelingas the templatefor statisticalanalysis: Faced with an applied problem,think of a data model.This enterprise has at its heart the belief that a statistician, imagination and by looking by at the data, can invent a reasonablygood parametricclass of models for a complexmechanism devisedby nature.Then parameters estimated are and conclusions drawn.But whena modelis fit are to data to drawquantitative conclusions: * The conclusions about the model'smechaare nism,and notaboutnature'smechanism. It follows that: * If the modelis a pooremulation nature,the of conclusions maybe wrong.

wherethe coefficients 8 {bm} are to be estimated, is N(O, a-2) and a-2 is to be estimated.Given that the data is generatedthis way, elegant tests of confidence hypotheses, intervals,distributions of the residualsum-of-squares asymptotics be and can derived.This made the model attractive terms in ofthe mathematics involved. This theory was used bothby academicstatisticians and othersto derive levels forcoefficients the basis of on significance model (R), with littleconsideration to whether as the data on hand could have been generatedby a linearmodel.Hundreds, perhapsthousandsofartiof cles were publishedclaiming or proof something otherbecause the coefficient significant the was at 5% level. Goodness-of-fit demonstrated was mostly givby correlation ing the value ofthe multiple coefficient R2 which was oftencloser to zero than one and whichcouldbe overinflated the use oftoomany by R2, nothingelse parameters.Besides computing data couldhave was doneto see ifthe observational been generated model(R). For instance,a study by was done several decades ago by a well-known memberof a university statisticsdepartment to in assess whether therewas genderdiscrimination All fileswere the salaries ofthe faculty. personnel examinedand a data base set up whichconsisted of salary as the responsevariable and 25 othervariacademic performance; ables which characterized that is, papers published,qualityofjournals publishedin, teachingrecord, etc. Gender evaluations, variable. appears as a binary predictor A linear regression was carriedout on the data and the gender coefficient significant the was at 5% level. That this was strong evidenceof sex discrimination was accepted as gospel. The design of the study raises issues that enter beforethe consideration a model-Can the data gathered of

STATISTICAL MODELING: THE TWO CULTURES

203

answer the question posed? Is inference justified whenyoursampleis the entirepopulation? Should a data modelbe used? The deficiencies analysis in occurred because the focuswas on the model and noton the problem. The linear regression model led to many erroneous conclusions that appearedin journal articles wavingthe 5% significance level withoutknowing whether modelfitthe data. Nowadays,I think the most statisticians will agree that this is a suspect wayto arriveat conclusions. thetime, At therewere fewobjections from statistical the about profession thefairy-tale aspectoftheprocedure, hiddenin But, an elementary Mosteller Tukey(1977) and textbook, discussmanyofthe fallaciespossiblein regression and write"The whole area of guided regression is fraught withintellectual, statistical, computational, and subjectmatter difficulties." Even currently, thereare onlyrare publishedcriuse of data models.One of tiques ofthe uncritical the fewis David Freedman, who examinesthe use ofregression models(1994); the use ofpath models (1987) and data modeling (1991, 1995).The analysis
in these papers is incisive. 5.2 Problems in CurrentData Modeling

A a variety models. residualplotis a goodness-ofof fittest,and lacks powerin morethan a fewdimensions. An acceptableresidual plot does not imply thatthe modelis a goodfitto the data. residuals. Thereare a variety waysofanalyzing of For instance,Landwher,Preibon and Shoemaker (1984, withdiscussion)gives a detailedanalysis of data set fitting logisticmodelto a three-variable a using various residual plots. But each of the four for discussantspresentothermethods the analysis. sense about the arbiOne is leftwithan unsettled of trariness residualanalysis. Misleading conclusionsmay follow from data modelsthat pass goodness-of-fit and residual tests checks. But published applicationsto data often model fitusing these show littlecare in checking methodsor any other.For instance,many of the currentapplicationarticlesin JASA that fitdata of modelshave verylittlediscussion howwell their modelfitsthe data. The questionof how well the modelfitsthe data is ofsecondary comimportance stochastic pared to the construction an ingenious of model.
5.3 The Multiplicity Data Models of

Currentapplied practice is to check the data model fit using goodness-of-fit and residual tests analysis.At one point,some years ago, I set up a in simulatedregression problem seven dimensions witha controlled amountofnonlinearity. Standard tests ofgoodness-of-fit not rejectlinearity did until the nonlinearity was extreme. RecenttheorysupWorkby Bickel, Ritov and ports this conclusion. Stoker(2001) showsthat goodness-of-fit have tests of verylittlepowerunless the direction the alternativeis precisely specified. implication that The is omnibusgoodness-of-fit tests,whichtest in many directionssimultaneously, have little power,and will notrejectuntilthe lack offitis extreme. if Furthermore, the modelis tinkered withon the basis of the data, that is, if variables are deleted or nonlinearcombinations the variables added, of thengoodness-of-fit are not applicable.Residtests ual analysisis similarly In unreliable. a discussion aftera presentation residual analysis in a semof inar at Berkeleyin 1993, WilliamCleveland,one ofthe fathers residualanalysis,admitted of thatit couldnotuncover in lack offit morethanfour five to The dimensions. papers I have read on usingresidual analysisto checklack offitare confined data to sets withtwoor threevariables. Withhigher the between dimensions, interactions thevariablescan produce passable residualplotsfor

One goal of statisticsis to extractinformation mechanism profrom data abouttheunderlying the plus ofdata modeling ducingthe data. The greatest a picis thatit produces simpleand understandable the tureofthe relationship between inputvariables in and responses.For instance,logisticregression is classification frequently used because it produces a linear combination the variableswithweights of that give an indication the variableimportance. of The end resultis a simplepictureofhow the predictionvariables affect responsevariable plus the confidence intervalsforthe weights.Suppose two each one with a different approach statisticians, to data modeling, a model to the same data fit set. Assume also that each one applies standard goodness-of-fit tests, looks at residuals, etc., and is convincedthat their model fits the data. Yet the two models give different picturesof nature's conclusions. mechanism and lead to different McCullah and Nelder (1989) write "Data will oftenpoint with almost equal emphasis on sevthat the eral possiblemodels,and it is important statisticianrecognizeand accept this."Well said, but different models,all ofthemequallygood,may give different picturesof the relationbetweenthe and responsevariables. The questionof predictor whichone mostaccurately reflects data is difthe ficultto resolve.One reason forthis multiplicity is that goodness-of-fit and othermethodsfor tests fit checking give a yes-noanswer.Withthe lack of

204

L. BREIMAN

powerofthese testswithdata havingmorethan a small numberof dimensions, therewill be a large number modelswhosefitis acceptable.Thereis of no way,amongthe yes-nomethods gaugingfit, for of determining which is the bettermodel. A few statisticians knowthis.Mountainand Hsiao (1989) "It write, is difficult formulate comprehensive to a model capable of encompassingall rival models. Furthermore, withthe use offinite samples,there are dubiousimplications withregardto the validity and powerofvariousencompassing tests that rely on asymptotic theory." Data modelsin current mayhave moredamuse in agingresultsthanthe publications the social sciencesbased on a linearregression analysis.Justas the 5% level ofsignificance becamea de factostandardfor the publication, Cox modelfor analysis the ofsurvival timesand logistic for regression survivenonsurvive data have becomethe de factostandard forpublication medicaljournals. That different in survivalmodels, equallywell fitting, couldgive different conclusions notan issue. is
5.4 Predictive Accuracy

Mosteller and Tukey(1977) wereearlyadvocates ofcross-validation. Theywrite, "Cross-validation is a naturalroutetotheindication thequalityofany of data-derived quantity.... We plan to cross-validate carefully wherever can." we Judging the infrequency estimatesof preby of dictiveaccuracyin JASA, this measure of model fitthat seems naturalto me (and to Mostellerand Tukey)is notnaturalto others. Morepublication of predictive accuracy estimates wouldestablishstandards forcomparison models,a practicethat is of in common machinelearning. OF 6. THE LIMITATIONS DATAMODELS Withthe insistence data models, on multivariate analysistoolsin statistics frozen discriminant are at in analysisand logistic regression classification and multiplelinear regressionin regression.Nobody reallybelievesthat multivariate data is multivariate normal,but that data model occupiesa large numberof pages in every graduate textbookon multivariate statistical analysis. With data gatheredfromuncontrolled observationson complex systems unknown involving physior cal, chemical, biological the mechanisms, a priori assumptionthat nature would generatethe data a modelselectedby the statisthrough parametric tician can result in questionableconclusions that cannotbe substantiated appeal to goodness-of-fit by tests and residual analysis. Usually,simple parametric modelsimposedon data generatedby complex systems, example,medical data, financial for data, resultin a loss ofaccuracy and information as to models(see Section11). compared algorithmic There is an old saying "If all a man has is a hammer, thenevery lookslike a nail."The problem trouble statisticians thatrecently for is someofthe problems have stoppedlookinglike nails. I conjecturethatthe resultofhitting thiswall is thatmore data models are appearingin current complicated published applications. combined Bayesianmethods withMarkovChain MonteCarlo are cropping all up over.This may signify that as data becomesmore the complex, data modelsbecomemorecumbersome and are losingthe advantageofpresenting simple a and clear picture nature'smechanism. of for Approaching problems looking a data model by imposesan a priori straight jacketthatrestricts the to abilityof statisticians deal witha wide range of statisticalproblems. The best available solutionto a data problem mightbe a data model;thenagain it might an algorithmic be model.The data and the problem To guide the solution. solve a widerrange ofdata problems, largerset oftoolsis needed. a

The mostobviousway to see how well the model boxemulatesnature'sboxis this:put a case x down nature'sbox getting outputy. Similarly, the an put same case x down the model box gettingan output y'. The closenessof y and y' is a measure of how good the emulationis. For a data model,this translatesas: fitthe parameters yourmodelby in using the data, then,using the model,predictthe data and see howgoodthe prediction is. Predictionis rarely perfect.There are usually many unmeasuredvariables whose effect is referred as "noise."But the extentto whichthe to model box emulates nature'sbox is a measure of how well our model can reproducethe natural phenomenon the producing data. McCullagh and Nelder (1989) in their book on linear modelsalso thinkthe answeris generalized obvious.They write,"Atfirstsightit mightseem as thougha good model is one that fitsthe data verywell;thatis, one thatmakes ,u(themodelpredictedvalue) veryclose to y (the responsevalue)." Thentheygo on to notethattheextent theagreeof mentis biased by the numberof parameters used in the modeland so is not a satisfactory measure. If Theyare,ofcourse, right. themodelhas toomany thenit mayoverfit data and givea parameters, the biased estimateof accuracy. But thereare ways to removethe bias. To get a moreunbiased estimate ofpredictive cross-validation be used, can accuracy, as advocatedin an important earlyworkby Stone (1974). If the data set is larger, aside a test set. put

STATISTICAL MODELING: THE TWO CULTURES

205

Perhaps the damagingconsequenceof the insishave ruled tenceon data modelsis thatstatisticians and out themselves ofsome ofthe mostinteresting challengingstatisticalproblemsthat have arisen out of the rapidlyincreasingabilityof computers are to storeand manipulatedata. These problems both scientific presentin manyfields, increasingly and commercial, and solutionsare being foundby nonstatisticians.
7. ALGORITHMIC MODELING

7.2 Theory in AlgorithmicModeling

Under other names, algorithmic modelinghas been used by industrialstatisticiansfordecades. book"Fitting Equainstance, delightful the See, for tionstoData" (Daniel and Wood,1971). It has been and social scientists. used by psychometricians book(1990) manyyears of Readinga preprint Gifi's a ago uncovered kindredspirit.It has made small inroadsinto the analysis of medical data starting withRichardOlshen'sworkin the early1980s. For further see work, Zhang and Singer(1999). Jerome Friedmanand Grace Wahba have done pioneering of methods. workon the development algorithmic in modBut the list ofstatisticians the algorithmic to and applications data are elingbusinessis short, of seldom seen in the journals. The development methods was takenup by a community algorithmic outsidestatistics.
7.1 A New Research Community

In the mid-1980stwo powerful new algorithms data became available: neural nets and forfitting using decisiontrees. A new research community these tools sprang up. Their goal was predictive consistedof youngcomThe community accuracy. and plus a few physicists engineers puterscientists, Theybegan usingthe new tools agingstatisticians. whereit in working complex on prediction problems was obviousthat data modelswere not applicable: nonlinear image recognition, speech recognition, time series prediction, handwritingrecognition, in markets. prediction financial Theirinterests rangeovermanyfieldsthat were for onceconsidered grounds statistihappyhunting out thousandsofinteresting cians and have turned and research papersrelatedto applications methodof A ology. large majority the papers analyze real data. The criterion anymodelis whatis theprefor An dictiveaccuracy. idea of the range of research ofthis groupcan be got by lookingat the Proceedings of the Neural InformationProcessing Systems Machine Learning Journal.

Data modelsare rarelyused in this community. The approach is that nature producesdata in a mysterious, black box whose insides are complex, Whatis observed unknowable. and, at least, partly is a set ofx's thatgo in and a subsequentset ofy's is that comeout. The problem to findan algorithm x f(x) such that forfuture in a test set, f(x) will of be a goodpredictor y. data modfrom in focus shifts The theory thisfield It of els to the properties algorithms. characterizes if convergence they as their"strength" predictors, and what gives themgood predictive are iterative, made in the theory The one assumption accuracy. is that the data is drawni.i.d. froman unknown distribution. multivariate There is isolated work in statisticswhere the Grace focus is on the theoryof the algorithms. Wahba's research on smoothing spline algoto rithms and theirapplications data (using crossreproducing is involving validation) builton theory The finalchapter kernelsin HilbertSpace (1990). of the CART book (Breimanet al., 1984) contains of convergence the CART a proof the asymptotic of the to algorithm theBayesriskbyletting treesgrow but as the sample size increases.There are others, is the relativefrequency small. Theoryresultedin a major advance in machine informative VladimirVapnikconstructed learning. test error (infinite set boundson the generalization whichdependon algorithms error)ofclassification These theoretical of the "capacity" the algorithm. machines(see Vapnik, vector boundsled to support 1995, 1998) whichhave provedto be more accuthen and in rate predictors classification regression neural nets,and are the subjectofheated current research(see Section10). theoryfor tree My last paper "Some infinity space ensembles"(Breiman,2000) uses a function of the analysisto tryand understand workings tree ensemblemethods.One sectionhas the heading, There is an "My kingdomfor some good theory." ensemblesknownas effective methodforforming but "boosting," there isn't any finitesample size thattells us whyit worksso well. theory
7.3 Recent Lessons

(theirmain yearlymeeting)or at the Conference

and increases in The advances in methodology accuracysince the mid-1980sthat have predictive in occurred the researchof machinelearninghas been phenomenal.There have been particularly in developments the last fiveyears.What exciting has beenlearned?The threelessonsthatseemmost

206

L. BREIMAN

important one: to Rashomon: multiplicity goodmodels; the of Occam: the conflict betweensimplicity and accuracy; Bellman:dimensionality-curse blessing. or
8. RASHOMON AND THE MULTIPLICITY OF GOOD MODELS

Rashomon is a wonderful Japanese movie in whichfourpeople, fromdifferent vantage points, witnessan incidentin whichone persondies and anotheris supposedlyraped. When they come to in testify court, theyall report the same facts,but theirstoriesofwhathappenedare verydifferent. What I call the RashomonEffect that there is is often multitude different a of descriptions [equationsf(x)] in a class offunctions givingabout the same minimum errorrate. The mosteasily understoodexampleis subset selectionin linear regression.Supposethereare 30 variablesand we wantto find best fivevariablelinearregressions. the There are about 140,000five-variable subsetsin competition.Usuallywe pickthe one withthe lowestresidual sum-of-squares (RSS), or,if thereis a test set, the lowesttest error. But theremay be (and generallyare) manyfive-variable equationsthat have RSS within1.0% of the lowestRSS (see Breiman, 1996a). The same is true if test set erroris being measured. So here are threepossiblepictureswithRSS or test set error within1.0% ofeach other: Picture1
y = 2.1 + 3.8x3 - 0.6x8 + 83.2x12
-

2.1x17 + 3.2x27,

Picture2 y = -8.9 + 4.6x5+ 0.01x6+ 12.0x15


+ 17.5X21+ 0.2X22,

neural net 100 times on simplethree-dimensional data reselecting initialweights be small and the to randomon each run. I found32 distinct minima, each ofwhichgave a different picture, and having about equal test set error. This effect closely connectedto what I call is instability (Breiman,1996a) thatoccurswhenthere are many different models crowdedtogether that have aboutthe same training testset error. or Then a slightperturbation the data or in the model of construction cause a skip fromone model to will The two modelsare close to each otherin another. termsof error, but can be distantin termsof the form the model. of If, in logisticregressionor the Cox model,the commonpractice of deleting the less important covariatesis carriedout, then the model becomes unstable-there are too many competing models. Say you are deletingfrom15 variables to 4 variables. Perturbthe data slightly and you will very four-variable model and possiblyget a different a different conclusionabout which variables are To important. improve accuracy weeding less by out covariatesyou run into the multiplicity important The picture whichcovariates imporof are problem. tant can vary significantly between two models havingaboutthe same deviance. over a large set of competing Aggregating models can reducethe nonuniqueness whileimproving Arena et al. (2000) bagged(see Glossary) accuracy. modelson a data base oftoxicand logistic regression chemicals wherethe number covariates of nontoxic in each model was reducedfrom15 to 4 by stanOn dardbest subsetselection. a testset,thebagged more modelwas significantly accuratethanthe sincovariates. is also morestable. It gle modelwithfour This is one possible fix.The multiplicity problem on and its effect conclusionsdrawn frommodels needs seriousattention. 9. OCCAM AND SIMPLICITYVS. ACCURACY Occam's Razor, long admired,is usually interto is preted meanthatsimpler better. Unfortunately, in prediction, and simplicity accuracy (interpretabilFor ity) are in conflict. instance,linear regression gives a fairly interpretable pictureofthe y,x relation. But its accuracy is usually less than that neural nets. An example of the less interpretable closerto myworkinvolves trees. On interpretability, rate an A+. A project trees I workedon in the late 1970s was the analysis of cases in state courtsystems. The delay in criminal Constitution to givesthe accusedtheright a speedy for trial.The Center theState Courtswas concerned

Picture3
y = -76.7 + 9.3x2 + 22.0x7 - 13.2x8 + 3.4x11 + 7.2X28.

The problem that each one Whichone is better? is tells a different storyabout which variables are important. The RashomonEffect also occurswith decision treesand neuralnets.In myexperiments trees, with ifthe training is perturbed set onlyslightly, by say a removing random2-3% of the data, I can get a tree quite different fromthe originalbut with almostthe same test set error. once ran a small I

STATISTICAL MODELING: THE TWO CULTURES


TABLE 1 Data set descriptions

207

Data set
Cancer Ionosphere Diabetes Glass Soybean Letters Satellite Shuttle DNA Digit

Training Sample size


699 351 768 214 683 15,000 4,435 43,500 2,000 7,291

Test Sample size

Variables
9 34 8 9 35

Classes
2 2 2 6 19 26 6 7 3 10

5000 2000 14,500 1,186 2,007

16 36 9 60 256

that in many states,the trials were anything but speedy. funded studyofthe causes ofthe delay. It a I visitedmany states and decidedto do the analysis in Colorado, whichhad an excellent computerized courtdata system. wealthofinformation A was extracted and processed. The dependentvariable for each criminalcase was the timefrom arraignment the timeofsento All in tencing. ofthe other information thetrialhistorywere the predictor variables.A large decision treewas grown, I showedit on an overhead and and explainedit to the assembledColorado judges. One ofthe splitswas on DistrictN whichhad a larger I delaytimethanthe otherdistricts. refrained from on commenting this.But as I walkedoutI heardone "I judge say to another, knewthoseguysin District N weredragging theirfeet." While trees rate an A+ on interpretability, they are good,but not great,predictors. Give them,say, a B on prediction.
9.1 Growing Forests for Prediction

variables.At each node chooseseveralofthe 20 at randomto use to splitthe node. Or use a random of combination a randomselectionof a few variables. This idea appears in Ho (1998), in Amitand in Geman(1997) and is developed Breiman(1999).
9.2 Forests Compared to Trees

Instead ofa singletreepredictor, growa forest of trees on the same data-say 50 or 100. If we are classifying, the newx downeach treein theforput est and geta votefor predicted the class. Let theforest prediction the class thatgetsthe mostvotes. be Therehas been a lotofworkin thelast five yearson All methwaysto growtheforest. ofthewell-known ods growthe forest perturbing training the by set, a growing tree on the perturbed trainingset, perthe set another turbing training again,growing tree, etc. Some familiar methodsare bagging(Breiman, 1996b),boosting (Freundand Schapire,1996), arcing(Breiman, 1998),and additive logistic regression (Friedman, Hastie and Tibshirani, 1998). In method date is random to forests. Mypreferred thisapproachsuccessive decision treesare grown by a introducing randomelementinto theirconstruction. For example,suppose there are 20 predictor

We compare the performance single trees of (CART) to randomforestson a numberof small from UCI repository the and largedata sets,mostly A (ftp.ics.uci.edulpub/MachineLearningDatabases). of summary the data sets is givenin Table 1. of Table 2 compares testset error a singletree the For to that ofthe forest. the fivesmallerdata sets by above the line,the test set errorwas estimated leaving out a random10% of the data, then running CART and the foreston the other90%. The left-out 10% was run downthe tree and the forest for and the erroron this 10% computed both.This was repeated 100 times and the errorsaveraged. The larger data sets below the line came with a separatetestset. Peoplewhohave been in the classification fieldfora while findthese increases in Some errorsare halved. Others accuracystartling. In are reducedby one-third. regression, wherethe
TABLE2 Test set misclassificationerror(%)

Data set
Breast cancer Ionosphere Diabetes Glass Soybean Letters Satellite Shuttle X103 DNA Digit

Forest
2.9 5.5 24.2 22.0 5.7 3.4 8.6 7.0 3.9 6.2

Single tree
5.9 11.2 25.3 30.4 8.6 12.4 14.8 62.0 6.2 17.1

208

L. BREIMAN

forest prediction the averageoverthe individual is treepredictions, decreasesin mean-squared the test set error similar. are
9.3 Random Forests are A + Predictors

The Statlog Project(Mitchie,Spiegelhalterand Taylor, 1994) compared 18 different classifiers. Included were neural nets, CART, linear and quadraticdiscriminant analysis,nearest neighbor, etc.The first four data setsbelowthelinein Table 1 werethe onlyones used in the StatlogProjectthat came with separate test sets. In termsof rank of comes accuracyon these fourdata sets, the forest in 1, 1, 1, 1 foran average rank of 1.0. The next had an averagerankof7.3. best classifier The fifth data setbelowthelineconsists 16x 16 of ZIP pixelgrayscale depictions handwritten Code of numerals.It has been extensively used by AT&T Bell Labs to test a varietyof prediction methods. A neural net handcrafted the data got a test set to error 5.1% vs. 6.2% fora standardrunofrandom of forest.
9.4 The Occam Dilemma

The the dimensionality. publishedadvice was that is high dimensionality dangerous.For instance,a (Meisel, book on patternrecognition well-regarded 1972) states "the features... must be relatively But recentworkhas shownthat few in number." can dimensionality be a blessing.
10.1 Digging It Out in Small Pieces

reduces the amountof Reducingdimensionality The available forprediction. morepreinformation the Thereis also dictor variables, moreinformation. in of information variouscombinations thepredictor variables.Let's trygoingin the oppositedirection: increaseit dimensionality, * Instead ofreducing of variables. byaddingmanyfunctions thepredictor There may now be thousands of features.Each containsa small amountofinformation. potentially The problemis how to extractand put together There are two these little pieces of information. The outstanding examplesofworkin thisdirection, Shape Recognition Forest (Y. Amit and D. Geman, 1997) and Support Vector Machines (V. Vapnik, 1995, 1998).
10.2 The Shape Recognition Forest

So forests A+ predictors. their are But mechanism a is for to producing prediction difficult understand. Trying delveintothetangledweb thatgenerated to votefrom treesis a Herculeantask. a plurality 100 So on interpretability, rate an F. Whichbrings they us to the Occam dilemma: * Accuracy generally requiresmorecomplex prefunctions diction methods. and interpretable Simple do notmake the mostaccuratepredictors. but Usingcomplex predictors maybe unpleasant, the soundestpath is to go forpredictive accuracy first, then tryto understand why.In fact,Section statistical 10 pointsout that froma goal-oriented thereis no Occam'sdilemma.(For more viewpoint, on Occam'sRazor see Domingos, 1998, 1999.)
10. BELLMAN AND THE CURSE OF DIMENSIONALITY

In 1992,the NationalInstitute Standardsand of for (NIST) set up a competition machine Technology to numerals. Theyput algorithms read handwritten of a together largeset ofpixelpictures handwritten by numbers(223,000) written over 2,000 individand attracted wide interest, uals. The competition diverseapproachesweretried. The Amit-Gemanapproachdefined many thoufeaturesin a hierarchisands of small geometric suchthatat Shallowtreesare grown, cal assembly. are each node,100 features chosenat randomfrom and level ofthehierarchy; the optithe appropriate mal splitofthe nodebased on the selectedfeatures is found. is down Whena pixelpicture a number dropped of a single tree,the terminalnode it lands in gives
probability estimates po, ..., p9 that it represents numbers 0, 1, ... ,9. Over 1,000 trees are grown,the

The title of this sectionrefersto RichardBellman'sfamousphrase,"thecurseofdimensionality." methodolFor decades,the firststep in prediction ogywas to avoid the curse.If therewere too many prediction variables,the recipe was to finda few of features(functions the predictor variables) that and then use "containmost of the information" these featuresto replace the originalvariables.In procedurescommonin statisticssuch as regression, logisticregressionand survival models the is to advisedpractice to use variabledeletion reduce

and averagedoverthisforest, the preprobabilities dictednumberis assignedto the largestaveraged probability. Using a 100,000 example trainingset and a 50,000 test set, the Amit-Gemanmethodgives a test set errorof 0.7%-closeto the limitsofhuman error.
10.3 Support Vector Machines

data havingprediction Supposethereis two-class in Euclideanspace. The prevectors M-dimensional vectors class #1are {x(1)} and thosefor for diction

STATISTICAL MODELING: THE TWO CULTURES

209

class #2are {x(2)}. If these two sets ofvectors can be separatedby a hyperplane thenthereis an optimal separating hyperplane. "Optimal" defined is as meaningthatthe distanceofthe hyperplane any to prediction vector maximal(see below). is The set of vectorsin {x(1)} and in {x(2)} that achieve the minimum distance to the optimal separatinghyperplane are called the supportvectors. Their coordinates determine the equation of the hyperplane. Vapnik (1995) showed that if a separating hyperplane exists,thenthe optimalseparating hyperplanehas low generalizationerror (see Glossary).
O /4optimal hyperplane vector support 00 0

In two-classdata, separability a hyperplane by does not oftenoccur.However, us increasethe let dimensionality adding as additional predictor by variables all quadratic monomialsin the original predictor variables; that is, all termsof the form A hyperplane the original in variablesplus XmlXm2. in quadraticmonomials the originalvariablesis a more complexcreature.The possibility separaof tion is greater.If no separationoccurs,add cubic If monomials inputfeatures. thereare originally as 30 predictor variables,thenthereare about 40,000 featuresif monomials to the fourth up degreeare added. The higherthe dimensionality the set of feaof the In it occurs. tures, morelikely is thatseparation theZIP Code data set,separation occurs withfourth is added. The testset error 4.1%. degreemonomials a large subset of the NIST data base as a Using afteradding training set, separationalso occurred and gave a test set up to fourth degreemonomials error rate of 1.1%. Separation can always be had by raising the dimensionality high enough.But if the separating too the becomes complex, generalization hyperplane errorbecomeslarge. An elegant theorem (Vapnik, 1995) givesthis boundforthe expectedgeneralizationerror: Ex(GE) < Ex(numberofsupport vectors)/(N- 1), whereN is the sample size and the expectation is overall training sets ofsize N drawnfrom same the as distribution the original set. underlying training The number support of vectors increaseswiththe of dimensionality the featurespace. If this number

will becomestoo large,the separatinghyperplane error. separation If cannotgivelow generalization small numberof not be realized with a relatively versionofsupport support vectors, thereis another by vectormachinesthat definesoptimality adding on a penaltytermforthe vectors the wrongside of the hyperplane. the Someingenious makefinding optialgorithms mal separatinghyperplane computationally feasible. These devicesreducethe search to a solution of a quadratic programming problemwith linear that are of the orderof the inequalityconstraints of numberN of cases, independent the dimension ofthefeature tailored thisparticto space. Methods of ular problem producespeed-ups an orderofmagfor quadratic nitudeoverstandardmethods solving programming problems. Support vector machines can also be used to in provideaccurate predictions other areas (e.g., idea that gives excelregression). is an exciting It to lent performance is beginning supplantthe and is use of neural nets. A readable introduction in (2000). Cristianini and Shawe-Taylor 11. INFORMATION FROM A BLACK BOX The dilemmaposed in the last sectionis that the models that best emulate nature in termsof predictive accuracyare also the mostcomplexand inscrutable. But this dilemmacan be resolvedby realizing wrong the questionis beingasked. Nature the the forms outputsy from inputsx by means of and unknown interior. a blackbox withcomplex y
H

nature

Current accurate predictionmethods are also blackboxes. complex Y< nets neural forests vectors support
<

So we are facingtwo black boxes, where ours than nature's. seems onlyslightly less inscrutable ensemIn data generated medicalexperiments, by error bles of predictors can give cross-validated rates significantly lower than logisticregression. friendstell me, "Doctors can My biostatistician There is no way they interpret logistic regression." trees can interpreta black box containingfifty In hookedtogether. a choicebetweenaccuracyand go interpretability, they'll forinterpretability. accuthe Framing questionas the choicebetween is interpreracy and interpretability an incorrect tationofwhat the goal of a statisticalanalysis is.

210

L. BREIMAN

The pointof a model is to get usefulinformation about the relationbetweenthe responseand preis dictor variables.Interpretability a way ofgetting information. a modeldoes nothave to be simple But about the relation to providereliable information and responsevariables; neither betweenpredictor does it have to be a data model. but * The goal is not interpretability, accurate information. threeexamplesillustratethis point. The following The firstshows that randomforestsapplied to a medical data set can give more reliable informaregrestionabout covariatestrengths than logistic sion.The secondshowsthatit can give interesting information couldnotbe revealedby a logistic that to The is regression. third an application a microarto ray data whereit is difficult conceiveof a data similarinformation. modelthatwoulduncover

whichI assume statisticalprocedure (unspecified) regression. was logistic samples Efronand Diaconis drew500 bootstrap from originaldata set and used a similarprothe variables in each cedure to isolate the important "Of data set. The authorscomment, bootstrapped selectednot one was the fourvariables originally selectedin more than 60 percentof the samples. in analyHence thevariablesidentified the original We sis cannot takentooseriously." willcomeback be later. to thisconclusion
Logistic Regression

on regression ratefor logistic The predictive error the hepatitisdata set is 17.4%. This was evaluated bydoing100 runs,each timeleavingouta randomly selected 10% of the data as a test set, and then overthe test set errors. averaging Usually,the initialevaluationofwhichvariables the absolute is are important based on examining 11.1 Example 1: Variable Importance in a of values ofthecoefficientsthevariablesin thelogisSurvival Data Set tic regression dividedby theirstandarddeviations. The data set contains survival or nonsurvival Figure 1 is a plotofthesevalues. of 155 hepatitispatientswith 19 covariates.It is The conclusionfromlooking at the standardis ized available at ftp.ics.uci.edu/pub/MachineLearning- coefficients that variables 7 and 11 are the by covariates.When logisticregresDatabases and was contributed Gail Gong.The most important The sion is run using only these two variables, the is description in a filecalled hepatitis.names. errorrate rises to 22.9%. Another data set has been previously analyzedby Diaconis cross-validated variables is to run a best and Efron (1983), and Cestnik, Konenenkoand way to findimportant Bratko (1987). The lowest reportederrorrate to subsets search which,for any value k, findsthe subsetofk variableshavinglowestdeviance. date, 17%,is in the latterpaper. of Diaconis and Efronreferto workby Peter GreThis procedure raises the problems instability Medical School who analyzed of and multiplicity models(see Section7.1). There goryofthe Stanford varifourvariables. this data and concludedthat the important are about 4,000 subsets containing an a Of these,there are almost certainly substantial ables werenumbers 12, 14, 19 and reports esti6, The variableswere mated20% predictive accuracy. numberthat have devianceclose to the minimum reducedin two stages-the firstwas by informal and give different picturesofwhat the underlying to data analysis.The secondrefers a moreformal mechanism is.
3.5. o2.5.5
0

1.5.N

a) ~~**

- .5., 0

10

11

12

13

14

15

16

17

18

19

20

variables logistic regression. FIG. 1. Standardized coefficients

STATISTICAL MODELING: 50 -

THE TWO CULTURES

211

0 40
c 30a, 20C

C 10

a,~~~~~~~~~~~~~~~,A
~0.
I I i I I I I I I I I I I , I I I I I i i

10 11 12 13 14 15 16 17 18 19 20

variables
FIG. 2.

Variable importance-random forest.

Random Forests

The randomforests predictive errorrate, evaluated by averagingerrorsover 100 runs,each time leavingout 10% ofthe data as a testset,is 12.3%almosta 30% reduction from logistic the regression error. Random forestsconsists of a large numberof randomly constructed trees,each voting a class. for Similar to bagging (Breiman, 1996), a bootstrap each sample ofthe training is used to construct set tree. A randomselectionof the input variables is the searchedto find best splitforeach node. To measure the importance the mthvariable, of the values of the mth variable are randomly permuted in all of the cases left out in the current bootstrap sample. Then these cases are run down the current tree and theirclassification noted.At the end ofa run consisting growing of manytrees, the percent increasein misclassification due to rate noisingup each variable is computed. This is the

that is shown in measure of variable importance Figure1. Random forestssingles out two variables, the As 12thand the 17th,as beingimportant. a verificationbothvariables were run in randomforests, The test set errorrates individually and together. over 100 replicationswere 14.3% each. Running thatvirtubothtogether no better. conclude did We is by capability provided a ally all ofthe predictive singlevariable,either12 or 17. between12 and 17 a bit To explore interaction the at run further, the end ofa randomforest usingall variables,the outputincludesthe estimatedvalue of ofthe probability each class vs. the case number. This information used to get plots of the variis to able values (normalized mean zero and standard deviation of one) vs. theprobability death.The varilinear able values are smoothed using a weighted The regression smoother. resultsare in Figure3 for variables12 and 17.

1
0

12 #1 VARIABLE vs PROBABILITY

VARIABLE vs PROBABILITY 17 #1

co

:2-1

-3 4

-2
3

.4 .6 .8 .2 class 1 probability

.4 .2 .6 .8 class 1 probability

FIG. 3. Variable 17 vs. probability #1.

212 40 -

L. BREIMAN

CZ

.r

20_ 20 10 -

-10

3 variables

FIG. 4.

Variable importance-Bupa data.

The graphsofthe variablevalues vs. class death probability almostlinear and similar.The two are correlated. Thinking variablesturnoutto be highly have affected logistic the regression thatthismight results,it was run again withone or the otherof thesetwovariablesdeleted.Therewas littlechange. I Out of curiosity, evaluated variable imporin regression the same way that I tance in logistic variable valby did in randomforests, permuting how much ues in the 10% test set and computing Not muchhelpthat increasedthe test set error. variables12 and 17 werenotamongthe 3 variables In ranked as most important. partial verification of of the importance 12 and 17, I triedthem separately as single variables in logisticregression. Variable 12 gave a 15.7% errorrate, variable 17 came in at 19.3%. analyTo go back to the originalDiaconis-Efron is sis,theproblem clear.Variables12 and 17 are surIf rogatesfor each other. one ofthemappearsimportant in a model built on a bootstrapsample, the of other doesnot.So each one'sfrequency occurrence
1 5
-

less is automatically than 50%. The paper lists the variables selectedin ten ofthe samples.Either 12 or 17 appear in sevenofthe ten.
11.2 Example 11Clustering in Medical Data

biomedical The Bupa liverdata set is a two-class data set also available at ftp.ics.uci.edu/pub/Macare: The hineLearningDatabases. covariates 1. 2. 3. 4. 5. 6. mcv alkphos sgpt sgot gammagt drinks volume mean corpuscular alkalinephosphotase alamineaminotransferase aspartateaminotransferase transpeptidase gamma-glutamyl equivalentsofalcoholic half-pint beveragedrunkper day

The firstfive attributesare the results of blood The to teststhought be relatedto liverfunctioning. into two classes by the 345 patientsare classified Class two is of severity theirlivermalfunctioning. In severe malfunctioning. a random forestsrun,

cluster1 class 2 cluster2 class 2 cluster-class 1 7

(a

. .... =......................

3 variable
FIG. 5.

Cluster averages-Bupa

data.

STATISTICAL MODELING: THE TWO CULTURES

213

themisclassification error rate is 28%. The variable importance givenby randomforests in Figure4. is Blood tests 3 and 5 are the mostimportant, followed by test 4. Random forestsalso outputsan intrinsic similarity measure whichcan be used to cluster. When this was applied,two clusterswere in discovered class two.The averageofeach variable is computed in and plotted each ofthese clusters in Figure5. An interesting facetemerges.The class two subjects consistoftwodistinct groups:thosethat have highscoreson bloodtests3, 4, and 5 and thosethat have low scoreson thosetests.
11.3 Example Ill: MicroarrayData

Random forestswas run on a microarray lymphoma data set withthreeclasses, sample size of 81 and 4,682variables(genes)without variable any selection moreinformation [for about this data set, see Dudoit,Fridlyand and Speed,(2000)]. The error rate was low.Whatwas also interesting from scia entific was viewpoint an estimateofthe importance ofeach ofthe 4,682 gene expressions. The graph in Figure 6 was producedby a run of randomforests. This result is consistent with assessments of variable importancemade using other algorithmic methods,but appears to have sharperdetail.
11.4 Remarks about the Examples

The examples show that much information is available froman algorithmic model. Friedman
600 l l

(1999) derivessimilarvariable information from a different ofconstructing forest. way a The similarityis that theyare bothbuilt as ways to give low predictive error. in Thereare 32 deathsand 123 survivors thehepatitisdata set. Callingeveryone survivor a gives a baselineerror rateof20.6%.Logistic lowregression ers this to 17.4%. It is not extracting muchuseful information from the data, whichmay explain its inability findthe important to variables.Its weakness mighthave been unknownand the variable importances acceptedat facevalue if its predictive was notevaluated. accuracy Random forestsis also capable of discovering important aspects of the data that standarddata modelscannotuncover. The potentially interesting clustering class twopatientsin Example II is an of illustration. The standard procedure when fitting data modelssuch as logistic is regression to delete Diaconis and Efron(1983) variables;to quote from again, "...statisticalexperience suggeststhat it is unwiseto fita modelthat dependson 19 variables withonly155 data pointsavailable."Newermethods in machinelearningthriveon variables-the morethe better. instance, For randomforests does not overfit. gives excellentaccuracyon the lymIt phomadata set ofExampleIII which has over4,600 variables,withno variable deletionand is capable ofextracting information from variableimportance the data.

400 -+
+

EL

++ 4+
+

200

~~~+

0
0
1000

2000

3000

4000

5000

variable number FIG. 6. Microarray variable importance.

214

L. BREIMAN

These examplesillustrate following the points: * Higher predictive accuracyis associated with morereliableinformation abouttheunderlying data mechanism. Weak predictive accuracycan lead to questionable conclusions. * Algorithmic models can give betterpredictive accuracy thandata models, and provide inforbetter mationaboutthe underlying mechanism.
12. FINAL REMARKS

The goals in statistics to use data to predict are and to get information about the underlying data mechanism. Nowhere it written a stonetablet is on whatkindofmodelshouldbe used to solveproblems involving data. To make myposition clear,I am not againstdata modelsper se. In somesituations they are the mostappropriate wayto solvethe problem. But the emphasisneeds to be on the problem and on the data. our in Unfortunately, fieldhas a vested interest data models,comehell or highwater.For instance, see Dempster's His (1998) paper on modeling. positionon the 1990 Census adjustment is controversy He particularly interesting. admitsthat he doesn't knowmuchaboutthedata orthedetails,butargues that the problemcan be solved by a strongdose of modeling. That moremodeling can make errorriddendata accurateseems highly unlikely me. to Terrabytes data are pouringinto computers of frommany sources,both scientific, and commercial, and thereis a need to analyzeand understand the data. For instance, data is being generated at an awesome rate by telescopesand radio telemilscopes scanningthe skies. Images containing lions of stellar objectsare storedon tape or disk. need automatedways to scan their Astronomers data to findcertaintypesofstellarobjectsor novel This is a fascinating and objects. enterprise, I doubt ifdata modelsare applicable.Yet I wouldenterthis in myledgeras a statistical problem. The analysis of geneticdata is one of the most challengingand interestingstatistical problems around. Microarraydata, like that analyzed in Section 11.3 can lead to significant advances in understanding genetic effects.But the analysis in of variable importance Section 11.3 would be difficult do accuratelyusing a stochasticdata to model. Problemssuch as stellarrecognition analysis or ofgene expression data couldbe highadventure for statisticians. it requiresthattheyfocuson solvBut ing the problem insteadofaskingwhat data model couldbe an algotheycan create.The best solution rithmic model,or maybea data model,or maybea

combination. the trick beinga scientist to But to is be opento usinga widevariety tools. of The rootsof statistics, in science,lie in workas ing withdata and checking theory against data. I hopein thiscentury field our willreturn its roots. to There are signsthat this hope is notillusory. Over the last tenyears,therehas been a noticeable move workon real worldproblems towardstatistical and reachingout by statisticians towardcollaborative workwithotherdisciplines. believethistrend I will if continueand, in fact,has to continue we are to surviveas an energetic creative and field. GLOSSARY Since some ofthe termsused in this paper may not be familiar all statisticians, append some to I definitions.

(y - 9)2. Given a set of data (training set) consisting of {(Yn Xn)n = 1,2, ..., N}, use it to construct

L(y, 9) that is a measure of the errorwhen y is the true response and 9 the predictedresponse. In classification, usual loss is 1 if y 7 9 and the zero if y = 9. In regression, the usual loss is

Infinite test set error. Assume a loss function

a predictor function +(x) of y. Assume that the of trainingset is i.i.d drawnfrom the distribution the randomvectorY, X. The infinite test set error is E(L(Y, +(X))). This is called the generalization in error machinelearning. settingaside a part ofthe data as a test set or by cross-validation. error. Good predictive the estimated generalization means low estimated error. accuracy Theesand nodes. This terminology to refers deciin sion trees as described the Breimanet al book (1984). down a tree,at each dictorvariables is "dropped" intermediate nodeit has instructions whether go to on of leftor rightdepending the coordinates x. It nodeand is assignedthe predicstopsat a terminal tiongivenbythatnode. Bagging. An acronymfor "bootstrapaggregatsuch that givenany ing."Start withan algorithm trainingset, the algorithm producesa prediction function can +(x). The algorithm be a decisiontree construction, logistic regression withvariabledelethe tion,etc.Take a bootstrap samplefrom training set and use this bootstrap set training to construct the predictor +1(x). Take anotherbootstrapsamset the ple and usingthis secondtraining construct predictor 42(x). Continuethis way forK steps. In regression, average all of the { k(X)} to get the
Dropping an x down a tree. When a vectorofprePredictive accuracy. This refers to the size of The generalization error is estimated either by

STATISTICAL MODELING: THE TWO CULTURES

215

bagged predictor x. In classification, at that class whichhas the plurality vote of the {4k(X)} is the baggedpredictor. Bagginghas been showneffective in variancereduction (Breiman,1996b). Boosting.This is a morecomplex way offorming an ensemble predictors classification in of thanbagging(Freund and Schapire,1996). It uses no randomization proceeds altering weights the but by on in thetraining Its performance terms lowpreset. of dictionerroris excellent(fordetails see Breiman, 1998). ACKNOWLEDGMENTS Many of my ideas about data modelingwere in formed threedecades of conversations withmy old friend and collaborator, Jerome Friedman.Conversations with Richard Olshen about the Cox model and its use in biostatisticshelped me to I understand background. am also indebtedto the William Meisel, who headed some of the prediction projectsI consultedon and helped me make thetransition from to probability theory algorithms, and to CharlesStonefor conversations illuminating aboutthenatureofstatistics and science.I'm gratefulalso for comments the editor, the of Leon Gleser, whichprompted major rewriteof the firstdraft a ofthis manuscript and resultedin a different and better paper. REFERENCES
Y. and GEMAN, (1997). Shape quantization and recogD. nition with randomized trees. Neural Computation 9 15451588. ARENA, C., SUSSMAN, CHIANG, K., MAZUMDAR, S., MACINA, N., 0. and LI, W. (2000). Bagging Structure-ActivityRelationships: A simulation study for assessing misclassification rates. Presented at the Second Indo-U.S. Workshop on Mathematical Chemistry,Duluth, MI. (Available at [email protected]).
AMIT,

BICKEL,

for goodness of fit for semiparametric hypotheses. Unpublished manuscript. BREIMAN, L. (1996a). The heuristics ofinstabilityin model selection. Ann. Statist. 24 2350-2381.
BREIMAN,

P., RITOV, Y. and STOKER, T. (2001). Tailor-made tests

26 123-140. BREIMAN, L. (1998). Arcing classifiers. Discussion paper, Ann. Statist. 26 801-824. BREIMAN. L. (2000). Some infinitytheory for tree ensembles. (Available at www.stat.berkeley.edu/technical reports). forests. MachineLearning 45 5J BREIMAN, L. (2001). Random 32. BREIMAN, L. and FRIEDMAN, J. (1985). Estimating transoptimal formationsin multiple regression and correlation.J Amer. Statist. Assoc. 80 580-619. BREIMAN, L., FRIEDMAN, J., OLSHEN, R. and STONE, C. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.

L. (1996b). Baggingpredictors. MachineLearningJ:

N. and SHAWE-TAYLOR, J. (2000). An Introduction to Support VectorMachines. Cambridge Univ. Press. DANIEL,C. and WOOD,F. (1971). Fittingequations to data. Wiley, New York. DEMPSTER, (1998). Logicist statistic 1. Models and Modeling. A. Statist. Sci. 13 3 248-276. DIACONIS, and EFRON,B. (1983). Computer intensive methods P. in statistics. ScientificAmerican 248 116-13 1. DOMINGOS,P. (1998). Occam's two razors: the sharp and the blunt. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (R. Agrawal and P. Stolorz, eds.) 37-43. AAAI Press, Menlo Park, CA. DOMINGOS, (1999). The role ofOccam's razor in knowledge disP. Data Mining and Knowledge Discovery 3 409-425. covery. DUDOIT, S., FRIDLYAND, and SPEED, T. (2000). Comparison J. of discrimination methods for the classification of tumors. (Available at www.stat.berkeley.edu/technical reports). D. FREEDMAN, (1987). As others see us: a case study in path analysis (with discussion). J Ed. Statist. 12 101-223. D. FREEDMAN, (1991). Statistical models and shoe leather. Sociological Methodology1991 (with discussion) 291-358. D. FREEDMAN, (1991). Some issues in the foundations of statistics. Foundations of Science 1 19-83. FREEDMAN, (1994). From association to causation via regresD. sion. Adv. in Appl. Math. 18 59-110. FREUND,Y. and SCHAPIRE,R. (1996). Experiments with a new boosting algorithm.In Machine Learning: Proceedings of the Thirteenth International Conference148-156. Morgan Kaufmann, San Francisco. FRIEDMAN, (1999). Greedy predictive approximation: a graJ. dient boosting machine. Technical report, Dept. Statistics StanfordUniv. R. FRIEDMAN, HASTIE, T. and TIBSHIRANI, (2000). Additive J., logisticregression:a statistical view ofboosting.Ann. Statist. 28 337-407. GIFI, A. (1990). Nonlinear Multivariate Analysis. Wiley, New York. Ho, T. K. (1998). The random subspace method forconstructing decision forests.IEEE Trans. PatternAnalysis and Machine Intelligence 20 832-844. D. A. LANDSWHER, PREIBON, and SHOEMAKER, (1984). GraphJ., ical methods for assessing logistic regression models (with discussion). J Amer.Statist. Assoc. 79 61-83. P. MCCULLAGH, and NELDER,J. (1989). Generalized Linear Models. Chapman and Hall, London. MEISEL, W. (1972). Computer-Oriented Approaches to Pattern Recognition.Academic Press, New York. MICHIE, D., SPIEGELHALTER, D. and TAYLOR, C. (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood, New York. F. MOSTELLER, and TUKEY,J. (1977). Data Analysis and Regression. Addison-Wesley, Redding, MA. D. MOUNTAIN, and HSIAO, C. (1989). A combined structural and flexible functional approach for modelenery substitution. J. Amer.Statist. Assoc. 84 76-87. STONE,M. (1974). Cross-validatorychoice and assessment ofstatistical predictions.J Roy. Statist. Soc. B 36 111-147. VAPNIK, V. (1995). The Nature of Statistical Learning Theory. Springer,New York. V VAPNIK, (1998). Statistical Learning Theory.Wiley,New York. WAHBA, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia. ZHANG, H. and SINGER, B. (1999). Recursive Partitioning in the Health Sciences. Springer,New York.
CRISTIANINI,

You might also like