You are on page 1of 18

Statistical Modeling: The Two Cultures

Author(s): Leo Breiman


Source: Statistical Science, Vol. 16, No. 3 (Aug., 2001), pp. 199-215
Published by: Institute of Mathematical Statistics
Stable URL: http://www.jstor.org/stable/2676681 .
Accessed: 17/05/2011 06:47

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .
http://www.jstor.org/action/showPublisher?publisherCode=ims. .

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to
Statistical Science.

http://www.jstor.org
Statistical Science
2001, Vol. 16, No. 3, 199-231

Statistical Modeling: The Two Cultures


Leo Breiman

Abstract. There are two culturesin the use of statisticalmodelingto


reach conclusionsfromdata. One assumes that the data are generated
bya givenstochasticdata model.The otheruses algorithmic modelsand
treatsthe data mechanismas unknown.The statisticalcommunity has
been committed to thealmostexclusiveuse ofdata models.This commit-
menthas led to irrelevanttheory,questionableconclusions, and has kept
statisticiansfromworkingon a large rangeofinteresting currentprob-
lems.Algorithmic modeling,bothin theoryand practice,has developed
rapidlyin fieldsoutsidestatistics.It can be used bothon large complex
data sets and as a more accurateand informative alternativeto data
modelingon smallerdata sets. If our goal as a fieldis to use data to
solve problems,thenwe need to moveaway fromexclusivedependence
on data modelsand adopta morediverseset oftools.

1. INTRODUCTION The values of the parameters are estimated from


Statisticsstartswithdata. Thinkofthe data as the data and the model then used for information
beinggeneratedby a blackbox in whicha vectorof and/orprediction.Thus the black box is filledin like
inputvariablesx (independentvariables)go in one this:
side,and on the otherside the responsevariablesy Y-4 linearregression L X
comeout. Inside the black box,naturefunctions to logisticregression X

associate the predictorvariableswiththe response Cox model


variables,so the pictureis like this: Model validation. Yes-no using goodness-of-fit
tests and residual examination.
y * nature x Estimatedculturepopulation.98% of all statisti-
cians.
Thereare twogoals in analyzingthe data:
The AlgorithmicModeling Culture
Prediction.To be able to predictwhattheresponses
are goingto be to futureinputvariables; The analysis in this culture considers the inside of
Information.To extractsome information about the box complex and unknown. Their approach is to
how nature is associatingthe responsevariables finda functionf(x)-an algorithmthat operates on
to the inputvariables. x to predict the responses y. Their black box looks
like this:
There are two different
approachestowardthese
goals: y unknown .4 x

The Data Modeling Culture


The analysisin thisculturestartswithassuming decisiontrees
a stochasticdata modelforthe inside ofthe black neuralnets
box.Forexample,a commondata modelis thatdata
are generatedby independentdrawsfrom Modelvalidation.Measured by predictiveaccuracy.
Estimatedculturepopulation.2% of statisticians,
responsevariables= f(predictor
variables, many in other fields.
randomnoise,parameters)
In this paper I will argue that the focus in the
statistical communityon data models has:
Leo Breiman is Professor,Department of Statistics,
Universityof California, Berkeley,California 94720- * Led to irrelevant theory and questionable sci-
4735 (e-mail: leo@stat.berkeley.edu). entificconclusions;

199
200 L. BREIMAN

* Kept statisticiansfromusing more suitable betweeninputsand outputsthan data models.This


algorithmicmodels; is illustratedusing two medical data sets and a
* Preventedstatisticiansfromworkingon excit- geneticdata set. A glossaryat the end ofthe paper
ing new problems; explains terms that not all statisticiansmay be
familiarwith.
I will also review some of the interestingnew
developmentsin algorithmicmodelingin machine
learningand lookat applicationsto threedata sets. 3. PROJECTS IN CONSULTING
As a consultantI designedand helpedsupervise
2. ROAD MAP surveysforthe EnvironmentalProtectionAgency
It maybe revealingto understandhowI becamea (EPA) and the stateand federalcourtsystems.Con-
memberofthe small secondculture.Aftera seven- trolledexperiments weredesignedforthe EPA, and
yearstintas an academicprobabilist, I resignedand I analyzedtrafficdata forthe U.S. Departmentof
wentintofull-time free-lance
consulting. Afterthir- Transportationand the CaliforniaTransportation
teenyearsofconsultingI joinedtheBerkeleyStatis- Department.Most ofall, I workedon a diverseset
ticsDepartmentin 1980 and have been theresince. ofpredictionprojects.Here are someexamples:
My experiencesas a consultantformedmy views Predictingnext-dayozonelevels.
aboutalgorithmic modeling.Section3 describestwo Using mass spectrato identify halogen-containing
oftheprojectsI workedon. These are givento show compounds.
howmyviewsgrewfromsuch problems. Predictingthe class of a ship fromhigh altitude
When I returnedto the universityand began radar returns.
readingstatisticaljournals,the researchwas dis- Using sonar returnsto predictthe class ofa sub-
tant fromwhat I had done as a consultant.All marine.
articlesbeginand end withdata models.My obser- Identityofhand-sentMorseCode.
vations about published theoreticalresearch in Toxicityofchemicals.
statisticsare in Section4. On-linepredictionofthe cause ofa freewaytraffic
Data modelinghas giventhe statisticsfieldmany breakdown.
successes in analyzingdata and gettinginforma- Speech recognition
tionaboutthe mechanismsproducingthe data. But The sourcesofdelayin criminaltrialsin statecourt
there is also misuse leading to questionablecon- systems.
clusionsabout the underlyingmechanism.This is
reviewedin Section5. Followingthatis a discussion To understandthe natureofthese problemsand
(Section6) ofhowthecommitment to data modeling the approachestaken to solve them,I give a fuller
has preventedstatisticiansfromenteringnew sci- ofthe firsttwoon the list.
description
entificand commercialfieldswherethe data being 3.1 The Ozone Project
gatheredis notsuitableforanalysisbydata models.
In the past fifteen years,the growthin algorith- In the mid-to late 1960s ozone levels became a
mic modelingapplicationsand methodologyhas serious health problemin the Los Angeles Basin.
been rapid. It has occurredlargelyoutside statis- Threedifferent alertlevelswereestablished.At the
tics in a new community-oftencalled machine highest,all government workerswere directednot
learning-thatis mostlyyoungcomputerscientists to driveto work,childrenwerekeptoffplaygrounds
(Section7). The advances,particularly overthe last and outdoorexercisewas discouraged.
fiveyears,have been startling.Three of the most The majorsourceofozoneat thattimewas auto-
importantchangesin perceptionto be learnedfrom mobiletailpipeemissions.These rose into the low
these advances are describedin Sections8, 9, and atmosphereand weretrappedtherebyan inversion
10, and are associatedwiththe following names: layer.A complexchemicalreaction,aided by sun-
light,cookedaway and producedozonetwoto three
ofgoodmodels;
Rashomon:the multiplicity
hoursafterthe morningcommutehours.The alert
Occam: the conflict between simplicity and
warningswereissued in the morning, but wouldbe
accuracy; if theycould be issued 12 hours in
moreeffective
Bellman:dimensionality-curseor blessing?
advance.In the mid-1970s,the EPA fundeda large
Section 11 is titled "Information
froma Black effortto see ifozonelevels couldbe accuratelypre-
Box" and is importantin showingthat an algo- dicted12 hoursin advance.
rithmicmodelcan producemoreand morereliable Commutingpatternsin the Los Angeles Basin
informationabout the structureofthe relationship are regular,with the total variationin any given
STATISTICAL MODELING: THE TWO CULTURES 201

daylighthour varyingonly a few percent from field.The moleculesofthe compoundsplitand the


one weekdayto another.Withthe total amountof lighterfragmentsare bent more by the magnetic
emissionsabout constant,the resultingozone lev- fieldthan the heavier.Then the fragmentshit an
els depend on the meteorologyof the preceding absorbingstrip,withthepositionofthefragment on
days. A large data base was assembled consist- the stripdeterminedbythe molecularweightofthe
ing of lower and upper air measurementsat U.S. fragment. oftheexposureat thatposi-
The intensity
weatherstationsas far away as Oregonand Ari- tion measuresthe frequencyof the fragment. The
zona, togetherwith hourly readings of surface resultantmass spectrahas numbersreflecting fre-
temperature,humidity,and wind speed at the quenciesoffragments frommolecularweight1 up to
dozens of air pollutionstationsin the Basin and the molecularweightofthe originalcompound.The
nearbyareas. peaks correspondto frequentfragmentsand there
Altogether,therewere daily and hourlyreadings are manyzeroes.The available data base consisted
ofover450 meteorological variablesfora periodof ofthe knownchemicalstructureand mass spectra
seven years, with corresponding hourlyvalues of of30,000compounds.
ozone and otherpollutantsin the Basin. Let x be The mass spectrumpredictorvectorx is ofvari-
the predictorvectorof meteorological variables on Molecularweightin the data
able dimensionality.
the nth day.There are morethan 450 variablesin base variedfrom30 to over10,000.The variableto
x since information several days back is included. be predictedis
Let y be the ozone level on the (n + 1)st day.Then
the problemwas to constructa functionf (x) such y = 1: containschlorine,
that forany futureday and futurepredictorvari- y = 2: does notcontainchlorine.
ables x forthat day, f (x) is an accurate predictorof
the nextday'sozonelevel y. The problemis to constructa functionf(x) that
To estimate predictiveaccuracy,the firstfive is an accuratepredictorof y wherex is the mass
years of data were used as the trainingset. The spectrumofthe compound.
last two years were set aside as a test set. The To measurepredictiveaccuracythe data set was
algorithmic modelingmethodsavailable in the pre- randomlydividedinto a 25,000 membertraining
1980s decades seem primitivenow.In this project set and a 5,000 membertest set. Linear discrim-
large linearregressionswererun,followedbyvari- inant analysis was tried,then quadratic discrimi-
able selection.Quadratictermsin, and interactions to adapt to the
nant analysis.These were difficult
among,the retainedvariableswereadded and vari- variabledimensionality.By thistimeI was thinking
able selectionused again to prunethe equations.In about decisiontrees.The hallmarksof chlorinein
the end,the projectwas a failure-the false alarm mass spectrawereresearched.This domainknowl-
rate of the final predictorwas too high. I have edge was incorporatedinto the decisiontree algo-
regretsthatthis projectcan't be revisitedwiththe rithmbythe designofthe set of1,500yes-noques-
toolsavailable today. tionsthatcouldbe appliedto a mass spectraofany
dimensionality.The resultwas a decisiontreethat
3.2 The Chlorine Project
gave 95% accuracyon bothchlorinesand nonchlo-
The EPA samplesthousandsofcompoundsa year rines (see Breiman,Friedman,Olshen and Stone,
and tries to determinetheirpotentialtoxicity.In 1984).
the mid-1970s,the standardprocedurewas to mea-
sure the mass spectraofthe compoundand to try 3.3 Perceptions on Statistical Analysis
to determineits chemicalstructurefromits mass As I leftconsultingto go back to the university,
spectra. theseweretheperceptions I had aboutworkingwith
Measuringthemass spectrais fastand cheap.But data to findanswersto problems:
the determination of chemicalstructurefromthe
mass spectra requires a painstakingexamination (a) Focus on findinga goodsolution-that'swhat
by a trainedchemist.The cost and availabilityof consultantsget paid for.
enoughchemiststo analyze all ofthe mass spectra (b) Live with the data beforeyou plunge into
produceddauntedthe EPA. Many toxiccompounds modeling.
containhalogens.So the EPA fundeda projectto (c) Searchfora modelthatgivesa goodsolution,
determineifthe presenceofchlorinein a compound eitheralgorithmic or data.
couldbe reliablypredictedfromits mass spectra. (d) Predictiveaccuracyon test sets is the crite-
Mass spectra are producedby bombardingthe rionforhowgoodthe modelis.
compoundwithions in the presenceof a magnetic (e) Computersare an indispensablepartner.
202 L. BREIMAN

4. RETURN TO THE UNIVERSITY These truismshave oftenbeenignoredin theenthu-


I had one tip about what researchin the uni- siasm forfittingdata models.A few decades ago,
the commitment to data modelswas suchthateven
versitywas like. A friendof mine, a prominent
simple precautionssuch as residual analysis or
statisticianfromthe BerkeleyStatistics Depart-
goodness-of-fittestswerenotused. The beliefin the
ment,visitedme in Los Angelesin the late 1970s.
infallibility
ofdata modelswas almostreligious.It
AfterI describedthe decisiontree methodto him,
is a strangephenomenon-oncea model is made,
his firstquestionwas, "What'sthe model forthe
then it becomestruthand the conclusionsfromit
data?"
are infallible.
4.1 Statistical Research 5.1 An Example
Upon myreturn,I startedreadingtheAnnals of I illustratewitha famous(also infamous)exam-
Statistics,the flagshipjournal oftheoreticalstatis- ple: assume the data is generatedby independent
tics,and was bemused.Everyarticlestartedwith drawsfromthe model
Assumethatthe data are generatedbythe follow- M
ing model: ... (R) Y = bo + bmXm + 8,
1
followedby mathematicsexploringinference,hypo- wherethe coefficients {bm} are to be estimated,8
thesis testing and asymptotics.There is a wide is N(O, a-2) and a-2 is to be estimated.Given that
spectrumofopinionregardingthe usefulnessofthe the data is generatedthis way, elegant tests of
theory published in the Annals of Statistics to the hypotheses,confidenceintervals,distributionsof
fieldofstatisticsas a sciencethatdeals withdata. I the residualsum-of-squares and asymptoticscan be
am at the verylow end ofthe spectrum.Still,there derived.This made the model attractivein terms
have been some gems that have combinednice ofthe mathematicsinvolved.This theorywas used
theoryand significant applications.An exampleis bothby academicstatisticiansand othersto derive
wavelettheory.Even in applications,data models significancelevels forcoefficients on the basis of
are universal.For instance,in the Journalof the model (R), with littleconsiderationas to whether
American Statistical Association (JASA), virtually the data on hand could have been generatedby a
everyarticlecontainsa statementofthe form: linearmodel.Hundreds,perhapsthousandsofarti-
Assumethatthe data are generatedbythe follow- cles were publishedclaimingproofofsomethingor
ing model: ... otherbecause the coefficient was significantat the
5% level.
I am deeplytroubledby the currentand past use Goodness-of-fitwas demonstrated mostlyby giv-
of data modelsin applications,wherequantitative ing the value ofthe multiplecorrelationcoefficient
conclusionsare drawnand perhapspolicydecisions R2 which was oftencloser to zero than one and
made. whichcouldbe overinflatedbythe use oftoomany
parameters.Besides computingR2, nothingelse
5. THE USE OF DATA MODELS was doneto see ifthe observationaldata couldhave
Statisticiansin applied research considerdata been generatedby model(R). For instance,a study
modelingas the templatefor statisticalanalysis: was done several decades ago by a well-known
Faced with an applied problem,think of a data memberof a universitystatisticsdepartmentto
model.This enterprisehas at its heart the belief assess whethertherewas genderdiscrimination in
that a statistician,by imaginationand by looking the salaries ofthe faculty.All personnelfileswere
at the data, can invent a reasonablygood para- examinedand a data base set up whichconsistedof
metricclass of models for a complexmechanism salary as the responsevariable and 25 othervari-
devisedby nature.Then parametersare estimated ables which characterizedacademic performance;
and conclusionsare drawn.But whena modelis fit that is, papers published,qualityofjournals pub-
to data to drawquantitativeconclusions: lishedin, teachingrecord,evaluations,etc. Gender
appears as a binarypredictorvariable.
* The conclusionsare about the model'smecha- A linear regressionwas carriedout on the data
nism,and notaboutnature'smechanism. and the gender coefficient was significantat the
It followsthat: 5% level. That this was strongevidenceof sex dis-
criminationwas accepted as gospel. The design
* If the modelis a pooremulationofnature,the of the study raises issues that enter beforethe
conclusionsmaybe wrong. considerationof a model-Can the data gathered
STATISTICAL MODELING: THE TWO CULTURES 203

answer the question posed? Is inferencejustified a varietyofmodels.A residualplotis a goodness-of-


whenyoursampleis the entirepopulation?Should fittest,and lacks powerin morethan a fewdimen-
a data modelbe used? The deficiencies in analysis sions. An acceptableresidual plot does not imply
occurredbecause the focuswas on the model and thatthe modelis a goodfitto the data.
noton the problem. Thereare a varietyofwaysofanalyzingresiduals.
The linear regressionmodel led to many erro- For instance,Landwher,Preibon and Shoemaker
neous conclusionsthat appearedin journal articles (1984, withdiscussion)gives a detailedanalysis of
wavingthe 5% significance level withoutknowing fittinga logisticmodelto a three-variable data set
whetherthe modelfitthe data. Nowadays,I think using various residual plots. But each of the four
most statisticianswill agree that this is a suspect discussantspresentothermethodsforthe analysis.
wayto arriveat conclusions.At thetime,therewere One is leftwithan unsettledsense about the arbi-
fewobjectionsfromthe statisticalprofessionabout trarinessofresidualanalysis.
thefairy-taleaspectoftheprocedure, But,hiddenin Misleading conclusionsmay follow from data
an elementary textbook, Mostellerand Tukey(1977) modelsthat pass goodness-of-fit tests and residual
discussmanyofthe fallaciespossiblein regression checks. But published applicationsto data often
and write"The whole area of guided regressionis show littlecare in checkingmodel fitusing these
fraughtwithintellectual,statistical,computational, methodsor any other.For instance,many of the
and subjectmatterdifficulties." currentapplicationarticlesin JASA that fitdata
Even currently,thereare onlyrare publishedcri- modelshave verylittlediscussionofhowwell their
tiques ofthe uncriticaluse of data models.One of modelfitsthe data. The questionof how well the
the fewis David Freedman,who examinesthe use modelfitsthe data is ofsecondaryimportancecom-
ofregressionmodels(1994); the use ofpath models pared to the constructionofan ingeniousstochastic
(1987) and data modeling(1991, 1995).The analysis model.
in these papers is incisive. 5.3 The Multiplicityof Data Models
5.2 Problems in CurrentData Modeling One goal of statisticsis to extractinformation
fromthe data abouttheunderlying mechanismpro-
Currentapplied practice is to check the data ducingthe data. The greatestplus ofdata modeling
model fit using goodness-of-fit tests and residual is thatit producesa simpleand understandablepic-
analysis.At one point,some years ago, I set up a tureofthe relationshipbetweentheinputvariables
simulatedregressionproblemin seven dimensions and responses.For instance,logisticregressionin
witha controlledamountofnonlinearity. Standard is frequently
classification used because it produces
tests ofgoodness-of-fitdid not rejectlinearityuntil a linear combinationofthe variableswithweights
the nonlinearitywas extreme.Recenttheorysup- that give an indicationofthe variableimportance.
ports this conclusion.Workby Bickel, Ritov and The end resultis a simplepictureofhow the pre-
Stoker(2001) showsthat goodness-of-fit testshave dictionvariables affectthe responsevariable plus
verylittlepowerunless the directionof the alter- confidenceintervalsforthe weights.Suppose two
nativeis preciselyspecified.The implicationis that statisticians,each one with a different approach
omnibusgoodness-of-fit tests,whichtest in many to data modeling,fit a model to the same data
directionssimultaneously, have little power,and set. Assume also that each one applies standard
will notrejectuntilthe lack offitis extreme. goodness-of-fittests, looks at residuals, etc., and
Furthermore, ifthe modelis tinkeredwithon the is convincedthat their model fits the data. Yet
basis of the data, that is, if variables are deleted the two models give different picturesof nature's
or nonlinearcombinationsof the variables added, mechanismand lead to different conclusions.
thengoodness-of-fit tests are not applicable.Resid- McCullah and Nelder (1989) write "Data will
ual analysisis similarlyunreliable.In a discussion oftenpoint with almost equal emphasis on sev-
aftera presentationof residual analysis in a sem- eral possiblemodels,and it is importantthat the
inar at Berkeleyin 1993, WilliamCleveland,one statisticianrecognizeand accept this."Well said,
ofthe fathersofresidualanalysis,admittedthatit but differentmodels,all ofthemequallygood,may
couldnotuncoverlack offitin morethanfourto five give differentpicturesof the relationbetweenthe
dimensions.The papers I have read on usingresid- predictorand responsevariables. The questionof
ual analysisto checklack offitare confinedto data whichone mostaccuratelyreflectsthe data is dif-
sets withtwoor threevariables. ficultto resolve.One reason forthis multiplicity
Withhigherdimensions, theinteractions between is that goodness-of-fittests and othermethodsfor
thevariablescan producepassable residualplotsfor checkingfitgive a yes-noanswer.Withthe lack of
204 L. BREIMAN

powerofthese testswithdata havingmorethan a Mostellerand Tukey(1977) wereearlyadvocates


small numberof dimensions,therewill be a large ofcross-validation.
Theywrite,"Cross-validation is
numberofmodelswhosefitis acceptable.Thereis a naturalroutetotheindicationofthequalityofany
no way,amongthe yes-nomethodsforgaugingfit, data-derivedquantity.... We plan to cross-validate
of determining which is the bettermodel. A few carefullywhereverwe can."
statisticiansknowthis.Mountainand Hsiao (1989) Judgingby the infrequency of estimatesof pre-
write,"It is difficult
to formulatea comprehensive dictiveaccuracyin JASA, this measure of model
model capable of encompassingall rival models. fitthat seems naturalto me (and to Mostellerand
Furthermore, withthe use offinitesamples,there Tukey)is notnaturalto others.Morepublicationof
are dubiousimplicationswithregardto the validity predictiveaccuracyestimateswouldestablishstan-
and powerofvariousencompassingtests that rely dards forcomparisonof models,a practicethat is
on asymptotic theory." commonin machinelearning.
Data modelsin currentuse mayhave moredam-
agingresultsthanthe publicationsin the social sci- 6. THE LIMITATIONSOF DATAMODELS
encesbased on a linearregressionanalysis.Justas
the 5% level ofsignificance
becamea de factostan- Withthe insistenceon data models,multivariate
dardforpublication,the Cox modelforthe analysis analysistoolsin statisticsare frozenat discriminant
ofsurvivaltimesand logisticregressionforsurvive- analysisand logisticregressionin classification
and
nonsurvivedata have becomethe de factostandard multiplelinear regressionin regression.Nobody
forpublicationin medicaljournals. That different reallybelievesthat multivariatedata is multivari-
survivalmodels,equallywell fitting,couldgive dif- ate normal,but that data model occupiesa large
ferentconclusionsis notan issue. numberof pages in every graduate textbookon
multivariatestatisticalanalysis.
5.4 Predictive Accuracy With data gatheredfromuncontrolledobserva-
The mostobviousway to see how well the model tionson complexsystemsinvolvingunknownphysi-
boxemulatesnature'sboxis this:put a case x down cal, chemical,or biologicalmechanisms,the a priori
nature'sbox gettingan outputy. Similarly, put the assumptionthat nature would generatethe data
same case x down the model box gettingan out- througha parametricmodelselectedby the statis-
put y'. The closenessof y and y' is a measure of tician can result in questionableconclusionsthat
how good the emulationis. For a data model,this cannotbe substantiatedbyappeal to goodness-of-fit
translatesas: fitthe parametersin yourmodelby tests and residual analysis. Usually,simple para-
using the data, then,using the model,predictthe metricmodelsimposedon data generatedby com-
data and see howgoodthe predictionis. plex systems,forexample,medical data, financial
Predictionis rarely perfect.There are usu- data, resultin a loss ofaccuracyand information as
ally many unmeasuredvariables whose effectis comparedto algorithmic models(see Section11).
referredto as "noise."But the extentto whichthe There is an old saying "If all a man has is a
model box emulates nature'sbox is a measure of hammer,theneveryproblemlookslike a nail."The
how well our model can reproducethe natural troubleforstatisticiansis thatrecentlysomeofthe
phenomenon producingthe data. problemshave stoppedlookinglike nails. I conjec-
McCullagh and Nelder (1989) in their book on turethatthe resultofhittingthiswall is thatmore
generalizedlinear modelsalso thinkthe answeris complicateddata models are appearingin current
obvious.They write,"Atfirstsightit mightseem publishedapplications.Bayesianmethodscombined
as thougha good model is one that fitsthe data withMarkovChain MonteCarlo are croppingup all
verywell;thatis, one thatmakes ,u(themodelpre- over.This may signifythat as data becomesmore
dictedvalue) veryclose to y (the responsevalue)." complex,the data modelsbecomemorecumbersome
Thentheygo on to notethattheextentoftheagree- and are losingthe advantageofpresentinga simple
mentis biased by the numberof parametersused and clear pictureofnature'smechanism.
in the modeland so is not a satisfactory measure. Approaching problemsbylookingfora data model
Theyare,ofcourse,right.Ifthemodelhas toomany imposesan a prioristraightjacketthatrestrictsthe
parameters,thenit mayoverfit the data and givea abilityof statisticiansto deal witha wide range of
biased estimateof accuracy.But thereare ways to statisticalproblems.The best available solutionto
removethe bias. To get a moreunbiased estimate a data problemmightbe a data model;thenagain
ofpredictiveaccuracy,cross-validation can be used, it mightbe an algorithmic model.The data and the
as advocatedin an importantearlyworkby Stone problemguide the solution.To solve a widerrange
(1974). If the data set is larger,put aside a test set. ofdata problems,a largerset oftoolsis needed.
STATISTICAL MODELING: THE TWO CULTURES 205

Perhaps the damagingconsequenceof the insis- 7.2 Theory in AlgorithmicModeling


tenceon data modelsis thatstatisticianshave ruled
Data modelsare rarelyused in this community.
themselvesout ofsome ofthe mostinteresting and
The approach is that nature producesdata in a
challengingstatisticalproblemsthat have arisen
black box whose insides are complex,mysterious,
out of the rapidlyincreasingabilityof computers
and, at least, partlyunknowable.Whatis observed
to storeand manipulatedata. These problemsare
is a set ofx's thatgo in and a subsequentset ofy's
increasinglypresentin manyfields,both scientific
that comeout. The problemis to findan algorithm
and commercial,and solutionsare being foundby
f(x) such that forfuturex in a test set, f(x) will
nonstatisticians.
be a goodpredictorofy.
The theoryin thisfieldshiftsfocusfromdata mod-
7. ALGORITHMIC MODELING els to the propertiesofalgorithms.It characterizes
Under other names, algorithmicmodelinghas their"strength"as predictors, convergenceif they
been used by industrialstatisticiansfordecades. are iterative,and what gives themgood predictive
See, forinstance,the delightful book"FittingEqua- accuracy.The one assumptionmade in the theory
tionstoData" (Daniel and Wood,1971). It has been is that the data is drawni.i.d. froman unknown
used by psychometricians and social scientists. multivariatedistribution.
Readinga preprintofGifi'sbook(1990) manyyears There is isolated work in statisticswhere the
ago uncovereda kindredspirit.It has made small focus is on the theoryof the algorithms.Grace
inroadsinto the analysis of medical data starting Wahba's research on smoothing spline algo-
withRichardOlshen'sworkin the early1980s. For rithmsand theirapplicationsto data (using cross-
further work,see Zhang and Singer(1999). Jerome validation)is builton theoryinvolvingreproducing
Friedmanand Grace Wahba have done pioneering kernelsin HilbertSpace (1990). The finalchapter
workon the developmentof algorithmicmethods. of the CART book (Breimanet al., 1984) contains
But the list ofstatisticiansin the algorithmic
mod- a proofofthe asymptotic convergenceofthe CART
elingbusinessis short,and applicationsto data are algorithm to theBayesriskbylettingthetreesgrow
seldom seen in the journals. The developmentof as the sample size increases.There are others,but
algorithmic methodswas takenup by a community the relativefrequency is small.
outside statistics. Theoryresultedin a major advance in machine
learning.VladimirVapnikconstructed informative
7.1 A New Research Community boundson the generalizationerror(infinitetest set
error)ofclassification algorithmswhichdependon
In the mid-1980stwo powerfulnew algorithms
the "capacity"of the algorithm.These theoretical
forfittingdata became available: neural nets and
boundsled to supportvectormachines(see Vapnik,
decisiontrees. A new research communityusing
1995, 1998) whichhave provedto be more accu-
these tools sprang up. Their goal was predictive in classificationand regressionthen
rate predictors
accuracy.The community consistedof youngcom-
neural nets,and are the subjectofheated current
puterscientists,physicistsand engineersplus a few
research(see Section10).
agingstatisticians.Theybegan usingthe new tools
My last paper "Some infinitytheoryfor tree
in workingon complexprediction problemswhereit
ensembles"(Breiman,2000) uses a functionspace
was obviousthat data modelswere not applicable:
analysisto tryand understandtheworkingsoftree
speech recognition,image recognition,nonlinear
ensemblemethods.One sectionhas the heading,
time series prediction,handwritingrecognition,
"My kingdomfor some good theory."There is an
predictionin financialmarkets.
effective methodforformingensemblesknownas
Theirinterestsrangeovermanyfieldsthat were
"boosting,"but there isn't any finitesample size
onceconsideredhappyhuntinggroundsforstatisti-
theorythattells us whyit worksso well.
cians and have turnedout thousandsofinteresting
researchpapersrelatedto applicationsand method- 7.3 Recent Lessons
ology.A large majorityof the papers analyze real
foranymodelis whatis thepre-
data. The criterion The advances in methodology and increases in
dictiveaccuracy.An idea of the range of research predictiveaccuracysince the mid-1980sthat have
ofthis groupcan be got by lookingat the Proceed- occurredin the researchof machinelearninghas
ings of the Neural InformationProcessing Systems been phenomenal.There have been particularly
Conference(theirmain yearlymeeting)or at the excitingdevelopmentsin the last fiveyears.What
Machine Learning Journal. has beenlearned?The threelessonsthatseemmost
206 L. BREIMAN

important
to one: neural net 100 times on simplethree-dimensional
data reselectingthe initialweightsto be small and
Rashomon:the multiplicityofgoodmodels;
randomon each run. I found32 distinctminima,
Occam: the conflictbetweensimplicityand accu-
racy; each ofwhichgave a different picture,and having
Bellman:dimensionality-curse or blessing. about equal test set error.
This effectis closely connectedto what I call
8. RASHOMON AND THE MULTIPLICITY
instability(Breiman,1996a) thatoccurswhenthere
OF GOOD MODELS
are many different models crowdedtogetherthat
have aboutthe same trainingortestset error.Then
Rashomon is a wonderfulJapanese movie in a slightperturbationof the data or in the model
whichfourpeople, fromdifferent vantage points, construction will cause a skip fromone model to
witnessan incidentin whichone persondies and another.The two modelsare close to each otherin
anotheris supposedlyraped. When they come to termsof error,but can be distantin termsof the
testifyin court,theyall reportthe same facts,but formofthe model.
theirstoriesofwhathappenedare verydifferent. If, in logisticregressionor the Cox model,the
What I call the RashomonEffectis that there commonpractice of deleting the less important
is oftena multitudeofdifferent descriptions[equa- covariatesis carriedout, then the model becomes
tionsf(x)] in a class offunctionsgivingabout the unstable-there are too many competingmodels.
same minimumerrorrate. The mosteasily under- Say you are deletingfrom15 variables to 4 vari-
stoodexampleis subset selectionin linear regres- ables. Perturbthe data slightlyand you will very
sion.Supposethereare 30 variablesand we wantto possiblyget a differentfour-variablemodel and
findthe best fivevariablelinearregressions.There a different conclusionabout which variables are
are about 140,000five-variable subsetsin competi- important. To improveaccuracybyweedingoutless
tion.Usuallywe pickthe one withthe lowestresid- importantcovariatesyou run into the multiplicity
ual sum-of-squares (RSS), or,if thereis a test set, problem.The pictureofwhichcovariatesare impor-
the lowesttest error.But theremay be (and gen- tant can vary significantly between two models
erallyare) manyfive-variable equationsthat have havingaboutthe same deviance.
RSS within1.0% of the lowestRSS (see Breiman, Aggregatingover a large set of competingmod-
1996a). The same is true if test set erroris being els can reducethe nonuniquenesswhileimproving
measured. accuracy.Arena et al. (2000) bagged(see Glossary)
So here are threepossiblepictureswithRSS or logisticregressionmodelson a data base oftoxicand
test set errorwithin1.0% ofeach other: nontoxicchemicalswherethe numberofcovariates
Picture1 in each model was reducedfrom15 to 4 by stan-
dardbest subsetselection.On a testset,thebagged
y = 2.1 + 3.8x3 - 0.6x8 + 83.2x12 modelwas significantly moreaccuratethanthe sin-
- 2.1x17 + 3.2x27, gle modelwithfourcovariates.It is also morestable.
This is one possible fix.The multiplicity problem
Picture2 and its effecton conclusionsdrawn frommodels
y = -8.9 + 4.6x5+ 0.01x6+ 12.0x15 needs seriousattention.
+ 17.5X21+ 0.2X22, 9. OCCAM AND SIMPLICITYVS. ACCURACY
Picture3 Occam's Razor, long admired,is usually inter-
y = -76.7 + 9.3x2 + 22.0x7 - 13.2x8 pretedtomeanthatsimpleris better.Unfortunately,
in prediction,accuracyand simplicity(interpretabil-
+ 3.4x11 + 7.2X28. ity) are in conflict.
For instance,linear regression
Whichone is better?The problemis that each one gives a fairlyinterpretablepictureofthe y,x rela-
tells a differentstoryabout which variables are tion. But its accuracy is usually less than that
important. of the less interpretableneural nets. An example
The RashomonEffectalso occurswith decision closerto myworkinvolvestrees.
treesand neuralnets.In myexperiments withtrees, On interpretability,trees rate an A+. A project
ifthe trainingset is perturbedonlyslightly,
say by I workedon in the late 1970s was the analysis of
removinga random2-3% of the data, I can get delay in criminalcases in state courtsystems.The
a tree quite different fromthe originalbut with Constitution givesthe accusedtherightto a speedy
almostthe same test set error.I once ran a small trial.The CenterfortheState Courtswas concerned
STATISTICAL MODELING: THE TWO CULTURES 207

TABLE 1
Data set descriptions

Training Test
Data set Sample size Sample size Variables Classes

Cancer 699 9 2
Ionosphere 351 34 2
Diabetes 768 8 2
Glass 214 9 6
Soybean 683 35 19
Letters 15,000 5000 16 26
Satellite 4,435 2000 36 6
Shuttle 43,500 14,500 9 7
DNA 2,000 1,186 60 3
Digit 7,291 2,007 256 10

that in many states,the trials were anythingbut variables.At each node chooseseveralofthe 20 at
speedy.It fundeda studyofthe causes ofthe delay. randomto use to splitthe node. Or use a random
I visitedmany states and decidedto do the anal- combinationof a randomselectionof a few vari-
ysis in Colorado,whichhad an excellentcomputer- ables. This idea appears in Ho (1998), in Amitand
ized courtdata system.A wealthofinformation was Geman(1997) and is developedin Breiman(1999).
extractedand processed.
The dependentvariable for each criminalcase 9.2 Forests Compared to Trees
was the timefromarraignment to the timeofsen- We compare the performanceof single trees
tencing.All ofthe otherinformation in thetrialhis- (CART) to randomforestson a numberof small
torywere the predictorvariables.A large decision and largedata sets,mostlyfromtheUCI repository
treewas grown,and I showedit on an overheadand (ftp.ics.uci.edulpub/MachineLearningDatabases).A
explainedit to the assembledColoradojudges. One summaryofthe data sets is givenin Table 1.
ofthe splitswas on DistrictN whichhad a larger Table 2 comparesthetestset errorofa singletree
delaytimethanthe otherdistricts.I refrainedfrom to that ofthe forest.For the fivesmallerdata sets
commenting on this.But as I walkedoutI heardone above the line,the test set errorwas estimatedby
judge say to another,"I knewthoseguysin District leaving out a random10% of the data, then run-
N weredraggingtheirfeet." ning CART and the foreston the other90%. The
While trees rate an A+ on interpretability, they left-out10% was run downthe tree and the forest
are good,but not great,predictors.Give them,say, and the erroron this 10% computedforboth.This
a B on prediction. was repeated 100 times and the errorsaveraged.
9.1 Growing Forests for Prediction The larger data sets below the line came with a
separatetestset. Peoplewhohave been in the clas-
Instead ofa singletreepredictor, growa forestof sificationfieldfora while findthese increases in
trees on the same data-say 50 or 100. If we are accuracystartling.Some errorsare halved. Others
put the newx downeach treein thefor-
classifying, are reducedby one-third.In regression,wherethe
est and geta voteforthepredictedclass. Let thefor-
est predictionbe the class thatgetsthe mostvotes.
Therehas been a lotofworkin thelast fiveyearson TABLE2
Test set misclassificationerror(%)
waysto growtheforest.All ofthewell-known meth-
ods growthe forestby perturbing the trainingset, Data set Forest Single tree
growinga tree on the perturbedtrainingset, per-
Breast cancer 2.9 5.9
turbingthetrainingset again,growinganothertree, Ionosphere 5.5 11.2
etc. Some familiarmethodsare bagging(Breiman, Diabetes 24.2 25.3
1996b),boosting(Freundand Schapire,1996), arc- Glass 22.0 30.4
ing(Breiman,1998),and additivelogisticregression Soybean 5.7 8.6
(Friedman,Hastie and Tibshirani,1998). Letters 3.4 12.4
Mypreferred methodto date is randomforests.In Satellite 8.6 14.8
thisapproachsuccessivedecisiontreesare grownby Shuttle X103 7.0 62.0
DNA 3.9 6.2
introducing a randomelementinto theirconstruc- Digit 6.2 17.1
tion. For example,suppose there are 20 predictor
208 L. BREIMAN

forestpredictionis the averageoverthe individual The publishedadvice was that


the dimensionality.
treepredictions,thedecreasesin mean-squaredtest high dimensionalityis dangerous.For instance,a
set errorare similar. well-regardedbook on (Meisel,
patternrecognition
1972) states "the features... must be relatively
9.3 Random Forests are A + Predictors few in number."But recentworkhas shownthat
The Statlog Project(Mitchie,Spiegelhalterand dimensionalitycan be a blessing.
Taylor, 1994) compared 18 differentclassifiers. 10.1 Digging It Out in Small Pieces
Included were neural nets, CART, linear and
quadraticdiscriminantanalysis,nearest neighbor, Reducingdimensionality reduces the amountof
etc.The firstfourdata setsbelowthelinein Table 1 information available forprediction.The morepre-
werethe onlyones used in the StatlogProjectthat dictorvariables,themoreinformation. Thereis also
came with separate test sets. In termsof rank of informationin variouscombinations ofthepredictor
accuracyon these fourdata sets, the forestcomes variables.Let's trygoingin the oppositedirection:
in 1, 1, 1, 1 foran average rank of 1.0. The next increaseit
* Instead ofreducingdimensionality,
best classifierhad an averagerankof7.3. byaddingmanyfunctions ofthepredictor
variables.
The fifth data setbelowthelineconsistsof16x 16 There may now be thousands of features.Each
pixelgrayscale depictionsofhandwritten ZIP Code potentiallycontainsa small amountofinformation.
numerals.It has been extensivelyused by AT&T The problemis how to extractand put together
Bell Labs to test a varietyof predictionmethods. There are two
these little pieces of information.
A neural net handcrafted to the data got a test set outstandingexamplesofworkin thisdirection, The
errorof5.1% vs. 6.2% fora standardrunofrandom Shape Recognition Forest (Y. Amit and D. Geman,
forest. 1997) and Support Vector Machines (V. Vapnik,
9.4 The Occam Dilemma
1995, 1998).
10.2 The Shape Recognition Forest
So forestsare A+ predictors.But theirmechanism
forproducinga prediction is difficult
to understand. In 1992,the NationalInstituteofStandardsand
Tryingto delveintothetangledweb thatgenerated Technology (NIST) set up a competition
formachine
a pluralityvotefrom100 treesis a Herculeantask. algorithms to read handwrittennumerals.Theyput
theyrate an F. Whichbrings
So on interpretability, togethera largeset ofpixelpicturesofhandwritten
us to the Occam dilemma: numbers(223,000) writtenby over 2,000 individ-
uals. The competition attractedwide interest,and
* Accuracygenerallyrequiresmorecomplexpre- diverseapproachesweretried.
functions
dictionmethods.Simpleand interpretable The Amit-Gemanapproachdefinedmany thou-
do notmake the mostaccuratepredictors. sands of small geometricfeaturesin a hierarchi-
Usingcomplexpredictors maybe unpleasant,but cal assembly.Shallowtreesare grown,suchthatat
the soundestpath is to go forpredictiveaccuracy each node,100 featuresare chosenat randomfrom
first,then tryto understandwhy.In fact,Section the appropriatelevel ofthehierarchy;and the opti-
10 pointsout that froma goal-orientedstatistical mal splitofthe nodebased on the selectedfeatures
viewpoint,thereis no Occam'sdilemma.(For more is found.
on Occam'sRazor see Domingos,1998, 1999.) Whena pixelpictureofa numberis droppeddown
a single tree,the terminalnode it lands in gives
10. BELLMAN AND THE CURSE OF probability estimates po, ..., p9 that it represents
DIMENSIONALITY numbers 0, 1, ... ,9. Over 1,000 trees are grown,the
probabilitiesaveragedoverthisforest,and the pre-
The title of this sectionrefersto RichardBell- dictednumberis assignedto the largestaveraged
man'sfamousphrase,"thecurseofdimensionality." probability.
For decades,the firststep in predictionmethodol- Using a 100,000 example trainingset and a
ogywas to avoid the curse.If therewere too many 50,000 test set, the Amit-Gemanmethodgives a
predictionvariables,the recipe was to finda few test set errorof 0.7%-closeto the limitsofhuman
features(functionsof the predictorvariables) that error.
"containmost of the information" and then use
10.3 Support Vector Machines
these featuresto replace the originalvariables.In
procedurescommonin statisticssuch as regres- Supposethereis two-classdata havingprediction
sion, logisticregressionand survival models the vectorsin M-dimensionalEuclideanspace. The pre-
advisedpracticeis to use variabledeletionto reduce dictionvectorsforclass #1are {x(1)} and thosefor
STATISTICAL MODELING: THE TWO CULTURES 209

class #2are {x(2)}. If these two sets ofvectorscan becomestoo large,the separatinghyperplanewill
be separatedby a hyperplanethenthereis an opti- notgivelow generalizationerror.If separationcan-
mal separatinghyperplane."Optimal"is definedas not be realized with a relativelysmall numberof
meaningthatthe distanceofthe hyperplaneto any supportvectors,thereis anotherversionofsupport
predictionvectoris maximal(see below). vectormachinesthat definesoptimalityby adding
The set of vectorsin {x(1)} and in {x(2)} that a penaltytermforthe vectorson the wrongside of
achieve the minimum distance to the optimal the hyperplane.
separatinghyperplaneare called the supportvec- Someingeniousalgorithms makefinding theopti-
tors. Their coordinatesdeterminethe equation of mal separatinghyperplanecomputationally feasi-
the hyperplane.Vapnik (1995) showed that if a ble. These devicesreducethe search to a solution
separatinghyperplaneexists,thenthe optimalsep- of a quadratic programming problemwith linear
arating hyperplanehas low generalizationerror inequalityconstraintsthat are of the orderof the
(see Glossary). numberN of cases, independentof the dimension
ofthefeaturespace. Methodstailoredto thispartic-
O /4- optimalhyperplane ular problemproducespeed-upsofan orderofmag-
nitudeoverstandardmethodsforsolvingquadratic
programming problems.
supportvector
Support vector machines can also be used to
00 provideaccurate predictionsin other areas (e.g.,
0
regression).It is an excitingidea that gives excel-
lent performance and is beginningto supplantthe
In two-classdata, separabilityby a hyperplane use of neural nets. A readable introduction is in
does not oftenoccur.However,let us increasethe Cristianiniand Shawe-Taylor(2000).
dimensionalityby adding as additional predictor
variables all quadratic monomialsin the original 11. INFORMATIONFROM A BLACK BOX
predictorvariables; that is, all termsof the form
XmlXm2.A hyperplane in the originalvariablesplus The dilemmaposed in the last sectionis that
quadraticmonomialsin the originalvariablesis a the models that best emulate nature in termsof
more complexcreature.The possibilityof separa- predictiveaccuracyare also the mostcomplexand
tion is greater.If no separationoccurs,add cubic inscrutable.But this dilemmacan be resolvedby
monomialsas inputfeatures.If thereare originally realizingthewrongquestionis beingasked. Nature
30 predictorvariables,thenthereare about 40,000 formsthe outputsy fromthe inputsx by means of
featuresif monomialsup to the fourthdegreeare a blackbox withcomplexand unknowninterior.
added.
The higherthe dimensionality of the set of fea- y H nature 4 x
tures,themorelikelyit is thatseparationoccurs.In
theZIP Code data set,separationoccurswithfourth Current accurate predictionmethods are also
degreemonomialsadded. The testset erroris 4.1%. complexblackboxes.
Using a large subset of the NIST data base as a
trainingset, separationalso occurredafteradding nets
neural
Y< forests < x
up to fourthdegreemonomialsand gave a test set vectors
support
errorrate of 1.1%.
Separation can always be had by raising the So we are facingtwo black boxes, where ours
dimensionality high enough.But if the separating seems onlyslightlyless inscrutablethan nature's.
hyperplanebecomestoocomplex,thegeneralization In data generatedby medicalexperiments, ensem-
errorbecomeslarge. An elegant theorem(Vapnik, bles of predictorscan give cross-validatederror
1995) givesthis boundforthe expectedgeneraliza- rates significantlylower than logisticregression.
tionerror: My biostatisticianfriendstell me, "Doctors can
interpretlogisticregression."There is no way they
Ex(GE) < Ex(numberofsupportvectors)/(N- 1),
can interpreta black box containingfiftytrees
whereN is the sample size and the expectationis hookedtogether.In a choicebetweenaccuracyand
overall trainingsets ofsize N drawnfromthesame they'llgo forinterpretability.
interpretability,
underlying distribution as the originaltrainingset. Framingthe questionas the choicebetweenaccu-
The numberofsupportvectorsincreaseswiththe racy and interpretabilityis an incorrectinterpre-
dimensionality ofthe featurespace. If this number tationofwhat the goal of a statisticalanalysis is.
210 L. BREIMAN

The pointof a model is to get usefulinformation (unspecified)statisticalprocedurewhichI assume


about the relationbetweenthe responseand pre- was logisticregression.
dictorvariables.Interpretability is a way ofgetting Efronand Diaconis drew500 bootstrapsamples
information. But a modeldoes nothave to be simple fromthe originaldata set and used a similarpro-
to providereliable information about the relation cedure to isolate the importantvariables in each
betweenpredictorand responsevariables; neither bootstrappeddata set. The authorscomment,"Of
does it have to be a data model. the fourvariables originallyselectednot one was
selectedin more than 60 percentof the samples.
but accurate
* The goal is not interpretability,
Hence thevariablesidentified in the originalanaly-
information.
sis cannotbe takentooseriously."Wewillcomeback
The following threeexamplesillustratethis point. to thisconclusionlater.
The firstshows that randomforestsapplied to a
medical data set can give more reliable informa- Logistic Regression
tionabout covariatestrengthsthan logisticregres- The predictiveerrorrateforlogisticregressionon
sion.The secondshowsthatit can give interesting the hepatitisdata set is 17.4%. This was evaluated
information thatcouldnotbe revealedby a logistic bydoing100 runs,each timeleavingouta randomly
regression.The thirdis an applicationto a microar- selected 10% of the data as a test set, and then
to conceiveof a data
ray data whereit is difficult averagingoverthe test set errors.
modelthatwoulduncoversimilarinformation. Usually,the initialevaluationofwhichvariables
are importantis based on examiningthe absolute
11.1 Example 1: Variable Importance in a
values ofthecoefficientsofthevariablesin thelogis-
Survival Data Set
tic regressiondividedby theirstandarddeviations.
The data set contains survival or nonsurvival Figure 1 is a plotofthesevalues.
of 155 hepatitispatientswith 19 covariates.It is The conclusionfromlooking at the standard-
available at ftp.ics.uci.edu/pub/MachineLearning- is that variables 7 and 11 are the
ized coefficients
Databases and was contributed by Gail Gong.The most importantcovariates.When logisticregres-
descriptionis in a filecalled hepatitis.names.The sion is run using only these two variables, the
data set has been previouslyanalyzedby Diaconis cross-validatederrorrate rises to 22.9%. Another
and Efron (1983), and Cestnik, Konenenkoand way to findimportantvariables is to run a best
Bratko (1987). The lowest reportederrorrate to subsets search which,for any value k, findsthe
date, 17%,is in the latterpaper. subsetofk variableshavinglowestdeviance.
Diaconis and Efronreferto workby Peter Gre- This procedureraises the problemsofinstability
goryofthe StanfordMedical School who analyzed and multiplicity ofmodels(see Section7.1). There
this data and concludedthat the importantvari- are about 4,000 subsets containingfourvariables.
ables werenumbers6, 12, 14, 19 and reportsan esti- Of these,there are almost certainlya substantial
mated20% predictiveaccuracy.The variableswere numberthat have devianceclose to the minimum
reducedin two stages-the firstwas by informal and give different picturesofwhat the underlying
data analysis.The secondrefersto a moreformal mechanism is.

3.5.

o2.5-
.5
0
a) ~~**
1.5-
.N

- .5., , l l l l l l l l
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
variables

logistic regression.
FIG. 1. Standardized coefficients
STATISTICAL MODELING: THE TWO CULTURES 211

50 -
-
0 40
c 30-
a,
20-
C

C 10 * 4
a,~~~~~~~~~~~~~~~,A
~0.

I I i I I I I I I I I I I , I I I I I i i
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
variables

FIG. 2. Variable importance-randomforest.

Random Forests measure of variable importancethat is shown in


Figure1.
The randomforestspredictiveerrorrate, evalu-
ated by averagingerrorsover 100 runs,each time Random forestssingles out two variables, the
leavingout 10% ofthe data as a testset,is 12.3%- 12thand the 17th,as beingimportant. As a verifi-
almosta 30% reductionfromthe logisticregression cationbothvariables were run in randomforests,
error. individuallyand together.The test set errorrates
Random forestsconsists of a large numberof over 100 replicationswere 14.3% each. Running
randomlyconstructed trees,each votingfora class. bothtogetherdid no better.We concludethatvirtu-
Similar to bagging (Breiman, 1996), a bootstrap ally all ofthe predictivecapabilityis providedby a
sample ofthe trainingset is used to constructeach singlevariable,either12 or 17.
tree. A randomselectionof the input variables is To exploretheinteraction between12 and 17 a bit
searchedto findthe best splitforeach node. further, at the end ofa randomforestrun usingall
To measure the importanceof the mthvariable, variables,the outputincludesthe estimatedvalue
the values of the mth variable are randomlyper- ofthe probability ofeach class vs. the case number.
muted in all of the cases left out in the current This information is used to get plots of the vari-
bootstrapsample. Then these cases are run down able values (normalizedto mean zero and standard
the currenttree and theirclassificationnoted.At deviationone) vs. theprobability ofdeath.The vari-
the end ofa run consistingofgrowingmanytrees, able values are smoothedusing a weightedlinear
the percentincreasein misclassificationrate due to regressionsmoother. The resultsare in Figure3 for
noisingup each variable is computed.This is the variables12 and 17.

VARIABLE12 vs PROBABILITY
#1 VARIABLE17 vs PROBABILITY
#1
1 1

co :2-1

-2
-3
4 3

0 .2 .4 .6 .8 1 0 .2 .4 .6 .8
class 1 probability class 1 probability

FIG. 3. Variable 17 vs. probability #1.


212 L. BREIMAN

40 -

CZ

.r 20_-
20

10 -

-10 . I
0 1 2 3 4 5 6 7
variables

FIG. 4. Variable importance-Bupa data.

The graphsofthe variablevalues vs. class death is automaticallyless than 50%. The paper lists the
probabilityare almostlinear and similar.The two variables selectedin ten ofthe samples.Either 12
variablesturnoutto be highlycorrelated.Thinking or 17 appear in sevenofthe ten.
thatthismighthave affected the logisticregression
results,it was run again withone or the otherof 11.2 Example 11Clustering in Medical Data
thesetwovariablesdeleted.Therewas littlechange. The Bupa liverdata set is a two-classbiomedical
Out of curiosity,I evaluated variable impor- data set also available at ftp.ics.uci.edu/pub/Mac-
tance in logisticregressionin the same way that I hineLearningDatabases. The covariatesare:
did in randomforests,by permutingvariable val-
ues in the 10% test set and computinghow much 1. mcv mean corpuscularvolume
that increasedthe test set error.Not muchhelp- 2. alkphos alkalinephosphotase
variables12 and 17 werenotamongthe 3 variables 3. sgpt alamineaminotransferase
ranked as most important.In partial verification 4. sgot aspartateaminotransferase
of the importanceof 12 and 17, I triedthem sep- 5. gammagt gamma-glutamyl transpeptidase
arately as single variables in logisticregression. 6. drinks equivalentsofalcoholic
half-pint
Variable 12 gave a 15.7% errorrate, variable 17 beveragedrunkper day
came in at 19.3%.
To go back to the originalDiaconis-Efronanaly- The firstfive attributesare the results of blood
sis,theproblemis clear.Variables12 and 17 are sur- teststhoughtto be relatedto liverfunctioning.The
rogatesforeach other.Ifone ofthemappearsimpor- 345 patientsare classifiedinto two classes by the
tant in a model built on a bootstrapsample, the severityoftheirlivermalfunctioning. Class two is
otherdoesnot.So each one'sfrequency ofoccurrence severe malfunctioning. In a random forestsrun,

1 - cluster1 class 2
cluster2 class 2
5 _ cluster-class 1 7

(a . ....
=......................

0 3 5 6 7
variable

FIG. 5. Cluster averages-Bupa data.


STATISTICAL MODELING: THE TWO CULTURES 213

themisclassification errorrate is 28%. The variable (1999) derivessimilarvariable information froma


importancegivenby randomforestsis in Figure4. different way ofconstructing a forest.The similar-
Blood tests 3 and 5 are the mostimportant, fol- ityis that theyare bothbuilt as ways to give low
lowed by test 4. Random forestsalso outputsan predictiveerror.
intrinsicsimilaritymeasure whichcan be used to Thereare 32 deathsand 123 survivorsin thehep-
cluster.When this was applied,two clusterswere atitisdata set. Callingeveryonea survivorgives a
discoveredin class two.The averageofeach variable
baselineerrorrateof20.6%.Logisticregressionlow-
is computedand plottedin each ofthese clustersin
Figure5. ers this to 17.4%. It is not extractingmuchuseful
An interesting facetemerges.The class two sub- information fromthe data, whichmay explain its
jects consistoftwodistinctgroups:thosethat have inabilityto findthe importantvariables.Its weak-
highscoreson bloodtests3, 4, and 5 and thosethat ness mighthave been unknownand the variable
have low scoreson thosetests. importancesacceptedat facevalue if its predictive
accuracywas notevaluated.
11.3 Example Ill: MicroarrayData Random forestsis also capable of discovering
Random forestswas run on a microarraylym- importantaspects of the data that standarddata
phoma data set withthreeclasses, sample size of modelscannotuncover.The potentiallyinteresting
81 and 4,682variables(genes)withoutanyvariable clusteringofclass twopatientsin Example II is an
selection[formoreinformation about this data set, illustration.The standard procedurewhen fitting
see Dudoit,Fridlyandand Speed,(2000)]. The error data modelssuch as logisticregressionis to delete
rate was low.Whatwas also interesting froma sci- variables;to quote fromDiaconis and Efron(1983)
entificviewpointwas an estimateofthe importance again, "...statisticalexperiencesuggeststhat it is
ofeach ofthe 4,682 gene expressions.
unwiseto fita modelthat dependson 19 variables
The graph in Figure 6 was producedby a run
of randomforests.This result is consistentwith withonly155 data pointsavailable."Newermeth-
assessments of variable importancemade using ods in machinelearningthriveon variables-the
other algorithmicmethods,but appears to have morethe better.For instance,randomforestsdoes
sharperdetail. not overfit.It gives excellentaccuracyon the lym-
phomadata set ofExampleIII whichhas over4,600
11.4 Remarks about the Examples variables,withno variable deletionand is capable
The examples show that much informationis ofextracting variableimportanceinformation from
available froman algorithmicmodel. Friedman the data.

600 - l l

400 -+
+

*
EL + + ++

200 +
o ~~~+ 4+ +

+ +

0
0 1000 2000 3000 4000 5000

variable number

FIG. 6. Microarray variable importance.


214 L. BREIMAN

These examplesillustratethe following


points: combination. But the trickto beinga scientistis to
be opento usinga widevarietyoftools.
* Higher predictiveaccuracyis associated with
The rootsof statistics,as in science,lie in work-
morereliableinformationabouttheunderlying data
ing withdata and checkingtheoryagainst data. I
mechanism.Weak predictiveaccuracycan lead to
hopein thiscenturyourfieldwillreturnto its roots.
questionableconclusions.
There are signsthat this hope is notillusory.Over
* Algorithmicmodels can give betterpredictive
the last tenyears,therehas been a noticeablemove
accuracythandata models,and providebetterinfor- towardstatisticalworkon real worldproblemsand
mationaboutthe underlying mechanism. reachingout by statisticianstowardcollaborative
workwithotherdisciplines.I believethistrendwill
12. FINAL REMARKS
continueand, in fact,has to continueif we are to
The goals in statisticsare to use data to predict surviveas an energeticand creativefield.
and to get information about the underlyingdata
mechanism.Nowhereis it writtenon a stonetablet GLOSSARY
whatkindofmodelshouldbe used to solveproblems Since some ofthe termsused in this paper may
involvingdata. To make mypositionclear,I am not not be familiarto all statisticians,I append some
againstdata modelsper se. In somesituationsthey definitions.
are the mostappropriatewayto solvethe problem. Infinite test set error. Assume a loss function
But the emphasisneeds to be on the problemand L(y, 9) that is a measure of the errorwhen y is
on the data. the true response and 9 the predictedresponse.
Unfortunately, our fieldhas a vested interestin the usual loss is 1 if y 7 9 and
In classification,
data models,comehell or highwater.For instance, zero if y = 9. In regression,the usual loss is
see Dempster's(1998) paper on modeling.His posi- (y - 9)2. Given a set of data (training set) consist-
tionon the 1990 Census adjustmentcontroversy is ing of {(Yn Xn)n = 1,2, ..., N}, use it to construct
particularlyinteresting. He admitsthat he doesn't a predictorfunction+(x) of y. Assume that the
knowmuchaboutthedata orthedetails,butargues trainingset is i.i.d drawnfromthe distribution of
that the problemcan be solved by a strongdose the randomvectorY, X. The infinitetest set error
of modeling.That moremodelingcan make error- is E(L(Y, +(X))). This is called the generalization
riddendata accurateseems highlyunlikelyto me. errorin machinelearning.
Terrabytesof data are pouringinto computers The generalization error is estimated either by
frommany sources,both scientific,and commer- settingaside a part ofthe data as a test set or by
cial, and thereis a need to analyzeand understand cross-validation.
the data. For instance, data is being generated Predictive accuracy. This refers to the size of
at an awesome rate by telescopesand radio tele- the estimatedgeneralizationerror.Good predictive
scopes scanningthe skies. Images containingmil- accuracymeans low estimatederror.
lions of stellar objectsare storedon tape or disk. Theesand nodes. This terminology refersto deci-
Astronomersneed automatedways to scan their sion trees as describedin the Breimanet al book
data to findcertaintypesofstellarobjectsor novel (1984).
objects.This is a fascinatingenterprise,and I doubt Dropping an x down a tree. When a vectorofpre-
ifdata modelsare applicable.Yet I wouldenterthis dictorvariables is "dropped"down a tree,at each
in myledgeras a statisticalproblem. intermediate nodeit has instructionswhetherto go
The analysis of geneticdata is one of the most leftor rightdependingon the coordinatesof x. It
challengingand interestingstatistical problems stopsat a terminalnodeand is assignedthe predic-
around. Microarraydata, like that analyzed in tiongivenbythatnode.
Section 11.3 can lead to significantadvances in Bagging. An acronymfor "bootstrapaggregat-
understandinggenetic effects.But the analysis ing."Start withan algorithmsuch that givenany
of variable importancein Section 11.3 would be trainingset, the algorithmproducesa prediction
difficultto do accuratelyusing a stochasticdata function+(x). The algorithmcan be a decisiontree
model. construction,logisticregressionwithvariabledele-
Problemssuch as stellarrecognitionor analysis tion,etc.Take a bootstrapsamplefromthetraining
ofgene expressiondata couldbe highadventurefor set and use this bootstraptrainingset to construct
statisticians.But it requiresthattheyfocuson solv- the predictor+1(x). Take anotherbootstrapsam-
ing the probleminsteadofaskingwhat data model ple and usingthis secondtrainingset construct the
theycan create.The best solutioncouldbe an algo- predictor42(x). Continuethis way forK steps. In
rithmicmodel,or maybea data model,or maybea regression,average all of the { k(X)} to get the
STATISTICAL MODELING: THE TWO CULTURES 215

bagged predictorat x. In classification,that class CRISTIANINI, N. and SHAWE-TAYLOR, J. (2000). An Introduction


whichhas the pluralityvote of the {4k(X)} is the to Support VectorMachines. Cambridge Univ. Press.
baggedpredictor.Bagginghas been showneffective DANIEL,C. and WOOD,F. (1971). Fittingequations to data. Wiley,
New York.
in variancereduction(Breiman,1996b). DEMPSTER,A. (1998). Logicist statistic 1. Models and Modeling.
Boosting.This is a morecomplexway offorming Statist. Sci. 13 3 248-276.
an ensembleofpredictors in classification
thanbag- DIACONIS,P. and EFRON,B. (1983). Computer intensive methods
ging(Freund and Schapire,1996). It uses no ran- in statistics. ScientificAmerican 248 116-13 1.
domizationbut proceedsby alteringthe weightson DOMINGOS,P. (1998). Occam's two razors: the sharp and the
blunt. In Proceedings of the Fourth International Conference
thetrainingset.Its performance in termsoflowpre- on Knowledge Discovery and Data Mining (R. Agrawal and
dictionerroris excellent(fordetails see Breiman, P. Stolorz, eds.) 37-43. AAAI Press, Menlo Park, CA.
1998). DOMINGOS,P. (1999). The role ofOccam's razor in knowledge dis-
covery.Data Mining and Knowledge Discovery 3 409-425.
ACKNOWLEDGMENTS DUDOIT, S., FRIDLYAND,J. and SPEED, T. (2000). Comparison
of discrimination methods for the classification of tumors.
Many of my ideas about data modelingwere (Available at www.stat.berkeley.edu/technical reports).
formedin threedecades of conversationswithmy FREEDMAN,D. (1987). As others see us: a case study in path
old friendand collaborator,
JeromeFriedman.Con- analysis (with discussion). J Ed. Statist. 12 101-223.
versations with Richard Olshen about the Cox FREEDMAN,D. (1991). Statistical models and shoe leather. Soci-
model and its use in biostatisticshelped me to ological Methodology1991 (with discussion) 291-358.
FREEDMAN,D. (1991). Some issues in the foundations of statis-
understandthe background.I am also indebtedto tics. Foundations of Science 1 19-83.
William Meisel, who headed some of the predic- FREEDMAN,D. (1994). From association to causation via regres-
tion projectsI consultedon and helped me make sion. Adv. in Appl. Math. 18 59-110.
thetransitionfromprobabilitytheoryto algorithms, FREUND,Y. and SCHAPIRE,R. (1996). Experiments with a new
and to CharlesStoneforilluminatingconversations boosting algorithm.In Machine Learning: Proceedings of the
ThirteenthInternational Conference148-156. Morgan Kauf-
aboutthenatureofstatisticsand science.I'm grate-
mann, San Francisco.
fulalso forthe commentsofthe editor,Leon Gleser, FRIEDMAN,J. (1999). Greedy predictive approximation: a gra-
whichprompteda major rewriteof the firstdraft dient boosting machine. Technical report, Dept. Statistics
ofthis manuscriptand resultedin a different and StanfordUniv.
betterpaper. FRIEDMAN,J., HASTIE, T. and TIBSHIRANI,R. (2000). Additive
logisticregression:a statistical view ofboosting.Ann. Statist.
28 337-407.
REFERENCES GIFI, A. (1990). Nonlinear Multivariate Analysis. Wiley, New
AMIT,Y. and GEMAN,D. (1997). Shape quantization and recog- York.
nition with randomized trees. Neural Computation 9 1545- Ho, T. K. (1998). The random subspace method forconstructing
1588. decision forests.IEEE Trans. PatternAnalysis and Machine
ARENA, C., SUSSMAN, N., CHIANG, K., MAZUMDAR, S., MACINA, Intelligence 20 832-844.
0. and LI, W. (2000). Bagging Structure-ActivityRela- LANDSWHER, J., PREIBON,D. and SHOEMAKER, A. (1984). Graph-
tionships: A simulation study for assessing misclassifica- ical methods for assessing logistic regression models (with
tion rates. Presented at the Second Indo-U.S. Workshop on discussion). J Amer.Statist. Assoc. 79 61-83.
Mathematical Chemistry,Duluth, MI. (Available at NSuss- MCCULLAGH,P. and NELDER,J. (1989). Generalized Linear Mod-
man@server.ceoh.pitt.edu). els. Chapman and Hall, London.
BICKEL, P., RITOV, Y. and STOKER, T. (2001). Tailor-madetests MEISEL, W. (1972). Computer-OrientedApproaches to Pattern
for goodness of fit for semiparametric hypotheses. Unpub- Recognition.Academic Press, New York.
lished manuscript. MICHIE, D., SPIEGELHALTER, D. and TAYLOR, C. (1994). Machine
BREIMAN, L. (1996a). The heuristics ofinstabilityin model selec- Learning, Neural and Statistical Classification. Ellis Hor-
tion. Ann. Statist. 24 2350-2381. wood, New York.
BREIMAN, L. (1996b). Baggingpredictors.
MachineLearningJ: MOSTELLER,F. and TUKEY,J. (1977). Data Analysis and Regres-
26 123-140. sion. Addison-Wesley,Redding, MA.
BREIMAN, L. (1998). Arcing classifiers. Discussion paper, Ann. MOUNTAIN, D. and HSIAO, C. (1989). A combined structural and
Statist. 26 801-824. flexible functional approach for modelenery substitution.
BREIMAN. L. (2000). Some infinitytheory for tree ensembles. J. Amer.Statist. Assoc. 84 76-87.
(Available at www.stat.berkeley.edu/technicalreports). STONE,M. (1974). Cross-validatorychoice and assessment ofsta-
BREIMAN, L. (2001). Randomforests. MachineLearningJ 45 5- tistical predictions.J Roy. Statist. Soc. B 36 111-147.
32. VAPNIK, V. (1995). The Nature of Statistical Learning Theory.
BREIMAN, L. and FRIEDMAN, J. (1985). Estimating optimaltrans- Springer,New York.
formationsin multiple regression and correlation.J Amer. VAPNIK,V (1998). Statistical Learning Theory.Wiley,New York.
Statist. Assoc. 80 580-619. WAHBA, G. (1990). Spline Models for Observational Data. SIAM,
BREIMAN, L., FRIEDMAN, J., OLSHEN, R. and STONE, C. Philadelphia.
(1984). Classification and Regression Trees. Wadsworth, ZHANG, H. and SINGER, B. (1999). Recursive Partitioning in the
Belmont, CA. Health Sciences. Springer,New York.

You might also like