Professional Documents
Culture Documents
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .
http://www.jstor.org/action/showPublisher?publisherCode=ims. .
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to
Statistical Science.
http://www.jstor.org
Statistical Science
2001, Vol. 16, No. 3, 199-231
199
200 L. BREIMAN
important
to one: neural net 100 times on simplethree-dimensional
data reselectingthe initialweightsto be small and
Rashomon:the multiplicityofgoodmodels;
randomon each run. I found32 distinctminima,
Occam: the conflictbetweensimplicityand accu-
racy; each ofwhichgave a different picture,and having
Bellman:dimensionality-curse or blessing. about equal test set error.
This effectis closely connectedto what I call
8. RASHOMON AND THE MULTIPLICITY
instability(Breiman,1996a) thatoccurswhenthere
OF GOOD MODELS
are many different models crowdedtogetherthat
have aboutthe same trainingortestset error.Then
Rashomon is a wonderfulJapanese movie in a slightperturbationof the data or in the model
whichfourpeople, fromdifferent vantage points, construction will cause a skip fromone model to
witnessan incidentin whichone persondies and another.The two modelsare close to each otherin
anotheris supposedlyraped. When they come to termsof error,but can be distantin termsof the
testifyin court,theyall reportthe same facts,but formofthe model.
theirstoriesofwhathappenedare verydifferent. If, in logisticregressionor the Cox model,the
What I call the RashomonEffectis that there commonpractice of deleting the less important
is oftena multitudeofdifferent descriptions[equa- covariatesis carriedout, then the model becomes
tionsf(x)] in a class offunctionsgivingabout the unstable-there are too many competingmodels.
same minimumerrorrate. The mosteasily under- Say you are deletingfrom15 variables to 4 vari-
stoodexampleis subset selectionin linear regres- ables. Perturbthe data slightlyand you will very
sion.Supposethereare 30 variablesand we wantto possiblyget a differentfour-variablemodel and
findthe best fivevariablelinearregressions.There a different conclusionabout which variables are
are about 140,000five-variable subsetsin competi- important. To improveaccuracybyweedingoutless
tion.Usuallywe pickthe one withthe lowestresid- importantcovariatesyou run into the multiplicity
ual sum-of-squares (RSS), or,if thereis a test set, problem.The pictureofwhichcovariatesare impor-
the lowesttest error.But theremay be (and gen- tant can vary significantly between two models
erallyare) manyfive-variable equationsthat have havingaboutthe same deviance.
RSS within1.0% of the lowestRSS (see Breiman, Aggregatingover a large set of competingmod-
1996a). The same is true if test set erroris being els can reducethe nonuniquenesswhileimproving
measured. accuracy.Arena et al. (2000) bagged(see Glossary)
So here are threepossiblepictureswithRSS or logisticregressionmodelson a data base oftoxicand
test set errorwithin1.0% ofeach other: nontoxicchemicalswherethe numberofcovariates
Picture1 in each model was reducedfrom15 to 4 by stan-
dardbest subsetselection.On a testset,thebagged
y = 2.1 + 3.8x3 - 0.6x8 + 83.2x12 modelwas significantly moreaccuratethanthe sin-
- 2.1x17 + 3.2x27, gle modelwithfourcovariates.It is also morestable.
This is one possible fix.The multiplicity problem
Picture2 and its effecton conclusionsdrawn frommodels
y = -8.9 + 4.6x5+ 0.01x6+ 12.0x15 needs seriousattention.
+ 17.5X21+ 0.2X22, 9. OCCAM AND SIMPLICITYVS. ACCURACY
Picture3 Occam's Razor, long admired,is usually inter-
y = -76.7 + 9.3x2 + 22.0x7 - 13.2x8 pretedtomeanthatsimpleris better.Unfortunately,
in prediction,accuracyand simplicity(interpretabil-
+ 3.4x11 + 7.2X28. ity) are in conflict.
For instance,linear regression
Whichone is better?The problemis that each one gives a fairlyinterpretablepictureofthe y,x rela-
tells a differentstoryabout which variables are tion. But its accuracy is usually less than that
important. of the less interpretableneural nets. An example
The RashomonEffectalso occurswith decision closerto myworkinvolvestrees.
treesand neuralnets.In myexperiments withtrees, On interpretability,trees rate an A+. A project
ifthe trainingset is perturbedonlyslightly,
say by I workedon in the late 1970s was the analysis of
removinga random2-3% of the data, I can get delay in criminalcases in state courtsystems.The
a tree quite different fromthe originalbut with Constitution givesthe accusedtherightto a speedy
almostthe same test set error.I once ran a small trial.The CenterfortheState Courtswas concerned
STATISTICAL MODELING: THE TWO CULTURES 207
TABLE 1
Data set descriptions
Training Test
Data set Sample size Sample size Variables Classes
Cancer 699 9 2
Ionosphere 351 34 2
Diabetes 768 8 2
Glass 214 9 6
Soybean 683 35 19
Letters 15,000 5000 16 26
Satellite 4,435 2000 36 6
Shuttle 43,500 14,500 9 7
DNA 2,000 1,186 60 3
Digit 7,291 2,007 256 10
that in many states,the trials were anythingbut variables.At each node chooseseveralofthe 20 at
speedy.It fundeda studyofthe causes ofthe delay. randomto use to splitthe node. Or use a random
I visitedmany states and decidedto do the anal- combinationof a randomselectionof a few vari-
ysis in Colorado,whichhad an excellentcomputer- ables. This idea appears in Ho (1998), in Amitand
ized courtdata system.A wealthofinformation was Geman(1997) and is developedin Breiman(1999).
extractedand processed.
The dependentvariable for each criminalcase 9.2 Forests Compared to Trees
was the timefromarraignment to the timeofsen- We compare the performanceof single trees
tencing.All ofthe otherinformation in thetrialhis- (CART) to randomforestson a numberof small
torywere the predictorvariables.A large decision and largedata sets,mostlyfromtheUCI repository
treewas grown,and I showedit on an overheadand (ftp.ics.uci.edulpub/MachineLearningDatabases).A
explainedit to the assembledColoradojudges. One summaryofthe data sets is givenin Table 1.
ofthe splitswas on DistrictN whichhad a larger Table 2 comparesthetestset errorofa singletree
delaytimethanthe otherdistricts.I refrainedfrom to that ofthe forest.For the fivesmallerdata sets
commenting on this.But as I walkedoutI heardone above the line,the test set errorwas estimatedby
judge say to another,"I knewthoseguysin District leaving out a random10% of the data, then run-
N weredraggingtheirfeet." ning CART and the foreston the other90%. The
While trees rate an A+ on interpretability, they left-out10% was run downthe tree and the forest
are good,but not great,predictors.Give them,say, and the erroron this 10% computedforboth.This
a B on prediction. was repeated 100 times and the errorsaveraged.
9.1 Growing Forests for Prediction The larger data sets below the line came with a
separatetestset. Peoplewhohave been in the clas-
Instead ofa singletreepredictor, growa forestof sificationfieldfora while findthese increases in
trees on the same data-say 50 or 100. If we are accuracystartling.Some errorsare halved. Others
put the newx downeach treein thefor-
classifying, are reducedby one-third.In regression,wherethe
est and geta voteforthepredictedclass. Let thefor-
est predictionbe the class thatgetsthe mostvotes.
Therehas been a lotofworkin thelast fiveyearson TABLE2
Test set misclassificationerror(%)
waysto growtheforest.All ofthewell-known meth-
ods growthe forestby perturbing the trainingset, Data set Forest Single tree
growinga tree on the perturbedtrainingset, per-
Breast cancer 2.9 5.9
turbingthetrainingset again,growinganothertree, Ionosphere 5.5 11.2
etc. Some familiarmethodsare bagging(Breiman, Diabetes 24.2 25.3
1996b),boosting(Freundand Schapire,1996), arc- Glass 22.0 30.4
ing(Breiman,1998),and additivelogisticregression Soybean 5.7 8.6
(Friedman,Hastie and Tibshirani,1998). Letters 3.4 12.4
Mypreferred methodto date is randomforests.In Satellite 8.6 14.8
thisapproachsuccessivedecisiontreesare grownby Shuttle X103 7.0 62.0
DNA 3.9 6.2
introducing a randomelementinto theirconstruc- Digit 6.2 17.1
tion. For example,suppose there are 20 predictor
208 L. BREIMAN
class #2are {x(2)}. If these two sets ofvectorscan becomestoo large,the separatinghyperplanewill
be separatedby a hyperplanethenthereis an opti- notgivelow generalizationerror.If separationcan-
mal separatinghyperplane."Optimal"is definedas not be realized with a relativelysmall numberof
meaningthatthe distanceofthe hyperplaneto any supportvectors,thereis anotherversionofsupport
predictionvectoris maximal(see below). vectormachinesthat definesoptimalityby adding
The set of vectorsin {x(1)} and in {x(2)} that a penaltytermforthe vectorson the wrongside of
achieve the minimum distance to the optimal the hyperplane.
separatinghyperplaneare called the supportvec- Someingeniousalgorithms makefinding theopti-
tors. Their coordinatesdeterminethe equation of mal separatinghyperplanecomputationally feasi-
the hyperplane.Vapnik (1995) showed that if a ble. These devicesreducethe search to a solution
separatinghyperplaneexists,thenthe optimalsep- of a quadratic programming problemwith linear
arating hyperplanehas low generalizationerror inequalityconstraintsthat are of the orderof the
(see Glossary). numberN of cases, independentof the dimension
ofthefeaturespace. Methodstailoredto thispartic-
O /4- optimalhyperplane ular problemproducespeed-upsofan orderofmag-
nitudeoverstandardmethodsforsolvingquadratic
programming problems.
supportvector
Support vector machines can also be used to
00 provideaccurate predictionsin other areas (e.g.,
0
regression).It is an excitingidea that gives excel-
lent performance and is beginningto supplantthe
In two-classdata, separabilityby a hyperplane use of neural nets. A readable introduction is in
does not oftenoccur.However,let us increasethe Cristianiniand Shawe-Taylor(2000).
dimensionalityby adding as additional predictor
variables all quadratic monomialsin the original 11. INFORMATIONFROM A BLACK BOX
predictorvariables; that is, all termsof the form
XmlXm2.A hyperplane in the originalvariablesplus The dilemmaposed in the last sectionis that
quadraticmonomialsin the originalvariablesis a the models that best emulate nature in termsof
more complexcreature.The possibilityof separa- predictiveaccuracyare also the mostcomplexand
tion is greater.If no separationoccurs,add cubic inscrutable.But this dilemmacan be resolvedby
monomialsas inputfeatures.If thereare originally realizingthewrongquestionis beingasked. Nature
30 predictorvariables,thenthereare about 40,000 formsthe outputsy fromthe inputsx by means of
featuresif monomialsup to the fourthdegreeare a blackbox withcomplexand unknowninterior.
added.
The higherthe dimensionality of the set of fea- y H nature 4 x
tures,themorelikelyit is thatseparationoccurs.In
theZIP Code data set,separationoccurswithfourth Current accurate predictionmethods are also
degreemonomialsadded. The testset erroris 4.1%. complexblackboxes.
Using a large subset of the NIST data base as a
trainingset, separationalso occurredafteradding nets
neural
Y< forests < x
up to fourthdegreemonomialsand gave a test set vectors
support
errorrate of 1.1%.
Separation can always be had by raising the So we are facingtwo black boxes, where ours
dimensionality high enough.But if the separating seems onlyslightlyless inscrutablethan nature's.
hyperplanebecomestoocomplex,thegeneralization In data generatedby medicalexperiments, ensem-
errorbecomeslarge. An elegant theorem(Vapnik, bles of predictorscan give cross-validatederror
1995) givesthis boundforthe expectedgeneraliza- rates significantlylower than logisticregression.
tionerror: My biostatisticianfriendstell me, "Doctors can
interpretlogisticregression."There is no way they
Ex(GE) < Ex(numberofsupportvectors)/(N- 1),
can interpreta black box containingfiftytrees
whereN is the sample size and the expectationis hookedtogether.In a choicebetweenaccuracyand
overall trainingsets ofsize N drawnfromthesame they'llgo forinterpretability.
interpretability,
underlying distribution as the originaltrainingset. Framingthe questionas the choicebetweenaccu-
The numberofsupportvectorsincreaseswiththe racy and interpretabilityis an incorrectinterpre-
dimensionality ofthe featurespace. If this number tationofwhat the goal of a statisticalanalysis is.
210 L. BREIMAN
3.5.
o2.5-
.5
0
a) ~~**
1.5-
.N
- .5., , l l l l l l l l
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
variables
logistic regression.
FIG. 1. Standardized coefficients
STATISTICAL MODELING: THE TWO CULTURES 211
50 -
-
0 40
c 30-
a,
20-
C
C 10 * 4
a,~~~~~~~~~~~~~~~,A
~0.
I I i I I I I I I I I I I , I I I I I i i
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
variables
VARIABLE12 vs PROBABILITY
#1 VARIABLE17 vs PROBABILITY
#1
1 1
co :2-1
-2
-3
4 3
0 .2 .4 .6 .8 1 0 .2 .4 .6 .8
class 1 probability class 1 probability
40 -
CZ
.r 20_-
20
10 -
-10 . I
0 1 2 3 4 5 6 7
variables
The graphsofthe variablevalues vs. class death is automaticallyless than 50%. The paper lists the
probabilityare almostlinear and similar.The two variables selectedin ten ofthe samples.Either 12
variablesturnoutto be highlycorrelated.Thinking or 17 appear in sevenofthe ten.
thatthismighthave affected the logisticregression
results,it was run again withone or the otherof 11.2 Example 11Clustering in Medical Data
thesetwovariablesdeleted.Therewas littlechange. The Bupa liverdata set is a two-classbiomedical
Out of curiosity,I evaluated variable impor- data set also available at ftp.ics.uci.edu/pub/Mac-
tance in logisticregressionin the same way that I hineLearningDatabases. The covariatesare:
did in randomforests,by permutingvariable val-
ues in the 10% test set and computinghow much 1. mcv mean corpuscularvolume
that increasedthe test set error.Not muchhelp- 2. alkphos alkalinephosphotase
variables12 and 17 werenotamongthe 3 variables 3. sgpt alamineaminotransferase
ranked as most important.In partial verification 4. sgot aspartateaminotransferase
of the importanceof 12 and 17, I triedthem sep- 5. gammagt gamma-glutamyl transpeptidase
arately as single variables in logisticregression. 6. drinks equivalentsofalcoholic
half-pint
Variable 12 gave a 15.7% errorrate, variable 17 beveragedrunkper day
came in at 19.3%.
To go back to the originalDiaconis-Efronanaly- The firstfive attributesare the results of blood
sis,theproblemis clear.Variables12 and 17 are sur- teststhoughtto be relatedto liverfunctioning.The
rogatesforeach other.Ifone ofthemappearsimpor- 345 patientsare classifiedinto two classes by the
tant in a model built on a bootstrapsample, the severityoftheirlivermalfunctioning. Class two is
otherdoesnot.So each one'sfrequency ofoccurrence severe malfunctioning. In a random forestsrun,
1 - cluster1 class 2
cluster2 class 2
5 _ cluster-class 1 7
(a . ....
=......................
0 3 5 6 7
variable
600 - l l
400 -+
+
*
EL + + ++
200 +
o ~~~+ 4+ +
+ +
0
0 1000 2000 3000 4000 5000
variable number