Professional Documents
Culture Documents
T E C H N I C A L CO M M E N T
PSYCHOLOGY
Comment on Estimating
the reproducibility of
psychological science
Daniel T. Gilbert,1* Gary King,1 Stephen Pettigrew,1 Timothy D. Wilson2
A paper from the Open Science Collaboration (Research Articles, 28 August 2015,
aac4716) attempting to replicate 100 published studies suggests that the reproducibility of
psychological science is surprisingly low. We show that this article contains three
statistical errors and provides no support for such a conclusion. Indeed, the data are
consistent with the opposite conclusion, namely, that the reproducibility of psychological
science is quite high.
SCIENCE sciencemag.org
1037-a
R ES E A RC H | T E C H N I C A L C O M M E N T
1037-a
sciencemag.org SCIENCE
TechnicalComment:SupplementaryAppendix
Title:
SupplementaryappendixforCommentonEstimatingthereproducibilityofpsychological
science.
1
1
1
2
Authors:
DanielT.Gilbert
,GaryKing
,StephenPettigrew
,TimothyD.Wilson
(1)
1
2
Affiliations
:HarvardUniversity
,UniversityofVirginia
Summary:
OnewaytheOpenScienceCollaborationsrecentreport(
1,
hereinafterreferredto
asOSC2015)definesasuccessfulreplicationisasoneinwhichthe95%confidenceinterval
(CI)fromthereplicationcapturesthepointestimatefromtheoriginalpublishedresult(see
OSC2015Table1,column10).Thisdoesnotmakestatisticalsensebecausetheeffectthatthey
weretryingtoreplicatewasthatoftheoriginalstudy,notofthereplication.Thus,weinverted
thecalculationtodeterminewhetherthepointestimatesfromthereplicationswerecaptured
withinthe95%CIoftheoriginalarticle.Makingthisswitchhasanegligibleimpactonthe
numbersreportedbelowandchangesnoconclusions.
Iftheonlychangemadetotheoriginalstudieswasthatnewsamplesweredrawnfromeachof
theoriginalpopulations,wewouldexpectthat95%ofeffectsfromthereplicationswouldfall
insidetheCIoftheoriginalstudiesduetochancealone.OSC2015findsthatonly47%studies
replicatedbasedonthiscriterion(OSC2015,Table1,column10).However,weknowthatin
additiontodrawingnewsamples,thereplicationexperimentsconductedinOSC2015deviated
fromthedesignprotocolsoftheoriginalstudiesmuchmorethansimplydrawinganewsample.
Weconclude:(a)ThepercentofstudiesinOSC2015thatshouldbeexpectedtofailtoreplicate
bychancealoneisnot5%,butatleast34.54%,andprobablymuchhigher(b)Ifthereplication
studiesinOSC2015hadbeenmorehighlypowered,theobservedreplicationratewouldhave
beenquitehighand(c)IfOSC2015hadanalyzedonlythehighfidelityreplicationsthatwere
endorsedbytheoriginalauthors,thepercentofsuccessfulreplicationswouldhavebeen
statisticallyindistinguishablefrom100%.
Details:
TheManyLabsDesign.
InManyLabsproject(
2,
hereinafterreferredtoasML2014),the
authorschose13previouslypublishedstudiesfrompsychologicalscienceandhad36different
groupsofresearchers(manylabs)attempttoreplicateeachstudy.Eachlabdrewasample
fromadifferentpopulationandthereweredifferencesbetweenlabsinhowthesampleswere
drawn.Oneofthe13studieswasreplicatedinfourdifferentways,andafewwerereplicatedin
fewerthan36labs.Thisresultedinatotalof574totalreplicationdatasets.
Replicationinformation
:Allinformationandcodenecessarytoreplicateourresultsare
archivedindataverse(3).
1. ERROR
Thetotaluncertaintyduetoreplicationcouldbeestimatedbyhaving
fullyindependent
teams
replicatethesamestudy,wherefullyindependentmeansabsolutelynocommunication
betweentheteams.Overall,totalreplicationuncertaintyisdueto(a)identifyingthepopulation
ofinterest,(b)samplingsubjectsfromthatpopulation,and(c)choosingandimplementingthe
experimentalprotocols(suchasinpersonversusonlineinteractions,detailsaboutthesurveyor
othermeasurementinstrument,howthetreatmentwasadministered,whatcovariatesor
experimentalconditionswereheldphysicallyorstatisticallyconstant,whatstatisticalestimator
wasused,etc.).Themoreuncertaintyinreplication,thelargerthenumberofstudiesweshould
expecttofailtoreplicatebychancealone.
ThereplicationproceduresusedinOSC2015ensuredconsiderablevariabilityfromallthree
sources,andyet,whentheauthorsofOSC2015computedthenumberofstudiesoneshould
expecttofail,theyassumedthattheonlysourceofvariabilitywasitem(b)above.They
explainedthatonthebasisofonlytheaveragereplicationpowerofthe97original,significant
effects[M=0.92,median(Mdn)=0.95],wewouldexpectapproximately89positiveresultsin
thereplicationsifalloriginaleffectsweretrueandaccuratelyestimated.(OSC2015,page
aac47164)Thisestimateofa92%successrateistheaverageofthestatisticalpowerforthe100
replicationstudiesandrepresentstheexpectednumberofreplicationpointestimatesthatare
statisticallysignificantandinthesamedirectionoftheoriginalresult.OSC2015doesnot
provideasimilarbaselinefortheCIreplicationtestfromTable1,column10,althoughbasedon
statisticaltheoryweknowthat95%ofreplicationestimatesshouldfallwithinthe95%CIofthe
originalresults.Giventhatthereplicationprotocolsweredifferentfromthoseoftheoriginal
studies,thereisstrongreasontosuspectthatthesefiguresseverelyoverestimatetheactual
numberofstudiesoneshouldexpecttofailtoreplicatebychancealone.
ML2014replicatedthesamestudy35or36timesbuthadmanydependenciesacrossthe
replications:Webundledtheselectedstudiestogetherintoabrief,easytoadminister
experimentthatwasdeliveredtoeachparticipatingsample[i.e.,fromeachlab]throughasingle
infrastructure(
http://projectimplicit.net/
).Therearemanyfactorsthatcaninfluencethe
replicabilityofaneffectsuchassample,setting,statisticalpower,andproceduralvariations.The
presentdesignstandardizesproceduralcharacteristicsandensuresappropriatestatisticalpowerin
ordertoexaminetheeffectsofsampleandsettingonreplicability(ML2014,page143).This
meansthatwecanusetheML2014datatoestimatethenumberofstudiesthatwouldbe
expectednottoreplicateonthebasisofuncertaintyduetouncertaintyfrom(a),(b),butonly
partsof(c).Thus,acalculationusingthesedatawillbemorerealisticthantheoneinOSC2015
whichisbasedon(b)alonebutwillstillunderestimatethenumberoffailurestoreplicatethatare
duetochance.
WefirstusedtheCIfromonelabstestofoneofthe13studies(andnodatafromtheoriginal
studiestheyattemptedtoreplicate)asabaseline(analogoustothesituationwherethis
particularreplicationhadbeenthepublishedresult).Wetreatedthetestsofthatstudyfromthe
other35labsasreplicationsofthisresult.Repeatingthisforall574studylabcombinations,
wefindthat34.5%ofreplicationsgeneratedapointestimateoutsidethe95%CIofthe
publishedresult.Thisfarexceedsthe5%failureratewewouldexpectdueto(b)alone.
Conclusion1.
OSC2015concludedthatif100%oftheoriginalstudiestheyattemptedto
replicatehadproducedtrueeffects,theywouldexpect92%ofthem(or95%basedonthemetric
weareusing)toreplicate.Infact,usingtheML2014datasolelytoestimatereliabilityofthe
replicators,weshouldexpectonly65.5%ofthoseoriginalstudiestobesuccessfullyreplicated
givenOSC2015sdesign.Becausethislatterestimateexcludesmanysourcesofuncertaintydue
to(c),andbecausethedeviationsbetweentheprotocolsusedinthearticlesreplicatedby
OSC2015andthesinglereplicationsweobservearesolarge,
thepercentoforiginalarticles
containingtrueeffectsthatoneshouldexpecttoreplicategiventheOSC2015designwould
probablybelowerthan66%
.
2.POWER
TheauthorsofML2014assessedreplicationsuccessforeachoriginalpublishedstudybypooling
togetherdatafromall(approximately)36labsandconductedonehighpoweredanalysis.For11
ofthe13articles,thispooledteststatisticwasstatisticallysignificantandinthesamedirection
astheoriginalstudy,andbymostothercriteria,10or11ofthe13studiesweresuccessfully
replicated.TheauthorsofML2014tookthisasevidencethatoriginalresultswerereplicatedin
11of13or85%ofthearticles.Clearly,theyfoundnoevidenceofareplicationcrisisin
psychologicalscience.
ToemulatethesetupofOSC2015usingtheML2014data,weconsideredeachofthe574
replicationsindividually,ratherthanpoolingthem.Wecalculatedthe95%CIoftheestimate
reportedinthe13original,publishedarticlesanddeterminedwhetherthepointestimateofthe
574replicationsfellwithinthisinterval.Wefoundthat34.0%ofthe574replicationswere
withinthe95%CIreportedbytheoriginalauthor,whichwouldseemtosuggestevenmoreofa
replicationcrisisthandoesOSC2015,inwhich47%replicatedbasedonthiscriterion
(OSC2015,Table1,column10).
ThisapproachmirrorstheonetakenbyOSC2015,wherereplicationsamplesizewasroughlythe
sameastheoriginalpublishedarticle.WhenwereruntheaboveanalysisusingonlytheML2014
replicationthathadasampleatleastasbigastheoriginalpublishedarticle,ourconclusions
remainthesame.AmongthehighestpoweredManyLabsreplications,just40.7%reporteda
resultinsidethe95%CIoftheoriginalstudy.
Conclusion2.
ML2014,whichcombinedallthedatafrom36studiesintoonepooledstudy,was
verypowerful,andonthatbasisconcludedthat11of13or85%oftheoriginalstudieswere
replicated.Inotherwords,thereisnoevidenceofareplicationcrisisinpsychologicalscience.
Byselectingonereplicationoutofthe35or36replicationsfromML2014,weapproximatedthe
procedureusedinOSC2015.Whenwedidthis,wefoundthattheprobabilityofreplicationina
singlestudywasonlyabout34%,whichisevenworsethantheOSC2015figureof47%.The
implicationisthatifeachofthe100originalstudiesexaminedinOSC2015hadbeenreplicated
withamorepowerfuldesign(orifmanystudieswereconductedandpooledtogether,aswas
doneinML2014),anextremelylargepercentageofthe100originalstudieswouldhave
replicated.Theclearconclusionisthat
OSC2015providesnoevidenceforareplicationcrisisin
psychologicalscience.
3.BIAS
PriortorunningeachreplicationintheOSC2015study,eachreplicationteamcontactedthe
authorsoftheoriginalstudythattheywereseekingtoreplicate(givenprotocolsprovidedbythe
managingteam).Theoriginalauthorswereaskedtheextenttowhichtheyendorsedthe
replicationprotocolpriortorunningtheexperiment.Most(69%)endorsedthereplication
protocol,8%hadconcernsaboutthereplicationplanbasedoninformedjudgmentorspeculation,
3%hadconcernsbasedonpublishedempiricalevidenceabouttheconstraintsoftheeffect,and
18%didnotrespond.Weasked:Whatwouldthereplicationratehavebeenifallofthe
replicationsinOSC2015hadbeenhighenoughinfidelitytohavebeenendorsedbytheoriginal
authors?Oralternatively,ifOSC2015hadonlyanalyzedthe69%ofstudieswhichwere
endorsedbytheoriginalauthors?
InOSC2015,endorsementwasastrongpredictorofreplicationsuccess:Ofthereplication
protocolsthatwerenotendorsedbyoriginalauthors,only15.4%weresuccessfullyreplicated,
butofthereplicationprotocolsthatwereendorsedbytheoriginalauthors,astriking3.9times
moreor59.7%weresuccessfullyreplicated.IfOSC2015hadonlyanalyzedthereplications
thatwereendorsedbytheoriginalauthors,theirsuccessratewouldhaveincreasedfrom47%to
59.7%(95%CI:[47.7%,70.4%]).
Wealsoestimatedalogisticregressionmodelofreplicationsuccessonthedegreeof
endorsementbytheoriginalauthors,operationalizedasacategoricalvariablewith4levels.We
usedtheseregressionresultstogeneratepredictedprobabilitiesofsuccessfulreplicationinthe
counterfactualworldinwhichallstudieswereendorsed.Inthisworld,inwhichallreplications
wereofhighenoughfidelitytohavebeenendorsedbytheoriginalauthors,ourmodelpredicted
thatthereplicationratewouldhavebeen58.6%(95%CI:[47.0%,69.5%]),controllingfor
covariates(thecitationcountoftheoriginalpaper,thediscipline/subfieldoftheresearch,and
prereplicationassessmentsofwhetherhighlevelmethodologicalexpertisewouldberequired
andwhethertheoriginalpapersdesignhadahighprobabilityofexpectancybias).These
confidenceintervalsoverlapthe(lowerthan)65.5%rateofsuccessfulreplicationthatwewould
expectbychanceifalloftheoriginalstudieshadproducedtrueeffectsandallofthesourcesof
uncertaintyhadbeenconsidered.
Thislastanalysisisanextrapolationtoacounterfactualsituationandbydefinitionmoremodel
dependent.Theuncertaintiesgeneratedbythismodeldependencedonotmeanthismodel
shouldnotberunatpresent,itistheonlywaytoestimatethequantityofinterestthatOSC2015
soughttoestimate.Moreover,theuncertaintyexistswhetherornotthemodelisrun,includingif,
likeOSC2015,oneweretoignorethemistakesduetothereplicationinfidelities.
Conclusion3
.
IfOSC2015hadconductedonlyreplicationsthatwereofhighenoughfidelityto
beendorsedbyoriginalauthors,thepercentofsuccessfulreplicationstheyobservedwouldhave
beenstatisticallyindistinguishablefromthepercentonewouldexpectifalloftheoriginaleffects
hadbeentrue.
References
1.
OpenScienceCollaboration,Estimatingthereproducibilityofpsychologicalscience.
Science.
349
,aac47161aac47168(2015).
2.
R.A.Klein
etal
,InvestigatingVariationinReplicability:AManyLabsReplication
Project.
SocialPsychology
45
(3):142152(2014).
3.
Gilbert,DanielKing,GaryPettigrew,StephenWilson,Timothy,2016,"ReplicationData
for:Commenton`Estimatingthereproducibilityofpsychologicalscience.",
http://dx.doi.org/10.7910/DVN/5LKVH2
,HarvardDataverse,V1
[UNF:6:2+oJQsG7lz6j6Kqaj/bF+g==]