Professional Documents
Culture Documents
factors that influence this problem and is characteristic of the field and can
some corollaries thereof. va~ya lot depending on whether the
There is increasing concern that most field targets highly likely relationships
Modeling the Framework for False or searches for only one or a few
current published research findings are
Positive Findings true relationships among thousands
false.The probability that a research daim
is true may depend on study power and Severai methodologists have and millions of hypotheses that may
bias,the number of other studies on the pointed out [9-111 that the high be postulated. Let us also consider,
Same question, and, importantly,the ratio rate of nonreplication (lack of for computational simplicity,
of true to no relationships arnong the confirmation) of research discoveries circumscribed fields where either there
relationships probed in each scientific is a consequence of the convenient, is only one true relationship (among
field. In this framework, a research finding yet ill-founded strategy of claiming many that can be hypothesized) or
is less likely to be true when the studies conclusive research findings solely on the power is similar to find any of the
conducted in a field are smaller; when the basis of a Single study assessed by several existing true relationships. The
effect sizes are smal1er;when there is a formal statistical significance, typically pre-study probability of a relationship
greater number and lesser preselection for a pvalue less than 0.05. Research being true is R/(R + 1). The probability
of tested re1ationships;where there is is not most appropriately represented of a study finding a true relationship
greater flexibility in designs, definitions, and summarized by pvalues, but, reflects the power 1 - ß (one minus
outcomes, and analytical modes; when unfortunately, there is a widespread the Type ll error rate). The probability
there is greater financial and other notion that medical research articles of claiming a relationship when none
interest and prejudice;and when more truly exists reflects the Type I error
teams are involved in a scientific field It a n be proven that rate, a. Assuming h a t C relationships
are being probed in the field, the
in chase of statistical significance. most daimed resea~h expected values of the 2 x 2 table are
Simulations show that for most study
designs and Settings, it is rnore likely for findings are false. g-iven in Table 1. After a research
a research claim to be false than true. finding has been claimed based on
Moreover,for many current scientific should be interpreted based only on achieving formal statistical significance,
fields, claimed research findings may pvalues. Research findings are defined the post-study probability that it is true
often be simply accurate measures of the here as any relationship reaching is the positive predictive value, PPV.
prevailing bias. In this essay, I discuss the formal statistical significance, e.g., The PPV is also the complementary
implications of these problems for the effective interventions, informative probability of what Wacholder et al.
conduct and interoretation of research. predictors, risk factors, or associations. have called the false positive repon
"Negative" research is also very useful. probability [10]. According to the 2
"~e&itive" is actually a misnomer, and x 2 table, one gets PPV = (1 - ß) R/(R
the misinterpretation is widespread. - PR + a). A research finding is thus
P
ublished research findings are
sometimes refuted by subsequent However, here we will target
evidence, with ensuing confusion relationships that investigators claim Citatlon: loannidisJPA (2005) Why most published
and disappointment. Refutation and exist, rather than null findings. research findings arefalsePLoS Med 2(8):el24.
controversy is Seen across the range of As has been shown previously, the
Copyright: Q 2005 John P.Aloannidis.This is an
research designs, from clinical trials probability that a research finding open-access article distributedunder the terms
and traditional epidemiological studies is indeed true depends on the prior ,of the Creative CommonsAttribution Liceme,
probability of it being true (before which permits unrestricted use,distribution,and
[I-31 to the most modern molecular reproduction in any medium, provided the original
research [4,5]. There is increasing doing the study), the statistical power work is properlycited.
concern that in modern research, false of the study, and the level of statistical
Abbmviatlon: PPV, positive predictivevalue
findings may be the majority or even significance [10,11]. Consider a 2 x 2
the vast majority of published research table in which research findings are John P. A. loannidis is in the Department of Hygiene
compared against the gold Standard and Epidemiology, Uniwnity of loannina Schwl of
claims [6-81. However, this should Medicine, loannina,Greece,and lnstinnefor Clinical
not be surprising. It can be proven of true relationships in a scientific Research and Health Policy Studies,Depanment of
that most claimed research findings field. In a research field both true and Medicine,Tuhs-New England Medial Center,Tufts
false hypotheses can be made about UniversitySchool of Medicine, Boston,Massachusetts,
are false. Here I will examine the key United States of America. E-mail:jioannid&c.uoigr
the presence of relationships. Let R
be the ratio of the number of "true Compting IntaresWTheauthor has declared that
The Essay section contains opinion pieces on topks no competing interests exist.
of broad interest to a general medial audience.
relationships" to "no relationships"
among those tested in the field. R DOI: 10.137l/journal.prned.0020124
:*.
:@.. PLoS Medicine I www.plosmedicine.org August 2005 I Volume 2 I lssue 8 I e124
Same question, claims a statistically
Table 1. Research Findinqs and True Relationships
significant research finding is easy to
Research True Relationshlp
estimate. For n independent studies of
Findina Yes Na Total
equal power, the 2 ;2 table is shown in
Yes . 6 1 -ß)RI(R+l) d ( R + 1) c(R+a-'BR)I(R+ 1) Table 3: PPV = R ( l - ß3/(R + 1- [ l -
No cßRl(R + 1 ) -
c(1 a)l(R + 1 ) -
c(1 a + ßR)l(R + 1 ) aIn- Rßn) (not considering bias). With
Total dV(R+ 1 ) dVl+l) C increasing number of independent
studies, PPV tends to decrease, unless
1 - ß < a , i.e., typically 1 - ß < 0.05.
This is shown for different levels of
more likely true than false if (1 - ß)R are lost in noise [12], or investigators
power and for different pre-study odds
> a.Since usually the vast majority of use data inefficiently or fail to notice
investigators depend on a = 0.05, this in Figure 2. For n studies of different
statistically significant relationships, or
means that a research finding is more power, the term ßn is replaced by the
there rnay be conflicts of interest that
product of the terms Pi for i = 1 to n,
likely true than false if (1 - ß)R > 0.05. tend to "bury" significant findings [13].
but inferences are similar.
What is less well appreciated is There is no good large-scale empirical
that bias and the extent of repeated evidence on how frequently such Corollaries
independent testing by different teams reverse bias rnay occur across diverse A practical example is shown in Box
of investigators around the globe rnay research fields. However, it is probably 1. Based on the above considerations,
further distort this picture and rnay fair to say that reverse bias is not as
lead to even smaller probabilities of the one rnay deduce several interesting
common. Moreover measurement corollaries about the probability that a
research findings being indeed true. errors and inefficient use of data are research finding is indeed true.
We will try to model these two factors in probably becoming less frequent Corollary 1: The smaller the studies
the context of similar 2 X 2 tables. problems, since measurement error has conducted in a scientific field, the less
decreased with technological advances iikeiy the research hdings are to be
Bias
in the molecular era and investigators true. Small sample size means smaller
First, let us define bias as the are becoming increasingly sophisticated power and, for all functions above,
combination of various design, data, about their data. Regardless, reverse the PPV for a true research finding
analysis, and presentation factors that bias rnay be modeled in the Same way as decreases as power decreases towards
tend to produce research findings bias above. Also reverse bias should not 1- ß = 0.05. Thus, other factors being
when they should not be produced. be confused with chance variability that equal, research findings are more likely
Let U be the proportion of probed rnay lead to missing a true relationship true in scientific fields that undertake
analyses that would not have been because of chance. large studies, such as randomized
"research findings," but nevertheless
controlled trials in cardiology (several
end up presented and reported as Testing by Several Independent thousand subjects randomized) [14]
such, because of bias. Bias should not Teams
~ - - ~ ~ ~ ~ -
than in scientific fields with small
be confused with chance variability
Several independent teams rnay be studies, such as most research of
that causes some findings to be false by
addressing the Same Sets of research molecular predictors (sarnple sizes 100-
chance even though the study design,
questions. As research efforts are fold smaller) [15].
data, analysis, and presentation are
globalized, it is practically the rule Coroiiary 2: The smaiier the effect
perfect. Bias can entail manipulation
that several research teams, often sizes in a scientific field, the less iikeiy
in the analysis or reporting of findings.
dozens of them, rnay probe the Same the research findings are to be true.
Selective or distorted reporting is a
or similar questions. Unfortunately, in Power is also related to the effect
typical form of such bias. We rnay
some areas, the prevailing mentality size. Thus research findings are more
assume that U does not depend on
until now has been to focus on likely true in scientific fields with large
whether a true relationship exists
isolated discoveries by single teams effects, such as the impact of smoking
or not. This is not an unreasonable
and interpret research experiments on Cancer or cardiovascular disease
assumption, since typically it is
in isolation. An increasing number (relative risks 3-20), than in scientific
impossible to know which relationships
of questions have at least one study fields where postulated effects are
are indeed true. In the presence of bias
claiming a research finding, and small, such as genetic risk factors for
(Table 2), one gets P W = ([I - ß]R +
this receives unilateral attention. multigenetic diseases (relative risks
$R)/(R+a-ßR+ U - ua+$R),and
The probability that at least one 1.1-1.5) [7]. Modem epidemiology is
PPV decreases with increasing u, unless
study, among several done on the increasingly obliged to target smaller
1 - ß 1a , i.e., 1 - ß 10.05 for most
situations. Thus, with increasing bias, - -
the chances that a research finding Table 2. Research Findings andTrue Relationships in the Presence of Bias
is true diminish considerably. This is Research True Relationship
shown for different levels of power and Finding Yes No Total
for different pre-study odds in Figure 1.
Yes (dl-BlRtuMIER+l) .dtudl-a)1(~+1) c(R+a-ßR+u-ua+uaR)/(R+t)'
Conversely, true research findings -
(1 u)cßRI(R + 1) - -
1 C I I R+1 - - +
c(1 u)(1 a ßR)/(R 1 ) +
No
rnay occasionally be annulled because . .
Total cRNI+ 1) > c1@+1) ' , . . ' . C ,< . -<
of reverse bias. For example, with large
measurement errors relationships ~ . i a 1 3 7 1 ~ ~ ~ ~ ~ 1 ~ 4 m 0 2
Figura 2. PW (ProbabilityThata Research is hot or has strong invested interests eventuaily true about 85% of the time.
Finding IsTrue) as a Function of the Pre-Study rnay sometimes promote larger studies A fairly similar performance is expected
Odds for Various Numbers of Conducted and improved standards of research, of a confirmatory meta-analysis of
Studies,n
enhancing the predictive value of its goodquality randomized trials:
Panels correspond to power of 0.20,050,
and 0.80. research findings. Or massive discovery- potential bias probably increases, but
oriented testing may result in such a power and pre-test chances are higher
alternating extreme research claims large yield of significant relationships compared to a single randomized trial.
and extremely opposite refutations that investigators have enough to Conversely, a meta-analytic finding
[29]. Empirical &dence suggests that report and search further and thus from inconclusive studies where
this sequence of extreme opposites is refrain from data dredging and pooling is used to "correct" the low
very CO-&nonin m~lecuiar~~enetics manipulation. power of single studies, is probably
1291. false if R I 1:3. Research findings from
These coroiiaries consider each Most Research FindingsAre False underpowered, early-phase clinical
factor separately, but these factors often for Most Research Designs and for trials would be true about one in four
influence each other. For example, Most Fields times, or even less frequently if bias
investigators working in fields where In the described framework, a PPV is present ~~iderniological studies of
true effect sizes are perceived to be exceeding 50% is quite difficult to an exploratory nature perform even
small rnay be more likely to perform get. Table 4 provides the results worse, especially when underpowered,
large studies than investigators working of simulations using the formulas but even well-powered epidemiological
in fields where true effect sizes are developed for the influence of power, studies rnay have only a one in
perceived tobe large. Or prejudice ratio of true to non-true relationships, five chance being true. if R = 1:10.
rnay prevail in a hot scientific field, and bias, for various types of situations Finally, in discoveqwriented research
further undermining the predictive that rnay be characteristic of specific with massive testing, where tested
value of its research findings. Highly study designs and settings. A finding relationships exceed true ones 1,00&
prejudiced stakeholders rnay even from a wellconducted, adequately fold (e.g., 30,000 genes tested, of which
create a barrier that aborts efforts at powered randomized controlled t i a l 30 rnay be the true culprits) [30,31],
obtaining and disseminating opposing starting with a 50% pre-study chance PPV for each claimed relationship is
results. Conversely, the fact that a field that the intemention is effective is extremely low, even with considerable
.
!@,:
. PLoS Medicine I www.plosmedicine.org August 2005 1 Volume 2 I lssue 8 1 e124
totality of the evidente. Diminishing many relationships are expected to be 19. Ioannidis JP, Evans SJ, Gotzsche PC, O'Neill
RT, Altman DG, et al. (2004) Better reporting
bias through enhanced research true among those probed across the of harms in randomized uials: An extension
standards and curtailing- of prejudices
- - relevant research fields and research of the CONSORTstatement. Ann Intern Med
rnay also help. However, this rnay designs. The wider field rnay yield some 141: 781-788.
20. International Conference on Harmonisation
require a change in scientific mentality guidance for estimating this probability E9 Expert Working Group (1999) ICH
that might be difficult to achieve. for the isolated research project. Harmonised Tripartite Guideline. Statistical
In some research designs, efforts Experiences from biases detected in principles for clinical uials. Stat Med 18: 1905-
1942.
rnay also be more successful with other neighboring fields would also be 21. Moher D, Cook DJ, Eastwood S, Olkin I,
upfront registration of studies, e.g., useful to draw upon. Even though these Rennie D, et al. (1999) Improving the quality
randomized tnals [35].Registration assumptions would be considerably of reports of meta-analyses of randomised
controlled trials: The QUOROM starement.
would Pose a challenge for hypothesis- subjective, they would still be very Quality of Reporting of ~eta-analyses.Lancet
generating research. Some kind of useful in interpreting research claims 354: 18961900.
registration or networking of data and putting them in context. 22. Stroup DF, Berlin JA, Monon SC. Olkin I,
Williamson GD, er al. (2000) Meta-analysis
collections or investigators within fields of observational studies in epidemiology:
rnay be more feasible than registration References A proposal for reponing. Meta-analysis
1. Ioannidii JP,Haidich AB, Lau J (2001) Any of Observational Studies in Epidemiology
of each and every hypothesis- casualties in the clash of randomised and (MOOSE) group. JAMA 283: 2008-2012.
generating experiment. Regardless, observational evidence? BMJ 322: 879-880. 23. Marshall M, Lockwood A, Bradley C,
2. Lawlor DA, Davey Smith G, Kundu D, Adams C,Joy C, er al. (2000) Unpublished
even if we do not See a great deal of Bmckdorfer KR, Ebrahim S (2004) Those ratingscales: A major source of bias in
Progress with registration of studies confounded vitamins: What can we learn from randomised controlled uials of treaunents for
in other fields, the pnnciples of the differences between observational versus schizophrenia. BrJ Psychiatry 176: 249-252.
randomised uial evidence! Lancet 363: 1724- 24. Alunan DG, Goodman SN (1994) Transfer
developing and adheringto a protocol 1727. of technology from statisticaljoumals to the
could be more widely borrowed from 3. Vandenbroucke JP (2004) When are biomedical literature. Past trends and future
randomized controlled trials. observational studies as credible as randomised predictions. JAMA 272: 129-132.
uials? Lancet 363: 172tL1731. 25. Chan AW, Hrobjartsson A, Haahr MT,
Finally, instead of chasing statistical 4. Michiels S, Koscielny S. Hill C (2005) Gotzsche PC, Alunan DG (2004) Empirical
significance,we should improve our Prediction of cancer outcome with microarrays: evidence for selective reporting of outcomes in
A multiple random Validation suategy. Lancet
understanding of the range of R 365: 488492.
randomized uials: Comparison of protocols to
values-the pre-study odds-where published anicles. JAMA 291: 2457-2465.
5. Ioannidis JPA, N m n i EE, Trikalinos TA, 26. Krimsky S, Rothenberg LS,Stott P, Kyle G
research efforts operate [10]. Before Contopoul~IoannidisDG (2001) Replication (1998) Scientificjournals and their authors'
validity of genetic association studies. Nat financial interests: A pilot study. Psychother
running an expenment, investigators Genet 29: 306309. Psychosom 67: 194-201.
should consider what they believe the 6. Colhoun HM, McKeigue PM. Davey Smith
27. Papanikolaou GN, Baltogianni MS,
chances are that they are testing a true G (2003) Problems of reponing genetic
associations with complex outcomes. Lancet Contopoulos-Ioannidis DG, Haidich AB,
rather than a non-true relationship. 361: 865-872. Giannakakii IA, et al. (2001) Reporting of
conflicu of interest in guidelines of preventive
Speculated high R values rnay 7. Ioannidii JP (2003) Cenetic associations: False
and therapeutic interventions. BMC Med Res
or uue? Trends Mol Med 9: 135-138.
sometimes then be ascertained. As 8. IoannidisJPA (2005) Microarrays and Methodol 1: 3.
descnbed above, whenever ethically molecular research: Noise discovery? Lancet 28. Anunan EM, Lau J, Kupelnick B, Mosteller F,
365: 454455. Chalmers TC (1992) A comparison of resulu
acceptable, large studies with minimal of meta-analysesof randomized control trials
9. Sterne JA, Davey Smith G (2001) Siting the
bias should be berformed on research evidence-What's wrong with significance tests. and recommendations of clinical expew.
findings that are considered relatively BMJ 322: 226-231. Treatments for myocardial infarction. JAMA
10. Wacholder S, Chanock S, Garcia-Closas M, EI 268: 240-248.
established, to See how often they are ghormli L, Rothman N (2004) Assessing the 29. loannidisJP, Trikalinos TA (2005) Early
indeed confirmed. I suspect several probability that a positive report is false: An extreme conuadictory estimates may
appear in published research: The Proteus
established "classics" will fail the test approach for molecular epidemiology studies. J
phenomenon in molecular genetics research
Nad Cancer Inst 96: 434-442.
11. Risch NJ (2000) Searching for genetic and randomized uials. J Clin Epidemiol 58:
Nevertheless, most new discoveries determinants in the new millennium. Nature 543-549.
405: 847-856. 30. Ntzani EE, IoannidisJP (2003) Predictive
will continue to stem from hypothesis- ability of DNA microanays for cancer outcomes
12. KelseyJL, Whittemore AS, Evans AS,
generating research with low or very Thompson WD (1996) Methods in and correlates: An empirical assessment.
low pre-study odds. We should then observational epidemiology, 2nd ed. NewYork: Lancet 362: 1439-1444.
Oxford U Press. 432 P. 31. Ransohoff DF (2004) Rules of evidence
acknowledge that statistical significance 13. Topol EJ (2004) Failing the public health- for cancer molecular-marker discovery and
testing in the report of a single study Rofecoxib, Merck, and the FDA. N Engl J Med validation. Nat Rev Cancer 4: 309-314.
gives only a partial picture, without 351: 1707-1709. 32. Lindley DV (1957) A statistical paradox.
14. Yusuf S, Collins R, Peto R (1984) Why do we Biometrika 44: 187-192.
knowing how much testing has been need some large, simple randomized trials? Stat 33. Barrlett MS (1957) A comment on D.V.
done outside the report and in the Med 3: 409-422. Lindley's statistical paradox. Biomeuika 44:
relevant field at large. Despite a large 15. Alunan DG, Royston P (2000) What do we 533-534.
mean by validating a prognostic model? Stat 34. Senn SJ (2001) Two cheen for P-values.J
statistical literature for multiple testing Med 19: 453-473. Epidemiol Biostat 6: 193-204.
corrections [37],usually it is impossible 16. Taubes G (1995) Epidemiology faces i o limits. 35. De Angelis C. Drazen JM, Frizelle FA, Haug C,
Science 269: 164-169. Hoey J. et al. (2004) Clinical uial regisuation:
to decipher how much data dredging 17. Golub TR, Slonim DK, Tamayo P, Huard A statement from the International &mmittee
by the reporting authors or other C, Gaasenbeek M, er al. (1999) Molecular of MedicalJournal Editors. N Engl J Med 351:
- -
research teams has preceded a reported classification of cancer: Class discovery 1250-12511
and dass prediction by gene expression 36. Ioannidii JPA (2005) Conuadicted and
research finding. Even if determining monitoring. Science 286: 531-537. initially suonger effects in highly cired clinical
this were feasible, this would not 18. Moher D, Schulz KF, Alunan DG (2001) research. JAMA 294: 218-228.
inform us about the pre-study odds. The CONSORT Statement Revised 37. Hsueh HM, ChenJJ, Kodell RL (2003)
recommendations for improving the quality Comparison of methods for estimating the
Thus, it is unavoidable that one should of reports of parallel-group randomised uials. number of uue null hypotheses in multiplicity
make approximate assumptions on how Lancet 357: 1191-1 194. testing. J Biopharm Stat 13: 675-689.