You are on page 1of 7

GUEST EDITORIAL

Ten rules for reading clinical research reports


John M. Yancey, PhD
Louisville, Ky.

The Editor, along with most surgical scholars, has been prestigious journals, is no guarantee of their quality. ''5
concerned about excesses in statistical support of submitted When the Editor-in-Chief asked me to conduct an
and published surgical articles; problems range from a total audit of the research papers published by The American
absence of obviously needed statistical assessment to point- Journal of Surgery in 1987 and 1988, I envisioned an
lessly overdone analysis of small numbers of experimental article similar in format to the article published by The
observations. Somewhere in between is the use of a blatantly British Journal of Surgery.1 After a cursory review of those
wrong statistical test for analysis of data. Dr. Yancey is a publications, however, it was clear that my article would
frequent consultant to the Journal and has prepared this be quite similar to that by Dr. Murray. The proportion,
work at the Editor's request. severity, and types of errors detected by my preliminary
audit were consistent with those reported by The British
Prior to the recruitment of a statistical referee, The Journal of Surgery audit.
British Journal of Surgery commissioned an audit of the Although a data analysis error rate substantially
statistical standard of 28 research papers published by below that reported by similar audits of the clinical
that journal in the January to March 1987 issues. The literature might be considered good news by some, the
results of that audit were published in the July 1988 issue bad news is that I found many 1987-1988 Journal articles
of The British Journal of Surgery.1 The audit identified with methodologic errors so serious as to render invalid
only 39% of the articles as containing statistical errors so the conclusions of the authors. So, rather than devoting
serious as to have warranted r e j e c t i o n - h a d the evalua- my time to a precise quantification of the errors made by
tion been made prior to publication. This rate of serious recent contributors to the Journal, I decided to consider
error is below that reported by many other audits of the some possible solutions to this widespread defect in the
clinical literature. 2-12 clinical research literature.

THE PROBLEM THE SOLUTION?


One of the more comprehensive reviews of the meth- Numerous books and articles have already been
odologic quality of the medical literature was published published with a view toward quantifying and correcting
by New England Journal of Medicine Books in 1986. 5 this serious problem. 13 There is little indication the
Those authors reviewed the reviews of more than 4,200 problem is going away; rather, the data suggest the
research reports published in prestigious medical jour- problem is growing worse?
nals. Among their findings were the following: "The 16 Perhaps the time has come for Journal readers to
assessment articles published before 1970 described a deal with authors who misinterpret their own data the
median of 23% of 2,063 research reports as meeting the way they deal with patients who misinterpret their own
assessor's criteria for validity; the 12 published during or symptoms: reinterpret the misinterpretations and then
after 1970 reported that a median of 6% of 2,172 draw conclusions based upon one's own analysis of the
publications met the assessors' criteria. ''S hard data.
It should be noted that the criteria for validity used
were not esoteric. Most of the reviews were asking only
CAVEATS
if the authors' conclusions were valid extrapolations from
the data presented. Included among the many shocking Before I commence my list of caveat emptor rules for
conclusions in that review of the reviews was the follow- reading the data analysis section of clinical research
ing: "The findings reported in the 28 assessment articles articles, I feel a need to make several points. First, these
suggest that the average practitioner will find relatively are my personal opinions. Second, I found little evidence
few journal articles that are scientifically sound in terms that any of the 1987-1988 Journal authors were intention-
of reporting usable data and providing even moderately ally trying to mislead their readers. Third, to say that the
strong support for their inferences . . . . The mere fact conclusions of an article are invalid is not to say they are
that research reports are published, even in the most wrong. A surgeon may decide whether to operate upon a
particular patient by consulting a Ouija board and still
From the Department of Diagnosis and General Dentistry, School of
come to a correct conclusion.
Dentistry, University of Louisville, Louisville, Kentucky.
Reprinted with permission from Am J Surg 1990,159:533-9. It is important to understand that the data analysis
8/8/60349 errors identified by just about every review of the medical

558
American Journal of Orthodontics and Dentofacial Orthopedics Yancey 559
Volume 109, No. 5

literature are not trivial misunderstandings important effect, a proprietary data base. This is precisely the
only to those of us who teach statistics to Ph.D. students opposite of the scientific canon that all data must be
and clinical residents. Most of the errors are of the public. When science stops being public, it stops being
magnitude of claiming to have scientific evidence that science.
one surgical procedure is superior to another when no Rule III: Differentiate between descliptive and inferen-
such valid scientific data are presented. tial statistics. I believe the single most serious error made
The fact ~:hat most authors sincerely believe that their by clinical researchers is the failure to understand the
statistics do provide scientific support for their conclu- fundamental difference between statistical indices that
sions does not render the problem any less serious. If summarize the data collected by the study and those that
a surgeon based a clinical decision upon the output from attempt to answer questions about data that have not
a Ouija board, because he sincerely thought it to be a been c o l l e c t e d - a n d likely never will be collected.
scientific instrument, his conclusion would still be sci- Statistics such as means and standard deviations
entifically imTalid, professionally incompetent, and ethi- describe the data in hand; hence, they are called descrip-
cally wrong--even if it did turn out to be clinically tive statistics. Tools that generate p values allow one to
correct. make inferences about the population of data from which
the present study is but one random sample; hence, they
are called inferential statistics.
TEN RULES FOR THE JOURNAL READER
That is all simple enough if one is drawing random
Since it is reasonably clear that all clinical researchers samples from existing populations, as in epidemiologic
are about as ~ikely to start analyzing their data according research, but the vast majority of clinical studies produce
to the rigorous rules of science as it is that all advertise- data that are not random samples from existing popula-
ments in clinical journals are going to do likewise in pre- tions. Indeed, one often has data from patients treated as
senting their wares, the typical consumer of both the ad- no other patients in the world have ever been treated. So,
vertisements and the articles might benefit from the fol- it is only natural that many 1987-1988 Journal authors
lowing rules for the proper understanding of research data thought their standard errors of the mean, p val-
published in contemporary clinical journals. ues, and regression lines, to name just a few items, were
Rule I: Be skeptical. As noted earlier, the odds are indexing something important about the data they had
good that the authors have arrived at invalid conclusions. just collected. They were wrong!
This does not mean their conclusions are wrong, but it If one sample mean is 10 and the other 5, the
does mean that their conclusions may not have the difference is precisely 5, whether the associated p value
scientific validity attributed to them by their authors. is >0.95 or <0.0001. The question of whether that
Many 1987-1988 Journal articles were not exceptions to difference is of any clinical note is a clinical judgment,
this rule. regardless of the associated p values. Authors who use p
Rule II: Look for the data. Once upon a time, clinical values to support their conclusions about the magnitude
researchers said, "Here is a detailed description of what or importance of differences in group data are making
we did and here is what happened and this is what we the most serious and the most common error to be found
think it all means; but since we have reported our data, in the clinical literature.
you are free to draw your own conclusions." Contempo- Most p values mean the same thing, p < 0.05, for
rary journal authors are more likely to say, "Here is a example, generally means that there are fewer than 5
description of what we did, here are our statistical chances in 100 of getting the statistic in question (or one
analyses, and this is what all that mumbo jumbo means; even more extreme) if the null hypothesis being tested
trust us." But the readers' only consult with nature is via were precisely true in the population from which one has
the data reported in the article. Everything else is a a random sample. Most authors clearly do not under-
consult with authority. That is not science. stand this simple fact. If they did, they would clearly
Most 1987-1988 Journal articles did a good job of state: (1) What the statistic in question indexes about the
presenting the data, but many gave the impression that data; (2) The null hypothesis; (3) The population one is
the raw data had been cooked by the only legal recipe. talking about; and (4) What all of this has to do with the
Generally, there are several legitimate recipes for cook- basic point of the study. In a great many cases, the
ing any given data set. Some recipes, however, not only statistic does not index what the authors think it indexes,
are illegitimate but they present the data in a form the null hypothesis is of no clinical note, the population
incapable of reconstitution to the state required for is not the one referenced in the author's discussion, and
appropriate cooking. In many instances, both the authors the statistical analysis is irrelevant to the basic question
and the consumers of 1987-1988 Journal, reports would of the study.
have been better served by following the rule for raw One way to keep this absolutely crucial distinction in
vegetables: ceok only if otherwise indigestible. mind is to ask oneself the following question: If the study
In any event, authors should present their data in a data were based upon all such patients in the w o r l d -
form that allows consumers to apply their own favorite rather than a very small nonrandom sample of such
recipe. Otherwise, the reader is faced with what is, in patients --would the results be of any clinical note? If the
560 Yancey American Journal of Orthodon~cs and Dentofacial Orthopedics
May 1996

answer is no, then no statistical analysis will make the The mere fact that so many 1987-1988 Journal au-
results of the study clinically important, even if all the p thors erroneously believed the mean to be a good index
values are "highly significant." If the answer is yes, the of nonsymmetrical and/or highly variable data, and the
study is of clinical importance, even if all the p values are standard error of the mean to be an accurate index of
"not significant." that variability, is sufficient reason for all Journal readers
Rule IV: Question the validity of all descriptive statis- to attempt to transform all data back to its raw form. The
tics. Clinicians should be interested in individuals, but the standard error of the mean, in fact, is neither an index of
majority of articles in the 1987-1988 Journal used means variability in the sample data nor an estimate of variabil-
and standard errors of the mean to index the data. It has ity in the population. And the larger the sample size, the
been said that a fellow with one leg frozen in ice and the less well it indexes what should be of chief concern to all
other leg in boiling water is c o m f o r t a b l e - o n average. clinicians: variability among patients.
Although it is clear that the mean temperature of the Rule V: Question the validity of all inferential statistics.
water covering this poor fellow's lower half is a dismal Almost every review of the medical literature has found
index of the degree of his discomfort, equally inappro- that many researchers are wrong about what inferential
priate uses of the arithmetic mean are to be found in statistics actually index. 1-13This was clearly in evidence in
some of the 1987-1988 Journal articles. the 1987-1988 Journal articles. Many authors interpreted
An example of this type of error is found in an "p < 0.05" to mean that an important treatment effect
October 1987 Journal article." I n Figure 5, the concen- had been demonstrated and "p > 0.05" to mean that no
tration of ornithine decarboxylase in tumor sites is re- important treatment effect existed. In truth, neither the
ported as 3.5 (nmol CO2/mg protein/hour) _+3 (SEM). A presence of statistical significance (p < 0.05) nor the
careful reader would interpret the carefully composed absence of statistical significance (p > 0.05) has any
bar graph as follows: With 15 subjects, a standard error reliable relationship either to the magnitude of the
of the mean of 3 becomes a standard deviation of 11.6 treatment effect nor to the extent of clinical importance.
(SD = SEM × x/n); however, since the concentration The October 1987 Journal article cited earlier to
cannot drop below 0, this must have been one of the most illustrate the inappropriate use of the mean and standard
positively skewed distributions ever recorded; and in a error of the mean also serves as an example of the
severely skewed distribution, the mean is pulled in the problem with p v a l u e s . 14 Figure 5 in that article reports a
direction of the skew; hence, almost all of the 15 subjects single value of p < 0.375 for three Student!s t tests used
must have shown a concentration very close to 0, with 1 to compare ornithine decarboxylase concentrations at
or 2 patients having large concentrations. the tumor site and three other sites. As already noted,
Therefore, 3.5 is a poor index of the concentration at the t test is about population means and the mean is not
the tumor site; hence, it is a poor index of how the tumor a legitimate index of tumor site concentrations. Even if
site compared with other sites. The standard error of the this were not the case, the Student's t test is not legiti-
mean is an even poorer index of the variability in the mate here because of the grossly unequal sample sizes,
data. The bar graph with its error bar does not resemble the grossly unequal standard deviations (which are not
the actual distribution of data points. After interpreting presented but which can be calculated), the grossly
this graph, it is clear that the difference between the skewed distribution of tumor site concentrations, the fact
tumor site and the other sites is not in terms of the mean, that most of the data are paired rather than independent,
it is in terms of the number of tumor sites having and the fact that one mean is used in three different post
concentrations very much larger than anything observed hoc comparisons. Even if one ignored all of these prob-
at the other sites. Hence, the subsequent Student's t tests lems and accepted this "not significant" p value as saying
had no sensible interpretation, because the t tests were something important about these data, what is it saying?
about population means, which had no clinical meaning. It clearly is not saying that there is no difference
It is clear that had many of the 1987-1988 Journal between the sample means; the tumor site mean is five
authors looked carefully at their raw data, i~t would have times as great as the concentration at the other three
become obvious that the use of the mean ± 1 standard sites. This difference in means is greater than any of the
error of the mean gave a most misleading picture of what other comparisons presented in the article, all of which
h a p p e n e d - i t almost always does. When that was true, were "significant." It clearly is not saying that this dif-
subsequent statistical analyses seldom made any sense ference is of no clinical note; the authors state otherwise.
because the majority of inferential statistical tools used It is, in fact, just saying that in the population from which
in the 1987-1988 Journal were about population means. this is a random sample, we do not know which mean is
Population means, however, seldom present for treat- larger, p > 0.05 is a statement of ignorance, not a state-
ment. Even if one knew the precise value of many of the ment of no difference.
population means being estimated by Journal authors, In Figures 2 through 4 in this same article, ~4the same
that would be about as valuable to practicing physicians statistical procedure is employed to compare mean
as knowing the mean heart rate of all their patients polyamine concentrations at the tumor site and the three
currently in the hospital. other sites. Five different p values (p < 0.05 to
American Journal of Orthodontics and Dentofacial Orthopedics
Volume 109, No. 5 Yancey 561

p < 0.0005) are used to index nine comparisons. Does should have been used. 18 In Figure 2, the authors say,
this mean that the p < 0.0005 differences are larger or "The regression line shows that the length of the preop-
more important than the p < 0.05 differences? No. As erative period was correlated with the amount of blood
previously noted, none of these differences is even close required. The two variables were related with a high
to the magnitude of the "not significant" difference in degree of predictability (r = 0.66, p < 0.001). ''18 The
ornithine decarboxylase concentrations. So what do they authors also provide the regression equation ( y -
mean? 1.1393 + 1.6376x), suggesting that in the future one
If the correct statistical procedures had been em- might use this equation to predict the amount of blood
ployed, a p < 0.05 would mean that one may be about required. But if one plugs 10 days into the equation, one
95% confident that precise replications of this study gets the prediction of 17.5 units of blood. In the sample
would yield population means with the same directional of 38 patients, the four patients with a 10-day preopera-
difference in magnitude as was observed in the sample. A tive period required 6, 9, 21, and 34 units of blood,
p < 0.0005 means that one may be about 99.95% confi- respectively. The five patients with a 12-day delay re-
dent of precisely the same thing. Neither p value gives the quired 2, 3, 15, 40, and 58 units, respectively, rather than
slightest hint of what a clinician would need to know: (1) the 20.8 units predicted by the formula.
What is the probable magnitude of each population It is clear from an examination of this scattcrgraph
mean? (2) What is the probable magnitude of the differ- that there is not "a high degree of predictability," even in
ence between these population means? (3) What is the the sample data. But a regression line and a regression
probable distribution of individual values around these equation imply that one can use this equation to predict
means? (4) What proportion of the population has a what will happen in the future. (We already know pre-
concentration in these two sites anywhere near that of cisely how much blood each of the sample patients
the sample means? (5) What is the clinical importance of required.) The authors do not provide the data to calcu-
this difference? (6) Is this of the slightest value in late precisely the 95% confidence intervals for the re-
treating individual patients? gression, but a rough estimate yields the following limits
Still, p values have become a firmly fixed fetish in the for a value of 12 days: 15 and 24. This means that one
clinical literature, so you must learn to live with them. A may be 95% confident that in the population from which
good place to start is by reading Medical Uses of Statistics, these 38 patients are a random sample, the mean number
especially Chapter 8.13'~s When statisticians talk to each of units of blood required for all patients with a 12-day
other, however, they find the use of p values by medical preoperative period is a single value somewhere between
researchers a subject for much raucous ridiculeJ 6 15 and 24.
Rule VI: Be wary of correlation and regression analy- It is quite clear from the great variability in the
ses. The great majority of clinical studies I have examined sample data that even if we knew precisely the mean
over the pas! 30 years involving correlation and/or re- value for all the 12-day patients in the world, we would
gression analysis have been seriously flawed. The worst have very little notion of what the fate of any particular
mistake is accepting the authors' implicit suggestion that patient might be. Such individual prediction limits may
an individual clinician may safely use the regression be calculated. Using the data presented in the article, we
equation to make predictions for individual patients. may be 95% confident that a single 12-day patient will
Sometimes equations or graphs are given for 95% con- require somewhere between minus 6 and plus 45 units of
fidence intervals, but these usually are misinterpreted to blood. This information is as useless as it is ridiculous.
be confidence intervals for individual predictions rather Even if this sample size had been 38,000 (instead of only
than correctly understood to be estimates of population 38), a correlation coefficient of only 0.66 would not have
means, which, more often than not, are of no use to produced narrow individual prediction l i m i t s - b u t the p
clinicians. value would have been <0.0000000H
Equations for making individual predictions are pre- Since this was not a prediction problem in the first
sented in any worthwhile statistics book, ~7yet they almost place, how did the authors come to the "high degree of
are never included in clinical journal articles. I was predictability-" conclusion? It is likely they did so because
pleasantly surprised to find a couple of instances in the of the highly significant p value. All that p value means is
1987-1988 Journal where these correct equations were that in the population from which this is a true random
used. It also was good to find that most authors pre- sample, we may be 99.9% confident that the correlation
sented scattergraphs. It is a shame that some of those coefficient between days and blood is greater than abso-
authors did not examine their own graphs well enough to lute zero. A more useless piece of clinical information is
note that their significant p values constituted a poor hard to imagine. And keep in mind that if the p value
index of the ability to predict y from x, even in the sample had been <0.00000001, we just would be even more
data. confident of precisely the same useless fact!
For example, an article in the December 1988 Journal The sadness of it all is that the simple scattergraph
used a regression equation, a correlation coefficient, a p contains all the support the authors needed for their
value, and a scattergraph where only the scattergraph observation that in this sample of 38 patients there was a
562 Yancey American Journal of Orthodontics and Dentofacial Orthopedics
May 1996

general tendency for more blood to be required by those Rule IX: Look for indices of probable magnitude-of-
patients spending the most time in the preoperative treatment effects. The great majority of the inferential
period. The regression equation, the regression line, and statistical tools employed by 1987-1988 Journal authors
the p value stimulated the authors to claim "a high were estimates of directional differences in population
degree of predictability" in a situation where that is means or population proportions. Few articles provided
neither true nor relevant to the point being made. If they estimates of the probable magnitude of these differences.
had just presented the scattergraph, the authors might Many authors used their p values to index the magni-
have thought to show where the seven duplicate data tude-of-treatment effects. They were wrong, p values are
points are located, because only 31 of the 38 patients are affected by the magnitude-of-treatment effects in the
identifiable in their Figure 2. sample data, but they are affected much more by sample
Rule VII: Identify the population sampled. In the "old size, variability, and the particular recipe used to cook the
days," clinicians intuitively understood that what was data, none of which has any relationship to either the
true of Dr. Brown's patients in London may not be true magnitude or clinical importance of the treatment ef-
of Dr. Smith's patients in Biloxi, Mississippi. Today, fects.
many researchers seem to believe that statistics can solve The use of confidence intervals for population pa-
that problem. They can not. In almost all instances, the rameters and prediction limits for individual observations
statistics used are about what would happen if the actually would have changed fundamentally many of the
present study were to be repeated over and over. They conclusions that were based upon p values. The regres-
say nothing about what would happen if even slight sion example cited earlier is but one of many cases. 18 An
changes were to be made in either treatments or patient example of where confidence limits do not reverse the
populations. conclusions reached via p values, but still add a great
Such extrapolations surely can be made, but it is deal to the analysis, may be found in a review I wrote to
scientific thinking and clinical judgment that justify the accompany a recent research report in the Journal.19'2°
leap, not the statistical analyses generally cited. Still, Most statistics texts devote almost all of their atten-
most of the 1987-1988 Journal articles strongly implied tion to hypothesis testing, but even those that give equal
that their statistics certified the extrapolation of the study treatment to parameter estimation 17make the distinction
results to patient populations much different from those more difficult to understand than is the case. The simple
actually sampled by the study. I found few articles in facts are these: (1) The larger the sample size, the less
which the authors stated very clearly that their inferential the variability and the more powerful the statistical tool,
statistics simply suggested directional differences in pa- the more likely one is to find trivial differences to be
rameters of populations that would exist only if the "highly significant," and the more precisely confidence
present study were to be replicated precisely-warts and intervals and prediction limits will tell one what one
all. needs to know; (2) The smaller the sample size, the
Rule VIII: Identify the type of study. Only truly ran- greater the variability and the less powerful the statistical
domized tightly controlled, prospective studies provide tool, the more likely one is to find profoundly important
even an opportunity for strong cause-and-effect state- differences to be "not significant," and the more clearly
ments. Often there are very good reasons why such a confidence intervals and prediction limits will tell one
study cannot be done, but that has no effect upon the that one does not yet know what one needs to know.
logic that says that in the absence of true random Rule X: Draw your own conclusions. The primary
assignment and rigid control of the treatment groups, any difference between science and other methods of finding
differences in outcome might be a function of differences the truth is that science consults nature while other
in the groups other than the treatment difference. Too methods consult authority. Your only consult with nature
few authors of nonrandomized and loosely controlled in a research article is the section that presents the raw
studies in the 1987-1988 Journal articles drew sufficient data. Everything else is some authority telling you what
attention to the fact that no statistical manipulation can the truth of the matter is.
compensate for the absence of true random assignment If you had good reason to believe that almost all
of patients to treatments and strong assurance that the journal authors already had applied correctly the rules of
only differences between the treatment groups come scientific data analysis to the process of drawing valid
from the treatment itself. scientific conclusions from their data, you could safely
Nonrandomized, retrospective, and loosely con- skip to the conclusions section of each article. Indeed,
trolled studies may be suggestive of true cause-and-effect there would be little need for the Journal to publish any
relationships, but many replications of such studies (pref- material other than the authors' conclusions. That is not
erably in randomly selected samples of the target popu- the case. Sadly, in many cases, you will need to undo
lation) are required before alternative explanations of much of the data analysis that has been done before you
the apparent treatment effect may reasonably be ig- can draw your own conclusions; but that, more often than
nored. not, will be worth the effort.
American Journal of Orthodontics and Dentofacial Orthopedics Yancey 563
Volume t09, No. 5

If you are unable to reconstruct even a rough esti- answer is yes, you should be helping your less fortunate
mate of the distribution of original data points, ask colleagues come to the same state of understanding. If
yourself if the report is still of any use to clinicians who the answer is no, ask your guru why he or she never
remain responsible for treating patients one by one. explained all of this to you in the first place.

THE REAL SOLUTION


AN APOLOGY TO JOURNAL AUTHORS
Someday, the consumers of the clinical literature will
I hereby sincerely apologize for having so harshly be justified in reading the conclusions of a research
criticized the majority of 1987-1988 Journal authors, article with the same confidence they now read individual
whose only error was performing perfectly usual and patient reports written by respected colleagues. That day
customary analyses of their research data and then com- will arrive, however, only after at least three events occur:
ing to perfectly usual and customary conclusions. I am (1) widespread recognition of the prevalence and severity
even more apologetic for the role those of us who of data analysis errors in the contemporary literature; (2)
provide biostatistical services to the biomedical commu- widespread acceptance of more sensible data analysis
nity have played in creating the sad state of affairs procedures; and (3) pronouncements by editors of the
discussed in this essay. We provided quick answers to
best journals that manuscripts that claim a level of
your quick questions just because you didn't want a
scientific validity not supported by the design employed
lecture on statistics. We served you as technicians just
and data analyses performed will not be published.
because that is all you wanted. Worst of all, we took no
Already, there is massive documentation of the prob-
responsibility for your gross misunderstanding of our
lem. The references cited in this paper constitute only an
tools. Of course, we knew we would never be called to
introduction to this dismal literature. Already, there are
account because you never replicate your studies often
very sensible alternatives to the inappropriate statistical
enough to see whether our predictions are correct or not.
tools most often misused in the clinical literature. But
Indeed, most of you never understood that our magic p
history suggests that neither science nor medicine changes
values are juat trivial predictions, rather than the rabbi's
usual and customary practice just because a terrible prob-
kosher stamp we let you believe them to be. lem exists and a dearly superior approach is available.21'22
Those of us who have spent a large part of our lives
Perhaps the Uniform Requirements for Manuscripts
working in industrial and military settings are more to Submitted to Biomedical Journals 23 will someday be as
blame than anyone. We know that when one is serious
concerned with the scientific validity of the authors'
about predicting what will happen if a particular proce-
conclusions as it is with the size of the margins and
dure is repeated over and o v e r - b e c a u s e it will be
format of the reference section. Perhaps the Guidelines
repeated ow~r and o v e r - o n e never uses p values or
for Statistical Reporting in Articles for Medical Journals 24
regression equations or standard errors of the mean. We
will someday be as strictly enforced as the rules for the
know quite well that one uses confidence intervals and
appearance of the manuscript.
prediction limits and estimates of probable magnitude of
Perhaps the journals that have been courageous
effect. We also know that the statistical tools most
enough to commission audits of their own methodologic
commonly used in the biomedical literature seldom do
quality will be the first to change the real standards for
nearly as good a job of predicting highly variable out-
usual and customary data analysis. Perhaps they will
comes as do more sophisticated integrations of math-
include in their "Information for Authors" section the
ematics, data, judgment, logic, experience, and guts.
absolute requirement that all statistical analyses of re-
You wanted simple magic, and you got simple magic.
search data must contribute t o - r a t h e r than substitute
But this thing has gotten out of hand! It is time for those
f o r - r i g o r o u s scientific thinking and independent clinical
of us in my business to admit that there are no quick
judgment.
answers to ycur quick questions and that good statistical
consultations require a great deal m o r e time than good
SUMMARY
medical consultations, because the former involve far
more abstract concepts and far fewer pre-programmed This was not a scientific assessment of the scientific
skills than the latter. quality of the papers published by The American Journal
It is time for you to admit that you do not understand of Surgery. It was an informal audit of the adequacy of the
the indices we have given you. For those of you who do data analysis in the clinical research reports appearing in
think you understand these tools, I suggest the following the 1987-1988 issues.
exercise. Take your last half-dozen research reports to As one who has devoted more than three decades to
your favorite statistical guru. Ask him/her to do a com- helping a great variety of people make sense of scientific
puter simulation of 10,000 replications of each study. data, I found the overall quality of data analysis in these
Look at the results. Ask yourself if that is what the papers to be above average for the medical literature; and
conclusions section of each article predicted. If the yet, I found many instances of errors so serious as to
564 Yancey AmericanJournalof Orthodonticsand DentofacialOrthopedics
May 1996

render invalid the conclusions of the authors. My 10 pro- 9. Schor S. Statistical proof in inconclusive "negative" trials:
posed rules for reading clinical research reports constitute an editorial. Arch Intern Med 1981;141:1263-4.
only an interim solution to a very worrisome problem. The 10. Schor S, Karten I. Statistical evaluation of medical journal
real solution must c o m e from the producers of and the manuscripts. JAMA 1966;195:1123-8.
11. Sheehan TJ. The medical literature: let the reader beware.
gatekeepers for the medical literature.
Arch Intern Med 1980;140:472-4.
12. Thorn MD, Pullman CC, Symons MJ, Eckel FM. Statistical
Interested readers should examine "An exploratory study and research quality of the medical and pharmacy literature.
of statistical assessment of papers published in the British Am J Hosp Pharm 1985;42:1077-82.
Medical Journal," especially Tables I and 2. ( G a r d n e r MJ, 13. Bailar JC, Mosteller F, eds. Medical uses of statistics.
B o n d J. J A M A 1990;263:1355-7.) Waltham, MA: New England Journal of Medicine Books,
1986.
14. Dimery IW, Nishioka K, Grossie B Jr, et al. Polyamine
REFERENCES metabolism in carcinoma of the oral cavity compared with
adjacent and normal oral mucosa. Am J Surg 1987;154:429-
1. Murray GD. The task of a statistical referee. Br J Surg 33.
1988;75:664-7. 15. Ware JH, Mosteller F, Ingelfinger JA. P values. In: Bailar
2. Bailar JC. Science, statistics, and deception. Ann Intern JC, Mosteller F, eds. Medical uses of statistics. Waltham,
Med 1986;104:259-60. MA: New England Journal of Medicine Books, 1986:149-69.
3. DerSimonian R, Charette LJ, McPeek B, Mosteller F. 16. Salsburg DS. The religion of statistics as practiced in medi-
Reporting on methods in clinical trials. In: Bailar JC, cal journals. Am Statistician 1985;39:220-3.
Mosteller F, eds. Medical uses of statistics. Waltham, MA: 17. Zar JH. Biostatistical analysis, 2nd ed. Englewood Cliffs,
New England Journal of Medicine Books, 1986:272-88. NJ: Prentice-Hall, 1984.
4. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The 18. Grossman MD, McGreevy JM. Effect of delayed operation
importance of beta, the type II error, and sample size in the for bleeding esophageal varices on Child's class and indices
design and interpretation of the randomized controlled of liver function. Am J Surg 1988;156:502-5.
trial: survey of 71 "negative" trials. In: Bailar JC, Mosteller 19. Yancey JM. Editorial comment. Am J Surg 1989;158:433-4.
F, eds. Medical uses of statistics. Waltham, MA: New 20. Garcia-Rodriguez JA, Puig-Lacalle J, Arnau C, Porta M,
England Journal of Medicine Books, 1986:289-304. Vallv6 C. Antibiotic prophylaxis with cefotaxime in gas-
5. Goldschmidt PG, Colton T. The quality of medical litera- troduodenal and biliary surgery. Am J Surg 1989;158:428-
ture: an analysis of validation assessments. In: Bailar JC, 32.
Mosteller F, eds. Medical uses of statistics. Waltham, MA: 21. Kupfersmid J. Improving what is published: a model in
New England Journal of Medicine Books, 1986:370-91. search of an editor. Am Psychologist 1988;43:635-42.
6. Gore SM, Jones IG, Rytter EC. Misuse of statistical meth- 22. Kuhn TS. The structure of scientific revolutions, 2nd ed.
ods: critical assessment of articles in BMJ from January to Chicago: University of Chicago Press, 1970.
March 1976. Br Med J 1977;1:85-7. 23. International Committee of Medical Journal Editors. Uni-
7. Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the form requirements for manuscripts submitted to biomedical
reporting of clinical trials: a survey of three medical jour- journals. Ann Intern Med 1988;108:258-65.
nals. N Engl J Med 1987;317:426-32. 24. Bailar JC, Mosteller F. Guidelines for statistical reporting in
8. Reed JF, Slaichert W. Statistical proof in inconclusive articles for medical journals. Ann Intern Med 1988;108:266-
"negative" trials. Arch Intern Med 1981;141:1307-10. 73.

You might also like