Professional Documents
Culture Documents
n the light of continuing debate over the applications of statistical methods only and is not meant as an assessment
significance testing in psychology journals and follow- of research methods in general. Psychology is a broad
ing the publication of Cohen's (1994) article, the Board science. Methods appropriate in one area may be inappro-
of Scientific Affairs (BSA) of the American Psychological priate in another.
Association (APA) convened a committee called the Task The title and format of this report are adapted from a
Force on Statistical Inference (TFSI) whose charge was "to similar article by Bailar and Mosteller (1988). That article
elucidate some of the controversial issues surrounding ap- should be consulted, because it overlaps somewhat with
plications of statistics including significance testing and its this one and discusses some issues relevant to research in
alternatives; alternative underlying models and data trans- psychology. Further detail can also be found in the publi-
formation; and newer methods made possible by powerful cations on this topic by several committee members (Abel-
computers" (BSA, personal communication, February 28, son, 1995, 1997; Rosenthal, 1994; Thompson, 1996;
1996). Robert Rosenthal, Robert Abelson, and Jacob Co- Wainer, in press; see also articles in Harlow, Mulaik, &
hen (cochairs) met initially and agreed on the desirability of Steiger, 1997).
having several types of specialists on the task force: stat-
isticians, teachers of statistics, journal editors, authors of Method
statistics books, computer experts, and wise elders. Nine Design
individuals were subsequently invited to join and all agreed.
These were Leona Aiken, Mark Appelbaum, Gwyneth Boo- Make clear at the outset what type of study you are doing.
doo, David A. Kenny, Helena Kraemer, Donald Rubin, Bruce Do not cloak a study in one guise to try to give it the
Thompson, Howard Wainer, and Leland Wilkinson. In addi- assumed reputation of another.For studies that have mul-
tion, Lee Cronbach, Paul Meehl, Frederick Mosteller and John tiple goals, be sure to define and prioritize those goals.
Tukey served as Senior Advisors to the Task Force and There are many forms of empirical studies in psychol-
commented on written materials. ogy, including case reports, controlled experiments, quasi-
The TFSI met twice in two years and corresponded experiments, statistical simulations, surveys, observational
throughout that period. After the first meeting, the task studies, and studies of studies (meta-analyses). Some are
force circulated a preliminary report indicating its intention hypothesis generating: They explore data to form or sharpen
to examine issues beyond null hypothesis significance test- hypotheses about a population for assessing future hypothe-
ing. The task force invited comments and used this feed- ses. Some are hypothesis testing: They assess specific a priori
back in the deliberations during its second meeting. hypotheses or estimate parameters by random sampling from
After the second meeting, the task force recommended that population. Some are meta-analytic: They assess specific
several possibilities for further action, chief of which a priori hypotheses or estimate parameters (or both) by syn-
would be to revise the statistical sections of the American thesizing the results of available studies.
Psychological Association Publication Manual (APA, Some researchers have the impression or have been
1994). After extensive discussion, the BSA recommended taught to believe that some of these forms yield information
that "before the TFSI undertook a revision of the APA that is more valuable or credible than others (see Cronbach,
Publication Manual, it might want to consider publishing 1975, for a discussion). Occasionally proponents of some
an article in American Psychologist, as a way to initiate research methods disparage others. In fact, each form of
discussion in the field about changes in current practices of research has its own strengths, weaknesses, and standards
data analysis and reporting" (BSA, personal communica- of practice.
tion, November 17, 1997).
This report follows that request. The sections in italics Jacob Cohen died on January 20, 1998. Without his initiative and gentle
are proposed guidelines that the TFSI recommends could persistence, this report most likely would not have appeared. Grant Blank
be used for revising the APA publication manual or for provided Kahn and Udry's (1986) reference. Gerard Dallal and Paul
developing other BSA supporting materials. Following Velleman offered helpful comments.
Correspondence concerning this report should be sent to the Task
each guideline are comments, explanations, or elaborations Force on Statistical Inference, c/o Sangeeta Panicker, APA Science Di-
assembled by Leland Wilkinson for the task force and rectorate, 750 First Street, NE, Washington, DC 20002-4242. Electronic
under its review. This report is concerned with the use of mail may be sent to spanicker@apa.org.
Y Y
700- 700-
PhD
'U
"X600- 9 600-
No PhD
500- 500-
Y
400- 400-
3.5 4.0 4.5 5.0
3.5 4.0 4.5 5.0 3.5 4.0 4.5 5.0
GPA GPA
Note. GRE = Graduate Record Examination; GPA = grade point average; PhD and No PhD = completed and did not complete the doctoral degree; Y = yes;
N = no.
and not just asserted on the basis of imprecise labeling Record Examination (GRE), the undergraduate grade point
conventions of the software. average (GPA), and whether a student completed a doctoral
Tables and figures. Although tables are com- degree in the department (PhD). Figure 2A shows a format
monly used to show exact values, well-drawn figures need appearing frequently in psychology journal articles: two
not sacrificeprecision. Figuresattract the reader'seye and regression lines, one for each group of students. This
help convey global results. Because individuals have dif- graphic conveys nothing more than four numbers: the
ferent preferences for processing complex information, it slopes and intercepts of the regression lines. Because the
often helps to provide both tables and figures. This works scales have no physical meaning, seeing the slopes of lines
best when figures are kept small enough to allow space for (as opposed to reading the numbers) adds nothing to our
both formats.Avoid complexfigures when simpler ones will understanding of the relationship.
do. In all figures, include graphical representations of Figure 2B shows a scatter plot of the same data with
interval estimates whenever possible. a locally weighted scatter plot smoother for each PhD
Bailar and Mosteller (1988) offer helpful information group (Cleveland & Devlin, 1988). This robust curvilinear
on improving tables in publications. Many of their recom- regression smoother (called LOESS) is available in modem
mendations (e.g., sorting rows and columns by marginal statistics packages. Now we can see some curvature in the
averages, rounding to a few significant digits, avoiding relationships. (When a model that includes a linear and
decimals when possible) are based on the clearly written quadratic term for GPA is computed, the apparent interac-
tutorials of Ehrenberg (1975, 1981). tion involving the PhD and no PhD groups depicted in
A common deficiency of graphics in psychological Figure 2A disappears.) The graphic in Figure 2B tells us
publications is their lack of essential information. In most many things. We note the unusual student with a GPA of
cases, this information is the shape or distribution of the less than 4.0 and a psychology GRE score of 800; we note
data. Whether from a negative motivation to conceal irreg- the less surprising student with a similar GPA but a low
ularities or from a positive belief that less is more, omitting GRE score (both of whom failed to earn doctoral degrees);
shape information from graphics often hinders scientific we note the several students who had among the lowest
evaluation. Chambers et al. (1983) and Cleveland (1995) GRE scores but earned doctorates, and so on. We might
offer specific ways to address these problems. The exam- imagine these kinds of cases in Figure 2A (as we should in
ples in Figure 2 do this using two of the most frequent any data set containing error), but their location and distri-
graphical forms in psychology publications. bution in Figure 2B tells us something about this specific
Figure 2 shows plots based on data from 80 graduate data set.
students in a Midwestern university psychology depart- Figure 3A shows another popular format for display-
ment, collected from 1969 through 1978. The variables are ing data in psychology journals. It is based on the data set
scores on the psychology advanced test of the Graduate used for Figure 2. Authors frequently use this format to
601
August
August 1999 American Psychologist
1999 ** American Psychologist 601
Figure 3
Graphics for Groups
A B
BO-
800-
700-
w
r 600- 0-
c,
500-
400-
N Y N Y
PhD PhD
Note. GRE = Graduate Record Exoaminotion; N = no; Y = yes.
display the results of t tests or ANOVAs. For factorial effects to those in previous studies) suggest the results are
ANOVAs, this format gives authors an opportunity to generalizable?Are the design and analytic methods robust
represent interactions by using a legend with separate sym- enough to support strong conclusions?
bols for each line. In more laboratory-oriented psychology Novice researchers err either by overgeneralizing their
journals (e.g., animal behavior, neuroscience), authors results or, equally unfortunately, by overparticularizing.
sometimes add error bars to the dots representing the Explicitly compare the effects detected in your inquiry with
means. the effect sizes reported in related previous studies. Do not
Figure 3B adds to the line graphic a dot plot repre- be afraid to extend your interpretations to a general class or
senting the data and 95% confidence intervals on the means population if you have reasons to assume that your results
of the two groups (using the t distribution). The graphic apply. This general class may consist of populations you
reveals a left skewness of GRE scores in the PhD group. have studied at your site, other populations at other sites, or
Although this skewness may not be severe enough to affect even more general populations. Providing these reasons in
our statistical conclusions, it is nevertheless noteworthy. It your discussion will help you stimulate future research for
may be due to ceiling effects (although note the 800 score yourself and others.
in the no PhD group) or to some other factor. At the least,
the reader has a right to be able to evaluate this kind of Conclusions
information. Speculation may be appropriate,but use it sparingly and
There are other ways to include data or distributions in explicitly. Note the shortcomings ofyour study. Remember,
graphics, including box plots and stem-and-leaf plots however, that acknowledging limitationsis for the purpose
(Tukey, 1977) and kernel density estimates (Scott, 1992; of qualifying results and avoiding pitfalls in future re-
Silverman, 1986). Many of these procedures are found in search. Confession should not have the goal of disarming
modem statistical packages. It is time for authors to take criticism. Recommendationsfor future research should be
advantage of them and for editors and reviewers to urge thoughtful and grounded in present and previous findings.
authors to do so. Gratuitous suggestions ('"further research needs to be
Discussion done . . ") waste space. Do not interpret a single study's
results as having importance independent of the effects
Interpretation reported elsewhere in the relevant literature. The thinking
When you interpret effects, think of credibility, generaliz- presented in a single study may turn the movement of the
ability, and robustness. Are the effects credible, given the literature, but the results in a single study are important
results of previous studies and theory? Do the features of primarily as one contribution to a mosaic of study effects.
the design and analysis (e.g., sample quality, similarity of Some had hoped that this task force would vote to
the design to designs of previous studies, similarity of the recommend an outright ban on the use of significance tests
603
August 1999 *'American Psychologist
.American Psychologist 603
Scott, D. W. (1992). Multivariate density estimation: Theory, practice, Thompson, B., & Snyder, P. A. (1998). Statistical significance and reli-
and visualization. New York: Wiley. ability analyses in recent JCD research articles. Journal of Counseling
Silverman, B. W. (1986). Density estimation for statistics and data and Development, 76, 436-441.
analysis. New York: Chapman & Hall. Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-
Snyder, P., & Lawson, S. (1993). Evaluating results using corrected and Wesley.
uncorrected effect size estimates. Journal of Experimental Education, Wainer, H. (in press). One cheer for null hypothesis significance testing.
61, 334-349. Psychological Methods.
Thompson, B. (1996). AERA editorial policies regarding statistical sig- Webb, E. J., Campbell, D. T., Schwartz, R. D., & Sechrest, L. (1966).
nificance testing: Three suggested reforms. Educational Researcher, Unobtrusive measures: Nonreactive research in the social sciences.
25(2), 26-30. Chicago: Rand McNally.