You are on page 1of 9

Journal oj Abnormal and Social 1'sychology

1962, Vol. 65, No. 3, 145-153

THE STATISTICAL POWER OF ABNORMAL-SOCIAL


PSYCHOLOGICAL RESEARCH:
A REVIEW '
JACOB COHEN
New York University

Given an experimental effect in a popula- doctoral candidate and sponsor, or author and
tion, how likely is the null hypothesis to be editor) and rarely on the basis of a Type II
rejected? Equivalently, what is the power of error analysis, which can always be performed
the statistical test? What is the expectation prior to the collection of data. These non-
that the (false) null hypothesis will be sus- rational bases for setting sample size must
tained and thus a Type II error committed? often result in investigations being undertaken
It is a remarkable phenomenon that the which have little chance of success despite the
research which is reported by psychological actual falsity of the null hypothesis, and
investigators rarely refers to this issue, and probably less often in the use of a far larger
even more rarely actually investigates it. sample than is necessary. Either of these
On the other hand, issues concerning Type I circumstances is wasteful of research effort.
error or "significance," i.e., the validity of Stemming from these considerations, a pro-
the rejection of the null hypothesis, are more gram of investigation, computation, and re-
or less conscientiously attended to. This portage has been undertaken whose major
marked asymmetry of sophistication and at- aims are as follows:
tention to these two types of error is mirrored, 1. To call these issues to the attention of in-
and largely determined, by the exposition of vestigators, consumers of research, and eval-
these issues in the statistics textbooks used in uators of research planned or completed
the graduate training of the investigators. (sponsors, agency panels, journal editors).
These texts are characterized by an early ex- 2. To provide tables and conventional
planation of Type I and Type II errors, standards which will facilitate the perform-
followed by a neglect of the latter throughout ance of power analyses for the most common
the remainder of the text. Thus, every statis- statistical tests.
tical test is described with careful attention to 3. To conduct surveys of the psychological
issues of significance, and typically no at- research literature to assess its current status
tention to power. (For a partial exception, with regard to power.
see Walker & Lev, 1953.) The present report is the first of the in-
The problem of power is occasionally ap- vestigation, and seeks to achieve partially the
proached indirectly by concern with the first and third aim.2 Specifically, it describes
sample size to be used in an investigation. the results of a survey, of the Journal of Ab-
Other things equal, power is a monotonic normal and Social Psychology, 1960, 61, from
function of sample size, but decisions as to the viewpoint of the power of the statistical
sample size are typically reached by recourse tests employed. Less formally, it seeks to
to local tradition, ready availability of data, answer the question, "What kind of chance
unaided intuition, usually called "experience," did these investigators have of rejecting false
and negotiation (the latter usually between null hypotheses?"
1
This study was primarily supported by Grant
M-S174(A) from the National Institute of Mental METHOD
Health, United States Public Health Service, which The statistical literature was searched for formulae
support is gratefully acknowledged. I am also grate- and nomographs of power functions of the most
ful to Catherine Henderson for her expert assistance
2
in preparing the manuscript, and to the New York A more detailed description of the statistical
University Faculty Research Fund for supplementary rationale, as well as the resulting power tables, is
assistance. presently in preparation for separate publication.
14S
146 JACOB COHEN

commonly used statistical tests, from which tables power in some cases, it avoids the more serious
were prepared relating power to the conditions of problem of inflated significance levels and the
which it is a function: (a) size of effect (degree embarrassment of large effects in the nonpredicted
of departure in the population from the null hy- direction.
pothesis), (b) significance (Type I error) criterion,
(c) sample size, and (d) choice of critical region Size of Effect
(i.e., directional or nondircctional). The tests and The most difficult problem for the psychological
the sources of the formulae or nomographs arc investigator in performing a power analysis on an
given in Table I.3 experimental plan is the formulation of an answer to
the question, "How large an effect (a difference,
Standard Conditions correlation coefficient, etc.) in the population do I
expect actually exists, or want to be able to detect?"
For the purpose of the survey, it was necessary Only rarely in the abnormal-social area arc theoretical
to formulate a set of reasonable standard conditions models well enough specified to be of help in an-
on the basis of which the power of each test could swering this question, yet the question must be
be computed. Whether or not other significance answered. In the present study, this problem was
criteria were indicated, the .05 Type I error level further complicated by the need to answer it for
was used uniformly.* Further, whether or not other- diverse content areas, utilizing a large variety of
wise specified, the nondirectional version of the dependent variables and many different types of
null hypothesis was used throughout. This means a statistical tests. A solution was needed for the present
two-sided test for normal, binomial, and /. distribu- survey which would make possible a reasonable
tions, and the logically equivalent one-sided (high) basis for integration across this diversity. Finally,
value test for x 2 and F distributions, as they are in the hope of facilitating the performance of power
usually tabled and used in hypothesis testing. Al- analyses as a routine practice in research planning,
though this criterion may lead to underestimating a solution was sought which could serve, at least
3
See Footnote 2. provisionally, as a standard set of conventional
4 criteria in such analyses.
With few exceptions, the articles provided no
The solution which evolved took the following
evidence of a significance level being set prior to
data collection, cither because it was not deemed form:
worth mentioning, or none had indeed been set. 1. For each type of statistical test where it was
In any case, and rightly or wrongly, the .05 level necessary, size of effect was expressed quantitatively
has trickled down from agronomy as a conventional in terms not dependent on the specific metric of
standard, and is usually understood to be the the variable(s) involved, e.g., differences in means
significance criterion if no other is mentioned. were expressed in units of standard deviation (the
usual z scores), departures of true population from
null hypothetical k category percentage distribu-
tions were formulated as constant proportions of
STATISTICAL TESTS AND SOURCES OF FORMULAE
AND NOMOGRAPHS OF POWER F'UNCTJONS \lk, etc.
2. Three levels of size of effect to be detected
were conceived: small, medium, and large.
Test of null hypothesis Source 3. Each of these levels was operationally defined
1. i teat: difference between means Dixon and Massey, for each type of statistical test, by assigning to them
1957, formula p. 253
values of the relevant metric-free population para-
2. Normal deviate test: difference be- Dixon and Massey, meter. These values are necessarily somewhat arbi-
between proportions (via arc .sine 1957, p. 251 and Table
transformation) A-28, p. 465 trary, but were chosen so as to seem reasonable.
The reader can render his own judgment as to their
3. Noi mal deviate test for difference Dixon and Massey,
between Pearson r's (via Fisher z 1957, p. 251 and Table reasonableness from a study of Table 2 and the
transformation) A-30b, p. 468 ensuing discussion, which set them forth, but what-
\. I test; Pearson r is zero Dixon and Massey, ever his judgment, he may at least be willing to
1957, formula p. 253 accept them as conventional.
(adapted)
Discussion is necessary to amplify, exemplify, and
5. Hinomial and normal deviate test: Mosteller, Rourke, and perhaps justify the decisions summarized in Table
proportion equals .50 (sign test) Thomas, Table TV,
pp. 369-388; Walker 1 for each type of statistical test of the null hy-
and Lev, 1953, pp. pothesis in turn:
60-63, 67-69
1. t test for two means. Consider the medium
6. /'' Lest in analysis of variance de- Kiserihart, tlasU level: it posits the existence of a difference between
signs: k means are equal and Wallis, 1947, I
256-259; Dixon a population means amounting to one-half of the
Massey, 1957, l population sigma. In more generally familiar terms,
426-453 (nomograpi
this would be exemplified by a research plan that
7. x2 test: (a) k proportions are equal, Patiiaik, 1949; Fix, sought to detect a difference of 8 points between the
or (b) kr proportions arc independ- Hodges, andLelimaiin,
ent (contingency test) 1959 mean IQs of two populations. Similarly, small and
large IQ mean differences would amount to 4 and
STATISTICAL POWER 147
TABLE 2
VALUKS oi> POPULATION PARAMETERS WHICH DEFINE THE LEVELS OF SIZE OF EFFECT FOR THE
VARIOUS STATISTICAL TESTS

Values
Tost Population parameter
Small Medium Large

1. / (two means arc equal) Afi-JWal/a .25 .50 1.00


2. Normal (two proportions are equal) \Pl-P-A .10 .20 .30
3. Normal (two r's are equal) Id f t \ .10 .20 .30
4. / (r = 0) .20 .40 .60
5. Sign test I P -.S0| .10 .20 .30
6. F (k means are equal) "Mil a / .125 .25 .50
Lar cst7> 3:2 2:1 4:1
7a. x2 (k proportions are equal) Ritio-
Rat10 g
' Smallest P
7h. x2 (contingency test) ^ (Pai - Pn) 2 , Varies with .able size, bul uses criteria
"
t-l Ji>O i ' equivalent, for equal de grees of free-
clom, to 7a (see text).

16 points, respectively. These values seem reason- to that of two proportions exists here: a given
able. For example, an 8-point mean IQ difference is population difference between two Pearson cor-
large enough to be noticeable; this is the order of relation coefficients is of varying detcctability as a
magnitude of the difference between people in function of their level, even more so than in the
professional and managerial occupations and also case of proportions, e.g., with two samples of SO cases
between clerical and semiskilled workers (Super, each, the power under our standard conditions to
1949, p. 98). Differences half this size (small) detect a difference between population r's of .10
would not be readily perceptible; e.g., the mean IQ and .30 is .17, while for .70 and .90 it is .83. An
difference between twins and nontwins (Husen, exactly parallel solution was used, i.e., the sample
1959) ; differences twice this size (large) would be values were used to approximate the level of popula-
so obvious as to virtually render a statistical test tion correlation of the test. Again, the problem was
superfluous, e.g., the mean IQ difference between avoidable by using tha difference in Fisher z trans-
college graduates and those with only a SO-SO formation values to define size of effectthese are
chance of passing in an academic high school cur- invariant for level of population r'sbut again
riculum (Cronbach, 1960, p. 174). considerations of awkwardness and unfamiliarity
2. Normal test for two proportions. The detect- led to the rejection of this alternative.
ability of a population difference in proportions of The argument of perceptibility for the definition
any given magnitude is not constant for any given of a correlation difference of .20 as medium (Table
research plan, but increases as the average of the 2) is not uniformly convincing. At high and pos-
two proportions departs in either direction from sibly at moderate levels of correlation, such a popu-
.50. Thus, for example, with two samples of SO cases lation difference would be noticeable, but not, say,
each the power under our standard conditions to when the population r's were .10 and .30. This
detect a difference between population proportions difficulty, too, could have been avoided by defini-
of .40 and .60 is .52, while for .70 and .90 (or .10 tion via differences in Fisher z transformation values.
and .30) it is .73. In the survey, the level of In any case, this decision could not affect the re-
average population proportion at which the power sults of the survey, since only a few minor instances
of the test was computed was the average of the of this statistical test were encountered. Small and
sample proportions found. Although this procedure large effects were again symmetrically defined as
was tedious, the alternative was to use as the differences of .10 and .30.
parameter the difference between the arc sine trans- 4. t test of I = 0. There were no technical com-
formations of the population proportions, for which plications here, but the choice of "reasonable" values
power is invariant over levels, but general un- defining the levels of size of effect proved trouble-
familiarity and awkwardness in thinking about dif- some. Initially, for the sake of comparability, it
ferences between proportions in these terms led to was planned to use the values of the r which
its rejection. are implied by those selected for the t test between
Similar reasoning as detailed for t tests for means means, since any difference between (standardized)
led to the selection of the values for the difference means can be expressed as a (point biserial) cor-
in population proportions to define small, medium, relation coefficient, or vice versa. This led (on the
and large effects (Table 2 ) . It was felt that a assumption of populations of equal size,5 to values
population incidence difference of .20 (medium) of .125, .25, and .50 for the respective levels of size
would be a fairly noticeable phenomenon, and the of effect. On the generally untenable further as-
other levels were defined symmetrically about it.
r>
3. Normal test for two r's. An analogous problem Assumptions of inequality lead to smaller values.
148 JACOB COHKN

sumption that the populations result from the in both standard terms and TQ units (M 100, a 16):
dichotomization of a normally distributed variable,
the resulting biserial r values are somewhat larger,
k = 2 It = 3 k = 4
.16, .31, and .63. Thus, from this point of view we
have as candidates for the definition of a medium
effect, coefficients of .25 or (more questionably) .31. + .25 104 + .306 104.9 + .335 105.4
On the other hand, conventional verbal descriptions -.25 96 0 100.0 + .112 101.8
would consider a "small" correlation, one between .20 -.306 95.1 -.112 98.2
and .40, a "moderate" one between .40 and 70, and -.335 94.6
a "high" one between .70 and .90 (Guilford, 1956, p.
Note that for two samples, the difference between
145). Guilford points out that these verbal terms
means is .50, as defined for the medium level of
may be misleading, and points out that "the validity
the t test for means. Note also that the standard
coefficient for a single test may be expected in the
deviation of each column of standardized means is
range from .00 to .60, with most of them in the lower
.25, and of IQ means (.25) (16) =4.
half of that range" (p. 146).
The illustration serves as a guide as to the
Thus, the medium correlation defined for com-
size of the disparities between means defined as a
patability with the criterion for a t test between
medium effect. Small effects arc arrived at by
means would be about .25-.30, in conventional
halving the gaps between means, large effects by
abstract terms about .50-.60, and in specific ap-
doubling them.
plication as test validity coefficients, perhaps about
.30-.40. A compromise among these considerations 7a. x2 test that k proportions are equal. This
test created the greatest problem in the selection
was struck: a medium effect size was defined as .40,
with small and large effects, respectively, as .20 and of a parameter to define degree of departure from
.60 These are smaller than would be dictated by the the null hypothesis. A plan to follow the same
procedure as for k means, namely, a fixed standard
abstract conventions, but rather more generous (i.e.,
deviation of the population proportions, was frus-
give higher power estimates) than the criteria of
trated by the fact that proportions are bounded
the other statistical tests, and are reasonably in
keeping with at least one common application of by zero and one, so that as the number of samples
correlation, validity coefficients. increases, with sigma fixed and 1/k decreasing toward
5. Sign test. This is more generally a test of the zero, negative values arc called for. After further
exploration with other approaches, the problem was
hypothesis that the population proportion having
finally solved by choosing as a parameter of size of
a given characteristic equals .5, and is accomplished
effect the ratio of the largest population proportion
by reference to the binomial distribution for small
to the smallest, with equal spacing of the k propor-
samples and the normal distribution for > 25,
where it gives an adequate approximation to the tions. So specified, this leads in turn (for any given
value of k) to the standard departure function used
binomial. The same criteria were used for levels
of size of effect here as were used for the hypothesis with the nonccntral x2 distribution (Patnaik, 1949,
that two population proportions differ, i.e., .10, .20, and formula for I given in 7b, Table 2).
Once the parametric function was chosen, specific
and .30 for small, medium, and large effects, re-
spectively (Table 2), and on the basis of the same values were then selected to define the levels (Table
considerations. 2 ) . Focusing again on the critical medium level, a
6. F test for k means. The population parameter ratio of 2:1 leads to the following specification of
population departures from the null hypothesis of
/ (Table 2) used to define degree of departure from
the null hypothesis was the standard deviation of cquiproportionality for some illustrative values of k'.
the k standardized population means, i.e., of the
means expressed in units of the common population i =2 k = 3 k = 4 k = 5
sigma, or as z scores. For the t test for two means,
the parameter was the absolute difference between .667 .444 .333 .267
the two means so expressed; here, for k means, it .333 .333 .289 .233
.222 .222 .200
is their standard deviation which measures their .167 .167
departure from each other, and therefore from the .133
null hypothesis which holds them to be equal.
Since t is merely a special case of F (i.e., its It should be noted that by this criterion as k
square root when the numerator has 1 df) it was increases, other things equal, the departure function
possible to define the levels of the parameter / to I and hence the power decreases. Note, too, that
make them consistent with those of t. Expressed for k = 2, this criterion is not quite the same as for
in terms of f, the t criteria are, respectively, .125, concretize the exposition, the further specification
.25, and .50 (Table 2 ) . Taking the medium level, of their distribution is necessary, and equal spacing
.25, we illustrate for varying numbers of samples, is chosen because it leads to maximum separation of
the population means implied. For this illustration the extreme means as well as for its intuitive
the means are equally spaced and are expressed simplicity. The power computed, however, is in-
0
When there arc more than two means, specifying dependent of the spacing, but is simply a function
/ does not fix the standard mean values. To of / (Dixon & Massey, 1957, p. 257).
STATISTICAL POWER 149

the statistically equivalent sign test (Table 2), which the volume. Each article was read in turn, and
calls for populations of .70 and .30 and hence a the nature of each statistical test performed (or
criterion of 2 1/3:1, instead of the 2:1 value used. implied) in the article was noted. Generally, when
This incompatibility was tolerated both because of sample sizes (and for F tests and x2, df) were
the greater simplicity of the latter, and also be- added to the standard conditions, the power of the
cause the former gave rise to discrepancies be- test for small, medium, and large effect size could
tween proportions for larger values of k which be read directly from the appropriate prepared tables,
seemed intuitively too large to be deemed medium. or by interpolation between tabled values. The
A large effect is defined as a ratio twice as large statistical tests given in Table 1 are not inclusive of
as those illustrated above (i.e., 4:1) and a small all used in the volume, most noteworthily, non-
effect as one three-quarters as large (i.e., 3:2), parametric tests based on ranks could not be studied
since a ratio half as large defines no effect (i.e., from the point of view of power due to unavail-
1:1). ability of systematic studies of this issue in the
7b. x 2 contingency test. A definition of size of literature. In the relatively few instances where
population contingency which was both simple and such tests had been used, the power was determined
direct could not be achieved, in contrast to the other for the analogous parametric test, e.g., the t test
tests. Instead, the same criteria values I were used for means for the Mann-Whitney U test and for
here as those which resulted from the ratio of the Wilcoxon matched-pairs signed-ranks test, and
largest to smallest proportion in the simpler one- the F test for the Kruskal-Wallis H test and for
dimensional x 2 test (Table 2, line 7a). Thus, for the Friedman test (Siegel, 19S6). Note that the
any given number of degrees of freedom, medium effect of this substitution was to slightly overesti-
contingency is implicitly defined as departure from mate the power of the tests on the usual assump-
null association (as measured by I) equal to that tion that the conditions required by the parametric
of medium departure from the equiproportionality tests obtained. Even if this assumption is questioned,
hypothesis of 7a, i.e., a 2:1 ratio of extremes (Table it is quite unlikely that the substitution results in
2). This results in / values which vary as a function an underestimation of power. In general, in the
of df. What this leads to, as a definition of medium few instances where statistical tests were so described
size of contingency, is perhaps more clearly illustrated as to leave a doubt about the exact details, the
by examples than described. Following are some doubt was resolved in favor of higher power esti-
contingency tables of varying degrees of freedom mates. For example, if a group of n cases was
whose proportions exemplify medium contingency divided into two subgroups for comparison, but the
(decimal points omitted): subgroup sizes were not given, it was assumed they
were equal, which then leads to a maximum power
estimate for that value of n,
lit 167 222 167 139 111 083 In this way, the power was determined for the
222 167 111 083 111 139 167 4,829 statistical tests 7 in the volume. But it was
desired to characterize the power of each of the
df=2, I = .0741 dj =3, 1= .0617
research studies in the volume. The typical article
involved a number of tests not all of equal relevance
067 083 100 117 133 084 098 152 to its major hypotheses. To determine an average
133 117 100 083 067 098 138 098 set of power values across all the statistical tests of
152 098 084 an article might lead to a distorted result, if, for
example, a few hypothesis relevant tests were per-
df=4, / = .0556 df = 4, I = .0556 formed on the total data followed by a large num-
ber of subsidiary exploratory tests on only a por-
tion of the cases. The latter would be less powerful
Note that in the 2 X k tables above the extreme
columns' cells are in 2:1 ratio and the values in (since sample sizes arc smaller), more numerous, and
less relevant to the issues central to the investigation.
each row are equally spaced. Of course, other tables
These considerations led to the classification of all
of equal size of effect (therefore leading to equal
tests performed either as bearing directly on the
power) can be constructed, provided that they yield
status of the major hypotheses or experimental is-
the I value appropriate to the df involved.
sues, of which there were 2,088, or as being peripheral
Limitations of space preclude the presentation to these issues, an additional 2,741. The latter
of tables which exemplify small and large con-
typically included such things as exploratory tests,
tingency effects, but the interested reader can con-
routine tests of the significance of all correlation
struct his own by analogy with the material
7
presented. Fortunately, this did not demand that many
separate determinations. In these characteristically
Survey Procedure multivariable studies, a single test, e.g., the sig-
nificance of an r for a given n might be applied
With power tables prepared for .OS level non- as many as 861 times, i.e., to each intercorrelation
directional tests for varying values of n for each of a 42 X 42 matrix (to take the most extreme
statistical test type, and the size of effect levels example) which counts as 861 statistical tests, but
chosen, the next step was the survey of articles in requires only a single power determination.
150 JACOB COHEN
TABLE 3
FREQUENCY AND CUMULATIVE PERCENTAGE DISTKIHUTIONS of THE POWER OF THE 70 ARTICLES" TO DETECT
SMALL, MEDIUM, AND LARGE POPULATION EFFECTS UNDER NONDIRECTIONAL .05 LEVEL CONDITIONS

Small effect Medium effect Large effect


1*O Wt'l"

Frequency
Cumulative Frequency Cumulative Frequency Cumulative
percentage percentage percentage

.99- 9 100
.9S-.98 7 87
.90- .94 2 100 18 77
.80-.89 4 97 14 51
.70-. 79 4 91 7 31
.60-.69 8 86 7 21
..SO-.59 12 74 6 11
.40 -.49 3 100 15 57 0 3
.30-39 6 96 ! 10 36 2 3
.20-.29 14 87 13 21
.10-.19 39 67 2 3
.05-.09 8 11

n 70 70 70

M .18 .48 .83


Median .17 .46 .89
a .08 .20 .16
(.->! .12 .32 .73
Qt .23 i .60 .94
n
From tiie Journal of Abnormal and Social Psychology, I960, 61.

coefficients in a factor analysis, significance tests 70 articles are given in Table 3. As can be
of reliability coefficients of dependent variables, seen from the distributions and their sum-
tests of "by-product" control variables or unhy-
pothesized interactions in analysis of variance designs,
marizing statistics, given .05 level nondirec-
etc. tional statistical tests of the major hypothesis,
Once the tests were so classified, the mean power the power to detect the size of effect levels
of the major tests was determined at the three previously denned are as follows:
levels of size of effect for each research study. 8 Small effects. On the average, the studies
By this procedure, no matter what the number of
tests a particular study might involve, all articles reviewed had only about one chance in five
count equally in the description of the total volume. or six of detecting small effects. About a
The mean power values of the studies at hypothesized fourth of the articles had as much as one
small, medium, and large population effects were chance in four of yielding significant results,
then distributed and their central tendency and and another fourth had no more than one
variability determined.
chance in eight under these conditions. Not
RESULTS one of the studies had as much as a SO-SO
chance of detecting a slight effect!
There are, in all, 78 articles in the Journal Medium effects. When one posits medium
of Abnormal and Social Psychology, 1960, effects in the population (generally of the
61. Of these, 6 involved no statistical tests order of twice as large as small effects) the
at all (case reports, factor analytic studies, studies average slightly less than a SO-SO
etc.) and two additional articles (both factor chance of successfully rejecting their major
analytic) involved no major tests as above null hypotheses. No more than one-quarter of
defined. The frequency and cumulative per- these studies have as good as three chances
centage distributions and relevant descriptive in five of succeeding under these conditions,
statistics of the (mean) power to detect small, and another quarter have less than one chance
medium, and large effects of the remaining in three.
8
The less important mean power for all tests,
Large effects. Only when one assumes large
both major and peripheral, was also found for effects (roughly twice as large as medium)
each article. does one find typically a good chance of
STATISTICAL POWER 151

rejecting the major null hypotheses, about for publication. Consider this paradigm: 100
five out of six. Even under these most favor- investigations are undertaken in which, in
able circumstances, a quarter of the studies fact, there is actually a medium population
have less than three chances in four of effect. From the above findings, about SO
succeeding. get positive results and are likely to come to
Another way of viewing these results is to publication; the other SO fail to reject their
determine the proportion of the studies which (assumed false) null hypotheses and are un-
would meet the criterion of a Type II error likely to come to publication. Thus, the gen-
level as small as the conventional Type I eral success of the articles in the volume
level, namely, .OS (power, therefore, would under review does not successfully argue for
be at .95 or higher). None of the studies meet their antecedent probabilities of success being
this criterion when one posits small or even any higher than the results of the analysis
medium effects, and only 23% (i.e., 16 of suggest, or, cquivalently, that the criteria for
70) meet it under conditions of large effect. size of effect used were overly stringent."
Incidentally, if the reader questions the On the contrary, there is a line of argument
validity of the author's judgment in classify- that suggests that the criteria were not
ing the statistical tests into major and peri- stringent enough. Assume that a medium
pheral (see above, Survey Procedure) or is effect exists in the population with regard to
for any reason curious about the power of some psychological construct or constructs,
the researches when all tests, major and e.g., a correlation between two (pure factor)
peripheral, are considered, the power means attitudes of .40. By the time we have meas-
for small, medium, and large effects are .20, ured each, the variance of our scores contains
.SO, and .83, respectively, hardly different error and other construct irrelevant variance
from the means in Table 3. which serve to attenuate the population ef-
fect we seek to a correlation of perhaps .20
DISCUSSION to .30 between fallible attitude scores. We al-
The results indicate that the investigators ways must draw inferences from variables
contributing to Volume 61 of the Journal of containing error and irrelevant variance while
Abnormal and Social Psychology had, on the we normally conceptualize our problems in
average, a relatively (or even absolutely) terms of constructs. The net effect of the
poor chance of rejecting their major null fallibility of our measurement and classifica-
hypotheses, unless the effect they sought was tion is to attenuate the effects we seek. Thus,
large. This surprising (and discouraging) the size of effect criteria, relating as they do
finding needs some further consideration to to fallible observations, imply even larger
be seen in full perspective. construct effects, and from the viewpoint of
First, it may be noted that with few ex- the latter, are on the generous side.
ceptions, the 70 studies did have significant If we then accept the diagnosis of general
results. This may then suggest that perhaps weakness of the studies, what treatment can
the definitions of size of effect were too severe, be prescribed? Formally, at least, the answer
or perhaps, accepting the definitions, one is simple: increase sample sizes. The mean
might seek to conclude that the investigators of the maximum sample sizes used to test
were operating under circumstances wherein major hypotheses in the 70 studies was 68.10
the effects were actually large, hence their The power of a statistical test depends form-
success. Perhaps, then, research in the ab- ally on several parameters, but unless one is
normal-social area is not as "weak" as the 0
The paradigm can be continued: assume that
above results suggest. But this argument rests
al the same time another 100 investigations are
on the implicit assumption that the research undertaken in which, in fact, there is no effect, i.e.,
which is published is representative of the the null hypothesis is true. At the .OS level, these will
research undertaken in this area. It seems contribute, on the average, another five candidates
obvious that investigators are less likely to for publication. This reduces even further the
strength of this argument.
submit for publication unsuccessful than 10
Since the distribution is positively skewed, as
successful research, to say nothing of a evidenced by a standard deviation of 55, the median
similar editorial bias in accepting research would be considerably less than 68.
152 JACOB COHEN

to increase the significance level (i.e., in- sight of the fact that, following the discovery
crease the risk of Type I errors) or use di- of a significant F ratio involving several
rectional tests (e.g., a one-sided test for t ) , groups, they are usually left with a multiple
power can generally be increased only by an comparison problem where the means are no
increase in sample size. Taking 68 cases, it more stable than the sample size on which
is instructive (and chastening) to see how each is based. Thus, if in the last example
much power they provide for various tests (seven groups of 10 cases each) F is found
under standard conditions (.OS significance significant, the determination of which group
criterion, nondirectional) assuming the exist- differs significantly from which then depends
ence of a medium population effect: on means based on 10 cases. Even if one then
1.1 test for a difference between two means. follows the overliberal practice of performing
Assuming samples of 34 cases each, the power t tests between pairs of these means at the
is .52. If the sample was unequally divided, tabled, but actually higher, .05 level (using
say for SO and 18, power would be only .42. 1! the within-group error term based on 63 dj)
2. Normal deviate test for a difference be- the power of each test under medium effect
tween two proportions. With two samples of conditions is only .19, despite the overall F
34 cases, assuming extreme population pro- test power of .31!
portions, say .70 and .90 (or .10 and .30), 7a. x2 test that k proportions are equal.
power is .57; assuming population proportions This parallels the situation for F tests, power
of .40 and .60, power is only .38. varying with k. For three groups of 23,
3. Normal deviate test for a difference be- power to detect a medium effect is .38; for
tween r's. Again dividing the 68 cases equally four groups of 17, .30; for seven groups of
for maximum power, with high population 10, .21. The same considerations apply when
r's, say .70 and .90, power is .66; with low it is necessary to follow-up the overall x2
r's. differing by the same amount, say .10 and test.
.30, power is only .13! 7b. x2 contingency test. As above, power
4. t test that a population r = 0. If it is, varies with dj. For example, for the contin-
in fact, .40 (medium), 68 cases give the gency tables illustrated above (Size of Effect)
high power value of .93. This high power assuming 68 cases in each, power is as fol-
is a consequence of the definition of medium lows: dj = 2, .50; dj = 3, .37; df = 4 (both
as .40, rather than a lower value which com- tables) .30.
patability with the other test criteria would Given these generally meager power values
dictate (see above, Size of Effect). for 68 cases, it is not surprising to find a mean
5. Normal deviate test that a population power value assuming medium effect size over
proportion is .50 (sign test}. If the popula- the 70 articles of only .48. Are these studies
tion proportion is actually .70 (or .30), 68 representative of abnormal-social research
cases give the high power value of .92, pro- undertaken? It follows from our earlier
vided that the design yields 68 differences. If, reasoning that, if anything, published studies
however, the 68 cases are set up to yield 34 are more powerful than those which do not
matched pairs, n is effectively 34, and power reach publication, certainly not less powerful.
is only .63. Therefore, the going antecedent probability
6. F test for k means. Power here depends of success of current abnormal-social research
upon the number of groups. Assuming three is much lower than one would like to see it,
groups of 23 cases each, power is .41; with a situation which is capable of improvement
four groups of 17 cases each, .36; with seven by increasing the size of the samples custom-
groups of 10 cases each, power drops to .31. arily employed.12 The investigator on the
The F test in the analysis of variance is, track of a subtle issue in the area of subcep-
indeed, a most versatile statistical tool (cf.
12
Anderson, 1961) but investigators may lose Other means for increasing power: improving
experimental design efficiency and/or experimental
11
It is demonstrable that for the statistical test control, and renouncing a slavish adherence to a
of any difference between or among samples which standard Type I level, usually .05. In some in-
total n cases, power is at a maximum when the vestigations, an increase in the latter may result in
n cases are equally divided. so large an increase in power as to justify the greater
STATISTICAL POWER 153

tion who plans to study 30 cases would do and the mean power values for the major
well to take heed! tests of each article were used to characterize
The consequences of this state of affairs are that article. The distributions of these values
fairly obvious. If many investigators are run- were presented and summarized.
ning high risks of failing to detect substantial It was found that the average power (prob-
population effects, much research is resulting ability of rejecting false null hypotheses)
in spuriously "negative" results. One can over the 70 research studies was .18 for small
only speculate on the number of potentially effects, .48 for medium effects, and .83 for
fruitful lines of investigation which have been large effects.
abandoned because Type II errors were made, These values are deemed to be far too
a situation which is substantially remediable small and suggest that much research in the
by using double or triple the original sample abnormal-social area has lead to the failure
size. A generation of researchers could be to reject null hypotheses which are in fact
profitably employed in repeating interesting false. This in turn may have lead to frequent
studies which originally used inadequate premature abandonment of useful lines of
sample sizes. Unfortunately, the ones most investigation.
needing such repetition are least likely to Since power is a direct monotonic function
have appeared in print. of sample size, it is recommended that in-
It is quite likely that similar conditions vestigators use larger sample sizes than they
prevail in other areas of psychological re- customarily do. It is further recommended
search. It is recommended that psychological that research plans be routinely subjected to
investigators attend to issues of power in their power analysis, using as conventions the
planning of experiments, and that the defini- criteria of population effect size employed in
tions of size of effect employed in this survey this survey.
be used conventionally. In the absence of any REFERENCES
basis for specifying an alternative to the null ANDERSON, N. H. Scales and statistics: Parametric
hypothesis for purposes of power analysis, and nonparametric. Psychol. Bull., 1961, 58, 305-
the criterion values for a medium effect 316.
(Table 2) are offered as a convention. CRONBACII, L. J. Essentials of psychological testing,
(2nd ed.) New York: Harper, 1960.
DIXON, W. J., & MASSEY, F. J., JR. Introduction to
SUMMARY statistical analysis, (2nd ed.) New York: Mc-
Graw-Hill, 19S7.
The purpose of the study was to survey EISENHART, C., HASTAY, M. W., & WALLIS, W. A.
the articles of the Journal of Abnormal and Techniques of statistical analysis. New York:
Social Psychology, 1960, 61, from the point McGraw-Hill, 1947.
Fix, E., HODGES, J. L., JR., & LEIIMANN, E. L.
of view of the power of their statistical tests The restricted x 2 test. In, Studies in probability
to reject their major null hypotheses, for and statistics dedicated to Harald Cramer. Stock-
defined levels of departure of population para- holm: Almquist & Wiksell, 1959.
meters from null conditions, i.e., size of ef- GTJILFORD, J. P. Fundamental statistics in psychology
fect. Conventional test conditions were em- and education. (3rd ed.) New York: McGraw-
Hill, 1956.
ployed in power determination: nondirectional HUSEN, T. Psychological twin research, Stockholm:
tests at the .05 level of significance. Almquist & Wiksell, 19S9.
For this purpose, extensive tables for the MOSTEIXER, F., ROURKE, R. E. K., & THOMAS,
common statistical tests were prepared from G. B., JR. Probability and statistics. Reading,
Mass.: Addison-Wesley, 1961.
which the power of a test could be read as a PATNAIK, P. B. The non-central x2 and F-distribu-
function of sample size and size of effect. tions and their applications. Biontetrika, 1949, 36,
From these tables, the power to detect small, 202-232.
medium, and large effects of each statistical SIEGEL, S. Nonparametric statistics for the behavioral
test employed in each article was determined, sciences. New York: McGraw-Hill, 1956.
SUPER, D. E. Appraising vocational fitness. New
York: Harper, 1949.
Type I risk. However, normal scientific conservatism WALKER, HELEN M., & LEV, J. Statistical inference.
would not tolerate too long a trip on this road. New York: Holt, 1953.
Increased sample size is likely to prove the most
effective general prescription for improving power. (Received October 5, 1961)

You might also like