You are on page 1of 17

Psychological Methods Copyright 2003 by the American Psychological Association, Inc.

2003, Vol. 8, No. 3, 305–321 1082-989X/03/$12.00 DOI: 10.1037/1082-989X.8.3.305

Sample Size for Multiple Regression: Obtaining Regression


Coefficients That Are Accurate, Not Simply Significant
Ken Kelley and Scott E. Maxwell
University of Notre Dame

An approach to sample size planning for multiple regression is presented that


emphasizes accuracy in parameter estimation (AIPE). The AIPE approach yields
precise estimates of population parameters by providing necessary sample sizes in
order for the likely widths of confidence intervals to be sufficiently narrow. One
AIPE method yields a sample size such that the expected width of the confidence
interval around the standardized population regression coefficient is equal to the
width specified. An enhanced formulation ensures, with some stipulated probabil-
ity, that the width of the confidence interval will be no larger than the width
specified. Issues involving standardized regression coefficients and random pre-
dictors are discussed, as are the philosophical differences between AIPE and the
power analytic approaches to sample size planning.

Sample size estimation from a power analytic per- nature (Cohen, 1994). Therefore, performing sample
spective is often performed by mindful researchers in size planning solely for the purpose of obtaining sta-
order to have a reasonable probability of obtaining tistically significant parameter estimates may often be
parameter estimates that are statistically significant. improved by planning sample sizes that lead to accu-
In general, the social sciences have slowly become rate parameter estimates, not merely statistically sig-
more aware of the problems associated with under- nificant ones.
powered studies and their corresponding Type II er- The zeitgeist of null hypothesis significance testing
rors, which can yield misleading results in a given seems to be losing ground in the behavioral sciences
domain of research (Cohen, 1994; Muller & Benig- as the generally more informative confidence interval
nus, 1992; Rossi, 1990; Sedlmeier & Gigerenzer, begins to gain widespread usage. Instead of simply
1989). The awareness of underpowered studies in the testing whether a given parameter estimate is some
literature has led vigilant researchers attempting to exact and specified value, typically zero, forming a
curtail this problem in their investigations to perform 100(1 − ␣) percent confidence interval around the
a power analysis (PA) prior to data collection. Re- parameter of interest frequently provides more mean-
searchers who have used various power analytic pro- ingful information. Although null hypothesis signifi-
cedures have undoubtedly strengthened their own re- cance tests and confidence intervals can be thought of
search findings and added meaningful results to their as complementary techniques, confidence intervals
respective research areas. However, even with PA be- can provide researchers with a high degree of assur-
coming more common, it is known that null hypoth- ance that the true parameter value is within some
eses of point estimates are rarely exactly true in confidence limits. Understanding the likely range of
the parameter value typically provides researchers
with a better understanding of the phenomenon in
Editor’s Note. Samuel B. Green served as action editor question than does simply inferring that the parameter
for this article.—SGW is or is not statistically significant. With regard to
Correspondence concerning this article should be ad- accuracy in parameter estimation (AIPE), all other
dressed to Ken Kelley or Scott E. Maxwell, Department of things being equal, the narrower the confidence inter-
Psychology, University of Notre Dame, 118 Haggar Hall, val, the more certain one can be that the observed
Notre Dame, Indiana 46556. E-mail: kkelley@nd.edu or parameter estimate closely approximates the corre-
smaxwell@nd.edu sponding population parameter. Accuracy in this

305
306 KELLEY AND MAXWELL

sense is a measure of the discrepancy between an cause the widths of the intervals are often “embar-
estimated value and the parameter it represents.1 rassingly large” (p. 1002). The AIPE approach pre-
One position that can be taken is that AIPE leads to sented here attempts to curtail the problem of
a better understanding of the effect in question and is embarrassingly large confidence intervals and pro-
more important for a productive science than a di- vides sample size estimates that lead to confidence
chotomous decision from a null hypothesis signifi- intervals that are sufficiently precise and thereby pro-
cance test. Many times obtaining a statistically sig- duce results that are presumably more meaningful
nificant parameter estimate provides a research than simply being statistically significant.
community with little new knowledge of the behavior In the context of multiple regression, sample size
of a given system. However, obtaining confidence can be approached from at least four different per-
intervals that are sufficiently narrow can help lead to spectives: (a) power for the overall fit of the model,
a knowledge base that is more valuable than a collec- (b) power for a specific predictor, (c) precision of the
tion of null hypotheses that have been rejected or that estimate for the overall fit of the model, and (d) pre-
failed to reach significance, given that the desire is to cision of the estimate for a specific predictor. The
understand a particular phenomenon, process, or sys- goal of the first perspective is to estimate the neces-
tem. sary sample size such that the null hypothesis of the
If we assume that the correct model is fit, observa- population multiple correlation coefficient equaling
tions are randomly sampled, and the appropriate as- zero can be correctly rejected with some specified
sumptions are met, (1 − ␣) is the probability that any probability (e.g., Cohen, 1988, chapter 13; Gatsonis &
given confidence interval from a collection of confi- Sampson, 1989; S. B. Green, 1991; Mendoza &
dence intervals calculated under the same circum-
stances will contain the population parameter of in-
terest. However, it is not true that a specific 1
The formal definition of accuracy is given by the square
confidence interval is correct with (1 − ␣) probability, root of the mean square error and can be expressed by the
as a computed confidence interval either does or does following formulation:
not contain the parameter value. The meaning of a
100(1 − ␣) percent confidence interval for some un- RMSE ⳱ √E[␪ˆ − ␪)2] ⳱ √E[(␪ˆ − E[␪ˆ])2] + (E[␪ˆ − ␪])2,
known parameter was summarized by Hahn and where E is the expectation operator and ␪ˆ is an estimate of
Meeker (1991) as follows: “If one repeatedly calcu- ␪, the value of the parameter of interest (Hellmann &
lates such [confidence] intervals from many [techni- Fowler, 1999; Rozeboom, 1966, p. 500). The first compo-
cally an infinite number of] independent random nent under the second radical sign represents precision,
samples, 100(1 − ␣)% of the intervals would, in the whereas the second component represents bias. Thus, when
long run, correctly bracket the true value of [the pa- the expected value of a parameter is equal to the parameter
value it represents (i.e., when it is unbiased), accuracy and
rameter of interest]” (p. 31). It is important to realize
precision are equivalent concepts and the terms can be used
that the probability level refers to the procedures for interchangeably.
constructing a confidence interval, not to a specific 2
It should be noted that the interpretation of confidence
confidence interval (Hahn & Meeker, 1991).2 intervals given in the present article follows a frequentist
Many of the arguments in the present article re- interpretation. The Bayesian interpretation of a confidence
garding the use and utility of confidence intervals interval was well summarized by Carlin and Louis (1996),
echo a similar sentiment that has been long recom- who stated that “the probability that [the parameter of in-
mended, as well as the more recent discussions in terest] lies in [the computed interval] given the observed
Wilkinson and the American Psychological Associa- data y is at least (1 − ␣)” (p. 42). Thus, the Bayesian frame-
tion Task Force on Statistical Inference (1999), essen- work allows for a probabilistic statement to be made about
tially an entire issue of Educational and Psychologi- a specific interval. However, when a Bayesian confidence
interval is computed with a noninformative prior distribu-
cal Measurement (Thompson, 2001) devoted to
tion (which uses only information obtained from the ob-
confidence intervals and measures of effect size, Al- served data), the computed confidence interval will exactly
gina and Olejnik (2000), and Steiger and Fouladi match that of a frequentist confidence interval; the interpre-
(1997), as well as the still salient views offered by tation is what differs. Regardless of whether one approaches
Cohen (1990, 1994). In fact, Cohen (1994) argued confidence intervals from a frequentist or a Bayesian per-
that the reason confidence intervals have previously spective, the suggestions provided in this article are equally
seldom been reported in behavioral research is be- informative and useful.
SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION 307

Stafford, 2001). With the second perspective, sample intervals and null hypothesis significance testing as
size is computed on the basis of the desired power for they relate to the issue of sample size for AIPE and
the test of a specific predictor rather than the desired PA. Specifically, the figure shows the limits of a con-
power for the test of the overall fit of the model (Co- fidence interval for a standardized regression coeffi-
hen, 1988, chapter 13; Maxwell, 2000). cient in each of four hypothetical studies with a dif-
The precision of the overall fit of the model leads to ferent predictor variable in each instance. In all four
another reason for planning sample size. One alterna- studies the null hypothesis that the regression coeffi-
tive within this perspective provides the necessary cient equals zero is false.
sample size such that the width of the one-sided From a purely power analytic perspective, Study 1
(lower bound) confidence interval of the population is considered a “success.” The confidence interval in
multiple correlation coefficient is sufficiently precise this study shows that the parameter is not likely to be
(Darlington, 1990, section 15.3.4). Another alterna- zero and is thus judged to be statistically significant.
tive within this perspective provides the sample size However, the confidence interval is wide, and thus the
such that the total width of the confidence interval parameter is not accurately estimated. In this study
around the population multiple correlation squared is little information about the population parameter is
specified by the researcher (Algina & Olejnik, 2000). learned other than it is likely to be some positive
The final perspective for sample size estimation value, a “failure” according to the goals of AIPE. This
within the multiple regression framework provides the study had an adequate sample size from the perspec-
main purpose of the present article. Necessary sample tive of power, but a larger sample is needed in order
size from this perspective is obtained such that the to obtain a more precise estimate.
confidence interval around a regression coefficient is Study 2, on the other hand, not only indicates that
sufficiently narrow. Oftentimes confidence intervals the null hypothesis should be rejected but also pro-
are computed at the conclusion of a study, and only vides precise information about the size of the popu-
then is it realized the sample size used was not large lation parameter. Here the confidence interval is nar-
enough to yield precise estimates. The AIPE approach row, and thus the population parameter is precisely
to sample size planning allows researchers to plan estimated. Study 2 is a success according to both the
necessary sample size, a priori, such that the com- PA and AIPE frameworks.
puted confidence interval is likely to be as narrow as Study 3 shows a nonsignificant effect that is ac-
specified. companied by a wide confidence interval, illustrating
Figure 1 illustrates the relation between confidence a failure by both methods. Had a larger sample size

Figure 1. Illustration of possible scenarios in which planned sample size was considered a
“success” or “failure” according to the accuracy in parameter estimation and the power
analysis frameworks. Parentheses are used to indicate the width of the confidence interval.
308 KELLEY AND MAXWELL

been used and had the effect been of approximately ing an interval no larger than the specified width will
the same magnitude, the width of the confidence in- be realized only (approximately) 50% of the time. A
terval would have likely been smaller, leading to a reformulation provides the necessary sample size such
potential rejection of the null hypothesis. Thus, the that there is a specified degree of assurance that the
sample size of Study 3 was inadequate from both computed confidence interval will be no larger than
perspectives. the specified width. The precision of the confidence
Study 4 illustrates a case in which the confidence interval and the degree of assurance of this precision
interval contains zero, yet the parameter is estimated depend on the goals of the researcher. Not surpris-
precisely. Study 4 exemplifies a failed PA but a suc- ingly, all other things being equal, greater precision
cessful application of AIPE, as the population param- and greater assurance of the precision necessitate a
eter is bounded by a narrow confidence interval. Of larger sample size. It is believed that if AIPE were
course, one could argue that this study is not literally widely applied, it would facilitate the accumulation of
a failure from a PA perspective, because as a condi- a more meaningful knowledge base than does a col-
tional probability, power depends on the population lection of studies reporting only parameters that are
effect size. In this study the population effect size may statistically significant but which do not precisely
be smaller than the minimal effect size of theoretical bound the value of the parameter of interest.
or practical importance.
The goals for PA and AIPE are fundamentally dif- Sample Size Estimation for
ferent. The goal of PA is to obtain a confidence in- Regression Coefficients
terval that correctly excludes the null value, thus mak-
ing the direction of the effect unambiguous. The In order to develop a general set of procedures for
necessary sample size from this perspective clearly determining the sample size needed to obtain a de-
depends on the value of the effect itself. On the other sired degree of precision for confidence intervals in
hand, the goal of AIPE is to obtain an accurate esti- multiple regression analysis, we use standardized re-
mate of the parameter, regardless of whether the in- gression coefficients.4 Standardized regression coef-
terval happens to contain the null value. Thus, sample ficients are used for two reasons in developing pro-
size from the AIPE perspective does not depend on cedures for determining sample size using an AIPE
the value of the effect itself. However, these two approach. First, due to the arbitrary nature of the
methods of sample size planning are not rivals; rather many measurement scales used in the behavioral sci-
they can be viewed as complementary. In general, the ences, standardized coefficients are more directly in-
most desirable study design is one in which there is terpretable. Second, standardized coefficients provide
enough power to detect some minimally important a more general framework in that variances and co-
effect while also being able to accurately estimate the variances need not be estimated when planning an
size of the effect. In this sense, designing a study can appropriate sample size.5
entail selecting a sample size based on whichever per-
spective implies the need for the largest sample size
for the desired power and precision. We revisit this 3
Although the present article illustrates AIPE in a mul-
possibility in the Power Analysis Versus Accuracy in tiple regression framework, the extension to other applica-
Parameter Estimation section, in which AIPE and PA tions of the general linear model is not difficult, many of
are formally compared in a multiple regression frame- which can be thought of as special cases of multiple regres-
work. sion.
4
For the moment let us suppose that a researcher has The use of standardized regression coefficients may
decided to adopt the AIPE perspective. Provided the give rise to technical issues that are addressed in a later
input population parameters are correct, the tech- section of this article. Standardizing regression coefficients
in the presence of random predictors has many appealing
niques that are presented in this article allow research-
characteristics with regard to interpretability, but under cer-
ers to plan sample size in a multiple regression frame- tain circumstances problems can develop when using this
work such that the confidence interval around the popular technique.
regression coefficient of interest is sufficiently nar- 5
If the desire is to form confidence intervals around un-
row.3 One approach provides the necessary sample standardized regression coefficients, the techniques pre-
size such that the expected width of the confidence sented here are equally useful. The desired width of the
interval will be the value specified. However, achiev- computed confidence interval is measured in terms of the
SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION 309

The formula for a 100(1 − ␣) percent symmetric where R2 represents the population multiple correla-
confidence interval for a single population standard- tion coefficient predicting the criterion (dependent)
ized regression coefficient, ␤j, can be written as fol- variable Y from the p predictor variables and R2XXj
lows: represents the population multiple correlation coeffi-
cient predicting the jth predictor from the remaining p

␤ˆ j Ⳳ t共1−␣ Ⲑ 2;N−p−1兲 冑 共1 − RXX


2
j
1 − R2
兲共N − p − 1兲
,
− 1 predictors. The calculated N should be rounded to
the next larger integer for sample size. The w in the
above equation is the desired half-width of the confi-
(1) dence interval. It should be kept in mind that this
procedure yields a planned sample size that leads to a
where ␤ˆ j is the observed standardized regression co- confidence interval width for a specific predictor. In
efficient, j represents a specific predictor ( j ⳱ 1, . . . , practice, both R2 and R2XXj must be estimated prior to
p), p is the number of predictors (independent or con- data collection, a complication we address momen-
comitant variables, covariates, or regressors), R2 is tarily. Although not frequently acknowledged in the
the observed multiple correlation coefficient of behavioral literature on regression analysis, Equation
the model, R2XXj represents the observed multiple cor- 1 is derived assuming predictors are fixed and un-
relation coefficient predicting the jth predictor (Xj) standardized. Equation 2 is a reformulation of Equa-
from the remaining p − 1 predictors, and N is the tion 1 and thus is based on the same assumptions.
sample size (Cohen & Cohen, 1983; Harris, 1985).6 Results from a Monte Carlo study are provided later
The value that is added to and subtracted from ␤ˆ j to in the article indicating that sample size estimates
define the upper and lower bounds of a symmetric based on Equation 2 are reasonably accurate when
confidence interval is defined as w, which is the half- predictors are random and have been standardized.
width of the entire confidence interval. Thus, the total Equation 2 is intended to determine N such that the
width of a confidence interval is 2w. The value of w expected half-width of an interval is under the re-
is of great importance for accuracy in estimation, be- searcher’s control. However, there is approximately
cause the width of the interval determines the preci- only a 50% chance that the interval will be no larger
sion of the estimated parameter. than specified. The reason for this can be seen from
In the procedure for planning sample size, the criti- Equation 1. Notice that the width of an interval will
cal value for t(1−␣/2;N−p−1) is replaced by the critical depend in part on R2 and R2XXj, both of which will vary
z(1−␣/2) value. Justification for this can be made be- from sample to sample. Thus, for a fixed sample size,
cause precise estimates generally require a relatively the interval width will also vary over replications.
large sample size, and replacing the critical t(1−␣/ However, it is possible to modify Equation 2 in order
2;N−p−1) value with the critical z(1−␣/2) value has vir- to increase the likelihood that the obtained interval
tually no impact on the outcome for the sample size in will be no wider than desired.
most cases.7 The formula used to determine the
planned sample size, such that confidence intervals
6
around a particular population regression coefficient, We introduce the notational system used throughout the
␤j, will have an expected value of the width specified, article. A boldface italicized R denotes the population mul-
tiple correlation coefficient, while a standard-print italicized
is obtained by solving for N in Equation 1 and by
R is used for its corresponding sample value. A population
making use of the presumed knowledge of the popu-
correlation matrix is denoted by a nonitalicized, boldface,
lation multiple correlation coefficients: nonserif-font R. A population zero-order correlation coef-

冉 冊冉 冊
ficient is denoted as a lowercase rho (␳), whereas a vector of
z共1−␣ Ⲑ 2兲 2
1 − R2 population zero-order correlation coefficients is denoted as
N= + p + 1, (2) a boldface lowercase rho (␳).
w 1 − RXX
2
j 7
The z approximation is poor if the correlations between
the predictors and the criterion are large and the correlations
among the predictors are small. In this case, the standard
error of ␤ˆ j is small, producing a relatively small estimated
ratio of the standard deviation of Y to the standard deviation sample size. Under these conditions, the degrees of freedom
of Xj. Thus, following the methods presented for standard- of the critical t value are small, and thus the critical t value
ized regression coefficients, application to unstandardized will not closely match the critical z value. We do not believe
coefficients is straightforward. that this occurs frequently in behavioral research. The al-
310 KELLEY AND MAXWELL

If ␥ is the desired degree of uncertainty of the com- variance of ␤ˆ j and thus leads to Equation 3. Because
puted confidence interval being the specified width, the only random variable in Equation 2 is the variance
Equation 2 can be modified with a multiplicative fac- of ␤ˆ j, use of Equation 3 provides probabilistic assur-
tor that will provide a modified N such that a re- ance that the obtained confidence interval of interest
searcher can have approximately 100(1 − ␥) percent around ␤j will have a half-width no larger than the
assurance that a computed confidence interval will be specified w with 100(1 − ␥) percent confidence.
of the specified width or less. For example, if there With regard to choosing a 100(1 − ␥) percent con-
were a desire to be 80% confident that the obtained w fidence interval for estimation, when compared with a
would be no larger than the desired half-width, ␥ 100(1 − ␣) percent confidence interval for hypothesis
would be defined as 0.20 and there would be only a testing, important distinctions arise. The most obvious
20% chance that the half-width of the confidence in- difference in the present context is that ␥ represents
terval around ␤j would be larger than the specified w. the probability of obtaining a confidence interval with
Hahn and Meeker (1991, section 8.3) showed how an observed w that is larger than the specified w,
to plan sample size for confidence intervals when a whereas alpha is the probability of rejecting a null
specified width around the mean of a normal distri- hypothesis that is true. When making use of Equation
bution is desired, as well as modifying that formula to 3, a researcher is expected to obtain a w that is larger
obtain 100(1 − ␥) percent confidence that the interval than the value specified only 100␥ percent of the time,
will be of the desired width or less. Taking similar regardless of whether or not the null hypothesis is
logic and applying it to multiple regression leads to true. Whereas alpha is typically thought of as one of
the creation of a formula for a modified N, NM. This two essentially constant values, .05 or .01, ␥ is chosen
modified formulation provides the necessary sample by the researcher in order to achieve some desired
size in order for researchers to be 100(1 − ␥) percent degree of assurance that the precision of the estimated
confident that the ␤j of interest will have a corre- parameter will be realized. Thus, confidence intervals
sponding confidence interval width that is no larger formed in the realm of hypothesis testing represent an
than specified. The formula for NM is given as fol- attempt to accomplish a different goal than those
lows: formed when a researcher’s interest is in obtaining a

冉 冊冉 冊冉 冊
precise estimate of the parameter of interest.
z共1−␣ Ⲑ 2兲 2
1 − R2 ␹共21−␥;N−1兲
NM = + p + 1,
w 1 − RXX
2 N − p −1 Specifying Population Parameters as
j
(3) Input Values
where N is the value obtained in Equation 2 and As illustrated in the last section, determining
␹2(1−␥;N−1) is the critical value from a chi-square dis- sample size through an AIPE approach requires one to
tribution at the 1 − ␥ quantile having N − 1 degrees of know, or anticipate, R2 and R2XXj. This is by no means
freedom. Like N, NM should also be rounded to the an easy task, but with some careful planning and
next larger integer. sound theoretical judgment, it is possible to develop
Rather than using the parameter value of the vari- appropriate estimates of the two parameters. In the
ance for ␤ˆ j as was done in the calculation of N, to remainder of this section we suggest different meth-
compute NM, Equation 3 uses the upper bound of the ods for anticipating the values of R2 and R2XXj, such
100(1 − ␥) percent confidence interval for the vari- that sample size planning can be accomplished.
ance of ␤ˆ j. Recall that in any given sample the ob- Given that estimates are available for the p(p + 1)/2
tained variance of ␤ˆ j will be either larger or smaller zero-order population correlation coefficients, the
than the parameter value specified in Equation 2. squared multiple correlation coefficient predicting Y
Equation 3 uses the maximum value expected for the from the p predictors can be calculated using the fol-
variance of ␤ˆ j at the 100(1 − ␥) percent confidence lowing equation:
level. This value is substituted into Equation 2 for the
XX␳YX,
R2 ⳱ ␳ⴕYXR−1 (4)

ternative method is to solve for the appropriate sample size where ␳YX is the population p × 1 column vector of
iteratively, which generally adds unnecessary complica- correlations of each Xj regressor with Y (and ␳ⴕYX, its
tions. transpose), and RXX is the p × p population intercor-
SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION 311

relation matrix of all of the predictor variables with do so (B. F. Green, 1977; Raju, Bilgic, Edwards, &
one another.8 Fleer, 1999; Wainer, 1976).
Finding the squared multiple correlation coefficient If a researcher does not have a good idea of the
of variable j from the other p − 1 predictors can be relationship of the zero-order correlations, conven-
readily computed from RXX in two steps. The first tions such as Cohen’s (1988, section 3.2) small (␳ ⳱
step is to calculate rjj, which for the jth predictor .10), medium (␳ ⳱ .30), and large (␳ ⳱ .50) effect
variable is defined as the jth principal diagonal ele- sizes for correlations can be used. These correlations
ment of R−1 2
XX (Harris, 1985). In the second step, RXXj can be used directly in Equation 4 or used in an ex-
for the jth predictor variable is found from the fol- changeable structure. For example, if exchangeability
lowing expression: seems reasonable and the predictor variables are mod-
erately or highly correlated with one another, a re-
1 searcher could fill the off-diagonal elements of the
2
RXX =1− . (5) RXX intercorrelation matrix with values of .30, .40, or
j rjj
.50. Further, suppose that it is reasonable to expect
The inverse of rjj is known as the tolerance of variable that the correlations of the predictors with the crite-
j with the other p − 1 predictors. The tolerance (1 − rion are, in general, small or medium. In this case the
R2XXj) is the proportion of variance of a predictor that vector ␳YX can be filled with correlations of .10, .20,
cannot be explained by the remaining p − 1 predictor or .30. Once acceptable estimates for the two types of
variables included in the model. As the tolerance of Xj correlations have been determined, the multiple cor-
approaches zero, Xj becomes highly correlated with relations can be obtained from Equations 4 and 5.
the remaining predictor variables and R2XXj becomes The third way to determine values for R2 and R2XXj
larger, which means there is more predictability, or is to consult previous literature in order to determine
collinearity, of predictor Xj from the other p − 1 pre- likely values for these two parameters or for likely
dictors (Darlington, 1990, p. 128). values of the zero-order correlation coefficients
The second method of finding R2 is a variation of (whether the data follow an exchangeable structure or
the first method and depends on the notion of ex- not). Meta-analytic studies may be of help when es-
changeability. An exchangeable structure (Maxwell, timating the required population parameters; how-
2000) is one in which the intercorrelations of the pre- ever, in many domains of research, meta-analytic
dictors are all the same and the correlations of the studies have not yet been conducted or the construct
predictors with the criterion variable are all the same of interest may differ from those previously examined.
(but ␳XX and ␳YX need not be equal to one another, The final method is presented here more as a warn-
where ␳ represents a population zero-order correlation ing than a recommendation. This method is based on
coefficient). Thus, instead of estimating the p(p + 1)/2 the commonly recommended approach of sample size
zero-order correlations, it is necessary to estimate planning based on parameter estimates obtained from
only two correlations, one for the correlation of each pilot studies. Pilot studies are sometimes undertaken
of the predictors with one another and another corre- when literature reviews provide little or no informa-
lation for each of the predictors with the criterion tion about the population parameter(s) necessary for
variable. The two zero-order correlations used in ex- sample size planning. However, a potential problem
changeable structures should be of the general mag- with pilot studies is that these small-scale investiga-
nitude as the set of correlations they represent. Since tions may yield parameter estimates that do not
B. F. Green (1977) showed that “many linear com- closely correspond with the parameter values they
posites [that is, predicted scores] are barely different represent. Thus, basing Equations 2 and 3 on param-
from using equal weights” (p. 274), the exchangeable
structure offers a potentially useful tool when plan-
ning necessary sample size (see Maxwell, 2000, for a 8
A caution is warranted when estimating the p(p + 1)/2
thorough treatment and rationale of the exchangeable zero-order correlation coefficients, as it is feasible to esti-
structure, as well as a similar correlational structure mate an impossible set of correlations. If an impossible set
that is somewhat relaxed). Many times an exchange- is estimated, the multiple correlation coefficient can be
able structure may be a sensible place to start when greater than one. If this were to occur, adjustments to RXX
planning sample size for a multiple regression analy- and/or ␳YX must be made, such that a realistic set of pa-
sis, unless there are obvious theoretical reasons not to rameter values can be used for estimating N and NM.
312 KELLEY AND MAXWELL

eter estimates obtained from pilot studies may yield cient between the predictor and the criterion variable.
inappropriate estimates of the required sample size if However, if there is more than one predictor variable,
the obtained estimates do not closely approximate the ␤js are not confined to the interval [−1, 1], as they
their corresponding parameter values. do not represent correlations. Thus, the choice of w is
When planning an appropriate sample size, regard- not necessarily obvious, in large part because of the
less of whether it is for an application of PA or AIPE, interpretation of the standardized regression coeffi-
it is typically unrealistic to proceed as if the values of cient and its interrelatedness with the other predictors
the necessary population parameters are known ex- in the model. Not surprisingly, all other things being
actly. Given that, a researcher who uses methods of equal, the smaller the specified w, the larger the re-
sample size planning should conduct a sensitivity quired sample size.
analysis. A sensitivity analysis involves calculating
appropriate sample sizes using a range of realistic Example and Application of the Procedures
values of the necessary population parameters. In the Suppose that a researcher is interested in perform-
context of the present article, a researcher would ing an analysis using multiple regression. Further sup-
specify likely values of R2 and R2XXj in order to de- pose that the researcher is interested in obtaining a
termine their effects on N and NM. For the values of N precise estimate of a particular population standard-
and NM computed with the various parameter values ized regression coefficient. In particular, rather than
in the sensitivity analysis, the most appropriate esti- having an embarrassingly large confidence interval
mate of sample size is chosen given what is deemed to around the estimated ␤j of interest, the researcher de-
be the most appropriate input parameter values. It is cides that a confidence interval with an expected
also advantageous to triangulate planned sample sizes width of 0.20 will provide a sufficiently precise esti-
from multiple methods, rather than focusing only on a mate of ␤j; thus, w is defined as 0.10. The researcher
single technique. The suggestion of a sensitivity is also interested in calculating NM, such that there
analysis and multiple methods of obtaining estimates will be an 80% chance that the ␤j of interest will have
of sample size are provided in order for the researcher a corresponding confidence interval that has a half-
to have a firm grasp on the nonlinear relationship width no larger than the specified w of 0.10.
between the required sample size and the unknown Suppose that after consulting past research and in
parameter values. line with theory, the researcher determines that an
Although the particular value of w is arbitrary and exchangeable correlational structure seems reason-
depends only on the desired width for the confidence able, and the five predictor variables that are to be
interval, researchers should keep in mind the likely used in the analysis are hypothesized to correlate with
range of ␤j when choosing w, even though the value one another at .40. Further, suppose there is reason to
of ␤j itself need not be known. Although there have believe that there is likely to be a medium effect, a
been conventions established regarding the magnitude correlation of .30, between each of the predictor vari-
of particular effect sizes (e.g., Cohen’s, 1988, conven- ables and the criterion.
tions for the standardized mean difference and the Following Equation 4, the R2 can be shown to equal
zero-order correlation coefficient), no such conven- .17, and from Equation 5, the R2XXj predicting the jth
tions have been established for standardized regres- regressor from the remaining p − 1 predictors equals
sion coefficients. For example, a medium standard- .29. The researcher then solves for the estimated N by
ized regression coefficient might be viewed as use of Equation 2, which yields a value of 453.98.
resulting from medium zero-order correlations. In re- When rounded to the next largest integer, the esti-
ality, however, the population ␤j will depend greatly mated N from Equation 2 provides the researcher with
on the number of predictors, even when all zero-order an estimated sample size of 454. Accordingly, if the
correlations are medium. In such multiparameter situ-
ations, it becomes very difficult to develop a mean-
ingful scale for small, medium, and large effect sizes.9 9
Cohen (1988) even acknowledged the difficulties and
Even though effect size conventions do not exist for inconsistencies in conventions for effect size measures in
the relative size of the standardized regression coef- the context of multiple regression. These inconsistencies are
ficient, the likely value of ␤j is in the interval [−1, 1]. due to the interrelatedness of p, the multiple correlation
In the special case in which there is only one predic- coefficients, and the zero-order correlation coefficients (Co-
tor, ␤j is literally the population correlation coeffi- hen, 1988, p. 413; see also Maxwell, 2000, p. 438).
SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION 313

input parameter values were correct, using a sample a specific predictor may provide expected ws nar-
size of 454 will yield a confidence interval around ␤j rower or wider than the specified value for the re-
that has an expected half-width of 0.10. maining p − 1 predictors, depending on the tolerance
To compute NM, such that there is an 80% chance of the predictor for which sample size was calculated.
of obtaining a confidence interval for ␤j with a half- When interest lies in the w for a specific predictor,
width no larger than 0.10, the researcher uses Equa- no problems arise regardless of whether the correla-
tion 3. Implicit in Equation 3 for this example is the tional structure is or is not exchangeable. Sample size
fact that the sample variance of ␤ˆ j is expected to be is calculated for the specific predictor regardless of
less than the parameter value 80% of the time. Be- whether the tolerance for the predictor of interest is
cause the obtained w will be less than the w specified smaller or larger than any of the remaining p − 1
if the variance of ␤ˆ j is smaller in the sample than the predictors. Under this strategy, researchers are con-
parameter value used to estimate sample size, the ob- cerned foremost with the width of the confidence in-
tained w will be no greater than the specified w with terval for the beta of interest and less so for the re-
a probability of .80. maining p − 1 predictors. For example, in the scenario
The .80 quantile of the chi-square distribution with in the previous paragraph, a researcher whose ques-
N − 1 degrees of freedom is 478.12. This critical tion pertains specifically to estimating the relationship
chi-square value is then divided by N − p − 1, yielding between X3 and Y controlling for X1 and X2 should
a variance correction factor of 1.07. Following Equa- choose an N of 201 or an NM of 229.
tion 3, NM is estimated at 484.10 and after being Another strategy in situations in which exchange-
rounded up to the next largest integer yields a value of ability does not hold leads to the expected value of all
485. If the parameter values estimated by the re- of the confidence intervals being as narrow as or nar-
searcher were correct, using an NM of 485 will pro- rower than the specified w. In this approach the
vide the researcher with approximately an 80% sample size used for the study is the largest of the p
chance of obtaining a w of 0.10 or less for the confi- different sample sizes. Thus, the expected half-width
dence interval around the beta weight of interest. No- for the predictor with the lowest tolerance is w,
tice that sample size increases by only 31 (or 6.83%) whereas the expected half-widths for the remaining p
when specifying 80% confidence that the obtained w − 1 confidence intervals will be less than w; to what
would be less than the specified width. Typically NM degree depends on the tolerance of the other predic-
is not considerably greater than N and should be con- tors. For example, given NM values of 268, 180, and
sidered for the added assurance it provides for a pre- 229 for the three predictors, respectively, a researcher
cise estimate with what generally amounts to a rela- interested in a narrow confidence interval for each and
tively small cost. every predictor should choose an NM of 268.
When the assumption of exchangeability does not
hold, generally a different sample size will be esti- Power Analysis Versus Accuracy in
mated for each of the p predictors. In the following Parameter Estimation
example, suppose a researcher hypothesizes the fol-
Estimating sample size from a PA perspective is
lowing population parameters for the RXX intercorre-
conceptually different than estimating sample size to
lation matrix and the ␳YX vector, respectively:

冋 册 冋册
achieve AIPE. This conceptual difference can poten-
1 .50 tially translate into very different practical implica-
RXX = .40 1 ␳YX = .30 . tions. This section considers the relative sample sizes
.60 .05 1 .10 required by the two approaches. Maxwell (2000)
showed that sample size could be estimated for a
Further suppose the desired half-width and alpha were given predictor to obtain a specified power using the
set to 0.15 and .05, respectively. In this scenario, the following formula:

冉 冊冉 冊
planned sample sizes would be estimated as 237, 154,
and 201 for Predictors 1, 2, and 3, respectively. Fur- ␭ 1 − R2
thermore, if the researcher wanted to have 90% con- N= + p − 1, (6)
␤2j 1 − RXX
2
j
fidence that the obtained w would be less than or
equal to 0.15, NM would be 268, 180, and 229 for where ␭ is a noncentrality parameter from an F dis-
Predictors 1, 2, and 3, respectively. Thus, when ex- tribution with 1 numerator and N − p − 1 denominator
changeability does not hold, planning sample size for degrees of freedom. The ␭ value in Equation 6 is a
314 KELLEY AND MAXWELL

tabled critical value that determines the power of a parameter value or the minimally important value of
given statistical test for a predictor of interest. The the standardized regression coefficient be specified.
required value of ␭ for a specified degree of power Note that a value for the standardized regression co-
can be obtained from Cohen’s (1988, pp. 448–455) efficient is not necessary when planning sample size
tables or from the appropriate noncentral F distribu- for precision. For this reason, planning sample size
tion. from the AIPE perspective is actually easier than ap-
The relative sample size required for AIPE versus proaching sample size planning from the PA perspec-
PA can be compared by the following two multipli- tive.
cative ratios found in Equations 2 and 6, respectively: Unless p is very large, sample size for PA is ap-

冉 冊 2 proximately
z共1−␣ Ⲑ 2兲

versus
w
N = MPA 冉 1 − R2
1 − RXX
2 冊 , (7)

冉冊
j


. where MPA ⳱ ␭/␤2j , which is the multiplier used for
␤2j the PA approach. Similarly, sample size for AIPE is
Unless p is very large, the ratio of required sample approximately

冉 冊
size for AIPE compared with PA is approximately
(z(1−␣/2)␤j)2/(␭w2) to 1. Note that the population stan- 1 − R2
N = MAIPE , (8)
dardized regression coefficient is the only one of the 1 − RXX
2
j
four values beyond the researcher’s control. Whereas
␣, ␭, and w are chosen to coincide with the goals of where MAIPE ⳱ (z(1−␣/2)/w)2, which is the multiplier
the research project, the PA approach requires that the used in the AIPE approach. Figure 2 depicts the re-

Figure 2. Relationship of the relative planned sample size for the accuracy in parameter
estimation (AIPE) and the power analytic (PA) approaches to sample size planning as a
function of the population beta weight (approximate sample size in the special case when R2
⳱ R2XXj).
SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION 315

lationship of the multipliers for PA and AIPE for another in order to achieve reasonable statistical
population betas for various values of power and pre- power while obtaining confidence intervals that are
cision (␣ ⳱ .05). As Equations 7 and 8 show, multi- sufficiently narrow.
plying the corresponding value on the ordinate for
either power or precision in Figure 2 by the ratio Random Versus Fixed Predictors and the Issue
(1 − R2)/(1 − R2XXj) yields an approximate sample of Standardization
size. More generally, the relative elevation of a curve In the present article it was assumed that the pre-
or line represents the relative sample size required to dictor variables were random and that all variables
achieve a desired level of power or precision. were standardized. The reason that standardized val-
Several practical implications emerge from Figure ues were discussed exclusively is because correlations
2. First, as the curves and lines show, as the popula- tend to be easier to hypothesize and work with than
tion ␤j becomes larger, sample size for power can be variances and covariances, which would be necessary
much smaller than it is for precision. Conversely, to carry out AIPE in the unstandardized case. Another
when the ␤j is small, sample size for power can be reason why standardized regression coefficients are
much larger than is required for precision. For ex- beneficial is because of the arbitrariness of most
ample, when the ␤j equals 0.30, the sample size re- scales of measurement used in the behavioral sci-
quired to obtain a confidence interval with an ex- ences. Furthermore, a widely used convention of the
pected half-width of 0.10 is just over 4 times as large magnitude of effect is available for correlations in
as the sample size needed to obtain a power of .80. psychology (Cohen, 1988, section 3.2). It should be
However, when ␤j is 0.08, the sample size needed for clear, however, that if the hypothesized values are
a power of .80 is more than 3 times larger than that correct when finding N and NM for standardized val-
needed to obtain a confidence interval with an ex- ues, they will provide the same relative degree of
pected half-width of 0.10. Note that these relation- precision around the unstandardized regression coef-
ships hold true regardless of the values of R2 and ficients. The relative degree of precision regarding w
R2XXj, as both of these values play the same role in is scaled in terms of the ratio of the standard deviation
Equations 7 and 8. Second, for constant values of R2 of the criterion to the standard deviation of the jth
and R2XXj, sample size for precision is independent of predictor (sY /sXj).
the value of ␤j, whereas smaller samples can provide With regard to random and fixed predictor values in
adequate power for larger values of ␤j. Third, implicit the unstandardized case, Sampson (1974) showed that
in Equation 8 and as depicted in Figure 2, halving the regardless of the predictors being fixed or random,
width of a confidence interval for ␤j requires approxi- “we obtain the same estimates for the regression co-
mately a fourfold increase in sample size. Fourth, in efficients and the variance of the error” (p. 684 from
the special case in which R2 is equal to R2XXj—that is, Theorem 1). There is, however, a difference between
(1 − R2)/(1 − R2XXj) ⳱ 1.00—the values on the ordi- the two cases. Note that if R2 ⳱ 0, then the distribu-
nate based on the curve for power and the line for tion of R2 is identical in both cases and follows a
precision are approximately the required sample sizes. central F distribution. However, the distribution of R2
Thus, it is clear that the two methods are different is different for the two cases when R2 ⫽ 0 (Stuart,
from the outset and can yield very different estimates Ord, & Arnold, 1999, section 28.29). In fact, the dis-
of sample size in the same study. Each is designed to tribution of R2 is a noncentral F distribution in the
answer a different question, and as can be seen, they case of fixed predictors, whereas it is not in the case
do just that. The two approaches differ on a philo- of random predictors (Rencher, 2000, pp. 240–241).
sophical level, one designed to achieve a narrow in- Accordingly, the distribution of the test statistic under
terval and one designed to obtain an interval that does the null hypothesis is the same for the fixed as well as
not contain the specified null value. The point is that the random X case, but the power functions for the test
depending on what the researcher’s question is and statistic are different for the two cases (Rencher,
the desired outcome, a different approach to sample 2000, chapter 10). Gatsonis and Sampson (1989)
size estimation will be needed. Neither approach is showed that Cohen’s (1988) power tables for deter-
necessarily “right” or “wrong” for a given problem; mining sample size are approximations, because Co-
these approaches are merely different in the questions hen treated random predictors as though they were
that they attempt to answer. It is recommended that fixed. However, Gatsonis and Sampson concluded
the two approaches be used in conjunction with one that “Cohen’s approximation works quite well in
316 KELLEY AND MAXWELL

many situations” (p. 519). Thus, practically speaking, In structural equation modeling (SEM), which can
random versus fixed X values have little effect on be viewed as a generalization of multiple regression,
applied research because the consequences, in most several authors have illustrated the potential problems
cases, are trivial. The issue of standardization, how- of analyzing a correlation matrix as if it were a co-
ever, is quite different, especially when standardiza- variance matrix (e.g., Babakus, Ferguson, & Jöreskog,
tion is performed on random predictor variables. 1987; Browne, 1982; Cudeck, 1989; Jamshidian &
Even though multiple regression using standardized Bentler, 2000). Steiger (2001) concluded that SEM
random predictors is common practice in behavioral parameter estimates based on a correlation matrix
research, as well as in many other fields, there are (analogous to standardized coefficients in multiple re-
nuances associated with this strategy that are not gression) may be correct, whereas their standard er-
widely known and are potentially problematic. As rors are incorrect (see also Lawley & Maxwell, 1971,
previously stated, the formula (see Equation 1) for the chapter 7, for technical details). MacCallum and Aus-
standard error of a regression coefficient that is ran- tin (2000) stated that when a correlation matrix is
dom and standardized is approximate. The formula, as analyzed as if it were a covariance matrix in SEM, “in
given explicitly in sources such as Cohen and Cohen all cases, standard errors of parameter estimates as
(1983) and Harris (1985) and implicitly in many oth- well as confidence intervals and test statistics for pa-
ers, treats the standard deviation of each predictor as rameter estimates will be incorrect,” and they further
a constant value. This is obviously not the case when emphasized that the “correct standard errors will gen-
the predictors are random, as the standard deviation of erally be smaller than the incorrect values which re-
the predictor is itself a random variable. This is con- sults in narrower confidence intervals and larger test
trasted with the situation in which the values of the statistics” (p. 217). For the reasons outlined in this
predictor variables are preset in advance and thus the section regarding the approximate nature of Equation
standard deviation of those predictors would not vary 1, a simulation study was conducted to verify the
across replications of the study. integrity of the procedures suggested throughout the
In order to transform an unstandardized regression article.
coefficient to a standardized regression coefficient,
one can multiply the raw score regression coefficient Results of Monte Carlo Simulations
by sX j /sY, so as to remove the (generally arbitrary)
scaling of Y and Xj. Likewise, this same procedure is If Equation 1 was exact, the assumptions were met,
commonly done in order to obtain the standard error and the multiple correlation coefficients were cor-
of the standardized regression coefficient.10 However, rectly specified, the sample size estimation proce-
“standard errors of standardized parameters, in gen- dures presented here yield correct estimates of re-
eral, are not a simple rescaling of the standard errors quired sample size. However, whenever the values of
of the original parameter estimates” (Jamshidian & the predictors are random and standardized, rather
Bentler, 2000, p. 74). The problem with scaling the than being fixed, Equation 1 is an approximation. In
standard error of a standardized regression coefficient applications of multiple regression to observational
in the random predictor case can be seen by a well- studies in the behavioral sciences, predictors are typi-
known property of variances. If C is a constant and V cally random, not fixed. Further, standardization often
is a random variable, Var(CV) ⳱ C2Var(V), where occurs in the behavioral sciences because of the in-
Var(⭈) represents the variance of the quantity in pa-
rentheses. However, if C̃ is itself a random variable,
then Var(C̃V) ⫽ C̃ 2Var(V). Common formulas for the
10
standard error of standardized regression coefficients The reason that multiplying the standard error of the
(e.g., Equation 1) assume that the standard deviation unstandardized regression coefficient by sX j /sY removes
the scaling of the jth predictor can be seen by the formula
of the predictor is fixed. In the case of random pre-
for the standard error of the unstandardized regression co-
dictor variables, such an assumption implies that efficient: (sY /sXj ) √(1 − R2)/[(1 − R2XXj)(N − p − 1)]. Multi-
Var(C̃V) ⳱ C̃ 2Var(V). Because this assumption is plying this formula by sX j /sY removes the scaling of Y and Xj
false, the variability of Xj is not taken into consider- from the standard error and is commonly, yet inappropri-
ation when calculating the standard error of standard- ately, assumed to be the correct standard error for the jth
ized regression coefficients from the random X standardized regression coefficient when the predictor is
case, which generally leads to incorrect standard errors. random.
SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION 317

terpretational problems associated with arbitrary formed very well. Because of the large number of
scales of measurement. Under these circumstances, it conditions that were studied, the tabled results could
was unclear whether basing planned sample size on not be presented; however, detailed descriptions of
Equation 2 would produce an interval with the desired the results follow.11
width. In addition to ensuring that Equation 2 consis- The mean, median, and standard deviation of the
tently yields accurate estimates of sample size, a percentage of error were determined for each of the
Monte Carlo study was necessary because Equation 3 166 conditions that were examined. The percentage of
implicitly assumes Equation 2 is correct. error was determined by subtracting the specified w
One scenario studied in the Monte Carlo simulation from the mean of the obtained ws, dividing this dif-
was the aforementioned exchangeable structure with ference by the specified w, and then multiplying by
five predictors and where ␳XX ⳱ .40 and ␳YX ⳱ .30. 100. For example, if the mean of the obtained ws was
The simulation revealed that Equations 2 and 3 pro- 0.204 when the specified w was 0.20, the percentage
duced very accurate results in this situation. Recall of error would be computed as follows: 100(0.204 −
that when w is specified as 0.10 for this scenario, 0.20)/0.20 ⳱ 2.00. Thus, in this condition the mean of
Equation 2 dictates a necessary sample size of 454. the obtained ws was 2.00% larger than the specified w.
The mean w for the five betas, each based on 10,000 In the simulation conditions in which p was 2, all
replications, using a sample size of 454, was 0.101, combinations of small, medium, and large correla-
with a standard deviation of 0.003; the median w was tions among the predictors as well as the criterion (27
also 0.101. Recall that having an 80% chance of ob- total) were completely crossed with ws of 0.05, 0.10,
taining a w no larger than the specified value of 0.10 and 0.20. Thus, a total of 81 different conditions were
requires a necessary sample size of 485 based on examined for p ⳱ 2. The mean and median of the
Equation 3. The mean and the median confidence in- percentage of error were 0.33 and 0.17, respectively,
terval half-width using a sample size of 485 was with a standard deviation of 0.34. The minimum per-
0.098, with a standard deviation of 0.003. Most im- centage of error was 0.01 for a case in which w was
portant, 81.64% of the obtained ws were no larger 0.05, and the maximum percentage of error was 1.85
than the specified value of 0.10. Further, the 80th for a case in which w was 0.20. Thus, in the worst
percentile for the empirical distribution of the ob- case out of the 81 different conditions for p ⳱ 2, the
tained ws was 0.10. In summarizing the results for this mean of the obtained w was less than 0.01 units larger
scenario, the suggested procedures yielded an original than expected.
sample size such that the mean of the ws was 0.101 In the case in which p was 5, the results are re-
and a modified sample size that led to just over 80% ported separately for two different types of correla-
of the confidence intervals being no larger than speci- tional structures. In the first type of correlational
fied. structure, 25 different exchangeable structures were
This example was selected because we thought it examined. In any single one of the 25 combinations,
was reasonably typical of a behavioral research sce- all predictors correlated equally among themselves
nario. However, this single scenario cannot address and each correlated equally with the criterion vari-
the extent to which the approximation is accurate for able. Correlations among predictors consisted of ␳XX
other situations. To investigate the general accuracy values of .10, .20, .30, .40, and .50. Correlations of the
of the procedures, we undertook a large Monte Carlo predictors with the criterion consisted of ␳YX values
simulation study to address the appropriateness of of .10, .20, .30, .40, and .50. Thus, ␳XX and ␳YX each
Equation 2. In the simulation study 166 different con- varied from small to large by .10 and yielded a 5 × 5
ditions were examined. In the different conditions a factorial design.
variety of correlational structures were used. The ws Two combinations of correlations are excluded
were specified to be 0.025, 0.05, 0.10, 0.15, 0.20,
0.15, and 0.35, using ps of 2, 5, and 10. Presumably
the simulations encompass the likely ranges of w and 11
The complete set of simulation results is available in
p that is commonly of interest to behavioral research- tabular format from Ken Kelley or Scott E. Maxwell. The
ers, combined with a variety of correlation structures code, which was written in R/S-PLUS, is also available on
to show generality. Each condition in the simulation request. Note that the anonymous reviewers were provided
study was based on 10,000 replications. The results with the simulation results as part of their assessment of our
showed that the suggested procedures generally per- procedures.
318 KELLEY AND MAXWELL

from the following descriptive statistics because their terval coverage was greater than 95% (the nominal
multiple correlations between the predictors and cri- alpha was set to .05). The mean and median percent-
terion are greater than .80 and not representative of age of coverage were 95.53 and 95.24, respectively,
most psychological research.12 The mean and median with a standard deviation of 0.78. Whereas the small-
percentage of error for the remaining 23 ws were 1.87 est percentage of coverage was 94.34, the largest per-
and 1.03, respectively, with a standard deviation of centage of coverage was 97.89. Thus, the results of
2.22. The minimum percentage of error was 0.22, and the simulations have shown empirically the approxi-
the maximum was 10.00. This worst case occurred mate nature of Equation 1 and the fact that OLS mul-
when the correlations among the predictors were .10 tiple regression tends to have inflated standard errors
and the correlations between the predictors and crite- when predictor variables are random and have been
rion were .40. This correlational structure is unlikely standardized.13
in most behavioral research because R ⳱ .76. How- The fact that Equation 1 is approximate and gen-
ever, even this condition had a mean w that was only erally provides confidence intervals wider than nec-
0.01 units larger than expected. essary raises some questions regarding its use as well
The other simulations that were conducted for p ⳱ as the use of Equations 2 and 3 in the context of
5 were based on two published correlational struc- sample size planning for precise estimates of stan-
tures. The first was a subset of a correlation matrix dardized regression coefficients. For example, in the
obtained from the developmental literature (Smari, case in which the largest confidence discrepancy oc-
Petursdottir, & Porsteinsdottir, 2001), and the other curred, 97.89% of the computed confidence intervals
was obtained from an example given in an SEM text bracketed the population parameter. Applying Equa-
(Table 7.1 in Loehlin, 1998). The mean and median of tion 1 to this condition (w ⳱ 0.10, ␳YX1 ⳱ .50, ␳YX2
the absolute percentage of error for the 30 conditions ⳱ .10, ␳X1X2 ⳱ .10, p ⳱ 2), we found that the popu-
(15 from each example) were 0.55 and 0.23, respec- lation correlations would suggest that the standard
tively, with a standard deviation of 0.76. The mini- error was 0.051. A simulation based on 1,000,000
mum of the absolute percentage of error was 0.01 in replications showed that, consistent with the SEM lit-
a condition in which w was 0.025, and the maximum erature, the standard deviation of the regression coef-
was 2.75 in a condition in which w was 0.35. Thus,
the worst condition in this situation produced a mean
12
w of 0.36 when the specified w was 0.35. The two excluded cases consisted of unlikely scenarios
For p ⳱ 10, the correlation matrix used was a for much behavioral research. The first excluded scenario
subset of one obtained from the clinical–counseling consisted of correlations among the predictors of .10 and
literature that had previously been cited in an SEM correlations between the predictor and the criterion of .50.
text (Worland, Weeks, Janes, & Strock, 1984, as cited Such a combination of correlations leads to an R of .95 and
where the requirement of a positive definite correlation ma-
in Kline, 1998, p. 254). The mean and median of the
trix is nearly violated. In this case the mean w was 0.151
percentage of error for the 30 conditions that were when it was specified to be 0.10. Poor performance of the
examined were 0.18 and 0.09, respectively, with a technique in this particular scenario is not surprising, given
standard deviation of 0.19. The smallest absolute per- that many statistical procedures fail when parameters ap-
centage of error was less than 0.01 for a case in which proach their theoretical bounds. The second excluded case is
w was 0.05, and the largest percentage of error was similar to the first and consisted of correlations among the
0.67 for a condition in which w was 0.20. Thus, the predictors of .20 and correlations between the predictor and
condition with the largest discrepancy had a percent- the criterion of .50. This combination of correlations leads
age of error less than 1%. to an R of .83. In this second excluded scenario, the mean
Recall the cited SEM literature in which it has been w was 0.112 when it was specified to be 0.10.
13
shown that the standard errors of parameter estimates Many behavioral scientists would see no problem with
an empirical alpha smaller than the nominal alpha level and
are generally inflated when a correlation matrix is
thus with being more conservative. However, a toxicologist
treated as a covariance matrix. Because ordinary least or bioscientist working with chemical agents or medicine
squares (OLS) multiple regression is a special case of would likely argue that a Type II error may be more costly
SEM, it follows that the standard errors of OLS mul- than a Type I error, as concluding that there is “no effect”
tiple regression are often inflated when predictor vari- of a noxious substance could be a harmful mistake. Further,
ables are random and standardized. In 130 of the 166 power and precision will be sacrificed if the actual Type I
conditions investigated (78.31%), the confidence in- error rate is smaller than the nominal alpha level.
SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION 319

ficients was 0.044, a value smaller than implied by precision, sample size estimates become inflated as R2
Equation 1. This result suggests that the sample size approaches zero. The opposite pattern of results oc-
calculated from Equation 2, which assumes the stan- curs when R2 begins to approach one. In this case the
dard error from Equation 1 is correct, is approximate proportion of variance unaccounted for is, on average,
and in this particular case somewhat negatively bi- larger in the sample than is implied by R2. Conse-
ased. Unfortunately, no exact formula for the standard quently, the use of Equation 2 or Equation 3 will tend
error is known to exist when predictors are random to underestimate sample size.
and standardized. Thus, given the current state of The same phenomenon happens in the denominator
knowledge, researchers need to continue to use Equa- with R2XXj as it does in the numerator with R2; the only
tion 1 for forming confidence intervals around regres- difference is that the relationship is the exact opposite.
sion coefficients for predictors that are random and Because R2XXj is in the denominator of Equation 2, the
standardized. Equations 2 and 3 can then be used in sample size is over- or underestimated in a reverse
the research design phase in order to determine ap- fashion as was illustrated for R2.
proximate sample sizes for precise estimates of the For simplicity, the discussion has been limited to
regression coefficients of interest. regression models that include only main effects and
no interaction or other higher order (polynomial)
terms, as there are certain nuances associated with
Limitations of the Procedure
multiplicative terms that have been scaled in multiple
Although the distribution of R2 is asymptotically regression models (see chapter 3 of Aiken & West,
normal throughout most of its domain (Stuart et al., 1991, for details regarding multiplicative effects in
1999, section 28.33), this is not the case as R2 ap- multiple regression). Furthermore, the procedures
proaches its limits. When R2 begins to approach zero, given here assume that all predictors are included in
the distribution of the observed R2 values becomes the regression model and that no selection of predic-
positively skewed because of the lower bound at zero. tors occurs (as would be the case in, e.g., a stepwise
The converse is true as R2 begins to approach one, and regression analysis).
thus the distribution of the observed R2 values will be
negatively skewed.
Discussion
The fact that the distribution of R2 becomes nega- Approaching sample size estimation from a per-
tively or positively skewed affects sample size esti- spective of AIPE rather than one exclusively empha-
mation in two ways. Recall from Equations 2 and 3 sizing power is beneficial for a productive science.
that there are two multiple correlations in the equa- Although planning sample size through PA studies is
tions for determining sample size, the model R2 in the important and undeniably improves research findings,
numerator and R2XXj in the denominator. As R2 ap- the accuracy in those parameter estimates should be at
proaches zero in the population, the estimated sample least as much of a concern as their probability value,
size for a planned study based on Equation 2 or Equa- perhaps even more so. An optimal experimental de-
tion 3 will, with everything else held constant, tend to sign consists of an adequate sample size from an
be larger than necessary. One way to understand why AIPE perspective as well as an adequate sample size
overestimation occurs is to inspect Equation 1. On the from the PA perspective. Ensuring that sample size is
basis of this equation, a confidence interval becomes adequate from both perspectives leads to parameter
narrower as 1 − R2 becomes smaller. As R2 ap- estimates that will likely be accurate as well as sta-
proaches zero and thus the distribution of R2 becomes tistically significant.
more positively skewed, the mean R2 tends to be A special case in which precision is especially im-
greater than R2, implying that the mean 1 − R2 tends portant occurs when the goal is to provide evidence in
to be less than 1 − R2. Accordingly, the observed support of the null hypothesis. If a confidence interval
confidence intervals will tend to be narrower than is sufficiently narrow and power is of sufficient
expected based on the value of R2. The estimated strength (say, power > .90), at times it may be appro-
sample size from Equation 2 or Equation 3 is a func- priate to show support for the null hypothesis, in the
tion of R2; thus, confidence intervals based on sample sense that the value of the parameter is not meaning-
size estimates from these equations will tend to be fully different from the null value. Note that this is not
narrower than specified when the model R2 ap- “accepting the null hypothesis” but is merely showing
proaches zero. In other words, for a desired degree of support for it (Greenwald, 1975).
320 KELLEY AND MAXWELL

The simulation study showed that the procedures parameter estimates, not merely statistically signifi-
presented here were effective in accomplishing their cant ones, leads to a more productive science and
respective goals. The mean and median of the ob- yields research findings that are more beneficial to a
served ws were very close to their specified values given area of inquiry.
when the estimated N (Equation 2) was used to select
sample size. When using N, researchers are reminded References
that this provides the necessary sample size such that
the expected half-width of the confidence interval is, Aiken, L. S., & West, S. G. (1991). Multiple regression:
on average, the specified width. However, this does Testing and interpreting interactions. Newbury Park,
not ensure that the particular observed w will be the CA: Sage.
specified width in any given sample. The modified Algina, J., & Olejnik, S. (2000). Determining sample size
sample size (Equation 3) takes into consideration the for accurate estimation of the squared multiple correla-
variability of the standard error of ␤ˆ and adjusts the tion coefficient. Multivariate Behavioral Research, 35,
sample size accordingly, such that one can be ap- 119–137.
proximately 100(1 − ␥) percent confident that the Babakus, E., Ferguson, C. E., & Jöreskog, K. G. (1987).
width around a particular ␤j will have a corresponding The sensitivity of confirmatory maximum likelihood fac-
w that is no larger than the specified w. tor analysis to violations of measurement scale and dis-
A caution is given because of the problems that can tribution assumptions. Journal of Marketing Research,
arise when using standardized variables from random 24, 222–228.
X values in the context of multiple regression. Al- Browne, M. W. (1982). Covariance structures. In D. M.
though there are numerous reasons to use standard- Hawkings (Ed.), Topics in applied multivariate analysis
ized values as input into multiple regression models, (pp. 72–141). New York: Cambridge University Press.
and thus make use of their corresponding estimates Carlin, B. P., & Louis, T. A. (1996). Bayes and empirical
for interpretational reasons, the standard errors of Bayes methods for data analysis. New York: Chapman &
such estimates are generally not exact. Even though Hall.
the simulations show that the common method of Cohen, J. (1988). Statistical power analysis for the behav-
standardizing random predictors produces confidence ioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.
intervals for standardized regression coefficients that Cohen, J. (1990). Things I have learned (so far). American
are generally wider than they should be, the sample Psychologist, 45, 1304–1312.
size procedures we present typically produce the de- Cohen, J. (1994). The earth is round (p < .05). American
sired degree of precision. Psychologist, 49, 997–1003.
In conclusion, the AIPE procedures presented here Cohen, J., & Cohen, P. (1983). Applied multiple regression/
are applicable to researchers working within the correlation analysis for the behavioral sciences (2nd ed.).
framework of OLS multiple regression who want to Hillsdale, NJ: Erlbaum.
determine sample size a priori in order to obtain ac- Cudeck, R. (1989). Analysis of correlation matrices using
curate parameter estimates. Given reasonably accu- covariance structure models. Psychological Bulletin, 105,
rate input parameters, use of these procedures pro- 317–327.
vides researchers with confidence intervals around Darlington, R. B. (1990). Regression and linear models.
regression coefficients whose expected widths are the New York: McGraw-Hill.
values specified or, alternatively, with some degree of Gatsonis, C., & Sampson, A. R. (1989). Multiple correla-
probabilistic assurance. As with all sample size plan- tion: Exact power and sample size calculations. Psycho-
ning, the AIPE procedures will be less accurate to the logical Bulletin, 106, 516–524.
extent that the input parameters deviate from their true Green, B. F. (1977). Parameter sensitivity in multivariate
values. However, the problem with the choice of input methods. Multivariate Behavioral Research, 12, 263–
parameters should not be used as a reason to avoid 288.
sample size planning. In addition, we have shown that Green, S. B. (1991). How many subjects does it take to do
planning sample size for precise estimates of stan- a regression analysis? Multivariate Behavioral Research,
dardized regression coefficients requires less a priori 26, 499–510.
knowledge (i.e., fewer input parameters) than the cor- Greenwald, A. G. (1975). Consequences of prejudice
responding planning necessary to obtain sufficient against the null hypothesis. Psychological Bulletin, 82,
statistical power. We believe that obtaining accurate 1–20.
SAMPLE SIZE AND ACCURACY IN PARAMETER ESTIMATION 321

Hahn, G. J., & Meeker, W. Q. (1991). Statistical intervals: Rossi, J. C. (1990). Statistical power of psychological re-
A guide for practitioners. New York: Wiley. search: What have we gained in 20 years? Journal of
Harris, R. J. (1985). A primer of multivariate statistics (2nd Consulting and Clinical Psychology, 58, 646–656.
ed.). New York: Academic Press. Rozeboom, W. W. (1966). Foundations of the theory of
Hellmann, J. J., & Fowler, G. W. (1999). Bias, precision, prediction. Homewood, IL: Dorsey Press.
and accuracy of four measures of species richness. Eco- Sampson, A. R. (1974). A tale of two regressions. Journal
logical Applications, 9, 824–834. of the American Statistical Association, 69, 682–689.
Jamshidian, M., & Bentler, P. M. (2000). Improved stan- Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of sta-
dard errors of standardized parameters in covariance tistical power have an effect on the power of studies?
structure models: Implications for construct explication. Psychological Bulletin, 105, 309–316.
In R. D. Goffin & E. Helmes (Eds.), Problems and solu- Smari, J., Petursdottir, G., & Porsteinsdottir, V. (2001). So-
tions in human assessment (pp. 73–94). Dordrecht, the cial anxiety and depression in adolescents in relation to
Netherlands: Kluwer Academic. perceived competence and situational appraisal. Journal
Kline, R. B. (1998). Principles and practice of structural of Adolescence, 24, 199–207.
equation modeling. New York: Guilford Press. Steiger, J. H. (2001). Driving fast in reverse: The relation-
Lawley, D. N., & Maxwell, A. E. (1971). Factor analysis as ship between software development, theory, and educa-
a statistical method (2nd ed.). London: Butterworth. tion in structural equation modeling. Journal of the
Loehlin, J. C. (1998). Latent variable models: An introduc- American Statistical Association, 96, 331–338.
tion to factor, path, and structural analysis (3rd ed.). Steiger, J. H., & Fouladi, R. T. (1997). Noncentrality inter-
Mahwah, NJ: Erlbaum. val estimation and the evaluation of statistical methods.
MacCallum, R. C., & Austin, J. T. (2000). Applications of In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.),
structural equation modeling in psychological research. What if there were no significance tests? (pp. 221–257).
Annual Review of Psychology, 51, 201–226. Mahwah, NJ: Erlbaum.
Maxwell, S. E. (2000). Sample size and multiple regression
Stuart, A., Ord, J. K., & Arnold, S. (1999). Kendall’s ad-
analysis. Psychological Methods, 5, 434–458.
vanced theory of statistics (Vol. 2A, 6th ed.). New York:
Mendoza, J. L., & Stafford, K. L. (2001). Confidence inter-
Oxford University Press.
vals, power calculation, and sample size estimation for
Thompson, B. (Ed.). (2001). Confidence intervals around
the squared multiple correlation coefficient under the
effect sizes [Special issue]. Educational and Psychologi-
fixed and random regression models: A computer pro-
cal Measurement, 61 (4).
gram and useful standard tables. Educational and Psy-
Wainer, H. (1976). Estimating coefficients in linear models:
chological Measurement, 61, 650–667.
It don’t make no nevermind. Psychological Bulletin, 83,
Muller, K. E., & Benignus, V. A. (1992). Increasing scien-
213–217.
tific power with statistical power. Neurotoxicology and
Teratology, 14, 211–219. Wilkinson, L., & the American Psychological Association
Raju, N. S., Bilgic, R., Edwards, J. E., & Fleer, P. F. (1999). Task Force on Statistical Inference. (1999). Statistical
Accuracy of population validity and cross-validity esti- methods in psychology journals: Guidelines and expla-
mation: An empirical comparison of formula-based, tra- nations. American Psychologist, 54, 594–604.
ditional empirical, and equal weights procedures. Applied
Psychological Measurement, 23, 99–115. Received December 11, 2001
Rencher, A. C. (2000). Linear models in statistics. New Revision received March 18, 2003
York: Wiley. Accepted April 23, 2003 ■

You might also like