Professional Documents
Culture Documents
Classic parametric statistical significance tests, such as normality) are violated. The miscomputation of p values,
analysis of variance and least squares regression, are coupled with the inaccurate estimation of effect sizes and
widely used by researchers in many disciplines, including confidence intervals, can lead to substantive errors in the
psychology. For classic parametric tests to produce accu- interpretation of data. Several prominent statisticians and
rate results, the assumptions underlying them (e.g., nor- researchers have described the use of classic parametric
mality and homoscedasticity) must be satisfied. These as- statistics in the face of assumption violations as invalid
sumptions are rarely met when analyzing real data. The (e.g., Kezelman et al., 1998; Leech & Onwuegbuzie, 2002;
use of classic parametric methods with violated assump- Wilcox, 2001; Zimmerman, 1998).
tions can result in the inaccurate computation of p values, Modern robust statistical procedures exist that can
effect sizes, and confidence intervals. This may lead to solve the problems inherent in using classic parametric
substantive errors in the interpretation of data. Many mod- methods when assumptions are violated. Many modern
ern robust statistical methods alleviate the problems inher- statistical procedures are easily conducted with widely used
ent in using parametric methods with violated assumptions, software such as SPSS, SAS, and R. Despite the advan-
yet modern methods are rarely used by researchers. The tages of modern methods and the ease with which these
authors examine why this is the case, arguing that most procedures can be conducted, they are rarely used by
researchers are unaware of the serious limitations of clas- researchers. In the first part of this article, we examine why
sic methods and are unfamiliar with modern alternatives. A this is the case. We argue that most researchers are unaware
range of modern robust and rank-based significance tests of the limitations of classic methods and do not realize that
suitable for analyzing a wide range of designs is intro- modern alternatives exist. In the second half of the article,
duced. Practical advice on conducting modern analyses we provide a practical, nontechnical introduction to some
using software such as SPSS, SAS, and R is provided. The modern methods.
authors conclude by discussing robust effect size indices. Problems With Classic Parametric
Keywords: robust statistics, nonparametric statistics, effect Methods
size, significance testing, software Classic parametric methods are based on certain as-
N
sumptions. One important assumption is that the data being
ull hypothesis significance testing is the work- analyzed are normally distributed. In practice, this assump-
horse of research in many disciplines, including tion is rarely met. Micceri (1989) examined 440 large data
medicine, education, ecology, economics, soci- sets from the psychological and educational literature, in-
ology, and psychology. A recent study of 10 leading inter- cluding a wide range of ability and aptitude measures (e.g.,
national psychology journals found that null hypothesis math and reading tests) and psychometric measures (e.g.,
significance testing was used in 97% of articles (Cumming scales measuring personality, anxiety, anger, satisfaction,
et al., 2007). The most widely used null hypothesis tests are locus of control). None of the data were normally distrib-
classic parametric procedures, such as Student’s t, analysis uted, and few distributions remotely resembled the normal
of variance (ANOVA), and ordinary least squares regres- curve. Instead, the distributions were frequently multimo-
sion. For classic parametric tests to produce accurate re- dal, skewed, and heavy tailed. Micceri’s study indicated
sults, the assumptions underlying them must be sufficiently that real data are more likely to resemble an exponential
satisfied. However, these assumptions are rarely met when
analyzing real data. The use of classic parametric tests
when assumptions are violated can lead to the inaccurate David M. Erceg-Hurn, School of Psychology, University of Western
Australia, Crawley, Western Australia, Australia; Vikki M. Mirosevich,
calculation of p values. This can result in an increased risk Department of Health, Government of Western Australia.
of falsely rejecting the null hypothesis (i.e., concluding that We thank Kale Dyer and Rand Wilcox for their helpful feedback on
real effects exist when they do not). In contrast, power to drafts of this article. Thanks also to Amy Lampard and Jake England for
detect genuine effects is often substantially reduced. An their support and encouragement.
Correspondence concerning this article should be addressed to David
additional problem is that common measures of effect size M. Erceg-Hurn, School of Psychology, M304, University of Western
(e.g., Cohen’s d) and confidence intervals may be inaccu- Australia, 35 Stirling Highway, Crawley, Western Australia 6009, Aus-
rately estimated when classic parametric assumptions (e.g., tralia. E-mail: david.erceg-hurn@grs.uwa.edu.au
may be much higher (or lower) than the obtained p value. Most researchers do not realize that modern robust statis-
Consider performing an ANOVA on data when variances tical methods exist. This is largely due to lack of exposure
and sample sizes are unequal. An ANOVA is run in SPSS to these methods. For example, the psychology statistics
or SAS, and the p value reported is .05. This should mean curriculum, journal articles, popular textbooks, and soft-
that if the null hypothesis is rejected, there is less than a 5% ware are dominated by statistics developed before the
chance that a Type I error has been made. However, the 1960s. This problem is not limited to psychology but also
true risk of committing a Type I error may be closer to 30% exists in many other disciplines (e.g., medicine, ecology).
(Wilcox, Charlin, & Thompson, 1986). Contrary to popular The field of statistics has progressed markedly since 1960,
belief, equal sample sizes offer little protection against yet most researchers rely on outdated methods. We are not
inflated Type I error when variances are heterogeneous trying to blame researchers for not being familiar with
(Harwell, Rubinstein, Hayes, & Olds, 1992). Conducting a modern statistical methods. Researchers are busy in their
regression analysis with violated assumptions can also lead own areas of expertise and cannot be expected to be famil-
to inflated rates of Type I error. The probability of a Type iar with cutting-edge developments within statistics. At the
I error when testing at an alpha rate of .05 can exceed 50% same time, it is essential that researchers are made aware of
when data are nonnormal and heteroscedastic (Wilcox, important developments within statistics that have the po-
2003). Researchers need to be aware that the p values tential to improve research in domains such as psychology.
reported by statistical packages such as SPSS may be Assumption Testing Issues
extremely inaccurate if the data being analyzed are non-
normal and/or heteroscedastic; The inaccuracy may lead Another reason why modern methods are underused is that
researchers to unwittingly make Type I errors. researchers frequently fail to check whether the data they
An additional problem is that the power of classic are analyzing meet the assumptions underlying classic
parametric tests can be substantially lowered when the parametric tests (Keselman et al., 1998). This may be due
assumptions of normality or homoscedasticity are violated. to forgetfulness or not knowing how to check assumptions.
See Wilcox (1998) for an example in which only a small A related problem is that, due to low power, statistical
departure from normality reduces the power of the t test assumption tests built into software such as SPSS often do
from .96 to .28. Wilcox (1998) summarized the effect on a poor job of detecting violations from normality and
power of violating the normality and homoscedasticity homoscedasticity (Jaccard & Guilamo-Ramos, 2002). For
assumptions as follows: example, Levene’s test is often used to test the homosce-
As hundreds of articles in statistical journals have pointed out and
dasticity assumption. A p value greater than .05 is usually
for reasons summarized in several books . . . arbitrarily small taken as evidence that the assumption has not been vio-
departures from normality can result in low power; even when lated. However, Levene’s test can yield a p value greater
distributions are normal, heteroscedasticity can seriously lower than .05, even when variances are unequal to a degree that
the power of standard ANOVA and regression methods. The could significantly affect the results of a classic parametric
practical result is that in applied work, many nonsignificant results test. This is particularly true when small samples are being
Original scores 11 12 17 18 10 11 13 15 20 21 23 25
Corresponding rank scores 2.5 4 7 8 1 2.5 5 6 9 10 11 12
Sum of ranks 21.50 14.50 42.00
Mean rank 5.38 3.63 10.50
Relative treatment effect 0.41 0.26 0.83
be interpreted as indicating that the groups do not differ weights of zero) is exactly the same in WA as it is in
much (in terms of participants’ typical response on the ordinary least squares regression. The difference between
dependent variable). In contrast, large differences in rela- the two procedures is that they use different methods to fit
tive treatment effects suggest that the groups differ signif- the regression line. In ordinary least squares regression, the
icantly. Relative treatment effects for the current data set regression line minimizes the sum of the squared residuals.
are shown in Table 1. Participants in Group C (p̂C ⫽ .83) A single outlier can substantially alter the slope of the
tend to have higher scores on the dependent variable than regression line, reducing its fit to the data. In contrast, WA,
do participants in Groups A (p̂A ⫽ .41) or B (p̂B ⫽ .26). which involves ranking residuals, minimizes the impact
The ATS can be used to determine whether these differ- that extreme criterion (y-axis) scores have on the regression
ences are statistically significant. In the present case, the line. The result is that the WA line often provides a better
null hypothesis that the groups have equivalent relative fit to data than does the least squares line. It is important to
treatment effects is rejected, ATS(1.81, 7.42) ⫽ 11.38, p ⫽ note that although WA is robust to outliers in the y-space,
.006. The values in parentheses (1.81 and 7.42) refer to the it is not robust to extreme predictor (x-axis) values (neither
degrees of freedom for the test. is ordinary least squares regression). In such situations, a
The ATS can analyze data from independent groups, modified version of WA called weighted Wilcoxon tech-
repeated measures, and mixed designs. Brunner et al. niques (WW) can be used. WW ensures that analyses are
(2002) developed macros that allow the easy generation of robust to outliers in both the x- and y-spaces. However, it
the ATS in SAS. Most of the macros are now also available is preferable to use WA in situations where there are no
for R. They can be downloaded from http://www.ams outliers in the x-space, as WA is more powerful than WW.
.med.uni-goettingen.de/de/sof/ld/makros.html. The Web page See Hettmansperger and McKean (1998) and McKean
is written in German; however, the macros are clearly (2004) for more details.
labeled and downloading them is straightforward. Brunner Free Web-based software known as RGLM can con-
et al. (2002) provided instructions for using the SAS mac- duct WA via the Internet. RGLM is located at http://
ros. Shah and Madden (2004) is another useful source of www.stat.wmich.edu/slab/RGLM. To conduct an analysis,
information, targeted at applied researchers unfamiliar with a user uploads data to RGLM or enters data into a form.
the ATS rather than statisticians. The article is accompa- Analysis options are selected, and RGLM then conducts
nied by a comprehensive electronic supplement that illus- the analysis and outputs the results. The interface can
trates how to use the SAS macros to generate and interpret perform WA simple and multiple regression, as well as
the ATS and relative treatment effects. Wilcox (2003, WA alternatives to the single and paired samples t tests,
2005) also provided coverage of the ATS, which he re- one-way and factorial ANOVAs, and analysis of covari-
ferred to as the Brunner, Dette, and Munk (BDM) method. ance. An appealing aspect of RGLM is that both WA and
Wilcox discussed code for calculating the ATS using R. He classic parametric analyses are reported side-by-side. This
also discussed how to follow up significant ATS main allows users to observe whether the two procedures pro-
effects and interactions with multiple comparison proce- duce equivalent or conflicting results. Abebe, Crimin, and
dures. McKean (2001) and Crimin, Abebe, and McKean (in press)
provided a guide to conducting robust analyses using
Other Rank-Based Methods RGLM. It is important to note that the RGLM interface
The ATS is just one of many modern robust rank-based only conducts WA. Users wishing to conduct WW analyses
methods. A prominent rank-based approach to ANOVA should make use of the experimental site http://www.stat
and regression is that of Hettmansperger and McKean .wmich.edu/slab/HBR2. It is also possible to conduct WA
(1998). Their approach is sometimes called Wilcoxon anal- and WW analyses using R. Terpstra and McKean (2005)
ysis (WA). WA evaluates hypotheses analogous to those provided instructions for carrying out WW analyses using
assessed by classic parametric methods. For example, the R. R code for WA and WW is available for download from
null hypothesis tested in regression (i.e., no relationship http://www.jstatsoft.org/v14/i07 and http://www.stat.wmich
between the predictors and the criterion variable; beta .edu/mckean/HMC/Rcode.