You are on page 1of 7

TutorialsinQuantitativeMethodsforPsychology

2007,Vol.3(2),p.2834.

Statisticalpower:Anhistoricalintroduction

JeanDescteaux
UniversitdeSherbrooke

Despitethefundingagenciesgrowingdemandstowardspoweranalyses,webelieve
researchers are still not fully aware of the statistical power concept, of the possible
benefits of power analysis in the planning phase and of the ways to increase the
chances of significantly detecting a given effect in their study. The following review
falls within this area of interest. We discuss the history of the concept of statistical
power,thereasonsforitsongoingneglect,itspotentialbenefitstoresearchers,aswell
as actual ways to improve statistical power. We also touch upon the subject of the
impactofpoweranalysisonthescientificliterature.

The concept of statistical power is not new. It was Clinical Psychology and the Journal of Personality and Social
formulated in the 1930s by Jerzy Neyman, a Moldavian Psychology and found that thesestudieshad,onaverage,at
who later immigrated to the United States, and Egon S. mosta57%probabilitytoobtainstatisticalconfirmationofa
Pearson,thesonofKarlPearson,theBritishstatisticianwho mediumeffect,whichiscomparabletothe48%reportedby
introduced the famous r (Neyman & Pearson, 1928, 1933). Cohen in 1962. For their part, Sedlmeier and Gigerenzer
While considered promising, the concept, however, never (1989) used the same method to analyze reports published
gainedmuchpopularity.Itsapplicationinscientificresearch in 1984 in the Journal of Abnormal Psychology and showed
planning was strongly opposed by Sir Ronald Fisher (also that these studies had, on average, no more than a 37%
from Great Britain), an influential figure in the field of chance of finding an actual medium effect. More recently,
statistics at the time, which might explain why statistical Bezeau and Graves (2001), ClarkCarter (1997), Kosciulek
power remained relatively unknown until Jacob Cohen and Szymanski (1993), and Mone, Mueller, and Mauland
(USA)broughtitbacktolightintheearly1960s.Interestin (1996)reportedasimilarlackofpowerinworkpublishedin
statistical power was revived partially thanks to Cohens such diverse areas as clinical neuropsychology, articles
article (1962) in which he described the insufficient reported in British Journal of Psychology, rehabilitation
statisticalpowerofstudiespublishedinthe1960volumeof counseling research, and management. The only exception
the Journal of Abnormal and Social Psychology. In his review, totherulewasprovidedbyMaddockandRossi(2001),who
Cohenconcludedthatthereportedstudieshad,onaverage, showedthatresearchinthreehealthrelatedjournals(Health
a less than 1 in 2 probability (a 48% chance) to obtain Psychology, Addictive Behaviors, and Journal of Studies on
statisticalconfirmationofanactualmediumeffect. Alcohol) published in 1997 had adequate power to detect
SincethepublicationofCohensanalysis,theconceptof largeandmediumeffects.
statisticalpowerhasgainedground,evidencedbythesharp In general, the latest results speak of a profound
increaseinthenumberofreferencestothepopularbookby paradox.Infact,itishardtoexplainwhy,despitethegreat
Cohen (1969, 1977) in the scientific literature from 4 importancethatreviewersattributetothesignificanceoftest
instances in 1971 to 214 instances in 1987 (Sedlmeier & results and despite the ever growing difficulty to obtain
Gigerenzer, 1989). However, despite this apparent growing research funding, research workers seem content with
awareness of the concept, the inadequacy in terms of experimental protocols that yield inconclusive results in 1
insufficientstatisticalpowerdescribedbyCohenin1962still outof2cases.Wouldtheyratherspendtimeandmoneyto
exists to date. Rossi (1990) applied the method used by little avail than actually plan their research to ensure
Cohen in 1962 to several reports published in 1982 in the sufficient statistical power (a power of .80, for example)?
Journal of Abnormal Psychology, the Journal of Consulting and Admittedly, the concept has gained ground. The American

28
29

Psychological Association does its part in popularizing the Before detailing the researchplanning protocol, let us
idea, urging researchers, for instance, to include at least considerthegeneralsequenceanddefinetheconceptsused.
someindexofeffectsizeintheirresultssection(APA,2001).
Conceptsrelatedtostatisticalpower
Journals such as the Journal of Clinical and Consulting
Psychologyalsocontributebyspecificallyinstructingauthors Statisticalpowerintheresearchplanningphase
to report effect sizes for primary study findings as well as In order to maximize the power of a test, Cascio and
confidence intervals for them (Instructions to Authors, Zedeck (1983) recommend following this sequence in the
Journal of Consulting and Clinical Psychology, 2007). So, if researchplanningphase(aprioriapplication):
todays researchers are more alert to statistical power, it 1. Determine the minimum effect size that would be
seems they are still unconvinced about its possible benefits consideredusefulorsignificant.
intheplanningphaseandarenotfullyawareofthewaysto 2. Determine the appropriate sample size based on
increasethechancesofsignificantlydetectingagiveneffect the desired power, the selected effect size and the given
intheirstudy.Thefollowingreviewfallswithinthisareaof alphalevel.
interest. We discuss the concept of statistical power, the 3. For a fixed sample size that proves insufficient to
reasons for its ongoing neglect, its potential benefits to achieve the desired power given the other parameters,
researchers, as well as actual ways to improve statistical adjust the alphalevelwhileconsideringtherelativeimpact
power. We also touch upon the subject of the impact of oftypeIandtypeIIerrors.
poweranalysisonthescientificliterature. Butfirst,letusreviewsomeoftheconceptsinvolved.

Statisticalpower,definitionandapplication
Significancelevels(alpha)and(beta)
Simply put, the power of a statistical test is the In the part entitled Statistical Power, definition and
probability that the test will yield statistically significant application, we said that, for a certain effect size (e.g. d)
results, given the existence of an actual effect. While andagivensamplesize(N),thesignificancelevelallows
seemingly simple, this definition encompasses a to quantify the power of a test, and vice versa. For the
mathematical complexity that often discourages the purpose of such a statement, in order to ensure that vice
uninitiated. Statistical power is determined by various versa fully applies, it is imperative to consider the level
criteria, such as the sample size (N), the effect size of the as being variable. Most researchers believe that this
observed phenomenon (e.g. d) and the applied level of significancelevelmustbesetat=.05;butwhynotat.08,
statistical significance (). The mathematical relationship .10or.025?
between the four elements allows having any of these Theconventional=.05iswidelybelievedtohavebeen
parameters quantified as a function of the three others established, more or less arbitrarily, by Sir Ronald Fisher
(Cohen,1988). (Sedlmeier & Gigerenzer, 1989; Ryan, 1985; Cohen, 1990).
While adding a bit of complexity, such interrelations Indeed,Fisherconsideredthatvariationsrelativetonormal
allowforflexibilityintheapplicationofthestatisticalpower that are greater than two standard deviations (which
concept.Forexample,thecalculationofpowerasafunction roughlycorrespondsto=.05fortwotailedtests)mustbe
oftheotherparametersisparticularlyusefulintheresearch judgedsignificant(Fisher,1925).Later,Fisherstatedthathe
planning phase (a priori) or to quantify the power of a personallypreferredtosetalowsignificancelevelandreject
completedtest(aposteriori).Oneofthemostcommonlyused the results that did not meet thiscriterion(Fisher,1926).In
applications of the statistical power concept is to compute this context, the word prefer is of great importance. In fact,
an appropriate sample size to detect an actual effect with this preference has been questioned and challenged by
high probability. Another application would be to many accomplished statisticians who described the
determinethealphalevel()basedontheotherestablished unconditional use of = .05 as an almost religious
parameters. This application islessfrequentduetovarious extreme (Cascio and Zedeck, 1983), as sacred (Skipper,
reasons,someofwhichwillbeexplainedlater.Finally,itcan Guenther, & Nass, 1967), as an arbitrary unreasonable
beusedtocalculatetheeffectsizeasafunctionoftheother tyranny (Cohen, 1990), or as decreed by tradition and
elements in the formula, that is, , N, and power. This reviewers(Tabachnick&Fidell,2001).
applicationisalsorelativelyrare(Cascio&Zedeck,1983). This debate is even more pertinent nowadays since
Instead of giving a didactic example for each of these todays statistics theory differs from Fishers teaching. As
applications, the section entitled Empirical example stated by Sedlmeier and Gigerenzer (1989), the current
describes a typical researchplanning sequence. It covers theoryisahybridofapproachesdevelopedbyFisher,onthe
variousapplicationsofstatisticalpowerand,therefore,isan one hand, andbyNeymanand Pearson,ontheotherhand
exhaustiveillustrativemeans toexplainthevariousaspects.

30

(see below). Therefore, Fishers preference for = .05 has suggestedbyNeymanandPearson,thereforeallowingtoset


been transposed to a different context and, while well different values for before gathering test data (some
intentionedatthebeginning,itnolongercorrespondstothe reports use = .10). However, this hybrid theory refers to
current reality. The following paragraphs explain the the type II error and statistical power solely for general
controversy surrounding the Fisher and the Neyman academic purposes since their calculation calls for an
Pearsonapproaches. alternativehypothesis(H1)thatisnotpartofthetheory.
At the time of Fisher, and largely on his Thehybridtheorycouldbeimprovedbyconsideringthe
recommendation, solely the null hypothesis (H0: the level and thus statistical power in order to determine
hypothesisthatweformulatewiththehopeofrejecting)was the level (which, of course, implies that its value could
specified and verified(Fisher,1935,1966).Suchverification vary).Infact,manystatisticians(e.g.Cascio&Zedeck,1983)
consistsofcomputingthevalueofastatistic(forexample,t recommendmakingabeforehandevaluationoftherelative
or F) based on the results, while considering the null impactofthetypeIandtypeIIerrorsonthedesiredpower
hypothesis as being true. The value of this statistic is then (note that this approach is not unanimously accepted, see
comparedtothecriticalvalue,whichisalsocomputedusing Ryan, 1985). For example, a researcher wants to maintain a
a certain probability level . If the statistic value is higher typical statistical power of .80, then by definition = .20. If
than the critical value, it is presumed that the results thisresearcherchoosesthetraditional=.05,then/=.20
concerned have not arisen through chance and H0 is / .05 = 4, which means that a type I error would be
therefore rejected. If H0 is rejected when in fact it is true, consideredfourtimesassignificantandharmfulasatypeII
then a type I error has been committed (rejecting H0 when error (reasonable). Let us consider another example. A
H0istrue). researcherbelievesthatalowerwouldyieldabettertest,
However, working in the shadows of Fisher, the and chooses = .001. According to Cohen (1988), such a
Neyman Pearson team had already developed the weak level is typically associated with a very low
concepts of alternative hypothesis (H1) and of type II error statistical power, for example, power = .10. Then, = 1
(i.e. rejecting H1 when H1 is in fact true, or not rejecting H0 power=1.10=.90and/=.90/.001=900.Itmeansthat,
whenH1istrue).Inotherwords,thealternativehypothesis according to this researcher, a type I error would be 900
H1 is a statement in the favor of which the null hypothesis timesascriticalandharmfulasatypeIIerror.Withcertain
could be rejected. These concepts allow to compute the exceptions, such relative importance of errors indicated by
probability of committing a type II error, denoted by . In thisresearcherseemsratherunreasonable
other words, it is the probability that the alternative Therefore, the values of and levels may have a
hypothesiswouldberejectedinfavorofthenullhypothesis significant impact on the results of the statistical tests used
when H1 is in fact true; that is, the probability that a true (typeIandtypeIIerrors).Thus,theirrespectiveimportance
effect (H1) would be considered the result of chance alone is determined by a quantity that, by definition, is closely
andhencejudgedfalse.Thenextstepwouldbetocompute relatedtotheconcerneddata:effectsize.Weshalldiscussit
(1),ortheprobabilityofaneffectbeingfoundtruewhen later.
itisinfacttrue.Therefore,(1)representsthepowerofa
test to detect a significant result when the effect actually Effectsizeoftheobservedphenomenon
exists. In 1988, Cohen stated that effect size is the least known
While being of great value, the Neyman Pearson concept related to statistical inference. He attributed such
theory did not achieve the desired impact, perhaps due to relative obscurity to the historical difference between
the strong opposition on the part of Fisher who described Fishers testing philosophy and Neyman and Pearsons
those interested in the concepts of type II error and (1928,1933).Infact,Fisherstestproceduresdonotincludea
statistical power as Russians trained for technological definedalternativehypothesis,whichmakesitimpossibleto
efficiency rather than statistical inference (Fisher, 1955). calculate the probability and, consequently, the statistical
Thisbattleofopinions,whichpassedrelativelyunnoticedin power of a test. To do so, we need to formulate H1, which
North America, had farreaching consequences. The main implies a certain degree of the effect presence in the
outcome was a new hybrid theory (as described by population and/or a certain degree of falsity of the null
Sedlmeier & Gigerenzer, 1989) taught in Human Sciences hypothesis.Then,suchdegreeisindeedtheeffectsize.
programsnowadays.Thishybridtheoryhasbeenapproved Morespecifically,whencomparingtwopopulations(i.e.
neither by Fisher followers nor by Neyman Pearson intergroup tests), the null hypothesis usually takes the
supporters. The hybrid theory states that only the null following form: the difference between the measured
hypothesismustbeverified,asrecommendedbyFisher,but parameters for each population is zero, or 2 1 = 0.
also recognizes the importance of the type II error, as Therefore,ifthenullhypothesisistrue,theeffectsizehasto

31

bezero.Consequently,ifthenullhypothesisisfalse,then2 thanthemediumeffectsize,butnottrivial,andaslargeif
10.Thatwouldbesimilartorecognizingtheexistence it is larger than the medium effect size, and the difference
of a difference between the means of the two populations, between the two is similar to that between the small effect
i.e.21=x,wherexistheeffectsizeor,inotherwords, and the medium effect (see Cohen, 1992). Mathematically
the degree of falsity of the null hypothesis (H0). The speaking, effect sizes are defined as small if d = .20,
higherthevalueofx,thefartherthenullhypothesisisfrom mediumifd=.50,andlargeifd=.80.
the truth. Note that the specification of x, as per Neyman Before we go back to the research design sequence
andPearson,isequivalenttospecifyingH0:21=0and described by Cascio and Zedeck (1983), there is still one
H1:21=x. more point to address. We need to specify the relationship
Weseethattheequation21=xdefinesxintermsof betweenpowerandsamplesize.
the unit scale used (e.g. seconds, IQ points, etc.). So, if we
want to use power charts or compare the results of several Samplesize
tests, the effect size must be specified as a dimensionless The most frequent application of power analysis is to
number.Dependingontheactualtest,theeffectsizemaybe compute the minimum sample size (N) required to test an
expressed as d (difference between two means), r effect of the estimated size with a desired power and a
(correlationbetweentwovariables),f(ANOVAtest)orany known level. Generally, a larger sample size tends to
otherindexrelatedtothespecifictest(seeCohen,1992).For reducethevariabilityofsamplestatistics(mean,correlation,
thepurposeofanexample,thedformulais: etc.),orinotherwords,reduceserrorvarianceandtherefore
2 increases the likelihood of detecting an effect size of the
d= 1
specified (or larger) magnitude. From the statistics point of
Ifwedividethedifferencebetweenthemeansmeasuredon view, it reduces the probability and therefore increases
a given scale by the standard deviation expressed in the statisticalpower(1).
same units of measurement, we see that the effect size d is Unless impossible due to majorconstraints,samplesize
indeed independent of the scale used. The same is true of mustbethefirstcriteriatobeadjustedinordertoaugment
theothereffectsizeindices(e.g.r,f,etc.). power.However,TabachnickandFidell(2001;p.35)caution
Since d is defined as the difference between two means researchersagainstexcessiveuseofalargesamplesize(N),
divided by a standard deviation, it can be easily computed which may cause the statistical power of a test to be too
once the sample data has been collected. However, as strong. In fact, in such cases, the null hypothesis would be
mentioned earlier, the most common application of the almostcertainlyrejectedandthetestmightbeabletodetect
statisticalpowerconceptisintheresearchplanningphase,a effects that are too small to be of any substantive
phasewhensamplemeansarenotyetavailable.Inthiscase, significance. In a way, the fact that journals now insist on
we have to find other ways to make a realistic estimate of reporting effect size estimates along statistically significant
the effect size (including useful or significant effect, as resultswouldtendtominimizetheimpactofsuchfindings.
specifiedinthesequencebyCascio&Zedeck,1983).Cohen
Empiricalexample
(1988) suggests two approaches: 1) one can calculate the
effect size from previous work in a similar area (e.g. meta ToillustratetheresearchdesignsequencebyCascioand
analysisstudies)or2)ifsuchdataarenotavailable,onecan Zedeck (1983) as well as its related concepts, here is an
use personal judgment, theoretical principles or any example that demonstrates recurring concerns in clinical
combination thereof to estimate the possible effect size in research.Theexampleispurelyfictional.
the study. As an alternative to the first approach, many Suppose a researcher wants to investigate the
fundingagenciesnowencourageresearcherstorealisepilot effectiveness of a shortterm (14 weeks) psychodynamic
studiesinordertotestthefeasibilityoftheresearchprotocol therapytreatmentofminordepressivedisorder(introduced
and to obtain preliminary effect sizes using the specific forfurtherstudyintheDSMIVandDSMIVTR;American
measuresintendedforthemain study.However,itremains Psychiatric Association, 1994, 2000). In her study, she
a good idea to base preliminary effect size estimations on chooses to use the Beck Depression Inventory II (BDIII,
prior work. Papers by Levine (1997) and Thalheimer and 1996) to compare the patients scores at the end of the
Cook(2002)provideinterestingintroductionsonthesubject. psychotherapy treatment (posttest) with their scores at the
Ifusingthesecondapproach,theresearcherwillidentifythe beginningofthepsychotherapytreatment(pretest).
anticipated effect size as conventional values of small, For this project, the researcher opts for the planning
medium or large. In Human Sciences, an effect size is sequence suggested by Cascio and Zedeck (1983). The
definedasmediumifitisperceptibletothenakedeyeof priorityisthustodefinetheminimumeffectsizethatwould
anattentiveobserver,assmallifitissignificantly smaller bejudgedusefulorimportant.Shereadsonthesubjectand

32

concludesthat,basedonthelittledataavailable,individuals heterogeneousthanonemayconcludebasedoninitialBDI
classified as suffering from minor depression have an II scores. Consequently, instead of / of 1 in 4, the
average initial BDIII score of about 19 with a standard researcher opts for an / ratio of 1 in 2. If she is
deviation of 8. She also learns that in a nonclinical determinedtomaintainapowerof.80,then =.20and=
population, this BDIII score is usually around 8 with a .10, which corresponds to = 2.50. If we redo the earlier
standarddeviationof6(theselastfigures,however,willnot calculationusingthesenewcriteria,weseethatthenumber
be part of the calculation). Finally, she concludes that the of participants in this case should be 39. The researcher is
initial BDIII score is not a strong predictor for the posttest now10patientsshort.
score in the treatment of severe depression and that the Under the circumstances, the researcher has several
correlationbetweenthetwoscoresismerely.20. options. First, she may decide to postpone her research.
Suppose now that the researcher considers that a final Second,shemaytrytoteamupwithcolleaguesworkingin
meanBDIIIscoreof15,comparedtotheinitialscoreof19, the same field in order to increase the number of
wouldindicatethatthetreatmentissomewhateffectiveand participants available for inclusion in the study. Third, she
that it is worth pursuing. On the other hand, an may proceedwiththestudyifsheacceptsstatisticalpower
improvement of less than 4 points by the end of the below.80(inthelasttestdesigndescribedabovewithN=30
treatmentcouldleadhertoabandonthistreatmentmethod and = .10, power will be around .70). In this case, the
andtoreconsideritspertinence. researcher could try to limit her sample to a particular
Sincethepretestandposttestdataarepartofarepeated subgroupinordertoreducewithingroupvariance.Finally,
measures design and thus are not independent (for further she may opt for a onetailed ttest. However, it should be
details,seeHowell,1998,part8.5),theresearchercomputes noted that onetailed tests are not unanimously accepted
theeffectsizeasfollows: andshouldbeusedsparinglyandwithgreatcaution.
2 1 2
d= 1 = Consequencesofpoweranalysis
X1 X2 2(1 )
Sincethepostteststandarddeviationisnotavailable,she Statisticalsignificancevs.designqualityinreviewers
estimatesthatsuchdeviationshouldroughlycorrespondto decision
thepretestvalue.Hence: In 1982, Atkinson, Furlong and Wampold (1982)
19 15 concludedthatastudyreportsubmittedforpublicationwas
d= = .40
8 2(1 .20) most likely to be rejected unless at least some of its major
Further, being familiar with power analysis, the researcher results were significant at traditionally accepted levels p <
believes that it would be wise to maintain a minimum test .05 or p < .01 (see also Sedlmeier & Gigerenzer, 1989).
powerof.80.Moreover,sheknowsthatusingastandard Perhaps due to the impact of this study, the subject of the
level of .05 in her tests would mean better chances to have statisticalsignificanceoffindingshasbecomesomewhatofa
her work published. In Howells table (1998; p. 762) she censorshipgage(personalcensorshiponthepartofauthors
finds a corresponding (the socalled noncentrality themselves or censorship on the part of publishers, who
parameter) which in this case is 2.80. She substitutes these knows? For an interesting perspective on this question, see
valuesinthefollowingformula: Reysen,2006).Itremainsdifficulttoconcludeifthesituation
= d N , has ever improved since 1982. If we review the bits and
hence pieces of information from various published sources (e.g.,
2 2.80 2 DeVaney, 2001), it seems that reviewers position has
N= = = 49
d2 .40 2 somewhatimprovedsincethe1982articlebyAtkinsonetal.,
The calculation shows that she needs to recruit 49 butunfortunatelynotthatmuch.
participantstohavean8in10chancetoobtainasignificant Without focusing too much on this subject, we would
resultwith=.05ifthetreatmentyieldsaneffectsizeofat like however to point out that the concept promoted by
least.40. reviewersandeditors,asdescribedbyAtkinsonetal.,isin
Supposenowthatshewishestoinvestigatetheeffectof contradiction with the third step of the research design
her treatment on patients who show comorbid personality sequence by Cascio and Zedeck (1983; see Part 2). In fact,
disorders (e.g., borderline, histrionic, and narcissistic) and reviewersposition,justliketheoldFisherianapproachof
thatshehasaccesstoamaximumof30participantsforher =.05,doesnotallowfortobeconsideredasavariable.We
study. Due to its preliminary nature, there is no indication share the views expressed by Sedlmeier and Gigerenzer
whether the considered treatment could be effective in the (1989) and believe that promoting the concept of statistical
caseofsuchpatients.Inaddition,thepresenceofcomorbid powercouldshakethefoundationsofsuchstatusquo.One
personality disorders is likely to make the group more ofthebenefitsofthestatisticalpowerconceptisthatitoffers

33

a rigorous rational approach that justifies the use of a magiclimitthatdefinesthetruth.Infact,thelevelshould


variable . Another advantage is that the statistical power be considered as a variable, the value of which is
concept encourages researchers to define beforehand a determinedbyarealistic/ratio.Therefore,thestatistical
minimum effect size that would be judged useful or power of tests should be maintained above minimum to
significant. According to Cohen (1994), careful attention to ensure that nonsignificant results could be considered as
the effect size will result in reconsidering error variance (a meaningful. In the context of powerful tests, nonsignificant
smallererrorvarianceisdesirable),whichshouldinitsturn resultsareinfactverypertinentsincetheysuggestthatthe
improveexperimentaldesigns. obtained effect sizes are relatively trivial (very small) or
negligible (below the minimum effect judged useful or
Overestimationofeffectsizeintheliterature significant). Finally, reviewers should readily accept to
Reviewers bias towards studies reporting significant publish significant and nonsignificant results alike to make
results has yet another consequence: studies with larger sure that the reality depicted in the literature actually
effect sizes are much more likely to be accepted for correspondstotherealityweknow.
publication since they report significant results more often.
References
Therefore, Lane and Dunlap (1978) point out that in such
studies completed in a low statistical power context, American Psychiatric Association (1994). Diagnostic and
reported effect sizes may be much higher than they are in Statistical Manual of Mental Disorders (4th Ed.).
reality. They explain that low levels (e.g. = .01) used in Washington,DC:AmericanPsychiatricAssociation.
reportsmayresultindistortedandartificiallyinflatedeffect American Psychiatric Association (2000). Diagnostic and
sizes.Theauthorsconcludethatthegeneraltrendtopublish Statistical Manual of Mental Disorders Text Revision (4th
significant results only cannot coexist with the adequate Ed.). Washington, DC: American Psychiatric
estimateofeffectsizesbasedontheliterature(inparticular, Association.
with regard to metaanalyses). Consequently, they American Psychological Association. (2001). Publication
recommendacceptingforpublicationallexperimentsifthey manual of the American Psychological Association (5th ed.).
relate to important concepts and have a wellstructured Washington,DC:Author.
design. Atkinson, D. R., Furlong, M. J., & Wampold, B. E. (1982).
While this issue has been addressed in many Statistical significance, reviewer evaluations, and the
publications, the work by Lane and Dunlap (1978) stands scientific process: Is there a (statistically) significant
outsinceitspotlightstheparadoxthatexiststodate,i.e.the relationship?JournalofCounselingPsychology,29,189194.
gap between the use of metaanalysis results to compute Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Beck
power, on the one hand, and publication requirements, on Depression Inventory manual (2nd Ed.). San Antonio, TX:
the other hand. To resolve the issue, the literature must PsychologicalCorporation.
mirrortherealworldmoreaccurately.Andtoachievethat, Bezeau, S., & Graves, R. (2001). Statistical power and effect
wemustquestiontheconventionallyestablishedleveland sizes of clinical neuropsychology research. Journal of
promotestudiesofhigherstatisticalpowerthataredesigned ClinicalandExperimentalNeuropsychology,23,399406.
aroundtheconceptofminimumeffectsize. Cascio, W. F., &Zedeck,S.(1983).Openinganewwindow
inrationalresearchplanning:Adjustalphatomaximize
Conclusion
statisticalpower.PersonnelPsychology,36,517526.
Tosumup,itseemsthatstatisticalpoweranditsderived ClarkCarter, D. (1997). The account taken of statistical
conceptshavegainedgroundcomparedtothesituationthat power in research published in the British Journal of
existed several decades ago. However, the new awareness Psychology.BritishJournalofPsychology,88,7183.
exists mostly in theory since the power of recent studies Cohen, J. (1962). The statistical power of abnormal social
does not seem to differ much from the statistical power of psychologicalresearch:Areview.JournalofAbnormaland
studies reported by Cohen in his 1962 article. Given the SocialPsychology,65,145153.
positive impact that the use of statistical power could have Cohen, J. (1969). Statistical power analysis for the behavioral
on the scientific literature in general and on research sciences.SanDiego,CA:AcademicPress.
planninginparticular,wehopethatitspopularitywouldgo Cohen, J. (1977). Statistical power analysis for the behavioral
from theory to practice. More specifically, we should go sciences (Rev. Ed.). Hillsdale, NJ: Lawrence Erlbaum
back to the sources and reconceptualize the type II error Associates.
basedontheconceptofminimumeffectsize(i.e.minimum Cohen, J. (1988). Statistical power analysis for the behavioral
effect judged useful or significant). Moreover, we should sciences (2nd Ed.). Hillsdale, NJ: Lawrence Erlbaum
stopregardingthelevelasaconstant(i.e.=.05),assome Associates.

34

Cohen, J. (1990). Things I have learned (so far). American 84106.


Psychologist,45,13041312. Maddock, J. E., & Rossi, J. S. (2001). Statistical power of
Cohen,J.(1992).Apowerprimer.PsychologicalBulletin,112, articles published in three health psychologyrelated
155159. journals.HealthPsychology,20,7678.
Cohen, J. (1994). The earth is round (p < .05). American Mone, M. A., Mueller, G. C., & Mauland, W. (1996). The
Psychologist,49,9971003. perceptions and usage of statistical power in applied
Denton, F. T. (1990). The effects of publication selection on psychology and management research. Personnel
test probabilities and estimator distributions. Risk Psychology,49,103120.
Analysis,10,131136. Neyman, J., & Pearson, E. S. (1928). On the use and
DeVaney,T.A.(2001).Statisticalsignificance,effectsize,and interpretation of certain test criteria for purposes of
replication: What do the journals say? Journal of statisticalinference.Biometrica,20a,175240,263294.
ExperimentalEducation,69,310320. Neyman, J., & Pearson, E. S. (1933). On the problem of the
Fagley, N. S. (1985). Applied statistical power analysis and mostefficienttestsofstatisticalhypotheses.Transactions
the interpretation of nonsignificant results by research oftheRoyalSocietyofLondonSeriesA,231,289337.
consumers.JournalofCounselingPsychology,32,391396. Pollard,P.,&Richardson,J.T.E.(1987).Ontheprobability
Fisher, R. A. (1925). Statistical methods for research workers. of making type I errors. Psychological Bulletin, 102, 159
Edinburgh,Scotland:Oliver&Boyd.. 163.
Fisher, R. A. (1926). The arrangement of field experiments. Reysen, S. (2006). Publication of nonsignificant results: A
JournaloftheMinistryofAgriculture,33,503513. survey of psychologistsopinions. Psychological reports,
Fisher, R. A. (1955). Statistical methods and scientific 98,169175.
induction. Journal of the Royal Statistical Society Series B, Rossi, J. S. (1990). Statistical power of psychological
17,6978. research: What have we gained in 20 years? Journal of
Fisher, R. A. (1966). The design of experiments (8th Ed.). ConsultingandClinicalPsychology,58,646656.
Edinburgh, Scotland: Oliver & Boyd (1st Edition Ryan, T. A. (1985). Ensembleadjusted p values: How are
publishedin1935). theytobeweighted?PsychologicalBulletin,97,521526.
Gigerenzer,G.,Swijtink,Z.,Porter,T.,Daston,L.,Beatty,J., Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of
& Krger, L. (1989). The empire of chance. Cambridge: statisticalpowerhaveaneffectonthepowerofstudies?
CambridgeUniversityPress. PsychologicalBulletin,105,309316.
Howell, David C. (1998). Mthodes statistiques en sciences Skipper, J. K., Guenther, A. L., & Nass, G. (1967). The
humaines.Paris:DeBoeckUniversit. sacredness of .05: A note concerning the uses of
Instructions to Authors, Journal of Consulting and Clinical statistical levels of significance in social science. The
Psychology (2007). Retrieved September 23, 2007, from AmericanSociologist,2,1618.
http://www.apa.org/journals/ccp/submission.html Tabachnick, B.G., & Fidell, L.S. (2001). Using multivariate
Kosciulek,J.F.,&Szymanski,E.M.(1993).Statisticalpower statistics(4thEd.).Boston,MA:AllynandBacon.
analysis of rehabilitation counseling research. Thalheimer,W.,&Cook,S.(2002).Howtocalculateeffectsizes
RehabilitationCounselingBulletin,36,212219. from published research articles: A simplified methodology.
Lane,D.M.,&Dunlap,W.P.(1978).Estimatingeffectsize: Retrieved January 30th, 2007 from http://work
Biasresultingfromthesignificancecriterionineditorial learning.com/effect_sizes.htm.
decisions. British Journal of Mathematical and Statistical
Psychology,31,107112. ManuscriptreceivedSeptember26th,2006
Levine, J. (1997). Overcoming feelings of powerlessness in
Aging researchers: A primer on statistical power in
analysis of variance designs. Psychology and Aging, 12,

You might also like