Professional Documents
Culture Documents
The paper presents and illustrates two areas of widespread abuse of statistics in social science research. The
first is the use of techniques based on random sampling but with cases that are not random and often not
even samples. The second is that even where the use of such techniques meets the assumptions for use,
researchers are almost universally reporting the results incorrectly. Significance tests and confidence
intervals cannot answer the kinds of analytical questions most researchers want to answer. Once their
reporting is corrected, the use of these techniques will almost certainly cease completely. There is nothing to
replace them with but there is no pressing need to replace them anyway. As this paper illustrates, removing
the erroneous elements in the analysis is usually sufficient improvement (to enable readers to judge claims
more fairly). Without them it is hoped that analysts will focus rather more on the meaning and limitations
of their numeric results.
T
He TerM ‘statistics’ is an ambiguous including confidence intervals, are based on
one. It emerged from the collation and a modified form of an argument modus
use of figures concerning the nation tollendo tollens. In formal logic, the argument
state from the 17th century onwards in the of denying the consequent is as follows:
UK, and subsequently in the Us and else- If a is true then B is true
where (porter, 1986). such figures involved B is not true
relatively simple analyses, and ‘political arith- Therefore, a is not true also
metic’ was largely used to lay bare inefficien- This is a perfectly valid argument, and the
cies, inequalities and injustice (Gorard, conclusion must be true, as long as the
2012). However, more recently and for many premising statements are all definitive. If B is
commentators the term has come to mean a not true then it is certain that a is not true.
set of techniques derived from sampling However, as soon as tentativeness or proba-
theory, and/or the products of those tech- bility enters the argument fails:
niques. It is the abuse of such techniques If a is true then B is probably true also
that is the subject of this new paper. These B may not be true
techniques include the use of standard Therefore, a may not be true also
errors, confidence intervals and significance This is not really a valid argument, and the
tests (both explicitly and disguised within truth of the conclusions is contingent
more complex statistical modelling). They on many factors beyond pure logic.
are supposedly used to help analysts to Characteristic a may or may not be true. If it
decide whether something that is found to is true, characteristic B could be true as well,
be true of the sample achieved in a piece of or not. The observation that B may (or may
research is also likely to be true of the known not) be true says almost precisely nothing
population from which that sample was about the truth of a. The first premise may
drawn. all of these statistical techniques, be likened to the null hypothesis in statistical
analysis, and the second to the evidence vals with population-based data, they are
from the achieved research sample. The betraying ignorance of the meaning of confi-
probabilistic argument is now contingent dence intervals (see below), misleading
upon the frequency with which a and B are policy-makers and other researchers, and
true together in reality, and on the accuracy harming those who will be affected by
of the research finding about the likelihood supposedly evidence-informed decisions.
of B being true. Knowing both of these facts, exactly the same applies to samples other
it would be possible to draw a probabilistic than the random samples on which sampling
conclusion about the likelihood of a being theory techniques such as significance tests
true (the desired research conclusion). But are based (Fielding & Gilbert 2000). Oppor-
in reality, neither of these facts would be tunity, convenience, snowball samples and
known. In fact, the main supposed objective the like also do not have a standard error, by
of the analysis would be to help decide on definition. Findings derived from such
the accuracy of the research finding that B samples have no probabilistic uncertainty;
may not be true. The analysis assumes from they will just have bias. In the same way that
the outset something about that which it is findings from population data can be tenta-
supposed to be assessing. The misunder- tively generalised to other cases not in the
standing caused by this assumption is wide- population, so findings from non-random
spread. samples can be generalised to other cases.
But in both situations the generalisation can
Using sampling theory techniques in only be based on judgement, and how well
inappropriate contexts the sampled cases match the non-sampled
However, the most obvious abuse of ones in terms of what it already known. In
sampling theory techniques is their use in reality, the judgement is not a generalisation
situations for which they were not designed from the sample (or population) but a deci-
and for which they ought not to be used. sion about what is already known about
data for a population cannot have a stan- non-sampled cases and how well they match
dard error, by definition. The standard error the sampled ones. None of this concerns
is defined as the standard deviation of a random sampling variation. When
random sampling distribution, of samples researchers like Carr and Marzouq (2012),
drawn repeatedly from a population. It is to take just one of many available examples,
used (but incorrectly, see below) to try and cite significance tests and p-values derived
estimate the proximity of the sample mean from two complete classes of children in one
to the population mean. When working with primary school, they are making a key analyt-
population data the population mean is ical error. Their probabilities cannot mean
known, therefore, such an estimation is anything in the context where only a conven-
neither needed nor valid. Of course, the ience sample of a year group from one
population data may be incomplete due to school is involved. even if their results had
missing cases or missing values, but this is a been based on a random sample, the statis-
cause of bias not a consequence of random tical population which such results could be
sampling variation. Bias ought to be generalised to does not exist outside the
addressed in any analysis (although it rarely sample. such abuse of statistical techniques
is addressed by those who use ‘statistics’ simply has to cease. as with Goldstein’s use
instead) but it cannot be addressed through of confidence intervals for population data,
significance tests and the like. None of the such abuse of non-random samples leads to
techniques of sampling theory statistics can errors, wasted opportunities, vanishing
or should be used with population data. breakthroughs, and unwarranted conclu-
When commentators like Goldstein (2008, sions.
p.396) advocate the use of confidence inter-
The final situation for this first kind of (pdata|Hyp) as being the probability of what
abuse is when random samples are planned is being ‘tested’ also being true for the popu-
but not achieved. strictly speaking an incom- lation, given the value obtained from the
plete random sample is not a random random sample achieved (pHyp|data).
sample at all. rolling 1000 unbiased dice to These two probabilities are clearly very
estimate the probability of gaining each different, and neither can be safely inferred
outcome would be a (pseudo-)random from the other. One may be small and the
process. rolling the dice and then re-rolling other large, or vice versa, or any combina-
any that showed a six would not lead to a tion in between (Gorard, 2010). The p-value
good estimate of the probability of gaining calculation depends on the initial assump-
each outcome. This is obvious. In the same tion of a null hypothesis about what is true
way, selecting 1000 cases by chance from a for the population. as soon as it is allowed
known population is very different from that the null hypothesis may not be true, the
selecting 1000 cases and then replacing 100 calculation goes wrong. The actual computa-
of these because they refused to participate. tion for a significance test involves no real
This means that in almost all real-life information about the population, and this
research situations, sampling theory statis- means that the same sample from two very
tical techniques are not relevant, do not different populations would yield the same
mean anything and must not be used. In a p-values. a sample mean of 50 would, quite
sense, the paper could end at this point, absurdly, produce the same p-value if the
because it would be rare for an analyst to be population mean were 40, 50, 60 or 70, etc.
dealing with a complete random sample. This is because the population value is not
known (else there would be no point on
Misunderstanding and misrepresenting conducting the significance test), and the
the outputs of significance tests entire calculation is based only on the
However, there is a second kind of wide- achieved sample value.
spread abuse of statistics that is even worse To illustrate the common misunder-
but somewhat harder to explain. This is standing of this, consider a simplified situa-
because there is such a common misunder- tion. There is a bag, containing 100
standing of this form of analysis. put simply, well-shuffled balls of identical size, and the
statistical analysis even when conducted balls are known to be of only two colours.
appropriately and with all underlying a sample of 10 balls is selected at random
assumptions met does not do what most from the bag. This sample contains seven
analysts want and what many methods red balls and three blue balls. The analytical
instructors portray that it does. The nature question to be addressed is: how likely is it
of the conditional probabilities involved is that this observed difference in the balance
commonly and mistakenly reversed, whether of the colours between the two samples is
through incompetence or intention to also true of the original 100 balls in each
deceive. bag? The situation is clearly analogous to
This confusion between the probabilities many analyses reported in social science
for a sample and a population is clear in the research. The bag of balls is the population,
logic of significance testing and the quota- from which a sample is selected randomly.
tion of p-values. as with the modus tollens a moment’s thought shows that it is not
argument above, a significance test assumes possible to say anything very much about the
from the outset that what is being ‘tested’ is other 90 balls in the bag. The remaining 90
true for the population, and so calculates the might all be red or all blue, or any share of
probability of obtaining a specific value from red and blue in between. Yet the purpose of
the random sample achieved (siegel 1956). such a significance test analysis is to find out
analysts then generally mistake this via sampling something about the balance of
colours in the bag. Without knowing what is since finding out that balance is supposed to
in the bag there is no way of assessing how be the purpose of the analysis.
improbable it is that the sample has ended Of course, the probability of getting seven
up with seven red balls. Once this impossi- reds from a bag containing 80 reds is different,
bility is realised, the pointlessness of signifi- a priori, to the probability of getting seven reds
cance testing becomes clear. from a bag containing 20 reds. But the signifi-
What a significance test does instead is to cance test is conducted post hoc. There is no
make an artificial assumption about what is way of telling what the remaining population
in the bag. Here the null hypothesis might is from the sample alone. To imagine other-
be that the bag contains 50 balls of each wise, would be equivalent to deciding that
colour at the outset. Knowing this it becomes rolling a three followed by a four with a die
relatively easy to calculate the chances of showed that the die was biased (since the
picking seven reds and three blue in a probability of that result is only 1/36, which is
random sample of 10. If this probability is much less than five per cent, of course).
small (traditionally less than one-in-20, or For anyone who has spotted this misun-
0.05) it is customary to claim that this is derstanding, there is little doubt that their
evidence that the bag must have contained use of significance testing would cease (Falk
an unbalanced set of balls at the outset. This & Greenbaum, 1995). No one wants to know
claim is obviously nonsense. The assumption the probabilistic answer the tests actually
of the null hypothesis tells us nothing about provide (about the probability of the
what is actually in the bag. For example, observed data given the assumption), and
imagine that the bag started with 80 red balls the test cannot provide the answer analysts
and 20 blues. The sample is drawn as above, really want (the probability of the assump-
and contains seven reds. The significance tion being true given the data observed).
test approach assumes that there are 50 reds This conclusion is not new (Harlow et al.,
in the bag and calculates a probability of 1997). It has been known for a long time,
getting seven in a sample of 10. This proba- perhaps since their earliest adoption, that
bility will be clearly incorrect because the significance tests do not work as hoped for,
balls are less balanced in fact than the null and may well be harmful because their
assumption requires. Now imagine that the results are so widely misinterpreted (Carver,
sample is still the same but that the bag had 1978). Yet unwary methods resources and
80 blue balls and only 20 red originally. The purported experts continue to peddle the
significance test approach again assumes fiction that p-values are, or are closely related
that there are 50 reds in each bag and calcu- to, the probability of the sample result being
lates the same probability of getting seven ‘true’, real or relevant. relatively recent
red from one and five from the other. This examples among many include the following
probability will also be clearly incorrect in a textbook on social science methods:
because the balls are less balanced than the [statistical significance is] ‘the likelihood
null assumption requires. More absurdly, that a real difference or relationship
this second probability must be the same as between two sets of data has been found’
the first one since they are both calculated in (somekh & lewin, 2005, p.224).
the same way on the same assumption. so and perhaps even more worrying is the
the significance test would give exactly the ‘explanation’ (in relation to statistical
same probability of having drawn seven reds modelling) given during the training of
in a random sample from a bag of 80 per heavily selected UK national experts in
cent reds as from a bag of 20 per cent reds. rigorous evaluation:
This absurdity happens because the test significance of b4 indicates whether there
takes no account of the actual proportion of is evidence of an interaction effect
each colour in the population. It cannot, (Connolly, 2013, slide 5).
Both of these explanations are the wrong (2008, p.399) says of their use in value-added
way around. The ‘significance’ value is really calculations:
the likelihood of finding a fake ‘difference’ a confidence interval provides a range of
or ‘effect’ if none actually exists. This is a values that, with a given probability –
very different value to the likelihood of there typically 0.95 – is estimated to contain the
actually being a difference or effect. It is like true value of the school score.
saying the probability of being a professional Connolly (2007, p. 149) says that a 95 per
footballer if a person is over six feet tall is the cent confidence interval shows that:
same as the probability of a person being There is a 95 per cent chance that the
over six feet tall if they are a footballer. The true population mean is within just
first of these values will be much, much under two standard errors of the same
smaller than the second. To confuse the two mean.
as the supposed experts above do is to make Both of these statements are wrong. With
a very serious mistake. It is possible to population data, or where the true popula-
convert one figure to the other using Bayes’ tion value (such as its mean) was already
theorem, as long as the unconditional prob- known, there would be no need for confi-
abilities are already known (such as what dence intervals (CIs). a CI is calculated only
proportion of people are footballers and from the sample value, and no reference at
what proportion are over six feet tall). But all is made to the true population value (how
there would be no point in conducting a could it be?). Instead of the above, a CI for a
significance test in this situation since both sample value means precisely this:
conditional probabilities would be calcu- If we assume that the value from a
lated precisely. complete random sample is identical to
the true population value, then the CIs of
Misunderstanding and misrepresenting many repeated complete random
confidence intervals samples of the same size would contain
Faced with increasing criticism of signifi- the population value for 95 per cent (or
cance testing and its abuse, in 1999 the selected interval) of these samples.
american psychological association (apa) This is why any reported CI for a specific
set up a Task Force on statistical Inference. sample is centred around the sample value.
This considered a ban on the reporting of Of course, based on this correct definition
such tests in all apa journals. Unfortunately, (of how a CI is actually calculated) the tech-
their final recommendation fell short of nique is completely useless. It cannot be
such a radical but useful step, and apa used to assess how close the sample value is
instead focused on moving beyond signifi- to the unknown population value, because it
cance to a consideration of the ‘precision’ of is based on the assumption that the two are
any research findings. Its influential publica- identical from the outset. as soon as it is
tion manual now states that: allowed that the two might differ at all, then
[Null hypothesis significance testing] is the calculation of the CI fails. If the sample
but a starting point and that additional mean is not at the precise centre of the
reporting elements such as effect sizes, normally distributed population (or
confidence intervals, and extensive sampling distribution) then it is not true that
description are needed (apa, 2010, 95 per cent of the population will lie within
p.33). 1.96 standard deviations from the sample
This is a shame because confidence intervals mean. The absurdity of this kind of artificial
use the same underlying logic as significance calculation is perhaps even clearer when
tests, share the same fatal flaws, and are at considering what happens in an example.
least as widely misunderstood. For example, Imagine that a sample mean was 50, and that
talking about confidence intervals, Goldstein this was drawn from a population with mean
60. The CI would have a particular range done instead. removal of the error is
centred around 50. Now imagine that all else improvement enough. In the paper used as
remains the same but that the population an example above (among countless others),
mean was actually 70. The CI would remain Carr and Marzouq (2012) presented Table 1
the same because the CI is unrelated to the (p.7) as below and textual discussion of
actual population mean. This suggests that a these findings (p.6):
CI based on an estimate of 50 for a real value as seen in Table 1 children endorsed all
of 60 would imply the same level of accuracy four of the achievement goals to similar
as for a real value of 70. In practice, and even degrees. However, the range of responses
when used as intended, CIs are pointless. for both the mastery-approach and
Worse than this, because even purported mastery avoidance scales were narrower
authorities are explaining their interpreta- than the performance scales and were
tion incorrectly, they are being used to draw focused at the top end of the scale.
invalid inferences. again, money and Correlations between goals (Table 1) are
research effort are being wasted and those consistent with the 2 x 2 framework
intended to benefit from research may be where goals sharing a dimension
being harmed. simply stating the number of (mastery/performance or approach/
cases underlying any sample value is suffi- avoidance) are positively correlated while
cient and valid. those not sharing a dimension are
unrelated (elliot & McGregor, 2001;
What should happen instead? elliot & Murayama, 2008). although this
There is a tendency to want to cling to tradi- pattern of correlation is evident in this
tional statistics, not understanding them or sample the association between mastery
even knowing that they do not makes sense, approach and performance approach
due to not being sure what to do instead. In goals is smaller than expected, just
general, the answer is that nothing should be approaching significance.
Clearly, much of this reporting is incorrect. Nothing much has changed with the invalid
With a convenience sample of 58 children p-values removed. If the findings of the
from one school, Carr and Marzouq (2012) paper were important (or not) before, they
should not be discussing statistical ‘signifi- remain so now that they are reported
cance’ or quoting p-values. Therefore, parts without abusing statistics. It is entirely
of the report such as the gobbledegook at possible that making the results simpler, and
the foot of the table can be simply removed. not misleading readers or even the
In addition, the use of decimal places should researchers themselves with false probabili-
be curtailed. It is unlikely that the reported ties, would encourage a greater emphasis on
means are really accurate to five one thou- the analytical issues that really matter and on
sandths of a unit in a study measuring things the substantive (or not) nature of the results.
as vague as ‘performance approach’ with Key issues in this example appear to be
only 58 cases. The result could look like this. whether the measures are measuring
as seen in Table 1 children endorsed all anything at all, whether they can measure it
four of the achievement goals to similar accurately, how they could be calibrated,
degrees. However, the range of responses what the bias might be in the sample, the
for both the mastery-approach and nature of any non-response, and how any of
mastery avoidance scales was narrower these initial errors might propagate through
than the performance scales and focused ensuing calculations. The answers to these
at the top end of the scale. Correlations questions and others like them will help
between goals (Table 1) are consistent readers and researchers decide whether the
with the 2 x 2 framework where results warrant the claim in the paper – that
goals sharing a dimension (mastery/ the researchers have tested ‘the 2 x 2 achieve-
performance or approach/avoidance) ment goal model’ (p.6). Moving away from
are positively correlated while those not the convenient but invalid push-button
sharing a dimension are unrelated (elliot approach to analysis might yield benefits
& McGregor, 2001; elliot & Murayama, beyond mere cessation of the abuse. It might
2008). although this pattern of introduce more transparency and judge-
correlation is evident in this sample the ment in reporting (Gorard, 2006).
association between mastery approach
and performance approach goals is
smaller than expected.
References
american psychological association (apa) (2010). Gorard, s. (2006). Towards a judgement-based statis-
Publication manual of the APA (6th ed.). Wash- tical analysis. British Journal of Sociology of Educa-
ington, dC: apa. tion, 27(1), 67–80.
Carr, a. & Marzouq, s. (2012). The 2 x 2 achievement Gorard, s. (2010). all evidence is equal: The flaw in
goal framework in primary school: do young statistical reasoning. Oxford Review of Education,
children pursue mastery-avoidance goals?. 36(1), 63–77.
The Psychology of Education Review, 36(2), 3–8. Gorard, s. (2012). The increasing availability of
Carver, r. (1978). The case against statistical signifi- official datasets: Methods, opportunities, and
cance testing. Harvard Educational Review, 48, limitations for studies of education. British
378–399. Journal of Educational Studies, 60(1), 77–92.
Connolly, p. (2013). Analysis of Randomised Controlled Gorard, s. (2013). Research design: Robust approaches for
Trials (RCTs). presentation to Conference of eeF the social sciences. london: sage.
evaluators: Building evidence in education, Harlow, l., Mulaik, s. & steiger, J. (1997). What if there
london. were no significance tests? Marwah, NJ: lawrence
Connolly, p. (2007). Quantitative data analysis in educa- erlbaum.
tion. New York: sage. porter, T. (1986). The rise of statistical thinking.
Falk, r. & Greenbaum, C. (1995). significance tests princeton: princeton University press.
die hard: The amazing persistence of a proba- rozeboom, W. (1997). Good science is abductive not
bilistic misconception. Theory and Psychology, 5, hypothetico-deductive. In l. Harlow, s. Mulaik &
75–98. J. steiger (eds.), What if there were no significance
Fielding, J. & Gilbert, N. (2000). Understanding social tests? Marwah, NJ: lawrence erlbaum.
statistics. london: sage. siegel, s. (1956). Non-parametric statistics for the
Goldstein, H. (2008). evidence and education policy behavioural sciences. Tokyo: McGraw Hill.
– some reflections and allegations. Cambridge somekh, B. & lewin, C. (2005). Research methods in the
Journal of Education, 38(3), 393–400. social sciences. london: sage.
Research Digest
Blogging on brain and behaviour
subscribe by rss or email
www.researchdigext.org.uk/blog
Become a fan
www.facebook.com/researchdigest
Follow the digest editor at
www.twitter.com/researchdigest
Correspondence
Professor Gene V. Glass
arizona state University and
University of Colorado, Boulder.
email: glass@asu.edu
T Is rIGHT to question the application of argument may well benefit from a Bayesian
might be representative of students else- the trial being run many times on students in
where in the country in similar schools; the same schools at the same time in a virtual
should they receive the same intervention. population from which we did sample
Intuitively this appears less of a leap of faith randomly. This allows us to quantify chance
than claiming that the actual performance, and gives meaning to the p-values and confi-
an example of a population parameter, lies dence intervals used. Whilst the concept is
between its confidence intervals (the respon- abstract, ignoring uncertainty is far worse
dent is aware of the simplification inherent and may result in concluding that things
in ‘lies between’). The main reason why such work when they do not and vice versa; even if
generalisation may be reasonable is that it is this is just for the sample in question.
for the same intervention on the same By introducing the concept of a virtual
outcome rather than the estimation of a population, we are acknowledging that the
measure that might be influenced by a students in the trial can be regarded as a
myriad of other factors. physical ‘population’. They were not
randomly sampled so no further physical
Virtual population population exists to which confidence inter-
The previous argument is definitely ques- vals can apply strictly. Gorard states that ‘data
tionable since the effect itself may also be for a [physical] population cannot have a
vulnerable to this myriad of other factors. standard error, by definition.’ Indeed, if we
However, it might be strong enough to justify are estimating a population parameter and
another trial or even intervention roll-out if have measured everyone in the population,
the sample is seen to be representative we need no standard error. However, things
‘enough’. From a frequentist point of view, are rarely so straightforward since the popu-
when analysing the results of any trial, we lation itself may be limited in terms of the
need to establish how easily the results we research question and may need to be seen
see could have occurred by chance alone; as a sample within a larger virtual population
the basis of a frequentist statistical test. simi- as illustrated in the previous paragraph.
larly, it is useful to estimate a confidence Merely reporting effect sizes and
interval; thus encapsulating the chance numbers of participants would engender a
element of the effect size we see. If we reject culture of conjecture around the uncertainty
any generalisation to a wider population, of any result. Bayesian statistics has a lot to
Gorard’s conundrum comes into sharp offer the concerns raised by Gorard and,
focus: if this were not a random sample, rather than not attempting to measure
where are the other subjects from which we uncertainty, all users of statistics should
could have sampled that might give rise to embrace the frequentist versus Bayesian
the other results upon which our confidence debate more seriously.
interval is based? They certainly do not exist
physically. However, it is often helpful to Correspondence
regard them as existing virtually. The Dr Ben Styles
concept of a virtual population is often used research director, National Foundation
without acknowledgement, for example, for educational research,
when assigning a confidence interval to a The Mere Upton park, slough.
school’s value-added results even when all Berkshire sl1 2dQ.
students are measured (Goldstein, 2008). email: b.styles@nfer.ac.uk
rather than conceptualising the 95 per cent
confidence interval as one of many from a Reference
large series of trials run on members of a Goldstein, H. (2008). evidence and education policy
– some reflections and allegations. Cambridge
physical population, 95 per cent of which
Journal of Education, 38(3), 393–400.
would contain the true effect; we imagine
Gorard makes two claims: (1) there is a widespread abuse of statistics in social sciences; and (2) researchers
almost universally report results incorrectly. According to Gorard the way forward is to correct the reporting,
so that the abuse will disappear.
The Authors
Professor Victor H.P. van Daal
edge Hill University.
Dr Herman J. Adèr
Johannes van Kessel advising.
Correspondence
Professor Victor H.P. van Daal
director,
The Centre for literacy and
Numeracy research,
Faculty of education,
edge Hill University,
st. Helen’s road, Ormskirk,
lancashire, l39 4Qp.
email: vandaalv@edgehill.ac.uk
few minutes of his much-lauded The Joy of suit. also interesting are the arguments used
Stats documentary he calculates confidence to justify their position. In my experience the
intervals for data that have not been sampled most common defence involves an appeal to
randomly (rosling 2013). With such high the idea of a ‘superpopulation’.
profile examples of misuse, it is perhaps not
surprising that flouting the assumptions The myth of the ‘superpopulation’
required for these statistics is routine. In educational research it is not uncommon
In social science research, using inferen- for research to use population data. The
tial statistics regardless of having non- populations used can range from the quite
random, incomplete or even population modest, such as data on all students in a
data is commonplace. One only has to single school, to those such as the National
glance at recent publications in highly- pupil database (Npd) that include every
ranked social science journals to see that student enrolled in a state school in
there is little in the way of quality control england. What these data sets have in
exercised by reviewers or editors in relation common is that they have not been gener-
to the incorrect use of these techniques. ated through processes of random sampling
This situation would be sufficiently but cover all the cases in a particular institu-
worrying if it stopped there. However, the tion, geographical area, and so on. It is quite
effects of the widespread acceptance of the common for analyses of these population
inappropriate use of inferential statistics are data sets to include the use of inferential
not limited to the publication of articles by statistics. as Gorard points out, the results of
authors who, for whatever reason, are erro- such analyses are meaningless. With popula-
neously using these techniques. My experi- tion data there is no need for any inference
ence – and that of close colleagues – suggests as any analyses are conducted at the popula-
that the common and accepted abuse of tion level. any errors in the results will not
these techniques leads to a further problem. be due to random sampling and so, in any
authors who have quite properly eschewed case, inferential statistics should not be used.
the use of inferential statistics because their as Berk (2004, p.42) concludes:
data do not meet the required assumptions If the data are a population, there is no
are often asked by reviewers to include the sampling, no uncertainty because of
results of these tests in their research reports sampling, and no need for statistical
in order to have them accepted for publica- inference. Indeed, statistical inference
tion. I have come across this situation in makes no sense. The only game is
papers I have authored or co-authored and describing patterns in the data on hand.
also in reports from colleagues who have Unlike in some situations in the physical
received these kinds of recommendations sciences, in the social sciences other defi-
from reviewers. Mostly commonly, however, ciencies in the data (such as drop-out, non-
I have seen it in the comments from other response or measurement error) cannot be
reviewers on many occasions when I have assumed to be random. The problem of bias
acted as a referee. a related situation also caused by non-response has been acknowl-
frequently occurs when, as a referee, edged for many years (see Hansen &
I recommend that the inappropriate use of Hurwitz 1946) and repeated studies have
inferential statistics be removed from a demonstrated the non-random nature of
paper before publication. non-response (e.g. sheikh & Mattingly
What is notable in the situations 1981).
described above is the fervour displayed by as mentioned above, a common defence
advocates of the inappropriate use of infer- of the use of inferential statistics with popu-
ential statistics, both when defending their lation data appeals to the idea of a ‘super-
own practice and insisting that others follow population’, ‘hyperpopulation’, ‘infinite
these assumptions. even if it can be demon- interval is adopted, it becomes unclear how
strated that data could or do meet the neces- such information would be of use to a
sary requirements, the outputs of inferential researcher. as with NHsT and p-values, the
statistics do not tell us anything we want to starting point is still based on assumptions
know. put simply, they are not useful even in about the population that are unverified and
the (hypothetical?) cases where the under- cannot be tested using sample data. The
lying assumptions for their use are met. logical problem that renders NHsT and
p-values useless also prevents the informa-
P-values are not the only problem. tion provided by CIs being useful. Those
One of the elements that sets Gorard’s paper who correctly criticise the use of p-values on
apart from most critiques of the use of infer- these grounds often seem to miss the point
ential statistics is that his concern is not that all inferential statistical techniques
limited to the use of null-hypothesis statis- suffer from a similar flaw.
tical testing (NHsT) and the associated apart from Gorard, there are very few
reliance on p-values. Criticisms of the use of commentators who are willing to abandon
p-values are relatively common but often the ‘project’ of inferential statistics alto-
advocate a change of focus not only to effect gether. Concerns are expressed that to
sizes (which are not an inferential tech- abandon inferential statistics as a ‘bad job’
nique) but also to the use of standard errors would be akin to ‘throwing the baby out with
(ses) and confidence intervals (CIs) (e.g. the bath water’. It is now relatively uncontro-
Cumming, 2012; Hubbary & lindsay, 2008; versial for researchers to express concern
lambdin, 2012). even rozeboom (1960) about the use of NHsT and p-values, and
and Cohen (1994), both who have written these concerns are even beginning to be
about the logical problems with NHsT and raised in texts aimed at undergraduate
are ardent critics of the use of p-values, view students (e.g. Field 2009). recommending
CIs as an acceptable alternative. This view is abandoning the use of all inferential statis-
also shared by Meehl (1997), who has tical techniques, however, is much less
written extensively on the problems of common and likely to generate considerably
NHsTs in psychology. more criticism.
While discussion of other issues can be However, what is missing from the more
found in the literature, the problem central cautious interrogations of the use of inferen-
to the use of p-values is that the probability tial statistics is a convincing account of why
that they refer to is not the probability we we should continue to use any of them at all.
want to know. P-values can only provide the Those who advocate a move to CIs do not
probability of the data given a hypothesis. adequately explain how the information
What we actually want to know is the proba- provided by these measures can be useful
bility of a hypothesis given the data. as when they are interpreted correctly. My current
Gorard shows in his paper, the former prob- view is that no such account will be forth-
ability – p(d|H) – is of no use to researchers coming simply because it is not possible to
and it is not possible to convert this informa- construct one.
tion to the more useful latter probability –
p(H|d) – using only sample data. This Against inferential statistics
problem has led to the commentators as I stated at the beginning of this response,
mentioned above advocating the use of CIs I do not expect Gorard’s views to be popular
as an alternative to p-values. – especially those that extend the traditional
However, as Gorard shows, the advan- critique of NHsT and p-values to other infer-
tages of using ses and CIs in place of, or in ential outputs such as CIs. I expect my
addition to, p-values, are illusionary. Once support of his views to receive similar reac-
the correct definition of a confidence tions. However, I believe these arguments
References
Berk, r. (2004). Regression Analysis: A constructive lambdin, C. (2012). significance tests as sorcery:
critique. Thousand Oaks, Ca: sage. science is emprical – significance tests are not.
Cohen, J. (1994). ‘The earth Is round (p<.05)’. Theory Psychology, 22(1), 67–90.
American Psychologist, 49(12), 997–1003. Meehl, p.e. (1997). The problem is epistemology, not
Cumming, G. (2012). Understanding the new statistics: statistics: replace significance tests by confidence
Effect sizes, confidence intervals and meta-analysis. intervals and quantify accuracy of risk numerical
New York: routledge. predications. In l.l. Harlow, s.a. Mulaik & J.H.
Field, a. (2009). Discovering statistics using IBM SPSS steiger (eds.), What if there were no significance
Statistics. london: sage. tests? Mahwah, NJ: lawrence erlbaum.
Freedman, d.a. (2004). sampling. In M.s. lewis- Morrison, d.e. & Henkel, r.e. (eds.) (1970).
Beck, a. Bryman & T.F. liao (eds.), Sage The significance test controversy. Chicago: aldine.
encyclopaedia of social science research methods rosling, H. (2013). The Joy of Stats. BBC 4.
(pp.987–991). Thousand Oaks, Ca: sage. 16 October. accessed 25 January 2014, from:
Goldstein, H. (2003). Multilevel statistical models www.bbc.co.uk/programmes/b00wgq0l
(3rd ed.) london: arnold. rozeboom, W.W. (1960). The fallacy of the null
Hagood, M.J. (1970). The notion of a hypothetical hypothesis significance test. Psychological Bulletin,
universe. In d.e. Morrison & r.e. Henkel (eds.), 57, 416–428.
The significance test controversy. Chicago: aldine. sheikh, K. & Mattingly, s. (1981). Investigating non-
Hansen, M.H. & Hurwitz, W.N. (1946). The problem response bias in mail surveys. Journal of Epidemi-
of non-response in sample surveys. Journal of the ology and Community Health, 35, 293–296.
American Statistical Association, 41(236), 517–529. Ziliak, s.T. & McCloskey, d.N. (2008). The cult of statis-
Harlow, l.l., Mulaik, s.a. & steiger, J.H. (eds.) tical significance: How the standard error cost us jobs,
(1997). What if there were no significance tests? justice and lives. ann arbor, MI: University of
Mahwah, NJ: lawrence erlbaum. Michigan press.
Hubbard, r. & lindsay, r.M. (2008). Why p-values
are not a useful measure of evidence in statistical
significance testing. Theory & Psychology, 18(1),
69–88.
everything that appears in reputable have to go to in order to try and defend the
methods texts, and it is the kind of error that indefensible.
leads Glass to say ‘The fiction that proba- There are some less common variants in
bility statements are meaningful in the the responses that try to maintain the edifice
absence of random acts underlying them is of significance testing. If we have a non-
preposterous’. random sample we could randomly sample
Common responses when I write or from within that and then use significance
lecture about the abuse of statistics are with the sub-sample (Van daal & adèr). This
‘everyone does it’, ‘it has happened for a seems truly desperate. purportedly, there are
long time’ and eventually ‘we already know techniques to ‘fix’ a non-random sample and
all this but what should we do instead?’. make it back into a random one (Van daal &
I hope readers can see that none of these is adèr). No there are not, because if we knew
a valid counter-argument, and that they the key values for the missing cases then we
remain invalid when deployed by four would have a complete sample. If not, we can
respondents here who largely re-state their only use the values we do have to make up
own existing practice. styles claims that in for what is missing, so enhancing the bias
rejecting the use of significance and CIs I am caused by the missing cases in the first place
rejecting any attempt to consider uncer- (Gorard, 2013). The same practical problem
tainty in research findings. This is not true, eliminates putwain’s suggestion that if the
and the original paper urges researchers to achieved sample looks similar to what the
consider a wider and more important range random sample would have been if available,
of factors that lead to uncertainty but which then using significance is justified. To
are ignored by the significance approach imagine a random sample based on a
(such as design bias or respondent attrition). convenience sample, and then try to
I feel I am the more concerned because I do compute real probabilities accurately based
not just want to pretend I am assessing on that imagination is surely incorrect.
uncertainty via an invalid technique. even stranger is the notion that the
Howe claims that I argued that we should super-, hyper- or virtual-population invented
not use convenience samples and quotes to help differentiate between theoretically
apa guidance suggesting that convenience finite and infinite populations can then be
samples are perfectly proper. They are, and used to justify treating actual population
I never suggested otherwise. In fact, I clearly data as a random sample (styles). Just
stated that we often have no practical alter- envisage what styles means when he writes
native. What apa does not say is that we ‘we imagine the trial being run many times
should use significance tests with conven- on students in the same schools at the same
ience samples. as ever, it is presumably easier time in a virtual population from which we
to mis-portray what I said and argue with did sample randomly’. and note that this
that. Van daal and adèr do something entirely ignores the logical problems raised
similar. I showed that denying the conse- by point two at the outset. I have written
quent is only valid in logic when the prem- about this absurdity many times before (e.g.
ises are certain, and that the modus tollens Gorard, 2008), and White handles this
argument fails once any premise is uncertain briefly but well in his response.
or probabilistic. They portray this as me The search for an alternative or what to
saying that probabilistic argument in general do ‘instead’ of significance, especially with
is invalid, including weather forecasting as non-random samples/allocation, is an odd
an illustration. But weather forecasting does one. since the existing approaches do not
not employ this ‘denying the consequent’ work we must abandon them. a Bayesian
argument structure at all. These three exam- approach would certainly be more logical
ples show the lengths that commentators but is no panacea and no substitute for