You are on page 1of 31

Open Dialogue:

The widespread abuse of statistics by


researchers: What is the problem and
what is the ethical way forward?
Professor Stephen Gorard

The paper presents and illustrates two areas of widespread abuse of statistics in social science research. The
first is the use of techniques based on random sampling but with cases that are not random and often not
even samples. The second is that even where the use of such techniques meets the assumptions for use,
researchers are almost universally reporting the results incorrectly. Significance tests and confidence
intervals cannot answer the kinds of analytical questions most researchers want to answer. Once their
reporting is corrected, the use of these techniques will almost certainly cease completely. There is nothing to
replace them with but there is no pressing need to replace them anyway. As this paper illustrates, removing
the erroneous elements in the analysis is usually sufficient improvement (to enable readers to judge claims
more fairly). Without them it is hoped that analysts will focus rather more on the meaning and limitations
of their numeric results.

Which kind of statistics is being abused?

T
He TerM ‘statistics’ is an ambiguous including confidence intervals, are based on
one. It emerged from the collation and a modified form of an argument modus
use of figures concerning the nation tollendo tollens. In formal logic, the argument
state from the 17th century onwards in the of denying the consequent is as follows:
UK, and subsequently in the Us and else- If a is true then B is true
where (porter, 1986). such figures involved B is not true
relatively simple analyses, and ‘political arith- Therefore, a is not true also
metic’ was largely used to lay bare inefficien- This is a perfectly valid argument, and the
cies, inequalities and injustice (Gorard, conclusion must be true, as long as the
2012). However, more recently and for many premising statements are all definitive. If B is
commentators the term has come to mean a not true then it is certain that a is not true.
set of techniques derived from sampling However, as soon as tentativeness or proba-
theory, and/or the products of those tech- bility enters the argument fails:
niques. It is the abuse of such techniques If a is true then B is probably true also
that is the subject of this new paper. These B may not be true
techniques include the use of standard Therefore, a may not be true also
errors, confidence intervals and significance This is not really a valid argument, and the
tests (both explicitly and disguised within truth of the conclusions is contingent
more complex statistical modelling). They on many factors beyond pure logic.
are supposedly used to help analysts to Characteristic a may or may not be true. If it
decide whether something that is found to is true, characteristic B could be true as well,
be true of the sample achieved in a piece of or not. The observation that B may (or may
research is also likely to be true of the known not) be true says almost precisely nothing
population from which that sample was about the truth of a. The first premise may
drawn. all of these statistical techniques, be likened to the null hypothesis in statistical

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 3


© The British Psychological Society – ISSN 0262-4087
Professor Stephen Gorard

analysis, and the second to the evidence vals with population-based data, they are
from the achieved research sample. The betraying ignorance of the meaning of confi-
probabilistic argument is now contingent dence intervals (see below), misleading
upon the frequency with which a and B are policy-makers and other researchers, and
true together in reality, and on the accuracy harming those who will be affected by
of the research finding about the likelihood supposedly evidence-informed decisions.
of B being true. Knowing both of these facts, exactly the same applies to samples other
it would be possible to draw a probabilistic than the random samples on which sampling
conclusion about the likelihood of a being theory techniques such as significance tests
true (the desired research conclusion). But are based (Fielding & Gilbert 2000). Oppor-
in reality, neither of these facts would be tunity, convenience, snowball samples and
known. In fact, the main supposed objective the like also do not have a standard error, by
of the analysis would be to help decide on definition. Findings derived from such
the accuracy of the research finding that B samples have no probabilistic uncertainty;
may not be true. The analysis assumes from they will just have bias. In the same way that
the outset something about that which it is findings from population data can be tenta-
supposed to be assessing. The misunder- tively generalised to other cases not in the
standing caused by this assumption is wide- population, so findings from non-random
spread. samples can be generalised to other cases.
But in both situations the generalisation can
Using sampling theory techniques in only be based on judgement, and how well
inappropriate contexts the sampled cases match the non-sampled
However, the most obvious abuse of ones in terms of what it already known. In
sampling theory techniques is their use in reality, the judgement is not a generalisation
situations for which they were not designed from the sample (or population) but a deci-
and for which they ought not to be used. sion about what is already known about
data for a population cannot have a stan- non-sampled cases and how well they match
dard error, by definition. The standard error the sampled ones. None of this concerns
is defined as the standard deviation of a random sampling variation. When
random sampling distribution, of samples researchers like Carr and Marzouq (2012),
drawn repeatedly from a population. It is to take just one of many available examples,
used (but incorrectly, see below) to try and cite significance tests and p-values derived
estimate the proximity of the sample mean from two complete classes of children in one
to the population mean. When working with primary school, they are making a key analyt-
population data the population mean is ical error. Their probabilities cannot mean
known, therefore, such an estimation is anything in the context where only a conven-
neither needed nor valid. Of course, the ience sample of a year group from one
population data may be incomplete due to school is involved. even if their results had
missing cases or missing values, but this is a been based on a random sample, the statis-
cause of bias not a consequence of random tical population which such results could be
sampling variation. Bias ought to be generalised to does not exist outside the
addressed in any analysis (although it rarely sample. such abuse of statistical techniques
is addressed by those who use ‘statistics’ simply has to cease. as with Goldstein’s use
instead) but it cannot be addressed through of confidence intervals for population data,
significance tests and the like. None of the such abuse of non-random samples leads to
techniques of sampling theory statistics can errors, wasted opportunities, vanishing
or should be used with population data. breakthroughs, and unwarranted conclu-
When commentators like Goldstein (2008, sions.
p.396) advocate the use of confidence inter-

4 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


The widespread abuse of statistics by researchers

The final situation for this first kind of (pdata|Hyp) as being the probability of what
abuse is when random samples are planned is being ‘tested’ also being true for the popu-
but not achieved. strictly speaking an incom- lation, given the value obtained from the
plete random sample is not a random random sample achieved (pHyp|data).
sample at all. rolling 1000 unbiased dice to These two probabilities are clearly very
estimate the probability of gaining each different, and neither can be safely inferred
outcome would be a (pseudo-)random from the other. One may be small and the
process. rolling the dice and then re-rolling other large, or vice versa, or any combina-
any that showed a six would not lead to a tion in between (Gorard, 2010). The p-value
good estimate of the probability of gaining calculation depends on the initial assump-
each outcome. This is obvious. In the same tion of a null hypothesis about what is true
way, selecting 1000 cases by chance from a for the population. as soon as it is allowed
known population is very different from that the null hypothesis may not be true, the
selecting 1000 cases and then replacing 100 calculation goes wrong. The actual computa-
of these because they refused to participate. tion for a significance test involves no real
This means that in almost all real-life information about the population, and this
research situations, sampling theory statis- means that the same sample from two very
tical techniques are not relevant, do not different populations would yield the same
mean anything and must not be used. In a p-values. a sample mean of 50 would, quite
sense, the paper could end at this point, absurdly, produce the same p-value if the
because it would be rare for an analyst to be population mean were 40, 50, 60 or 70, etc.
dealing with a complete random sample. This is because the population value is not
known (else there would be no point on
Misunderstanding and misrepresenting conducting the significance test), and the
the outputs of significance tests entire calculation is based only on the
However, there is a second kind of wide- achieved sample value.
spread abuse of statistics that is even worse To illustrate the common misunder-
but somewhat harder to explain. This is standing of this, consider a simplified situa-
because there is such a common misunder- tion. There is a bag, containing 100
standing of this form of analysis. put simply, well-shuffled balls of identical size, and the
statistical analysis even when conducted balls are known to be of only two colours.
appropriately and with all underlying a sample of 10 balls is selected at random
assumptions met does not do what most from the bag. This sample contains seven
analysts want and what many methods red balls and three blue balls. The analytical
instructors portray that it does. The nature question to be addressed is: how likely is it
of the conditional probabilities involved is that this observed difference in the balance
commonly and mistakenly reversed, whether of the colours between the two samples is
through incompetence or intention to also true of the original 100 balls in each
deceive. bag? The situation is clearly analogous to
This confusion between the probabilities many analyses reported in social science
for a sample and a population is clear in the research. The bag of balls is the population,
logic of significance testing and the quota- from which a sample is selected randomly.
tion of p-values. as with the modus tollens a moment’s thought shows that it is not
argument above, a significance test assumes possible to say anything very much about the
from the outset that what is being ‘tested’ is other 90 balls in the bag. The remaining 90
true for the population, and so calculates the might all be red or all blue, or any share of
probability of obtaining a specific value from red and blue in between. Yet the purpose of
the random sample achieved (siegel 1956). such a significance test analysis is to find out
analysts then generally mistake this via sampling something about the balance of

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 5


Professor Stephen Gorard

colours in the bag. Without knowing what is since finding out that balance is supposed to
in the bag there is no way of assessing how be the purpose of the analysis.
improbable it is that the sample has ended Of course, the probability of getting seven
up with seven red balls. Once this impossi- reds from a bag containing 80 reds is different,
bility is realised, the pointlessness of signifi- a priori, to the probability of getting seven reds
cance testing becomes clear. from a bag containing 20 reds. But the signifi-
What a significance test does instead is to cance test is conducted post hoc. There is no
make an artificial assumption about what is way of telling what the remaining population
in the bag. Here the null hypothesis might is from the sample alone. To imagine other-
be that the bag contains 50 balls of each wise, would be equivalent to deciding that
colour at the outset. Knowing this it becomes rolling a three followed by a four with a die
relatively easy to calculate the chances of showed that the die was biased (since the
picking seven reds and three blue in a probability of that result is only 1/36, which is
random sample of 10. If this probability is much less than five per cent, of course).
small (traditionally less than one-in-20, or For anyone who has spotted this misun-
0.05) it is customary to claim that this is derstanding, there is little doubt that their
evidence that the bag must have contained use of significance testing would cease (Falk
an unbalanced set of balls at the outset. This & Greenbaum, 1995). No one wants to know
claim is obviously nonsense. The assumption the probabilistic answer the tests actually
of the null hypothesis tells us nothing about provide (about the probability of the
what is actually in the bag. For example, observed data given the assumption), and
imagine that the bag started with 80 red balls the test cannot provide the answer analysts
and 20 blues. The sample is drawn as above, really want (the probability of the assump-
and contains seven reds. The significance tion being true given the data observed).
test approach assumes that there are 50 reds This conclusion is not new (Harlow et al.,
in the bag and calculates a probability of 1997). It has been known for a long time,
getting seven in a sample of 10. This proba- perhaps since their earliest adoption, that
bility will be clearly incorrect because the significance tests do not work as hoped for,
balls are less balanced in fact than the null and may well be harmful because their
assumption requires. Now imagine that the results are so widely misinterpreted (Carver,
sample is still the same but that the bag had 1978). Yet unwary methods resources and
80 blue balls and only 20 red originally. The purported experts continue to peddle the
significance test approach again assumes fiction that p-values are, or are closely related
that there are 50 reds in each bag and calcu- to, the probability of the sample result being
lates the same probability of getting seven ‘true’, real or relevant. relatively recent
red from one and five from the other. This examples among many include the following
probability will also be clearly incorrect in a textbook on social science methods:
because the balls are less balanced than the [statistical significance is] ‘the likelihood
null assumption requires. More absurdly, that a real difference or relationship
this second probability must be the same as between two sets of data has been found’
the first one since they are both calculated in (somekh & lewin, 2005, p.224).
the same way on the same assumption. so and perhaps even more worrying is the
the significance test would give exactly the ‘explanation’ (in relation to statistical
same probability of having drawn seven reds modelling) given during the training of
in a random sample from a bag of 80 per heavily selected UK national experts in
cent reds as from a bag of 20 per cent reds. rigorous evaluation:
This absurdity happens because the test significance of b4 indicates whether there
takes no account of the actual proportion of is evidence of an interaction effect
each colour in the population. It cannot, (Connolly, 2013, slide 5).

6 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


The widespread abuse of statistics by researchers

Both of these explanations are the wrong (2008, p.399) says of their use in value-added
way around. The ‘significance’ value is really calculations:
the likelihood of finding a fake ‘difference’ a confidence interval provides a range of
or ‘effect’ if none actually exists. This is a values that, with a given probability –
very different value to the likelihood of there typically 0.95 – is estimated to contain the
actually being a difference or effect. It is like true value of the school score.
saying the probability of being a professional Connolly (2007, p. 149) says that a 95 per
footballer if a person is over six feet tall is the cent confidence interval shows that:
same as the probability of a person being There is a 95 per cent chance that the
over six feet tall if they are a footballer. The true population mean is within just
first of these values will be much, much under two standard errors of the same
smaller than the second. To confuse the two mean.
as the supposed experts above do is to make Both of these statements are wrong. With
a very serious mistake. It is possible to population data, or where the true popula-
convert one figure to the other using Bayes’ tion value (such as its mean) was already
theorem, as long as the unconditional prob- known, there would be no need for confi-
abilities are already known (such as what dence intervals (CIs). a CI is calculated only
proportion of people are footballers and from the sample value, and no reference at
what proportion are over six feet tall). But all is made to the true population value (how
there would be no point in conducting a could it be?). Instead of the above, a CI for a
significance test in this situation since both sample value means precisely this:
conditional probabilities would be calcu- If we assume that the value from a
lated precisely. complete random sample is identical to
the true population value, then the CIs of
Misunderstanding and misrepresenting many repeated complete random
confidence intervals samples of the same size would contain
Faced with increasing criticism of signifi- the population value for 95 per cent (or
cance testing and its abuse, in 1999 the selected interval) of these samples.
american psychological association (apa) This is why any reported CI for a specific
set up a Task Force on statistical Inference. sample is centred around the sample value.
This considered a ban on the reporting of Of course, based on this correct definition
such tests in all apa journals. Unfortunately, (of how a CI is actually calculated) the tech-
their final recommendation fell short of nique is completely useless. It cannot be
such a radical but useful step, and apa used to assess how close the sample value is
instead focused on moving beyond signifi- to the unknown population value, because it
cance to a consideration of the ‘precision’ of is based on the assumption that the two are
any research findings. Its influential publica- identical from the outset. as soon as it is
tion manual now states that: allowed that the two might differ at all, then
[Null hypothesis significance testing] is the calculation of the CI fails. If the sample
but a starting point and that additional mean is not at the precise centre of the
reporting elements such as effect sizes, normally distributed population (or
confidence intervals, and extensive sampling distribution) then it is not true that
description are needed (apa, 2010, 95 per cent of the population will lie within
p.33). 1.96 standard deviations from the sample
This is a shame because confidence intervals mean. The absurdity of this kind of artificial
use the same underlying logic as significance calculation is perhaps even clearer when
tests, share the same fatal flaws, and are at considering what happens in an example.
least as widely misunderstood. For example, Imagine that a sample mean was 50, and that
talking about confidence intervals, Goldstein this was drawn from a population with mean

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 7


Professor Stephen Gorard

60. The CI would have a particular range done instead. removal of the error is
centred around 50. Now imagine that all else improvement enough. In the paper used as
remains the same but that the population an example above (among countless others),
mean was actually 70. The CI would remain Carr and Marzouq (2012) presented Table 1
the same because the CI is unrelated to the (p.7) as below and textual discussion of
actual population mean. This suggests that a these findings (p.6):
CI based on an estimate of 50 for a real value as seen in Table 1 children endorsed all
of 60 would imply the same level of accuracy four of the achievement goals to similar
as for a real value of 70. In practice, and even degrees. However, the range of responses
when used as intended, CIs are pointless. for both the mastery-approach and
Worse than this, because even purported mastery avoidance scales were narrower
authorities are explaining their interpreta- than the performance scales and were
tion incorrectly, they are being used to draw focused at the top end of the scale.
invalid inferences. again, money and Correlations between goals (Table 1) are
research effort are being wasted and those consistent with the 2 x 2 framework
intended to benefit from research may be where goals sharing a dimension
being harmed. simply stating the number of (mastery/performance or approach/
cases underlying any sample value is suffi- avoidance) are positively correlated while
cient and valid. those not sharing a dimension are
unrelated (elliot & McGregor, 2001;
What should happen instead? elliot & Murayama, 2008). although this
There is a tendency to want to cling to tradi- pattern of correlation is evident in this
tional statistics, not understanding them or sample the association between mastery
even knowing that they do not makes sense, approach and performance approach
due to not being sure what to do instead. In goals is smaller than expected, just
general, the answer is that nothing should be approaching significance.

Table 1: Descriptives and intercorrelations for achievement goal responses.

Variable M SD Min Max Range 1 2 3


1. Performance Approach 3.39 1.07 1.00 5.00 4.00 –
2. Performance Avoidance 3.58 .88 1.67 5.00 3.33 .69** –
3. Mastery Approach 4.51 .53 3.00 5.00 2.00 .20+ .10 –
4. Mastery Avoidance 3.92 .89 3.00 5.00 3.00 .13 .41** .42**
+p<0.06; *p<0.05; **p<0.001 (one-tailed)

8 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


The widespread abuse of statistics by researchers

Clearly, much of this reporting is incorrect. Nothing much has changed with the invalid
With a convenience sample of 58 children p-values removed. If the findings of the
from one school, Carr and Marzouq (2012) paper were important (or not) before, they
should not be discussing statistical ‘signifi- remain so now that they are reported
cance’ or quoting p-values. Therefore, parts without abusing statistics. It is entirely
of the report such as the gobbledegook at possible that making the results simpler, and
the foot of the table can be simply removed. not misleading readers or even the
In addition, the use of decimal places should researchers themselves with false probabili-
be curtailed. It is unlikely that the reported ties, would encourage a greater emphasis on
means are really accurate to five one thou- the analytical issues that really matter and on
sandths of a unit in a study measuring things the substantive (or not) nature of the results.
as vague as ‘performance approach’ with Key issues in this example appear to be
only 58 cases. The result could look like this. whether the measures are measuring
as seen in Table 1 children endorsed all anything at all, whether they can measure it
four of the achievement goals to similar accurately, how they could be calibrated,
degrees. However, the range of responses what the bias might be in the sample, the
for both the mastery-approach and nature of any non-response, and how any of
mastery avoidance scales was narrower these initial errors might propagate through
than the performance scales and focused ensuing calculations. The answers to these
at the top end of the scale. Correlations questions and others like them will help
between goals (Table 1) are consistent readers and researchers decide whether the
with the 2 x 2 framework where results warrant the claim in the paper – that
goals sharing a dimension (mastery/ the researchers have tested ‘the 2 x 2 achieve-
performance or approach/avoidance) ment goal model’ (p.6). Moving away from
are positively correlated while those not the convenient but invalid push-button
sharing a dimension are unrelated (elliot approach to analysis might yield benefits
& McGregor, 2001; elliot & Murayama, beyond mere cessation of the abuse. It might
2008). although this pattern of introduce more transparency and judge-
correlation is evident in this sample the ment in reporting (Gorard, 2006).
association between mastery approach
and performance approach goals is
smaller than expected.

Table 1: Descriptives and intercorrelations for achievement goal responses.

Variable M SD Min Max Range 1 2 3


1. Performance Approach 3.4 1.1 1.0 5.0 4.0
2. Performance Avoidance 3.6 0.9 1.7 5.0 3.3 +7
3. Mastery Approach 4.5 .0.5 3.0 5.0 2.0 +0.2 +0.1
4. Mastery Avoidance 3.9 0.9 3.0 5.0 3.0 +0.1 +0.4 +0.4

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 9


Professor Stephen Gorard

Conclusion design. randomisation is the best protection


When an analyst is trying to decide on the against imbalance or bias in the sample or
substantive importance of an apparent the groupings, both in terms of known char-
research finding, they are faced with a acteristics that could be matched and in
number of alternative explanations. If they terms of the unknown characteristics that
have used a random sample then one of any attempted matching procedure is forced
these explanations is that the result is a fluke to neglect (Gorard, 2013). But random
introduced by sampling variation. This is the sampling is used to minimise bias, not so that
explanation that significance testing, confi- significance tests can be run. With a high
dence intervals and associated statistical quality random sample the best estimate of
techniques are meant to address (but which any equivalent value for the linked popula-
they do not). However, this is only one expla- tion will be the sample value. No amount of
nation. Other methods-based explanations dredging with the sample data alone (as
include design errors, bias in the sample, happens with standard errors, significance
errors in measuring or recording data, tests and CIs) can improve this estimate.
researcher effects and so on. These other real people, their lives, well-being,
explanations ought to be considered and health and education are affected by
discussed whether the sample is a random research evidence-informed decisions in
one or not, or even if a population is policy and practice. at present, these deci-
involved. But the current abuse of signifi- sions are (unknown to policy-makers and
cance testing seems to have replaced all practitioners) overly influenced by a super-
other considerations. What should happen stitious ritual that few seem to understand
instead of the false logic of statistics is a but many seem happy to follow and pass on
greater focus on the meaning and authority to new researchers. This ritual was described
of the evidence that analysts uncover, using by rozeboom (1997, p.335) as:
transparent judgement to decide whether a surely the most bone-headedly mis-
difference is worth pursuing or whether a guided procedure ever institu-tionalised
coefficient is worth retaining in a model. in the rote training of science students.
There are a number of simple techniques as illustrated above, the techniques of
than can assist in making and portraying sampling theory statistics do not work as
these judgements, including graphical intended, and can give misleading results
displays, and a range of effect sizes from leading to vanishing breakthroughs and
odds ratios to r2. even harmful interventions. It is time for this
Of course, none of the above is any kind wasteful and dangerous nonsense to cease.
of argument against measurement or the
crucial role of numbers in social science Correspondence
research. This paper is rather an argument Professor Stephen Gorard
that researchers should take numbers more professor of education and Well-being,
seriously, and think rather more carefully school of education,
than at present about their meaning. simi- durham University,
larly, this is not an argument against the leazes road,
random selection of cases in a sample, or the durham dH1 1Ta.
random allocation of what is in effect popu- email: s.a.c.gorard@durham.ac.uk
lation data to treatment groups in a trial

10 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


The widespread abuse of statistics by researchers

References
american psychological association (apa) (2010). Gorard, s. (2006). Towards a judgement-based statis-
Publication manual of the APA (6th ed.). Wash- tical analysis. British Journal of Sociology of Educa-
ington, dC: apa. tion, 27(1), 67–80.
Carr, a. & Marzouq, s. (2012). The 2 x 2 achievement Gorard, s. (2010). all evidence is equal: The flaw in
goal framework in primary school: do young statistical reasoning. Oxford Review of Education,
children pursue mastery-avoidance goals?. 36(1), 63–77.
The Psychology of Education Review, 36(2), 3–8. Gorard, s. (2012). The increasing availability of
Carver, r. (1978). The case against statistical signifi- official datasets: Methods, opportunities, and
cance testing. Harvard Educational Review, 48, limitations for studies of education. British
378–399. Journal of Educational Studies, 60(1), 77–92.
Connolly, p. (2013). Analysis of Randomised Controlled Gorard, s. (2013). Research design: Robust approaches for
Trials (RCTs). presentation to Conference of eeF the social sciences. london: sage.
evaluators: Building evidence in education, Harlow, l., Mulaik, s. & steiger, J. (1997). What if there
london. were no significance tests? Marwah, NJ: lawrence
Connolly, p. (2007). Quantitative data analysis in educa- erlbaum.
tion. New York: sage. porter, T. (1986). The rise of statistical thinking.
Falk, r. & Greenbaum, C. (1995). significance tests princeton: princeton University press.
die hard: The amazing persistence of a proba- rozeboom, W. (1997). Good science is abductive not
bilistic misconception. Theory and Psychology, 5, hypothetico-deductive. In l. Harlow, s. Mulaik &
75–98. J. steiger (eds.), What if there were no significance
Fielding, J. & Gilbert, N. (2000). Understanding social tests? Marwah, NJ: lawrence erlbaum.
statistics. london: sage. siegel, s. (1956). Non-parametric statistics for the
Goldstein, H. (2008). evidence and education policy behavioural sciences. Tokyo: McGraw Hill.
– some reflections and allegations. Cambridge somekh, B. & lewin, C. (2005). Research methods in the
Journal of Education, 38(3), 393–400. social sciences. london: sage.

Research Digest
Blogging on brain and behaviour
subscribe by rss or email
www.researchdigext.org.uk/blog
Become a fan
www.facebook.com/researchdigest
Follow the digest editor at
www.twitter.com/researchdigest

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 11


Open Dialogue peer review:
A response to Gorard
Professor Gene V. Glass

Random selection, random assignment capturing the proportion of persons in the


and Sir Ronald Fisher population who are left-handed.’ Now there
N eVerY GeNeraTION, the widespread is nothing wrong with the way this finding is

I misunderstandings of the application of


inferential statistical techniques in psycho-
logical research must be dispelled. In the
reported; but the question remains, ‘What
population?’ The fact is, there is no popula-
tion to which an inference is being made.
1960s, roseboom (1960), Grant (1962), and Our researcher has made a probabilistic
a few others performed the service. In 1978, statement about a non-probabilistic event.
the late paul e. Meehl shined light on what I remember questioning my superiors –
was going horribly wrong with empirical whom I shall not name, but who were icons
research in the ‘soft’ social sciences, among in the history of educational psychology –
which he included social, clinical and educa- about this situation many decades ago.
tional psychology. The title of Meehl’s article at the time, their answer was accepted
hints at many of the issues currently on the ex cathedra: ‘The population is that popula-
table: ‘Theoretical risks and tabular asterisks: tion from which this sample could have been
sir Karl, sir ronald, and the slow progress of randomly drawn.’ Of course, now, looking
soft psychology.’ as with ‘The widespread back, this answer is absurd, and only was
abuse of statistics by researchers: What is the given to justify applying the statistical tech-
problem and what is the ethical way niques that seemingly aggrandised some
forward?,’ Meehl even opens his treatment rather pedestrian research. If I was forced to
of the then prevalent methodological mess describe the population ‘from which this
with an exposition of popper’s famous modus sample could have been randomly drawn’,
tollens. I could only imagine a population that looks
stephen Gorard (2014) has performed like the sample itself, only much larger in
the function of setting straight the misappli- number. If non-random samples are just
cation of statistical significance testing for replicas of the populations from which they
this generation of educational psychologists. are drawn, then there is no need for any
I hasten to record my agreement with nearly kind of inference. On this, Gorard and I are
everything he has written on the topic. let in perfect accord.
me briefly make one small point of elabora- There exists one important application
tion and two smaller points of clarification. of statistical tests that does not depend on
The fiction that probability statements random sampling from populations. sir
are meaningful in the absence of random ronald a. Fisher invented it and illustrated it
acts underlying them is preposterous. For in his famous text The Design of Experiments
example: a researcher grabs the 50 closest (1935). When subjects are randomly
undergraduates available to her, asks how assigned to conditions in a comparative
many are left-handed – answer: eight – and experiment, the eventual observed results
proceeds to place a 95 per cent confidence can be compared with what would be
interval around the proportion; conclusion: expected if the random assignment had
‘The interval .08 to .29 was generated by a been the sole cause of any observed differ-
process that has a 95 per cent probability of ence. and this comparison can legitimately

12 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


© The British Psychological Society – ISSN 0262-4087
A response to Gorard

be stated in the form of a probability state- References


ment, for example, ‘The difference observed Fisher, r.a. (1935). The design of experiments.
edinburgh: Oliver & Boyd.
would occur fewer than one time in 1000 by
Gorard, s. (2014). The widespread abuse of statistics
mere random assignment alone.’ The by researchers: What is the problem and what is
resulting test is known as a permutation test the ethical way forward? The Psychology of Educa-
(see ‘permutation test’ in http:// tion Review, 38(1), 3–11.
en.wikipedia.org/wiki/permutation_test). Grant, d.a. (1962). Testing the null hypothesis and
the strategy and tactics of investigating theo-
One final point. I studied some of my
retical models. Psychological Review, 69(1), 54–61.
statistics under G.e.p. Box, the famous statis- Meehl, p.e. (1978). Theoretical risks and tabular
tician at the University of Wisconsin, and the asterisks: sir Karl, sir ronald, and the slow
son-in-law of sir ronald a. Fisher. It was progress of soft psychology. Journal of Consulting
common in those days for professors to and Clinical Psychology, 46(4), 806–834.
rozeboom, W.r. (1960). The fallacy of the null-
punish students who gave interpretations of
hypothesis significance test. Psychological Bulletin,
confidence intervals like this: ‘The proba- 57(4), 416–428.
bility is 95 per cent that the population
proportion of left-handed persons lies
between .08 and .29.’ Box, however, said,
relax; it makes perfect sense when under-
stood in the Bayesian sense.

Correspondence
Professor Gene V. Glass
arizona state University and
University of Colorado, Boulder.
email: glass@asu.edu

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 13


Open Dialogue peer review:
A response to Gorard
Professor Christine Howe

N WHaT, I aM sUre, is an intentionally However, I am unconvinced that the distinc-

I iconoclastic analysis of the use of statistics


in social science research, stephen Gorard
makes five major points. First, statistical
tion is seldom understood, and, therefore,
that ‘the misunderstanding caused by this
assumption is widespread’. surely the
inference is often couched in terms that distinction between logical and probabilistic
resemble modus tollens, which creates an reasoning is generally appreciated within the
inappropriate aura of logic. second, academic community. as regards the second
sampling theory techniques are frequently point, it is certainly inappropriate to use
used when working with populations, when techniques drawn from sampling theory
they are only appropriate with samples that when working with populations, but
are drawn from populations. Third, research based on population data is actually
sampling theory techniques presuppose quite rare within psychology. The examples
random samples, and randomisation is not I can think of are concerned with atypical
always achieved. Fourth, significance tests groups where the numbers are relatively
presume some ‘truth’ at the population level small, for example, in forensic and clinical
and test the probability of obtaining contexts or relating to developmental disor-
observed sample values given that truth. ders. However, the vast majority of studies in
However, social scientists often assume that psychology are concerned with samples.
they can infer truths at the population level Ideally the samples used in research
given sample values, that is, make precisely should be random, and moving to stephen
the reverse inference. Fifth, in many cases, it Gorard’s third point it can be a cause of
would be preferable to publish descriptive concern when randomisation is not
data only, accompanied with measures of achieved. Nevertheless, I do not agree that
effect size. In the following, I shall discuss sampling theory techniques should be
the points, not from the perspective of social deemed inappropriate just because the
science in general but of psychology in sample was based on opportunity, conven-
particular. psychology is the discipline in ience or snowballing. In my opinion, the
which I have worked for most of my career. onus lies upon researchers who use such
The gist of my argument is that while methods of selection to demonstrate repre-
stephen Gorard’s points do have relevance sentativeness, and to make a case for treating
for psychology and, therefore, I welcome their sample as if it were random, that is, to
them, the relevance is of a lower magnitude defend the judgement that stephen Gorard
than his paper envisages and for somewhat acknowledges is inevitable in such circum-
different reasons. stances. In other words, my views are closer
I have relatively little to say about the first to those of the apa task force that addressed
couple of points. statistical arguments do the issue in the 1990s: ‘Using a convenience
indeed sometimes follow the linguistic struc- sample does not automatically disqualify a
ture of modus tollens. Moreover, there is no study from publication, but it harms your
doubt that, as stephen Gorard points out, objectivity to try to conceal this by implying
the validity of statistical arguments depends that you used a random sample. sometimes
upon contingent probabilities and not logic. the case for the representativeness of a

14 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


© The British Psychological Society – ISSN 0262-4087
A response to Gorard

convenience sample can be strengthened by psychology undergraduates are trained to


explicit comparison of sample characteristics write laboratory reports in terms of testing
with those of a defined population’ and sometimes rejecting the null hypothesis.
(Wilkinson and the Task Force on statistical In other words, it is part of the discipli-
Inference, 1999, p.595). If the case for repre- nary wisdom that, as stephen Gorard writes
sentativeness is strong, sampling theory tech- ‘the significance value is really the likelihood
niques may prove as useful with convenience of finding a fake difference or effect if none
samples as with a truly random sample, and actually exists’. Moreover far from ‘no-one
like the many thousands of psychological [wanting] to know the probabilistic answer
studies conducted in these circumstances the tests actually provide’, such answers lie at
(for any of an enormous number of practical the heart of what psychology is about. The
reasons) the results will turn out to gener- reason for this apparent perversity is that as
alise. If the case is weak, the weakness will well as testing null hypotheses, psychologists
normally be exposed at some point along the reject these when the probability of sample
line, whether or not sampling theory tech- parameters is relatively small (conventionally
niques were used and abused. .05 or lower) and infer that alternative
From a psychological perspective, the key hypotheses of population difference are
point in stephen Gorard’s paper is probably more plausible. Once more, psychology
the fourth one relating to significance undergraduates are taught to regard such
testing, for significance tests are undoubt- inferences as conjectures, not as necessary
edly pre-eminent amongst the discipline’s consequences of significance testing, and to
nuts and bolts. Fortunately, they are not typi- avoid at all costs extrapolating probabilities
cally used in an inappropriate fashion, for from those computed for the null hypoth-
the starting point in much psychological esis. However, space in academic journals is
research is indeed the presumption of popu- tight, and what is spelled out in student
lations that do not differ, that is, the situation reports typically becomes implicit in profes-
amongst those depicted in the paper where sional publications. In particular, a tradition
the bag of balls contains 50 red balls and 50 has developed of expressing alternative
blue balls. When the research is concerned hypotheses (e.g. as the claims to be exam-
with putatively universal human characteris- ined), and leaving unstated the null
tics like working memory, the populations hypotheses that are actually tested with
are the human race as it currently exists and sampling theory techniques. I should like to
the human race as it might be if the charac- think that most psychologists know that this
teristics were different. In other words, the practice is only a tradition, although some
populations are not necessarily real, let may, I guess, forget this with the passage of
alone in possession of qualities that research time. I certainly do not believe that it results
aspires to document. When the focus is upon from what stephen Gorard refers to as
variables like quality of parenting, teaching ‘incompetence or intention to deceive’.
or peer relations, the populations are those Where I detect dangers is in dissemination of
who are privy to high quality experiences psychological research to the wider public,
and those whose experiences are not so because in that context uncritical use of
good. Here the bag of balls analogy may disciplinary discourse and presumption of
need extending to a continuum, but the shared understanding could mislead.
general point continues to apply. This is that stephen Gorard’s point about real people
in all cases, the null hypothesis is no differ- being affected by research evidence is well
ence at the population level, and that taken.
samples are assessed to estimate the proba- In general then, I do not think there is
bility of their parameters given this hypoth- widespread misuse of significance testing
esis. For this reason, generations of within psychology, but does such testing

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 15


Professor Christine Howe

actually add something? I am unfamiliar with Correspondence


the hapless study of achievement goals that Professor Christine Howe
stephen Gorard cites. (Why this particular professor of education,
study, I kept wondering?). Its convenience Faculty of education,
sample may or may not be legitimately University of Cambridge,
treated as if it were random. Its measures 184 Hills road,
may or may not be appropriately regarded as Cambridge CB2 8pQ.
interval scales. However, assuming the pre- email: cjh82@cam.ac.uk
requisites for correlational analysis are
fulfilled, the asterisks in the first table and Reference
the p-values at its foot strike me as more than Wilkinson, l. and the Task Force on statistical Infer-
ence (1999). statistical methods in psychology
‘gobbledegook’. rather they reflect tests that
journals: Guidelines and explanations. American
take account of sample parameters (e.g. N) Psychologist, 54, 594–604.
in a more formal and standardised fashion
than could be achieved using the second
table even if this table was accompanied with
graphical displays like scatter-plots. Of
course, like any acceptance of alternative
hypotheses, the reasoning process is inferen-
tial and provisional, and for that reason
I could not agree more with stephen
Gorard’s call for focus on ‘the meaning and
authority of the evidence analysts uncover’.
The proof of the pudding, though, is ulti-
mately in the eating. Just as I argued earlier
for sampling so I would argue for signifi-
cance testing: to the extent that null hypoth-
esis testing and conjectured alternative
hypotheses result in a body of knowledge
that advances theoretical understanding and
practical action, it can, despite undoubted
abuses, be justified. dud psychology is not
unusual and sometimes the problems lie
with misused statistics. Nevertheless, on
balance, I think the evidence is positive.

16 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


Open Dialogue peer review:
A response to Gorard
Professor Dave Putwain

Purism over pragmatism? the paper) that, from my reading of the


TepHeN GOrard outlines in his educational psychology literature, is made

S paper several ways in which social scien-


tists may not be using statistical tech-
niques in the correct way(s) or way(s) in
most frequently. One could pick up any
educational psychology journal and find
examples of researchers performing inferen-
which they were intended. By and large this tial statistical analysis on samples that were
position and the sentiments behind it are to anything other than random. To illustrate
be applauded. Would one wish to be using this, a recent edition of the Journal of Educa-
analytic techniques inappropriately or tional Psychology (august, 2013), arguably one
wrongly in their research? Would you wish to of the most prestigious journals in the field
inadvertently pass on the use of inappro- of educational psychology, and one which
priate techniques to others as part of your should represent the most learned and pre-
teaching? even if this were the case, I am not eminent scholars, contained 24 quantitative
sure that it would strictly represent an abuse articles. all of them used inferential statis-
of statistics, which, according to my tics, yet none reported a random sample.
dictionary, indicates that there must be a stephen and I would agree that it is common
moral or ethical dimension. stephen’s paper practice. I think it is only fair to note that
does not set out the case for why using statis- stephen’s argument is technically correct.
tical techniques incorrectly is ethically or The various analytic approaches used in this
morally dubious, although he does hint edition of the Journal of Educational Psychology
where this could lie where, for example, included almost every analysis that one
findings from wrongly-used statistical could try: bivariate and intraclass correlation
analyses are used to inform or justify policy coefficients, t-tests, analysis of variance,
decisions and poor outcomes. It would seem single and multi-level regression models and
more appropriate to describe the arguments structural equation model. all were accom-
and practices set out in his paper as the panied by tests of significance, which esti-
misuse, rather than abuse, of statistics. mated the likelihood of sampling error on
I presume that the term abuse was used as a the basis that the sample was a random one.
rhetorical device intended to set out a With with some caveats regarding the use of
position and provoke a response, and there multilevel models in which data were clus-
is nothing wrong with that, per se. so, based tered, none of these studies had random
on my understanding of the thrust of the samples. although not reported in the
arguments presented in the paper, I will majority of cases, one would infer that these
respond to the misuse of statistics and focus were purposeful or convenience samples
on the use of inferential statistics with and that, strictly speaking, these techniques
samples that are not randomly sampled. should, therefore, not have been used.
I have chosen to focus on this aspect in In the majority of cases, attaining
particular, partly because some of the other random samples from populations of
issues in the paper (e.g. misuse of confi- interest is not a realistic goal for psycho-
dence intervals) are corollaries of this. In logical or educational researchers. either
addition, this is the issue (of those raised in persons or groups are unavailable or

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 17


© The British Psychological Society – ISSN 0262-4087
Professor Dave Putwain

unwilling to participate for perfectly good underlying population and, accordingly, to


reasons, or it is simply not possible to obtain estimate the degree of error present in their
the kinds of information about the popula- sample parameters (means, correlation coef-
tion required to select persons or groups at ficients, path coefficients and so on) as they
random. Indeed, it would only require one pertain to an underlying population. The
person from a random sample to withdraw critical issue is the extent to which the
their participation or data and that sample is convenience sample would have resembled
no longer random. Hence, most researchers the random sample, had such a sample been
use the next best available option, which is available. If the random and convenience
typically a convenience sample. The samples are considered to be well matched,
approach advocated in stephen’s paper is to then conclusions about the degree of
subsequently drop inferential analyses where sampling error are likely to be very similar,
sampling is anything other than random. the substantive conclusions regarding
I would like to consider the principal reason research findings will remain unchanged,
why I believe researchers in the field of and one can move forward to make recom-
educational psychology, and perhaps more mendations for practice or policy without
widely in fields of education, social science undue concern that such recommendations
and psychology, continue to use such are erroneous or misplaced.
methods of analysis. Given the standard of I would argue that it is not so heinous a
statistical expertise taught in undergraduate decision to continue using inferential statis-
UK psychology courses (to meet thresholds tics, but that analysts should pay more
for Bps accreditation) and on ersC-recog- careful consideration to the sample charac-
nised Master’s level courses in pre-doctoral teristics, something that stephen highlights.
training, along with the explanations and The reporting of distribution characteristics
advice offered in popular textbooks which for continuous data (such as skewness and
focus on statistical analysis, I would argue kurtosis) appears to be becoming more
that the majority of analysts know that these common in educational psychology journals,
are techniques that require the use of but it should be essential if one is using infer-
random samples and that their purpose is to ential statistics when, in the strictest sense,
establish sampling error. I propose that the they shouldn’t. Interestingly, in the example
continued use of inferential statistics with cited in stephen’s paper (Carr & Marzouq,
non-random samples does not, necessarily, 2012), there was absolutely no difference to
arise from ignorance about their intended the interpretation of the findings, regardless
purpose. of whether the inferential statistics accompa-
rather it is a pragmatic decision, nied the interpretation of means and corre-
although rarely articulated, that if a non- lations. stephen uses this to support his
random sample shows broadly the same argument that it is possible to interpret find-
characteristics as a random sample would, ings without the use of inferential statistics
then it is permissible to use inferential statis- and I would not disagree. However, in
tics with the aim of attempting to establish another in another sense it also undermines
sampling error. For instance, if an analyst is that argument. When inferential statistics
dealing with continuous data and expects a were used, it did not result in misinterpreta-
random sample to be normally distributed, tion and it is difficult to imagine what the
but only a convenience sample is available ethical and moral consequences of this
(as is typically the case), and if the conven- might be. so the issue, to me, boils down to
ience sample was normally distributed one simple point: purism or a pragmatism?
(within acceptable limits), it is typically If one is a purist, these practices are not
treated as if it were random. researchers still acceptable; if a pragmatist, then they are
wish to treat their data is if they represent permissible.

18 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


A response to Gorard

If stephen is determined to persuade the Correspondence


pragmatists that they are wrong, I don’t think Professor Dave Putwain
they will be convinced solely by a technical Centre for literacy and Numeracy research,
argument that they may already be aware of. Faculty of education,
However, they may be swayed by the possible edge Hill University,
moral or ethical implications of using infer- st. Helen’s road,
ential statistics with non-random samples. so, Ormskirk,
in addition to the technical points made in lancashire l39 4Qp.
stephen’s paper, I would like to see some email: putwaind@edgehill.ac.uk
clear examples of research or analysis where
the misplaced use of inferential statistics have
led to erroneous conclusions that have subse-
quently lead to poor changes in practice or
policy. This would considerably strengthen
the claim that there are ethical and moral
reasons for not using inferential statistics in
this way and may help convince the pragma-
tists that there are good reasons for engaging
more deeply with the technical argument.
While I applaud stephen’s aim to generate
discussion, I fear that without this additional
aspect, his argument will not have the impact
he is hoping for.

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 19


Open Dialogue peer review:
A response to Gorard
Dr Ben Styles

T Is rIGHT to question the application of argument may well benefit from a Bayesian

I frequentist statistics to any sample in any


field of research. When making a point
estimate of a population parameter, this
statistician’s response. Questioning how
uncertainty is reported is sensible; arguing
for its removal from reporting is against a
question is fundamental. If the sample were fundamental tenet of research itself.
random but not simple random then we I see two arguments that justify the reten-
have to take into account the sample design tion of frequentist confidence intervals in
when estimating our parameter and its stan- certain scenarios when sampling was not
dard error. If the sample were one of conven- random. The first concerns the generalisa-
ience or the response rate of a random tion of effect size and is more of an intuitive
sample were low then we are not strictly justi- argument than a statistical one. The second
fied in assigning confidence intervals to concerns the concept of a virtual population
point estimates regardless of any sophisti- and is more justifiable.
cated weighting we might carry out. When
asking whether an internally valid result Generalisation of effect size
from an rCT is generalisable to a wider I believe it is wise, up to a point, to operate
population the question is the same: if an less stringent sampling protocols for the esti-
intervention is shown to have a certain effect mate of differences in outcomes between
size and 95 per cent confidence interval, are groups than is necessary for point estimates.
we justified in claiming that, had we run the such flexibility in education research is
trial many times within the population in needed due to the consistent difficulty of
question, 95 per cent of the confidence recruiting schools. The groups might be
intervals produced would contain the true randomised arms of a trial or test-takers
effect size? Unless we had carried out whose results on two or more tests are being
random sampling of subjects, the answer has equated. What matters in these scenarios is
to be no. It is very often the case in educa- that the relative performance of the groups
tion research evaluations including rCTs can be compared. providing the allocation
that the sampling was not random. even if it to trial arms (or tests) was at random and
purports to be a random sample of schools, attrition was minimal, we have no worries
response rate is unlikely to be sufficient to about bias; what is at stake here is how we
avoid bias. Gorard is, therefore, right to report uncertainty. Can we be more justified
question the use of frequentist confidence in generalising a group difference from a
intervals. However, he goes too far in advo- non-random sample than a population
cating a world where we do not attempt to parameter point estimate? strictly speaking,
measure uncertainty and this short response we cannot generalise in either scenario.
attempts to show how the use of these tradi- However, in practice, what we are often
tional techniques can be beneficial even for surmising in education research when we
the worst samples. The response does not create confidence intervals around an effect
attempt to unpick the notion that frequen- size is that the difference in performance
tist confidence intervals are useless even for between groups of students in a trial whose
a completely random sample. This latter schools have not been randomly sampled

20 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


© The British Psychological Society – ISSN 0262-4087
A response to Gorard

might be representative of students else- the trial being run many times on students in
where in the country in similar schools; the same schools at the same time in a virtual
should they receive the same intervention. population from which we did sample
Intuitively this appears less of a leap of faith randomly. This allows us to quantify chance
than claiming that the actual performance, and gives meaning to the p-values and confi-
an example of a population parameter, lies dence intervals used. Whilst the concept is
between its confidence intervals (the respon- abstract, ignoring uncertainty is far worse
dent is aware of the simplification inherent and may result in concluding that things
in ‘lies between’). The main reason why such work when they do not and vice versa; even if
generalisation may be reasonable is that it is this is just for the sample in question.
for the same intervention on the same By introducing the concept of a virtual
outcome rather than the estimation of a population, we are acknowledging that the
measure that might be influenced by a students in the trial can be regarded as a
myriad of other factors. physical ‘population’. They were not
randomly sampled so no further physical
Virtual population population exists to which confidence inter-
The previous argument is definitely ques- vals can apply strictly. Gorard states that ‘data
tionable since the effect itself may also be for a [physical] population cannot have a
vulnerable to this myriad of other factors. standard error, by definition.’ Indeed, if we
However, it might be strong enough to justify are estimating a population parameter and
another trial or even intervention roll-out if have measured everyone in the population,
the sample is seen to be representative we need no standard error. However, things
‘enough’. From a frequentist point of view, are rarely so straightforward since the popu-
when analysing the results of any trial, we lation itself may be limited in terms of the
need to establish how easily the results we research question and may need to be seen
see could have occurred by chance alone; as a sample within a larger virtual population
the basis of a frequentist statistical test. simi- as illustrated in the previous paragraph.
larly, it is useful to estimate a confidence Merely reporting effect sizes and
interval; thus encapsulating the chance numbers of participants would engender a
element of the effect size we see. If we reject culture of conjecture around the uncertainty
any generalisation to a wider population, of any result. Bayesian statistics has a lot to
Gorard’s conundrum comes into sharp offer the concerns raised by Gorard and,
focus: if this were not a random sample, rather than not attempting to measure
where are the other subjects from which we uncertainty, all users of statistics should
could have sampled that might give rise to embrace the frequentist versus Bayesian
the other results upon which our confidence debate more seriously.
interval is based? They certainly do not exist
physically. However, it is often helpful to Correspondence
regard them as existing virtually. The Dr Ben Styles
concept of a virtual population is often used research director, National Foundation
without acknowledgement, for example, for educational research,
when assigning a confidence interval to a The Mere Upton park, slough.
school’s value-added results even when all Berkshire sl1 2dQ.
students are measured (Goldstein, 2008). email: b.styles@nfer.ac.uk
rather than conceptualising the 95 per cent
confidence interval as one of many from a Reference
large series of trials run on members of a Goldstein, H. (2008). evidence and education policy
– some reflections and allegations. Cambridge
physical population, 95 per cent of which
Journal of Education, 38(3), 393–400.
would contain the true effect; we imagine

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 21


Open Dialogue peer review:
A response to Gorard
Professor Victor H.P. van Daal & Dr Herman J. Adèr

Gorard makes two claims: (1) there is a widespread abuse of statistics in social sciences; and (2) researchers
almost universally report results incorrectly. According to Gorard the way forward is to correct the reporting,
so that the abuse will disappear.

e HaVe a NUMBer OF concerns 3. Unlike in physics where truly random

W with this paper, which are listed


below.
1. Gorard does not provide any hard
samples are used, convenience samples
dominate in the social sciences. The main
problem with convenience samples concerns
data – and here we don’t need any inferen- generalisation. What is found in a conven-
tial statistics – on how widespread the abuse ience sample cannot be generalised to the
of statistics in the social sciences is, nor does population from which samples are drawn.
he inform us of how ‘universal’ the poor In the field of medical research often a
reporting is. so, we don’t get any idea at all so-called rCT design (randomised clinical
about how widespread the abuse of statistics trial) is used, in which different treatments
and how universal the poor reporting of (usually two) are randomly assigned to
results are. Furthermore, even if statistical different patients, so that alternative expla-
methods would be abused and reporting nations for any treatment effect found can
would be poor it remains unclear in what be ruled out. However, random assignment
way and to what extent this invalidates the can still go wrong, though statistical tech-
interpretation of published research results. niques are available to fix such problems
Finally, is there any evidence that statistics (Van renswoude, 2013). statistical tech-
abuse and/or faulty reporting necessarily niques can and should not be abandoned,
leads to bad decisions by policy makers or because hypothesis testing and effect size
other stakeholders? In our opinion, wrong estimation are essential for the interpreta-
decisions are caused by not being able to tion of any rCT. Generalisation in medical
assess the methodological quality of any research is achieved by conducting meta-
research and by not being able to assess analyses, in which studies relevant for a
which pieces of research are relevant for the specific topic are systematically combined.
decision to be taken. 4. Gorard seems to join others, who time
2. The take on probabilistic reasoning and again discuss the inadequacy of statis-
presented in the beginning of the paper is tical reasoning. However, this is not very
not convincing. The consequence would be helpful for researchers. Instead, researchers
that, for example, weather forecasts should should systematically be taught to scrutinise
not be trusted, because they are based on their conclusions in view of the limitations of
probabilistic reasoning. For an in-depth the statistical techniques used, so-called
discussion of probabilistic reasoning, see content robustness (adèr, 2008), and, equally
pearl (2000). Morgan and Winship (2007) important, in view of the influence of poten-
provide a more accessible treatment, while tial confounders in their designs. In general:
shadish, Cook and Campbell (2002) focus a researcher should take a methodological
on causal inferential reasoning in social point of view rather than a statistical one.
science research.

22 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


© The British Psychological Society – ISSN 0262-4087
A response to Gorard

5. another perspective of quantitative References


data analysis is completely ignored in adèr, H.J. (2008). Methodological quality. In H.J.
adèr & G.J. Mellenbergh (eds.), Advising on
Gorard’s paper: the difference between
research methods: A consultant’s companion
exploratory and confirmatory data analysis: (pp.49–70). Huizen, The Netherlands: Johannes
his discussion is confined to the second form van Kessel publishing.
of analysis, whereas most data in the social Morgan, s.l. & Winship, C. (2007). Counterfactuals
sciences are of explorative nature. every and causal inference: Methods and principles for social
research. New York: Cambridge University press.
researcher in the field is aware of this differ-
pearl, J. (2000). Causality: Models, reasoning, and
ence and of the different statistical tech- inference. Cambridge, Ma: Cambridge University
niques that should be applied. press.
6. Finally, even in an explorative data shadish, W.r., Cook, d.T. & Campbell, d.T. (2002).
analysis, statistical inference can be applied Experimental and quasi-experimental designs for
generalised causal inference. New York: Houghton
by using cross-validation. a relative large
Mifflin.
data set (though this requirement can be Van renswoude, d.r. (2013). random or non-
weakened) is randomly split in two parts. random assignment: What difference does it
One part is used to describe the sample and make? In H.J. adèr & G.J. Mellenbergh (eds.),
the data, and to generate hypotheses. Corre- Advising on research methods: Selected topics 2013.
Huizen, The Netherlands: Johannes van Kessel
sponding statistical hypotheses are then
publishing.
tested on the other part of the data set.

The Authors
Professor Victor H.P. van Daal
edge Hill University.
Dr Herman J. Adèr
Johannes van Kessel advising.

Correspondence
Professor Victor H.P. van Daal
director,
The Centre for literacy and
Numeracy research,
Faculty of education,
edge Hill University,
st. Helen’s road, Ormskirk,
lancashire, l39 4Qp.
email: vandaalv@edgehill.ac.uk

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 23


Open Dialogue peer review:
A response to Gorard
Dr Patrick White

Orard’s paper is a welcome Neither of these problems can be easily over-

G reminder of the problems that under-


mine the value of some of the most
commonly-used statistical analyses. His
come by any amount of statistical or philo-
sophical manoeuvring. The history of debate
in this area suggests, however, that when
paper concentrates on two important these issues are raised there is no shortage of
problems inherent in the use of inferential advocates ready to defend the continued use
statistics in social science that, while purport- of inferential statistics, even in cases where
edly widely recognised, have not led to any the underlying assumptions are not met (for
dramatic decline in the use of these tech- examples see some of the contributions to
niques. as Gorard acknowledges, the points Morrison and Henkel (1970) and Harlow et
he makes are by no means new; they have al. (1997)). The arguments presented in
been regularly raised by concerned favour of their continued use are often
commentators and date back to the very convoluted and sometimes display an almost
early use of this type of statistical analysis. religious attachment to these techniques
However, these views have not always been that has led some commentators to suggest
expressed as clearly or forcefully, and Gorard that users of inferential statistical are akin to
illustrates his points using the type of illus- some kind of ‘cult’ (Ziliak & McCloskey
trative examples seldom seen in other 2008). In this response I will expand on the
commentaries. While I do not expect his points made by Gorard with reference to the
paper to be warmly welcomed by many of the wider literature on the topic and my experi-
respondents in this issue, the points he raises ence of how this debate impacts on the
are important ones that cannot be ignored process of peer review.
by a responsible research community.
Meeting the assumptions for the use of
Two objections to the use of inferential inferential statistical techniques
statistics in social science as Gorard correctly points out, in social
The two key points that Gorard raises are as science research data rarely (if ever) meet
follows: the assumptions underlying the use of infer-
1. Most social science data do not meet the ential statistical techniques. The difficulty of
assumptions required for the use of achieving true random samples and full
inferential statistics. despite this, these response (both at the case and variable
techniques are widely used regardless of levels) mean that these assumptions are
whether these assumptions are met. unrealistic in the context of most social
2. p(data|hypothesis) ≠ p(hypothesis|data). science research. This does not prevent
This problem undermines the use not purported ‘experts’ ignoring these assump-
only of tests of statistical significance that tions, however, and defending the use of
produce p-values but also has implications inferential statistics with inappropriate data.
for the correct interpretation of a high profile example of this is Hans
confidence intervals, which have been rosling, hailed by some as a key player in
promoted as a more useful alternative. stimulating public interest in statistics and
increasing statistical literacy. Within the first

24 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


© The British Psychological Society – ISSN 0262-4087
A response to Gorard

few minutes of his much-lauded The Joy of suit. also interesting are the arguments used
Stats documentary he calculates confidence to justify their position. In my experience the
intervals for data that have not been sampled most common defence involves an appeal to
randomly (rosling 2013). With such high the idea of a ‘superpopulation’.
profile examples of misuse, it is perhaps not
surprising that flouting the assumptions The myth of the ‘superpopulation’
required for these statistics is routine. In educational research it is not uncommon
In social science research, using inferen- for research to use population data. The
tial statistics regardless of having non- populations used can range from the quite
random, incomplete or even population modest, such as data on all students in a
data is commonplace. One only has to single school, to those such as the National
glance at recent publications in highly- pupil database (Npd) that include every
ranked social science journals to see that student enrolled in a state school in
there is little in the way of quality control england. What these data sets have in
exercised by reviewers or editors in relation common is that they have not been gener-
to the incorrect use of these techniques. ated through processes of random sampling
This situation would be sufficiently but cover all the cases in a particular institu-
worrying if it stopped there. However, the tion, geographical area, and so on. It is quite
effects of the widespread acceptance of the common for analyses of these population
inappropriate use of inferential statistics are data sets to include the use of inferential
not limited to the publication of articles by statistics. as Gorard points out, the results of
authors who, for whatever reason, are erro- such analyses are meaningless. With popula-
neously using these techniques. My experi- tion data there is no need for any inference
ence – and that of close colleagues – suggests as any analyses are conducted at the popula-
that the common and accepted abuse of tion level. any errors in the results will not
these techniques leads to a further problem. be due to random sampling and so, in any
authors who have quite properly eschewed case, inferential statistics should not be used.
the use of inferential statistics because their as Berk (2004, p.42) concludes:
data do not meet the required assumptions If the data are a population, there is no
are often asked by reviewers to include the sampling, no uncertainty because of
results of these tests in their research reports sampling, and no need for statistical
in order to have them accepted for publica- inference. Indeed, statistical inference
tion. I have come across this situation in makes no sense. The only game is
papers I have authored or co-authored and describing patterns in the data on hand.
also in reports from colleagues who have Unlike in some situations in the physical
received these kinds of recommendations sciences, in the social sciences other defi-
from reviewers. Mostly commonly, however, ciencies in the data (such as drop-out, non-
I have seen it in the comments from other response or measurement error) cannot be
reviewers on many occasions when I have assumed to be random. The problem of bias
acted as a referee. a related situation also caused by non-response has been acknowl-
frequently occurs when, as a referee, edged for many years (see Hansen &
I recommend that the inappropriate use of Hurwitz 1946) and repeated studies have
inferential statistics be removed from a demonstrated the non-random nature of
paper before publication. non-response (e.g. sheikh & Mattingly
What is notable in the situations 1981).
described above is the fervour displayed by as mentioned above, a common defence
advocates of the inappropriate use of infer- of the use of inferential statistics with popu-
ential statistics, both when defending their lation data appeals to the idea of a ‘super-
own practice and insisting that others follow population’, ‘hyperpopulation’, ‘infinite

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 25


Dr Patrick White

population’ or ‘hypothetical universe’. This p.306) point out, there is no ‘empirically


argument is also used in cases where conven- demonstrable random generation proce-
ience sampling has been used, when non- dure’ and so no statistical inference is
response rates are very high, or in other possible. In any case, it is difficult to know
cases where a sample is clearly of a non- what any probabilities or other inferential
random nature. outputs would actually refer to as they ‘do
relating such a concept to statistical not directly answer any empirical question’
inference dates back to at least 1941 with the (Berk, 2004, p.51). These difficulties,
publication of an article by Hagood (1970) however, have not stopped many researchers
and the idea has more recently been cham- appealing to this idea. But as Freedman
pioned by Goldstein (2003) among others. (2004, p.989) argues, ‘the frequency with
Hagood (1970, p.66) defines this phenom- which the assumption has been made in the
enon as ‘the universe of all possible samples past does not provide any justification for
(which may be limited universes) which making it again, and neither does the
could have been produced under similar grandiloquent name… the problem of
conditions of time, place, culture, and other induction is unlikely to be solved by fiat’.
relevant factors’ (p.66). advocates of using Berk (2004, p.52) is similarly dismissive of
inferential statistics with population data or these arguments and the lack of critical eval-
non-random samples argue that we are ulti- uation of the assumptions that underlie
mately interested in ‘universals’, rather than them: ‘It is as if by mouthing the term super-
the actual situation at hand, and we can use population, a spell is cast making all statistical
inferential statistics to generalise from popu- inference legitimate’.
lation data to these wider truths. Goldstein appeals to ‘superpopulations’ are philo-
(2003, p.164, original emphasis) recom- sophical in nature rather than statistical.
mends considering ‘the actual population My experience of explaining this concept to
as if it was a realisation of a conceptually infi- colleagues with little expertise in statistics
nite population extending through time, usually results in gasps of disbelief that such
and also possibly through space’. This, he a concept could be taken seriously at all.
claims, enables the researcher to ‘make That the idea is fundamentally flawed has
generalisations and predictions beyond the not prevented attempts to use it to defend
units that comprise the real population’. the inappropriate use of inferential statistical
We would probably all agree that discov- techniques among educational and social
ering ‘universal’ truths that transcend time, researchers.
place (or space) and culture is desirable.
agreeing with this aim, however, is very Does any of the above matter?
different from agreeing that it is possible. The ‘appropriate’ use of inferential statistics
and it is different again from agreeing that it is the first point addressed by Gorard in his
is possible to do this using population data paper. Given that social science data are
(or other data not selected randomly) and unlikely to meet the required assumptions,
inferential statistics. Gorard suggests that the matter could end
as Berk (2004) notes, the first problem there, with the conclusion that inferential
with the superpopulation approach is that statistics are not suited to social science. His
these superpopulations are imaginary and second point, however, suggests that the
do not actually exist. The second issue is that issue of the appropriate use of inferential
even if we could agree that such ‘superpopu- statistical techniques is actually a red
lations’ exist, data used in any existing herring. rather than the debate beginning
research project cannot have been randomly and ending with this conclusion, it is actually
selected from any of these ‘superpopula- the case that we do not need to scrutinise the
tions’. as Morrison and Henkel (1970, extent to which social science data meet

26 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


A response to Gorard

these assumptions. even if it can be demon- interval is adopted, it becomes unclear how
strated that data could or do meet the neces- such information would be of use to a
sary requirements, the outputs of inferential researcher. as with NHsT and p-values, the
statistics do not tell us anything we want to starting point is still based on assumptions
know. put simply, they are not useful even in about the population that are unverified and
the (hypothetical?) cases where the under- cannot be tested using sample data. The
lying assumptions for their use are met. logical problem that renders NHsT and
p-values useless also prevents the informa-
P-values are not the only problem. tion provided by CIs being useful. Those
One of the elements that sets Gorard’s paper who correctly criticise the use of p-values on
apart from most critiques of the use of infer- these grounds often seem to miss the point
ential statistics is that his concern is not that all inferential statistical techniques
limited to the use of null-hypothesis statis- suffer from a similar flaw.
tical testing (NHsT) and the associated apart from Gorard, there are very few
reliance on p-values. Criticisms of the use of commentators who are willing to abandon
p-values are relatively common but often the ‘project’ of inferential statistics alto-
advocate a change of focus not only to effect gether. Concerns are expressed that to
sizes (which are not an inferential tech- abandon inferential statistics as a ‘bad job’
nique) but also to the use of standard errors would be akin to ‘throwing the baby out with
(ses) and confidence intervals (CIs) (e.g. the bath water’. It is now relatively uncontro-
Cumming, 2012; Hubbary & lindsay, 2008; versial for researchers to express concern
lambdin, 2012). even rozeboom (1960) about the use of NHsT and p-values, and
and Cohen (1994), both who have written these concerns are even beginning to be
about the logical problems with NHsT and raised in texts aimed at undergraduate
are ardent critics of the use of p-values, view students (e.g. Field 2009). recommending
CIs as an acceptable alternative. This view is abandoning the use of all inferential statis-
also shared by Meehl (1997), who has tical techniques, however, is much less
written extensively on the problems of common and likely to generate considerably
NHsTs in psychology. more criticism.
While discussion of other issues can be However, what is missing from the more
found in the literature, the problem central cautious interrogations of the use of inferen-
to the use of p-values is that the probability tial statistics is a convincing account of why
that they refer to is not the probability we we should continue to use any of them at all.
want to know. P-values can only provide the Those who advocate a move to CIs do not
probability of the data given a hypothesis. adequately explain how the information
What we actually want to know is the proba- provided by these measures can be useful
bility of a hypothesis given the data. as when they are interpreted correctly. My current
Gorard shows in his paper, the former prob- view is that no such account will be forth-
ability – p(d|H) – is of no use to researchers coming simply because it is not possible to
and it is not possible to convert this informa- construct one.
tion to the more useful latter probability –
p(H|d) – using only sample data. This Against inferential statistics
problem has led to the commentators as I stated at the beginning of this response,
mentioned above advocating the use of CIs I do not expect Gorard’s views to be popular
as an alternative to p-values. – especially those that extend the traditional
However, as Gorard shows, the advan- critique of NHsT and p-values to other infer-
tages of using ses and CIs in place of, or in ential outputs such as CIs. I expect my
addition to, p-values, are illusionary. Once support of his views to receive similar reac-
the correct definition of a confidence tions. However, I believe these arguments

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 27


Dr Patrick White

need to be repeated if any progress is to be Correspondence


made in changing the way we conduct and Dr Patrick White
teach statistical analysis. We need to senior lecturer,
abandon these practices that are at best department of sociology,
wasteful and at worst harmful. data in social University of leicester,
science does not meet the assumptions University road,
required for inferential statistics and, in any leicester le1 7rH.
case, the outputs produced by these tech- email: patrick.white@le.ac.uk
niques are not useful. It’s time to move on.

References
Berk, r. (2004). Regression Analysis: A constructive lambdin, C. (2012). significance tests as sorcery:
critique. Thousand Oaks, Ca: sage. science is emprical – significance tests are not.
Cohen, J. (1994). ‘The earth Is round (p<.05)’. Theory Psychology, 22(1), 67–90.
American Psychologist, 49(12), 997–1003. Meehl, p.e. (1997). The problem is epistemology, not
Cumming, G. (2012). Understanding the new statistics: statistics: replace significance tests by confidence
Effect sizes, confidence intervals and meta-analysis. intervals and quantify accuracy of risk numerical
New York: routledge. predications. In l.l. Harlow, s.a. Mulaik & J.H.
Field, a. (2009). Discovering statistics using IBM SPSS steiger (eds.), What if there were no significance
Statistics. london: sage. tests? Mahwah, NJ: lawrence erlbaum.
Freedman, d.a. (2004). sampling. In M.s. lewis- Morrison, d.e. & Henkel, r.e. (eds.) (1970).
Beck, a. Bryman & T.F. liao (eds.), Sage The significance test controversy. Chicago: aldine.
encyclopaedia of social science research methods rosling, H. (2013). The Joy of Stats. BBC 4.
(pp.987–991). Thousand Oaks, Ca: sage. 16 October. accessed 25 January 2014, from:
Goldstein, H. (2003). Multilevel statistical models www.bbc.co.uk/programmes/b00wgq0l
(3rd ed.) london: arnold. rozeboom, W.W. (1960). The fallacy of the null
Hagood, M.J. (1970). The notion of a hypothetical hypothesis significance test. Psychological Bulletin,
universe. In d.e. Morrison & r.e. Henkel (eds.), 57, 416–428.
The significance test controversy. Chicago: aldine. sheikh, K. & Mattingly, s. (1981). Investigating non-
Hansen, M.H. & Hurwitz, W.N. (1946). The problem response bias in mail surveys. Journal of Epidemi-
of non-response in sample surveys. Journal of the ology and Community Health, 35, 293–296.
American Statistical Association, 41(236), 517–529. Ziliak, s.T. & McCloskey, d.N. (2008). The cult of statis-
Harlow, l.l., Mulaik, s.a. & steiger, J.H. (eds.) tical significance: How the standard error cost us jobs,
(1997). What if there were no significance tests? justice and lives. ann arbor, MI: University of
Mahwah, NJ: lawrence erlbaum. Michigan press.
Hubbard, r. & lindsay, r.M. (2008). Why p-values
are not a useful measure of evidence in statistical
significance testing. Theory & Psychology, 18(1),
69–88.

28 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


Open Dialogue:
Authors’ response to peer commentary
Professor Stephen Gorard

Perpetuating the ‘preposterous’


Y BrIeF paper on the abuse of They have long been recognised as the truth

M statistics (‘abuse’ being a synonym


for ‘misuse’ in the Oxford English
Dictionary, and explained as meaning ‘to use
by some before us, and papers and books
have even been written to try and explain the
political, financial and career reasons why
mistakenly or for a bad purpose’) argued others continue to ignore their obvious
three main points. truth.
1. Once random sampling variation is
eliminated as a possible explanation for A few important misunderstandings
any apparent finding, analysts need to It is intriguing that the four respondents
focus on alternative explanations based who wish to retain the use of significance,
on design, measurement, bias, and etc., all focus on the third point above,
attrition rather more than they do now. because this is surely the hardest to defend
2. The widespread use of probability against. The computation of significance,
calculations such as significance tests for standard errors and confidence intervals
eliminating random sampling variation (and the associated algorithms, now hidden
as a possible explanation is based on a by software) are clearly predicated on true
misunderstanding. random sampling. The fact that the software
3. The kinds of probability calculations still operates and provides an ‘answer’ when
involved in significance tests and the data does not come from randomisation
confidence intervals cannot be used with and contains no probabilistic uncertainty is
non-random cases, such as convenience merely an illustration of the well-known
samples and population data, anyway. garbage-in garbage-out principle. I guess
It is heartening to read support for these that for these respondents to accept the
ideas in the responses from such a high- truth (and in clearer terms than putwain’s
profile influence in the area (Gene Glass, ‘technically correct’, and the double-nega-
from whom I have borrowed the term tive ‘would not disagree’, or Ben styles’ ‘not
‘preposterous’) and an emerging influence strictly justified’, for example) would mean
on methods in social science (patrick White, the end of statistics as it is practiced, the end
the only respondent to engage fully with the of purported expertise in these methods,
second point above). Their response papers and the casting of doubt over prior work.
take these points forward and provide Howe states resolutely ‘I do not agree
elegant examples with both clarity and some that sampling theory techniques should be
subtle humour. Yet, in a sense I cannot deemed inappropriate just because the
understand why everyone does not compre- sample was based on opportunity, conven-
hend and support these three relatively ience or snowballing’. To some extent this is
simple points. They were not made by me in merely a description of current practice and
a spirit of ‘purism’ (dave putwain), or of so many researchers must agree with Howe.
‘iconoclasm’ (Christine Howe). These three But to see it so blatantly in print like that is
points are simply the truth of the matter. truly shocking. It is the exact opposite of

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 29


© The British Psychological Society – ISSN 0262-4087
Professor Stephen Gorard

everything that appears in reputable have to go to in order to try and defend the
methods texts, and it is the kind of error that indefensible.
leads Glass to say ‘The fiction that proba- There are some less common variants in
bility statements are meaningful in the the responses that try to maintain the edifice
absence of random acts underlying them is of significance testing. If we have a non-
preposterous’. random sample we could randomly sample
Common responses when I write or from within that and then use significance
lecture about the abuse of statistics are with the sub-sample (Van daal & adèr). This
‘everyone does it’, ‘it has happened for a seems truly desperate. purportedly, there are
long time’ and eventually ‘we already know techniques to ‘fix’ a non-random sample and
all this but what should we do instead?’. make it back into a random one (Van daal &
I hope readers can see that none of these is adèr). No there are not, because if we knew
a valid counter-argument, and that they the key values for the missing cases then we
remain invalid when deployed by four would have a complete sample. If not, we can
respondents here who largely re-state their only use the values we do have to make up
own existing practice. styles claims that in for what is missing, so enhancing the bias
rejecting the use of significance and CIs I am caused by the missing cases in the first place
rejecting any attempt to consider uncer- (Gorard, 2013). The same practical problem
tainty in research findings. This is not true, eliminates putwain’s suggestion that if the
and the original paper urges researchers to achieved sample looks similar to what the
consider a wider and more important range random sample would have been if available,
of factors that lead to uncertainty but which then using significance is justified. To
are ignored by the significance approach imagine a random sample based on a
(such as design bias or respondent attrition). convenience sample, and then try to
I feel I am the more concerned because I do compute real probabilities accurately based
not just want to pretend I am assessing on that imagination is surely incorrect.
uncertainty via an invalid technique. even stranger is the notion that the
Howe claims that I argued that we should super-, hyper- or virtual-population invented
not use convenience samples and quotes to help differentiate between theoretically
apa guidance suggesting that convenience finite and infinite populations can then be
samples are perfectly proper. They are, and used to justify treating actual population
I never suggested otherwise. In fact, I clearly data as a random sample (styles). Just
stated that we often have no practical alter- envisage what styles means when he writes
native. What apa does not say is that we ‘we imagine the trial being run many times
should use significance tests with conven- on students in the same schools at the same
ience samples. as ever, it is presumably easier time in a virtual population from which we
to mis-portray what I said and argue with did sample randomly’. and note that this
that. Van daal and adèr do something entirely ignores the logical problems raised
similar. I showed that denying the conse- by point two at the outset. I have written
quent is only valid in logic when the prem- about this absurdity many times before (e.g.
ises are certain, and that the modus tollens Gorard, 2008), and White handles this
argument fails once any premise is uncertain briefly but well in his response.
or probabilistic. They portray this as me The search for an alternative or what to
saying that probabilistic argument in general do ‘instead’ of significance, especially with
is invalid, including weather forecasting as non-random samples/allocation, is an odd
an illustration. But weather forecasting does one. since the existing approaches do not
not employ this ‘denying the consequent’ work we must abandon them. a Bayesian
argument structure at all. These three exam- approach would certainly be more logical
ples show the lengths that commentators but is no panacea and no substitute for

30 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


Authors’ response to peer commentary

judgement. as my original paper outlined, tests is often correctly described in methods


we do not need alternatives as such since we texts, these tests are then generally misused
should be considering all competing in the examples and in research practice.
substantive, design and methods explana- The error is now usually only implicit since
tion for any apparent finding anyway (even if analysts simply take the probability of the
we want to eliminate chance first). But signif- data observed given the truth of the null
icance testing has somehow come to replace hypothesis as being the same as or closely
such real analysis, perhaps because the latter related to the probability of the null hypoth-
is not push-button. Nowhere is this more esis being true given the data observed. That
apparent than in the Carr and Marzouq is, they ‘reject’ the null hypothesis where the
(2012) paper cited in my original paper. probability of the data observed under that
I would have thought that my views on that hypothesis is low, and they do so without
paper were plain, and that once the explanation or justified argument.
goobledegook is removed nothing of scien- This key error of confusing pHyp|data
tific value is left. But I do not intend to with pdata|Hyp is again almost universal.
pursue this further. That paper was merely That is why I proposed asking all researchers
chosen as a recent example to represent to explain the steps in the ‘logic’ they are
hundreds similar in this journal and the using explicitly, since once they have written
many thousands in journals worldwide. the argument down in clear (if they can) it
should be obvious to them and their readers
The abuse of significance is a that the argument is invalid (see the worked
big problem example in Chapter 5 of Gorard, 2013).
I did not, in my brief article, explain just how significance tests just do not work as
widespread the abuse of sampling theory intended by their users, even when applied
with non-random samples is, nor how often to random sampling/allocation. The situa-
the results of statistical analyses are poorly tion is even worse for confidence intervals
reported. I assumed that respondents like because I have yet to encounter an analyst
Van daal and adèr would know (and realise who can correctly explain what CIs mean.
the damage that ensues). They do tell CIs are as bad as p-values since they ‘suffer
readers that ‘convenience samples dominate from similar flaws to p-values, exaggerating
in the social sciences’ – which surely means both the size of implausible effects and their
that significance tests should hardly ever be significance’ (Matthews, 1998, p.5), yet they
encountered, as their computation depends are even harder to describe. and why should
entirely on prior randomisation. However, the fact that 95 per cent of hypothetical
these tests are still widely encountered, as repeated sample figures lie within 1.96 popu-
Van daal and adèr should and probably do lation standard deviations of the population
know, and as putwain helpfully illustrates via average then imply that the population
consideration of articles in the Journal of average will lie within 1.96 one-sample stan-
Educational Psychology. None of the 24 articles dard deviations of a specific one-sample
involving numbers was based on random average 95 per cent of the time? It will not
samples, and all of them used inferential and it does not, as anyone can see if they
statistics (incorrectly). This does not surprise think about it clearly. please let us not worry
me and I have done similar analyses of about whether it is a ‘standard deviation’ or
education journals and found the same a ‘standard error’, a ‘population’ or a
thing (e.g. Gorard, 2008). The problem is so ‘sampling distribution’. let the reader insert
widespread that it is almost universal, and their terms of choice, and the argument of
perhaps in some strange way that makes it CIs remains just as clearly nonsense.
hard for Van daal and adèr to notice. simi- These problems are not just common-
larly, although the meaning of significance place, they are also dangerous. at best, they

The Psychology of Education Review, Vol. 38, No. 1, Spring 2014 31


Professor Stephen Gorard

make research reports harder to read, References


perhaps confusing readers, and certainly Gorard, s. (2002). ethics and equity: pursuing the
perspective of non-participants. Social Research
wasting people’s time with producing,
Update, 39, 1–4.
publishing and consuming fake results. In a Gorard, s. (2008). Quantitative research in education:
worse case, they waste public funding, Volumes 1 to 3. london: sage.
causing needless opportunity costs, and they Gorard, s. (2010). all evidence is equal: The flaw in
waste people’s energy pursuing what turn statistical reasoning. Oxford Review of Education,
36(1), 63–77.
out to be all too easily predictable vanishing
Gorard, s. (2013). Research design: Robust approaches for
breakthroughs. at worst, these errors the social sciences. london: sage.
damage lives and kill people (see examples Matthews, r. (1998). Bayesian critique of statistics in
from diet, cancer research, and heart treat- health: The great health hoax.
ments in Matthews 1998). http://www2.isye.gatech.edu/~brani/
isyebayes/bank/pvalue.pdf
‘seventy years ago ronald Fisher gave
scientists a mathematical machine for
turning baloney into breakthroughs, and
flukes into funding’ (Matthews, 1998,
p.1).
This is especially shocking because research
in the UK is largely funded by the public
(taxpayers and charity-givers). Where social
science has impact in practice, the effect is
largely on the public in areas of policy like
education, crime, housing, transport and
health. Yet ethical committees and guide-
lines still largely ignore the interests of the
wider public in their focus on possible harm
to the researchers and the researched
(Gorard, 2002). I urge putwain to pursue the
implications of this second principle of
ethics, now creeping into ethical guidelines
such as those of the sra. public money is
being wasted, and public lives are being
made worse (or at least not improved as
much as they could be), by this invalid prac-
tice of significance testing. To call it merely a
‘cult’ is to downplay its importance. It should
cease. Now.

32 The Psychology of Education Review, Vol. 38, No. 1, Spring 2014


Copyright of Psychology of Education Review is the property of British Psychological
Society and its content may not be copied or emailed to multiple sites or posted to a listserv
without the copyright holder's express written permission. However, users may print,
download, or email articles for individual use.

You might also like