Professional Documents
Culture Documents
P-Value:
Or, Comparing the Price of One’s Shoes to a Wyoming Guide Pole
by
Anna E. Baldwin
Submitted to
Patty Kero, Ed.D.
To a novice reader of research, the words “statistically significant” might leave one a bit
star-struck: this phrase ignites a kind of wonder, as in, “This study found something important!”
Or, “look how much work these researchers put into this effort and the results were statistically
significant. How satisfying.” To the researcher, statistical significance may appear to validate the
long hours and substantiate his hunch all along: “I knew there was a negative correlation between
the number of hours a person works and the marital satisfaction reported by his or her spouse!
The p-value itself is the outcome related to the level of significance pre-set by the
researcher. Typically set at .05, the level of significance (or alpha, α) indicates the likelihood of a
researcher finding a difference by chance or accident, given that the null hypothesis is true
[ CITATION Huc08 \l 1033 ]. For example, let’s say an educational researcher wants to study the
average amount of time 9th graders spend on homework compared to the average amount of time
10th graders spend on homework. The null hypothesis would be H0: µ9=µ10 and the researcher sets
the level of significance at .05. Should H0 be true, the researcher is allowing for a 5% chance that
she will find that there is a difference; if this happens, she will reject H0 and commit a Type I
error. However, if she conducts her research and correctly rejects H0 because it was not true, she
might then report the p-value at < .05 or even calculate it to reflect its true value – say, p < .001,
for example.
This sounds like good research! Not so fast, according to Carver (1993): “…statistical
significance tells us nothing directly relevant to whether the results we found are large or small,
and it tells us nothing with respect to whether the sampling error is large or small” [ CITATION
Ron93 \p 291 \n \y \l 1033 ]. In addition, Shaver (1993) argues that statistical significance tests
are not indicators of a result occurring by chance, a null hypothesis’ plausibility, or causality
3
significance testing. First, he calls the problem “tautological”: that is, it is repetition of an already
understood fact – that “virtually all null hypotheses will be rejected at some sample size” (p.
362); therefore tests of statistical significance can be both redundant and pointless. Next, these
tests can lead researchers to devise unrealistic comparisons when they start pairing statistics that
came from different hypotheses, sample sizes, means, and so on. Finally, what Thompson calls
“inescapable dilemmas” are caused when researchers use statistical significance testing in a way
The main criticism of statistical significance appears to be that too many researchers and
readers of research put too much stock in the number achieved and the phrase “statistically
significant,” when a statistically significant finding may not be significant in a practical sense at
all [ CITATION Sny93 \l 1033 ]. Particularly because the calculated p-level may be just barely under
the level set a priori, the statistical significance of a result may be tainted by arbitrariness.
Rather, “if theory is supported, we would like to know how big the effect really is; if theory is
not supported, we want to know how close we came” [ CITATION Ser93 \p 352 \l 1033 ]. Effect size
is one of the ways these critics have suggested improving the overall reporting of findings.
Others include discussion of sample size, ensuring randomness, replicability, and confidence
intervals.
Effect size, or magnitude of effect (ME), is a way that researchers can report findings in a
practical sense rather than solely through p-value [ CITATION Sny93 \l 1033 ]. Consider reporting
the size of a shoe sale to one’s husband: “These shoes are 30% off today” has a much different
effect than “These shoes cost only $120 today.” Over $100 for a pair of shoes might be the limit!
4
Similarly, increasing the sample size can change the outcome of a test of statistical significance;
Both Carver [ CITATION Ron93 \n \t \l 1033 ] and Thompson [ CITATION Tho93 \n \t \l 1033
] proffer replication as a genuine way of reducing the emphasis placed on tests of statistical
significance. According to Carver, “building replication into our research helps to eliminate
chance or sampling error as one of the threats to the internal validity of our results” [ CITATION
Ron93 \p 291-292 \n \y \t \l 1033 ]. Thompson goes into greater detail about how to accomplish
replication, given that doing so is costly in terms of money and the researcher’s time. He
suggests halving the sample and using crossvalidation; another idea is the jackknife approach,
which involves conducting analyses on various groups of subjects within the sample and
conducting every k analysis possible. His strongest suggestion, the one for which the article was
named, is the bootstrap methods: “Conceptually, these methods involve copying the data set on
top of itself again and again infinitely many times to thus create an infinitely large mega data set
… Then hundreds or thousands of different samples are drawn from the mega file, and the results
are computed separately for each sample and then averaged” [ CITATION Tho93 \p 369 \n \y \t \l
1033 ]. The bootstrap method helps a researcher evaluate the stability of his results over different
Finally, using a confidence interval rather than a point null hypothesis can, according to
Serlin [ CITATION Ser93 \n \t \l 1033 ], provide more credible results than the p-value. He calls it
the “good-enough belt,” and it seems like common sense. As a basic analogy, I might tell my 8-
year-old to clean her room, which is very, very messy. My definition of “clean” is that every
article is picked up and put away, the bed is made, and the windows have been washed. The
reward for this chore is a movie-date with her friend. She spends two hours cleaning; when I go
5
in to check her work, she has picked up every article and put it away and made the bed, but she
has not washed the windows (which weren’t really that dirty anyway). Is this good enough? Or
does she fall short of my somewhat arbitrary level of significance for the movie-date? Serlin
While few of the critics are ready to throw out p-value entirely, they all agree that it is
overused, undercriticized, and misunderstood by many researchers and most readers of research.
P-value may be useful for identifying how far a result is from the null hypothesis, but without
considering sample size, effect size, its proximity to the level of significance, and other factors,
using p-value to evaluate a study’s results is like using six-foot guide poles to drive through a
References
Carver, R. P. (1993). The Case Against Statistical Significance Testing, Revisited. Journal of
Serlin, R. (1993). Confidence Intervals and the Scientific Method: A Case for Holm on the
Shaver, J. P. (1993). What Statistical Significance Testing Is, and What It Is Not. Journal of
Snyder, P., & Lawson, S. (1993). Evaluating Results Using Corrected and Uncorrected Effect
Thompson, B. (1993). The Use of Statistical Significance Tests in Research: Bootstrap and Other