P Value

1
P-Value:
Or, Comparing the Price of One’s Shoes to a Wyoming Guide Pole
by
Anna E. Baldwin
Submitted to
Patty Kero, Ed.D.
In Partial Fulfillment of the Requirements of

EDLD 618: Educational Statistics
The University of Montana
Spring 2010
2
To a novice reader of research, the words “statistically significant” might leave one a bit
star-struck: this phrase ignites a kind of wonder, as in, “This study found something important!”
Or, “look how much work these researchers put into this effort and the results were statistically
significant. How satisfying.” To the researcher, statistical significance may appear to validate the
long hours and substantiate his hunch all along: “I knew there was a negative correlation between
the number of hours a person works and the marital satisfaction reported by his or her spouse!
And look, here it is in this tiny p-value.”
The p-value itself is the outcome related to the level of significance pre-set by the
researcher. Typically set at .05, the level of significance (or alpha, α) indicates the likelihood of a
researcher finding a difference by chance or accident, given that the null hypothesis is true
[ CITATION Huc08 \l 1033 ]. For example, let’s say an educational researcher wants to study the
average amount of time 9th graders spend on homework compared to the average amount of time
10th graders spend on homework. The null hypothesis would be H0: µ9=µ10 and the researcher sets
the level of significance at .05. Should H0 be true, the researcher is allowing for a 5% chance that
she will find that there is a difference; if this happens, she will reject H0 and commit a Type I
error. However, if she conducts her research and correctly rejects H0 because it was not true, she
might then report the p-value at < .05 or even calculate it to reflect its true value – say, p < .001,
for example.
This sounds like good research! Not so fast, according to Carver (1993): “…statistical
significance tells us nothing directly relevant to whether the results we found are large or small,
and it tells us nothing with respect to whether the sampling error is large or small” [ CITATION
Ron93 \p 291 \n \y \l 1033 ]. In addition, Shaver (1993) argues that statistical significance tests
are not indicators of a result occurring by chance, a null hypothesis’ plausibility, or causality
3
between treatments. Thompson (1993) presents three fundamental criticisms of statistical
significance testing. First, he calls the problem “tautological”: that is, it is repetition of an already
understood fact – that “virtually all null hypotheses will be rejected at some sample size” (p.
362); therefore tests of statistical significance can be both redundant and pointless. Next, these
tests can lead researchers to devise unrealistic comparisons when they start pairing statistics that
came from different hypotheses, sample sizes, means, and so on. Finally, what Thompson calls
“inescapable dilemmas” are caused when researchers use statistical significance testing in a way
that creates predictable outcomes.
The main criticism of statistical significance appears to be that too many researchers and
readers of research put too much stock in the number achieved and the phrase “statistically
significant,” when a statistically significant finding may not be significant in a practical sense at
all [ CITATION Sny93 \l 1033 ]. Particularly because the calculated p-level may be just barely under
the level set a priori, the statistical significance of a result may be tainted by arbitrariness.
Rather, “if theory is supported, we would like to know how big the effect really is; if theory is
not supported, we want to know how close we came” [ CITATION Ser93 \p 352 \l 1033 ]. Effect size
is one of the ways these critics have suggested improving the overall reporting of findings.
Others include discussion of sample size, ensuring randomness, replicability, and confidence
intervals.
Effect size, or magnitude of effect (ME), is a way that researchers can report findings in a
practical sense rather than solely through p-value [ CITATION Sny93 \l 1033 ]. Consider reporting
the size of a shoe sale to one’s husband: “These shoes are 30% off today” has a much different
effect than “These shoes cost only $120 today.” Over $100 for a pair of shoes might be the limit!
4
Similarly, increasing the sample size can change the outcome of a test of statistical significance;
thus it is important to find other ways of reporting an outcome.
Both Carver [ CITATION Ron93 \n \t \l 1033 ] and Thompson [ CITATION Tho93 \n \t \l 1033
] proffer replication as a genuine way of reducing the emphasis placed on tests of statistical
significance. According to Carver, “building replication into our research helps to eliminate
chance or sampling error as one of the threats to the internal validity of our results” [ CITATION
Ron93 \p 291-292 \n \y \t \l 1033 ]. Thompson goes into greater detail about how to accomplish
replication, given that doing so is costly in terms of money and the researcher’s time. He
suggests halving the sample and using crossvalidation; another idea is the jackknife approach,
which involves conducting analyses on various groups of subjects within the sample and
conducting every k analysis possible. His strongest suggestion, the one for which the article was
named, is the bootstrap methods: “Conceptually, these methods involve copying the data set on
top of itself again and again infinitely many times to thus create an infinitely large mega data set
… Then hundreds or thousands of different samples are drawn from the mega file, and the results
are computed separately for each sample and then averaged” [ CITATION Tho93 \p 369 \n \y \t \l
1033 ]. The bootstrap method helps a researcher evaluate the stability of his results over different
groupings of subjects, and it can be used as part of inference to the population.
Finally, using a confidence interval rather than a point null hypothesis can, according to
Serlin [ CITATION Ser93 \n \t \l 1033 ], provide more credible results than the p-value. He calls it
the “good-enough belt,” and it seems like common sense. As a basic analogy, I might tell my 8-
year-old to clean her room, which is very, very messy. My definition of “clean” is that every
article is picked up and put away, the bed is made, and the windows have been washed. The
reward for this chore is a movie-date with her friend. She spends two hours cleaning; when I go
5
in to check her work, she has picked up every article and put it away and made the bed, but she
has not washed the windows (which weren’t really that dirty anyway). Is this good enough? Or
does she fall short of my somewhat arbitrary level of significance for the movie-date? Serlin
would probably say she could go.
While few of the critics are ready to throw out p-value entirely, they all agree that it is
overused, undercriticized, and misunderstood by many researchers and most readers of research.
P-value may be useful for identifying how far a result is from the null hypothesis, but without
considering sample size, effect size, its proximity to the level of significance, and other factors,
using p-value to evaluate a study’s results is like using six-foot guide poles to drive through a
nine-foot Wyoming snowdrift.

6
References
Carver, R. P. (1993). The Case Against Statistical Significance Testing, Revisited. Journal of
Experimental Education , 61 (4), 287-292.
Huck, S. (2008). Reading Statistics (5th ed.). Boston: Pearson.
Serlin, R. (1993). Confidence Intervals and the Scientific Method: A Case for Holm on the
Range. Journal of Experimental Education , 61 (4), 350-360.
Shaver, J. P. (1993). What Statistical Significance Testing Is, and What It Is Not. Journal of
Experimental Education , 61 (4), 293-316.
Snyder, P., & Lawson, S. (1993). Evaluating Results Using Corrected and Uncorrected Effect
Size Estimates. Journal of Experimental Education , 61 (4), 334-349.
Thompson, B. (1993). The Use of Statistical Significance Tests in Research: Bootstrap and Other
Alternatives. Journal of Experimental Education , 61 (4), 361-377.

P Value

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

P Value

Uploaded by

Copyright:

Available Formats

1

In Partial Fulfillment of the Requirements of

And look, here it is in this tiny p-value.”

between treatments. Thompson (1993) presents three fundamental criticisms of statistical

that creates predictable outcomes.

thus it is important to find other ways of reporting an outcome.

groupings of subjects, and it can be used as part of inference to the population.

would probably say she could go.

nine-foot Wyoming snowdrift.

Experimental Education , 61 (4), 287-292.

Huck, S. (2008). Reading Statistics (5th ed.). Boston: Pearson.

Range. Journal of Experimental Education , 61 (4), 350-360.

Experimental Education , 61 (4), 293-316.

Size Estimates. Journal of Experimental Education , 61 (4), 334-349.

Alternatives. Journal of Experimental Education , 61 (4), 361-377.

You might also like