You are on page 1of 6

1

P-Value:
Or, Comparing the Price of One’s Shoes to a Wyoming Guide Pole

by
Anna E. Baldwin

Submitted to
Patty Kero, Ed.D.

In Partial Fulfillment of the Requirements of


EDLD 618: Educational Statistics
The University of Montana
Spring 2010
2

To a novice reader of research, the words “statistically significant” might leave one a bit

star-struck: this phrase ignites a kind of wonder, as in, “This study found something important!”

Or, “look how much work these researchers put into this effort and the results were statistically

significant. How satisfying.” To the researcher, statistical significance may appear to validate the

long hours and substantiate his hunch all along: “I knew there was a negative correlation between

the number of hours a person works and the marital satisfaction reported by his or her spouse!

And look, here it is in this tiny p-value.”

The p-value itself is the outcome related to the level of significance pre-set by the

researcher. Typically set at .05, the level of significance (or alpha, α) indicates the likelihood of a

researcher finding a difference by chance or accident, given that the null hypothesis is true

[ CITATION Huc08 \l 1033 ]. For example, let’s say an educational researcher wants to study the

average amount of time 9th graders spend on homework compared to the average amount of time

10th graders spend on homework. The null hypothesis would be H0: µ9=µ10 and the researcher sets

the level of significance at .05. Should H0 be true, the researcher is allowing for a 5% chance that

she will find that there is a difference; if this happens, she will reject H0 and commit a Type I

error. However, if she conducts her research and correctly rejects H0 because it was not true, she

might then report the p-value at < .05 or even calculate it to reflect its true value – say, p < .001,

for example.

This sounds like good research! Not so fast, according to Carver (1993): “…statistical

significance tells us nothing directly relevant to whether the results we found are large or small,

and it tells us nothing with respect to whether the sampling error is large or small” [ CITATION

Ron93 \p 291 \n \y \l 1033 ]. In addition, Shaver (1993) argues that statistical significance tests

are not indicators of a result occurring by chance, a null hypothesis’ plausibility, or causality
3

between treatments. Thompson (1993) presents three fundamental criticisms of statistical

significance testing. First, he calls the problem “tautological”: that is, it is repetition of an already

understood fact – that “virtually all null hypotheses will be rejected at some sample size” (p.

362); therefore tests of statistical significance can be both redundant and pointless. Next, these

tests can lead researchers to devise unrealistic comparisons when they start pairing statistics that

came from different hypotheses, sample sizes, means, and so on. Finally, what Thompson calls

“inescapable dilemmas” are caused when researchers use statistical significance testing in a way

that creates predictable outcomes.

The main criticism of statistical significance appears to be that too many researchers and

readers of research put too much stock in the number achieved and the phrase “statistically

significant,” when a statistically significant finding may not be significant in a practical sense at

all [ CITATION Sny93 \l 1033 ]. Particularly because the calculated p-level may be just barely under

the level set a priori, the statistical significance of a result may be tainted by arbitrariness.

Rather, “if theory is supported, we would like to know how big the effect really is; if theory is

not supported, we want to know how close we came” [ CITATION Ser93 \p 352 \l 1033 ]. Effect size

is one of the ways these critics have suggested improving the overall reporting of findings.

Others include discussion of sample size, ensuring randomness, replicability, and confidence

intervals.

Effect size, or magnitude of effect (ME), is a way that researchers can report findings in a

practical sense rather than solely through p-value [ CITATION Sny93 \l 1033 ]. Consider reporting

the size of a shoe sale to one’s husband: “These shoes are 30% off today” has a much different

effect than “These shoes cost only $120 today.” Over $100 for a pair of shoes might be the limit!
4

Similarly, increasing the sample size can change the outcome of a test of statistical significance;

thus it is important to find other ways of reporting an outcome.

Both Carver [ CITATION Ron93 \n \t \l 1033 ] and Thompson [ CITATION Tho93 \n \t \l 1033

] proffer replication as a genuine way of reducing the emphasis placed on tests of statistical

significance. According to Carver, “building replication into our research helps to eliminate

chance or sampling error as one of the threats to the internal validity of our results” [ CITATION

Ron93 \p 291-292 \n \y \t \l 1033 ]. Thompson goes into greater detail about how to accomplish

replication, given that doing so is costly in terms of money and the researcher’s time. He

suggests halving the sample and using crossvalidation; another idea is the jackknife approach,

which involves conducting analyses on various groups of subjects within the sample and

conducting every k analysis possible. His strongest suggestion, the one for which the article was

named, is the bootstrap methods: “Conceptually, these methods involve copying the data set on

top of itself again and again infinitely many times to thus create an infinitely large mega data set

… Then hundreds or thousands of different samples are drawn from the mega file, and the results

are computed separately for each sample and then averaged” [ CITATION Tho93 \p 369 \n \y \t \l

1033 ]. The bootstrap method helps a researcher evaluate the stability of his results over different

groupings of subjects, and it can be used as part of inference to the population.

Finally, using a confidence interval rather than a point null hypothesis can, according to

Serlin [ CITATION Ser93 \n \t \l 1033 ], provide more credible results than the p-value. He calls it

the “good-enough belt,” and it seems like common sense. As a basic analogy, I might tell my 8-

year-old to clean her room, which is very, very messy. My definition of “clean” is that every

article is picked up and put away, the bed is made, and the windows have been washed. The

reward for this chore is a movie-date with her friend. She spends two hours cleaning; when I go
5

in to check her work, she has picked up every article and put it away and made the bed, but she

has not washed the windows (which weren’t really that dirty anyway). Is this good enough? Or

does she fall short of my somewhat arbitrary level of significance for the movie-date? Serlin

would probably say she could go.

While few of the critics are ready to throw out p-value entirely, they all agree that it is

overused, undercriticized, and misunderstood by many researchers and most readers of research.

P-value may be useful for identifying how far a result is from the null hypothesis, but without

considering sample size, effect size, its proximity to the level of significance, and other factors,

using p-value to evaluate a study’s results is like using six-foot guide poles to drive through a

nine-foot Wyoming snowdrift.


6

References

Carver, R. P. (1993). The Case Against Statistical Significance Testing, Revisited. Journal of

Experimental Education , 61 (4), 287-292.

Huck, S. (2008). Reading Statistics (5th ed.). Boston: Pearson.

Serlin, R. (1993). Confidence Intervals and the Scientific Method: A Case for Holm on the

Range. Journal of Experimental Education , 61 (4), 350-360.

Shaver, J. P. (1993). What Statistical Significance Testing Is, and What It Is Not. Journal of

Experimental Education , 61 (4), 293-316.

Snyder, P., & Lawson, S. (1993). Evaluating Results Using Corrected and Uncorrected Effect

Size Estimates. Journal of Experimental Education , 61 (4), 334-349.

Thompson, B. (1993). The Use of Statistical Significance Tests in Research: Bootstrap and Other

Alternatives. Journal of Experimental Education , 61 (4), 361-377.

You might also like