You are on page 1of 13

Handout No.

6
Phil.015
April 12, 2016
HYPOTHESIS TESTING IN BINOMIAL AND NORMAL
DISTRIBUTION MODELS
1. Hypotheses
Recall that in one old segment of the TV show The Odd Couple, Felix claimed
to have ESP. Oscar was skeptical and suggested to test Felixs claim. Oscar
would draw a card at random from a deck of four large cards, each with a different geometric figure on it (e.g., a circle, square, triangle and a cross). Without
showing the card, Felix was asked to identify it. Felix and Oscar repeated the
basic card identification experiment 10 times. Remarkably, Felix made 6 correct
identifications. Felixs score is surprisingly good in view of the fact that the average number of correct identifications is only 2.5 for anyone who does not have
ESP. Well, does Felixs score prove that he actually has ESP? Of course not.
This may have been his lucky day. But Felixs high score may provide some
evidence for Felixs ESP capability. But just how much evidence? It depends
on how strict Oscar chooses to be in terms of the discrepancy between the gathered data (i.e., 6 correct identifications) and the predictions of his no ESP
statistical model (i.e., on average one can make only 2.5 correct identifications).

2. Statistical model building


To build a statistical model for the Felix-Oscar ESP experiment, we need the
following two conceptual ingredients:
(i) A test statistic Xn , i.e., a random variable that counts the number of Felixs
correct indentifications in n independent trials. It is easy to see that the
possible values of the test statistic Xn (with n = 10) are: 0, 1, 2, , 10.

(ii) A parametrized family of seriously possible probability distribution functions of Xn . Because (i) the probability that Felix correctly identifies a
card drawn by Oscar in any trial is always the same, namely p = 14 , (ii)
there are only two possible outcomes for each trial, called success (correct
identification) and failure (wrong identification), (iii) there are n = 10 trials, where n is fixed, and since (iv) all n trials are statistically independent
of each other, the binomial probability distribution is an appropriate model
for the experiment.
1

Remember that there are 4 cards and without ESP Felix will guess correctly
any card drawn by Oscar with probability p = 14 . If Felix has ESP, then the
probability could be much higher, but we may not know exactly how much
higher. Given the foregoing problem description, it is most adequate to consider
as a model the following parametrized family of binomial probability distribution
functions:


BinX10 k | p =df P X10

 
10
pk (1 p)10k ,
=k|p =
k


where the parameter that parametrizes the possible binomial statistical models
is the probability p. Of course, we do not know the exact value of p. The
business of hypothesis testing is to generate statistical inferences about the
likely values of p.

Statisticians often write Xn Bin k | p to indicate that the random variable
(i.e., the test statistic) Xn has a pdf specified by Bin k | p with parameter
p whose value is unknown. This is their way to introduce a parametric family
of statistical models that hopefully includes the correct model with a specific
value for p. Because this value is not known, the next move is to hypothesize a
specific value of p and then let the experimental outcome decide whether that
hypothesis about ps value is acceptable.
Because Oscar does not believe that Felix has ESP, he starts pessimistically
with the so-called null hypothesis
1
H0 : p = ,
4
stating plausibly that Felix has no ESP and Felix is only guessing. In other
words, Oscar believes that the binomial statistical model
X10 Bin k |

1
4

correctly characterizes Felixs ESP capabilities. To emphasize the extant hypothesis H0 , it is common to symbolize
X10 Bin k | H0 )

the extant statistical model in place of the above notation. (Think of H0 as a


condition in the model.) It is easy to verify that the mean (expected value)
of

1
the test statistic Xn under H0 is = E X10 = 10 4 = 2.5 (from E Xn = np.)
Recall also that under H0 , the pdf of X10 is given by the table below:
2

x
pX10 (y)


Specification of BinX10 k|p
Probability Assignment
1
2
3
4
5
6

0
0.056

0.188

0.282

0.250

0.146

0.058

0.016

0.003

0.000

The graphical representation of pdf pX10 is displayed next. Note that the diagram has its highest values at X10 = 2 and X10 = 3, justifying the mean value
X10 = 2.5. As alluded to above, with no ESP, the correct identification scores
will be quite close to 2 or 3.

pX (x) = BinX10 (x | H0 ) = P X10 = x | p
0.4

0.2

10

Remember that Felix has correctly identified 6 cards in a sample of n = 10


trials. I.e., the observed value xobs of X10 is 6. It is easy to see that the
probability of correctly identifying 6 or more cards is P X10 6 | H0 =
0.016 + 0.003 + 0.000 + 0.000 + 0.000 = 0.019. The fact that this probability
is rather small tells us that X10 = 6 is quite far from what we expect (namely
only E(X10 = 2.5 under hypothesis H0 or under Oscars extant statistical model
X10 Bin k | H0 ). Note again that the sample result X10 = 6 seems rather
inconsistent with the null hypothesis H0 . To put it in another way, the null
hypothesis H0 does not explain in a satisfactory manner the observed value
X10 = 6. For example, if p were considerably larger, say p = 0.5, captured by a
different null hypothesis, say H0 , then the mean of X10 would be = 10. 12 = 5
and the graph of the revised probability distribution function would be as follows:

pX (x) = BinX10 (x | H0 ) = P X10 = x | p

0.4

0.2

10

This model would indeed explain Felixs results much better, but Oscar does
not accept it! Oscar is a skeptic! Be that as it may, we should definitely consider
the so-called alternative hypothesis
1
Ha : p > ,
4
stating that in the case of Felix ESP performance the probability of correctly
identifying a card drawn by Oscar is actually greater than the guessing-type
probability. In other words, the correct model for the experiment is somewhere
in the binomial family
X10 Bin k | Ha )

of seriously possible statistical models. The next problem is how to decide which
hypothesis should we accept H0 or Ha in face of fresh observation X10 = 6?
Before moving on, note also that if Oscar were a true believer in Felixs ESP
capabilities, he may as well consider yet another hypothesis, say H0 : p = 0.75,
that leads to the binomial probability distribution

pX (x) = BinX10 (x | H0 ) = P X10 = x | p

0.4

0.2

10

giving the mean = 10. 0.75 = 7.5 that treats Felixs ESP performance far
too optimistically. This is something Oscar is not prepared to do.
What hypotheses H0 and H0 indicate is that there are many hypotheses that
perhaps explain Felixs experimental result much better than H0 . But Oscar is
not convinced as yet that this might be the case.

3. Model validation in binomial right-tail tests


One way to indicate the discrepancy between what we expext based on H0 and
what we actually observe is to calculate the so-called P -value, i.e., the right-tail
probability

P X10 6 | H0 = 0.016 + 0.003 + 0.000 + 0.000 + 0.000 = 0.019

or in general the P -value P Xn xobs | H0 , where xobs is the value of Xn
observed in an experiment consisting of n trials. The fact that in Felixs case
the P -value is small indicates that the observation X10 = 6 is far from what we
expect when p = 14 (i.e., when we hold the view that H0 is the correct hypothesis). To repeat, the model based on H0 cannot explain Felixs data! However,
the result X10 = 6 is not at all surprising if Felix really has some degree of ESP.
As a matter of fact, the evidence tends to favor the conclusion that Felix has
an ESP capability.
It was Ronald Fisher (1890-1962) who stated by convention or by way of benchmarks that
5

(i) If the P -value satisfies



P Xn xobs | H0 < 0.01,

then xobs shows strong evidence against H0 (or simply the test result is
highly statistically significant), prompting a rejection of H0 at the 1% significance level! Because the P -value 0.019 in Felixs case is greater than
0.01, Oscar can still retain H0 at the 1% significance level!
(ii) However, if the P -value satisfies

0.01 P Xn xobs | H0 < 0.05,

then xobs shows moderate evidence against H0 (or simply the test result is
statistically significant) still prompting a rejection of H0 but this time
only at the 5% significance level! Since now the P -value 0.019 (obtained
for Felix) is strictly less than 0.05, Oscar must give up his pessimistic
hypothesis H0 at the 5% significance level!
(iii) Finally, the P -value satisfying
0.10 P Xn xobs | H0
indicates no evidence against H0 .

In addition to the P -value approach, described above, there is also a dual critical
value approach, according to which a designated critical value xcr of the test
statistic Xn will determine when H0 ought to be rejected. Specifically, in a
right-tailed test (a prime example is Felixs ESP experiment) H0 is rejected
precisely when xcr xobs . In a right-tail test, the set of values of Xn that are
equal to or larger than the critical value is called the rejection region. All other
values of Xn belong to the nonrejection region.
Question: How do we find the critical value? Answer: The statistician specifies
it prior to the experiment or calculates it from the equation

P Xn xcr | H0 = 0.01


at the 1% significance level. Since P Xn xcr | H0 = 1P Xn xcr 1 | H0
and hence

P Xn xcr 1 | H0 = 0.99,

we can look up the value of xcr 1 in the table for the binomial comulative
probability distribution for sample n and probability specified by H0 .
6

In the case of Felixs ESP experiment, the values n = 10 and p = 14 give


xcr 1 = 7 and therefore the critical value at the 1% significance level is xcr = 8.
What this means is that any score xobs of Felix belonging to the rejection region
{8, 9, 10} on the right of the binomial pdf graph calls for a rejection of H0 at
the 1% significance level.
Now, at a 5% significance level we solve the equation

P Xn xcr | H0 = 0.05

for xcr . Equivalently, we look up the value of xcr 1 in the table for the binomial
comulative probability distribution for sample n and probability specified by H0 ,
satisfying the formula

P Xn xcr 1 | H0 = 0.95.

In the Felixs ESP example, we find that the critical value at the 5% significance
level is approximately xcr 1 = 4.5, i.e., we have xcr = 5.5. What this means is
that any test score above 5.5 leads to the rejection of H0 at the 5% significance
level. Thus now the rejection region for Felixs ESP experiment is given by the
set {6, 7, 8, 9, 10}.

The binominal graph for a biased coin that leads to more likely heads (i.e., the
alternative hypothesis has the form Ha : p(head)> 12 ) in n = 16 tosses has the form
pX (x) = BinX16 (x | H0 ) = P X16 = x | p

0.4

0.2

1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

reject region

The rejection region is given by head numbers X = 12, 13, 14, 15, 16.

4. Model validation in left-tail tests


So far we have been analyzing the so-called one-sided or one-tailed tests in which
the alternative hypothesis Ha has the general right-tail or greater than form
> 0 , indicating that the unknown parameter is greater than the parameter
0 in the null hypothesis H0 . However, in one-tailed tests the inequality can
also go in the reversed left-tail or smaller than direction < 0 .
For example, suppose we have coin that may be biased (loaded) in such a way
that a head is less likely occur than a tail. In this case the null hypothesis has
the equational form
1
H0 : p = ,
2
stating that the coin is fair, so that the probability of getting a head is P(H) =
0.5. But because it appears that a head is less likely (.e., P(H) < 0.5), the
obvious alternative (or research) hypothesis has the form
1
Ha : p < .
2
Suppose the coin has been tossed n = 16 times. As above, the statistical model
is once again given by a binomial probability distribution.
Before the performance of the coin-tossing experiment we may stipulate that
the rejection region will be specified by the set {0, 1, 2, 3, 4, 5}. In other words,
this time the critical value is xcr = 5, and the hypothesis H0 will be rejected
whenever only 5 or a smaller number of heads occurs in 16 independent tosses
of the coin. The graph of the corresponding statistical model looks as follows:
pX (x) = BinX16 (x | H0 ) = P X16 = x | p

0.4

0.2

1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
reject region

In the graph the cutoff point (critical value) xcr is indicated by a vertical line.
In this setting hypothesis testing is really quite simple. In a given experiment
of 16 coin tosses, simply count the total number of heads and verify whether it
is above the cutoff point xcr . If so, hypothesis H0 is retained, and otherwise H0
is rejected.

Of course, we can calculate the P -value P Xn xcr | H0 . If it turns out to
be smaller than 0.01, then H0 is rejected at the 1% significance level. And of
course
 similarly for 0.05. Specifically, the binomial table gives P X16 5 |
H0 = 0.1051, which is a bit too weak for rejecting H0 . However, because
P X16 3 | H0 = 0.010, only 3 heads in total in 16 coin tosses would be on
the border of a highly significant test at the 1% significance level.

5. Model validation in two-tail tests


As you may have guessed, there is also a two-tail test, in which the null hypothesis H0 has the general form = 0 and the alternative hypothesis Ha is
expressed by the inequality 6= 0 . A typical example is a coin that may be
biased but we do not know which way. Therefore the hypotheses are as follows:
1
H0 : p = ,
2
stating that the coin is fair, so that the probability of getting a head is P(H) =
0.5. But because it appears that a head and a tail may not be equally likely,
the alternative hypothesis has the form
1
Ha : p 6= .
2
Once again, let us assume that the coin is tossed n = 16 times and let us stipulate
that the rejection region is given by the set {0, 1, 2, 3} {13, 14, 15, 16}. Thus the
critical value on the left is xcr = 3 and on the
 right it is xcr = 13.
 In this case the
P -value is given by the sum P X16 3 | H0 + P X16 13 | H0 of P -values on the
left and right.

The graph of a two-tailed binomial test for a biased coin is as follows:



pX (x) = BinX16 (x | H0 ) = P X16 = x | p
0.4

0.2

1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
reject region

reject region

6. Model validation in normal distribution experiments


If the sample in a binomial model is large, say n 30, then the limits theorems of
statistics suggest to use a normal distribution that approximates binomial distributions with large n to high degrees of accuracy. Recall that the normal distribution
has the graphic form

y
1
pX (x) =
2

(x )2
2 2
e

= 0.5, = 0

The approximation of a binomial distribution by a normal distribution can be doe as


follows:

10

Because it is often easier and far more universal to work with the normal distribution,
for large binomial samples hypothesis testing is performed in a normal distribution
setting. There is one important technical problem, however. Tables for the normal
probability distribution are available only for special cases, where the mean is = 0
and the standard deviation is = 1. In order to be able to use this rather specialized
table also for the other normal probability distributions (in which in general 6= 0 and
6= 1), we must transform the original test statistic Xn into its so-called Z-statistic
or Z-score (or standardized random variable), defined by
Z =df

n X
X

where the distribution of Z is the standard normal distribution (with = 0 and


= 1) and therefore can be found in the table. If n is not known, it can be replaced
by the sample standard deviation Sn .

7. Model validation in normal right-tail tests


The tables have determined that in normal right-tail tests the critical value Z = zcr
for Z at a 1% significance level is zcr = 2.33. Likewise, at the 5% significance level the
critical value of Z is zcr = 1.645. What this means is that hypothesis H0 is rejected at
a 0.05 significance level precisely when the Z score computed from the test statistic
Xn lies beyond the value zcr = 1.645 in the rejection region on the right of the graph,
indicated in the diagram below.
y
1
pX (x) = e
2

(x )2
2 2

= 0.5, = 0

reject region
1.64

11

8. Model validation in normal left-tail tests


In the case of normal left-tail tests the situation is symmetrical with right-tail tests.
This time, however, in order to reject H0 , the observed value Z = xobs of the Z
score computed from the test statistic Xn should be smaller and hence to the left of
the critical value zcr = 1.645, as indicated by the rejection region in the graph below:
y
1
pX (x) =
2

(x )2
2 2
e

= 0.5, = 0

reject region
1.64

9. Model validation in normal two-tail tests


As in the case of binomial tests, Two-sided normal tests divide the rejection region
into two areas on the left and on the right. However, in this case the critical value
zcr at the 0.01 level of significance is 2.58 and 2.58. Likewise, at the 0.05 significance
level the critical value zcr is 1.96 and 1.96, as shown in the graph below:
y
1
pX (x) =
2

(x )2
2 2
e

= 0.5, = 0

reject region

reject region

1.96

1.96

12

Once again, hypothesis H0 is rejected provided that the observed result Z = zobs in a
pertinment experiment falls into the rejection region.

13

You might also like