You are on page 1of 47

STAT171

Statistical Data Analysis


(2015)
Topic 9
Inference regarding
proportions
(one and two populations)
1

J&B
Chapter 8 section 5 (one proportion)
Chapter 10 section 7 (two proportions)

1. Testing a hypothesis about . 8.5

2. Confidence interval for . 8.5

3. Testing a hypothesis about two


proportions, 1 and 2 . 10.7

4. Confidence interval for 1- 2 .


10.7

Notation
Text & Lecture notes
n = sample size
(number of independent Bernoulli trails)

X = count of the number of successes


Lecture notes
= population proportion (a constant)
P = the sample proportion P = X / n
Text book
P = population proportion (a constant)

P = the sample proportion P

=X/n

Care is needed due to different


notation!!!!
3

Testing a single proportion


Example:
In past years, each year 15% of
people who insured their car made
a claim. This year, of a random sample of
400 policies, 76 made a claim. Is there any
evidence that the proportion has changed?
Setting up the problem:
X = the number in the sample who made a
claim this year
X ~ B (n , )

(x = 0, 1, , n)

We have to assume the policyholders are independent

P = the sample proportion who made a


claim this year
P=X/n

0 1 2
n

p = , , ,...,
n n n
n

Distribution results for X:


We have:
X = number of claims made in the sample
this year
X ~ Binomial
n=400
=0.15 IF the claim rate is unchanged
X ~ B (400, )

For X ~ Bin ( n, ) :
E ( X ) = n

Var ( X ) = n (1 )

For n "sufficiently large" (CLT applies):


approx X ~ N ( n , n (1 ) )
or

X n
n (1 )

both n > 15
and n (1 ) > 15

~ N ( 0,1)
5

Distribution results for P:


The sample proportion of claims, P, has a
scaled binomial distribution
P ~ (1/n) * B (400, )
=0.15 IF the claim rate is unchanged

X
For P = :
n
X
E ( P) = E
n

E ( X ) n
=
=
=
n
n

X
Var ( P ) = Var
n

Var ( X ) n (1 ) (1 )
=
=
=
2
2
n
n
n

For n "sufficiently large" (CLT applies):


(1 )
approx P ~ N ,

or

(1 )
n

both n > 15
and n (1 ) > 15

~ N ( 0,1)
6

Example (cont)
Here, we have observed the sample result
p = 76/400 = 0.19
We want to test given that p is 0.19, do
we have evidence that is no longer 0.15?
(i.e. has the claim rate changed from 0.15?)
Under the assumption that is 0.15, we
want the probability of getting a sample
proportion at least as far away as 0.19 is
from 0.15 (that is p 0.11 or p 0.19).
This is the same as getting a sample
count
X 76
or
X 44
since 0.15*400 = 60
76 60 = 16
(we observed 16 more than we expected)
and  60 - 16 = 44
(the same distance away in the other direction)

We can obtain this probability in two ways:


(1) using the exact binomial:

400
x
400 x
Prob =
( 0.15 ) ( 0.85 )
x =0 x
44

400
x
400 x
+
( 0.15 ) ( 0.85 )
x = 76 x
400

(2) Using the normal approximation to


the binomial:
Prob ( P 0.11) + Prob ( P 0.19 )
where

approx

(1 )

N ( 0,1)

For the general case


Following the steps as for a test of :
<H> H0: = 0 e.g. H0: = 0.15
H 1: 0
=

H1: 0.15
= 0.05

<A> CLT assumption check


The text book states that to use the
z-test for proportions, we need:
n
0 15
both
and n(1-
0) 15
 We will use this check
(as it is the one in the quizzes).

<T> If H0 is true, set up the test


statistic:

 if P  N 0 ,

 then

0 (1 - 0 )

(1 0 )

 N

0 ,1

Note: the mean and variance are exact,


the normality is approximate here

with observed value

z obs =

p 0
0 (1 - 0 )
n

10

<P> Obtain the p-value


This enables us to determine whether
zobs is a believable or not-believable
value from the Z distribution
For a HA: 0
 p-value P( | Z | |zobs| )

Make the decision:


p-value Reject H0
p-value > Retain H0

<C> Write a meaningful conclusion

11

Continuity Correction

See J&B
p.254

We are approximating a discrete (binomial)


distribution with a continuous (normal)
distribution. Therefore, the continuity
correction should be used.
For a two-sided alternative, the corrected
test statistic is:

z obs

1
p 0
2
n
=
0 (1 - 0 )
n

Allows finding
the area in both
tails including
the observed
sample p

The larger n is, the less important it is to


use the continuity correction.
Note: The text book (J&B) does not use the
cc in hypothesis testing for proportions
(which leads to less accurate approximations to the p-value)

and this is also the case for the quizzes.

12

One tailed tests


The hypothesis test can be one or two-tailed.
If one-tailed where H1: > 0
Test statistic is:
z obs

1
p 0
2n
=
0(1- 0)
n

p-value P(Z zobs)


If one-tailed where H1: < 0
Test statistic is:
z obs

1
p 0 +
2n
=
0(1-0)
n

p-value P( Z zobs)
13

For the example


H0: = 0.15
H1: 0.15
= 0.05
Checking the assumption of approximate
normality:
n*0 = 400*0.15 = 60 15
and

n*(1- 0) = 400*0.85 = 340 15

 reasonable to assume normality


The test statistic is

1
| P 0 |
2n
Z=
0 (1- 0 )
n
14

With observed value


z obs

1
0.19 0.15
0.03875
800
=

2.17
0.0178536
0.15(1-0.15)
400

p-val = P(| Z | 2.17 ) 0.030


Reject H0 at the 5% level of significance.
There is sufficient evidence to conclude
the proportion of claimants is different
from previous years. The sample
proportion of insured claiming this year
is significantly greater than 15%.
In the above example, not using the continuity
correction gives zobs = 2.24 with an
associated p-value of 0.025.
That is, no c.c. will give a smaller p-value
than when c.c. is used  the actual Type I
error rate will be higher than specified by 15
the significance level.

Confidence interval for


[Usually a confidence interval is of the form:
statistic z/2* std. error(statistic) ]
Here, it should be: p z

(1 - )
n

Ideally, to get the CI

But is unknown.

for , we have to
solve a quadratic

We dont even have a hypothesised value!


So, we use p as our best estimate of
 an approx confidence interval for is:

p z /2

p (1- p )
n

... we use an approximation for the standard


error of P instead of the exact standard error
 but we still refer to the z-tables, not the t
16
... theory to be done next year.

Confidence interval CLT check:


In the hypothesis test for , we used the
normal approximation to the binomial, and
had to check the validity of this under H0.
We also need to check the validity of using
the CLT for the confidence interval, but
here we do not have a 0.
Instead, we check using the sample p:
CLT check:

we need np 15 and n(1-p) 15


where np = the sample number of successes
and n(1-p) = the sample number of fails

Continuity correction:
When doing confidence intervals for , we
dont worry about continuity correction. It is
pointless trying to improve accuracy when the
17
standard error is only approximated.

For the example


CLT check:

np = 76 15
n(1-p) = 324 15

 We can validly use the normal approximation


to the binomial here.

95% confidence interval for

0.19(0.81)
0.19 1.96

400

( 0.19 1.96 0.01963)


( 0.19 0.0385 )
( 0.1515, 0.2285 )

We are 95% confident that the interval


(0.1515, 0.2285) includes the true
population proportion of claimants this
year.
18

Using the CI for for testing H0


Even though this interval for does not
contain 0.15 (the hypothesised proportion
for this year), we cannot accurately use it to
test the hypothesis H0: = 0.15 vs H0: 0.15
Why is this so?
In evaluating the standard error of P:
the hypothesis test uses 0 ; but
the C.I. uses the sample p .
Under H0: = 0.15 we used:
0 (1 0 )
0.15 0.85
se ( p ) =
=
n

400

0.01785

For the C.I. calculations we used


 ( p) =
se

p (1 p )
=
n

0.19 0.81
0.019615
400
19

However, in most cases, the


difference between the two s.e.s
will be very small.
Only if 0 is close to the CI
boundaries is there a problem
with using the CI to perform the
hypothesis test.
Here, the 95% ci for was
(0.1515, 0.2285)
and we were testing H0: = 0.15,
so it a bit too close to call in this
case (so we would have to do the hypothesis test).

20

Limits on c.i.s for


A two-sided approx confidence
interval for is:
p(1-p)
p z /2

However, must be in the interval


(0,1) as it is a proportion.
Ideally, to get the CI for , we have
to solve a quadratic

The confidence interval CLT check


np 15 and n(1-p) 15
should guarantee the ci will not be
outside the interval (0,1).
The 3 CLT check will guarantee
the ci for is in the interval (0,1), as
long as the z/2 < 3.
21

One sided c.i.s for


For a one-sided CI for using the
normal approximation, we cannot
have a boundary of
 we have boundaries of 0 or 1
for a proportion.
The 100(1-)% ci for :
For a <

alternative: 0 ,

p(1-p)
p + z

For a >

p(1-p)
alternative: p z
, 1
n

22

Using Minitab (16):


Under Stat Basic Stats 1 Proportion

In MTB 17, there is a drop-down panel for this.

Under options, Click on:


use test based on normal distribution to carry
out the z test.
Large n  normal approx quite accurate and
quicker than many binomial calculations

Otherwise, p-value is calculated using exact


binomial probabilities.
Small n  normal approx not necessarily
accurate and a small number of binomial
calculations is quite quick

23

Resulting Minitab output


MTB > POne 400 76;
SUBC> Test 0.15;
SUBC> UseZ.

Minitab does not


use notation in the
output, it uses p

Test and CI for One Proportion


Test of p = 0.15 vs p not = 0.15
X

N Sample_p

76 400

95%CI

Z-Val P-Val

0.190 (0.151555,0.228445) 2.24 0.025

Using the normal approximation.

MTB > POne 400 76;


SUBC> Test 0.15.
Test and CI for One Proportion
Test of p = 0.15 vs p not = 0.15
X

N Sample p

76 400

Exact
95% CI

P-Value

0.190 (0.152721,0.231938)

0.036
24

Two sample test of proportions


Used if we have two independent
samples, where we measure the
proportion of something in each.
Example
Children are randomly selected from two
different schools take the same test.
The number who pass at each school is
recorded.
At School1, 40 out of 70 pass the test.
At School2, 45 out of 100 pass the test.
We want to know: is there any difference
between the two schools in their overall
pass rates?
The (hypothetical) populations of interest are all
students who may ever be in either of the schools.

25

Here we have two independent samples:


School1

p1 = 40/70 0.57

n1 = 70

School2

p2 = 45/100 = 0.45 n2 = 100

Based on these samples, we need to


decide which scenario we believe:
The proportions estimate the same
(and the difference between p1 and p2
can be explained by random variation)

or
The proportions estimate two different
population proportions 1 and 2
(and the difference between p1 and p2
is due to this systematic difference)
26

General case for two proportions


Sample1:

observe X1 successes
from n1 observations
 P1 = X1 / n1

Sample2:

observe X2 successes
from n2 observations
 P2 = X2 / n2

We want to test: H0: 1 = 2 (= )


H1: 1 2

at sig level

If n1 and n2 are large enough to apply the


CLT:
1 (1 1 )

P1 ~ N 1 ,

n1

2 (1 2 )
P2 ~ N 2 ,

n2

27

If the two samples are independent:


1 (11 ) 2 (12 )

P1 P2  N 1 2 ,
+

n1
n2

If H0 is true, i.e. if 1 = 2 =
P1 P2

(1 ) (1 )

 N ,
+

n1
n2

1 1
 N 0 , (1 ) +
n1 n2

Therefore the test statistic is:


p1 p2
zobs =
1 1
(1 ) +
n1 n2
But is unknown !!!

28

We cannot get the exact standard


error of P1 P2 , as we need the
(unknown) value of to substitute in.
 use the pooled sample
proportion to estimate .
Use
= p = weighted average of p1 and p2
= number in sample1+number in sample2
total n

So

n1 p1 + n2 p2
x1 + x2
= p =
=
n1 + n2
n1 + n2
The sample proportions are
weighted by the sample sizes

29

This then gives an estimated standard


error of P1 P2, and we get the
observed test statistic:

zobs =

p1 p2

1 1
(1 ) +
n1 n2

Obtain p-value and then reject or


retain H0 like any other z-test.
As for any test, this can be one or two
tailed.
This IS a z-test (even though we have
estimated the standard error of (P1P2)
using the pooled p-hat) ... as we have
used binomial distributional properties
in this estimation.
30

CLT check for two proportions


We need approximate normality for both
sample proportions under H0, but we dont
know the value of , so use its estimate,
the pooled sample proportion p:
Need: n 1 p 15 and n 1(1- p) 15
n 2 p 15 and n 2(1- p) 15
These are just the number of successes and
failures in the two samples.

Continuity correction?
There is no need for continuity correction
in two sample proportions tests, as you
need to add and subtract a correction term
(one for p1 and one for p2) and they will
approximately cancel.
31

For the school example


H0: 1 = 2
H1: 1 2
= 0.05
p1 = 40 / 70 0.57 based on n1 = 70
p2 = 45/100 = 0.45 based on n2 = 100
Under H0, the pooled proportion is:

p = =

40 + 45
85
1
=
=
= 0.50
70 + 100
170
2

Checking for approximate normality:


n1*p = 35 15 and

n1(1-p) = 35 15

n2*p = 50 15 and

n2(1-p) = 50 15

CLT applies
32

zobs =

40 45

70 100
1
1
0.5 ( 0.5 ) +
70 100

17
140

0.07792
1.558

p-value P(| Z | 1.55)


2*0.061 0.121
p-value > 0.05 retain H0

There is insufficient evidence, at the 5%


significance level, to be able to conclude
there is a difference in the pass rate
between the two schools.
33

Confidence interval for 1 - 2


Here, we have no null hypothesis, so we
are not assuming that 1 = 2 .
When doing the hypothesis test, we
averaged p1 and p2 to get a pooled p, and
used that in our estimate of s.e.(p1 - p2).
However, to evaluate the confidence
interval, we still need an estimate of
se ( p1 p2 ) =

1 (1 1 ) 2 (1 2 )
n1

n2

use p1 as an estimate of 1
and p2 as an estimate of 2
So, an approx 100(1-)% C.I. for 1 - 2 is:

( p1 p2 ) z 2

p1 (1 p1 ) p2 (1 p2 )
+
n1
n2

34

Warning: We cant use the confidence


interval to accurately carry out the
hypothesis test H0: 1 = 2 , as the standard
error of the difference in the two sample
proportions is evaluated in different ways:
Under H0, there was only one value of to
estimate, and we then used the pooled p to
estimate the relevant standard error
1 1

se ( p1 p2 ) = (1 ) +
n1 n2

In the CI calculations, we are not assuming


1 = 2 , so the relevant standard error is
estimated by:

(p p )=
se
1
2

p1 (1 p1 ) p2 (1 p2 )
+
n1
n2
35

For the example


95% C.I. for 1-2 is:

40 45

1.96*
70 100

40 30

70 70
70

45 55

+ 100 100
100

( ( 0.5714 0.45 ) 1.96*0.077289 )


( 0.1214 0.1514 )
( 0.030 , 0.273)

We are 95% confident that the above


interval includes the true difference
between the population proportions.
Note:
the standard error used in the hypothesis
test calculations was 0.07792;
the standard error used in the ci
calculations was 0.07729

36

Using Minitab
Under Stat Basic Stats 2 Proportions

Use pooled estimate


of p for test must be
ticked or the CI
(unpooled p) estimate
of the standard error is
used in the hypothesis
test.

Note the only option is to use the normal


approximation. There is no exact binomial
test.
37

Output (pooled option)


MTB > PTwo 70 40 100 45;
SUBC> Pooled.
Test and CI for Two Proportions
Sample X
1
40
2
45

N
70
100

Sample p
0.571429
0.450000

Difference = p(1) - p(2)


Estimate for difference: 0.121429
95% CI for diff:(-0.0300545, 0.272912)

Test for difference = 0 (vs not = 0):


Z = 1.56 P-Value = 0.119
Fisher's exact test: P-Value = 0.161
Ignore this until next year
38

Output (unpooled option)


MTB > PTwo 70 40 100 45.

Test and CI for Two Proportions


Sample X
1
40
2
45

N
70
100

Sample p
0.571429
0.450000

Difference = p(1) - p(2)


Estimate for difference: 0.121429
95% CI for diff:(-0.0300545, 0.272912)

Test for difference = 0 (vs not = 0):


Z = 1.57 P-Value = 0.116
Results:
same CI
only a small difference
in z (1.56 vs 1.57)
and p values (0.119 vs 0.116)

39

Topic 9. Appendix A
Insurance claims example:
In past years, 15% of the policy holders
have made an insurance claim per year.
This year, of a random sample of 400
policies, 76 have made a claim.
Is there any evidence that the proportion
has increased by a factor of more than 1.2
times?
Sample estimate of is p = 76/400 = 0.19
Ratio of proportions is
sample proportion 0.19
=
1.267
past proportion
0.15

(bigger than 1.2)

40

The results of previous inference were:


H0: = 0.15 vs H0: 0.15
was rejected with p-val = 0.030
The 95% CI for is (0.1515, 0.2285 )
 We found that there is evidence that
the true proportion this year is higher
than 15%

BUT we have not yet answered the


question about whether the
proportion has increased by a factor
of more than 1.2 times!!!
41

We can approach this a couple of ways:


(1) Hypothesis test
H0: = 0.15 * 1.2 = 0.18
H1: > 0.18
= 0.05
CLT check:
n = 400 * 0.18 = 72 15
n(1-) = 400 * 0.82 = 328 15

1
76
1
0

0.18
2n
= 400 800
0 (1 0 )
0.18 ( 0.82 )
n
400

p
zobs =

0.00875

0.46
0.0192094

p-value = P(Z 0.46) 0.3228


 Insufficient evidence (at = 5%) to
conclude the proportion has increased
by a factor of more than 1.2.
42

(2) Confidence interval (one sided)


(not strictly equivalent)
95% CI lower bound for :
p z0.05

p (1 p )
n

CLT check
(sample numbers):
np = 76 15
n(1-p) = 324 15

76 324

76
=
1.645 400 400
400
400
= 0.19 0.032267
= 0.1577

Max value for a proportion is


1, so upper limit cannot be

We are 95% certain the interval (0.1577, 1)


contains the true proportion. Because 0.18 is in
the interval (and is not close to the boundary),
there is insufficient evidence to be able to claim
that the proportion has increased by a factor of
more than 1.2.
To claim the proportion has increased by a factor
of more than 1.2, the CI for the new would
have to be completely ABOVE 0.18 .
43

(3) Confidence interval for the ratio


The claim is that the ratio of the
proportions is more than 1.2
 i.e. that / 0.15 > 1.2
The appropriate 95% one-sided CI for the
new was found to be (0.1577, 1).
So, the appropriate 95% one-sided CI for
the ratio / 0.15 is

0.1577 1
,

(1.051, 6.667 )
0.15 0.15

Because 1.2 is in the interval (and is not


close to the boundary), there is insufficient
evidence to be able to claim that the ratio
of the proportions is more than 1.2.
To claim the ratio of proportions has is
more than 1.2, the CI would have to be
completely ABOVE 1.2

44

Summary: One Sample


Proportions Test & C.I.
Hypothesis test:
H0: = 0

versus H1: 0

Sample: P = X/n

where X ~ Bino (n , )

CLT check: n0 15 and n(1-0) 15


z obs

1
p 0
2n
=
0 (1 - 0 )
n

Confidence Interval:
CLT check: np 15 and n(1-p) 15
An approximate 100(1-)% CI for is

p(1-p)
p z 2

One-sided CIs for


must have as
their limit 0 or 1
(not )
45

Summary: Two Sample


Proportions Test
Hypothesis test:
H0: 1 = 2 versus H1: 1 2
Sample: P1 = X1/n1 and P2 = X2/n2
X1 + X 2
Pooled estimate of is P =
n1 + n2
CLT check:
n1*p 15
n1*(1-p) 15

z obs =

and

n2p 15

and n2(1-p) 15

p1 p 2
1
1
p (1 p ) +
n1 n2
46

Summary: Two Sample


Proportions C.I.
Confidence Interval:
Sample: P1 = X1/n1

and P2 = X2/n2

CLT check (simply uses observed counts):


n1*p1 15
n1*(1-p1) 15

and

n2p2 15

and n2(1-p2) 15

An approximate 100(1-)% CI for (1-2) is

( p1 p 2 ) z 2

p1 (1 - p1 ) p 2 (1 - p 2 )
+
n1
n2

One-sided CIs for (1 - 2) must


have as their limit -1 or +1 (not )
47

You might also like