Stat171

STAT171
Statistical Data Analysis

(2015)
Topic 9
Inference regarding
proportions
(one and two populations)
1
J&B
Chapter 8 section 5 (one proportion)
Chapter 10 section 7 (two proportions)
1. Testing a hypothesis about . 8.5
2. Confidence interval for . 8.5
3. Testing a hypothesis about two

proportions, 1 and 2 . 10.7
4. Confidence interval for 1- 2 .

10.7
Notation
Text & Lecture notes
n = sample size
(number of independent Bernoulli trails)
X = count of the number of successes

Lecture notes
= population proportion (a constant)
P = the sample proportion P = X / n
Text book
P = population proportion (a constant)
P = the sample proportion P
=X/n
Care is needed due to different

notation!!!!
3
Testing a single proportion

Example:
In past years, each year 15% of
people who insured their car made
a claim. This year, of a random sample of
400 policies, 76 made a claim. Is there any
evidence that the proportion has changed?
Setting up the problem:
X = the number in the sample who made a
claim this year
X ~ B (n , )
(x = 0, 1, , n)
We have to assume the policyholders are independent
P = the sample proportion who made a

claim this year
P=X/n
0 1 2
n
p = , , ,...,
n n n
n
Distribution results for X:

We have:
X = number of claims made in the sample
this year
X ~ Binomial
n=400
=0.15 IF the claim rate is unchanged
X ~ B (400, )
For X ~ Bin ( n, ) :
E ( X ) = n
Var ( X ) = n (1 )
For n "sufficiently large" (CLT applies):

approx X ~ N ( n , n (1 ) )
or
X n
n (1 )
both n > 15
and n (1 ) > 15
~ N ( 0,1)
5
Distribution results for P:

The sample proportion of claims, P, has a
scaled binomial distribution
P ~ (1/n) * B (400, )
=0.15 IF the claim rate is unchanged
X
For P = :
n
X
E ( P) = E
n
E ( X ) n
=
=
=
n
n
X
Var ( P ) = Var
n
Var ( X ) n (1 ) (1 )
=
=
=
2
2
n
n
n
For n "sufficiently large" (CLT applies):

(1 )
approx P ~ N ,
or
(1 )
n
both n > 15
and n (1 ) > 15
~ N ( 0,1)
6
Example (cont)
Here, we have observed the sample result
p = 76/400 = 0.19
We want to test given that p is 0.19, do
we have evidence that is no longer 0.15?
(i.e. has the claim rate changed from 0.15?)
Under the assumption that is 0.15, we
want the probability of getting a sample
proportion at least as far away as 0.19 is
from 0.15 (that is p 0.11 or p 0.19).
This is the same as getting a sample
count
X 76
or
X 44
since 0.15*400 = 60
76 60 = 16
(we observed 16 more than we expected)
and 60 - 16 = 44
(the same distance away in the other direction)
We can obtain this probability in two ways:

(1) using the exact binomial:
400
x
400 x
Prob =
( 0.15 ) ( 0.85 )
x =0 x
44
400
x
400 x
+
( 0.15 ) ( 0.85 )
x = 76 x
400
(2) Using the normal approximation to

the binomial:
Prob ( P 0.11) + Prob ( P 0.19 )
where
approx
(1 )
N ( 0,1)
For the general case

Following the steps as for a test of :
<H> H0: = 0 e.g. H0: = 0.15
H 1: 0
=
H1: 0.15
= 0.05
<A> CLT assumption check

The text book states that to use the
z-test for proportions, we need:
n
0 15
both
and n(1-
0) 15
We will use this check
(as it is the one in the quizzes).
<T> If H0 is true, set up the test

statistic:
if P N 0 ,
then
0 (1 - 0 )
(1 0 )
N
0 ,1
Note: the mean and variance are exact,

the normality is approximate here
with observed value
z obs =
p 0
0 (1 - 0 )
n
10
<P> Obtain the p-value

This enables us to determine whether
zobs is a believable or not-believable
value from the Z distribution
For a HA: 0
p-value P( | Z | |zobs| )
Make the decision:

p-value Reject H0
p-value > Retain H0
<C> Write a meaningful conclusion
11
Continuity Correction
See J&B
p.254
We are approximating a discrete (binomial)

distribution with a continuous (normal)
distribution. Therefore, the continuity
correction should be used.
For a two-sided alternative, the corrected
test statistic is:
z obs
1
p 0
2
n
=
0 (1 - 0 )
n
Allows finding
the area in both
tails including
the observed
sample p
The larger n is, the less important it is to

use the continuity correction.
Note: The text book (J&B) does not use the
cc in hypothesis testing for proportions
(which leads to less accurate approximations to the p-value)
and this is also the case for the quizzes.
12
One tailed tests

The hypothesis test can be one or two-tailed.
If one-tailed where H1: > 0
Test statistic is:
z obs
1
p 0
2n
=
0(1- 0)
n
p-value P(Z zobs)

If one-tailed where H1: < 0
Test statistic is:
z obs
1
p 0 +
2n
=
0(1-0)
n
p-value P( Z zobs)
13
For the example

H0: = 0.15
H1: 0.15
= 0.05
Checking the assumption of approximate
normality:
n*0 = 400*0.15 = 60 15
and
n*(1- 0) = 400*0.85 = 340 15
reasonable to assume normality

The test statistic is
1
| P 0 |
2n
Z=
0 (1- 0 )
n
14
With observed value

z obs
1
0.19 0.15
0.03875
800
=
2.17
0.0178536
0.15(1-0.15)
400
p-val = P(| Z | 2.17 ) 0.030

Reject H0 at the 5% level of significance.
There is sufficient evidence to conclude
the proportion of claimants is different
from previous years. The sample
proportion of insured claiming this year
is significantly greater than 15%.
In the above example, not using the continuity
correction gives zobs = 2.24 with an
associated p-value of 0.025.
That is, no c.c. will give a smaller p-value
than when c.c. is used the actual Type I
error rate will be higher than specified by 15
the significance level.
Confidence interval for

[Usually a confidence interval is of the form:
statistic z/2* std. error(statistic) ]
Here, it should be: p z
(1 - )
n
Ideally, to get the CI
But is unknown.
for , we have to
solve a quadratic
We dont even have a hypothesised value!

So, we use p as our best estimate of
an approx confidence interval for is:
p z /2
p (1- p )
n
... we use an approximation for the standard

error of P instead of the exact standard error
but we still refer to the z-tables, not the t
16
... theory to be done next year.
Confidence interval CLT check:

In the hypothesis test for , we used the
normal approximation to the binomial, and
had to check the validity of this under H0.
We also need to check the validity of using
the CLT for the confidence interval, but
here we do not have a 0.
Instead, we check using the sample p:
CLT check:
we need np 15 and n(1-p) 15

where np = the sample number of successes
and n(1-p) = the sample number of fails
Continuity correction:
When doing confidence intervals for , we
dont worry about continuity correction. It is
pointless trying to improve accuracy when the
17
standard error is only approximated.
For the example

CLT check:
np = 76 15
n(1-p) = 324 15
We can validly use the normal approximation

to the binomial here.
95% confidence interval for
0.19(0.81)
0.19 1.96
400
( 0.19 1.96 0.01963)

( 0.19 0.0385 )
( 0.1515, 0.2285 )
We are 95% confident that the interval

(0.1515, 0.2285) includes the true
population proportion of claimants this
year.
18
Using the CI for for testing H0

Even though this interval for does not
contain 0.15 (the hypothesised proportion
for this year), we cannot accurately use it to
test the hypothesis H0: = 0.15 vs H0: 0.15
Why is this so?
In evaluating the standard error of P:
the hypothesis test uses 0 ; but
the C.I. uses the sample p .
Under H0: = 0.15 we used:
0 (1 0 )
0.15 0.85
se ( p ) =
=
n
400
0.01785
For the C.I. calculations we used

( p) =
se
p (1 p )
=
n
0.19 0.81
0.019615
400
19
However, in most cases, the

difference between the two s.e.s
will be very small.
Only if 0 is close to the CI
boundaries is there a problem
with using the CI to perform the
hypothesis test.
Here, the 95% ci for was
(0.1515, 0.2285)
and we were testing H0: = 0.15,
so it a bit too close to call in this
case (so we would have to do the hypothesis test).
20
Limits on c.i.s for

A two-sided approx confidence
interval for is:
p(1-p)
p z /2
However, must be in the interval

(0,1) as it is a proportion.
Ideally, to get the CI for , we have
to solve a quadratic
The confidence interval CLT check

np 15 and n(1-p) 15
should guarantee the ci will not be
outside the interval (0,1).
The 3 CLT check will guarantee
the ci for is in the interval (0,1), as
long as the z/2 < 3.
21
One sided c.i.s for

For a one-sided CI for using the
normal approximation, we cannot
have a boundary of
we have boundaries of 0 or 1
for a proportion.
The 100(1-)% ci for :
For a <
alternative: 0 ,
p(1-p)
p + z
For a >
p(1-p)
alternative: p z
, 1
n
22
Using Minitab (16):

Under Stat Basic Stats 1 Proportion
In MTB 17, there is a drop-down panel for this.
Under options, Click on:

use test based on normal distribution to carry
out the z test.
Large n normal approx quite accurate and
quicker than many binomial calculations
Otherwise, p-value is calculated using exact

binomial probabilities.
Small n normal approx not necessarily
accurate and a small number of binomial
calculations is quite quick
23
Resulting Minitab output

MTB > POne 400 76;
SUBC> Test 0.15;
SUBC> UseZ.
Minitab does not

use notation in the
output, it uses p
Test and CI for One Proportion

Test of p = 0.15 vs p not = 0.15
X
N Sample_p
76 400
95%CI
Z-Val P-Val
0.190 (0.151555,0.228445) 2.24 0.025
Using the normal approximation.
MTB > POne 400 76;

SUBC> Test 0.15.
Test and CI for One Proportion
Test of p = 0.15 vs p not = 0.15
X
N Sample p
76 400
Exact
95% CI
P-Value
0.190 (0.152721,0.231938)
0.036
24
Two sample test of proportions

Used if we have two independent
samples, where we measure the
proportion of something in each.
Example
Children are randomly selected from two
different schools take the same test.
The number who pass at each school is
recorded.
At School1, 40 out of 70 pass the test.
At School2, 45 out of 100 pass the test.
We want to know: is there any difference
between the two schools in their overall
pass rates?
The (hypothetical) populations of interest are all
students who may ever be in either of the schools.
25
Here we have two independent samples:

School1
p1 = 40/70 0.57
n1 = 70
School2
p2 = 45/100 = 0.45 n2 = 100
Based on these samples, we need to

decide which scenario we believe:
The proportions estimate the same
(and the difference between p1 and p2
can be explained by random variation)
or
The proportions estimate two different
population proportions 1 and 2
(and the difference between p1 and p2
is due to this systematic difference)
26
General case for two proportions

Sample1:
observe X1 successes
from n1 observations
P1 = X1 / n1
Sample2:
observe X2 successes
from n2 observations
P2 = X2 / n2
We want to test: H0: 1 = 2 (= )

H1: 1 2
at sig level
If n1 and n2 are large enough to apply the

CLT:
1 (1 1 )
P1 ~ N 1 ,
n1
2 (1 2 )
P2 ~ N 2 ,
n2
27
If the two samples are independent:

1 (11 ) 2 (12 )
P1 P2 N 1 2 ,
+
n1
n2
If H0 is true, i.e. if 1 = 2 =
P1 P2
(1 ) (1 )
N ,
+
n1
n2
1 1
N 0 , (1 ) +
n1 n2
Therefore the test statistic is:

p1 p2
zobs =
1 1
(1 ) +
n1 n2
But is unknown !!!
28
We cannot get the exact standard

error of P1 P2 , as we need the
(unknown) value of to substitute in.
use the pooled sample
proportion to estimate .
Use
= p = weighted average of p1 and p2
= number in sample1+number in sample2
total n
So
n1 p1 + n2 p2
x1 + x2
= p =
=
n1 + n2
n1 + n2
The sample proportions are
weighted by the sample sizes
29
This then gives an estimated standard

error of P1 P2, and we get the
observed test statistic:
zobs =
p1 p2
1 1
(1 ) +
n1 n2
Obtain p-value and then reject or

retain H0 like any other z-test.
As for any test, this can be one or two
tailed.
This IS a z-test (even though we have
estimated the standard error of (P1P2)
using the pooled p-hat) ... as we have
used binomial distributional properties
in this estimation.
30
CLT check for two proportions

We need approximate normality for both
sample proportions under H0, but we dont
know the value of , so use its estimate,
the pooled sample proportion p:
Need: n 1 p 15 and n 1(1- p) 15
n 2 p 15 and n 2(1- p) 15
These are just the number of successes and
failures in the two samples.
Continuity correction?
There is no need for continuity correction
in two sample proportions tests, as you
need to add and subtract a correction term
(one for p1 and one for p2) and they will
approximately cancel.
31
For the school example

H0: 1 = 2
H1: 1 2
= 0.05
p1 = 40 / 70 0.57 based on n1 = 70
p2 = 45/100 = 0.45 based on n2 = 100
Under H0, the pooled proportion is:
p = =
40 + 45
85
1
=
=
= 0.50
70 + 100
170
2
Checking for approximate normality:

n1*p = 35 15 and
n1(1-p) = 35 15
n2*p = 50 15 and
n2(1-p) = 50 15
CLT applies
32
zobs =
40 45
70 100
1
1
0.5 ( 0.5 ) +
70 100
17
140
0.07792
1.558
p-value P(| Z | 1.55)

2*0.061 0.121
p-value > 0.05 retain H0
There is insufficient evidence, at the 5%

significance level, to be able to conclude
there is a difference in the pass rate
between the two schools.
33
Confidence interval for 1 - 2

Here, we have no null hypothesis, so we
are not assuming that 1 = 2 .
When doing the hypothesis test, we
averaged p1 and p2 to get a pooled p, and
used that in our estimate of s.e.(p1 - p2).
However, to evaluate the confidence
interval, we still need an estimate of
se ( p1 p2 ) =
1 (1 1 ) 2 (1 2 )
n1
n2
use p1 as an estimate of 1
and p2 as an estimate of 2
So, an approx 100(1-)% C.I. for 1 - 2 is:
( p1 p2 ) z 2
p1 (1 p1 ) p2 (1 p2 )
+
n1
n2
34
Warning: We cant use the confidence

interval to accurately carry out the
hypothesis test H0: 1 = 2 , as the standard
error of the difference in the two sample
proportions is evaluated in different ways:
Under H0, there was only one value of to
estimate, and we then used the pooled p to
estimate the relevant standard error
1 1

se ( p1 p2 ) = (1 ) +
n1 n2
In the CI calculations, we are not assuming

1 = 2 , so the relevant standard error is
estimated by:
(p p )=
se
1
2
p1 (1 p1 ) p2 (1 p2 )
+
n1
n2
35
For the example

95% C.I. for 1-2 is:
40 45

1.96*
70 100
40 30
70 70
70
45 55
+ 100 100
100
( ( 0.5714 0.45 ) 1.96*0.077289 )

( 0.1214 0.1514 )
( 0.030 , 0.273)
We are 95% confident that the above

interval includes the true difference
between the population proportions.
Note:
the standard error used in the hypothesis
test calculations was 0.07792;
the standard error used in the ci
calculations was 0.07729
36
Using Minitab
Under Stat Basic Stats 2 Proportions
Use pooled estimate

of p for test must be
ticked or the CI
(unpooled p) estimate
of the standard error is
used in the hypothesis
test.
Note the only option is to use the normal

approximation. There is no exact binomial
test.
37
Output (pooled option)

MTB > PTwo 70 40 100 45;
SUBC> Pooled.
Test and CI for Two Proportions
Sample X
1
40
2
45
N
70
100
Sample p
0.571429
0.450000
Difference = p(1) - p(2)

Estimate for difference: 0.121429
95% CI for diff:(-0.0300545, 0.272912)
Test for difference = 0 (vs not = 0):

Z = 1.56 P-Value = 0.119
Fisher's exact test: P-Value = 0.161
Ignore this until next year
38
Output (unpooled option)

MTB > PTwo 70 40 100 45.
Test and CI for Two Proportions

Sample X
1
40
2
45
N
70
100
Sample p
0.571429
0.450000
Difference = p(1) - p(2)

Estimate for difference: 0.121429
95% CI for diff:(-0.0300545, 0.272912)
Test for difference = 0 (vs not = 0):

Z = 1.57 P-Value = 0.116
Results:
same CI
only a small difference
in z (1.56 vs 1.57)
and p values (0.119 vs 0.116)
39
Topic 9. Appendix A
Insurance claims example:
In past years, 15% of the policy holders
have made an insurance claim per year.
This year, of a random sample of 400
policies, 76 have made a claim.
Is there any evidence that the proportion
has increased by a factor of more than 1.2
times?
Sample estimate of is p = 76/400 = 0.19
Ratio of proportions is
sample proportion 0.19
=
1.267
past proportion
0.15
(bigger than 1.2)
40
The results of previous inference were:

H0: = 0.15 vs H0: 0.15
was rejected with p-val = 0.030
The 95% CI for is (0.1515, 0.2285 )
We found that there is evidence that
the true proportion this year is higher
than 15%
BUT we have not yet answered the

question about whether the
proportion has increased by a factor
of more than 1.2 times!!!
41
We can approach this a couple of ways:

(1) Hypothesis test
H0: = 0.15 * 1.2 = 0.18
H1: > 0.18
= 0.05
CLT check:
n = 400 * 0.18 = 72 15
n(1-) = 400 * 0.82 = 328 15
1
76
1
0
0.18
2n
= 400 800
0 (1 0 )
0.18 ( 0.82 )
n
400
p
zobs =
0.00875
0.46
0.0192094
p-value = P(Z 0.46) 0.3228

Insufficient evidence (at = 5%) to
conclude the proportion has increased
by a factor of more than 1.2.
42
(2) Confidence interval (one sided)

(not strictly equivalent)
95% CI lower bound for :
p z0.05
p (1 p )
n
CLT check
(sample numbers):
np = 76 15
n(1-p) = 324 15
76 324
76
=
1.645 400 400
400
400
= 0.19 0.032267
= 0.1577
Max value for a proportion is

1, so upper limit cannot be
We are 95% certain the interval (0.1577, 1)

contains the true proportion. Because 0.18 is in
the interval (and is not close to the boundary),
there is insufficient evidence to be able to claim
that the proportion has increased by a factor of
more than 1.2.
To claim the proportion has increased by a factor
of more than 1.2, the CI for the new would
have to be completely ABOVE 0.18 .
43
(3) Confidence interval for the ratio

The claim is that the ratio of the
proportions is more than 1.2
i.e. that / 0.15 > 1.2
The appropriate 95% one-sided CI for the
new was found to be (0.1577, 1).
So, the appropriate 95% one-sided CI for
the ratio / 0.15 is
0.1577 1
,
(1.051, 6.667 )
0.15 0.15
Because 1.2 is in the interval (and is not

close to the boundary), there is insufficient
evidence to be able to claim that the ratio
of the proportions is more than 1.2.
To claim the ratio of proportions has is
more than 1.2, the CI would have to be
completely ABOVE 1.2
44
Summary: One Sample

Proportions Test & C.I.
Hypothesis test:
H0: = 0
versus H1: 0
Sample: P = X/n
where X ~ Bino (n , )
CLT check: n0 15 and n(1-0) 15

z obs
1
p 0
2n
=
0 (1 - 0 )
n
Confidence Interval:
CLT check: np 15 and n(1-p) 15
An approximate 100(1-)% CI for is
p(1-p)
p z 2
One-sided CIs for

must have as
their limit 0 or 1
(not )
45
Summary: Two Sample

Proportions Test
Hypothesis test:
H0: 1 = 2 versus H1: 1 2
Sample: P1 = X1/n1 and P2 = X2/n2
X1 + X 2
Pooled estimate of is P =
n1 + n2
CLT check:
n1*p 15
n1*(1-p) 15
z obs =
and
n2p 15
and n2(1-p) 15
p1 p 2
1
1
p (1 p ) +
n1 n2
46
Summary: Two Sample

Proportions C.I.
Confidence Interval:
Sample: P1 = X1/n1
and P2 = X2/n2
CLT check (simply uses observed counts):

n1*p1 15
n1*(1-p1) 15
and
n2p2 15
and n2(1-p2) 15
An approximate 100(1-)% CI for (1-2) is
( p1 p 2 ) z 2
p1 (1 - p1 ) p 2 (1 - p 2 )
+
n1
n2
One-sided CIs for (1 - 2) must

have as their limit -1 or +1 (not )
47

Stat171 - 09 - 2015 - 1 Copy 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat171 - 09 - 2015 - 1 Copy 4

Uploaded by

Copyright:

Available Formats

Statistical Data Analysis

1. Testing a hypothesis about . 8.5

2. Confidence interval for . 8.5

3. Testing a hypothesis about two

4. Confidence interval for 1- 2 .

X = count of the number of successes

P = the sample proportion P

Care is needed due to different

Testing a single proportion

We have to assume the policyholders are independent

P = the sample proportion who made a

Distribution results for X:

For n "sufficiently large" (CLT applies):

Distribution results for P:

For n "sufficiently large" (CLT applies):

We can obtain this probability in two ways:

(2) Using the normal approximation to

For the general case

<A> CLT assumption check

<T> If H0 is true, set up the test

Note: the mean and variance are exact,

with observed value

<P> Obtain the p-value

Make the decision:

<C> Write a meaningful conclusion

We are approximating a discrete (binomial)

The larger n is, the less important it is to

and this is also the case for the quizzes.

One tailed tests

p-value P(Z zobs)

For the example

n*(1- 0) = 400*0.85 = 340 15

reasonable to assume normality

With observed value

p-val = P(| Z | 2.17 ) 0.030

Confidence interval for

Ideally, to get the CI

We dont even have a hypothesised value!

... we use an approximation for the standard

Confidence interval CLT check:

we need np 15 and n(1-p) 15

For the example

We can validly use the normal approximation

95% confidence interval for

( 0.19 1.96 0.01963)

We are 95% confident that the interval

Using the CI for for testing H0

For the C.I. calculations we used

However, in most cases, the

Limits on c.i.s for

However, must be in the interval

The confidence interval CLT check

One sided c.i.s for

Using Minitab (16):

In MTB 17, there is a drop-down panel for this.

Under options, Click on:

Otherwise, p-value is calculated using exact

Resulting Minitab output

Minitab does not

Test and CI for One Proportion

0.190 (0.151555,0.228445) 2.24 0.025

n(1- 0) = 4000.85 = 340 15