10 Inferential Statistics

Inferential Statistics Part 1
Chapter 8
P. 253- 278
Collecting a random sample
Goal: to understand characteristics about a
population
Examples:
Whats the average commuting time for city residents?
Whats the average household income of the patrons of a
particular grocery store?
Whats the average leaf size size of birch trees on August 1
in a particular state park?
What proportion of people in a particular tropical city have
had malaria?
Estimating the mean
One of the most common goals of statistical
inference is estimating a population mean
with a sample mean
Central Limit Theorem
When we have n independent, identically distributed
(X1..Xn) random variables, the mean of those random
variables approaches a normal distribution with mean
= and variance = 2 , as n gets large.
n
Independence of random variables means that the
value of one observation has no effect on the value
of another observation.
Identical distribution of random variables means that

each random variable comes from the same
population (e.g., roll of a die, coin flip).
Simple random sampling
Each observation drawn does not depend on others
drawn
Thus observations are independent
Each observation (i.e., each random variable) is

identically distributed
The population has a distribution that doesnt change (each
observation is randomly drawn from an identical distribution
the distribution of the population).
So the Central Limit Theorem applies!

(when n is large)
What does this mean?
Suppose we take a sample of n=50
observations from a population that frequency
has this distribution:
0 10 20 30
Mean () = 20
2
Variance ( ) = 100
Std. dev ( ) = 10
We then find the mean of this sample (suppose this mean = 19). Take
another sample of 50 observations and find the mean (suppose its 24).
Do this many times, and well come up with a distribution of means. The
Central Limit Theorem tells us this distribution will always look like the next
slide (as long as n is large, and 50 is large enough):
The normal curve
16 18 20 22 24
x
2
Mean () = 20 Sample size (n) = 50 variance of sample mean = =2
n
Symbols
Population Parameter:
Estimate:
Expected: E ( )
Basic Types of Inference
Point Inference
The value of a population parameter is estimated using a
single value

Examples: mean, standard deviation, etc.
Interval Inference
Attaching a probability to an estimate (i.e., making a
confidence interval)
Example: we are 95% confident that is between 10 and 20

Judging the Quality of the Estimator
)and
Bias the difference between E (
) )
(i.e., Bias E (
Bias may be positive or negative (e.g., a

positively biased estimator would indicate the
population parameter is higher than it actually
is)
Efficiency how clustered the distribution of

is (i.e., how peaked is its distribution)
Judging the Quality of the Estimator
Best case scenario: to have an unbiased estimator,
with a high level of efficiency
We can measure the quality of the estimator using

the Mean Squared Error (MSE) or its counterpart
RMSE (the square root of the MSE)
MSE Bias Variance

2
Remember that the variance in this case it the

variance of a random variable so we use the
equation: 2
Variance
n
Point Estimates (inferring population
parameters from samples)
Population Mean: x
Population Proportions: P X /n
Population Variance: 2 s2
Population Standard Deviation: s

Confidence Intervals
The degree of confidence we have in our estimates defined
by a percentage
Common examples: 90, 95, or 99% confident
The confidence interval is defined with the symbol
In confidence intervals, alpha () is the proportion of time

your confidence interval is wrong
The typical usage is: z / 2

Why do we divide by 2?
Confidence Interval Example
What is the 95% confidence interval for a normally distributed
variable?
= 1 - desired confidence interval
= 1 0.95 = 0.05
Remember that we divide by 2 since we have uncertainty both

above and below the mean (i.e., 2 tails)
Therefore we use z0.025 for the 95% confidence interval
From the z-table we find that z0.025 = 1.96
What does this mean?

Interval Estimation (making confidence intervals
for population parameters estimated from samples)
Case #1 estimating an interval for when X
is normally distributed and we know
This is the simplest case because normality

allows us to use the z-table
This is also unlikely since it requires knowing

the distribution and the (which implies
knowing already)
Example #1: Create a confidence
interval for
A town is considering building a new bridge over a
river. The primary goal is to reduce workers
commute times from a particular community. A
random sample of workers in that community are
asked to estimate their reduction in commute time if
the bridge were built.
Our goal is to estimate the mean reduction in

commute time for the whole community if the bridge
were built. Create a 95% confidence interval for this
mean.
Example #1 Data
n = 100 workers are sampled
x = 17 minutes
= 30 minutes
What is the 95% confidence interval for
the mean?
Constructing a confidence interval
Construct a 95% confidence interval around the sample mean

P ( X 1.96 X 1.96 ) 0.95
n n
30 30
P (17 1.96 17 1.96 ) 0.95
100 100
P (17 1.96 * 3 17 1.96 * 3) 0.95
P (17 5.88 17 5.88) 0.95
So we can say that the 95% C.I. is 17 +/- 5.88 or 11.12, 22.88
Example #1 Questions
What would happen to our interval if we
used a 99% confidence interval
instead?
What would happen to our confidence

interval if we sampled 200 people
instead of 100 people?
Case #2 estimating an interval for when X
is not normally distributed and we know
In this case the n matters a lot, why?
This is also unlikely since it requires knowing

the distribution and the (which implies
knowing already)
Case #3 estimating an interval for when
and the distribution are unknown
What should we used instead of ?
Can we use the z-table in this case?
This case is what we see most commonly

t-distribution vs. z-distribution
When we only have s (and not ) we use the t-
distribution rather than the z-distribution
To do so we use the t-table
How are they different?

The t-distribution changes depending on the degrees of
freedom (n-1)
This is reflected in the table and in the symbol t / 2 ,n 1
The t-distribution accounts for more uncertainty (i.e., wider
confidence intervals) since s is just an estimate for
t-distribution vs. z-distribution
As n approaches infinity t and z become equal
This means that even when we have s instead of we can use the z-
distribution if n is large
Central Limit Theorem: as n gets large.
What is large?
Rule of thumb: 30
For n less than 30, the distribution of x does not follow the normal
distribution accurately enough.
But the distribution of x does closely follow a t-distribution for sample

sizes of less than 30.
For this class use the t-distribution any time you have s instead of
Example #2
n = 16
x = 30
s2 = 1600
What is the 95% C.I. for the mean?
Example #2
s = 40
Degrees of freedom = n 1 = 15
t / 2,n 1 t0.05 / 2,161 t0.025,15 2.131 (from the t-table)
s s
P ( X 2.131 X 2.131 ) 0.95
n n
40 40
P (30 2.131 30 2.131 ) 0.95
16 16
P (30 2.131 * 10 30 2.131 * 10) 0.95
P (30 21.31 30 21.31) 0.95
The 95% confidence interval for the mean is (8.69, 51.31)

Interval Estimation (making confidence intervals for
population parameters estimated from samples)
Case #4 estimating an interval for a proportion
based on a sample proportion p
Remember that p = x/n

In other word, p = the number of successes divided by
the number of samples
For example: the proportion of people over 6ft tall
In this case we dont need s or , but we do need

the standard deviation of p: (1 )
p
n
Which we estimate as: p(1 p)

sp
n
Interval Estimation (making confidence intervals for
population parameters estimated from samples)
Case #4 continued
p(1 p) p(1 p)
Equation: p z / 2 p z / 2
n n
We use the z-distribution for estimating an interval for a
proportion based on a sample proportion p
This also limits us to using only large samples (in this case n
> 100)
For smaller samples, we calculate the entire distribution using

the binomial mass function: P( x) C xn x (1 ) n x (i.e., solve
for all x values)
Example #3
n = 150 people at a convention
63 people sampled were over 6 feet tall
What is the 99% C.I. for the true
proportion of all people 6 ft tall at the
convention?
Example #3
p = 63/150 = 0.42
99% C.I. -> z /2 z0.005 2.58 (from the z-table)
p (1 p ) p (1 p )
p z / 2 p z / 2
n n
0.42 * 0.58 0.42 * 0.58

0.42 2.58 0.42 2.58
150 150
0.42 2.58 * 0.04 0.42 2.58 * 0.04
0.42 0.104 0.42 0.104

The 99% confidence interval for p = 0.42 is (0.316, 0.524)
Sample Size Determination
Often, before we conduct a sample, we want to know
how large of a sample we need
Required sample sizes can be determined for

population parameters (mean, proportions, etc.) by
modifying the equations weve been going through
An additional component is the error (E)

This is basically the term that defines how far off we are
willing to be (i.e., the margin of acceptable error)
Strictly speaking, E is one-half the difference between the
upper and lower values for an interval for a given C.I.
Note that E is not the same as C.I.
Sample Size Determination
z / 2
2
Equation for : n
E
2
z / 2 p(1 p )
Equation for : n
E

What obvious flaw do you see?

Example #4
A movie theatre wants to know the mean
number of tickets sold per day. How many
days must they count to know the mean daily
ticket sales within 100 tickets with a 95%
confidence interval?
From previous sales reports, it is determined

that = 175
Example #4
What numbers do we plug into z / 2
2
our equation? n
E
What should zalpha/2 be?
What should E be?

Why dont we multiply this by 2?
What should be?

Example #4
z / 2
2
n
z = 1.96 E
E = 100
= 175 1.96 *175
2
n
n = number of days 100
we should sample
2
1.96 *175
n
100
n 11.765
Example #5
A city council election is being held with several
candidates expecting reasonably large returns.
To avoid a run-off between the top 2 vote getters,

the leading candidate must receive at least 45% of
the vote
How many people do we need to sample using exit

polls to determine with 99% confidence and an
acceptable error of 0.005 whether there will be a
run-off vote?
Example #5
2
z / 2 p (1 p )
z = 2.58 n
E
E = 0.005
p = 0.45 2
2.58 * 0.45 * 0.55
n = number of n

0 . 005
people we should
sample 2
2.58 * 0.497
n
0.005
n 16310
Class Problem
Given this sample of middle school kid
heights (in inches)
56, 64, 52, 69, 66, 64, 63, 46, 46, 49, 47,
60, 54, 45, 45, 69, 62, 67, 49, 43, 59
What is the 99% confidence interval for

the population mean ()?
Solution
n = 21
x = 1175/21 = 55.95
s = 8.96
talpha/2 , n-1 = 2.845
s s
P( X 2.845 X 2.845 ) 0.99
n n
8.96 8.96
P(55.95 2.845 55.95 2.845 ) 0.99
21 21
P(55.95 5.563 55.95 5.563) 0.99
So the 99% C.I. for the population mean () is [50.387, 61.513]

For Friday
Come with questions about homework
#6
For Monday
Read chapter 9 : pages 280-306

10 Inferential Statistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 Inferential Statistics

Uploaded by

Copyright:

Available Formats

Inferential Statistics Part 1

Identical distribution of random variables means that

Each observation (i.e., each random variable) is

So the Central Limit Theorem applies!

Example: we are 95% confident that is between 10 and 20

Bias may be positive or negative (e.g., a

Efficiency how clustered the distribution of

We can measure the quality of the estimator using

MSE Bias Variance

Remember that the variance in this case it the

Population Standard Deviation: s

Common examples: 90, 95, or 99% confident

The confidence interval is defined with the symbol

In confidence intervals, alpha () is the proportion of time

The typical usage is: z / 2

= 1 - desired confidence interval

Remember that we divide by 2 since we have uncertainty both

Therefore we use z0.025 for the 95% confidence interval

From the z-table we find that z0.025 = 1.96

What does this mean?

This is the simplest case because normality

This is also unlikely since it requires knowing

Our goal is to estimate the mean reduction in

P (17 1.96 * 3 17 1.96 * 3) 0.95

P (17 5.88 17 5.88) 0.95

What would happen to our confidence

In this case the n matters a lot, why?

This is also unlikely since it requires knowing

What should we used instead of ?

Can we use the z-table in this case?

This case is what we see most commonly

To do so we use the t-table

How are they different?

But the distribution of x does closely follow a t-distribution for sample

P (30 2.131 * 10 30 2.131 * 10) 0.95

P (30 21.31 30 21.31) 0.95

The 95% confidence interval for the mean is (8.69, 51.31)

Remember that p = x/n

In this case we dont need s or , but we do need

Which we estimate as: p(1 p)

For smaller samples, we calculate the entire distribution using

0.42 * 0.58 0.42 * 0.58

0.42 2.58 * 0.04 0.42 2.58 * 0.04

0.42 0.104 0.42 0.104

Required sample sizes can be determined for

An additional component is the error (E)

What obvious flaw do you see?

From previous sales reports, it is determined

What should E be?

What should be?

To avoid a run-off between the top 2 vote getters,

How many people do we need to sample using exit

What is the 99% confidence interval for

P(55.95 5.563 55.95 5.563) 0.99

So the 99% C.I. for the population mean () is [50.387, 61.513]

Read chapter 9 : pages 280-306

You might also like