Estimation: June 12, 2012 Rebecca Slack

Estimation
Chapter 6
June 12, 2012
Rebecca Slack
1
Relationship between population and sample
Random Number Tables
Randomized Clinical Trials
Estimation of the Mean of a Distribution
Estimation of the Variance of a Distribution

There is a lot of material here. What we dont
finish today, we will pick up on Thursday. Focus
on the variance will be moved to Thursdays
lecture.
2
A parameter that is part
of a model for a
population is called a
population parameter.
We use data to estimate
population parameters.
Any summary found
from the data is a
statistic.
The statistics that
estimate population
parameters are called
sample statistics.
3 Population and Sample
A population is the group we want to study. We
often call the population of interest the reference
population, the target population or the study
population.
Parameters are used to describe a population:
(mean)
(standard deviation)
p (proportion)
In reality, we never know the values of parameters.
We try to estimate the values of parameters, or we
assume the parameters have specific values and
see if we can find data to support that assumption.

A sample is a subset of the population. A
sample should be representative of the
population.
There are many ways to select a sample
Convenience
Systematic
Random.
Statistics are used to describe a sample:
(mean)
S (standard deviation)
(proportion)
X
p

A random sample is a subset of a
population such that each member of
the population is chosen
independently of other members, and
each member of the population has a
known non-zero selection probability.
A simple random sample is a
random sample in which each
subject in the population has the
same selection probability.

Statistical inference usually involves
inductive reasoning rather than deductive
reasoning.
Estimation is concerned with estimating the
values of specific population parameters.
Point estimate: a specific value used to estimate a
parameter
Interval estimate: a range of values used to
estimate a parameter
Hypothesis testing is concerned with testing
whether the value of a population parameter
is equal to some specific value.
7 End of Population and Sample
8 Random Number Tables
It's surprisingly difficult to generate random
values even when they're equally likely.
Computers have become a popular way to
generate random numbers. Even though they
often do much better than humans,
computers can't generate truly random
numbers either.
Since computers follow programs, the
'random' numbers we get from computers are
really pseudorandom.
22177263043874100925370862705819976227258497959070328250011089
There are ways to generate random numbers so that
they are both equally likely and truly random.
The best ways we know to generate data that give a fair
and accurate picture of the world rely on randomness,
and the ways in which we draw conclusions from those
data depend on the randomness, too.
What is a simulation? A simulation is an experiment run
as a model of reality.
A simulation consists of a collection of things that
happen at random. There is a situation that is repeated.
These situations are called the components of the
simulation. Each component has a set of possible
outcomes.
Often, simulations are computer-based programs used
to manipulate the elements of a strategy mix rather
than test them in a real setting. We will do non-
computer based simulations to get the concept.
1) Identify the component to be repeated.
2) Explain how you will model the outcome.
3) Explain how you will simulate the trial. (A
trial is the sequence of events that we are
pretending will take place.)
4) State clearly what the response variable is.
5) Run several trials.
6) Analyze the response variable.
7) State your conclusion (in the context of the
problem).
Suppose a couple will continue having children until
they have at least one boy and at least one girl.
What would the average family size be? Assume
boys and girls are equally likely.
1) Our component is the birth of a child.
2) We will model this with something that can
generate two outcomes randomly with 50%
chance each. Let's try a coin with H=F and T=M
3) Flip the coin until we have at least one head and
one tail.
4) The response variable is the number of coin flips
it took to accomplish #3.
5) OK, let's do it. After you flip (6), we will analyze
the results and state our conclusion (7).

A random sampling of 300 random digits from A Million Random
Digits with 100,000 Normal Deviates, copied from Wikipedia.
Suppose a couple will continue having children
until they have at least two boys and at least two
girls. What would the average family size be?
Assume boys and girls are equally likely.
1) Our component is the birth of a child.
2) We will model this with random digits. What are
some ways we can use random digits to
randomly model two outcomes with 50% chance
each?
3) Pick your starting location and sample the
digits until you have at least two boys and two
girls.
4) The response variable is the number of digits it
took to accomplish #3.
5) OK, let's do it. After you run it (6), we will
analyze the results and state our conclusion (7).
Suppose 85% of the students in this class are in a MS
program and the rest are in a PhD program. If I
randomly assign students to work together, how likely
will I be to get a pair where both students are in the
PhD program if there are 40 students in the class?
1) Our component is a student.
2) We will model this with pairs of random digits. What
are some ways we can use random digits to
randomly model two outcomes with 85% and 15%
chances?
3) Sample the digits until you have 20 pairs of
students.
4) The response variable is the number of PhD pairs.
5) OK, let's do it. After you run it (6), we will analyze
the results and state our conclusion (7).
Use the correct model for your
simulation.
Dont overstate your case. Always be
sure to indicate that future results will
not match your simulated results
exactly.
Model the outcome chances
accurately.
Run enough trials.
16 End of Random Number Tables
Definition: A clinical trial is a prospective
study with human subjects designed to
compare the effect of one or more
interventions against a control.
Clinical trials are often considered the gold
standard method for assessing the
effectiveness of an intervention.
Clinical Trials:
are prospective
include an intervention
include a control
include human subjects

17 Randomized Clinical Trials
Assumptions
Subjects in each treatment group are
selected from the same population.
Many statistical tests assume that
subjects are randomized to treatments.
Features
Randomization
Masking/Blinding
Stratification

It is a process of assigning subjects to treatments
in a randomized clinical trial
It helps ensure that the subjects in each treatment
group are similar with respect to both known and
unknown demographic and clinical characteristics
It helps control for selection bias
It is a rule for assigning subjects to treatment
groups in a way that produces a known probability
distribution for observed treatment differences in
the absence of a true treatment effect.
Each patient has a known chance of receiving
each treatment in the study.

Fixed Allocation:
Simple (flip a coin, random number table, computer
program)
Blocked (force equal group sizes after every fixed
number of patients; e.g., ABAB, AABB, ABBA,
BABA, etc.)
Stratified (equal treatment assignment within
prognostic groups)
Adaptive Allocation:
minimization (balance groups)
play the winner
NOT Randomization
Assigning subjects according to birth date or day of the
week.
Assigning subjects alternately, e.g. ABABABABA

Blinding helps eliminate information bias.
An ideal clinical trial design will include
randomization and blinding to help avoid
biased comparisons between treatment
groups.
Types of Blinding
Open label: treatment assignment is known by
subject and evaluator
Single blind: subject does not know treatment
assignment (or evaluator*)
Double blind: neither subject nor evaluator
know treatment assignment

The population of interest needs to be well-
defined so an appropriate sample that
reflects that population can be selected.
An efficient clinical trial achieves clear-cut
answers with as small a sample as
necessary.
Warning: Selecting subjects that are likely to
respond well to treatment can improve
efficiency but definitely restricts
generalizability.

22 End of Randomized Clinical Trials

5 minute stretch break!!

When we return, we will switch to Mean
Estimation

Remember:
point estimate: a specific value used to estimate
a parameter
interval estimate: a range of values used to
estimate a parameter
23

24 Estimating the Mean



A point estimate for the mean, , of a
distribution (or population) is the sample mean:

An estimate for the variance, , of a distribution
(or population) is the sample variance:

n
x
x
n
i
i
=
=
1
1
) (
1
2
2
=
n
x x
s
n
i
i
The sampling distribution of is the
distribution of values of over all possible
samples with n subjects that could have been
selected from the population.
This requires using your imagination and thinking
hypothetically

Notation: Let X
1
, , X
n
be a random sample
selected from some population with mean .
Then

x
x
( ) = X E
Definition: , an estimator for , is said to be
unbiased if .
Since , we see that , the sample
mean, is an unbiased estimator of , the
population mean.
The sample median is also an unbiased
estimator of the population mean, .
So, why do we use as our estimator of the
population mean, instead of the median?
If the underlying distribution of the population is
normal, then is the unbiased estimator of
with the smallest variance.

u
u
( ) u u =
E
30
( ) = X E
X
X
Estimating the Mean
Why is it preferable to estimate a parameter
from larger samples rather than from smaller
samples?

Consider each element in the sample as a
piece of information. It is intuitive that the more
information we have about a parameter, the
better we are able to estimate that parameter.

That is, we can be more precise (smaller
variance) when we have more data.
Let X
1
, , X
n
be a random sample from a
population with mean and variance .
The set of sample means in repeated random
samples of size n from this population has
variance = /n.
The standard deviation of this set of sample
means is then (/n)= /n and is referred to
as the standard error of the mean (SEM or
SE).
Let X
1
, , X
n
population with mean and variance . Then
for large n,

regardless of the underlying distribution of
X
1
, , X
n
!

|
|
.
|
\
|
n
N X
2
, ~
o

Suppose we know that the mean birth weight
for 1000 infants from the Boston City Hospital
is 112 ounces with a standard deviation of
20.6 ounces. Let these 1000 infants be our
population of interest.
What is the probability that the mean birth
weight of a sample of 10 infants will be
between 98 and 126 ounces?
We know from the CLT that

34
|
|
.
|
\
|
10
6 . 20
, 112 ~
2
N X

Estimating the Mean

Since and we want
we first find
|
|
.
|
\
|
10
6 . 20
, 112 ~
2
N X

15 . 2
51 . 6
14
10
6 . 20
112 98
=
=
|
.
|
\
|

=
L
Z
15 . 2
51 . 6
14
10
6 . 20
112 126
+ =
+
=
|
.
|
\
|

=
R
Z
( ) 126 98 Pr < < X

Since we find

|
|
.
|
\
|
10
6 . 20
, 112 ~
2
N X

( ) ( ) ( )
( ) ( )
( ) ( )
( ) ( ) | |
| |
9684 . 0
0158 . 0 9842 . 0
9842 . 0 1 9842 . 0
15 . 2 1 15 . 2
15 . 2 15 . 2
15 . 2 Pr 15 . 2 Pr
98 Pr 126 Pr 126 98 Pr
=
=
=
u u =
u u =
< + < =
< < = < <
Z Z
X X X

D
e
n
s
i
t
y

-4 -3 -2 -1 0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
Z ~ N(0,1)
2.15
Example: Birth Weights
Z
-2.15
0.9684
Let X
1
, , X
n
normal population with mean and variance
.
That is,

Then has a t distribution
with (n-1) degrees of
freedom (df) (Table 5,
p831).
( )
2
, ~ o N X
|
.
|
\
|

=
n
s
x
t

William S. Gosset
1876-1937

t

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
5 df

t

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
Students t Distribution
2 df
8 df
32 df

t

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
2 df
8 df
32 df
N(0,1)

When the sample size is more than 30
(i.e, df 30) the standard normal
distribution is a good approximation to the
t distribution.
) 1 , 0 ( ~ ~ ) 30 ( N Z df t

>
Recall that the sample mean ( )is a point
estimate of the population mean ().
A confidence interval is an interval estimate.
That is, a confidence interval is a range of
values that we use to estimate some
parameter. We usually construct what we call
95% confidence intervals.
In general, we construct 100% x (1-)
confidence intervals. So, if we want a 95%
CI, then = 0.05. If we want a 90% CI, then
= 0.10.
X
Age

40

42

44

46

48

50

52

54

56

Confidence Interval
95% confidence intervals
true mean,

Yes: Of the collection of all 95% confidence
intervals that could be constructed from
repeated random samples of size n, 95% of
them will contain the parameter .

Be careful: The probability that the parameter,
, is contained in a particular confidence
interval is either 0 or 1, depending on the true
(unknown) value of the parameter, .

A 100% x (1-) confidence interval (CI) for
the mean of a normal distribution with
an unknown variance is given by:
|
.
|
\
|
+

n
s
t x
n
s
t x
n n ) 2 / 1 , 1 ( ) 2 / 1 , 1 (
,
o o

|
.
|
\
|
+

n
s
t x
n
s
t x
n n ) 2 / 1 , 1 ( ) 2 / 1 , 1 (
,
o o
standard error of the mean

|
.
|
\
|
+

n
s
t x
n
s
t x
n n ) 2 / 1 , 1 ( ) 2 / 1 , 1 (
,
o o
from Table 5, p 831

t

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
sample size n=10
df = 9
975 . 0 , 9 2 / 05 . 0 1 , 9 ) 2 / 1 , 1 (
t t t
n
= =
o
2.262
0.025
0.975
When the sample size is more than 30
(i.e, df 30) the standard normal
distribution is a good approximation to the
t distribution.
) 1 , 0 ( ~ ~ ) 30 ( N Z df t

>
An approximate 100% x (1-) confidence
interval (CI) for the mean of a normal
distribution with an unknown variance is
given by (n > 30):
|
.
|
\
|
+

n
s
z x
n
s
z x
2 / 1 2 / 1
,
o o
from Table 3, p 825

The length of a confidence interval is:

or

n
s
t
n ) 2 / 1 , 1 (
2
o
n
s
z
2 / 1
2
o
margin of error
The length of a confidence interval
is affected by n, s, and .

It decreases as n increases.

It increases as s increases.

It decreases as increases.
54
n
s
z
2 / 1
2
o
n
s
t
n ) 2 / 1 , 1 (
2
o
Estimating the Mean

Suppose we measure the mean birth weight of a
sample of 10 infants from the Boston City
Hospital. Suppose we find the sample mean to
be 116.9 ounces and the sample standard
deviation to be 21.7 ounces.

Since n = 10 s 30, we know that we should use
the following formula to find the confidence
interval.

9 . 116 = x 7 . 21 = s
55
|
.
|
\
|
+

n
s
t x
n
s
t x
n n ) 2 / 1 , 1 ( ) 2 / 1 , 1 (
,
o o
Estimating the Mean
|
.
|
\
|
+

n
s
t x
n
s
t x
n n ) 2 / 1 , 1 ( ) 2 / 1 , 1 (
,
o o
9 . 116 = x 7 . 21 = s
262 . 2
) 975 . 0 , 9 (
) 2 / 05 . 0 1 , 1 10 ( ) 2 / 1 , 1 (
= =
=

t
t t
n o
|
.
|
\
|
+

n
s
t x
n
s
t x
n n ) 2 / 1 , 1 ( ) 2 / 1 , 1 (
,
o o
|
.
|
\
|
+
10
7 . 21
262 . 2 9 . 116 ,
10
7 . 21
262 . 2 9 . 116
|
.
|
\
|
+ 5 . 15 9 . 116 , 5 . 15 9 . 116
|
.
|
\
|
4 . 132 , 4 . 101
Most of the slides were adapted from a lecture
created for this course by Mark F. Munsell in 2006.
Several slides were adapted from lectures I
created for BIST 501 at Georgetown University
using the text: Stats, Data, and Models by
DeVeaux, Velleman, and Bock.
58

Estimation: June 12, 2012 Rebecca Slack

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimation: June 12, 2012 Rebecca Slack

Uploaded by

Copyright:

Available Formats

Estimation

5 Population and Sample

33 Estimating the Mean

36 Estimating the Mean

Estimating the Mean

You might also like