Professional Documents
Culture Documents
1. selects a sample of n units from a real, finite population of N units, each unit
being identified (for the purposes of selection) by a distinct label (eg name and
address for humans)
(THE DESIGN STAGE)
HOW?
1. (THE DESIGN STAGE): How do we select the sample? ie what is the sample
design or sampling mechanism? There are many possible sample designs, but
the most important, even if little used in actual survey practice, is
Definition 1 (SRS): Simple Random Sampling All possible samples of size n (ie
with n distinct units) are equally likely to be drawn, so that each has probability
1/ Nn .
The sampling fraction f = n/N is the probability that any particular unit is in-
cluded in the sample (see later). It goes from 0 (sampling from an infinite pop-
ulation: where sampling with replacement is equivalent to sampling without re-
placement studied here) to 1 (a census where the whole population is sampled and
there is clearly no sampling error: inferences on population parameters should be
exact in the absence of non-sampling errors: see below).
1
eg just to some of the questions asked, item non-response), measurement error,
for example asking people how many cigarettes they smoked last week, and other
NON-SAMPLING ERRORS. For further discussion see later in this course.
Of course some estimators are sensible and give precise inferences, and others
may be neither sensible (eg giving values outside the known range of a population
parameter) or be very imprecise (eg an estimator which ignores most of the sample
data)
Let Var[e] be the variance of an estimator e under its sampling distribution for the fixed
(unknown) population, and let v[e] be a sample estimator of this variance, called a vari-
ance estimator for e. Then under a normal approximation to the sampling distribution
of e, a nominal 95% confidence interval for is given by
p
e 1.96 v[e] (1.1)
Definition 4 The actual probability that this interval contains the value of is called
the coverage probability for the estimator, design and population (usually < 0.95: un-
dercoverage, but sometimes above 0.95 overcoverage).
Let E[.] denote expectation (or mean) under the sampling distribution, then
Definition 5 e is unbiased for if for all populations
E[e] = ,
2
FARMS: Sampling distribution of the sample mean and variance estimator under
simple random sampling.
Sample prob. y s2 v[y] Coverage?
A,B 0.1 111.5 612.5 183.75 NO
A,C 0.1 122.0 1568.0 470.40 NO
A,D 0.1 147.5 5724.5 1717.35 YES
A,E 0.1 182.0 15488.0 4646.40 YES
B,C 0.1 139.5 220.5 66.15 NO
B,D 0.1 165.0 2592.0 777.60 YES
B,E 0.1 199.5 9940.5 2982.15 YES
C,D 0.1 175.5 1300.5 390.15 YES
C,E 0.1 210.0 7200.0 2160.00 YES
D,E 0.1 235.5 2380.5 714.15 NO
mean 168.8 4702.7 1410.81 0.6
3
the sampling distributions of the sample mean and sample variance are therefore equal
to the value of the population mean and population variance respectively. These agree
with the general theorem in the next chapter.
The coverage probability is considerably less then the nominal level of 0.95, a phe-
nomenon known as undercoverage. However, if we had used a percentage point of the
t-distribution with say one degree of freedom instead of the Normal then the coverage
probability would have been 1, the phenomenon of overcoverage.
The variance of the (discrete) sampling distribution of y also agrees with the theorem
in the next chapter as
1
Var[y] = E[(y)2 ] (E[y])2 = (111.52 + 122.02 + + 235.52 ) 168.82 = 1410.81.
10
However, it can be seen that the distribution of y is a very poor approximation to the
Normal!
It is worthwhile noting the short-cut formula (y1 y2 )2 /2 for the sample variance,
when n = 2.
This further example illustrates the generality of the definitions we have made to all
sampling problems, not just SRS:
Example 2 Unequal probability sampling For this population e is biased (with a bias
of +0.02), and v[e] is also biased (with a bias of +0.0014). Note that the true variance
Var[e] is calculated directly from the sampling distribution using the standard formulae
for discrete distributions giving
The coverage is NOT 2/3, but obtained by adding the probabilities of covering samples,
or equivalently taking the mean of the coverage indicator (counting YES as 1 and NO
as 0).
4
Compare unbiased estimators e by their variances Var[e]:
e1 better than e2 if Var[e1 ] < Var[e2 ], but this inequality often depends on the popula-
tion, that is for some populations e1 is better than e2 whereas for others it is e2 which is
the better estimator. One of the aims of sampling theory is to identify the types of pop-
ulation which make one estimator better than another. If an estimator is always worse
than another i.e. for every population, no matter what the values in it, then it is clearly
not worth considering (we say it is inadmissible).
More generally compare biased and unbiased estimators by
2. to compare different estimators, perhaps using different sample designs and/or dif-
ferent sample sizes by looking at variances or more generally mean square errors.