You are on page 1of 7

1999 by CRC Press LLC

chapter nine
Sampling
Michael S. Broida
Miami University
Contents
9.1 Purpose
9.2 Strengths, weaknesses, and limitations
9.3 Inputs and related ideas
9.4 Concepts
9.4.1 Why sample?
9.4.2 Sample size and sampling error
9.4.3 Bias
9.4.4 Random sampling
9.4.5 Random-like samples
9.4.6 Stratified random sampling
9.4.6.1 Proportional Allocation
9.4.6.2 Optimal Allocation
9.5 Key terms
9.6 Software
9.7 References
9.1 Purpose
Sampling is a technique for obtaining an estimate from a population by
studying, measuring, or interviewing a subset (or sample) of that popula-
tion. This chapter discusses basic sampling concepts.
1999 by CRC Press LLC
9.2 Strengths, weaknesses, and limitations
A well-selected sample yields an estimate of the target parameters in much
less time and at much less cost than studying, measuring, or interviewing
the entire population (conducting a census). It is often impossible to achieve
100 percent response because some of the entities to be studied, measured,
or interviewed are unavailable or do not respond. A sample is sometimes
more accurate than a census because obtaining numerous measurements
introduces errors owing to fatigue, inaccurate or inconsistent data entry, and
the use of less qualified personnel.
The sample answer, called an estimate, is almost never exactly the same
as the corresponding population value. (This difference is called error.)
Additionally, before a statistically valid sample can be selected, a great deal
of information about the population must be available.
9.3 Input and related ideas
Before conducting a sample, it is necessary to define the specific information
being sought and the population from which the sample will be drawn. For
example, if an analyst needs information about perceived weaknesses in the
existing sales order tracking system, the population would consist of all the
people who utilize the existing system.
Sampling can be used to select the subset of a population to be inter-
viewed (Chapter 8), the members of a JAD team (Chapter 14), or the
members of an inspection team (Chapter 23). Sampling is an effective way
to study an existing system by selecting the entities, transactions, occur-
rences, or personnel to be observed and measured. Sampling is an effective
tool for estimating population characteristics when using such mathemati-
cal tools as simulation (Chapter 19) and queuing theory (Chapter 79).
During the testing phase of the system development life cycle (Part VII),
sampling is used to generate test data and select the specific events to be
monitored. During the operation and maintenance phase (Part VIII), sam-
pling is an effective tool for evaluating and monitoring performance and for
implementing system controls (Chapter 77). For example, quality
control is often implemented by taking random samples of a process.
Sometimes the estimates generated by sampling a process are plotted on a
control chart (Chapter 10) to determine if the process is in control.
9.4 Concepts
Sampling is a technique for obtaining an estimate from a population by
studying, measuring, or interviewing a subset (or sample) of the population.
This chapter discusses basic sampling concepts.
9.4.1 Why sample?
Every year, Consumer Reports magazine conducts tests on new automobiles
and reports its findings to its readers. Given the (literally) millions of
1999 by CRC Press LLC
automobiles that roll off the assembly lines every year, testing the entire
population would be incredibly time consuming, prohibitively expensive,
and practically impossible, so the test results are based on a sample.
In many cases, testing a sample is actually more accurate than testing
the entire population. A testers reactions and perceptions are likely to
change between the first car and the tenth car, if only because of fatigue.
Multiple tests mean considerable data, and data entry errors are inevitable.
Multiple tests also imply multiple testers, not all of whom are equally
skilled. Finally, the test conditions and criteria will almost certainly change
over time. For example, if enough cars are crashed into a barrier, the barrier
will eventually be deformed, thus changing the test conditions.
If the sample is drawn properly, it is reasonable to assume that the sam-
ple estimate reflects the population. The balance of this chapter discusses the
process of drawing a good sample.
9.4.2 Sample size and sampling error
The difference between the sample estimate and the true population value is
called error. As a general rule, the sampling error decreases as the sample
size increases. For example, assuming a 95 percent confidence interval, a
sample of 1,000 voters might predict the outcome of an election with an error
of slightly more than plus or minus 3 percent. Increase the sample size to
4,000, and the error drops to plus or minus 1.5 percent, while a sample size
of 10,000 reduces the error to less than plus or minus 1 percent.
Auseful formula for computing the sample size is:
n = (z
2

2
) /E
2
, (9.1)
where z is a number from the normal distribution table that corresponds to
the desired confidence interval, is the standard deviation of the popula-
tion as estimated by the sample standard deviation, and E is the maximum
acceptable error between the sample mean and the actual population
mean. For a 95 percent confidence interval, use z = 1.96. For a 99 percent
confidence interval, use z = 2.575. As a practical matter, one-fifth the sam-
ple range can be used as an estimate of the standard deviation.
For example, suppose you want to estimate the average amount of
money a state university student spends on food and beverages in an
average week. The maximum acceptable error is $2. Based on a prelimi-
nary sample, is estimated to be $8. The desired confidence interval is 95
percent. Plugging those numbers into Equation (9.1) suggests a sample
size of:
n = [(1.96
2
)(8
2
)] / 2
2
= 62.426
or 63 students. (It is impossible to sample a fractional student, and rounding
up yields a confidence interval slightly higher than 95 percent.) Assuming
1999 by CRC Press LLC
the students answer truthfully, averaging the weekly food expenditures of
63 randomly selected university students will yield a value that is within $2
of the population average with 95 percent confidence. To put it another way,
there is a 0.95 probability that the sample mean will lie within $2 of the true
mean. (Note: Areal statistician would probably argue that the last statement
is not technically correct, but in most cases it is a reasonable way to visualize
a confidence interval.)
9.4.3 Bias
Simply selecting the right sample size is not enough, however. For example,
a sample taken outside an expensive restaurant and a sample taken outside
a food bank will almost certainly yield two very different (and equally
invalid) estimates of the weekly food expenditures of university students
because those samples are likely to be biased. Abiased sample systematically
favors some members of the population over others. To cite another exam-
ple, if a telephone book is used to select a sample, people with unlisted
numbers, people who have recently moved into that telephone market, and
people with no telephone are automatically excluded from the sample.
Non-response bias occurs when one or more members of the selected
group are not included in the sample. A survey that includes information
only from people who answer their telephones at a certain time of day
excludes one subset of the population. Dismissing or excluding people who
refuse to answer certain questions is another source of non-response bias. Be
aware of non-response bias. Before taking a sample, study the sampling
process, identify subsets of the population that might be excluded or choose
not to participate, and adjust the sampling process as necessary.
9.4.4 Random sampling
One relatively easy way to avoid introducing bias is to sample randomly. A
sample is considered random if each member of the population has the same
chance of being selected. Random samples yield unbiased estimates.
Generally, an unbiased estimate is high about half the time and low about
half the time.
There are two commonly used techniques for selecting a random sam-
ple. If the population is small, the members (or slips of paper representing
each member) can be mixed thoroughly and the sample selected directly
(like bingo markers or lottery tickets). For larger populations, assign each
member a number and use a random number generator or a table of random
numbers to select the sample.
9.4.5 Random-like samples
In cases where it is impossible or inconvenient to select a true random sam-
ple, the objective is to generate estimates that behave as though they were
based on a random sample. The key to successful, almost random sampling is
to avoid introducing bias. For example, imagine a grocer inspecting a ship-
1999 by CRC Press LLC
ment of fruit. An estimate based on a sample taken from a single box or even
from the tops of several boxes is unlikely to accurately reflect the quality of
all the fruit. However, if the grocer selects several boxes and then selects
fruit from the top, the middle, and the bottom of each, the sample is likely
to be random-like.
On an assembly line, selecting every tenth, hundredth, or thousandth
item (generally, every nth item) as it flows by might be an effective way to
select a random-like sample. An option is to select every m nth item),
where n is a random number (for example, every 100 5th item.
Avoid predictability when sampling human beings, however, because it
often introduces bias. For example, if the boss walks through the work area
every hour on the hour, he or she is likely to find everyone hard at work. If
another boss were to use a random number table to define the times for
random visits to the work area, he or she is likely to gain a more accurate
picture of the employees work habits.
9.4.6 Stratified random sampling
With stratified random sampling, a population of size N is divided into m
subgroups. Each subgroup is called a stratum, and each member of the
population must lie in exactly one stratum. For example, dividing a group
of people by sex yields two strata (male and female); dividing a group of
voters into Democrat, Republican, Independent, and Socialist yields four
strata; and comparing the products produced on the first, second, and
third shifts calls for three strata. Samples are taken randomly within each
stratum.
Stratified random sampling is important if the different strata have
different means and/or different levels of variability. For example, suppose
the newer, relatively inexperienced employees who work the third shift
produce markedly more errors than the people who work the other two
shifts. In such cases, stratified sampling tends to yield more accurate
estimates than simple random sampling.
9.4.6.1 Proportional allocation
One technique for distributing a sample across several strata is called
proportional allocation. If 200 employees are distributed over three shifts
with 100 on first shift, 60 on second shift, and 40 on third shift, a reasonable
sample distribution might be 50 percent first shift, 30 percent second shift,
and 20 percent third shift.
9.4.6.2 Optimal allocation
If one stratum exhibits significantly more variability than the others,
proportionally more samples should be taken from the inconsistent stratum.
Also, if one stratum is more costly to measure or interview than another,
proportionally fewer samples should be taken from the expensive stratum.
1999 by CRC Press LLC
Optimal allocation is a technique for distributing a sample across
several strata that considers variability and cost. The optimum allocation
formula is:
(n
i
/ n) = [W
i

i
/ (C
i
1/2
)] / [W
i

i
/ (C
i
1/2
)], (9.2)
where n
i
is the number of samples in stratum i, n is the total sample size, W
i
is the percentage of the population in stratum i,
i
is the standard deviation
of stratum i, and C
i
is the cost to sample stratum i. The formula calculates a
relatively larger sample size for a given stratum if its variability (measured
by
i
) is higher than average or if the cost of sampling from that stratum is
lower than average.
For example, suppose n, the total sample size, is 500. The population is
divided among three strata, with costs to sample of $3, $4, and $5 per item
for strata 1, 2, and 3 respectively (C
1
= $3, C
2
= $4, and C
3
= $5). Stratum 1
contains 50 percent of the population (W
1
= 0.5), stratum 2 contains 30 per-
cent of the population (W
2
= 0.3), and stratum 3 contains 20 percent of the
population (W
3
= 0.2). Finally, the estimated standard deviations for the
three strata are
1
= 1.5,
2
= 2, and
3
= 2.5.
First calculate
(W
i

i
/(C
i
1/2
)) = [W
1

1
/(C
1
1/2
)] + [W
2

2
/(C
2
1/2
)] + [W
3

3
/(C
3
1/2
)]
= [0.5(1.5) / (3
1/2
)] + [0.3(2) / (4
1/2
)] + [0.2(2.5) / (5
1/2
)]
0.433 + 0 .300 + 0.224 = 0.957.
Next, compute
n
1
/n = 0.433/0.957 = 0.452
n
2
/n = 0.300/0.957 = 0.314
n
3
/n = 0.224/0.957 = 0.234.
Those numbers suggest that n
1
(the stratum 1 sample size) should be
45.2 percent (or 226 units) of the total sample size (500 items), n
2
should be
31.4 percent (or 157 units), and n
3
should be 23.4 percent (or 167 units).
9.5 Key terms
Bias Any factor that systematically favors some members of the population
over others when a sample is drawn.
Census Aset of measurements (or interviews) for every element of a
population.
Confidence interval A range of numbers around an estimate that
contains the corresponding population parameter with the stated prob-
ability. For example, a 95 percent confidence interval for an estimate of
1999 by CRC Press LLC
the population mean is a range of numbers that contains the popula-
tion mean with 95 percent certainty.
Error The difference between the value of a parameter as estimated by
a sample and the actual value of that parameter for the entire popula-
tion.
Estimate Avalue of a parameter determined by a sample.
Mean An arithmetic average; the sum of all the observations divid-
ed by the number of observations.
Non-response bias A form of bias that occurs when one or more
members of the selected group are not included or choose not to par-
ticipate in the sample.
Population The entire set of relevant entities or measurements.
Random sample Asample in which each item in the population has
the same chance of being selected.
Range The difference between the highest value and the lowest
value in a set of measurements.
Sample Aselected subset of a population.
Standard deviation The square root of the variance.
Strata The set of subgroups in a stratified random sample.
Stratified random sampling A random sampling technique in
which the population is divided into subgroups called strata such
that each element of the population lies in exactly one stratum; sam-
ples are taken randomly within each stratum.
Stratum Asingle subgroup in a stratified random sample.
Unbiased estimate An estimate that is high about half the time and
low about half the time.
Variance The average of the squared differences between the indi-
vidual population values and the population mean.
9.6 Software
Random number tables are found in many statistics textbooks and/or in the
software packages that accompany those books. Random number functions
are found in most spreadsheet programs. SAS users can generate random
observations from a binomial distribution (RANDBIN), an exponential
distribution (RANEXP), a normal distribution (RANNOR), a Poisson distri-
bution (RANPOI), or a uniform distribution (RANUNI). Minitab for
Windows users should check the RANDOM DATA sub-window on the
CALC pull down window.
9.7 References
1. Aczel, A. D., Complete Business Statistics, Irwin, Homewood, IL, 1989, chap. 16.
2. Badarinathi, R., Introduction to SAS, Dryden Press, New York, 1992, 21.
3. Bowerman, B. L. and OConnell, R. T., Applied Statistics. Improving Business
Processes, Irwin, Chicago, 1997.

You might also like