You are on page 1of 33

Sampling

Distributions:
Section 9.1
Why Am I Here Again?
• Statistics is the science (and art) of learning from
data.
• Remember from the first week of class that there are
two basic kinds of statistics:
– exploratory data analysis: an informal and open-
ended examination of data for patterns
– statistical inference: follows strict rules and
focuses on judging whether the patterns you
found are the sort you would expect
Does It Matter Which I Use?
• Yes! Exploratory data analysis can be done with
any data, but formal inference should only be used
in certain situations. Although experts disagree
about how widely statistical inference should be
used, they all agree that inference is most secure
when we produce data through random sampling
or randomized comparative experiments. Because
when we use chance to choose respondents or
assign subjects, the laws of probability can answer
the question “What would happen if we did this
many, many times?”
Where We’ve Been
• In chapters 1 – 4 we focused on exploratory
data analysis where we developed tools and
strategies for organizing, describing, and
analyzing data.
• In chapter 5 we learned how to correctly
collect or produce data through surveys,
experiments, and observational studies.
• In chapters 6 – 8 we learned about probability.
Where We’re Going
• The purpose of chapter 9 is to prepare us for the
study of statistical inference (chapters 10 – 15) by
looking at the probability distributions of some
very common statistics: sample proportions and
sample means.
Some Basic Vocabulary
• Parameter: a number that describes the entire
population. It statistics this value is never
known.

• Statistic: a number that can be computed from


the sample data. In practice, we often use a
statistic to estimate an unknown parameter.
The Essence of the Matter
• As long as we were just doing some basic data
analysis, the distinction between statistics and
parameters was not all that important, but now
as we get into inference, it is essential.
• Remember:
– The sample mean isx . This is a statistic.
– The population mean is μ. This is a
parameter.
More Notation
• If instead of the mean we are interested in the
percent of people or things that have a certain
characteristic, we use a proportion.
• Remember:
– The sample proportion is p̂ . (statistic)
– The population proportion is p. (parameter)
It All Varies
• Each time we take a sample of things we
expect to get a slightly different mean or
proportion, even when sampling from the
same population.

• This basic fact is called sampling variability:


the value of a statistic varies in repeated
sampling.
Always Practice Safe Statistics
• Because of sampling variability, we would never
just collect data from a single sample and say that
our sample statistic is equal to the population
parameter. It may be close, but it may be very far
off.
• So how can we ever be sure that our sample
statistic is a good estimator of our population
parameter? Well, we can’t, but we will learn how
to be pretty confident during the next few
chapters of our book.
Customs Conundrum
• In practice it is too difficult and expensive to take
many samples from a population, so we can imitate
the sampling by using a simulation.
• Custom’s officials at the Guadalajara airport want
to make sure that travelers do not bring illegal items
into the country. They cannot afford to search
everyone though, so they have each traveler press a
button; green they go through, red they get
searched. The officers claim that the probability the
light shows green on any press of the button is 0.70.
Simulate an SRS
• We can imitate the population with a table of
random digits, such as Table B in our book,
with each entry standing for a traveler. How
can we simulate the results of the button
pushes of the next 100 people in line?
• Let’s do this, starting at line 101.
• If we continued this process through the first 100
digits, we would find that 71 of the 100 entries
are 0 through 6. So our sample proportion of
people who make it through Customs in
pˆ  0.71
Guadalajara is 0.71 or .
• If we carried out this process again using the next
100 digits, we get a different result,pˆ  0.62 .
• These two sample results are different , and
neither is equal to the true population value p =
0.7. That’s sampling variability!
Speedy Simulations
• Simulations are very powerful tools in
statistics because they allow us to study
chance without physically collecting the data.
Technology makes this even faster than using
a random digit table.
A Picture is Worth 1000 SRSs

The distribution of the sample proportion of 1000 SRSs of


size 100 drawn from a population with p = 0.7.
A Sampling Distribution
• The histogram approximates the sampling
distribution of p̂ .
Strictly Speaking
• The true sampling distribution for our Customs
situation is the ideal histogram that would
form when using all the possible samples of
size 100 from our population. The histogram
that was created using 1000 trials is only an
approximation of the true sampling
distribution.
Don’t Table the Issue!
The probability distribution used to construct a
random number table:

* Note this is a probability distribution, not a sampling distribution!


An Actual Sampling Distribution
• Consider the process of taking an SRS of size
2 from this population and computing the
mean of the sample. We could perform a
simulation many, many times and get an
approximate sampling distribution. Since the
data set is fairly small and calculating the
mean is easy, we can instead construct the
actual sampling distribution.
All Possible Means for n = 2
The Sampling Distribution of the
Means
Continue CUSSing!
• Whether we use probability to create the true
sampling distribution of an event, or use a
simulation to create an approximate sampling
distribution, we can still describe the
distribution that is created.
Let’s Describe the Distributions

SRS w/ n = 100:

SRS w/ n = 1000:
Sneaky Scales
Randomization Rules!
• The shape of the approximate sampling
distributions that we just looked at are a result of
random sampling. Non-random sampling would
not give such regular and predictable results.
• When randomization is used in a design for
producing data, the statistics computed from the
data have a definite pattern of behavior over
many repetitions even though the result of a
single repetition is uncertain.
Can You Really Trust a Statistic?
• The fact that statistics calculated from random
samples have definite sampling distributions
allows a more careful answer to the question
of how trustworthy a statistic is as an estimator
of a parameter.
Can a Statistic Be Biased?
• Yes! We have already discussed how sampling
methods can be biased, but what does it mean
if a statistic is biased?

• The bias of a statistic is the difference between


the parameter being estimated and the average
value of the statistic used to estimate that
parameter.
An Easier Definition:

* That does not mean that each time I calculate an


unbiased statistic it will exactly equal the population
parameter. It will sometimes be larger and sometimes
smaller.
Is Bigger Really Better?
• Yes! Larger samples are better than smaller
samples because they are much more likely to
produce an estimate close to the true
parameter. This is because large samples have
much less variability than small samples.
The Variability of a Statistic
Why?
• Well since a statistic is only calculated from
the sample data collected, it is only the size of
the sample that you collect that effects the
statistic. The size of the population doesn’t
have anything to do with that calculation. (Of
course this is only for populations that are at
least 10 times as big as the sample size.)
If You’re Going to San
Francisco…
• The fact that the variability of sample results is
controlled by the size of the sample has very
important consequences for sampling design.
This means that a statistic from a sample of
size 2500 from the population of the US (more
than 300 million) is just as precise as a sample
of size 2500 from the population of San
Francisco (about 750,000.) To obtain equally
likely results, you must use equal sample sizes.
Bias and Variability on the Dartboard
• If the bullseye is the true population parameter
and the arrows we throw are our sample
statistics, which pictures represent high and
low bias? High and low variability?

You might also like