Professional Documents
Culture Documents
Parametric Methods
Methods are classified on the basis of what we know about the population we are
studying. Parametric methods are typically the first methods studied in an
introductory statistics course. The basic idea is that there is a set of fixed parameters
that determine a probability model.
Parametric methods are often those for which we know that the population is
approximately normal, or we can approximate using a normal distribution after we
invoke the central limit theorem. There are two parameters for a normal
distribution: the mean and the standard deviation.
Ultimately the classification of a method as parametric depends upon the
assumptions that are made about a population. A few parametric methods include:
Confidence interval for a population mean, with known standard deviation.
Confidence interval for a population mean, with unknown standard deviation.
Confidence interval for a population variance.
Confidence interval for the difference of two means, with unknown standard
deviation.
Nonparametric Methods
To contrast with parametric methods, we will define nonparametric methods. These
are statistical techniques for which we do not have to make any assumption of
parameters for the population we are studying.
Indeed, the methods do not have any dependence on the population of interest. The
set of parameters is no longer fixed, and neither is the distribution that we use. It is
for this reason that nonparametric methods are also referred to as distribution-free
methods.
Nonparametric methods are growing in popularity and influence for a number of
reasons. The main reason is that we are not constrained as much as when we use a
parametric method. We do not need to make as many assumptions about the
population that we are working with as what we have to make with a parametric
method. Many of these nonparametric methods are easy to apply and to understand.
A few nonparametric methods include:
Sign test for population mean
Bootstrapping techniques
U test for two independent means
Spearman correlation test
Comparison
There are multiple ways to use statistics to find a confidence interval about a
mean. A parametric method would involve the calculation of a margin of error with
a formula, and the estimation of the population mean with a sample mean. A
nonparametric method to calculate a confidence mean would involve the use of
bootstrapping.
Why do we need both parametric and nonparametric methods for this type of
problem?
Many times parametric methods are more efficient than the corresponding
nonparametric methods. Although this difference in efficiency is typically not that
much of an issue, there are instances where we do need to consider which method is
more efficient.
A normal distribution is more commonly known as a bell curve. This type of curve
shows up throughout statistics and the real world.
For example, after I give a test in any of my classes, one thing that I like to do is to
make a graph of all the scores. I typically write down 10 point ranges such as 60-69,
70-79, and 80-89, then put a tally mark for each test score in that range. Almost
every time I do this, a familiar shape emerges.
A few students do very well and a few do very poorly. A bunch of scores end up
clumped around the mean score. Different tests may result in different means and
standard deviations, but the shape of the graph is nearly always the same. This shape
is commonly called the bell curve.
Why call it a bell curve? The bell curve gets its name quite simply because its shape
resembles that of a bell. These curves appear throughout the study of statistics, and
their importance cannot be overemphasized.
What Is a Bell Curve?
To be technical, the kinds of bell curves that we care about the most in statistics are
actually called normal probability distributions. For what follows we’ll just assume
the bell curves we’re talking about are normal probability distributions. Despite the
name “bell curve,” these curves are not defined by their shape. Instead, an
intimidating looking formula is used as the formal definition for bell curves.
But we really don’t need to worry too much about the formula. The only two
numbers that we care about in it are the mean and standard deviation. The bell curve
for a given set of data has the center located at the mean. This is where the highest
point of the curve or “top of the bell“ is located. A data set‘s standard deviation
determines how spread out our bell curve is.
The larger the standard deviation, the more spread out the curve.
An Example
If we know that a bell curve models our data, we can use the above features of the
bell curve to say quite a bit. Going back to the test example, suppose we have 100
students who took a statistics test with a mean score of 70 and standard deviation of
10.
The standard deviation is 10. Subtract and add 10 to the mean. This gives us 60 and
80.
By the 68-95-99.7 rule we would expect about 68% of 100, or 68 students to score
between 60 and 80 on the test.
Two times the standard deviation is 20. If we subtract and add 20 to the mean we
have 50 and 90. We would expect about 95% of 100, or 95 students to score between
50 and 90 on the test.
The central limit theorem is a result from probability theory. This theorem shows up
in a number of places in the field of statistics. Although the central limit theorem can
seem abstract and devoid of any application, this theorem is actually quite important
to the practice of statistics.
So what exactly is the importance of the central limit theorem? It all has to do with
the distribution of our population.
As we will see, this theorem allows us to simplify problems in statistics by allowing
us to work with a distribution that is approximately normal.
Inferential statistics gets its name from what happens in this branch of statistics.
Rather than simply describe a set of data, inferential statistics seeks to infer
something about a population on the basis of a statistical sample. One specific goal
in inferential statistics involves the determination of the value of an unknown
population parameter. The range of values that we use to estimate this parameter is
called a confidence interval.
The second part of a confidence interval is the margin of error. This is necessary
because our estimate alone may be different from the true value of the population
parameter. In order to allow for other potential values of the parameter, we need to
produce a range of numbers. The margin of error does this.
The estimate is in the center of the interval, and then we subtract and add the
margin of error from this estimate to obtain a range of values for the parameter.
Confidence Level
Attached to every confidence interval is a level of confidence. This is a probability or
percent that indicates how much certainty we should be attributed to our confidence
interval.
If all other aspects of a situation are identical, the higher the confidence level the
wider the confidence interval.
This level of confidence can lead to some confusion. It is not a statement about the
sampling procedure or population. Instead it is giving an indication of the success of
the process of construction of a confidence interval. For example, confidence
intervals with confidence of 80% will, in the long run, miss the true population
parameter one out of every five times.
Any number from zero to one could, in theory, be used for a confidence level. In
practice 90%, 95% and 99% are all common confidence levels.
Margin of Error
The margin of error of a confidence level is determined by a couple of factors. We
can see this by examining the formula for margin of error. A margin of error is of the
form:
The statistic for the confidence level depends upon what probability distribution is
being used and what level of confidence we have chosen. For example, if Cis our
confidence level and we are working with a normal distribution, then C is the area
under the curve between -z* to z*. This number z* is the number in our margin of
error formula.
To deal with this uncertainty in knowing the standard deviation we instead use the
standard error. The standard error that corresponds to a standard deviation is an
estimate of this standard deviation. What makes the standard error so powerful is
that it is calculated from the simple random sample that is used to calculate our
estimate. No extra information is necessary as the sample does all of the estimation
for us.
Confidence Intervals
Confidence intervals are all similar to one another in a few ways. First, many two-
sided confidence intervals have the same form:
Second, the steps for calculating confidence intervals are very similar, regardless of
the type of confidence interval you are trying to find. The specific type of confidence
interval that will be examined below is a two-sided confidence interval for a
population mean when you know the population standard deviation. Also, assume
that you are working with a population that is normally distributed.
Example
To see how you can construct a confidence interval, work through an example.
Suppose you know that the IQ scores of all incoming college freshman are normally
distributed with standard deviation of 15. You have a simple random sample of 100
freshmen, and the mean IQ score for this sample is 120. Find a 90-percent
confidence interval for the mean IQ score for the entire population of incoming
college freshmen.
Practical Considerations
Confidence intervals of the above type are not very realistic. It is very rare to know
the population standard deviation but not know the population mean. There are
ways that this unrealistic assumption can be removed.
While you have assumed a normal distribution, this assumption does not need to
hold. Nice samples, which exhibit no strong skewness or have any outliers, along
with a large enough sample size, allow you to invoke the central limit theorem.
As a result, you are justified in using a table of z-scores, even for populations that are
not normally distributed.
Inferential statistics concerns the process of beginning with a statistical sample and
then arriving at the value of a population parameter that is unknown. The unknown
value is not determined directly. Rather we end up with an estimate that falls into a
range of values. This range is known in mathematical terms an interval of real
numbers, and is specifically referred to as a confidence interval.
Confidence intervals are all similar to one another in a few ways. Two-sided
confidence intervals all have the same form:
Example
To see how we can construct a confidence interval, we will work through an example.
Suppose we know that the heights of a specific species of pea plants are normally
distributed. A simple random sample of 30 pea plants has a mean height of 12 inches
with a sample standard deviation of 2 inches.
What is a 90% confidence interval for the mean height for the entire population of
pea plants?
3. Critical Value: Our sample has size of 30, and so there are
29 degrees of freedom. The critical value for confidence level
of 90% is given by t* = 1.699.
An Explanation of Bootstrapping
One goal of inferential statistics is to determine the value of a parameter of a
population. It is typically too expensive or even impossible to measure this directly.
So we use statistical sampling. We sample a population, measure a statistic of this
sample, and then use this statistic to say something about the corresponding
parameter of the population.
For example, in a chocolate factory, we might want to guarantee that candy bars
have a particular mean weight. It’s not feasible to weigh every candy bar that is
produced, so we use sampling techniques to randomly choose 100 candy bars. We
calculate the mean of these 100 candy bars and say that the population mean falls
within a margin of error from what the mean of our sample is.
Suppose that a few months later we want to know with greater accuracy -- or less of a
margin of error -- what the mean candy bar weight was on the day that we sampled
the production line.
We cannot use today’s candy bars, as too many variables have entered the picture
(different batches of milk, sugar and cocoa beans, different atmospheric conditions,
different employees on the line, etc.). All that we have from the day that we are
curious about are the 100 weights. Without a time machine back to that day, it would
seem that the initial margin of error is the best that we can hope for.
An Example
As mentioned, to truly use bootstrap techniques we need to use a computer. The
following numerical example will help to demonstrate how the process works. If we
begin with the sample 2, 4, 5, 6, 6, then all of the following are possible bootstrap
samples:
2 ,5, 5, 6, 6
4, 5, 6, 6, 6
2, 2, 4, 5, 5
2, 2, 2, 4, 6
2, 2, 2, 2, 2
4,6, 6, 6, 6
Try as hard as you can, you cannot lift yourself into the air by tugging at pieces of
leather on your boots.