Refresher

REFRESHER: REVIEW OF BASIC
STATISTICAL CONCEPTS
STAT 4020/MSA 5020
Regression Analysis
FALL 2014
Arthur Yeh
1. Introduction
2
Typically a problem in statistics seeks to study a

particular population, such all BGSU undergraduate
students or the lifetime of a certain brand of tire.
In most cases it is not possible to examine the entire
population, so we work with a subset of the
population called a sample.
A statistical study typically has two phases. In the
first, descriptive statistics are used to explore the data.
In the second, inferential statistics are used to
generalize from the sample to the population.
Sampling and Statistics
3
The most common type of sampling is simple random

sampling where every item in the population is
equally likely to be selected.
Any numerical summary from a sample is a statistic
and each one has a different sampling distribution
that describes how the sampling statistic varies from
sample to sample (even though we will typically just
obtain one sample).
Understanding the sampling distribution provides the
guidelines for the inference process.
2. Descriptive Statistics
4
Table 2.1 (p. 8) shows the 5-year returns as of July

2002 for a random sample of 83 mutual funds.
Just looking at a list of numbers like this provides little
useful information.
The field of descriptive statistics can provide several
ways to meaningfully summarize such lists, even when
there are far more data.
Frequency Distributions
The table at right is 5-year rates # of

constructed by breaking of return funds
the return rates down -8% to -4.01% 5
into 7 categories of -4% to -0.01% 6
equal width.
0% to 3.99% 17
Each observation (i.e.
mutual fund) falls into a 4% to 7.99% 34
unique bin because rates 8% to 11.99% 12
are to nearest .1% 12% to 15.99% 8
16% to 19.99% 1
5
Histograms (for numerical variables)
A graphical 20
method to
display a
frequency Frequency
distribution. 10
You can get a

quick look of
the data's 0
symmetry and
-10 0 10 20
5yr ret
spread.
6
Numerical Summaries
7
(numerical variables)
These are single numbers computed from the sample
to describe some characteristic of the data set.
Measures of location include the mean, median and
the first and third quartiles.
Measures of spread (variability) include the standard
deviation, range and midrange.
A measure of the spread relative to the mean is the
coefficient of variance (mean/standard deviation).
Displaying Descriptive Statistics
Variable N Mean Median TrMean StDev SE Mean
5yr ret 83 5.371 5.100 5.391 5.229 0.574
Variable Minimum Maximum Q1 Q3

5yr ret -7.800 17.000 2.700 8.900
A boxplot is a good
way to display
many of the
summary stats.
0 10 20
5yr ret
8
3. The Normal Distribution
9
A continuous random variable can take any value

over a given range.
The most important one is no doubt the normal
random variable whose probability distribution is
often depicted as bell-shaped.
It is centered at mean () and most of its probability
is within 3 (3 standard deviations) of the mean.
The Standard Normal Distribution
10
-4 -3 -2 -1 0 1 2 3 4
Units of measurement are standard deviations above/below the mean

A Simple Example (2.2 on p. 22)
11
A large retail firm has accounts receivable that

are assumed to be normal with mean =
$281 and standard deviation = $35.
What proportion of accounts are above $316?

Above what value do 13.57% of the accounts lie?
4. Populations, Samples and Sampling
Distributions
12
A statistic uses its own sampling distribution to make

inference about the population.
For example, (sample average) is the statistic most
often used to make inference about the population
mean .
The sample mean is a random variable which follows
a certain probability distribution.
The Sampling Distribution of
13
Assume the random variable has mean and

standard deviation .
We observe a random sample of size drawn from

and compute the sample average . The
expected value of is and the standard deviation
of is

=

Distributional Form
14
If the observations come from a normal distribution,

then the distribution of is also normal.
If the observations come from any other non-normal
distribution, then the Central Limit Theorem asserts
that the distribution of is approximately normal
as long as the sample size is large (usually 30 or
more from a practical perspective).
Another Example (2.3 on p. 25)
15
In a manufacturing process, the diameter of a

certain part averages 40 cm ( = 40). The
variation appears to be normally distributed with
standard deviation .2 cm ( = .2).
If a sample of 16 parts is chosen, what is the

probability that the average diameter is greater
than 40.1 cm?
5. Estimating a Population Mean
16
Point estimates are single numbers used as an

estimate of a population parameter.
In general, these will never be exactly right so we
use them as the basis for an interval estimate.
Because we base the interval on the sampling
distribution, we know the probability content of the
interval.
Example 2.5 (p. 28, is known)
17
In a department store, the charge account balances

of customers have a standard deviation of $35.
They will take a sample of 100 accounts and want
to make a 95% interval estimate (95% confidence
interval) for all accounts.
If the sample mean is $245, what is the interval
estimate?
Example 2.6 (p. 29, is unknown)
18
A manufacturer wants to estimate the average life

on an expensive electrical component.
Because the test destroys the component, a small
sample is used.
If the test results are 92, 110, 115, 103 and 98,
find a 95% interval estimate of the average
lifetime.
Note that in this example, an implicit assumption is
that the life time of an expensive electrical
component is normally distributed.
6. Hypothesis Tests About a Population Mean
19
In the previous section we discussed estimating an

unknown parameter.
Here we use the sample data to test a
preconceived belief about the value of a
parameter.
We state this belief in a hypothesis, thus the
procedure is called hypothesis testing.
An Example
20
Is the population average equal to 10?
H : 10 (On average, mean is 10)
H a : 10 (No it isn't)
Notation and Terminology
21
0 is called the Null Hypothesis
is called the Alternative or Research Hypothesis
We will set up a decision rule to determine whether

we reject or do not reject the null hypothesis. The
decision rule is based on the level of significance
(usually referred to as ) of the test.
Level of Significance
22
If we perform the test and the null hypothesis really is

correct, there is a chance we will say it is false (this is
called a Type-I error) because we happened to get
some fairly extreme values in our sample.
We control for this by setting up the decision rule so
there is a small probability of this happening.
This probability (which is the allowable probability of
making a Type-I error) is the level of significance.
A Type-II error is one in which the null hypothesis is
really false and we fail to reject the null hypothesis.
P-Value
23
Most software packages report the results of a

hypothesis test computing the -value of the test.
This is just a probability that says how far out in the
tail the test statistic fell.
The decision rule is that
Reject 0 if the -value < (the level of significance)

Example 2.8 (p. 36)
24
A company that manufactures rulers wants to insure

the average length is correct (12 inches).
From each production run, a sample of 25 rulers is
selected and checked with accurate equipment.
One particular sample had an average of 12.02
inches with a standard deviation .02 inch.
Using a 1% level of significance, test to see if
production is on target.
7. Estimating the Difference Between Two
Population Means
25
Here we have two samples and two sets of statistics

drawn from two populations, and we wish to test
0 : 1 = 2
What Should We Use?
26
If we know the two population variances are about

equal, use the exact procedure.
If we think they differ a lot, we should use the
approximate result.
If we do not really know, the approximate
approach is probably best.
Example 2.10 (p. 41)
27
For the 83 mutual funds we discussed earlier (Table

2.1 on p. 8), we want to compare the five-year
returns for load funds versus no-load funds.
8. Hypothesis Tests About the Difference Between Two
Population Means
28
Our test is of the form:
0 : 1 = 2 (1 2 = 0) (No difference)
: 1 2 (1 2 0) (There is a difference)
Example 2.11 (p. 45)
29
To test the hypothesis that load and no load funds

have the same return, we write:
0 : =
:
Exercise #1 (due Sept. 4, 2014)
30
#44 (p. 49)

#45 (p. 50)
#46 (p. 50)
Note: You need to include computer outputs (please

use Minitab for this exercise) when you answer these
questions. The answers should be typed, rather than
hand written.

Refresher

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Refresher

Uploaded by

Copyright:

Available Formats

REFRESHER: REVIEW OF BASIC

Typically a problem in statistics seeks to study a

The most common type of sampling is simple random

Table 2.1 (p. 8) shows the 5-year returns as of July

The table at right is 5-year rates # of

You can get a

Variable Minimum Maximum Q1 Q3

A continuous random variable can take any value

Units of measurement are standard deviations above/below the mean

A large retail firm has accounts receivable that

What proportion of accounts are above $316?

A statistic uses its own sampling distribution to make

Assume the random variable has mean and

We observe a random sample of size drawn from

If the observations come from a normal distribution,

In a manufacturing process, the diameter of a

If a sample of 16 parts is chosen, what is the

Point estimates are single numbers used as an

In a department store, the charge account balances

A manufacturer wants to estimate the average life

In the previous section we discussed estimating an

Is the population average equal to 10?

H : 10 (On average, mean is 10)

0 is called the Null Hypothesis

is called the Alternative or Research Hypothesis

We will set up a decision rule to determine whether

If we perform the test and the null hypothesis really is

Most software packages report the results of a

Reject 0 if the -value < (the level of significance)

A company that manufactures rulers wants to insure

Here we have two samples and two sets of statistics

If we know the two population variances are about

For the 83 mutual funds we discussed earlier (Table

Our test is of the form:

To test the hypothesis that load and no load funds

#44 (p. 49)

Note: You need to include computer outputs (please

You might also like