You are on page 1of 30

REFRESHER: REVIEW OF BASIC

STATISTICAL CONCEPTS
STAT 4020/MSA 5020
Regression Analysis
FALL 2014
Arthur Yeh
1. Introduction
2

Typically a problem in statistics seeks to study a


particular population, such all BGSU undergraduate
students or the lifetime of a certain brand of tire.
In most cases it is not possible to examine the entire
population, so we work with a subset of the
population called a sample.
A statistical study typically has two phases. In the
first, descriptive statistics are used to explore the data.
In the second, inferential statistics are used to
generalize from the sample to the population.
Sampling and Statistics
3

The most common type of sampling is simple random


sampling where every item in the population is
equally likely to be selected.
Any numerical summary from a sample is a statistic
and each one has a different sampling distribution
that describes how the sampling statistic varies from
sample to sample (even though we will typically just
obtain one sample).
Understanding the sampling distribution provides the
guidelines for the inference process.
2. Descriptive Statistics
4

Table 2.1 (p. 8) shows the 5-year returns as of July


2002 for a random sample of 83 mutual funds.
Just looking at a list of numbers like this provides little
useful information.
The field of descriptive statistics can provide several
ways to meaningfully summarize such lists, even when
there are far more data.
Frequency Distributions

The table at right is 5-year rates # of


constructed by breaking of return funds
the return rates down -8% to -4.01% 5
into 7 categories of -4% to -0.01% 6
equal width.
0% to 3.99% 17
Each observation (i.e.
mutual fund) falls into a 4% to 7.99% 34
unique bin because rates 8% to 11.99% 12
are to nearest .1% 12% to 15.99% 8
16% to 19.99% 1
5
Histograms (for numerical variables)
A graphical 20
method to
display a
frequency Frequency

distribution. 10

You can get a


quick look of
the data's 0

symmetry and
-10 0 10 20
5yr ret

spread.
6
Numerical Summaries
7
(numerical variables)
These are single numbers computed from the sample
to describe some characteristic of the data set.
Measures of location include the mean, median and
the first and third quartiles.
Measures of spread (variability) include the standard
deviation, range and midrange.
A measure of the spread relative to the mean is the
coefficient of variance (mean/standard deviation).
Displaying Descriptive Statistics
Variable N Mean Median TrMean StDev SE Mean
5yr ret 83 5.371 5.100 5.391 5.229 0.574

Variable Minimum Maximum Q1 Q3


5yr ret -7.800 17.000 2.700 8.900

A boxplot is a good
way to display
many of the
summary stats.

0 10 20
5yr ret
8
3. The Normal Distribution
9

A continuous random variable can take any value


over a given range.
The most important one is no doubt the normal
random variable whose probability distribution is
often depicted as bell-shaped.
It is centered at mean () and most of its probability
is within 3 (3 standard deviations) of the mean.
The Standard Normal Distribution
10

-4 -3 -2 -1 0 1 2 3 4

Units of measurement are standard deviations above/below the mean


A Simple Example (2.2 on p. 22)
11

A large retail firm has accounts receivable that


are assumed to be normal with mean =
$281 and standard deviation = $35.

What proportion of accounts are above $316?


Above what value do 13.57% of the accounts lie?
4. Populations, Samples and Sampling
Distributions
12

A statistic uses its own sampling distribution to make


inference about the population.
For example, (sample average) is the statistic most
often used to make inference about the population
mean .
The sample mean is a random variable which follows
a certain probability distribution.
The Sampling Distribution of
13

Assume the random variable has mean and


standard deviation .

We observe a random sample of size drawn from


and compute the sample average . The
expected value of is and the standard deviation
of is

=

Distributional Form
14

If the observations come from a normal distribution,


then the distribution of is also normal.
If the observations come from any other non-normal
distribution, then the Central Limit Theorem asserts
that the distribution of is approximately normal
as long as the sample size is large (usually 30 or
more from a practical perspective).
Another Example (2.3 on p. 25)
15

In a manufacturing process, the diameter of a


certain part averages 40 cm ( = 40). The
variation appears to be normally distributed with
standard deviation .2 cm ( = .2).

If a sample of 16 parts is chosen, what is the


probability that the average diameter is greater
than 40.1 cm?
5. Estimating a Population Mean
16

Point estimates are single numbers used as an


estimate of a population parameter.
In general, these will never be exactly right so we
use them as the basis for an interval estimate.
Because we base the interval on the sampling
distribution, we know the probability content of the
interval.
Example 2.5 (p. 28, is known)
17

In a department store, the charge account balances


of customers have a standard deviation of $35.
They will take a sample of 100 accounts and want
to make a 95% interval estimate (95% confidence
interval) for all accounts.
If the sample mean is $245, what is the interval
estimate?
Example 2.6 (p. 29, is unknown)
18

A manufacturer wants to estimate the average life


on an expensive electrical component.
Because the test destroys the component, a small
sample is used.
If the test results are 92, 110, 115, 103 and 98,
find a 95% interval estimate of the average
lifetime.
Note that in this example, an implicit assumption is
that the life time of an expensive electrical
component is normally distributed.
6. Hypothesis Tests About a Population Mean
19

In the previous section we discussed estimating an


unknown parameter.
Here we use the sample data to test a
preconceived belief about the value of a
parameter.
We state this belief in a hypothesis, thus the
procedure is called hypothesis testing.
An Example
20

Is the population average equal to 10?

H : 10 (On average, mean is 10)

H a : 10 (No it isn't)
Notation and Terminology
21

0 is called the Null Hypothesis

is called the Alternative or Research Hypothesis

We will set up a decision rule to determine whether


we reject or do not reject the null hypothesis. The
decision rule is based on the level of significance
(usually referred to as ) of the test.
Level of Significance
22

If we perform the test and the null hypothesis really is


correct, there is a chance we will say it is false (this is
called a Type-I error) because we happened to get
some fairly extreme values in our sample.
We control for this by setting up the decision rule so
there is a small probability of this happening.
This probability (which is the allowable probability of
making a Type-I error) is the level of significance.
A Type-II error is one in which the null hypothesis is
really false and we fail to reject the null hypothesis.
P-Value
23

Most software packages report the results of a


hypothesis test computing the -value of the test.
This is just a probability that says how far out in the
tail the test statistic fell.
The decision rule is that

Reject 0 if the -value < (the level of significance)


Example 2.8 (p. 36)
24

A company that manufactures rulers wants to insure


the average length is correct (12 inches).
From each production run, a sample of 25 rulers is
selected and checked with accurate equipment.
One particular sample had an average of 12.02
inches with a standard deviation .02 inch.
Using a 1% level of significance, test to see if
production is on target.
7. Estimating the Difference Between Two
Population Means
25

Here we have two samples and two sets of statistics


drawn from two populations, and we wish to test
0 : 1 = 2
What Should We Use?
26

If we know the two population variances are about


equal, use the exact procedure.
If we think they differ a lot, we should use the
approximate result.
If we do not really know, the approximate
approach is probably best.
Example 2.10 (p. 41)
27

For the 83 mutual funds we discussed earlier (Table


2.1 on p. 8), we want to compare the five-year
returns for load funds versus no-load funds.
8. Hypothesis Tests About the Difference Between Two
Population Means
28

Our test is of the form:

0 : 1 = 2 (1 2 = 0) (No difference)
: 1 2 (1 2 0) (There is a difference)
Example 2.11 (p. 45)
29

To test the hypothesis that load and no load funds


have the same return, we write:

0 : =
:
Exercise #1 (due Sept. 4, 2014)
30

#44 (p. 49)


#45 (p. 50)
#46 (p. 50)

Note: You need to include computer outputs (please


use Minitab for this exercise) when you answer these
questions. The answers should be typed, rather than
hand written.

You might also like