Ed Inference1

REVIEW OF STATISTICAL INFERENCE: ESTIMATION & HYPOTHESIS
TESTING

Introduction to Statistical Inference

Estimation
Constructing Confidence Intervals
Hypothesis Testing
Examples
Introduction to Statistical Inference:

In any experiment:
To draw certain inferences or make a decision about some hypotheses concerning the
situation being studied.
The extreme differences detected may cause to decide to take action.
In any decision, because of a small amount of data collected risk to take the wrong
decision uncertainty.
Definitions:
Statistical Inference: the process of inferring something about a population from a
sample drawn from that population.
Population: all possible values of some random variable X (response or dependent
variable).
Random samples: each member of the population has an equal chance of being included
in the sample, assuming the same pattern of variation.
Parameters: characteristics of the population of X, regarding which one wishes to obtain
information. For instance: the average or expected value of X, E ( X ) = , and the

population variance, Var ( X ) = 2 , whose square root is called population standard
deviation .
Sample Statistics: Un = f(X1,...,Xn) quantities computed from the sample (X1,...,Xn), n is
the number of observations in the sample, as an approximation of the corresponding

parameter. For instance: the sample mean
n
Xi
X = i=1
n
and the sample variance

n
2
(Xi-X )
S 2=
i =1
n 1
Estimator: given f(x,) the density function of X, and the unknown parameter, an
estimator of is a random variable computed from the sample:

U = g(X1,...,Xn),
U a statistic which depends on f, and n. Evaluating this formula for the specific values
x1,...,xn, then
= g(x1,...,xn)
is the single numerical value that is called the estimate of the parameter . When is an
estimator good for estimating a given parameter? unbiased estimator of if E (U ) = .
Estimation:
Two types of estimate:
1) Point estimate: a single numerical value to estimate univocally a parameter. However,
two different samples may produce differences between the estimated values.
2) Confidence interval: two numerical values defining a range of values within which the
parameter falls with a specified degree of confidence (confidence coefficient or
confidence level = 1 , most frequently 0.90, 0.95 and 0.99).
Probabilistic Interpretation of a Confidence Interval: In repeated sampling,

100(1 ) % of all intervals will in the long run include the population parameter.
Constructing Confidence Intervals:
Confidence interval for a population mean ( x = sample mean).
Confidence interval for the difference between two population means 1-2 .
Confidence interval for the variance 2 of a normally distributed population (s2 = sample
variance).
2
Confidence interval for the ratio of the variances 22 of two normally distributed
populations.
Confidence interval for a population proportion p ( p = sample proportion).
Confidence interval for the difference between two population proportions p1-p2.
Finding reliability coefficients
statistical tables for the corresponding sampling
distributions.
ASSUMPTIONS
CONFIDENCE INTERVAL
Normal population with known

Non-normal population with known
n > 30
Arbitrary population with unknown
n > 30
X - z/2
X - z/2
S
S
X + z/2
n
n
Sampling distribution: normal (Central Limit Theorem)
X - t/2
S
S
X + t/2
n
n
Sampling distribution: Students t with n 1 degrees of freedom
Two indep. normal populations

with known 1,2
Two indep. non-normal
populations with known 1,2
n1, n2 > 30
X + z/2
Sampling distribution: normal
Normal population with unknown

Two dep. normal populations
(paired samples)
1 + 2 -
1 + 2
1 2
X 1 - X 2 - z/2
X 1 - X 2 + z/2
n1 n 2
n1 n 2
Two indep. arbitrary populations

with unknown 1,2
n1, n2 > 30
X 1 - X 2 - z/2
S 1 + S 2 -
S1 + S 2
1 2
X 1 - X 2 + z/2
n1 n 2
n1 n 2
Sampling distribution: normal (Central Limit Theorem)

with unknown 1,2
1 = 2
X 1 - X 2 - t/2
1
1 ( n - 1) S 12 + ( n 2 - 1) S 22
1 1 ( n1 - 1) S 12 + ( n2 - 1) S 22
+ 1
1-2 X 1 - X 2 + t/2 +
n1 + n2 - 2
n1 + n 2 - 2
n1 n2
n 1 n 2
Sampling distribution: t with n1 + n 2 2 degrees of freedom
Normal population
(n - 1) S 2
2/2
(n - 1) S 2
12 - /2
Sampling distribution: Chi-Square with n 1 degrees of freedom

F 1 - /2
S2 2
S2
F /2 2
2
2
S1
1
S1
Sampling distribution: F with n1 1 numerator degrees of freedom and n2 1 denominator degrees of freedom
Binomial population
n > 30
p - z/2
p (1 - p )
p p + z/2
n
p (1 - p )
n

Two indep. binomial populations
n1, n2 > 30
p 1 - p 2 - z/2
p 1 (1 - p 1 ) p 2 (1 - p 2 )
+
p1-p2 p 1 - p 2 + z/2
n1
n2
p 1 (1 - p 1 ) p 2 (1 - p 2 )
+
n1
n2
Examples
Hypothesis Testing:
To assess the evidence provided by the data in favour of some statement.

Example:
XXL Company buys milk from several suppliers as the essential raw material for its cheese. XXL
suspects that some producers are adding water to their milk to increase their profits. Excess
water can be detected by determining the freezing point of the milk. The freezing temperature of
natural milk varies normally, with a mean of = 0.545 C and a standard deviation of
= 0.008 C . Added water raises the freezing temperature. XXLs laboratory manager
measures the freezing temperature of five consecutive lots of milk from one producer. The mean
measurement is x = 0.538 C . Is this good evidence that this producer is adding water to the
milk? (Moore-McCabe, p. 463)
1. Suppose that no water has been added, so that the mean freezing point of the
population of all milk from this producer is = 0.545 C . Then what is the
probability that five measurements would give a sample mean as high as
0.538 C or higher?
2. An outcome this high or higher has probability 0.025 if natural milk is measured.
3. Hence, since a mean freezing temperature as high as that observed would occur only
2.5 times per 100 samples of natural milk, there is evidence that the producer is
watering the milk.
Definitions and Steps in Hypothesis Testing:
1. Hypotheses:
Hypothesis: a statement (conjecture, supposition) about one or more populations,
expressed in terms of some parameter or parameters.
Null hypothesis ( H 0 ): It is a statement of no effect or no difference. The test of
significance is designed to assess the strength of the evidence against the null hypothesis. It
contains a statement of equality (either =, , or ).
Alternative hypothesis ( H1 ): It is the statement we hope or suspect to be true instead of
H 0 . The null and the alternative hypotheses are complementary.
Since only the parameter value in H 0 that is closest to H1 influences the form of the
test in all common significance testing situations, in H 0 the parameter has a specific value
( = 0 ). Meanwhile, H1 may be one-sided ( > 0 or < 0 ) or two-sided ( 0 ).
Null hypothesis versus alternative hypothesis. If the null hypothesis is not rejected, the
data on which the test is based do not provide sufficient evidence to cause rejection. If the
testing procedure leads to rejection, the data are not compatible with the null hypothesis, but
give evidence to the alternative hypothesis.
2. Test statistic:
A statistic computed from the data of the sample which serves as a decision maker. It shows
whether or not the data give evidence against the null hypothesis. One has to:
a) Choose a test statistic to test the null hypothesis, taking into account any assumptions
about the normality of the population distribution, equality of variances, independence of
samples
b) Determine the sampling distribution of the test statistic when the null hypothesis is
true.
c) Compute a value of the test statistic from the data contained in the sample (calculation
of test statistic).
3. Types of errors
Condition of Null Hypothesis
True
False
Possible Fail to reject H0 Correct action Type II error

Action
Reject H0
Type I error Correct action
Significance level ( ) = the probability of rejecting a true null hypothesis, that is, the probability
of committing a type I error.
4. Rejection and nonrejection regions

Set up the rejection region (or critical region) on this test statistic where the null hypothesis will be
rejected in 100 % of the samples when the null hypothesis is true. The values of the test
statistic that separate the rejection and nonrejection regions are called critical values of the test
statistic.
5. Decision rule
The null hypothesis is rejected if the value of the test statistic computed from the sample falls in
the rejection region. Otherwise, the null hypothesis is not rejected.
4. p-value
The probability, assuming that H0 is true, that the test statistic will take a value at least as extreme
in the direction of H1 as that actually computed is the p-value. The smaller the p-value is, the
stronger is the evidence against H0 provided by the data.
5. Decision rule
p-value we reject H0 (the data are statistically significant at level )
whereas
p-value > we do not reject H0
Neither hypothesis testing nor statistical inference, in general, leads to the

proof of a hypothesis; it merely indicates whether the hypothesis is
supported or is not supported by the available data. An accepted
hypothesis does not mean that the hypothesis is true, but that it may be true.
Confidence intervals and hypothesis testing

One may use confidence intervals to arrive at the same conclusions that are reached by using the
hypothesis testing procedures. When testing a null hypothesis by means of a two-sided
confidence interval, we reject H0 at the level of significance if the hypothesized parameter is
not contained within the 100(1 ) % confidence interval. Otherwise, we cannot reject H0.
ASSUMPTIONS
H0
TEST STATISTIC
Normal population with known

Non-normal population with known
n > 30
Arbitrary population with unknown
n > 30
Z=
= 0
Z=
= 0
Normal population with unknown

T=
Two dep. normal populations
(paired samples)
Two indep. normal populations with
known 1,2
Two indep. non-normal populations
with known 1,2
n1, n2 > 30
Two indep. arbitrary populations
with unknown 1,2
n1, n2 > 30
Two indep. normal populations with

unknown 1,2
1 = 2
= 0
Z=
d = d0
(d = 1-2)
X - 0
N(0,1)
/ n
X - 0
N(0,1)
S/ n
X - 0
t (n-1) d. f.
S/ n
( X 1 - X 2 )- d0
12 + 22
n1
d = d0
(d = 1-2)
Z=
N(0,1)
n2
( X 1 - X 2 )- d0
2
N(0,1)
S1 + S 2
n1 n 2
d = d0
(d = 1-2)
T=
( X 1 - X 2 ) - d0
1 1 ( n1 - 1) S 12 + ( n2 - 1) S 22
+
n1 + n2 - 2
n1 n2
t (n1+n2-2) d. f.
H1
REJECTION REGION OF H0
|Z| z/2
> 0
Z z
< 0
Z -z
|Z| z/2
> 0
Z z
< 0
Z -z
|T| t/2
> 0
T t
< 0
T -t
d d0
|Z| z/2
d > d0
Z z
d < d0
Z -z
d d0
|Z| z/2
d > d0
Z z
d < d0
Z -z
d d0
|T| t/2
d > d0
T t
d < d0
T -t
10
Normal population
2 = 02
2 02
221-/2 or 22/2
2 > 02
22
2 < 02
221-
12 22
FF/2 or FF1-/2
12 > 22
FF
12 < 22
FF1-
p - p0
N(0,1)
p0 (1 - p0 )
n
p p0
|Z| z/2
p > p0
Z z
p < p0
Z -z
p 1 - p 2
N(0,1)
1 1
p (1 - p ) +
n1 n2
p1 p2
|Z| z/2
p1 > p2
Z z
p1 < p2
Z -z
(n - 1) S 2
2 =
02
12 = 22
F=
Binomial population
n > 30
Two indep. binomial populations

n1, n2 > 30
2 (n-1) d. f.
p = p0
Z=
p1 = p2
Z=
S 1 F (n -1,n -1) d. f.
1
2
2
S2
Examples
11
Steps in the hypothesis testing procedure. [Adapted from Daniel (1999), p.211]
Evaluate data
Review assumptions
State hypotheses
Select test statistics
Determine distribution
of test statistic
State decision rule
Calculate test statistics
Do not
reject H0
Conclude
H0 may be
true
Make statistical
decision
Reject H0
Conclude
H1 is true
12
Examples:
Example 1. Confidence interval for a population mean sampling from a normal population
A physical therapist wished to estimate, with 99% confidence, the mean maximal strength of a
particular muscle in a certain group of individuals. He is willing to assume that strength scores are
approximately normally distributed with a variance of 144. A sample of 15 subjects who
participated in the experiment yielded a mean of 84.3. (Daniel, p. 157)
Solution: The z value corresponding to a confidence coefficient of 0.99 is found in the table of
standard normal distribution to be 2.58. The standard error is
12
= 3.0984 . The 99%
15
confidence interval for is:

84.3 2.58(3.0984) CItable
84.3 8.0
76.3 92.3
We are 99% confident that the population mean is between 76.3 and 92.3.
Probabilistic interpretation: In repeated sampling 99% of all intervals that could be
constructed in the manner described would include the population mean.
Example 2. Confidence interval for a population mean sampling from a non-normal population
Punctuality of patients in keeping appointments is of interest to a research team. In a study of
patient flow through the offices of general practitioners, it was found that a sample of 35 were
17.2 minutes late for appointments, on the average, with a standard deviation of 8 minutes. The
population distribution was felt to be non-normal. What is the 90% confidence interval for , the
true mean amount of time late for appointments? (Daniel, p. 158)
Solution: The sample size is fairly large (greater than 30). Although the population standard
13
deviation is unknown, we can use the sample variance as a replacement for the unknown
population variance. Therefore we assume the sampling distribution of the sample mean
to be approximately normally distributed (application of the central limit theorem). From
the table of standard normal distribution we find the z value corresponding to a
confidence coefficient of 0.90 to be about 1.645. The standard error is
8
= 1.3522 .
35
The 90% confidence interval for is:

17.2 1.645(1.3522) CItable
17.2 2.2
15.0 19.4
Example 3. Confidence interval for the difference between two population means sampling from
a normal population with equal variances
The purpose of a study by Stone et al. was to determine the effects of long-term exercise
intervention on corporate executives enrolled in a supervised fitness program. Data were
collected on 13 subjects (the exercise group) who voluntarily entered a supervised exercise
program and remained active for an average of 13 years and 17 subjects (the sedentary group)
who elected not to join the fitness program. Among the data collected on the subjects was
maximum number of sit-ups completed in 30 seconds. The exercise group had a mean and
standard deviation for this variable of 21.0 and 4.9, respectively. The mean and standard
deviation for the sedentary group were 12.1 and 5.6, respectively. We assume that the two
populations of overall muscle condition measures are approximately normally distributed and that
the two population variances are equal. We wish to construct a 95% confidence interval for the
difference between the means of the populations represented by these two samples. (Daniel, pp.
170-171)
Solution: First of all we compute the pooled estimate of the common population variance:
14
( n1 -1 ) S 12 +( n 2 -1 ) S 22
n1+ n 2 -2
(13 1)(4.9) 2 + (17 1)(5.6) 2

= 28.21
13 + 17 2
We enter the table of the t distribution with 13 + 17 2 = 28 degrees of freedom and a

desired confidence level of 0.95 ( = 0.05 ) we find that the reliability factor is 2.048.
The 95% confidence interval for the difference between population means is computed
as follows:
1 1
(21.0 12.1) 2.048 + (28.1)
13 17
8.9 4.008
4.9 E S 12.9
We are 95% confident that the difference between 4.9 and 12.9. Since the interval does
not include zero, we conclude that the population means are not equal.
Example 4. Confidence interval for the ratio of the variances of two normally distributed
populations
A study was conducted to determine if an acute dose of dextroamphetamine might have positive
effects on affect and cognition in schizophrenic patients maintained on a regimen of haloperidol.
Among the variables measured was the change in patients tension-anxiety states. For n2 = 4
patients who responded to amphetamine, the standard deviation for this measurement was 3.4.
For n1 = 11 patients who did not respond, the standard deviation was 5.8. Let us assume that
these patients constitute independent simple random samples from populations of similar
patients. Let us also assume that change scores in tension-anxiety state is a normally distributed
variable in both populations. We wish to construct a 95% confidence interval for the ratio of the
variances of these two populations. (Daniel, p. 192)
Solution: The information we have:
15
n1 = 11
n2 = 4
s12 = 5.8 2 = 33.64
s 22 = 3.4 2 = 11.56
df1 = 10
df 2 = 3
Since the sampling distribution follows a F distribution with 10 and 3 degrees of freedom,
when = 0.1 :
F0.05 = 8.79
The 95% confidence interval for
0.27
22
12
F0.95 = 1 3.71 = 0.27
is:
11.56 22
11.56
CItable
8.79
33.64 12
33.64
0.093
22
12
3.02
Since the interval includes 1, we are able to conclude that the two population variances
may be equal.
Example 5. Confidence interval for a population proportion

An association of Christmas tree growers in Indiana sponsored a sample survey of Indiana
households to help improve the marketing of Christmas trees. A simple random sample of 500
households was contacted by telephone and asked several questions in a 2-minute interview.
One question was Did you have a Christmas tree this year? Of the 500 respondents, 412
answered Yes. The association wished to construct a 95% confidence interval for the true
proportion of all Indiana households who displayed a Christmas tree. (Moore-McCabe, pp. 583-584)
Solution: The sample proportion is therefore:
16
p =
421
= 0.842
500
From the table of standard normal distribution we find the value of z value corresponding
to a confidence coefficient of 0.95 to be 1.96. The interval is:
0.842 1.96
0.842 0.158
CItable
500
0.842 0.32
0.81 p 0.874
The association was 95% confident that between 81% and 87% of Indiana homes
displayed Christmas trees.
Example 6. Test for a population proportion

Of the 500 respondents in the Christmas tree market survey, 38% were from rural areas
(including small towns) and the other 62% were from urban areas (including suburbs). According
to the 1980 Census, 36% of Indiana residents live in rural areas and the remaining 64% live in
urban areas. To examine how well the sample represents the state population in regard to rural
versus urban residence, we perform a hypothesis test of
H 0 : p = 0.36
versus the alternative:
H1 : p 0.36
where p represents the proportion of rural households that would be obtained by the telephone
sampling procedure if it were repeated over and over again. (Moore-McCabe, p. 584)
Solution: The test statistic is:
z calc =
0.38 0.36
0.36 0.64
500
= 0.93 HTtable
From the table of standard normal distribution, we find that the probability that a Z is less
17
than or equal to 0.93 is 0.8238. The probability in each tail is 1 0.238 = 0.1762 .
Therefore the p-value is 2 0.1762 = 0.35 . There is a 35% chance of getting a value of
Z larger than 0.93 or smaller than 0.93 if H0 is true. We therefore have no reason to
reject the hypothesis that the sampling procedure is unbiased with respect to rural versus
urban residence.
Example 7. Test for comparing two proportions

If a respondent to the Christmas tree survey introduced in Example 5 did display a tree during the
last holiday season, the next question asked was whether the tree was natural or artificial.
Respondents were also asked if they lived in an urban area or in a rural area. Of the 421
households displaying a Christmas tree, 261 were urban and 160 lived in rural areas. We will
compare the urban tree users and the rural tree users with respect to their preference for natural
versus artificial trees. Other studies have shown that there is a tendency toward greater use of
natural Christmas trees in rural areas. A one-sided test will be used to examine whether or no this
sample supports this previous finding. Take population 1 to consist of the urban households that
use a tree and population 2 to be the rural tree users. (Moore-McCabe, pp. 599-600)
Solution: We want to test the hypotheses:
H 0 : p1 = p 2
H1 : p1 < p 2
The survey responses show that 89 of the urban households and 64 of the rural
households who displayed a tree chose a natural tree. So:
p1 =
89
= 0.341
261
p 2 =
64
= 0 .4
160
The pooled estimate of the common value of the proportion of respondents who chose a
18
natural tree was:

p =
89 + 64
= 0.363
261 + 160
The test statistic is calculated as follows:
1
1
0.363 0.637
+
= 0.04828
261 160
z calc =
0.341 0.4
= 1.22 HTtable
0.04828
From the table of standard normal distribution, and since we are doing a one-sided test,
the p-value is:
P ( Z 1.22) = 0.1112
Even though rural households in the survey chose natural Christmas trees more often
than the urban households, our calculations indicate that there is not sufficient evidence
in the data to conclude that this difference in preferences is true in the population of
Indiana tree users. If the preferences of rural and urban households were identical, rural
usage would exceed urban usage by an amount leading to a z statistic at least as large
as the one observed in 11% of all samples of this size.
Example 8. Test for a population mean

A laboratory analyzes specimens of a pharmaceutical product to determine the concentration of
the active ingredient. Such chemical analyses are not perfectly precise. Repeated measurements
on the same specimen will give slightly different results. The results of repeated measurements
follow a normal distribution quite closely. The analysis procedure has no bias, so that the mean
of the population of all measurements is the true concentration in the specimen. The standard
deviation of this distribution is a property of the analytical procedure and is known to be
= 0.0068 gram per litre. The laboratory analyzes each specimen three times and reports the
19
mean result. The laboratory is asked to evaluate the claim that the concentration of the active
ingredient in a specimen is 0.86%. The mean of three repeated analyses of the specimen is
x = 0.8404 . The true concentration is the mean of the population of repeated analyses. Can
the laboratory conclude that is different from 0.86? (Moore-McCabe, pp. 474-475)
Solution: The hypotheses are:
H 0 : = 0.86
H1 : 0.86
The lab chooses the 1% level of significance ( = 0.01 ). The computed value of the test
statistic is:
z calc =
0.8404 0.86
0.0068
= 4.99 HTtable
Because the alternative is two-sided, we compare z = 4.99 with the 2 = 0.005

critical value from the table of standard normal distribution. This critical value is
z * = 2.576 . That is to say, the nonrejection region of H0 is [ 2.576,2.576]. Since
z > z * we reject H0.
References
Daniel (1999), pp. 150-271.
Moore and McCabe (1989), pp. 443-605.
20

Ed Inference1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ed Inference1

Uploaded by

Copyright:

Available Formats

REVIEW OF STATISTICAL INFERENCE: ESTIMATION & HYPOTHESIS

Introduction to Statistical Inference

Introduction to Statistical Inference:

situation being studied.

The extreme differences detected may cause to decide to take action.

Statistical Inference: the process of inferring something about a population from a

sample drawn from that population.

Population: all possible values of some random variable X (response or dependent

in the sample, assuming the same pattern of variation.

Parameters: characteristics of the population of X, regarding which one wishes to obtain

information. For instance: the average or expected value of X, E ( X ) = , and the

Sample Statistics: Un = f(X1,...,Xn) quantities computed from the sample (X1,...,Xn), n is

the number of observations in the sample, as an approximation of the corresponding

and the sample variance

estimator of is a random variable computed from the sample:

Probabilistic Interpretation of a Confidence Interval: In repeated sampling,

Constructing Confidence Intervals:

Confidence interval for a population mean ( x = sample mean).

Confidence interval for a population proportion p ( p = sample proportion).

Finding reliability coefficients

statistical tables for the corresponding sampling

Normal population with known

Sampling distribution: normal (Central Limit Theorem)

Sampling distribution: Students t with n 1 degrees of freedom

Two indep. normal populations

Sampling distribution: normal

Normal population with unknown

Two indep. arbitrary populations

Sampling distribution: normal (Central Limit Theorem)

Sampling distribution: t with n1 + n 2 2 degrees of freedom

Sampling distribution: Chi-Square with n 1 degrees of freedom

Sampling distribution: normal

Sampling distribution: normal

To assess the evidence provided by the data in favour of some statement.

Definitions and Steps in Hypothesis Testing:

Hypothesis: a statement (conjecture, supposition) about one or more populations,

expressed in terms of some parameter or parameters.

Null hypothesis ( H 0 ): It is a statement of no effect or no difference. The test of

Alternative hypothesis ( H1 ): It is the statement we hope or suspect to be true instead of

H 0 . The null and the alternative hypotheses are complementary.

Possible Fail to reject H0 Correct action Type II error

Type I error Correct action

4. Rejection and nonrejection regions

Neither hypothesis testing nor statistical inference, in general, leads to the

Confidence intervals and hypothesis testing

Normal population with known

Normal population with unknown

Two indep. normal populations with

Two indep. normal populations

Two indep. binomial populations

Select test statistics

Calculate test statistics

confidence interval for is:

The 90% confidence interval for is:

(13 1)(4.9) 2 + (17 1)(5.6) 2

We enter the table of the t distribution with 13 + 17 2 = 28 degrees of freedom and a

s12 = 5.8 2 = 33.64

The 95% confidence interval for

F0.95 = 1 3.71 = 0.27

Example 5. Confidence interval for a population proportion

Example 6. Test for a population proportion

versus the alternative:

Example 7. Test for comparing two proportions

natural tree was: