ST 205 - Lecturer Notes

ST 205: STATISTICAL INFERENCE I
(Teaching Notes)
Introduction:
Statistics is a science the science of inference.
Data summarized or otherwise, are used in the inference along with
tools of probability theory and inductive or deductive reasoning.
Definition
Statistical inference comprises those methods concerned with the
analysis of a subset of data leading to predictions or inferences about the
entire set of data.
Also, statistical inference means making a probability judgment
concerning a population on the basis of one or more samples.
There are two subdivisions within statistics: Descriptive statistics and
inferential statistics.
Descriptive statistics simply summarize the given data, bringing
out their important features and no attempt is made to infer
anything that pertains to more than the data themselves.
E.g.
In the financial year 2008/2009, seventy one (71) out of 133
district councils obtained clean financial audit which is 53.4%.

Inferential statistics uses a number of qualitative techniques that
enable us to make appropriate generalization from limited
observations.
e.g.
Suppose the department of mathematics and statistics wants to
establish masters programme in the academic year 2011/2012. To
meet statistician market demand, the department will conduct a
survey in various institutions to explore lacking skills so that the
masters curriculum to focus on.
Note: Statistical inference rely on the theory of probability regardless
whether we are dealing with point or interval estimation, tests of hypothesis
or correlation.
By Josephat Peter - UDOM
A numerical measure of a population is called a population
parameter, or simply parameter.
A numerical measure of the sample is called a sample statistic, or
simply a statistic.
Population parameters are estimated by sample statistics. When a
sample statistic is used to estimate a population parameter, the
statistic is called an estimator of the parameter.
Statistical Inference is to be subject to this subject
Two important problems in statistical inference are estimation and tests of
hypotheses.
Topic 1: ESTIMATION
Assume that some characteristic of the elements in a population can be
represented by a random variable X whose density is x
f
(x; ) = f(x; ),
where the form of the density is assumed known except that it contains an
unknown parameter (if were known, the density function would be
completely specified, and there would be no need to make inferences about
it).
Further more assume that the values n
x x x x ,..., , ,
3 2 1 of a random sample
n
X X X ,..., ,
2 1 from f(x; ) can be observed. On the basis of the observed
sample values n
x x x x ,..., , ,
3 2 1 it is desired to estimate the value of the
unknown parameter or the value of some function, say
(), of the
unknown parameter. This estimation can be made in two ways (types of
estimates): point and interval estimation.
1.1POINT ESTIMATION
Definition
- Is to let the value of some statistic to represent or estimate the unknown
parameter.
- Is a single number which is used to estimate an unknown population
parameter.

e.g. The sample mean, X is the sample statistic used as an estimator
of the population mean,
. This estimate is a point estimate because it

constitutes a single number
Although it is a common way of expressing an estimate, it suffers from a
limitation since it fails to indicate how close it is to the quantity it is
supposed to estimate. i.e. lack of reliability and precision.
For instance, if some one claim that 40% of all second year B.Sc.
statistics students (91) do not appreciate Statistical inferences
lecturer, it would not be very helpful if this claim is based on a small
number of students, say 5. However it is more meaningful and
reliable if the number of students increases.
This implies that a point estimates should always be accompanied by some
relevant information so that it is possible to judge how far it is reliable.
Other problems with point estimate are: first, careful plan some means of
obtaining a statistic to use as an estimator and second, to select criteria and
techniques to define and find a best estimator among many possible
estimators.
Methods of finding estimators
Assume that n
X X ,.....,
1 is a random sample from a density f(x; ), where
the form of the density is known but the parameter is unknown. Further
assume that is a vector of real numbers, say that
1, .,

k
are k parameters.
We will let , called the parameter space, denote the set of possible values
that the parameter can assume. The object is to find the statistics to be
used as estimators of certain functions.
There are several methods of finding point estimators, but for our case we
are going to study only three: method of moments, maximum likelihood and
least square methods. Probably the most important is the method of
maximum likelihood.
A statistic that is used to obtain a point estimate is called an estimator
The word estimator stands for the function, and the word estimate stands for
a value of that function.
e.g.
n
X
X
n
i
i
n
1
is an estimator of a mean
, and n
x
is an
estimate of

Notation in estimation that has widespread usage is: that is used to denote
an estimate of
.
Methods of Moments
In mechanics moment is used to denote the rotating effect of a force. In
statistics, it is used to indicate peculiarities of a frequency distribution. We
can measure central tendency, dispersion or variability, skewness and the
peakedness of the curve.
The moments about the actual arithmetic mean are:
First moment:
( )
X X
N
1 1
1
Second moment:
( )

2
1 2
1
X X
N
=
2
VARIANCE
Third Moment:
( )

3
1 3
1
X X
N
=
) (SKEWNESS
Fourth Moment:
( )

4
1 4
1
X X
N
= KURTOSIS
Let
( )
k
X f ,....., ;
1 be a density of a random variable X which has k
parameters k
,.....,
1 . Let
r
denote the rth moment about 0; that is,

( )
r
r
X E . In general
r
will be a known function of the k parameters

k
,.....,
1 . This can be denoted by writing
( ). ,.....,
1 k r r

Let
n
X X ,....,
1 be a random sample from the density
( )
k
X f ,...., ;
1 , and j
M

be the jth sample moment;
i.e.
n
X
M
n
i
i
j
j

1
Form the k equations
( )
k j j
M ,....,
1

,
, ,....., 1 k j
in k variables
k
,.....,
1 and let k

,....,
1 be their solution (we assume that there is a
unique solution). We say that the estimator
( )
k

,.....,
1 , where j
estimates
j
, is the estimator of
) ,....., (
1 k

obtained by the method of moments. The
estimators were obtained by replacing population moments by sample
moments.
For simplicity it can be defined that:
- Population moment. Let X follow a specific population distribution. The k-
th moment of the population distribution with pdf f(x) is:
( )
k
X E
.
- Sample moment. Let n
X X X ,...., ,
2 1 be a random sample from a pdf f(x).
The k-th sample moment is:
n
X
n
i
i
k
1
First moment:
n
X
M
n
i
i
1
1
(These are moments about zero)
Second moment:
n
X
M
n
i
i
1
2
2
, e.t.
Sample moments can be used to estimate population moments
Example 1
Let n
X X X ,...., ,
2 1 be a random sample from a normal distribution with
mean
and variance
2
. Let
( ) ( ) , ,
2 1

. Estimate the parameters
and
by the method of moments.

Solution
Recall ( )
2
1 2
2
and
1

( ) ,
1 1 1
M
The method of moments estimator of
is
X X
n
M
n
i
i

1
1
1
(Treat
as a first moment, then equate the first moment formula with
)
( )
2 2
2 2 2
, + M
(Treat
2
as a second moment, then equate a second moment formula with
the formula for calculating variance)
The method of moments estimator of
is
2
1
2
1
2
1 1
,
_
+

n
i
i
n
i
i X
n
X
n

2
1 1
2 2
1 1
,
_

n
i
i
n
i
i X
n
X
n
=
2
1
2 2
1
X X
n
n
i
i

=
( )
n
i
i
N
X X
1
2
Note: estimator of
is not
2
S
Example 2
( ) , N
Solution

n
i
i
X
n
1
1
and
2 2
1
2
1

n
i
i X
n
To get
X
and
( )
2
1
2
1
,

n
I
i
X X
n
NOTE:
Method of moments estimators are not uniquely defined. So far we have
been using first k raw moments. But central moments could also be used to
obtain equations whose solution would also produce estimators that would
be labeled method of moment estimators. Also moments other than the first k
could be used to obtain estimators.
Exercise
1. Let n
X X ,......,
1 be a random sample from a uniform distribution on
( ) 3 , 3 +
. Use the method of moments to estimate the
parameters
2. Let n
X X ,......,
1 be a random sample from a Poisson distribution with
parameter
. Estimate
.
Maximum Likelihood
This technique of finding estimators was first used and developed by Sir
R.A Fisher in 1922, who called it the maximum likelihood method.
The maximum Likelihood method provides estimators with the desirable
properties such as efficiency, consistency and sufficient. It usually does not
give an unbiased estimation.
Example 1:
Suppose we want to estimate the average grade
of LG university
examination. A random sample of size
128 n
is taken and the sample mean
x
is found to be 64 marks.
Clarification: - The assumption is that a sample of
128 n
represent a
population.
- From which population is
64 x
most probably come? A
population with
60
, 64 or 75?
Note: The population mean
is either 64 or not; it has only

one value.
- Hence, the term likely is used instead of probably.
Example 2:
Suppose that an urn contains a number of black and a number of white balls,
and suppose that it is known that the ratio of the numbers is 3:1 but it is not
known whether the black or white balls are more frequent.
Explanation:
- The probability of drawing a black ball is either
4
1
or
4
3
.
- If n balls are drawn with replacement from the urn, the distribution of X,
the number of black balls, is given by the binomial distribution
( )
x n x
q p
x
n
p x f

,
_
;
,
for
n x ,..., 2 , 1 , 0
Where
p q 1
and
p
is the probability of drawing a black ball.
4
1
p
or
4
3
p
- Here; we draw a sample of three balls,
3 n
, with
replacement and attempt to estimate the unknown parameter
p
of the distribution.
- The choice is to be done between only two numbers,
4
1
and
4
3
.
The possible outcomes and their probabilities are as follows:
Outcome: x 0 1 2 3
,
_
4
3
; x f
64
1
64
9
64
27
64
27
,
_
4
1
; x f
64
27
64
27
64
9
64
1

For example, if we found
0 x
in a sample of 3, the estimate
4
1
for p would
be preferred over
4
3
because the probability
64
27
is greater than
64
1
.
Generally we should estimate p by
4
1
when
0 x
or 1 and by
4
3
when
2 x
or 3. The estimator may be defined as
( )
'

3 , 2 : 7 5 . 0
1 , 0 : 2 5 . 0

x
x
x p p
The estimator thus selects for every possible x the value of
p
such that
( ) ( ) p x f p x f > ; . ;
, where
p
is an alternative value of p.
Further more:
If several alternative values of p were possible, we might reasonably proceed
in the same manner. Thus if we found
6 x
in a sample of 25 from a
binomial population, we should substitute all possible values of p in the
expression

( ) ( )
1 9
6
1
6
2 5
; 6 p p p f
,
_
for
1 0 p
and choose as our estimate that value of p which maximized
( ) p f ; 6
.
Maximum value can be found by equating first derivative equal to zero.
i.e.
( ) ( ) ( ) [ ] 0 1 9 1 6 1
6
2 5
; 6
1 8
5

,
_
p p p p p f
d p
d
we found that
25
6
, 1 , 0 p
are the roots.
The root which gives maximum value is
25
6
p
The estimate has the property
( ) ( ) p f p f > ; 6 ; 6
, where
p
is an alternative
value of p in the interval
1 0 p
Definition of Likelihood Function (likelihood=chance=probability)
The likelihood function of n random variables n
X X X ,....., ,
2 1 is defined to
be the joint density of the n random variables, say
( ) ; ,.....,
1 ,.....,
1
n X X
x x f
n
,
which is considered to be a function of
. In particular, if n
X X ,.....,
1 is a
random sample from the density
( ) ; x f
, then the likelihood function is
( ) ( ) ( ) ; ..... ; ;
2 1 n
x f x f x f
.
Notation for likelihood function is
( )
n
x x L ,....., ;
1

- The likelihood function
( )
n
x x L ,....., ;
1
gives the likelihood that the

random variables assume a particular value n
x x x ,....., ,
2 1 .
- The likelihood is the value of a density function; so for discrete random
variable it is a probability.
- We want to find the value of
which maximizes the likelihood

function
( )
n
x x L ,....., ;
1
.
Definition of Maximum likelihood estimator:
Let
( ) ( )
n
x x L L ,....., ;
1

be the likelihood function for the random
variables n
X X X ,....., ,
2 1 . If
[where
( )
n
x x x ,....., ,

2 1

is a function of
the observations n
x x ,.....,
1 ] is the value of
in which maximizes
( ) L
,
then
( )
n
X X X ,....., ,

2 1

is the maximumlikelihood estimator of
.
( )
n
x x ,.....,

1

is the maximum-likelihood estimate of
for the sample

n
x x ,.....,
1 .
- n
X X X ,....., ,
2 1 is a random sample from some density
( ) ; x f
, so that the
likelihood function is
( ) ( ) ( ) ( ) ; ..... ; ;
2 1 n
x f x f x f L
.
Many likelihood functions satisfy regularity conditions; so the maximum
likelihood estimator is the solution of the equation
( )
0
d
dL

Also
( ) L
and ln
( ) L
have their maxima at the same value of
, and it is
sometimes easier to find the maximum of the natural logarithm of the
likelihood.
If the likelihood function contains k parameters, i.e.
( ) ( )
n
i
k i k
x f L
1
2 1 2 1
,....., , ; ,....., ,
then the maximum-likelihood estimators of the parameters k
,....., ,
2 1 are
the random variables
( )
n
X X ,.....,

1 1 1

,
( )
n
X X ,.....,

1 2 2

, ..,
( )
n k k
X X ,.....,

1

, where
k

,.....,
2 1
are the values in which
maximize
( )
K
L ,....., ,
2 1
.
If certain regularity conditions are satisfied, the point where the likelihood is
a maximum is a solution of the k equations
( )
0
,.....,
1
1

d
dL
k
( )
0
,.....,
2
1

d
dL
k
.
.
.
( )
0
,.....,
1
k
k
d
dL

In this case it may also be easier to work with the natural logarithm of the
likelihood.
Example
Suppose that a random sample of size n is drawn from the Bernoulli
distribution
( )
x x
q p p x f

1
;
,
1 0 p
and
p q 1
. The sample values
n
x x x ,....., ,
2 1 will be a sequence of 0s and 1s.
The likelihood function is
( )

n
i
x x
i i
q p p L
1
1
=

i i
x n x
q p
By apply ln we get:
( ) ( ) q x n p x p L
i i
ln ln ln

+
First derivative:
( )
q
x n
p
x
dp
p L d
i i

ln
By substituting
p q 1
,
We find the estimate as:
i
i
x
n
x
p
Consider that
3 n
; the likelihood function can be represented by the
following four curves:
( ) ( )
( ) ( )
( ) ( )
( )
3
3
2
2
2
1
3
0
3 ;
1 2 ;
1 1 ;
1 0 ;
p x p L L
p p x p L L
p p x p L L
p x p L L
i
i
i
i

L(p)
0
L
3
L
1
L
2
L
0 1 p
Example
A random sample of size n from the normal distribution has the density
( )
1
]
1

n
i
i
x
1
2
2
2
1
exp
2
1

= ( )
1
]
1

,
_
2
2
2
2
2
1
exp
2
1

i
n
x
Taking ln
( )

2
2
2
2
1
ln
2
2 ln
2

i
x
n n
, where
0 >
and
< <
Maximum location, compute first derivative
( )
( )

i
x
d
d
2
1 ln
and
( )
( )
+
2
4 2 2
2
1 1
2
ln

i
x
n
d
d
Equating the equations to 0 to get
x x
n
i
1

( )

2 2
1
x x
n
i

NOTE:
- One must not always rely on the differentiation process to
locate the maximum
-
( )
0
d
dL
Locates both minima and maxima, and hence one
must avoid using a root of the equation which actually
locates a minimum. For example consider the following
figure;

( ) L

The actual maximum is at
, but the derivative set equal to 0 would locate
as the maximum.
- Maximum likelihood estimator has some desirable optimum
properties other than the natural (intuitively).
- Maximum likelihood estimators posses a property which is
sometimes called the invariance property of maximum
likelihood estimators.
Theorem: Invariance property of maximum likelihood estimators
Let
( )
n
X X X ,....., ,

2 1

be the maximum likelihood estimator of
in the
density
( ) ; x f
, where
is assumed unidimensional. If
( ) x
is a function
with a single-valued inverse, then the maximum likelihood estimator of
( )

is
( )
.
For example, in the normal density with 0
known the maximum likelihood

estimator of
2
is
( )
n
i
i
X
n
1
2
0
1

By the invariance property of maximum likelihood estimators, the maximum
likelihood estimator of
is
( )
n
i
i
X
n
1
2
0
1
Similarly, the maximum likelihood estimator of, say,

2
ln is
( )
1
]
1
n
i
i
X
n
1
2
0
1
ln
Extension of invariance property of maximum likelihood estimators
Extension is done in two ways:
1. first
is taken as k dimensional rather than unidimensional

2. The assumption that
( ) x
has a single valued inverse will be
removed.
Consider the following simple examples:
- Suppose an estimate of the variance, namely
( ) 1
, of a
Bernoulli
distribution is desired. We known an estimate of
is
x
, but since
( ) 1
is not a one-to-one function of
, theorem of invariance does

not give the maximum likelihood estimator of
( ) 1
.
- The theorem below will give such an estimate as
( ) x x 1
Also consider that
[ ]
2 2 2
+ X E
is desired (
2 2
+
is not a one-to-one
function of
and
2
), since we known the estimate of
and
2
. Then the
estimate will be
( )
+
2 2
1
x x
n
x
i
Theorem
Let
( )
k

,.....,

1
, where
( )
n j j
X X ,.....,

1

, be a maximum likelihood
estimator of
( )
k
,.....,
1
in the density
( )
k
x f ,....., ;
1 . If
( ) ( ) ( ) ( )
r
,.....,
1
for
k r 1
is a transformation of the parameter space
, then a maximum likelihood estimator of
( ) ( ) ( ) [ ]
r
,.....,
1
is
( ) ( ) ( ) [ ]
,.....,

1 r
. Note that
( ) ( ); ,.....,
1 k j j

so the maximum
likelihood estimator of
( )
k j
,.....,
1 is
( ) r j
k j
,....., 1 ,
,.....,
1

Exercise 2
1. Uniform distribution
Least Square
Regression refers to the statistical technique of modeling the relationship
between variables.
Consider the following simple linear regression
y . . . .
. . . . Data point
.. . . .
.. . . .
. . . . .
. . . Regression line
x

- The points on the graph are randomly chosen observations
of the two variables, X and Y.
- The straight line describes the general movement in the data
We would like our model to explain as much as possible about the process
underlying our data. However due to the uncertainty inherent in all real-
world situation, our model will probably not explain every thing, and we
will always have some remaining errors. The errors are due to unknown
outside factors that affect the process generating our data.
A good statistical model uses as few mathematical terms as possible to
describe the real situation. The model captures the systematic behavior of
the data, leaving out the factors that are nonsystematic and can not be
foreseen or predicted the errors.
Systematic component
Data Random errors
Model extracts everything
systematic in the data,
leaving purely random errors
The errors, denoted by
, constitute the random component in the model.

How do we deal with errors?
- This is where the probability theory comes in.
- The random errors are probably due to a large number of
minor factors that we cannot trace.
- We assume random errors,
, are normally distributed

- If we have a properly constructed model, the resulting
observed errors will have an average of zero (although few,
if any, will actually equal to zero)
-

should be independent of each other
Note: The assumption of a normal distribution of the errors is not absolutely
necessary in the regression model rather the assumption is made to so
that we can carry out statistical hypothesis tests using F and t
distribution.
The necessary assumption is that the errors
have mean zero, a

constant variance
2
and uncorrelated.
Consider simple linear regression model:
- We estimate the model parameters from the random sample
of data we have
Statistical model
- Next is to consider the observed errors resulting from the fit
of the model to the data. These observed errors are called
residuals; represent the information in the data not
explained by the model.
- If the residuals are found to contain some nonrandom,
systematic component, we reevaluate our proposed model, and
if possible adjust it to incorporate the systematic component
found in the residual or to discard the model and try another.
But if found residuals contain only randomness then the model
is used.
The population simple linear regression model:
+ + X Y
1 0 ... (1)
Where by:
Y is the dependent variable, X is the independent variable
(predictor),
is the error term.

The model contains two parameters: 0
= population parameter and 1
=
population slope.
Equation 1 above is composed of two components: nonrandom component
which is line itself and a purely random component the error term.
r randomerro
nonrandom
X Y + +
1 0
The nonrandom part of the model, the straight line, is the equation for the
mean of Y, given X. i.e
( )
X
Y
E
. If the model is correct, the average value of
Y for a given value of X falls right on the regression line.
The conditional mean of Y:
( ) X
X
Y
E
1 0
+

(2)
Sometimes
( ) Y E
or
( ) x
are used instead of
( )
X
Y
E
to denote conditional
mean of Y, for a given value of X.
As X increases, the average population value of Y also increases, assuming a
positive slope of the line, and vice versa.
The actual population value of Y is equal to the average Y conditional on X,
plus a random error,
. Thus for a given value X:

Y = Average Y for given X + Error
Y .. . . . ..Regression line
( ) X Y E
1 0
+

error . . . . .. . . . ..
. . . . . . . .
. . . . . 1
=slope
. . .. . . . . Points are the popln values of X and Y
. . .
0
x
Model assumptions:
1. The relationship between X and Y is a straight line relationship
2. The values of the independent variables X are assumed fixed (not
random); the only randomness in the values of Y comes from the error
term,
.
3. The errors,
, are normally distributed with mean 0 and a constant

variance
2
. The errors are uncorrelated with each other in successive
observations. i.e.
~
( )
2
, 0 N
or it can be written as
( ) 0
i
E
and
( )
2
var
i
Estimation
So far, we have described the population model, that is, the assumed true
relationship between the two variables X and Y. Our interest is focused on
this unknown population relationship, and we want to estimate it using
sample information.
We want to find good estimates of the regression parameters, 0
and 1
. A
method that gives us good estimates of the regression coefficients is the
methods of least squares compared to other methods such as minimizing the
sum of the absolute errors.
The estimated regression equation:
e X Y + +
1 0

In terms of data, it can be written as follows with the subscript i to signify
each particular data point:
i i i
e x y + +
1 0

Where:
n i ,....., 3 , 2 , 1
Generally:
i i i
i i i
i
i i
Y Y
Y Y
Y
Y

1 0
1 0

+
+
+ +

Sum of squares for error:
( ) ( )

n
i
n
i
i i i
Y Y SSE
1 1
2
2
Calculus (first derivative with respect to 0
and 1
) is used in finding the

expression for
0
and
1
that minimize SSE. These expressions are called

the normal equations.

+
+
n
i
i
n
i
i
n
i
i i
n
i
n
i
i i
x x y x
x n y
1
2
1
1
0
1
1 1
1 0

To discuss inference procedures, two assumptions will be considered;
a. We assume that n random variables are jointly independent
and each i
Y
is a normal random variable. Point estimation of
2
1 0
, ,
and
( ) x
for any x will be discussed. [not to be discussed
here; confidence interval for
2
1 0
, ,
and
( ) x
for any x & test of
hypotheses on
2
1 0
, ,
)].
For point estimation: n
Y Y Y ,......,
2 , 1 are independent normal random
variables with means n
x x x
1 0 2 1 0 1 1 0
,......, , + + +
and variances
2
. To find point estimators, we shall use the method of maximum
likelihood. The likelihood function is
( ) ( )
n
y y y L L ,....., , ; , , , ,
2 1
2
1 0
2
1 0

'
1
1
]
1
,
_

,
_
n
i
i i
x y
1
2
1 0
2
1
2
2
1
exp
2
1

and
( ) ( )

n
i
i i
x y
n n
L
1
2
1 0
2
2 2
1 0
2
1
ln
2
2 ln
2
, , ln

The partial derivatives of
( )
2
1 0
, , L
with respect to
2
1 0
, ,
are
obtained and set equal to zero. We have three equations
( )
( )
( )
2
2
1 0
1
1 0
1
1 0

0

0

n x y
x x y
x y
i i
i
n
i
i i
n
i
i i
The first two equations are called the normal equations. Solving the above
equations we get
( )( )
( )
( )
n
i
i i
i
i i
x y
n
x y
x x
x x y y
1
2
1 0
2
1 0
2
1

1
These are maximum likelihood estimates of 0 1

,
and
2

respectively. We notice that the
s x
i
`
must be such that
( )
0
2
x x
i
;
that is there must be at least two distinct values for the i
x
.
(Properties of point estimation such as minimum variance should be
shown later).
b. Assumption is that only the i
Y
are pair wise uncorrelated;
that is,
[ ] 0 , cov
j i
Y Y
for all
n j i ,....., 2 , 1
. Point estimation of
2
1 0
, ,
and
( ) x
for any x will be discussed.
For this case, n
Y Y Y ,......,
2 , 1 are pair-wise uncorrelated random
variables with means n
x x x
1 0 2 1 0 1 1 0
,......, , + + +
and variances
2
. Since the joint density of the i
Y
is not specified, maximum-
likelihood estimators of 0
, 1
and
2
cannot be obtained. In models
when the joint density of the observable random variables is not given,
a method of estimation called least-squares can be utilized.
i.e. The values of 1 0
,
that minimize the sum of squares
( )

n
i
i i
x Y
1
2
1 0

are defined to be the least-squares estimators of 0

and 1
.
From the normal equations shown above, we get; .
( )( )
( )
2
1
x x
x x Y Y
i
i i
and
x Y
1 0

The least-squares method gives no estimator for
2
, but an estimator of
2
based on the least-squares estimators of 0
and 1
is
( )
1
]
1
n
i
i i
x Y
n
1
2
1 0
2

2
1

1.2INTERVAL ESTIMATE
Point estimates are useful, yet they leave something to be desired. When the
point estimator under consideration had a probability density function, the
probability that the estimator actually equaled the value of the parameter
being estimated was zero (The probability that a continuous random variable
equals any one value is 0).
Hence it seems desirable that a point estimate should accompanied by some
measure of the possible error of the estimate. i.e. Instead of making the
inference of estimating the true value of the parameter to be a point, we
might make the inference of estimating that the true value of the parameter is
contained in some interval (we are referring interval estimation).
Interval estimate is an estimate constituting an interval of numbers rather
than a single number. An interval estimate is an interval believed likely to
contain the unknown population parameter. It conveys more information
than just the point estimate on which it is based.
Like point estimation, the problem of interval estimation is twofold.
There is the problem of finding interval estimators (we need
methods of finding a confidence interval).
There is the problem of determining good or optimal interval
estimators (we need criteria for comparing competing
confidence interval or for assessing the goodness of a
confidence interval).
An interval estimate of a population parameter
is an interval of the form

2 1

< < , where
1
and
2
depend on the value of the statistics
for a
particular sample and also on the sampling distribution of
.
e.g. a random sample of Matriculation examination scores for student
entering B.A Statistics at the University of Dar es Salaam in the year 2002
produce an interval 50 70 within which we expect to find the true average
all scores. The values of the end points 50 and 70 will depend on the
computed sample mean
x
and the sampling distribution of X .
As the sample size increases, we know that n
X
2 2
decreases, and
consequently our estimate are likely to be closer to the parameter
,
resulting in a shorter interval. Thus the interval estimate indicates, by its
length, the accuracy of the point estimate.
Since different samples will generally yield different values of
and,
therefore different values of
1
and
2
, we shall be able to determine

1
and
2
such that the ( )

2 1

< < P is equal to any positive fractional value
we care to specify.
( ) < < 1

2 1
P , for
1 0 < <
Then we have the probability a probability of
1
of selecting a random
sample that will produce an interval containing
.
The interval
2 1

< < is computed from the selected sample, is then called
a
( ) % 100 1
confidence interval, the fraction
1
is called the confidence
coefficient or the degree of confidence, and the end points
1
and
2
are
called the lower and upper confidence limits.
Note: 95% is most useful confidence interval.
e.g. it is better to be 95% confident that the average life of LG
refrigerator is between 7 and 8 years that to be 99% confident that it is
between 4 and 11. We prefer a short interval with a high degree of
confidence.

Some times the restrictions on the size of our sample prevent us from
achieving us from achieving short intervals.
In practice, estimates are often given in the form of the estimates plus or
minus a certain amount. e.g. The National Bureau of Statistics, department
of labor statistics may estimate the number of unemployed in a certain area
to be
2 . 0 7 . 5 t
million at a given time, feeling rather sure that the actual
number is between 5.5 and 5.9 million.
Suppose that a random sample (1.2, 3.4, 0.6, 5.6) of four observations is
drawn from a normal population with an unknown mean
and a known
standard deviation 3. The maximum likelihood estimate of
is the mean of
the sample observations;
7 . 2 x
.
We wish to determine upper and lower limits which are rather certain to
contain the true unknown parameter value between them.
From sample of size 4 from normal distribution
2 3

X
Z
will be normally
distributed with mean 0 and unit variance.
Hence
( ) ( )
2
2
1
2
1
z
z
e z z f

. We can compute the probability that Z will be
between any two arbitrary chosen numbers. Consider 95%.

( ) z

2
z
2
z

Thus
[ ] ( )
< <
96 . 1
96 . 1
95 . 0 96 . 1 96 . 1 dz z Z P
Substituting Z, we get
[ ] 95 . 0 96 . 1
2 3
96 . 1 <
<
X
P

( ) ( ) [ ]
[ ]
( ) 64 . 5 , 24 . 0
95 . 0 94 . 2 7 . 2 94 . 2 7 . 2
95 . 0 2 3 96 . 1 2 3 96 . 1
+ < <
+ < <
P
X X P
The method for finding a confidence that has been used in the example
above is a general method. This technique is applicable in many important
problems, but in other it is not because in these others it is either impossible
to find functions of the desired form or it is impossible to rewrite the derived
probability statements.
1.2.1 Confidence Interval of MEAN
There are really two cases to consider depending on whether or not
2
is
known.
Confidence Interval of Mean when the population
Standard Deviation is known
The central limit theorem tells us that when we select a large random sample
from any population with mean
and standard deviation
, the sample
mean, X is (at least approximately) normally distributed with mean
and
standard deviation
n
. If the population itself is normal, X is normally
distributed for any sample size.
Transforming Z to the random variable X with mean
and standard
deviation
n
, we find that before the sampling there is a
1

probability the X will fall within the interval:
n
Z

2
t
Once we have obtained our random sample, we have a particular value,
x
.
This particular
x
either lies within the range of values specified by the
formula above or not lie within this range.
Since the random sampling has already taken place and a particular
x
has
been computed, we no longer have a random variable and may no longer
talk about probabilities. We may say that we are 95% confident that
x
lies
within the interval (about 95% of the values of X obtained in a large
number of repeated samplings will fall within the interval).
Note: We cannot say that there is a 0.95 probability that
is inside the
interval, because the interval
n x 96 . 1 t
is not random, and neither is
.
The population mean
is unknown to us but is a fixed quantity not a

random variable.
We define 2
Z
as the Z value that cuts off an area of
2
to the right
A
( ) % 100 1
confidence interval for
when
is known and sampling is

done from a normal population, or with a large sample:
n
Z

2
t
Z value for 90% CI is 1.645
For 99% CI. is 2.58 or (using approximation interpolation, 2.576)
For 95% CI is 1.96
Note: When sampling from the same population, using a fixed sample
size, the higher the confidence level, the wider the interval.
e.g. 80% CI for
with n = 25,
x
= 122 and
10
is (116.88,
127.12)
but 95% CI is [114.16, 129.84]
80% is narrow compared to 95%
That means a wider interval has more of a presampling chance of capturing
the unknown population parameter. If we want 100% CI for a parameter, the
interval must be
[ ] ,
since the probability of capturing a parameter is 1.
Such probability will be obtained by allowing Z to be anywhere from
to .
If want both a narrow interval and a high degree of confidence, we need to
have a large amount of information because the larger the sample size the
narrower the interval.
When sampling from the same population, using a fixed confidence level, the
larger the sample size, n, the narrower the confidence interval.
Confidence interval for mean when standard
deviation in unknown
In constructing confidence intervals for
, we assume a normal distribution

or a large sample size. The assumption of known standard deviation was
necessary for theoretical reasons so that we could use standard normal
probabilities in constructing intervals.
In reality,
is rarely known because both
and
are population
parameters and they have to be estimated. When
is unknown we may use

the sample standard deviation, S, in its place. If the population is normally
distributed, the standardized statistic
n S
X
t

has a t distribution (students distribution) with

1 n
degrees of freedom. The
degrees of freedom of the distribution are the degrees of freedom associated
with the sample standard deviation.
The distribution was discovered by Gosset who was a scientist at the
Guiness brewery in Dublin, Ireland in 1908. Gosset publish under the name
student since he restricted his workers not to publish under their names.
The t distribution is resembles the standard normal distribution, Z: it is
symmetric and bell shaped. However it is flatter than Z in the middle part
and has a wider tails.
The mean of a t distribution is zero. For df > 2, the variance of the t
distribution is equal to
( ) 2 df df

Mean of t is the same as the mean of Z, but the variance of t is larger that
the variance of Z. As df increases, the variance of t is approaching 1 (as that
of Z). Large variance of t implies greater uncertainty compared to Z since it
is estimated by two random variables
and
. Since there are many t

distributions we need standardized table of probabilities.
A
( ) % 100 1
when
is not known (assuming a

normally distributed population);
n
s
t X
2
t
where 2
t
is the value of the t
distribution with n 1 degrees of freedom that cuts off a tail area of
2
to
its right.
Although the t distribution is correct distribution to use whenever
is
unknown, when df is large we may use the standard normal distribution. E.g.
sample size of 200 (df is 199).
Estimation problems can be divided into two kinds;
Small sample problems ( sample is less than 30)
Large sample problems (sample is 30 or more)
Example
A stock market analyst wants to estimate the average return on a certain
stock. A random sample of 15 days yields an average (annualized) return of
% 37 . 10 x
and a standard deviation of s = 3.5%. Assuming a normal
population of returns, give a 95% confidence interval for the average return
on this stock.
Solution
[ ] 31 . 12 , 48 . 8
15
5 . 3
145 . 2 37 . 10
2
t t
n
s
t x

Thus the analyst may be 95% sure that the average annualized return on the
stock is any where from 8.43% to 12.31%.
Theorem: Error in estimating
.
If
x
is used as an estimate of
, we can then be
( ) % 100 1
confident that the
error will not exceed
n
Z

2 .
Frequently, we wish to know how large a sample is necessary to ensure that
the error in estimating
will not exceed a specified amount e. It means we

must choose n such that
n
Z

2
Theorem: Sample size for estimating
If
x
is used as an estimate of
, we can be
( ) % 100 1
confident that the
error will not exceed a specified amount e when the sample size is
2
2
,
_
e
Z
n

.
1.2.2 Confidence Interval of the difference
between two Means in Paired and Independent
Samples
If we have two populations with means 1
and 2
and variances
2
1
and
2
2
, respectively, a point estimator of the difference between 1
and 2
is
given by the statistic
2 1
X X
. To obtain 2 1

, we shall select two
independent random samples, one from each population, of size 1
n
and 2
n

and compute the difference, 2 1
x x
, of the sample means.
If the independent samples means (greater than 30) is selected from normal
population, we can establish a confidence interval for 2 1

by
considering the sampling distribution of
2 1
X X
.
We known sampling distribution of
2 1
X X
~
( )
2 1 2 1
,
X X X X

,
,
_
,
_
2
2
2
1
2
1
2 1
2 1
2 1
n n
X X
X X

Then
( ) ( )
,
_
,
_
2
2
2
1
2
1
2 1 2 1
n n
X X
Z

With probability;
( ) ( )
( ) ( )

1
1
]
1
<
+

< 1
2
2
2
2 1
2
1
2 1 2 1
2
Z
n n
X X
Z P
Confidence Interval for
2 1

;
2
1
and
2
2
Known
If 1
x
and 2
x
are the means of independent random samples of size 1
n
and
2
n
from populations with known variances
2
1
and
2
2
, respectively, a
( ) % 100 1
confidence interval for 2 1

is given by
( ) ( )
2
2
2
1
2
1
2 2 1 2 1
2
2
2
1
2
1
2 2 1
n n
z x x
n n
z x x

+ + < < + where 2
z
is the
value leaving an area of
2
to the right.
For small sample we use t distribution when the populations are
approximately normally distributed.
Example
A standardized chemistry test was given to 50 girls and 75 boys. The girls made an
average grade of 76 with a standard deviation of 6, while the boys made an average grade
of 82 with a standard deviation of 8. Find a 96% confidence interval for the difference
mean, where the first mean is the mean score of all boys and second mean is score of all
girls who might take this test. (Answer: 57 . 8 43 . 3
2 1
< < )
Small Sample Confidence Interval for
2 1

;
2
1
=
2
2

Unknown
If 1
x
and 2
x
are the means of small independent random samples of size 1
n
and 2
n
respectively, from approximate normal populations with unknown
but equal variances,
( ) % 100 1

is given by
( ) ( )
2 1
2 2 1 2 1
2 1
2 2 1
1 1 1 1
n n
s t x x
n n
s t x x
p p
+ + < < +

where p
s
is the
pooled estimate of the population standard deviation and 2
t
is the t value
with
2
2 1
+ n n v
degree of freedom, leaving an area of
2
to the right.
( ) ( )
2
1 1
2 1
2
2 2
2
1 1
2
+
+
n n
s n s n
s
p
Example
A course in statistics is taught to 12 students by the conventional classroom procedure. A
second group of 10 students was given the same course by means of programmed
materials. At the end of the semester the same examination was given to each group. The
12 students meeting in the classroom made an average grade of 85 with a standard
deviation of 4, while the 10 students using programmed materials made an average of 81
with a standard deviation of 5. Find a 90% confidence interval for the difference between
the population means, assuming the populations are approximately normally distributed
with equal variances. (Answer:
31 . 7 69 . 0
2 1
< <
)
Small Sample Confidence Interval for
2 1

;
2
1

2
2

Unknown
If 1
x
and
2
1
s
, and 2
x
and
2
2
s
, are the means and variances of small
independent samples of size 1
n
and 2
n
, respectively, from approximate
normal distributions with unknown and unequal variances, an approximate
( ) % 100 1

is given by
( ) ( )
2
2
2
1
2
1
2 2 1 2 1
2
2
2
1
2
1
2 2 1
n
s
n
s
t x x
n
s
n
s
t x x + + < < +

where 2
t
is the t value
with
( )
( )
( )
( )
1
1
]
1
+
1
1
]
1
,
_
1 1
2
2
2
2
2
1
2
1
2
1
2
2
2
2
1
2
1
n
n s
n
n s
n
s
n
s
v
2
to the
right.
Example
Records for the past 15 years have shown the average rainfall in a certain region of the
country for the month on may to be 4.93 centimetres, with a standard deviation of 1.14
centimetres. A second region of the country has had an average rainfall in May of 2.64
centimetres with a standard deviation of 0.66 centimetres during the past 10 years. Find a
95% confidence interval for the difference of the true average rainfalls in these two
regions, assuming that the observations come from normal populations with different
variances. (answer:
56 . 2 02 . 2
2 1
< <
)
Difference of two means when the samples are not
independent and the variances of the two populations
are not necessary equal
This will be true if the observations in the two samples occur in pairs so that
the two observations are related.
e.g. If we run a test for second year B.Sc. Statistics on a new ST 200 lecturer
using 22 students, the scores before and after form our two sample.
Observations in the two samples made on the same students are related and
hence form a pair. To determine the effective of the new lecturer we have to
consider the difference of scores.
e.g. 2. investigate maize output using different fertilizers but same
area/soil/land
Confidence Interval for
2 1

D
for paired observations
If d and d
s
are the mean and standard deviation of the difference of n
random pairs of measurements, a
( ) % 100 1
2 1

D
is
n
s
t d
n
s
t d
d
D
d
2 2
+ < <
,
Where 2
t
is the t value with
1 n v
2
Example
Twenty college freshmen were divided into 10 pairs, each member of the pair having
approximately the same IQ. One of each pair was selected at random and assigned to a
statistics section using programmed materials only. The other member of each pair was
assigned to a section in which the professor lectured. At the end of the semester each
group was given the same examination and the following results were recorded.
Pair 1 2 3 4 5 6 7 8 9 10
Programmed Material 76 60 85 58 91 75 82 64 79 88
Lecturer 81 52 87 70 86 77 90 63 85 83
Find a 98% confidence interval for the true difference in the two learning
procedures. (Answer:
09 . 4 29 . 7 < <
D
)
1.3.1 Confidence Interval of PROPORTION
Sometimes our interest is qualitative rather than quantitative variable.
Interest may be relative frequency of occurrence of some characteristic in a
population. E.g. proportion of population who are users of colgate.
A point estimator of the proportion
p
in a binomial experiment is given by
the statistic
n
X
P
, where X represents the number of success in

n
trials.
Therefore, the sample proportion
n
x
p
will be used as the point estimate of
the parameter.
If
p
is not expected to be too close to zero or 1, we can establish a
p
by considering the sampling distribution of P
.
Therefore for
n
large the distribution of P
is approximately normally
distribution with mean
( ) p
n
np
n
X
E P E
P

,
_

And variance
n
pq
n
npq
n
X
n X
P

2 2
2
2 2

n pq
p P
Z

Theorem: Confidence Interval of

p
A large sample
( ) % 100 1
confidence interval for the population proportion,
p
:
n
q p
z p

2
t
Where the sample proportion,
p
, is equal to the number of successes in the
sample,
x
, divided by the number of trials (the sample size),
n
, and
p q 1
.
Example
A market research firm wants to estimate the share that foreign companies
have in the Tanzania market for certain products. A random sample of 100
consumers is obtained, and it is found that 34 people in the sample are users
of foreign-made products; the rest are users of domestic products. Give 95%
confidence interval for the share of foreign products in this market.
Solution
We have x = 34 and n = 100
34 . 0
100
34

n
x
p
( )( )
t t
100
66 . 0 34 . 0
96 . 1 34 . 0

2
n
q p
z p

[0.2472, 0.4328]
Thus, the firm may be 95% confident that foreign manufactures control
anywhere from 24.72% to 43.28% of the market.
Suppose the firm is not happy with such a wide confidence interval. What
can be done about it? Answers: either to increase sample size, if not sample
is to be increased then reduce confident interval say to 90%
Note: When estimating proportions using small samples, the binomial
distribution may be used in forming confidence intervals. Since the
distribution is discrete, it may not be possible to construct an interval with an
exact, prespecified confidence level such as 95% or 99%.
If
p
is used as an estimate of p, then we can be
( ) % 100 1
confident that the
error will not exceed a specified amount
e
when the sample size is
2
2
2

e
q p z
n

. Assumption is that error can not exceed

n q p z e
2
e.g. How large a sample is required if we want to be 95% confident that our
estimate of
p
is within 0.02? let p = 0.32
Solution
( ) ( )( )
( )
2090
02 . 0
68 . 0 32 . 0 96 . 1
2
2
n
Since sample size is obtained after estimating p, some time it is not
possible to estimate p (p is not given and cant be computed) therefore
the following technique will be used
2
2
2
4e
z
n

e.g. How large a sample is required if we want to be 95% confident

that our estimate of
p
is within 0.02? Answer is 2401
1.3.2 DIFFERENCE BETWEEN TWO
PROPORTIONS
Sometimes the interest is to find the difference of two proportions. E.g.
estimating a difference between proportions of people with skin problems
who are using medicated soap and proportions of people with no skin
problem but using medicated soap.
We have two populations and the problem is to estimate 1
p
from first
population and 2
p
from population two. Sample from first population is 1
n

and from population two is 2
n
.
1
1
1
n
x
p
and
2
2
2
n
x
p
. Their means are 1
p
and 2
p
, variances:
1
1 1
n
q p
and
2
2 2
n
q p
Interest is
2 1

P P
2 1
2 1
p p
P P

and variance
2
2 2
1
1 1
2

2 1
n
q p
n
q p
P P
+
where
( ) ( )
( ) ( )
2 2 2 1 1 1
2 1 2 1

n q p n q p
p p P P
Z
+

Hence confidence interval is given by

( ) ( )
2
2 2
1
1 1
2 2 1 2 1
2
2 2
1
1 1
2 2 1

n
q p
n
q p
z p p p p
n
q p
n
q p
z p p + + < < +

Example
A poll is taken among the residents of a city and the surrounding country to
determine the feasibility of a proposal to construct a civic center. If 24000 of
5000 city residents favor the proposal and 1200 of 2000 country residents
favor it, find a 90% confidence interval for the true difference in the
fractions favoring the proposal to construct the civic centre.
Answer: - 0.1414 < p1 p2 < - 0.0986. since both ends of the interval are
negative, we can also conclude that the proportion of country residents
favoring the proposal is greater than the proportion of city residents
favoring the proposal.
THE FINITE POPULATION CORRECTION FACTOR
So far we have been assuming that the population is much larger than the
sample. That is we sample from infinite population. In some case sample is
obtained from finite population. In such cases, the standard error of our
estimate needs to be corrected to reflect the fact that the sample constitutes a
nonnegligible fraction of the entire population.
When the size of the sample, n, constitutes at least 5% of the size of the
population, N, we have to use a finite-population correction factor and
modify the standard error of our estimator.
We need correction factor since the standard error does not account for the
relative size of the sample with respect to the size of the sampled population.
Consider
n
x

, as n approaching N the standard error is required to be
equal to zero since uncertainty decreases. And when n = N then standard
error is required to be zero since we dealing with the entire population. The
formula above says that standard error is zero only when sample size is
infinite.
For that reason we need some reduction by multiplying the standard error by
a finite-population correction factor.
Finite population correction factor:
1
N
n N
Note: Correction factor is close to 1 when the sample size is small relative to
the population size. The expression approaching zero as the sample
size approaches the population size as required.
A large sample 32confidence interval for
using a finite population

correction:
1
2
t
N
n N
n
s
z x

A large sample
( ) % 100 1
confidence interval for p using a finite
population correction:
1

t
N
n N
n
q p
z p

Example
A company has 1000 accounts receivable. To estimate the average amount
of these accounts, a random sample of 100 accounts is chosen. In the
sample, the average amount is
35 . 532 x
units and the standard deviation is
22 . 61 s
units. Give a 95% confidence interval for the average of all 1000
accounts.
Solution
Sampling fraction is
10 . 0
1000
100

N
n
. Since fraction is grater than 0.05, we
need to use a confidence interval with a finite population correction factor.
t
,
_
,
_
t 39 . 11 35 . 532
999
900
10
22 . 61
96 . 1 35 . 532
1
2
N
n N
n
s
z x

[520.96, 543.74]
1.3.1 CONFIDENCE INTERVAL OF VARIANCE
In some situations, our interest centers on the population variance
(population standard deviation) this happen in production processes, queuing
processes and other situations.
To compute confidence intervals for the population variance, we must have
knowledge of chi square denoted by
2
. The chi-square distribution, like the

t distribution has associated with it a degrees of freedom parameter,
1 n df
.
Note:
2
distribution is used to estimate population variance while t

distribution is used to estimate the population mean
Unlike the t distribution and the normal distribution, however, the chi-square
distribution is not symmetric.
Definition
- The chi-square distribution is the probability of the sum of several
independent, squared normal random variables.
- Is a parametric test used for comparing a sample variance to
a theoretical population variance.
Since it is a sum of squares it can not bring out a negative value and
therefore the distribution is bounded on the left by zero. The distribution is
skewed to the right.
The mean of a chi-square distribution is equal to the degrees of freedom
parameter, df. The variance of a chi-square distribution is equal to 2(df).
The chi-square distribution looks more and more like a normal distribution
as df increase.
In sampling from a normal population, the random variable
( )
2
2
2
1
S n
has
a chi-square distribution with n 1 degrees of freedom.
The probability that a random sample produces a
2
value greater than

some specified value is equal to the area under the curve to the right of this
value.

1

2

2
0
2
2 1
2
2
2

We are asserting that ( ) < < 1 2
2 2
2 1
2
X P , substituting the value of
2
we get
( ) ( )
,
_

< <
1
1 1
2 1
2
2
2
2
2
2
S n S n
P
.

A
( ) % 100 1
confidence interval for the population variance
2
(where the
population is assumed normal):
( ) ( )
1
]
1

2 1
2
2
2
2
2
1
,
1

S n S n
where 2
2
is the value
of the chi-square distribution with n -1 degree of freedom that cuts off an
area of
2
to its right and 2 1
2
is the value of the distribution that cuts
off an area of
2
to its left (equivalently, an area of
2 1
to its right).
Since
2
distribution in not symmetric, we cannot use equal values with

opposite signs (e.g.
96 . 1 t
) and must construct the confidence interval using
the two distinct tails of the distribution.
Example
In an automated process, a machine fills cans of coffee. If the average
amount filled is different from what is should be, the machine may be
adjusted to correct the mean. If the variance of the filling process is too high,
however, the machine is out of control and needs to be repaired. Therefore,
from time to time regular checks of the variance of the filling process are
made. This is done by randomly sampling filled cans, measuring their
amounts, and computing the sample variance. A random sample of 30 cans
gives an estimate
540 , 18
2
s
. Give a 95% confidence interval fro the
population variance
2
.
Solution
Degrees of freedom = n 1 = 30 1 = 29 7 . 45 025 . 0
2
16 975 . 0
2

Using these values, the confidence interval is
( ) ( )
[ ] 33604 , 11765
16
540 , 18 29
,
7 . 45
540 , 18 29
1
]
1
We can be 95% sure that the population variance is between [11765, 33604]
1.3.2 CONFIDENCE INTERVAL OF RATIO OF two
Variances
A point estimate of the ration of two population variances
2
2
2
1
is given
by the ratio
2
2
2
1
s s of the sample variances. If
2
1
and
2
2
are the variances
of a normal populations, we can establish an interval estimate of
2
2
2
1
by
using the statistic
2
2
2
1
2
1
2
2
S
S
F
whose sampling distribution is called F

distribution.
Theoretically, it is the ratio of two chi-square distributions
2
2
2
1
2
1
2
2
2
2
2
1
2
1
2
2
2
1
2
1
s
s
s
s
v
v
f

, with
1
1 1
n v
and
1
2 2
n v
degrees of
freedom.
The number of degrees of freedom associated with numerator is stated first
followed by sample variance in denominator. The curve of F depends not
only on the two parameters 1
v
and 2
v
but also on the order of which we
state them.
The distribution is similar to chi-square, that it is not symmetric. It is
represented similar to chi-square.
Writing
( )
2 1
, v v f
for
f
with
1
v
and
2
v
degrees of freedom, then
( )
( )
1 2
2 1 1
,
1
,
v v f
v v f
.

1

2

2
0 f
2 1
f
2
f
We can establish a confidence interval for
2
2
2
1
as
( ) ( ) [ ]

< <
1 , ,
2 1 2 2 1 2 1
v v f F v v f P
Where
( )
2 , 1 2 1
v v f
and
( )
2 1 2
, v v f
are the values of F distribution with
1
v

and 2
v
degrees of freedom.
Substituting for F we get,
( ) ( )

1
]
1
< <
1 , ,
2 1 2
2
2
2
1
2
1
2
2
2 1 2 1
v v f
S
S
v v f P
Hence Confidence interval is
( )
( )
1 2 2
2
2
2
1
2
2
2
1
2 1 2
2
2
2
1
,
,
1
v v f
s
s
v v f
s
s
< <
Example
A standardized placement test in ST 205 was given to 11 female and 80
male. Female made an average grade of 82 with a standard deviation of 8,
while males made an average grade of 78 with a standard deviation of 7.
Find 98% confidence interval for
2
2
2
1
and 2 1

, where
2
1
and
2
2
are
the variances of the populations grades for all female and male, respectively.
Assume the population to be normal.
Solution
11
1
n
,
79
2
n
,
8
1
s
,
7
2
s
For 98% means
02 . 0
Reading from the table,
( ) 47 . 2 79 , 10
01 . 0
f
(this is assumed to be since 79 is
not shown)
( ) 4 10 , 79
01 . 0
f
( ) 4
49
64
47 . 2
1
49
64
2
2
2
1
< <
,
_
ONE SIDED CONFIDENCE INTERVALS

It is possible to construct confidence intervals with only one side. It is useful
when we are interested in finding an upper bound only or lower bound only.
A right hand
( ) % 100 1
:
1
]
1
+
n
s
z x

,
A left hand
( ) % 100 1
: 1
]
1
+ ,
n
s
z x

Note:
replaces
2
because we have only one side where an error of
probability
may take place in the estimation.

Topic 2: PROPERTIES OF ESTIMATORS
The sample statistics we discussed,
P S X
, ,
as well as other sample
statistics are used as estimators of population parameters. May be we can
ask ourselves that, are some of many possible estimators better in some
sense than the other?
There are several criteria by which we can evaluate the quality of a statistic
as an estimator. We are going to discuss: Unbiasedness, efficient, sufficient,
minimum variance, Cramer Rao inequality and Consistency.
Unbiasedness
This is very important property that an estimator should possess. If we
take all possible samples of the same size from a population and calculate
their means, the mean x
of all these means will be equal to the mean

of the population.
Repeated samples are drawn by resampling while keeping the values of
the independent variables unchanged. Bias is often assessed by
characterizing the sampling distribution of an estimator.
Definition:
An estimator is said to be unbiased if its expectation value is equal to the
population parameter it estimates.
- An estimator
is said to be unbiased if
( )
E
.
-
( ) x E
This is to say that the sample mean
x
is an unbiased estimator of the
population mean.
This is an important property of the estimator because it means that there
is no systematic bias away from the parameter of interest.
Suppose we take the smallest sample observation as an estimator of the
population mean
, it can be easily shown that this estimator is biased

since the smallest observation is less than the mean. Its expected value
must be less that
,
( ) < X E
. Thus the estimator is biased downwards.
The extent of bias (systematic deviation) is the difference between the
expected value of the estimator and the value of the parameter,

( ) X E Bias
Also
( ) ( ) ( ) ( )

E E E E Bias
i.e.
( )
E B
Any systematic deviation of the estimator away from the parameter of
interest is called bias.
is said to be unbiased if
( ) 0
E
Note: In reality we usually sample once and obtain our estimate.
Consistency
An estimator is said to be consistent if its probability of being close to the
parameter it estimates increases as the sample size increases.
The sample mean,
x
, is a consistent estimator of
because the standard

error of
x
is
n
x

. As the sample size increases the standard error
decreases and hence the probability that
x
will be close to its expected
value
increases.
- A consistent estimator is one that concentrates in a narrower and
narrower band around its target as sample size increases indefinitely.
Mean Squared Error (MSE)
Mean squared error of estimators is defined as ( )
2
E
We known that;
( ) ( ) [ ] ( ) [ ]
2 2

+ B E E E Var (Refer that
( )
E B
, B is Bias)
( ) ( )
2

B E Var
( ) ( ) [ ]
( ) ( ) ( )
( ) ( ) ( ) ( )
2
2
2
2
2
2

B E BE E Var
B B E Var
B E Var
+
1
]
1
+


( ) ( ) ( ) ( ) [ ]
( ) ( ) ( ) [ ]
( ) ( ) ( )
( ) ( )
( ) ( ) ( )
2
2
2
2
2
2
2
2
2
2

2

2

B Var MSE E
B E Var
B B B E Var
B E B E Var
B E E B E Var

+
+
+

MSE = variance of estimator +
( )
2
bias
If the estimator is unbiased, then
( ) ( )

Var MSE
Example
A company has 4,000 employees whose average month wage comes to
Tshs 480,000 with a standard deviation of Tshs 120,000. Let
x
be the
mean monthly wage for a random sample of certain employees selected
from this company. Find the mean and standard deviation of
x
for a
sample size of 40 and 100.
Solution
000 , 4 N
,
000 , 480 Tshs
and
000 , 120 Tshs
For sample size of 40
The samples mean
000 , 480 Tshs
x

40 n
000 , 4 N
which gives
01 . 0
N
n
. As this value is less than 0.05
correction factor is not considered.
34 . 18987
40
000 , 120

n
x

For sample size of 100

The sample mean
000 , 480 Tshs
x

100 n
000 , 4 N
which gives
025 . 0
N
n
. The value is less than 0.05, no
need of correction factor.
000 , 12
100
000 , 120

n
x

From the example we learn that sample mean is equal to the population
mean regardless of the sample size. Standard deviation is usually affected
by sample size, and as sample size increase it decreases.
Efficiency (not real the same as minimum variance)
(Remember efficiency differ with consistency because efficiency is
based on relative frequency i.e. comparisons between two
estimators).
Efficiency is a relative property. We say that one estimator is efficient
relative to another. This means that the estimator has a smaller variance
(also standard deviation) than the other. Efficient is measured in terms of
size of the standard error of the statistic. Since an estimator is a random
variable, it is necessarily characterized by a certain amount of variability.
This is to say some estimates may be variable than others.
Definition
An estimator is efficient if it has relatively small variance (and standard
deviation).
If 1
and 2
are two unbiased estimators, 1
is more efficient than 2

if ( ) ( )
2 1

Var Var . Usually the estimator is selected based on MSE.
For example
In large samples, the variance of the sample mean is
( ) n x v
2

. As the
sample size increases, the variance becomes smaller, so the estimate
becomes more efficient.
Consider the probability distribution of the two estimators A and B.
A
B
x
Curve A shows the distribution of sample means. It is more precise
estimator as compared to curve B.
Estimator A is biased, though it may yield an estimate that will be close
to the true value (though it is likely to be wrong).
Estimate B though unbiased, can give estimates that are far away from
the true value.
As such we would prefer estimate A.
e.g. The sampling distribution of mean and median have the same mean,
that is population mean. However, the variance of the sampling
distribution of the means is smaller that the variance of the sampling
distribution of the medians. As such the sample mean is an efficient
estimator of the population mean, while the sample mean is an inefficient
estimator.
More examples
- The sample mean Y
is an unbiased estimator for the population mean.

- Given a random sample, the first observation 1
Y
is an unbiased
estimator for the population mean.
- GivenY
and 1
Y
both are unbiased, which estimator is more efficient
than the other?
Sufficiency
An estimator is said to be sufficient if it contains all the information in
the data about the parameter it estimates.
x
is a sufficient statistics because it utilizes all the information a sample
contains about the parameter to be estimated. We say
x
is a sufficient
estimator of the population mean
. That means no any other estimates

can provide additional information about
. Another sufficient statistic is

p
.
Cramer Rao Inequality
Since estimator with uniform minimum mean-squared error rarely
exists, a reasonable procedure is to restrict the class of estimating
functions and look for estimators with uniformly minimum mean-
squared error within the restricted class. One way of restricting the
class of estimating functions would be to consider only unbiased
estimators and then among the class of unbiased estimators search for
an estimator with minimum mean-squared error.
Definition: Uniformly minimum-variance unbiased estimator
(UMVUE)
Let n
X X ,.....,
1 be a random sample from
( ) ; x f
. An estimator
is
defined to be uniformly minimum variance unbiased estimator of
if
and only if ;
i. ( )
E (that is unbiased)
ii.
( ) ( ) Var Var
Derivation of lower bound for the variance of unbiased estimator

Let n
X X ,.....,
1 be a random sample from
( ) ; x f
. Let
be unbiased
estimator of
. We consider
( ) ; x f
as the probability density function that
satisfies the following assumption, called regularity conditions:
i.
( )
; ln x f
d
d
exists for all x and all
ii.
( ) ( )

n
i
n
i
n i n i
dx dx x f
d
d
dx dx x f
d
d
1 1
1 1
... ; ... ... ; ...
iii.
( ) ( ) ( ) ( )

n
i
n i n
n
i
n i n
dx dx x f
d
d
x x l dx dx x f x x l
d
d
1
1 1
1
1 1
... ; ,..., ... ... ; ,..., ...
iv.
( ) <
1
1
]
1
1
]
1
<
2
; ln 0
X f
d
d
E
for all
The above assumptions based on continuous density function, it applies the
same to discrete density function.
Under the assumptions above;
( )
( )
1
1
]
1
,
_
2
, ln
1
x f
d
d
nE
Var
The above expression is what is called Cramer Rao Inequality. The
right hand side is called Cramer Rao lower bound for the variance of
unbiased estimators.
The Cramr-Rao (Cramer Rao lower bound) is a limit to the
variance that can be attained by an unbiased estimator of a parameter
of a distribution.
Given a certain estimator we expect it to have low Mean Squared Error
(MSE). But the question is; what is the smallest variance that an
estimator can be attained by unbiased estimator of? An answer is given
by Cramaer-Rao inequality.
Example
Let n
X X ,.....,
1 be a random sample from ( ) ( )
!
; ;
x
e
x f x f
x

for x =
0, 1, 2,
Solution
( )
!
ln ; ln
x
e
d
d
x f
d
d
x

( )
x
x x
d
d
+ + 1 ! log ln
Therefore;
( ) ( ) [ ] [ ]
1 1
var
1 1
1 ; ln
2 2
2
2
2 2

1
1
]
1
,
_

1
1
]
1
,
_
X X E
X
E X f
d
d
E
Hence
( )
( ) n n

1
1
var
this is Cramer-Rao
lower bound.
2.6 Minimum Variance
???????????????????????????????????????????????????????
APPLICATION OF THE PROPERTIES OF ESTIMATORS
Normally Distributed Population
Normal population implies symmetric.
Unbiased
Both sample mean and median are unbiased estimators of the population
mean
.
Efficiency
Mean is more efficient than the sample median. This is because the variance
of the sample median happens to be 1.57 times as large as the variance of the
sample mean.
i.e.
n
median Var
2
57 . 1 ) (

Sufficiency
Sample mean is more sufficient than median because its computation use the
entire data set. Median is not sufficient because it is found as the point in the
middle of the data set regardless of the exact magnitudes of all other data
elements.
Consistence
Mean is also consistent.
Proportion
Unbiased
The sample proportion P
is the best estimator of the population proportion.

( ) p P E
.
It is also has the smallest variance of all unbiased estimators of p.
Sample Variance,
2
S
( )
1
2
2
n
X X
S
It seems logical to divide the sum of squared deviations by n rather than n -1
because we are seeking the average squared deviation from the sample
mean. The reasoning for this is explained by the concept of degree of
freedom.
But if we divide by n -1,
2
S become unbiased, and if we divide by n,
2
S
becomes biased.
Note: As we known that
2
S is an unbiased estimator of the population
variance
2
, the sample standard deviation S is not an unbiased estimator of
the population standard deviation
. There is a small bias which result S be

use as an estimator relying on the fact that
2
S is the unbiased estimator of
2
.
Degree of freedom
The number of degrees of freedom is equal to the total number of
measurement (these are not always raw data points), less the total number
of restrictions on the measurements. A restriction is a quantity computed
from the measurements.
e.g. given 10, 12, 16 and 18 where its mean is14. We are able only to
compute one unknown value let say 10 + 12 + 16 + x = 14.
If we have two samples and we known its means, degrees of freedom
becomes
( ) ( ) 2 1 1
2 1
: 2
2
: 1
1
+ + n n n n
df sample df sample

Topic 3: TESTING OF HYPOTHESES
Concept of Hypothesis
A hypothesis is a proposition that we want to verify. Collection of relevant
information is required, process it using statistical techniques and then test
the proposition. Hypothesis helps us to proper decision making. It is very
helpful in examining the validity or theories. Hypothesis is not always
necessary except for problem oriented study.
There are two types of hypothesis; Null and alternative. Amir (1989) defines
a null hypothesis as an assertion about one or more population parameters.
This is the assertion we hold as true until we have sufficient statistical
evidence.
Normally null hypothesis is denoted by 0
H
. This is the hypothesis of no
effects or difference. Consider the following current saga in Bagamoyo
District Council, some officials are accused for fund misuse, if they are
taken before magistrate (before verdict) the persons are considered not
committed fund misuse. So the statement Bagamoyo District Council
officials did not misuse fund is called the null hypothesis.
Alternative hypothesis, denoted by 1
H
, is the assertion of all situation not
covered by the null hypothesis (Amir, 1989). Beri (2003) definite alternative
hypothesis as the opposite of the null hypothesis. From an example of
Bagamoyo saga, alternative hypothesis is Bagamoyo District Council
officials misused fund.
Generally, whenever null hypothesis is specified, alternative hypothesis
must also be specified. It should be noted that it is not possible for null and
alternative hypotheses to be true at once. There are only two ways we can
make a conclusion on a proposition: either not to reject the null hypothesis
that means alternative hypothesis is untrue or reject the null hypothesis
while accepting the alternative hypothesis.
It is possible to have two or more alternative hypothesis but should be tested
one at a time against null hypothesis.
Both in the null and alternative hypothesis, the sample statistics such as
p x,
are not used. Instead the population parameter such as
p ,
Example (statistical example)
Consider that a drug manufacturing company has installed a machine that
fills automatically 5 grams in a small bottle.
Solution
At the beginning we are assuming that what the company claims is true.
Thus;
5 :
0
H
5 :
1
H
Procedure in hypothesis testing
There five steps involved in testing a hypothesis.
1. Formulate a hypothesis. This is the first step where setting of two
hypotheses should be done, i.e. 0
H
and 1
H
.
2. Set up a suitable significance level. In testing validity of
hypothesis we need a certain level of significance. The confidence
level with which a null hypothesis is rejected or accepted depends
upon the significance level used for the purpose. E.g. a
significance level of 5% means that we have about 5% of making
wrong decision, accepting a false hypothesis or rejecting a true
hypothesis.
3. Select test criterion. Selection of appropriate statistical technique
as a test criterion is the third step. We know that there is a lot of
statistical test, z-test for >30 and t-test for <30 etc. Statistical test
normally used in hypothesis testing are Z, t, F and
2
.
4. Compute. Computation of testing statistic and other necessary
computations.
5. Make Decision. This is the final step where statistical decision is
made involving the acceptance or rejection of the null hypothesis.
This depends whether the computed value of the test criterion falls
in the region of acceptance or in the region of rejection at a given
level of significance. The statement rejecting the hypothesis is
stronger than the statement accepting it. It is much easier to prove
something false than to prove it true.
Often, we wish to test the null hypothesis and see whether we can reject it in
favour of the alternative hypothesis. In a test of the value of the population
parameter we normally employ a test statistics.
A test statistic is a sample statistics computed from the data. The value of
the test statistic is used in determining whether or not we may reject the null
hypothesis.
We decide whether or not to reject the null hypothesis by following a rule
called the decision rule.
The decision rule of a statistical hypothesis test is a rule that specifies the
conditions under which the null hypothesis may be rejected.
Two types of errors in hypothesis testing
In testing a hypothesis there are four possibilities;
1. The hypothesis is true but our test leads to its rejection
2. The hypothesis is false but our test leads to its acceptance
3. The hypothesis is true and our test leads to its acceptance
4. The hypothesis is false and our test leads to its rejection
The first two leads to an erroneous decision. The first probability leads to a
Type I
( )
error and the second possibility leads to a Type II error
( )
.
State of Nature
Decision 0
H
is true 0
H
is false
Accept 0
H
Correct decision Type II error
( )
Reject 0
H
Type I error
( )
correct decision
i.e.

P (Reject 0
H
; 0
H
is true)

P (Accept 0
H
; 0
H
is false)
Note
The word accept above is in order.
Usually before carrying out the actual test to try to reject the null
hypothesis, the probability that we will make type I error is known.
This probability
( )
is preset small say 0.05. Knowing the probability
of making type I error i.e. to reject a null hypothesis which should not
be rejected, makes our rejection of a null hypothesis a strong
conclusion.
We can not say that we are accepting the null hypothesis because we
do not know the probability of making type II error
( )
, i.e. fail to
reject a false null hypothesis this is weak conclusion.
When we reject the null hypothesis, we feel fairly confident that the
hypothesis should indeed be rejected. When we fail to reject the null
hypothesis, we feel that we did not have enough evidence to reject the
hypothesis. Either the null hypothesis is indeed true, or more
evidence is needed for it to be rejected.
We emphasize, however, that accept will mean that there is not
evidence to reject the null hypothesis.
Note
The level of significance has a big role in committing either of these two
errors. If we choose a level of significance which is very small (we are
avoiding making type I error) we increase the probability of committing type
II error. Similarly, if level of significance is high (avoiding type II error)
there is an increase of making type I error. The solution is to choose the
level of significance which is not too small or big. The only way to get rid of
this is to increase sample size.
Definition
The level of significance of a statistical hypothesis test is
, the probability
of committing a type I error.
Definition
The rejection region of a statistical hypothesis test is the range of numbers
that will lead us to reject the null hypothesis in case the test statistic falls
within this range. The rejection region, also called the critical region, is
defined by the critical points. The rejection region is designed so that,
before the sampling takes place, our test statistic will have a probability

of falling within the rejection region if the null hypothesis is true.
Rejection region Acceptance Rejection region
region
Tabulated value Tabulated value
Definition
The acceptance region is the range of values that will lead us not to reject
the null hypothesis if the test statistic should fall within this region. The
acceptance region is designed so that, before the sampling takes place, our
test statistic will have a probability
( ) 1
of falling in the acceptance region
if the null hypothesis is true.
Tails of a test
Rejection region in hypothesis can be on both sides of the curve
with the non-rejection region in between the two rejection regions.
A hypothesis test with two rejection regions is called a two-tail test
and a test with one rejection region is called a one-tail test. The
one rejection region can be either of the regions, right (right tail
test) or left (left tail test).
How to find out that a particular test is a two-tail, right or left tail
test?
Signs in the tails of a test
Two tail test left tail test right tail test
Sign in the 0
H
= = or
= or
Sign in the 1
H < >
Rejection region in both tails in the left tail in the right tail
e.g.
45
45
1
0
H
H
Note: We say that a statistical result is significant at the level of
significance
if the result causes us to reject our null hypothesis when we

carry out the test using level of significance
.
Testing Hypotheses about Mean (large sample)
Consider the problem of testing the hypothesis that the mean
of a
population, with known variance
2
equals a specified value 0
against
two sided alternative that the mean is not equal to 0
.
i.e.
0 1
0 0
:
:

H
H
An appropriate statistic on which we base our decision criterion is the
random variable X . By using the significance level of
, it is possible to
find two critical values, 1
x
and 2
x
, such that the interval 2 1
x x x
defines
the acceptance region and the two tails of the distribution, 1
x x <
and 2
x x >
,
constitute the critical region.
The critical region is given in terms of Z values by the means of
transformation
n
x
z
Hence with given level of significance we have

n
x
z
n
x
z
0 2
2
0 1
2

From the population we select a random sample of size n and compute the
sample mean
x
.
1

2

2

1
x
0
2
x
2
z
2
z
Example
A company manufacturing automobile tyres finds that tyres-life is normally
distributes with a mean of 40,000 km and standard deviation of 3,000 km. It
is believed that a change in the production process will result in a better
product and the company has developed a new tyre. A sample of 100 new
tyres has been selected. The company has found that the mean life of these
new tyres is 40,900 km. Can it be concluded that the new tyre is
significantly better that the old ones, using the significance level of 0.01?
Solution
We are interested in testing whether or not there has been an increase in the
mean life of tyre.
Steps
1.
km H
km H
000 , 40 :
000 , 40 :
1
0
>
2. The significance level is 0.01

3. The test criterion is the Z-test
4. Computation
3
100 3000
000 , 40 900 , 40
n
x
z
5. z tabulated =
33 . 2
01 . 0
z z
. If we compare with z computed we see
that z computed is greater that z tabulated, then we reject the null
hypothesis.
i.e. since tabulated computed
z z >
we reject the null hypothesis
that
km 000 , 40
. That means that the new tyre is significantly
better than the old ones.

33 . 2 z
The power of statistical test
The power of a statistical test, given as
1
= P (reject 0
H
when 0
H
is
false) measures the ability of the test to perform as required.
1
is called the power of the function
When
1
is low (a value very close to zero) it is an indication that our
hypothesis test is working poorly. In contrast if
1
is large (very close to
1), we can be sure that our hypothesis test is working quite well.
The power of a statistical hypothesis test depends on the following factors;
The power depends on the distance between the value of the
parameter under the null hypothesis and the true value of the
parameter in question. The greater this distance, the greater the power.
The power depends on the population standard deviation. The smaller
the population standard deviation, the greater the power.
The power depends on the sample size used. The larger the sample,
the grater the power.
The power depends on the level of significance of the test. The
smaller the level of significance,
, the smaller the power.

Testing Hypotheses about Mean (small sample)
Small sample test statistic for the population mean,
:
n s
x
t
0

When the population is normally distributed and the null hypothesis is true,
the test statistic has a t distribution with
1 n
degrees of freedom.
Example
A manufacturer of electric batteries claims that the average capacity of a
certain type of battery that the company produces is at least 140 ampere-
hours with a standard deviation of 2.66 ampere-hours. An independent
sample of 20 batteries gave a mean of 138.47 ampere-hours. Test a 5 percent
significance level the null hypothesis that the mean life is 140 ampere-hours
against alternative that it is lower. Can the manufacturers claim be sustained
on the basis of this sample?
Solution
:
0
H
The mean life of batteries is 140 ampere-hours
:
1
H
The mean life of batteries is < 140 ampere-hours
Level of significance:
05 . 0
Test statistic: t
Computation:
57 . 2
47 . 4 66 . 2
53 . 1
20 66 . 2
140 47 . 138

n
x
t
computed

729 . 1
19 , 05 . 0 1 ,

t t t
n tabulated
19 , 05 . 0
t
140

1.729
57 . 2
computed
t
We reject the null hypothesis since computed
t
is within rejection region. Hence
we conclude that the sample mean is less than 140 ampere-hours.
Testing Hypotheses about DIFFERENCE BETWEEN TWO
POPULATION Means
Individual means is referred to as a one sample test. In some cases we are
required to test whether there is any difference between two means, in such a
case we need samples from each group. This is known as two sample tests.
The procedure for testing the hypothesis is similar to that used in on-sample
tests. Here, we have two populations and our concern is to test the claim as
to a difference in their sample means.
e.g. the government may claim that there is no difference between the
average monthly pension of its central and local government retired
employees.
From the example we have average monthly pension for central government
employees
( )
1
and that of local government

( )
2
. We take a random
samples of size 1
n
and 2
n
and determine their means 1
x
and 2
x
along
with sample standard deviations
( )
1
s
and
( )
2
s
.
When
30 > n
the Z statistic takes the following form
( ) ( ) ( ) ( )
2 1
2 1
2
2
2
1
2
1
2 1
2
2
2
1
2
1
2 1 2 1
x x
x x
n n
x x
n n
x x
Z

, when 2 1 0
: H
1
and 2
are unknown.
Example
A potential buyer wants to decide which of the two brands of electric bulbs
he should buy as he has to buy them in bulk. As a specimen, he buys 100
bulbs of each of the two brands A and B. On using these bulbs, he finds
that brand A has a mean life of 1,200 hours with a standard deviation of 50
hours and brand B has a mean life of 1,150 hours with a standard deviation
of 40 hours. Do the two brands differ significantly in quality? Use 0.05
Solution
Step 1
2 1 1
2 1 2 1 0
:
0 :

H
H
Where 1
= mean life of brand A bulbs and 2
= mean life of brand B

bulbs
Step 2
Level of significance = 0.05
Step 3
Test statistic = Z
Step 4: Computations
( )
81 . 7
16 25
50
100
40
100
50
1150 1200
2 2
2
2
2
1
2
1
2 1
n n
x x
Z

Step 5: decision
This is two tails, Z value is
96 . 1 t
. The calculated Z value falls in the
rejection region, then we reject the null hypothesis and therefore conclude
that the bulbs of two brands differ significantly in quality.
When
30 n
, t test is used
( ) ( )
( )
2 1
2 1 2 1
2 1
2 1
2 1
2
2 2
2
1 1
2 1 2 1
2
n n
n n x x
t
n n
n n
n n
s n s n
x x
t
+
+
+
+

, where
2
2 1
2
2 2
2
1 1
+
+
n n
s n s n
Testing Hypotheses for the population proportion (large
sample)
We known that, when the sample size is large, the distribution of the sample
proportions, P
may be approximately by a normal distribution with mean p

n
pq
, recall conditions when
5 > np
and
5 > nq
. The
test statistics we use is Z,
n
q p
p p
Z
0 0
0

or
0 0
0
q np
np x
Z

(binomial approximation)
We use 0
p
- the hypothesized value of p under the null hypothesis.
Example
A commonly prescribed drug on the market for relieving nervous tension is
believed to be only 60% effective. Experimental results with a new drug
administered to a random sample of 100 adults who were suffering from
nervous tension showed that 70 received relief. Is this sufficient evidence to
conclude that the new drug is superior to the one commonly prescribed? Use
0.05
Solution

6 . 0 :
6 . 0 :
1
0
>
p H
p H
05 . 0
Critical region:
645 . 1 > z
Computed
( )( )( )
04 . 2
4 . 0 6 . 0 100
60 70
z
Decision: Reject null hypothesis and conclude that the new drug is superior.
(z computed > z tabulated)
Testing Hypotheses for the DIFFERENCE BETWEEN TWO
proportion
The test statistics z for test concerning differences between two population
proportions
( ) ( )
2
2 2
1
1 1
2 1 2 1

n
q p
n
q p
p p p p
z
+

When
0 :
2 1 2 1 0
p p p p H
, then the test statistics z is

,
_
2 1
0
2 1
1 1

n n
q p
p p
z
o
, where pooled
2 1
2 2 1 1
0

n n
p n p n
p
+
+
, 0 0
1 p q
e.g.
You obtain a large number of components to an identical specification from
two sources. You may notice that some of the components are from the
suppliers own plant in Msalato and some are from the plant located Makuru.
You would like to know whether the proportions of defective components
are the same or there is a difference between the two. You take a random
sample of 600 components from each plant and find that the rejection rate
1
p
is 0.015 for Msalato components as compared to 2
p
= 0.017 for
Makuru component. Set up the null hypothesis and test it at 5 percent level
of significance.
Solution
2 1 1
2 1 0
:
:
p p H
p p H
, where 1
p
and 2
p
are the proportions of defective
Components from Msalato and Makuru respectively
This is two tails test
Level of significance is 0.05, both large samples
Z tabulated is
96 . 1 t
Z computed
,
_
2 1
0
2 1
1 1

n n
q p
p p
z
o
,
2 1
2 2 1 1
0

n n
p n p n
p
+
+
( )( ) ( )( )
016 . 0
1200
2 . 10 9
600 600
017 . 0 600 015 . 0 600
0

+
+
+
p
( )
0276 . 0
005248 . 0
002 . 0
600
1
600
1
016 . 0 1 016 . 0
017 . 0 015 . 0

,
_
z
We do not reject the null hypothesis since z computed does not fall in the
rejection region. Thus, there is no difference in the rejection rates of
components from Msalato and Makuru.
Testing Hypotheses for POPULATION VARIANCE
Some times we may be interested to draw conclusion on whether population
variance exceeds some level.

Test statistics for the population variance:
( )
2
0
2
2
1
s n

Where
2
0
is the value of the variance stated in the null hypothesis

Testing the chi-square distribution require the assumption of a
normally distributed population.
We reject null hypothesis if chi-square computed > chi-square
tabulated
Example
A machine makes small metal plates that are used in batteries for electronic
games. The diameter of a plate is a random variable with mean 5mm. As
long as the variance of the diameter of the plates is at most 1.00 (mm
2
), the
production process is under control and the plates are acceptance. If,
however, the variance exceeds 1.00, the machine must be repaired. The
engineer collects a random sample of 31 plates and finds that the sample
variance is 62 . 1
2
s . Is there evidence that the variance of the production
process is above 1.00?
Solution
A quality control engineer wants, therefore, to test the following hypotheses:
00 . 1 :
2
0
H
00 . 1 :
2
1
> H
( ) ( )( )
6 . 48
00 . 1
62 . 1 30 1
2
0
2
2

s n
Reading chi square with
05 . 0
and 30 df the value is 43.77
reject
null hypothesis
For
025 . 0
, chi-square tabulated is 46.98
reject null hypothesis

For
01 . 0
, chi-squared tabulated is 50.89
do not reject null

hypothesis
The variance of production exceed 1 when
01 . 0
. So at this level of
significance better stop the machine and do service.
Note:
As degree of freedom increases
2
approaches a normal distribution with

mean df and variance 2df (from the central moment theorem).
e.g. if we have a random variable with 150 df , normal approximation is
150
300
(because the variance is twice).
Z computed is

z t
300 96 . 1 150 t
(if it is two tails test), Z tabulated is
1.96
Testing Hypotheses for difference in two VARIANCE
In measuring whether two independent populations have the same
variability, we use F-test, which is the ratio of the two sample variances. The
population is assumed to be normally distributed.
The F-test statistics for testing the equality of two variances is given below:
2
1
2
1
s
s
F
2
1
s Is variance of sample 1
2
2
s Is variance of sample 2
The test statistics F follows an F distribution with
1
1
n
and
1
2
n

degrees of freedom.
e.g.
Suppose a company manufacturing lights bulbs is using two different
processes A and B. The life of the light bulbs of process A has a normal
distribution with mean 1
and standard deviation 1
. Similarly, for process

B, it is 2
and
2
. The data pertaining to the two processes are given below
Test that the variability of the two processes is the same.
Solution
2
2
2
1 1
2
2
2
1 0
:
:

H
H
05 . 0
Test statistic is F
Computations:
( ) ( )
( ) ( )
46 . 1
2625
3840
20 52500
15 57600
1 21
50 21
1 16
60 16
1
1
2
2
2
2
2 2
1
2
1 1
2
2
2
1

n
s n
n
s n
s
s
F
This is a two tail test, then 1.46 is compared with
20 . 2
20 , 15 , 05 . 0
F
As 2.20 is greater than 1.46 we do not reject the null hypothesis indicating
that there is no significant in the variability of the two samples.
THE p-VALUE
So far we have been arbitrary specifying level of significance. As such mere
acceptance or rejection of a hypothesis fails to show the full strength of the
sample evidence.
Alternative is to use p-value approach.
Example
Let n = 600
0096 . 0 :
0096 . 0 :
1
0
>
p H
p H
Sample A Sample B
16
1
h 21
2
h
hr x 1200
1
hr x 1300
2

hr 60
1
hr 50
2

( )
519 . 0
0 0
0
n
q p
p p
z
If we let
10 . 0
, the critical point is
28 . 1 + z
we do not reject the null
hypothesis
Again if
05 . 0
, critical point is 1.645, we do not reject the null hypothesis.
Question: is it not possible to accept null hypothesis at larger value of 0.1?
Answer: we can simply compute the smallest possible
at which we may
reject the null hypothesis.
If that is the case at which level of
can we reject the null hypothesis,

given that the value of our test statistic is
519 . 0 z
if we insist on rejecting
the null hypothesis?
Test statistic value z = 0.519

nnnnnnnnRejection region: Area = 0.10

0 1.28
Note: the area to the right of 1.28 is 0.10, which is why we could reject the
null hypothesis at 0.1 level if our test statistics (computed) is as small as
1.28.
From that concept, let find the area to the right of the computed value. That
is the area right of z = 0.519. The area represents the smallest probability of
a type I error, the smallest possible level
at which we may reject the null

hypothesis.
Read z = 0.519 in the standard normal probabilities tables you get 0.3018 i.e.
0.5 0.1982 (through interpolation) or 0.3015 if we consider nearby value.
This number is called the p-value. The number 0.3018 means that, assuming
that the null hypothesis is true, there is a 0.3018 probability of obtaining a
test statistic value as extreme as we have (0.519) or more extreme i.e. further
to the right (one tail) of z = 0.519. Since 0.3018 is greater probabilities than
normal probabilities 0.1, 0.05, 0.01 we accept the null hypothesis.
Test statistic value z = 0.519
n
nnn p-value = area to the right of the test statistic
nnnnnn (right hand tail test)
nnnnnnnnnnnnn p-value = 0.3018

0
Definitions
The p-value is the observed level of significance, which is the smallest value
at which 0
H
can be rejected.
Or
The p-value is the probability of obtaining a value of the test statistic as
extreme as, or more extreme than, the actual value obtained, when the null
hypothesis is true.
p Value decision
1. If the
value p
, 0
H
is not rejected [do not reject if
value p <
]
2. If the
< value p
, 0
H
is rejected [reject if
value p
]
Some of the rule of thumb developed by statisticians as aids in interpreting
p-values
When the p value is smaller than 0.01, the result is called very
significant.
When the p value is between 0.01 and 0.05, the result is called
significant.
When the p value is between 0.05 and 0.10, the result is considered
by some as marginally significant (and by others as not significant).
When the p value is greater than 0.10, the result is considered by
most as not significant.
The p value gets smaller as the test statistic falls further away in the tail of
the distribution. That even if we unable to compute the p value, we may
have an idea about its size. Suppose z = 120.97 implies that p value is
extremely small number, hence reject the null hypothesis with much
conviction.
Conversely, the closer our test statistic is to the centre of the sampling
distribution, the larger is the p value; hence we may be more convinced that
we do not have enough evidence to reject the null hypothesis and should
therefore accept it.
p- value for t , chi-square and other distributions
In the case of statistical tests where the sampling distribution of the test
statistic is the t distribution, we can see that the exact p-values are not
obtained because the table contains values for only a few selected standard
values of
such as 0.01, 0.05.

In such situations we make some relative statements about the p-value. For
example if we have a left hand tailed test with test statistic as t = - 2.4 with
df = 15 we find the value 2.4 for a t random variable with 15 degrees of
freedom falls between the two values 2.131 and2.602, corresponding to one-
tail areas of 0.025 and 0.01 respectively. We may therefore conclude that the
p value is between 0.01 and 0.025.
The same should apply for other distributions.
Two Tailed tests
In a two-tailed test, we find the p-value by doubling the area in the tail of the
distribution beyond the value of the test statistic.
e.g. if p value is 0.1131 for one tail, then for two tail is 2(0.1131)=0.2262
Refer the concept of
Level of significance,
0.10 0.05 0.01

One-tailed test + or (1.28) + or (1.645) +or (2.326)
Two-tailed test + or 1.645 + or (1.96) + or (2.576)
TESTS INVOLVING FINITE POPULATIONS
For finite population use the correction factor
1
N
n N
by multiplying the
standard error so long as sample size, n, represents 5% or more of the
population.
Sample size determination for hypothesis tests:
The minimum required sample size in hypothesis tests of
to satisfy a
given significance level and a given power:
( )
2
1 0
1 0
1
]
1

z z
n
Where 0
z
and 1
z
are the required z values determined by the probabilities
and
, respectively, and are used in their absolute value form. The

values 0
and 1
are the population mean under the null hypothesis, and a

value of the population mean under the alternative hypothesis at which the
specified power is needed, respectively.
The minimum required sample size in hypothesis tests of p to satisfy a given
significance level and a given power:
2
1 0
1 1 1 0 0 0
,
_
p p
q p z q p z
n
Where 0
z
and 1
z
are the required z values determined by the probabilities
and
, respectively, and are used in their absolute value form. The

values 0
p
and 1
p
are the null hypothesized population proportion and the
value of p under the alternative hypothesis at which the stated power is
needed, respectively.
GOODNESS OF FIT TEST
So far we have been testing statistical hypotheses about single population
parameters such as
2
,
and
p
. Now we consider testing if a population
has a specified theoretical distribution. The test is based upon how good a
fit we have between the frequency of occurrence of observations in an
observed sample and the expected frequencies obtained from the
hypothesized distribution. We need to estimate how accurately the fit
function approximates the observed distribution.
For binned data, one typically applies a
2
statistic to estimate the fit

quality. It should be noted that the application of the
2
statistic is limited.
The
2
test is neither capable nor expected to detect fit inefficiencies for all
possible problems. This is a powerful and versatile tool but it should not be
considered as the ultimate solution to every goodness of fit problem.
e.g. Consider the tossing of a die. If we hypothesized that the toss is fair
(uniform distribution of outcomes) and the die is tossed 120 times, then we
expect that each face will occur 20 times.
Faces
1 2 3 4 5 6
Observed 20 22 17 18 19 24
Expected 20 20 20 20 20 20
By comparing the observed frequencies with the corresponding expected
frequencies we must decide whether these discrepancies are likely to occur
due to sampling fluctuations and the die is balanced, or the die is not honest
and the distribution of outcomes is not uniform.
The appropriate statistic on which we base our decision criterion for an
experiment involving k cells is defined as:
A goodness of fit test between observed and expected frequencies is
based on the quantity
( )
k
i i
i i
E
E O
1
2
2
Where
2
is a value of the random variable whose sampling distribution is

approximated very closely by the chi-square distribution. The symbols i
O

and i
E
represents the observed and expected frequencies, respectively, for
the ith cell.
If the observed frequencies are close to the corresponding expected
frequencies, the
2
value will be small, indicating a good fit (Accept

0
H
).
If the observed frequencies differ considerably from the expected
frequencies, the
2
value will be large and the fit it poor (Reject 0

H
)

2 2
> constitutes the critical region
The decision criteria should be use only when the expected
frequencies is at least equal to 5
The df depends on two factors: the number of cells in the experiment
and the number of quantities obtained from the observed data the are
necessary in the calculation of the expected frequencies.
The number of degrees of freedom in a chi-square goodness-of-fit test is
equal to the number of cells minus the number of quantities obtained from
the observed data, which are used in the calculations of the expected
frequencies.
E.g. Uniform Distribution example
In a uniform distribution, the probabilities for each expected value are the
same. When the data are discrete, the expected value for each category is
obtained by dividing the total number of observations by the number of
categories. When the data are continuous, a suitable number of classes must
first be determined. The total number of observations can then be divided
equally among the classes.
Fifty students were randomly selected and asked to state their preference for
one of five candly bars. The results are shown below.
Candy Bar A B C D E
Number 8 12 9 11 10
Can we conclude that the students do not prefer one candy bar over another?
In other words, can we conclude that the preference for candy bars is
uniformly distributed?
Soln
If all candy bars are equal in terms of preference, the results from the survey
would have been as follows
Candy Bar A B C D E
Number 10 10 10 10 10
Hypotheses
:
0
H
the five candy bars are equally preferred
:
1
H
the five candy bars are not equally preferred
Candy Bar A B C D E
Number 8 12 9 11 10
Expected 10 10 10 10 10
( ) ( ) ( ) ( )
1
10
10 10
....... ..........
10
10 12
10
10 8
2 2 2 2
2
E
E O
488 . 9
2
05 . 0 , 4
2
, 1 5

Since chi square computed < chi square tabulate we accept the null
hypothesis.
Test for Independence
- Chi square test is also used to test for independence of two
variables.
- The observed frequencies are presented in a contingency
table
Row/column
- Expected frequencies
( )
i i
i i
E
E O
2
2
where the summation extends over all row x column cells in

the row *column contingency table. If
2 2
> with
( )( ) 1 1 c r v
degrees of
freedom, reject the null hypothesis of independence at the
level of
significance; otherwise, accept the null hypothesis.
Test for Independence
The chi-square statistic for testing independence is also applicable when
testing the hypothesis that k binomial populations have the same parameter
p. Hence we are interested in testing the hypothesis k
p p p H .......
2 1 0
against the alternative hypothesis that the population proportions are not all
equal.
To perform this test we first select independent random samples of size
k
n n n ,...., ,
2 1 from k populations and arrange the data in the 2 x k
contingency table.
The expected cell frequencies are calculated as above and substituted
together with the observed frequencies into the chi-square formula for
independence,
( )
i i
i i
E
E O
2
2
with
( )( ) 1 1 1 2 k k v
degrees of
freedom. Conclusion is reached with
2 2
>

ST 205 - Lecturer Notes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ST 205 - Lecturer Notes

Uploaded by

Copyright:

Available Formats

ST 205: STATISTICAL INFERENCE I

. This estimate is a point estimate because it

By Josephat Peter - UDOM

denote the rth moment about 0; that is,

will be a known function of the k parameters

by the method of moments.

as a first moment, then equate the first moment formula with

is either 64 or not; it has only

By Josephat Peter - UDOM

gives the likelihood that the

which maximizes the likelihood

for the sample

By Josephat Peter - UDOM

The actual maximum is at

, but the derivative set equal to 0 would locate

known the maximum likelihood

By Josephat Peter - UDOM

Similarly, the maximum likelihood estimator of, say,

is taken as k dimensional rather than unidimensional

, theorem of invariance does

, constitute the random component in the model.

, are normally distributed

have mean zero, a

is the error term.

= population parameter and 1

. Thus for a given value X:

, are normally distributed with mean 0 and a constant

Calculus (first derivative with respect to 0

) is used in finding the

that minimize SSE. These expressions are called

These are maximum likelihood estimates of 0 1

is an interval of the form

depend on the value of the statistics

, we shall be able to determine

such that the ( )

and standard deviation

is unknown to us but is a fixed quantity not a

is known and sampling is

, we assume a normal distribution

is rarely known because both

is unknown we may use

has a t distribution (students distribution) with

. Since there are many t

is not known (assuming a

will not exceed a specified amount e. It means we

, where X represents the number of success in

Theorem: Confidence Interval of

. Assumption is that error can not exceed

e.g. How large a sample is required if we want to be 95% confident

Hence confidence interval is given by

using a finite population

. The chi-square distribution, like the

distribution is used to estimate population variance while t

value greater than

distribution in not symmetric, we cannot use equal values with

whose sampling distribution is called F

ONE SIDED CONFIDENCE INTERVALS

may take place in the estimation.

of all these means will be equal to the mean

, it can be easily shown that this estimator is biased

because the standard

For sample size of 100