You are on page 1of 17

Chapter (II)

Definitions and Notation


This chapter is concerned with some important definitions and notation will be
used in this study .The first section deals with the properties for the estimators, then
the second section is devoted with studying the different approaches for estimation,
the third section is devoted in brief some topic in hypotheses testing, the fourth
section will focus on goodness of fit tests, finally the fifth section will focus on some
important distribution in this thesis.

2.1 Estimation Theory


Estimation theory has a vital role in statistical inference and it is divided in to
Point Estimation and Interval Estimation.
2.1.1Point Estimation
In statistics, point estimation involves the use of sample data to calculate a
single value (known as a statistic) which is to serve as a best guess for an unknown
(fixed or random) population parameter.
Definition (2-1): any statistic (known function of observed random variable) whose
values are used to estimate population’s parameter called point estimator.

Indeed there are a lot of statistic can be used for estimation any population’s
parameter, therefore we will discuss the criteria for preferring one estimator than
another and how can obtain the best estimator if it exists.

Definition (2-2) Unbiased Estimator:


The first criteria which can classify the estimators is unbiaseness, suppose θˆ
is a statistic from observed random sample and consider point estimator for θ , we

called θˆ is unbiased estimator for θ iff E ( θˆ ) = θ .If the previous condition valid in

large sample size, we called, θˆ is asymptotic unbiased estimator for θ .

1
Definition (2-3) Relative Efficient Estimator:
MSE (θˆ1 )
Suppose θˆ1 , θˆ2 are two estimators for θ , iff < 1 then it is
MSE (θˆ2 )

concluded that θˆ1 is more efficient than θˆ2 , where MSE refers to Mean Square

Error.If the previous condition valid in large sample size, hence θˆ1 will be asymptotic

more efficient than θˆ2 .

Definition (2-4) Consistent Estimator:


The statistic θˆ is consistent estimator for θ iff:

lim p((| θˆ − θ |) < ξ ) = 1


n →∞
where ξ > 0

It is obvious that the consistency is an asymptotic property and in sometimes called


convergence in probability. If θˆ is unbiased estimator for θ and Var( θˆ ) tends to zero

at large samples , so θˆ is consistent estimator for θ .

Definition (2-5) Sufficient Statistic :


An estimator θˆ is said to be a sufficient statistic iff it utilizes all information
in a sample relevant to the estimation for θ , that is meaning all the knowledge about
θ can be gained from the whole sample can just as well as gained from θˆ only , in

mathematical form , θˆ is sufficient statistic iff the conditional probability distribution

of the random sample given θˆ is independent of θ .It can be obvious that the previous
definition can not show how to obtain the sufficient statistic , the following theorem
will overcome this problem.
Definition (2-6) The likelihood function is the joint density function for the whole
sampling taking the following formula:
n
l ( x1 ...x n ;θ ) = ∏ f θ ( xi ;θ )
i =1

Theorem (2-1) Factorization Criterion : if we can express the likelihood


function of the random sample as follows : l ( x1 ...x n ;θ ) = g (θˆ, θ ) h( x1 ..x n ) where

h( x1..xn ) nonnegative function and doesn’t depend on θ , hence we consider θˆ is

sufficient statistic θ .

2
Theorem (2-2): if we can express the density function of the random sample as
follows:

f ( x;θ ) = a (θ )b( x) exp(c(θ ) d ( x )) means the density function belongs to exponential

family, then ∑ d (x) is a sufficient for θ . Sufficient statistic for θ is not unique , that
any transformation one to one from sufficient statistic leads to another sufficient
statistic.

Definition (2-7) Complete Statistic :


suppose a random sample drawn from f( x; θ ) and let θˆ be an estimator for

θ , the family of densities of θˆ is defined to be complete iff E{y( θˆ )} = 0 implies

P{ y (θˆ) = 0} = 1 , also the statistic is complete iff it is family is complete.

Definition (2-8) Robust Statistic:


An estimator is considered to be robust estimator if it has a sampling
distribution is not seriously affected by violations of assumption; such violations are
often due to outliers.

So far it is not quite obvious how to prefer between two statistical estimators
that may one unbiased but not efficient or not sufficient, the following theorem can
give some hints for comparison.

Theorem (2-3) Rao – Blackwell: assume θˆ1 be sufficient statistic for θ and θˆ2
( not function in θˆ1 ) be unbiased estimator for θ , if we obtain :

L(θˆ1 ) = E (θˆ2 / θˆ1 )

hence L(θˆ1 ) is unbiased estimator for θ , also L(θˆ1 ) is sufficient estimator for θ and

efficient estimator than θˆ2 .


Theorem (2-3) aids to obtain the better estimator ,but theorem (2-4) tells how
to obtain the best estimator or unique minimum variance unbiased estimator (
(UMVUE).
Theorem (2-4) Lehmann-Scheffe: if θˆ is a complete sufficient statistic for θ ,
and if the function T( θˆ ) is unbiased estimator θ , then T( θˆ ) is (UMVUE),

3
Cramer- Rao proposed an inequality that proposes the lower bound for the
variance of unbiased estimator and if the ( UMVUE) exists or not as following :
Assume θˆ is unbiased estimator for θ then :
1
V (θˆ) ≥ (1 − 1)
d
E( ln L( x;θ )) 2

If the two sides coincided, then θˆ is the best estimator for θ , indeed the
UMVUE for θ always exists until we can express the likelihood function as follows :
d
ln L( x;θ )) = b(θ ){θˆ − θ )}

From inequality (1-1),we notice the following points:
1. The lower bound for UMVUE can take another formula:
1
V (θˆ) ≥ 2
d
− E( ln L( x;θ ))
d 2θ
2. The denominator of the inequality (1-1) called Fisher Information I( θ ), that is
an index for the size of the information in the sampling corresponding θ .It is
obvious more information leads to more accuracy meaning less variability .
3. If θˆ is unbiased estimator for θ and it is variance coincides with Cramer-
Rao’s inequality, then f(x ; θ ) belongs to exponential family.

2.1.2Interval Estimation

In statistic inference, a confidence interval (CI) is an interval estimate of a

population parameter. Instead of estimating the parameter by a single value, an

interval likely to include the parameter is given. Thus, confidence intervals are used to

indicate the reliability of an estimate, that is shorter C.I imply better estimator.

Definition (2-9) : let X 1 X 2 .. X n be r.s. from f(x ; θ ) , assume L = ( t1 ( x1 x 2 ..x n ) and


U = t 2 ( x1 x 2 ..x n ) satisfies L<U and P(L< θ <U) = λ , where λ does not depend on θ ;

then ( L , U) is 100 λ percent confidence interval for estimating θ , where ( L , U) is


called the lower and upper confidence limits.

4
Typically, ( L ,U) are functions of the point estimator that was preferred for
estimating θ , confidence interval is a range of the values has a high probability of
containing the estimated parameter and it is consider being good if it has a shorter
length.

2.2 Methods of Estimation

In the statistical literatures many methods of estimation were proposed, this


section is concerned with three methods of estimation as following later.

2.2.1Method of Moments ( MOM)

It is difficult to track back who introduce MOM , but Johnan Bernoulli(1667-


1748) was the first who used the method in his work see (Gelder 2001), the idea for
this method that we estimate the unknown parameters in terms of the unobserved
population’s moments for instance (mean, variance, skewness, kurtosis, cofficient of
variation), then estimate the unobserved moments with the observed sampling
moments.
Properties for MOM estimation:
1. The method of moments in general provides estimators, which are biased but
consistent as large sample size, and not efficient, they are often used because
they lead to very simple computations.
2. Estimates by MOM may be used as the first approximation or initial values to
the solutions for other methods that need for iteration.
3. It is not unique, that instead of using the row moments, we can use the central
moments, therefore we obtain another estimation.
4. In some cases MOM can not be applicable such as Cauchy distribution.
2.2.2Maximum Likelihood ( MLE)

It is difficult to track who discovered this tool, but Daniel Bernoulil (1700-
1784) was the first who reported about it see (Gelder 2001), the idea for this method it
is requried to give the specified sample high probability to be drawn, so we will
research about the parameters achieve this goal, that is maximized the likelihood for
the specified sample.

5
The likelihood function is the joint density function for the whole random
sampling taking the following formula:
n
l ( x1 ...x n ;θ ) = ∏ f θ ( xi ;θ )
i =1

It takes another formula making the calculation easier and no loss information:
n
l ( x1 ...x n ;θ ) = ∑ ln f θ ( xi ;θ )
i =1

^
The method of maximum likelihood’s idea that estimates θ by finding the value of θ

that maximizes l (θ ) , hence θ is called maximum likelihood estimator, indeed


^

^
obtaining θ , in many cases, by solving the following equation:

d ln l (θ)
=0 (1 −2)

The maximum likelihood method can be used for estimated more than one
unknown parameter.It can be shown that it cann’t obtain θˆ by the (1-2) equation, if
the following conditions are not valid (often called regularity conditions):
1. The first and second derivatives of the log-likelihood function must be defined.
2. The range of X’s not depend on the unknown parameter .
3. The Fisher information matrix must not be zero.
According to Gelder (1995),properties of MLE’s are following:
1. The MLE’s estimates are asymptotically unbiased or consistent for the true
parameter.
^
2. The MLE have a powerful property called Invariance that if θ is MLE for θ ,
^
then g (θ ) is MLE for g(θ).

3. The MLE’s are asymptotically normal distribution.


4. If there is an efficient estimator for θ that achieves the Cramér-Rao lower bound ,
it must be MLE .
5. It may not exists, and if it found not necessary be in closed form, and also may not
be unique .

2.2.3Principle of Maximum of Entropy ( POME)

6
The principle of maximum entropy estimation ( POME) approach is a flexible
and powerful tool for a multiple purposes, for instance estimation the distribution’s
parameters , goodness of fit for a testing null hypothesis, derive a probability
distribution according to some conditions …etc. For understanding ( POME) it should
be to discuss the statistical Entropy in the information theory.

The origin of the entropy concept goes back to Ludwig Boltzmann(1877), it is


a Greek notation meaning transformation, it has been given a probabilistic
interpretation in information theory by Shannon (1948).He consider the entropy as
index of the uncertainty associated with a random variable, for the discrete
distribution the entropy can take the following formula:
n
H ( X ) = −∑ p ( xi ) ln p( xi )
i =1

The properties of Shannon’s (1948) entropy will be:

1. The quantity H(X) reaches a minimum, equal to zero, when one of the
events is a certainty.
2. If some events have zero probability, they can just as well be left out of
the entropy when we evaluate the uncertainty.

3. Entropy information must be symmetric that doesn’t depend on the


order of the probabilities.
4. Entropy information should be maximal if all the outcomes are
equally likely.
For continuous variable Shannon’s (1948) information entropy is defined as :

H ( x) = − ∫ f ( x) ln( f ( x))dx = − E (ln( f ( x ))
−∞

it is obvious that the differential entropy can be negative so we cannot


guarantee all Shannon’s (1948) properties valid .

Entropy can be considered as a measure of uncertainty. entropy concept in this


regard is similar to the variance of a random variable whose values are real numbers.
The main difference is that entropy applies to qualitative rather than quantitative
values, and, as such, depends on the probabilities of possible events see Frenken
(2003).

7
The concept for POME that employing to enlarge the parameter space and
maximize the entropy subject to both the parameters and the Lagrange multipliers, an
important notice from enlarging parameter space that method is applicable to any
distribution having any number of parameter.

According to jaynes(1961) and singh(1985) to estimate parameter by


( (POME) it is required the following steps
1. Define the entropy for the available data.
2. Define the given or prior information in independent constraints.
3. Maximizing the entropy function subject to m independent constraints.
In the mathematical form, it is required:
n
Max H ( X ) = −∑ p ( xi ) ln p( xi )
i =1

Subject to
n

∑ p( x ) = 1
i =1
i

∑g
i =1
j ( x ) p ( xi ) = c j j = 1...m

To optimism the entropy function subject to the condition, we should use


Lagrange Multiplier as following:
n n n
L = −∑ p( xi ) ln p( xi ) − (λ − 1)(∑ p ( xi ) − 1) − µ j (∑ g j ( x) p( xi ) − c j ) j = 1..m
i =1 i =1 i =1

n
dL
= − ln p ( xi ) − λ − µ j ∑ g j ( x) = 0 j = 1..m
dp ( xi ) i =1

n
−λ − µ j ∑ g j ( x )
p′( xi ) = e i =1
j = 1..m
Hence it is quite to know the relation between Lagrange’s coefficient and the

parameters in the distribution of the data, then substituting p′( xi ) in the entropy

function to obtain the derivative of H ( X ) with respect to Lagrange’s coefficient, after


that solving m independent equations simultaneously to get the estimates, the previous
algorithm can also extension to the continuous variables.

8
2.3Hypotheses Testing
A statistical hypothesis tests is a method of making statistical decision using
sampling data, it is consider a key technique of statistical inference, the aim of using
hypotheses tests that if the information in the sample guides us to accept or reject the

doubtful hypothesis called null hypothesis H o .In fact there are two types of
hypotheses testing:
1. Parametric Hypothesis : is considered with one or more constraints imposed upon
the parameters of certain distribution .
2. Non-Parametric Hypothesis : is the statement about the form of the cumulative
distribution function or probability function of the distribution from which sample
is drawn.
Hypotheses testing can be classified as following:
1. Simple Hypothesis: if the statistical hypothesis specifies the probability
distribution completely .
H 0 : f ( x) = f o ( x)

2. Composite Hypothesis: if the statistical hypothesis doesn’t specify the probability


distribution completely.
H 0 : f ( x) ≠ f o ( x) H 0 : f ( x) ≤ f o ( x ) H 0 : f ( x) ≥ f o ( x )

Definition (2-10) : Statistical Test is a rule or a procedure for deciding whether or


not reject the null hypothesis based on it is sampling distribution , that is if the
statistical test’s value lies in the critical region the decision is rejecting the null
hypothesis, otherwise accepting null hypothesis .
It is obvious that our decision based on sample data, therefore the decision can
affect with two kinds of errors.
1. Type I Error α : this error can be done when we reject H o although it is correct,
also called the level of significance.

2. Type II Error β : this error can be done when we accept H o although it is wrong,
1− β
the complement of this error called the Power of the Test .

Hence, we need statistical test that keeps the errors of the decision as
minimum as possible, unfortunately in a fixed sample size if one of errors was

9
minimized the other was maximized, so there is a negative relation between the two
errors.

To overcome this problem we can fixed the more serious error- Type I Error-
and searching for the statistical test which has the minimum Type II Error or Most
Powerful Test.
Theorem (2-5) : In the case of simple hypothesis verses simple alternative
hypothesis the most powerful test among all the tests have α or less than can take the
following formula:
Reject H o if λ < k accept H o if λ > k

n n
Where λ = ∏ f o ( xi ) ÷ ∏ f1 ( xi ) and k is a positive constant.
i =1 i =1

The idea that we calculate the ratio between the likelihood function under H o

and H1 , that is high value refers to accept H o otherwise indicates to reject H o ,


therefore this ratio is well known as simple likelihood ratio or Neyman-Pearson
lemma.
Definition (2-11): if it is required to test simple hypothesis verses composite
alternative hypothesis among all the tests have α or less than, the statistical test has
most powerful verses all alternative hypotheses called Uniformly Most Powerful Test,
and take the following formula :
Reject H o if Λ < c accept H o if Λ > c

n n
Where Λ = ∏ f o ( xi ) ÷ ∏ f Ω ( xi ) and c is a positive constant.
i =1 i =1

The idea that we calculate the ratio between the likelihood function under H o

and H1 , f Ω ( xi ) means all sample space for the parameter θ , this ratio is called
typically Generalized Likelihood Ratio.

It is obvious that λ is an special case from Λ .The distribution of Λ


corresponding to a particular null and alternative hypothesis using the sampling
distribution of the test, in many cases it is not quite, fortunately it is proved that for

10
any particular null and alternative hypothesis ( − 2 ln Λ ) has approximately χ 2
distribution with degree of freedom the of tested parameter in the null hypothesis.

2.4Goodness of Fit Tests

Goodness of fit can be classified in to two types, the first based on differences
between the observed frequencies and the expected frequencies the second tests based
on discrepancy between the Empirical Distribution Function (EDF) and the given
distribution function. Typically the applied researches need to test hypothesis related
to the population’s distribution, the test concerned with agreement between the
distribution of the sample’s values and hypothesized distribution is called " test of
goodness of fit". The hypothesis which is related to the goodness of fit :

H o : F ( x) = Fo ( x) for all x V.S H o : F ( x) ≠ Fo ( x) for all x

2.4.1Chi-Square Goodness of fit test

In 1900 Person was an early testing the drawn sample has a specific
distribution, he proposed the square difference between the observed and the expected
frequencies to overcome the canceling between the positive and the negative
differences, the test statistic can take the following formula:
r
(Oi − E i ) 2
x 2
r −1 =∑
i =1 Ei
where ( O,E) refer to observed and expected frequencies respectively .
The distribution of x 2 r −1 has a complicated form, so it has been proved that

x 2 r −1 has a limited distribution Chi-Square with degree of freedom ( r-1), where r


refers to the number of classes. In the real life it is rarely to has information about the
distribution’s parameters, therefore we resort to estimate the distribution’s parameters
from the sampling, hence the test of Person will take the following formula:

11
r
(Oi − Eˆ i ) 2
x 2 r −k = ∑
i =1 Eˆ i

where ( O, Ê ) refer to observed and estimated expected frequency respectively .


It has been proved that x 2 r −k has a limited distribution Chi-Square with degree
of freedom ( r-k), where r refers to the number of classes, k refers to the number of
estimated parameter.

Although the literature which applied researches tend to prefer person’s test,
steele (2002) represents a subset of alternative Chi-Square types in table ( 2-1).

The conditions that guarantee person’s test has limiting Chi-Square


distribution are following:

1. There is a high argument about the number of the classes that must be take in
consideration for guarantee limiting Chi-Square distribution, reviewed in Mann et
al (1942) suggest the following equation :
1

m = 4{2n 2 ÷ Z α } 5

where m is number of classes, zα is the upper standard normal distribution and n


refers to the sample size, also in Moree (1986) recommends the optimal number for
classes can be as follows:
2
m′ = 2n 5

where m′ is optimal number of classes and n refers to the sample size .


2. Many researchers made recommendations of how large or small expected
frequencies need to enable the exact distribution of person’s Chi-Square
distribution. Moore(1986) provided a review of literature associated with the
minimum and average expected in table (2-2).
2.4.2Kolmogorov – Smirnov Goodness of fit test ( K-S)
One the most useful and best known test is "K-S" test, introduced by Andrei
Nikalayevich Kolmogorov in 1933, Smirnov in 1938 extended the "K-S" to two
samples see Tucker(1996). It is consider the second approach that can be based on
EDF, indeed EDF tests divided into tests based on maximum distance of empirical
distribution function to null distribution function like (Kolmogorov – Smirnov ( K-S)

12
, Kupier’s V) , and quadratic tests ( Anderson-Darling and Cramer-von Mises tests)
see Thode (2002).
According to Thode (2002) the empirical distribution function can be defined
as a step function is calculated from the sample always used to estimate the
population’s cumulative function distribution has the following formula:

# observation ≤ x
Fn ( x) =
n

It can take another well-known formula:


0 x < X1
i

Fn ( x ) =  X 1 ≤ x ≤ X i +1 i = 1..n − 1
n
1 Xn ≤ n

Where X i refers to the order statistic.


It has been proved that:
P (sup Fn ( x ) − F ( x) → 0) = 1
n →∞

Where Fn (x) are empirical distribution function for sample of size n and F (x) is the
cumulative distribution’s population.
It is concluded that with probability one the convergence of Fn (x) to F (x) is
uniform for all values x, according to Mood (1978) the EDF has a limiting Normal

F ( x)(1 − F ( x )
distribution with F(x) and variance .Hence D = ( Fn ( x) − F ( x)) has
n

F ( x)(1 − F ( x )
Normal distribution with Zero mean and variance , hence Kolmogorov
n
(1933) proposed his test that taken the following formula:

Dn = Sup Fn ( x) − F ( x ) −∞ < x < ∞

The Dn statistics is particularly useful in the nonparametric statistical inference,

because the probability distribution of Dn doesn’t depend on the hypothesized


cumulative distribution but depend only on the sample size , Smirnov (1948) and
Massy (1951) provides the well known table for "K-S" based on Monto Carlo

13
simulation for various sample size and different levels of significance. Also Gibbons
(1971) gives limiting distribution of this test for various levels of significance.

Some points mentioned in Massy( 1951) about comparison between Goodness


of Fit Tests and Kolmogorov – Smirnov test as follows:

1. The K-S test doesn’t require observations to be grouped as in the case of Chi-
Square test, so K-S test uses all the information in the sample.
2. The K-S test can be used for any sample size in contract to Chi-Square test,
because it well known the exact distribution for K-S test.
3. The K-S test is not applicable when parameters have to be estimated from the
sample, in contract of Chi-Square test can be used in this case with reducing the
degree of freedoms as mentioned above.
4. The K-S test is used for continuous distributions but if it is applied with the
discrete distributions, it will be conservative test see Abdel assis (2005).

2.5 Some Important Distributions

In this section, it will be in brief shown some famous distributions which will
be used in this thesis.

2.5.1Normal Distribution:

The normal distribution, also called the Gaussian distribution, is an important


family of continuous probability distributions, applicable in many fields. Each
member of the family may be defined by two parameters, location and scale: the mean
("average", μ) and variance (standard deviation squared, σ2) respectively, The
standard normal distribution is the normal distribution with a mean of zero and a
variance of one, The importance of the normal distribution as a model of quantitative
phenomena in the natural and behavioral sciences is due in part to the central limit
theorem.

If X has a normal distribution with mean μ and variance σ2 the density


function will take the following form:

14
−1
1 ( x −µ )2
f ( x) = e 2σ
2
−∞ < x < ∞
σ 2π

Table (2-4) summarizes the more important characteristics for normal distribution.
Mean μ
Median μ
Mode μ
Variance σ2
Skewness 0
Excess 0
kurtosis
Entropy ln(σ 2πe
Moment- σ 2t 2
generating M X (t ) = exp( µt + )
function (mgf) 2

2.5.2Uniform distribution

In probability theory and statistics, the continuous uniform distribution is a


family of probability distributions such that for each member of the family all
intervals of the same length on the distribution are equally probable. This distribution
is defined by the two parameters, a and b, which are its minimum and maximum
values respectively. It has an important role in the generating random numbers
technique The distribution is often abbreviated U(a,b).

If X has a Uniform Distribution with minimum a and maximum b the density


function will take the following form:

15
 1
 a< x<b
f ( x) =  b − a
0 otherwise

Table (2-5) summarizes the more important characteristics for uniform distribution.

a+b
Mean
2
Median a+b
2
Mode any value in{
a,b}
Variance (b − a ) 2
12
Skewness 0
Excess kurtosis 6

5
Entropy ln(b − a )
Moment- e tb − e ta
generating
function (mgf) t (b − a )

2.5.3Exponential distribution

16
In probability theory and statistics, the exponential distributions are a class of
continuous probability distributions. They describe the times between events in a
Poisson process, indeed exponential distributions can be a special case for Gamma
Distribution, it has a widely application in life models, biology, mechanics, hydrology.

If X has a exponential distribution with rate parameter λ > 0 the density


function will take the following form:

λe − λx 0< x<∞


f ( x) = 
0 Otherwise

Table (2-6) summarizes the more important characteristics for exponential


distribution.

Mean 1
λ
Median ln(2)
λ
Mode 0
Variance 1
λ2
Skewness 2
Excess 6
kurtosis
Entropy
1
1− ln
λ
Moment- t
generating (1 − ) −1
function (mgf) λ

17

You might also like