Thessl Ch1 Statistics, Entropy, Lagrange, Score Test, Estimation

Chapter (II)
Definitions and Notation

This chapter is concerned with some important definitions and notation that
will be used in this study. The first section deals with review of some different
approaches of estimation, the second section is devoted to some topics in hypotheses
testing, the third section will focus on measures of information, the fourth section will
focus on optimization subject to conditions via Lagrange multiplier, finally the fifth
section explained in brief some important distributions.
2.1 Methods of Estimation

Problem of point estimation for distribution’s parameter plays a vital role in
the statistical literatures, therefore many methods of estimation were proposed, this
section is concerned with three methods of estimation.
1 Method of Moments
It is difficult to track back who introduce method of moments MOM , but
Johnan Bernoulli(1667-1748) was the first who used the method in his work see
Gelder(1997), the idea for this method that we estimate the unknown parameters in
terms of the unobserved population’s moments for instance (mean, variance,
skewness, kurtosis and coefficient of variation), then estimate the unobserved
moments with the observed sampling moments. Typically, the different types that can
represent the observed sampling moments have the following formulas:
.1 The moments about zero (raw moments ): E( x r )
.2 The central moments : E( x − x ) r

x−x r
.3 The standard moments : E( )
σ
Where r = 1..k and x , σ and k refer to the mean, standard deviation and the number
of the estimated parameters of the distribution respectively. Hence the methods works
by solving simultaneously a system of k equations in k unknown parameters and k
observed sample moments
1
2 Method of Maximum Likelihood
It is difficult to track who discovered this tool, but Bernoulil in 1700 was the
first who reported about it see Gelder (1997), the idea that it is required to give the
specified sample high probability to be drawn, so it is required to research about the
parameters that maximized the likelihood function for the specified sample.
The likelihood function is the joint density function for the completely random
sampling taking the following formula:
n
L( x1 ...x n ;θ ) = ∏ f ( xi ;θ )
i =1
The method of maximum likelihood is required to estimate θ by finding the value of
θ that maximizes L( x1 ...x n ;θ ) , hence θ is called maximum likelihood estimator

^ ^
^
MLE, indeed obtaining θ , in many cases, by solving the following equation:
dL ( x1 ...xn ; θ)
=0 (2.1.1)
dθ
The maximum likelihood method can be also used for estimated k unknown
parameters, therefore solving homogenous k equations in k unknown parameters. It
can be shown that it can't obtain θˆ defined in (2.1.1) equation, if the following
conditions are not valid (often called regularity conditions):
1. The first and second derivatives of the log-likelihood function must be defined.
2. The range of X’s doesn’t depend on the unknown parameter .
Note: In many situations solving (2.1.1) can not be easily, thus one can use monotonic
transformation that making the calculation easier and no loss information:
n
d ∑ ln f ( xi ;θ )
d ln{L( x1 ...x n ;θ )}
= i =1
dθ dθ
2
3 Method of Least Square
The method of least squares or ordinary least squares (OLS) is often has a vital
role in statistical researches, particularly regression analysis, is proposed by Gauss see
Gelder(1997), typically OLS used to estimate the relation between two variables are
known as independent and dependent variables. Least squares problems fall into two
categories, linear and non-linear. The linear least squares problem has a closed form
solution, but the non-linear problem does not and is usually solved by iterative
process, furthermore OLS can be applied for one or more independent variables, in
this study will focus on one independent variable.
Suppose Y1Y2 ..Yn are pairwise uncorrelated random variables represent the
dependent variables and X 1 X 2 .. X n represent the fixed independent variables,

suppose the relation between the Y ′s and X ′s expressed as:
Yi = B0 + B1 X i +U i i = 1..n
Where U ′s refers to the residuals of the model. Thus OLS states one should peak the
values of B ′s which make the sum of squared residuals as minimum as possible:
n
MIN U (Y , B′s, X ) = ∑( yi − B0 + B1 xi ) 2
i =1
Differentiating E (Y , B′s, X ) with respect to B ′s it will yield:
 dE (Y , B ′s, X ) n

 B0
= ∑ − 2( y i − B0 + B1 xi ) = 0 
 i =1
 (2.1.2)
 dE (Y , B ′s, X ) n 
 = ∑ − 2 xi ( y i − B0 + B1 xi ) = 0
 B1 i =1 
It is easily to check (2.1.2) gives a minimum values, hence solving (2.1.2) it will be
obtained:
3
n
∑ xy − nx y
b1 = i =1
n
∑x
i =1
2
− nx 2
b0 = y − b1 x
So far, it is not obvious to prefer which method can be more efficient than
other, fortunately, to overcome this problem it should be discuss some topics related
to the properties of point estimator and confidence interval.
Definition (2.1.1): In statistics, point estimation refers to the use of sample data to
calculate a single value is well known as a statistic, an observed function of a sample
where the function itself is independent of the parameter, which is to serve as a best
guess for an unknown population parameter
Definition (2.1.2) Unbiased Estimator: The first criteria which can classify the
estimators is unbiaseness, suppose θˆ is a statistic from observed random sample and
consider point estimator for θ , we called θˆ is unbiased estimator for θ iff E ( θˆ ) = θ ,
if the previous condition valid in the large sample size, we called, θˆ is asymptotic
unbiased estimator for θ .
Definition (2.1.3) Relative Efficient Estimator: Suppose θˆ1 , θˆ2 are two
Var (θˆ1 )
estimators for θ , iff < 1 then it is concluded that θˆ1 is more efficient than θˆ2
Var (θˆ )
2
, where Var refers to the variance of the estimator. If the previous condition valid in
large sample size, hence θˆ1 will be asymptotic more efficient than θˆ2 .
Definition (2.1.4) Consistent Estimator: The statistic θˆ is consistent estimator

for θ iff:
lim p ((| θˆ − θ |) < ξ ) → 1 where ξ > 0
n →∞
4
It is obvious that the consistency is an asymptotic property and in sometimes called
convergence in probability. If θˆ is unbiased estimator for θ and Var( θˆ ) tends to zero
at large sample size , so θˆ is consistent estimator for θ .
Definition (2.1.5) Sufficient Statistic : An estimator θˆ is said to be a sufficient

statistic iff it utilizes all information in a sample relevant to the estimation for θ , that
is meaning all the knowledge about θ can be gained from the whole sample can just
as well as gained from θˆ only . In mathematical form , θˆ is a sufficient statistic iff
the conditional probability distribution of the random sample given θˆ is independent

of θ .It can be obvious that the previous definition can not show how to obtain the
sufficient statistic , the following theorem will overcome this problem.
Theorem (2.1.1) Factorization Criterion: if we can express the likelihood

function of the random sample as follows: L( x1 ...x n ;θ ) = g (θˆ, θ )h( x1 ..x n ) where
h( x1..xn ) nonnegative function and doesn’t depend on θ , hence we consider θˆ is
sufficient statistic θ
Definition (2.1.6) Exponential family of densities: the density probability

function f ( x,θ ) considered as a member of the exponential family if f ( x,θ ) can be
rewritten as the following formula :
f ( x,θ ) = a (θ )b( x ) exp{c(θ )h( x)}
One of advantage from regarding f ( x,θ ) belongs to exponential family that
∑ h( x )
i =1
i is considered as a sufficient statistic. Cramer- Rao proposed an inequality
that proposes the lower bound for the variance of the unbiased estimator and if the
Unique Minimum Variance Unbiased Estimator (UMVUE) exists or not. Assume θˆ is
unbiased estimator for θ then :
5
1
V (θˆ) ≥
d (2.1.3)
E( ln L( x1 ...x n ;θ )) 2
dθ
If the two sides coincided, then θˆ is the best estimator for θ , indeed the
UMVUE for θ always exists until we can express the likelihood function as follows :
d
ln L( x1 ...x n ;θ ) = b(θ ){θˆ − θ )}
dθ
From (2.1.3) inequality, we notice the following points:
1. The lower bound for UMVUE can take another formula:
1 1
V (θˆ) ≥ =
d 2
d
− E( ln L( x1 ...x n ;θ )) E( ln L( x1 ...x n ;θ ) 2 )
d 2θ dθ
2. The denominator of the (2.1.3) called Fisher Information I( θ ), that is an index for
the size of the information in the sampling corresponding θ .It is obvious more
information leads to more accuracy meaning less variability .
Definition (2.1.8) In statistic inference, a confidence interval is an interval estimate

of a population parameter. Instead of estimating the parameter by a single value, an
interval likely to include the parameter is considered. Let X 1 X 2 .. X n be a random
sample from f(x ; θ ) , assume L = t1 ( x1 x 2 ..x n ) and U = t 2 ( x1 x 2 ..x n ) satisfies L<U and
P(L< θ <U) = λ , where λ does not depend on θ ; then ( L , U) is 100 λ percent
confidence interval for estimating θ , where L and U is called the lower and upper
confidence limits respectively.
Typically the confidence interval can be derived by using a suitable pivotal
quantity Q( X 1 X 2 .. X n ; θ ), a function in X 1 X 2 .. X n and θ which has distribution

doesn’t depend on θ .Therefore the lower and the upper bound of the confidence
interval depend only on the distribution of the pivotal quantity. Indeed confidence
interval is not unique, so our hope to utilize a pivotal quantity to obtain a shortest
confidence interval.
Guenther (1969) discussed in depth how to obtain the shortest confidence for
some famous distributions, that he said the shortest confidence interval can be derived
6
by searching about two values (a,b) in the domain of pivotal quantity, Where a < b,
which minimizing the length of the interval (U-L) and in the same time satisfying
P( a < Q( X 1 X 2 .. X n ;θ ) < b) = λ .
Furthermore, Guenther (1969) concluded the confidence interval based on

normal distribution or t-distribution can be considered the shortest confidence interval
because of the symmetric of the distribution, however the confidence interval based on
Chi-square distribution cannot be considered the shortest confidence interval, so he
recommended for using a table which proposed by Tate and Klett(1959) for the shortest
confidence interval based on Chi-square distribution corresponding different levels of
sample size and various levels of significance .
In mathematical form, it can take the following form:
Min Length = U − L
Subject to
∫ Q( X
a
1 X 2 .. X n ;θ )dx = λ
It is suitable to show, according to Gelder(1997), the methods estimation’s

properties as follows:
1. The method of moments in general provides estimators, which are biased but
consistent as large sample size, and not efficient, they are often used because
they lead to very simple computations. Estimates by MOM may be used as the
first approximation or initial values to the solutions for other methods that need
for iteration. It is not unique, that instead of using the row moments, we can use
the central moments, therefore we obtain another estimation, in some cases
MOM can not be applicable such as Cauchy distribution.
2. The MLE’s estimates are asymptotically unbiased or consistent for the
parameters. The MLE have a powerful property called Invariance that if θˆ is
^
MLE for θ, then g (θ ) is MLE for g(θ). The MLE’s estimates are asymptotically
7
normal distribution, so deriving confidence interval using MLE estimates can be
considered the shortest confidence interval when sample size is large. If there is
an efficient estimator for θ that achieves the Cramér-Rao lower bound , it must
be MLE. Furthermore, MLE may not exist, and if it found not necessary be in
closed form, and also may not be unique see Gelder(1997).
3. The Gauss–Markov theorem. states in which the errors have expectation zero
conditional on the independent variables, are uncorrelated and have equal
variances that Var (Yi / X i ) =Var (U i / X i ) =σ , the best linear

2
unbiased estimator of any linear combination of the observations, is its least-

squares estimator. "Best" means that the least squares estimators of the
parameters have minimum variance, furthermore if Y1Y2 ..Yn belong to a

Normal distribution the least squares estimators are also the maximum
likelihood estimators see Mood (1976).
2.2Hypotheses Testing
A statistical hypothesis tests is a method of making statistical decision using

sampling data, it is consider a key technique of statistical inference, the aim of using
hypotheses tests that if the information in the sample guides us to accept or reject the
doubtful hypothesis called null hypothesis H o .In fact there are two types of
hypotheses testing:
1. Parametric Hypothesis : is considered with one or more constraints imposed upon
the parameters of certain distribution .
2. Non-Parametric Hypothesis : is the statement about the form of the cumulative
distribution function or probability function of the distribution from which sample
is drawn.
Hypotheses testing can be classified as following:
1. Simple Hypothesis: if the statistical hypothesis specifies the probability
distribution completely .
H 0 : f ( x) = f o ( x)
2. Composite Hypothesis: if the statistical hypothesis doesn’t specify the probability

distribution completely.
8
H 0 : f ( x) ≠ f o ( x) H 0 : f ( x) ≤ f o ( x ) H 0 : f ( x) ≥ f o ( x )
Definition (2.2.1) : Statistical Test is a rule or a procedure for deciding whether or

not reject the null hypothesis based on it is sampling distribution , that is if the
statistical test’s value lies in the critical region the decision is rejecting the null
hypothesis, otherwise accepting null hypothesis .
It is obvious that our decision based on sample data, therefore the decision can
be affected with two kinds of errors.
1. Type I Error α : this error can be done when we reject H o although it is correct,
also called the level of significance.
2. Type II Error β : this error can be done when we accept H o although it is wrong,
the complement of this error called the Power of the Test 1 − β .
Hence, we need statistical test that keeps the errors of the decision as
minimum as possible, unfortunately in a fixed sample size if one of errors was
minimized the other was maximized, so there is a negative relation between the two
errors.
To overcome this problem we can fix the more serious error- Type I Error- and
searching for the statistical test, which has the minimum Type II Error, or Most
Powerful Test.
Theorem (2.2.1): In the case of simple hypothesis verses simple alternative

hypothesis the most powerful test among all the tests have α or less than can take the
following formula:
Reject H o if λ < k accept H o if λ > k
n n
Where λ = ∏ f o ( xi ) ÷ ∏ f1 ( xi ) and k is a positive constant.
i =1 i =1
The idea that we calculate the ratio between the likelihood function under H o
and H1 , that is high value refers to accept H o otherwise indicates to reject H o ,
9
therefore this ratio is well known as simple likelihood ratio or Neyman-Pearson
lemma.
Definition (2.2.2): if it is required to test simple hypothesis verses composite

alternative hypothesis among all the tests have α or less than, the statistical test has
most powerful verses all alternative hypotheses called Uniformly Most Powerful Test,
and take the following formula :
Reject H o if Λ < c accept H o if Λ > c
n n
Where Λ = ∏ f o ( xi ) ÷ ∏ f Ω ( xi ) and c is a positive constant.
i =1 i =1
The idea that we calculate the ratio between the likelihood function under H o
and H1 , f Ω ( xi ) means all sample space for the parameter θ , this ratio is called
typically Generalized Likelihood Ratio.
It is obvious that λ is an special case from Λ .The distribution of Λ

corresponding to a particular null and alternative hypothesis using the sampling
distribution of the test, in many cases it is not quite, fortunately it is proved that for
any particular null and alternative hypothesis − 2 ln Λ has approximately χ 2

distribution with degree of freedom the number of the tested parameter in the null
hypothesis.
2.3Measures of Information
A great variety of the information’s measures are proposed in the literatures

recently see Estban (1995), since Shannon (1948) has a huge contribution for
development the information theory , thus in this section it will deal with Shannon’s
entropy and some measures related to Shannon’s (1948) entropy.
Definition (2.3.1): The origin of the entropy concept goes back to Ludwig
Boltzmann (1877), it is a Greek notation meaning transformation, it has been given a
probabilistic interpretation in information theory by Shannon (1948).He consider the
entropy as index of the uncertainty associated with a random variable expressed in
10
nats , where nat (sometimes nit or nepit) is a unit of information or entropy, based on
natural logarithms. Let there is n events with probabilities p1 p 2 .. p n adding up to 1,

Shannon (1948) stated the entropy corresponding these events can take the following
formula:
n
H ( X ) = −∑ p ( xi ) ln p( xi ) (2.3.1)
i =1
Hence, Shannon (1948) claimed that via (2.3.1) one can transform the
information in the sample from the invisible form to numerical physical form so the
comparisons can easily made and can be understood. Frenken (2003) mentioned that
(2.3.1) can be regarded the variance for the qualitative data.
To show how Shannon (1948) concluded (2.3.1), assume n1 , n2 ..nk be the

number of each categories occurs in the experiment of length n, where:
k
ni
∑n
i =1
i = n and pi =
n
According to Golan (1996), Shannon (1948) mentioned that the all possible
combination that partition n into k categories of size n k can be indicator for the
accuracy of any decision associated to this sample, one can present the numbers all
possible combination as:
n!
W = C nn1,n2 ..nk = (2.3.2)
n1!n2 !..nk !
It is obvious that if (2.3.2) is always greater than or equal to one, if (2.3.2) equals one
this indicator for the sample has one category and that refers to the maximum of
accuracy and minimum uncertainty, for more simplicity Shannon (1948) preferred to
deal with logarithm of W as follows:
k
ln(W ) = ln n!−∑ ln ni !
i =1
11
Using approximation of Striling that states:
ln x!≈ x ln x − x as x → ∞
Thus ln(W) will be:
k k
ln(W ) ≈ n ln n − n − ∑ ni ln ni + ∑ ni
i =1 i =1
k
ln(W ) ≈ n ln n − ∑ ni ln ni
i =1
k
≈ n ln n − ∑ ni ln npi
i =1
k
≈ n ln n − ∑ ni (ln n + ln pi )
i =1
k k
≈ n ln n − ∑ ni ln n − ∑ ni ln pi
i =1 i =1
k
≈ n − ∑ pi ln pi
i =1
k
n −1 ln(W ) ≈ −∑ pi ln pi = H ( p)
i =1
Therefore Shannon’s (1948) entropy can be regarded as a measurement of the

accuracy associated to the decision’s sample in average. Indeed Shannon (1948)
mentioned (2.3.1) satisfy the following properties:
12
1 The quantity H ( X ) reaches a minimum, equal to zero, when one of
the events is a certainty, assuming 0 ln(0) = 0 ,and H ( X ) reaches the
maximum when all the probabilities are equal, hence H ( X ) can be

regarded as a concave function.
2 If some events have zero probability, they can just as well be left out
of the entropy when we evaluate the uncertainty.
3. Entropy information must be symmetric that doesn’t depend on the

order of the probabilities.
For the continuous distribution (2.3.1) can take the following formula:
∞
H ( X ) = − ∫ f ( x, θ ) ln f ( x, θ )dx
−∞
Definition (2.3.2): joint entropy is a measurement concerned with uncertainty of

the two variables takes the following formula:
n
H ( X , Y ) = −∑ p ( xi , y i ) ln p ( xi , y i )
i =1
It is obvious that:
H ( X , Y ) ≤ H ( X ) + H (Y )
According to Shannon (1948) the uncertainty of a joint events is less than or

equal to the sum of the individual uncertainties and with equality only if the events
are independent.
Definition (2.3.3): mutual information measures the information that X and Y

share, takes the following formula:
n
p ( xi , y i )
M ( X , Y ) = −∑ p ( xi , y i ) ln
i =1 p( x i ) p( yi )
It is obvious that M ( X , Y ) = 0 if the two variables are independent.
13
Definition (2.3.4): conditional entropy H ( X / Y ) is a measure of what Y doesn’t
say about X, meaning how much information in X doesn’t in Y, takes the following
formula:
H ( X / Y ) = H ( X , Y ) − H (Y )
Remark: definitions from (2-10) - (2-12) can be extended to the continuous variables
if the summation symbol replace with the integration symbol.
If the two variables are independent the conditional entropy H ( X / Y ) will equal
H ( X ) . it can realize that there is a relation between the measures of information as
follows:
Venn diagram: relation between information’s measures
Definition (2.3.5): Kullback and Leibler (1951) introduced relative-entropy or

information divergence ,which measures the distance between two distributions of a
random variable. This information measure is also known as KL-entropy taking the
following formula:
n
p ( xi )
KL( X / Y ) = ∑ p ( xi ) ln (2.3.3)
i =1 q( yi )
Typically (2.3.3) is also regarded as the relative entropy for using Y instead of X,
since (2.3.3) can be expressed as another form:
n n
KL( X / Y ) = ∑ p ( xi ) ln p ( xi ) − ∑ p ( xi ) ln q ( y i )
i =1 i =1
14
n
= − H ( X ) − ∑ p( xi ) ln q( y i )
i =1
For more simplicity taking the following example: suppose we have five events in the
specified sample associated to the following probabilities ( .2,.1,.3,.25,.15).Assuming that we
want to know the divergence between theses events and the probabilities’ uniform
distribution. Substituting in (2.3.3) it will yield:
n
p ( xi )
KL( X / Y ) = ∑ p ( xi ) ln
i =1 q( yi )
.2 .1 .3 .25 .15
= .2 ln + .1ln + .3 ln + .25 ln + .15 ln = .065
.2 .2 .2 .2 .2
Therefore, it can be concluded that if we replace the distribution of the sample with the
uniform distribution it will loss .065 nat , thus (2.3.2) can be consider a good tool for
discrimination between two distributions Gohale (1983). One would assume that whenever
0
q ( y i ) = 0 , the corresponding p ( xi ) = 0 and 0 ln = 0 see Dukkipati (2006), indeed
0
KL-entropy isn't symmetry that:
KL( X / Y ) ≠ KL(Y / X )
Furthermore KL( X / Y ) is non-negative measure and it equals zero iff X and Y

are identity:
KL( X / Y ) ≥ 0 for all i (2.3.4)
According Liu(2007) ,(2.3.4) can be studied using the following identity :
x
x ln( ) ≥ x − y for x, y > 0 (2.3.5)
y
Hence, one can rewrite (2.3.3) according to (2.3.5) as:
n
p ( xi ) n n
∑ p( x ) ln q( y ) ≥ ∑ p( x ) − ∑ q( x )
i =1
i
i =1
i
i =1
i for p ( xi ), q ( xi ) > 0
i
n
≥ 1 − ∑ q ( xi ) ≥ 0
i =1
KL( X / Y ) ≥ 0 for all i
15
Remark : KL can be applied when the variables are continuous that it will replace the
symbol of summation with integration notation, furthermore also all the properties
are valid see Dukkipati (2006).
2.4Lagrange Multiplier
In mathematical optimization, the method of Lagrange multipliers provides a

strategy for finding the maximum or minimum of the objective function subject to
constraints.To see this point consider the following example:
Min f ( x, y ) = 2 x 2 + y 2 (2.4.1)
Subject to
x + y =1
To solve (2.4.1), one can insert the constrain in the objective function and
transform the restricted optimization into unrestricted optimization, then search for
the extreme values as follows:
y = 1− x
(2.4.2)
Hence, (2.4.1) can be written:
Min f ( x, y ) = 2 x 2 + (1 − x) 2
So the minimum point of x, y can be obtained as follows:
df ( x)
=0
dx
4 x − 2(1 − x) = 0
6x − 2 = 0
16
1
x= (2.4.3)
3
It is obvious (2.4.3) refers to the minimum point that the second derivative is
positive, for obtaining the value of y it can substitute (2.4.3) in (2.4.2), it will yield :
2
y=
3
Indeed, the values of x and y can be reached via another route, which it can use
the principle of Lagrange multiplier as follows:
To solve (2.4.1) , it should write Lagrangian function as follows:
Lagr ( x, y, λ ) = x 2 + y − xy − λ (2 x + y − 5)
Where the constant λ refers to Lagrange multipliers, and Lagr refers to Lagrangian
function. The method works as follows:
 dLagr ( x, y, λ ) 
 = 2x − λ = 0
dx
 
 dLagr ( x, y, λ ) 
 = 2y − λ =0 (2.4.4)
dy
 
 dLagr ( x, y, λ ) 
 = (1 − x − y ) = 0 
dλ 
Since (2.4.4), generally represents a nonlinear equations, refers to a homogenous

system in three variables, solving theses equations yielded the solution of (2.4.1), as
follows:
x = 1.5 y=2 λ = −.5 (2.4.5)
One can conclude that transforming the (2.4.1) from constrained optimization
into unconstrained optimization is equivalent for using Lagrange multiplier principle,
indeed there is another approach, known as dual problem, to solve (2.4.1) that we
transform the constrained problem with unconstrained problem via replace all the
17
variables that in the objective function with the Lagrange multiplier, that From (2.4.4)
it can conclude:
λ λ
x= y= (2.4.6)
4 2
Substituting (2.4.6) in (2.4.1) it will yield that the objective function contain only the
Lagrange multiplier therefore to minimize (2.4.1) with respect to x, y imply
maximizing the objective function with respect to λ , since λ has the negative sign
thus there is usually opposite relation between Lagrange multiplier and the objective
function, hence (2.4.1) can be rewritten as unrestricted problem :
λ2 λ2 λ λ −3 2
Max + − λ ( + − 1) = λ +λ (2.4.7)
8 8 4 2 8
Taking the first derivative (2.4.7) to obtain the extreme values as follows:
−3 2
d λ +λ
8 −3 4
= λ +1 = 0 →λ =
dλ 4 3
(2.4.8)
Substituting (2.4.8) in (2.4.4) it will yield the same solution as (2.4.6).
According to (later) some remarks should be taken in consideration for searching to

solution when using Lagrange multiplier principle as follows:
1. The number of the constraints must be less than or equal to the number of the
variables.
2. The constraints in the optimization problem must be independent.
In statistical inference there is a well-known test related to Lagrange multiplier

for testing hypothesis concerned with the parameter of the distribution see Engle
(1984). Aitcheson and Silvey (1958) proposed the Lagrange multiplier test which
derives from a restricted maximum likelihood estimation using Lagrange multiplier,
18
suppose it is required to maximized L( x1 ...x n ;θ ) with respect to θ subject to the
hypothesis that θ = θ 0 , as mentioned above the Lagrangian function can take the
form:
Lagr (θ , λ ) = L( x1 ...x n ;θ ) − λ (θ − θ 0 )
Differentiating Lagr (θ , λ ) with respect to θ and λ then setting to zero it will yield:
dLagr (θ , λ ) dL( x1 ...x n , θ )

= −λ = 0 (2.4.9)
dθ dθ
dLagr (θ , λ )
= θ = θ0 (2.4.10)
dλ
For solving (2.4.9) and (2.4.10) simultaneously, one can obtain the derivative of
the L( x1...xn ;θ ) , then substituting (2.4.10) in to the derivative , thus it will be

obtained:
dLagr (θ , λ ) dL( x1 ...x n , θ 0 )
= =λ (2.4.11)
dθ dθ
Typically (2.4.11) known as the score function S (θ 0 ) .Since θ is often unknown
so it will be estimated by MLE see section (2.1), hence smaller value of S (θ 0 ) will
agree with θ 0 is close to MLE and accept the null otherwise reject θ 0 is MLE, thus
score test measures the magnitude between the tested value and MLE, it is obvious
that zero and the fisher information I (θ ) represents the mean and the variance of
S (θ 0 ) respectively , thus Lagrange multiplier (LM) can be written as :
( S (θ 0 )) 2
LM =
I (θ 0 )
Under the null hypotheses, for large sample LM has Chi-Square distribution
with one degree of freedom, for more details see Judge el at (1982), indeed LM test
can be extended to test k parameters simultaneously as follows:
LM = S (θ ) t I (θ ) −1 S (θ ) (2.4.12)
− − −
19
Where S (θ− ) refers the score function of the vector θ− , I (θ− ) refers to the inverse of
−1
the information matrix of order k, taking the following formula respectively:
 dL( x1 ...x n , θ ) 
 − 
 dθ1 
 
S (θ ) =  . 
−  
 dL( x ...x , θ ) 
 1 n
− 
 d θ 
 k 
 d d d 
 E ( ln L( x1 ...xn ;θ ) ) ln L( x1 ...x n ;θ ))  E ( ln L( x1 ...x n ;θ )) 
2
E(
 d θ dθ 1 dθ 2 dθ 1 dθ k 
 d d d 
 E( ln L( x1 ...x n ;θ )) E( ln L( x1 ...x n ;θ ) 2 )  E ( ln L( x1 ...x n ;θ )) 
I (θ ) =  dθ 1 dθ 2 dθ 2 dθ 2 dθ k 
−    
 
 d d 
 E( ln L( x1 ...x n ;θ ))  E( ln L( x1 ...x n ;θ ) 2 ) 
 dθ 1 dθ k dθ k 
 
Note: also (2.4.12) has Chi-Square distribution with k degree of freedom, for more
simplicity it should take the following example:
Let X 1 X 2.. X n be a random variables from the sample of size n follows Normal (
µ , σ 2 ) see section (2.3.3), suppose it is required to test :
H o : µ =µ 0 σ 2 = σ 02
Using LM test the logarithm of the normal distribution’s likelihood function can
be:
n
n n .5
ln L( x1 ...x n , µ , σ 2 ) = − ln(2π ) − ln σ 2 − 2
2 2 σ
∑ (x
i =1
i − µ)2
The score function will be :
20
 d ln L( x1 ...x n , µ , σ 2 ) 
 
 d µ 
S normal ( µ , σ 2 ) = 
2 
 d ln L( x1 ...x n , µ , σ ) 
 dσ 2 
 1 n 
 2 ∑ ( xi − µ ) 
 σ i =1 
S normal ( µ , σ ) = 
2
n 
n 1
 −
 2σ
2
+ ∑
σ 4 i =1
( x i − µ ) 2


Hence the score function under the null hypothesis:
 1 n 
 2 ∑ ( xi − µ 0 ) 
 σ 0 i =1 
S normal ( µ 0 , σ 0 ) = 
2
n 
 − n + 1 ∑ ( xi − µ 0 ) 2 
 2σ 2 σ 4 i =1 
 0 0 
The information matrix under the null hypothesis associated to the normal
distribution:
 n  σ 02 
 2 0   0 
σ   n 
I normal ( µ 0 , σ 0 ) =  0 I normal ( µ 0 , σ 0 ) = 
2 −1 2
and
n  4 
2σ 0 
0   0
 2σ 0 
4 
  n 
Hence, the LM test can take the following formula:
LM normal = S normal (θ ) t I normal (θ ) −1 S normal (θ )

− − −
2
( a − n⋅ µo)  2
1  n ⋅ ( σo) − b + 2⋅ a⋅ µ o − n ⋅ ( µ o) 
2 2
LM + ⋅
2
( )2
n ⋅ σo ( σo) 4⋅ n
n n
Where: a = ∑ xi , b = ∑ x i
2
i =1 i =1
21
Remark: as mentioned above LM normal has Chi-Square distribution with 2 degrees of
freedom. Suppose instead of testing the mean and the variance of the normal
simultaneously, it is required to test the mean only, hence the only change will be in
the score function as follows :
 1 n


S normal ( µ 0 , σ ) =  σ 0 2
2 ∑ (x
i =1
i− µ0 )

 0 
 
Therefore the LM test will be:
(a − nµ o ) 2
LM normal = ≈ χ 2 (1)
nσ 2
2.5 Some Important Distributions
In this section, it will be in brief shown some famous distributions which will
be used in this thesis.
1 Normal Distribution:
The normal distribution, also called the Gaussian distribution, is an important

family of continuous probability distributions, applicable in many fields. Each
member of the family may be defined by two parameters, location and scale. The
standard normal distribution is the normal distribution with a mean of zero and a
variance of one, The importance of the normal distribution as a model of quantitative
phenomena in the natural and behavioral sciences is due in part to the central limit
theorem.
If X has a normal distribution with mean μ and variance σ2 the density

function will take the following form:
−1
1 ( x −µ )2
f ( x) = e 2σ 2 −∞ < x < ∞
σ 2π
22
There is an important properties for normal distribution such as the mean , median
and mode are all equal also the skewness and the Excess kurtosis equal zero.In fact
normal distribution has a maximum entropy among all the distributions with fixed
variance and it is equal ln(σ 2πe with moment generation function equal
σ 2t 2
M X (t ) = exp( µt + ).
2
2 Uniform distribution
In probability theory and statistics, the continuous uniform distribution is a

family of probability distributions such that for each member of the family all
intervals of the same length on the distribution are equally probable. This distribution
is defined by the two parameters, a and b, which are its minimum and maximum
values respectively. It has an important role in the generating random numbers
technique The distribution is often abbreviated U(a,b).
If X has a Uniform Distribution with minimum a and maximum b the density

 1
 a< x<b
f ( x) =  b − a
0 otherwise
It is obvious that mean and the median are a + b 2 with a multiple mode that any
point can be considered a mode, it’s variance (b − a ) 2 12 , with moment generation
function e tb − e ta (b − a )t , in fact the uniform distribution has a maximum entropy
among all the distributions defined on the interval {0,1}and it is equal ln(b − a ) with
skewness and the Excess kurtosis equal zero and − 6 5 respectively .
3 Exponential distribution
In probability theory and statistics, the exponential distributions are a class of

continuous probability distributions. They describe the times between events in a
23
Poisson process, indeed exponential distributions can be a special case for Gamma
Distribution, it has a widely application in life models, biology, mechanics..etc.
If X has a exponential distribution with rate parameter λ > 0 the density

λe − λx 0< x<∞
f ( x) = 
0 Otherwise
Exponential distribution has mean, median, mode ,and variance equal 1 λ , ln(2) λ ,
Zero, 1 λ2 respectively. This distribution has a maximum entropy among all the
1
distribution that defined on the positive space with fixed mean that equals 1− ln
λ
t −1
with moment generation function, skewness and Excess kurtosis equal (1 − ) , 2
λ
and 6 respectively.
24

Thessl Ch1 Statistics, Entropy, Lagrange, Score Test, Estimation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thessl Ch1 Statistics, Entropy, Lagrange, Score Test, Estimation

Uploaded by

Copyright:

Available Formats

Chapter (II)

Definitions and Notation

2.1 Methods of Estimation

.2 The central moments : E( x − x ) r

The method of maximum likelihood is required to estimate θ by finding the value of

θ that maximizes L( x1 ...x n ;θ ) , hence θ is called maximum likelihood estimator

dependent variables and X 1 X 2 .. X n represent the fixed independent variables,

Differentiating E (Y , B′s, X ) with respect to B ′s it will yield:

consider point estimator for θ , we called θˆ is unbiased estimator for θ iff E ( θˆ ) = θ ,

Definition (2.1.4) Consistent Estimator: The statistic θˆ is consistent estimator

at large sample size , so θˆ is consistent estimator for θ .

Definition (2.1.5) Sufficient Statistic : An estimator θˆ is said to be a sufficient

as well as gained from θˆ only . In mathematical form , θˆ is a sufficient statistic iff

the conditional probability distribution of the random sample given θˆ is independent

Theorem (2.1.1) Factorization Criterion: if we can express the likelihood

h( x1..xn ) nonnegative function and doesn’t depend on θ , hence we consider θˆ is

Definition (2.1.6) Exponential family of densities: the density probability

f ( x,θ ) = a (θ )b( x ) exp{c(θ )h( x)}

One of advantage from regarding f ( x,θ ) belongs to exponential family that

Definition (2.1.8) In statistic inference, a confidence interval is an interval estimate

interval likely to include the parameter is considered. Let X 1 X 2 .. X n be a random

Typically the confidence interval can be derived by using a suitable pivotal

quantity Q( X 1 X 2 .. X n ; θ ), a function in X 1 X 2 .. X n and θ which has distribution

Furthermore, Guenther (1969) concluded the confidence interval based on

In mathematical form, it can take the following form:

It is suitable to show, according to Gelder(1997), the methods estimation’s

variances that Var (Yi / X i ) =Var (U i / X i ) =σ , the best linear

unbiased estimator of any linear combination of the observations, is its least-

parameters have minimum variance, furthermore if Y1Y2 ..Yn belong to a

A statistical hypothesis tests is a method of making statistical decision using

2. Composite Hypothesis: if the statistical hypothesis doesn’t specify the probability

Definition (2.2.1) : Statistical Test is a rule or a procedure for deciding whether or

the complement of this error called the Power of the Test 1 − β .

Theorem (2.2.1): In the case of simple hypothesis verses simple alternative

and H1 , that is high value refers to accept H o otherwise indicates to reject H o ,

Definition (2.2.2): if it is required to test simple hypothesis verses composite

It is obvious that λ is an special case from Λ .The distribution of Λ

any particular null and alternative hypothesis − 2 ln Λ has approximately χ 2

A great variety of the information’s measures are proposed in the literatures

entropy and some measures related to Shannon’s (1948) entropy.

natural logarithms. Let there is n events with probabilities p1 p 2 .. p n adding up to 1,

To show how Shannon (1948) concluded (2.3.1), assume n1 , n2 ..nk be the

Thus ln(W) will be:

Therefore Shannon’s (1948) entropy can be regarded as a measurement of the

the events is a certainty, assuming 0 ln(0) = 0 ,and H ( X ) reaches the

maximum when all the probabilities are equal, hence H ( X ) can be

3. Entropy information must be symmetric that doesn’t depend on the

Definition (2.3.2): joint entropy is a measurement concerned with uncertainty of

According to Shannon (1948) the uncertainty of a joint events is less than or

Definition (2.3.3): mutual information measures the information that X and Y

It is obvious that M ( X , Y ) = 0 if the two variables are independent.

Definition (2.3.5): Kullback and Leibler (1951) introduced relative-entropy or

Furthermore KL( X / Y ) is non-negative measure and it equals zero iff X and Y

KL( X / Y ) ≥ 0 for all i

In mathematical optimization, the method of Lagrange multipliers provides a

Hence, (2.4.1) can be written:

So the minimum point of x, y can be obtained as follows:

To solve (2.4.1) , it should write Lagrangian function as follows:

Since (2.4.4), generally represents a nonlinear equations, refers to a homogenous

x = 1.5 y=2 λ = −.5 (2.4.5)

Substituting (2.4.8) in (2.4.4) it will yield the same solution as (2.4.6).

According to (later) some remarks should be taken in consideration for searching to