Professional Documents
Culture Documents
1 Method of Moments
It is difficult to track back who introduce method of moments MOM , but
Johnan Bernoulli(1667-1748) was the first who used the method in his work see
Gelder(1997), the idea for this method that we estimate the unknown parameters in
terms of the unobserved population’s moments for instance (mean, variance,
skewness, kurtosis and coefficient of variation), then estimate the unobserved
moments with the observed sampling moments. Typically, the different types that can
represent the observed sampling moments have the following formulas:
.1 The moments about zero (raw moments ): E( x r )
1
2 Method of Maximum Likelihood
It is difficult to track who discovered this tool, but Bernoulil in 1700 was the
first who reported about it see Gelder (1997), the idea that it is required to give the
specified sample high probability to be drawn, so it is required to research about the
parameters that maximized the likelihood function for the specified sample.
The likelihood function is the joint density function for the completely random
sampling taking the following formula:
n
L( x1 ...x n ;θ ) = ∏ f ( xi ;θ )
i =1
^
MLE, indeed obtaining θ , in many cases, by solving the following equation:
dL ( x1 ...xn ; θ)
=0 (2.1.1)
dθ
The maximum likelihood method can be also used for estimated k unknown
parameters, therefore solving homogenous k equations in k unknown parameters. It
can be shown that it can't obtain θˆ defined in (2.1.1) equation, if the following
conditions are not valid (often called regularity conditions):
1. The first and second derivatives of the log-likelihood function must be defined.
2. The range of X’s doesn’t depend on the unknown parameter .
Note: In many situations solving (2.1.1) can not be easily, thus one can use monotonic
transformation that making the calculation easier and no loss information:
n
d ∑ ln f ( xi ;θ )
d ln{L( x1 ...x n ;θ )}
= i =1
dθ dθ
2
3 Method of Least Square
The method of least squares or ordinary least squares (OLS) is often has a vital
role in statistical researches, particularly regression analysis, is proposed by Gauss see
Gelder(1997), typically OLS used to estimate the relation between two variables are
known as independent and dependent variables. Least squares problems fall into two
categories, linear and non-linear. The linear least squares problem has a closed form
solution, but the non-linear problem does not and is usually solved by iterative
process, furthermore OLS can be applied for one or more independent variables, in
this study will focus on one independent variable.
Suppose Y1Y2 ..Yn are pairwise uncorrelated random variables represent the
Yi = B0 + B1 X i +U i i = 1..n
Where U ′s refers to the residuals of the model. Thus OLS states one should peak the
values of B ′s which make the sum of squared residuals as minimum as possible:
n
MIN U (Y , B′s, X ) = ∑( yi − B0 + B1 xi ) 2
i =1
dE (Y , B ′s, X ) n
B0
= ∑ − 2( y i − B0 + B1 xi ) = 0
i =1
(2.1.2)
dE (Y , B ′s, X ) n
= ∑ − 2 xi ( y i − B0 + B1 xi ) = 0
B1 i =1
It is easily to check (2.1.2) gives a minimum values, hence solving (2.1.2) it will be
obtained:
3
n
∑ xy − nx y
b1 = i =1
n
∑x
i =1
2
− nx 2
b0 = y − b1 x
So far, it is not obvious to prefer which method can be more efficient than
other, fortunately, to overcome this problem it should be discuss some topics related
to the properties of point estimator and confidence interval.
Definition (2.1.1): In statistics, point estimation refers to the use of sample data to
calculate a single value is well known as a statistic, an observed function of a sample
where the function itself is independent of the parameter, which is to serve as a best
guess for an unknown population parameter
Definition (2.1.2) Unbiased Estimator: The first criteria which can classify the
estimators is unbiaseness, suppose θˆ is a statistic from observed random sample and
if the previous condition valid in the large sample size, we called, θˆ is asymptotic
unbiased estimator for θ .
Definition (2.1.3) Relative Efficient Estimator: Suppose θˆ1 , θˆ2 are two
Var (θˆ1 )
estimators for θ , iff < 1 then it is concluded that θˆ1 is more efficient than θˆ2
Var (θˆ )
2
, where Var refers to the variance of the estimator. If the previous condition valid in
large sample size, hence θˆ1 will be asymptotic more efficient than θˆ2 .
4
It is obvious that the consistency is an asymptotic property and in sometimes called
convergence in probability. If θˆ is unbiased estimator for θ and Var( θˆ ) tends to zero
sufficient statistic θ
∑ h( x )
i =1
i is considered as a sufficient statistic. Cramer- Rao proposed an inequality
that proposes the lower bound for the variance of the unbiased estimator and if the
Unique Minimum Variance Unbiased Estimator (UMVUE) exists or not. Assume θˆ is
unbiased estimator for θ then :
5
1
V (θˆ) ≥
d (2.1.3)
E( ln L( x1 ...x n ;θ )) 2
dθ
If the two sides coincided, then θˆ is the best estimator for θ , indeed the
UMVUE for θ always exists until we can express the likelihood function as follows :
d
ln L( x1 ...x n ;θ ) = b(θ ){θˆ − θ )}
dθ
From (2.1.3) inequality, we notice the following points:
1. The lower bound for UMVUE can take another formula:
1 1
V (θˆ) ≥ =
d 2
d
− E( ln L( x1 ...x n ;θ )) E( ln L( x1 ...x n ;θ ) 2 )
d 2θ dθ
2. The denominator of the (2.1.3) called Fisher Information I( θ ), that is an index for
the size of the information in the sampling corresponding θ .It is obvious more
information leads to more accuracy meaning less variability .
sample from f(x ; θ ) , assume L = t1 ( x1 x 2 ..x n ) and U = t 2 ( x1 x 2 ..x n ) satisfies L<U and
P(L< θ <U) = λ , where λ does not depend on θ ; then ( L , U) is 100 λ percent
confidence interval for estimating θ , where L and U is called the lower and upper
confidence limits respectively.
Guenther (1969) discussed in depth how to obtain the shortest confidence for
some famous distributions, that he said the shortest confidence interval can be derived
6
by searching about two values (a,b) in the domain of pivotal quantity, Where a < b,
which minimizing the length of the interval (U-L) and in the same time satisfying
P( a < Q( X 1 X 2 .. X n ;θ ) < b) = λ .
Min Length = U − L
Subject to
∫ Q( X
a
1 X 2 .. X n ;θ )dx = λ
^
MLE for θ, then g (θ ) is MLE for g(θ). The MLE’s estimates are asymptotically
7
normal distribution, so deriving confidence interval using MLE estimates can be
considered the shortest confidence interval when sample size is large. If there is
an efficient estimator for θ that achieves the Cramér-Rao lower bound , it must
be MLE. Furthermore, MLE may not exist, and if it found not necessary be in
closed form, and also may not be unique see Gelder(1997).
3. The Gauss–Markov theorem. states in which the errors have expectation zero
conditional on the independent variables, are uncorrelated and have equal
2.2Hypotheses Testing
doubtful hypothesis called null hypothesis H o .In fact there are two types of
hypotheses testing:
1. Parametric Hypothesis : is considered with one or more constraints imposed upon
the parameters of certain distribution .
2. Non-Parametric Hypothesis : is the statement about the form of the cumulative
distribution function or probability function of the distribution from which sample
is drawn.
Hypotheses testing can be classified as following:
1. Simple Hypothesis: if the statistical hypothesis specifies the probability
distribution completely .
H 0 : f ( x) = f o ( x)
8
H 0 : f ( x) ≠ f o ( x) H 0 : f ( x) ≤ f o ( x ) H 0 : f ( x) ≥ f o ( x )
It is obvious that our decision based on sample data, therefore the decision can
be affected with two kinds of errors.
1. Type I Error α : this error can be done when we reject H o although it is correct,
also called the level of significance.
2. Type II Error β : this error can be done when we accept H o although it is wrong,
Hence, we need statistical test that keeps the errors of the decision as
minimum as possible, unfortunately in a fixed sample size if one of errors was
minimized the other was maximized, so there is a negative relation between the two
errors.
To overcome this problem we can fix the more serious error- Type I Error- and
searching for the statistical test, which has the minimum Type II Error, or Most
Powerful Test.
n n
Where λ = ∏ f o ( xi ) ÷ ∏ f1 ( xi ) and k is a positive constant.
i =1 i =1
The idea that we calculate the ratio between the likelihood function under H o
9
therefore this ratio is well known as simple likelihood ratio or Neyman-Pearson
lemma.
n n
Where Λ = ∏ f o ( xi ) ÷ ∏ f Ω ( xi ) and c is a positive constant.
i =1 i =1
The idea that we calculate the ratio between the likelihood function under H o
and H1 , f Ω ( xi ) means all sample space for the parameter θ , this ratio is called
typically Generalized Likelihood Ratio.
2.3Measures of Information
Definition (2.3.1): The origin of the entropy concept goes back to Ludwig
Boltzmann (1877), it is a Greek notation meaning transformation, it has been given a
probabilistic interpretation in information theory by Shannon (1948).He consider the
entropy as index of the uncertainty associated with a random variable expressed in
10
nats , where nat (sometimes nit or nepit) is a unit of information or entropy, based on
Hence, Shannon (1948) claimed that via (2.3.1) one can transform the
information in the sample from the invisible form to numerical physical form so the
comparisons can easily made and can be understood. Frenken (2003) mentioned that
(2.3.1) can be regarded the variance for the qualitative data.
k
ni
∑n
i =1
i = n and pi =
n
According to Golan (1996), Shannon (1948) mentioned that the all possible
combination that partition n into k categories of size n k can be indicator for the
accuracy of any decision associated to this sample, one can present the numbers all
possible combination as:
n!
W = C nn1,n2 ..nk = (2.3.2)
n1!n2 !..nk !
It is obvious that if (2.3.2) is always greater than or equal to one, if (2.3.2) equals one
this indicator for the sample has one category and that refers to the maximum of
accuracy and minimum uncertainty, for more simplicity Shannon (1948) preferred to
deal with logarithm of W as follows:
k
ln(W ) = ln n!−∑ ln ni !
i =1
11
Using approximation of Striling that states:
ln x!≈ x ln x − x as x → ∞
k k
ln(W ) ≈ n ln n − n − ∑ ni ln ni + ∑ ni
i =1 i =1
k
ln(W ) ≈ n ln n − ∑ ni ln ni
i =1
k
≈ n ln n − ∑ ni ln npi
i =1
k
≈ n ln n − ∑ ni (ln n + ln pi )
i =1
k k
≈ n ln n − ∑ ni ln n − ∑ ni ln pi
i =1 i =1
k
≈ n − ∑ pi ln pi
i =1
k
n −1 ln(W ) ≈ −∑ pi ln pi = H ( p)
i =1
12
1 The quantity H ( X ) reaches a minimum, equal to zero, when one of
For the continuous distribution (2.3.1) can take the following formula:
∞
H ( X ) = − ∫ f ( x, θ ) ln f ( x, θ )dx
−∞
It is obvious that:
H ( X , Y ) ≤ H ( X ) + H (Y )
13
Definition (2.3.4): conditional entropy H ( X / Y ) is a measure of what Y doesn’t
say about X, meaning how much information in X doesn’t in Y, takes the following
formula:
H ( X / Y ) = H ( X , Y ) − H (Y )
Remark: definitions from (2-10) - (2-12) can be extended to the continuous variables
if the summation symbol replace with the integration symbol.
If the two variables are independent the conditional entropy H ( X / Y ) will equal
H ( X ) . it can realize that there is a relation between the measures of information as
follows:
Venn diagram: relation between information’s measures
n
p ( xi )
KL( X / Y ) = ∑ p ( xi ) ln (2.3.3)
i =1 q( yi )
Typically (2.3.3) is also regarded as the relative entropy for using Y instead of X,
since (2.3.3) can be expressed as another form:
n n
KL( X / Y ) = ∑ p ( xi ) ln p ( xi ) − ∑ p ( xi ) ln q ( y i )
i =1 i =1
14
n
= − H ( X ) − ∑ p( xi ) ln q( y i )
i =1
For more simplicity taking the following example: suppose we have five events in the
specified sample associated to the following probabilities ( .2,.1,.3,.25,.15).Assuming that we
want to know the divergence between theses events and the probabilities’ uniform
distribution. Substituting in (2.3.3) it will yield:
n
p ( xi )
KL( X / Y ) = ∑ p ( xi ) ln
i =1 q( yi )
.2 .1 .3 .25 .15
= .2 ln + .1ln + .3 ln + .25 ln + .15 ln = .065
.2 .2 .2 .2 .2
Therefore, it can be concluded that if we replace the distribution of the sample with the
uniform distribution it will loss .065 nat , thus (2.3.2) can be consider a good tool for
discrimination between two distributions Gohale (1983). One would assume that whenever
0
q ( y i ) = 0 , the corresponding p ( xi ) = 0 and 0 ln = 0 see Dukkipati (2006), indeed
0
KL-entropy isn't symmetry that:
KL( X / Y ) ≠ KL(Y / X )
∑ p( x ) ln q( y ) ≥ ∑ p( x ) − ∑ q( x )
i =1
i
i =1
i
i =1
i for p ( xi ), q ( xi ) > 0
i
n
≥ 1 − ∑ q ( xi ) ≥ 0
i =1
15
Remark : KL can be applied when the variables are continuous that it will replace the
symbol of summation with integration notation, furthermore also all the properties
are valid see Dukkipati (2006).
2.4Lagrange Multiplier
Min f ( x, y ) = 2 x 2 + y 2 (2.4.1)
Subject to
x + y =1
To solve (2.4.1), one can insert the constrain in the objective function and
transform the restricted optimization into unrestricted optimization, then search for
the extreme values as follows:
y = 1− x
(2.4.2)
Min f ( x, y ) = 2 x 2 + (1 − x) 2
df ( x)
=0
dx
4 x − 2(1 − x) = 0
6x − 2 = 0
16
1
x= (2.4.3)
3
It is obvious (2.4.3) refers to the minimum point that the second derivative is
positive, for obtaining the value of y it can substitute (2.4.3) in (2.4.2), it will yield :
2
y=
3
Indeed, the values of x and y can be reached via another route, which it can use
the principle of Lagrange multiplier as follows:
Lagr ( x, y, λ ) = x 2 + y − xy − λ (2 x + y − 5)
Where the constant λ refers to Lagrange multipliers, and Lagr refers to Lagrangian
function. The method works as follows:
dLagr ( x, y, λ )
= 2x − λ = 0
dx
dLagr ( x, y, λ )
= 2y − λ =0 (2.4.4)
dy
dLagr ( x, y, λ )
= (1 − x − y ) = 0
dλ
One can conclude that transforming the (2.4.1) from constrained optimization
into unconstrained optimization is equivalent for using Lagrange multiplier principle,
indeed there is another approach, known as dual problem, to solve (2.4.1) that we
transform the constrained problem with unconstrained problem via replace all the
17
variables that in the objective function with the Lagrange multiplier, that From (2.4.4)
it can conclude:
λ λ
x= y= (2.4.6)
4 2
Substituting (2.4.6) in (2.4.1) it will yield that the objective function contain only the
Lagrange multiplier therefore to minimize (2.4.1) with respect to x, y imply
maximizing the objective function with respect to λ , since λ has the negative sign
thus there is usually opposite relation between Lagrange multiplier and the objective
function, hence (2.4.1) can be rewritten as unrestricted problem :
λ2 λ2 λ λ −3 2
Max + − λ ( + − 1) = λ +λ (2.4.7)
8 8 4 2 8
Taking the first derivative (2.4.7) to obtain the extreme values as follows:
−3 2
d λ +λ
8 −3 4
= λ +1 = 0 →λ =
dλ 4 3
(2.4.8)
1. The number of the constraints must be less than or equal to the number of the
variables.
2. The constraints in the optimization problem must be independent.
18
suppose it is required to maximized L( x1 ...x n ;θ ) with respect to θ subject to the
hypothesis that θ = θ 0 , as mentioned above the Lagrangian function can take the
form:
Lagr (θ , λ ) = L( x1 ...x n ;θ ) − λ (θ − θ 0 )
Differentiating Lagr (θ , λ ) with respect to θ and λ then setting to zero it will yield:
so it will be estimated by MLE see section (2.1), hence smaller value of S (θ 0 ) will
agree with θ 0 is close to MLE and accept the null otherwise reject θ 0 is MLE, thus
score test measures the magnitude between the tested value and MLE, it is obvious
that zero and the fisher information I (θ ) represents the mean and the variance of
( S (θ 0 )) 2
LM =
I (θ 0 )
Under the null hypotheses, for large sample LM has Chi-Square distribution
with one degree of freedom, for more details see Judge el at (1982), indeed LM test
can be extended to test k parameters simultaneously as follows:
LM = S (θ ) t I (θ ) −1 S (θ ) (2.4.12)
− − −
19
Where S (θ− ) refers the score function of the vector θ− , I (θ− ) refers to the inverse of
−1
dL( x1 ...x n , θ )
−
dθ1
S (θ ) = .
−
dL( x ...x , θ )
1 n
−
d θ
k
d d d
E ( ln L( x1 ...xn ;θ ) ) ln L( x1 ...x n ;θ )) E ( ln L( x1 ...x n ;θ ))
2
E(
d θ dθ 1 dθ 2 dθ 1 dθ k
d d d
E( ln L( x1 ...x n ;θ )) E( ln L( x1 ...x n ;θ ) 2 ) E ( ln L( x1 ...x n ;θ ))
I (θ ) = dθ 1 dθ 2 dθ 2 dθ 2 dθ k
−
d d
E( ln L( x1 ...x n ;θ )) E( ln L( x1 ...x n ;θ ) 2 )
dθ 1 dθ k dθ k
Note: also (2.4.12) has Chi-Square distribution with k degree of freedom, for more
simplicity it should take the following example:
Let X 1 X 2.. X n be a random variables from the sample of size n follows Normal (
H o : µ =µ 0 σ 2 = σ 02
Using LM test the logarithm of the normal distribution’s likelihood function can
be:
n
n n .5
ln L( x1 ...x n , µ , σ 2 ) = − ln(2π ) − ln σ 2 − 2
2 2 σ
∑ (x
i =1
i − µ)2
20
d ln L( x1 ...x n , µ , σ 2 )
d µ
S normal ( µ , σ 2 ) =
2
d ln L( x1 ...x n , µ , σ )
dσ 2
1 n
2 ∑ ( xi − µ )
σ i =1
S normal ( µ , σ ) =
2
n
n 1
−
2σ
2
+ ∑
σ 4 i =1
( x i − µ ) 2
1 n
2 ∑ ( xi − µ 0 )
σ 0 i =1
S normal ( µ 0 , σ 0 ) =
2
n
− n + 1 ∑ ( xi − µ 0 ) 2
2σ 2 σ 4 i =1
0 0
The information matrix under the null hypothesis associated to the normal
distribution:
n σ 02
2 0 0
σ n
I normal ( µ 0 , σ 0 ) = 0 I normal ( µ 0 , σ 0 ) =
2 −1 2
and
n 4
2σ 0
0 0
2σ 0
4
n
2
( a − n⋅ µo) 2
1 n ⋅ ( σo) − b + 2⋅ a⋅ µ o − n ⋅ ( µ o)
2 2
LM + ⋅
2
( )2
n ⋅ σo ( σo) 4⋅ n
n n
Where: a = ∑ xi , b = ∑ x i
2
i =1 i =1
21
Remark: as mentioned above LM normal has Chi-Square distribution with 2 degrees of
freedom. Suppose instead of testing the mean and the variance of the normal
simultaneously, it is required to test the mean only, hence the only change will be in
the score function as follows :
1 n
S normal ( µ 0 , σ ) = σ 0 2
2 ∑ (x
i =1
i− µ0 )
0
(a − nµ o ) 2
LM normal = ≈ χ 2 (1)
nσ 2
In this section, it will be in brief shown some famous distributions which will
be used in this thesis.
1 Normal Distribution:
−1
1 ( x −µ )2
f ( x) = e 2σ 2 −∞ < x < ∞
σ 2π
22
There is an important properties for normal distribution such as the mean , median
and mode are all equal also the skewness and the Excess kurtosis equal zero.In fact
normal distribution has a maximum entropy among all the distributions with fixed
variance and it is equal ln(σ 2πe with moment generation function equal
σ 2t 2
M X (t ) = exp( µt + ).
2
2 Uniform distribution
It is obvious that mean and the median are a + b 2 with a multiple mode that any
among all the distributions defined on the interval {0,1}and it is equal ln(b − a ) with
3 Exponential distribution
23
Poisson process, indeed exponential distributions can be a special case for Gamma
Distribution, it has a widely application in life models, biology, mechanics..etc.
Exponential distribution has mean, median, mode ,and variance equal 1 λ , ln(2) λ ,
Zero, 1 λ2 respectively. This distribution has a maximum entropy among all the
1
distribution that defined on the positive space with fixed mean that equals 1− ln
λ
t −1
with moment generation function, skewness and Excess kurtosis equal (1 − ) , 2
λ
and 6 respectively.
24