You are on page 1of 83

DRAFT

The Maxwell Distribution:


Properties, Parameter Estimators and Some Applications
Rev. 3.1

Doug Hollingshead
doug_hollingshead@yahoo.com

November 2009

Abstract

This note is intended to provide some statistical concepts and results related to the Maxwell
(or Maxwell-Boltzmann) distribution. This distribution can be derived from the three degree-
of-freedom Chi-squared distribution. The Maxwell distribution is the three dimensional
counterpart of the Rayleigh distribution when interpreted as sums of physical quantities.
Many standard probability texts discuss statistical properties of the Rayleigh distribution. To
date, the author has not found an expanded discussion of the Maxwell distribution properties,
or formulas pertaining to parameter estimation.

This note is primarily aimed at deriving relevant statistical properties of the one-parameter
Maxwell distribution and presenting formulae for various point estimators for this parameter.
Maximum likelihood, method of moments and quantile point estimation procedures are
discussed. Cramer-Rao lower bound relative efficiency comparison of the three estimators is
included. The discussion of methods proceeds from general concepts to specifics related to
the Maxwell distribution. A summary of important formulae is provided. Appendix A
provides discussion of Bayesian estimation for the Maxwell distribution parameter. Two-
parameter Maxwell distributions are discussed in Appendix B. Additional information
pertaining to the Maxwell distribution statistics is provided in the remaining appendices.

An attempt is made to identify some situations where knowledge of the Maxwell distribution
and its estimators is helpful. These include three dimensional miss distance, and detection of
mismatch between the assumed and actual covariance matrix associated with an alpha-beta
filter. The Mahalanobis distance for tri-variate normally distributed variables is introduced as
the primary analytical tool for this application.

1
DRAFT

One Parameter Maxwell Distribution Formula Summary


2 y2  − y2 
Density Function: f ( y) = ⋅ exp  
2 
0 ≤ y ≤∞ a >0
π a3  2a 
 y  2 y  − y2 
CDF: F ( y ) = erf  −
 ⋅ exp  
2 
 2a π a  2a 
 3π − 8 
µ1′ = a
8
Mean, Mode and Variance: ymo = 2 a σY2 = a 2  
π  π 

One Parameter Maxwell Statistics for n samples yi 1 ≤ i ≤ n

a2
Minimum Variance Bound: CRLB =
6n

n
1
MLE estimator: aˆ MLE =
3n
∑y
1
2
i MSE ( aˆ MLE ) = CLRB

π 1 n
MOM estimator: aˆ MOM =
8 n
∑y 1
i

MSE ( aˆ MOM ) = CLRB / 0.935818

aˆQUAN = 0.49344 ⋅Y3 / 4


Quantile Estimator: MSE ( aQUAN ) = CLRB / 0.6450
Y3 / 4 = upper quartile

Two Parameter Maxwell Distribution Formula Summary (Appendix B)

f ( y; a , b ) =
2 ( y − b ) 2 ⋅ exp  − ( y − b ) 2 
  b ≤ y ≤∞ ; a > 0 ; −∞ < b < ∞
π a3  2 a2 

 y −b  2 ( y − b)  − ( y − b) 2 
F ( y; a, b) = erf  − exp 



 2a  π a  2a 

Two Parameter Maxwell Statistics for n samples yi 1 ≤ i ≤ n

2
1 n 2 1 n  a2
aˆ MOM = 1.4849 ∑ yi −  n ∑ yi  MSE ( aˆ ) = 0.5270
n 1 1  n

bˆMOM
1 n
= ∑ yi − 2.3695
1 n 2 1 n 
∑ yi −  n ∑1 yi  ()
MSE bˆ = 1.2737
a2
n
n 1 n 1
List of Appendices

2
DRAFT

Appendix A: Bayes’ Estimators for the Maxwell Distribution

Appendix B: The Two-parameter Maxwell Distribution

Appendix C: Sample Moment Standard Error and the Delta Method

Appendix D: Some Results for Order Statistics of a Sample

Appendix E: Sums of Maxwell IID Random Variables

Appendix F: Goodness of Fit and Outlier Tests for Maxwell Distribution

Appendix G: Probability Plots and MATLAB Code for Plot Generation

Appendix H: Chi Distributions for k > 3 and Nakagami Distributions

Appendix I: Inverse Maxwell Distribution

References

3
DRAFT

The Maxwell Distribution:


Properties, Parameter Estimators and Some Applications

Purpose and Overview

The Maxwell (or more correctly, the Maxwell-Boltzmann) distribution was originally derived
using statistical mechanics, to explain the most probable energy distribution of molecules in a
system. The kinetic theory of gasses rests on this distribution. Later developments in
probability theory showed that this distribution can be derived from the more general Chi-
squared distribution, which in turn is a special case of the Gamma distribution.

The Maxwell distribution is related to the Chi-squared distribution, with three degrees of
freedom. This distribution is essentially the “3-dimensional” equivalent of the Rayleigh
distribution. The Chi-Squared distribution expresses the probability distribution of the sum of
the squares of random variables drawn from a Normal distribution. The Rayleigh and
Maxwell distributions define a random variable in terms of the root-mean-square (RMS) of
these sums. It is important to note the underlying variables are distributed normally, but have
a common variance.

This paper is intended to provide some technical background on the properties of the Maxwell
distribution including distribution function, moments and other properties. A second goal is to
discuss estimation methods for the single parameter of this distribution as well as discuss
mean square error of the various statistical estimators. The statistical efficiency concept for
estimators is introduced. Calculation of efficiency requires on the general expressions for
variance of the sample moments.

In this regard, three methods of parameter estimation will be discussed: method of moments
(MOM), method of quantiles, and maximum likelihood (MLE). The Cramer-Rao Lower
Bound (CRLB) for this distribution will be derived. The MLE estimator is shown to achieve
this lower bound.

Necessary concepts of order statistics are introduced to aid in development of quantile


estimators. Quantile estimators are generally very easy to calculate and are quite robust when
outliers are present in the data set. They are however less efficient than either MOM or MLE
estimators. A simple expression for the Maxwell minimum variance quantile estimator is
provided.

Bayesian estimators for the Maxwell distribution may be useful when using the Maxwell
distribution but are often limited by lack of knowledge of a suitable “a priori” distribution.
Bayesian estimation is discussed in Appendix A. Because of the structure of the likelihood
function for this distribution, a closed form solution exists only for special prior distributions.
The concept of sufficient statistics is introduced here. One such prior is discussed and the
Bayes’ estimate (i.e., mean) of the posterior distribution is provided. Comparison with MLE
estimators is made. An alternative “maximum a priori” or MAP (i.e., mode) estimator is also
mentioned.

4
DRAFT

The last goal of this paper is to indicate two potential applications of the Maxwell distribution
as applied to models or experiments encountered in missile design and analysis. The
application motivating this paper was the need to provide a simple metric for evaluation of the
output from a specific “alpha-beta” filter. The “capture probability” is defined as the
proportion of “weighted error magnitudes” that lie inside the volume contained within the
“2.5-sigma” covariance ellipsoid.

The capture probability involves calculation of the Mahalanobis distance ( d M ) for each of the
data points. This distance will be introduced, and it will be shown that even when the various
error components are correlated, the Mahalanobis distance still has a Maxwell distribution.

In some cases, the Maxwell distribution may not have the lower bound at zero. The
distribution will then have a location parameter as well as the scale parameter. This two
parameter Maxwell distribution is discussed in Appendix B, along with derivation of
properties of method-of-moments parameter estimators.

Derivation of both MLE and MOM estimates depend on the variance of the raw moment
estimators. Appendix C contains additional derivations pertaining to the variance and
covariance of moment statistics. Approximate variance of functions of the raw moments are
also discussed where the Delta Method is introduced for this purpose.

Appendix D develops some necessary results for order statistics of a sample, including the
expression for the expected value of the quantiles. The asymptotic mean and variance for
order statistics themselves is also derived and an application for the Maxwell distribution is
provided.

Appendix E derives the distribution for the sum or two Maxwell random variables. The sum
of N independent Maxwell random variables and the asymptotic normal distribution are also
discussed.

Appendix F provides an introductory discussion of hypothesis testing, goodness-of-fit tests


and tests for outliers. The appendix contains Monte Carlo derived tables for testing goodness
of fit via the Anderson-Darling statistic and using correlation statistics for the Maxwell
distribution. Tables are also provided for testing for k outliers in samples assumed to be
Maxwell distributed. The author is unaware of tables being published elsewhere for either of
these statistics from hypothesized Maxwell distributions.

Appendix G discusses and provides MATLAB code for development of Maxwell distribution
“probability paper.” These plots are useful for initial evaluation of data.

Appendix H provides a brief overview of Chi distributions for k > 3. Here, k is the number of
squared normal variables included in the root-mean-square sum. A generalized form of this
distribution is the Nakagami. Parameter estimation methods for these distributions are also
discussed.

5
DRAFT

Distribution Derivation

The sum of squares of “n” independent random variables, each with a standard normal
distribution, result in a random variable that has a Chi-squared distribution with “n” degrees-
of-freedom. References [1] or [2] contain details. We are currently interested in the sum of
three independent variables which are normally distributed with zero mean. The three
component variables have a common variance, and each variable is normalized by dividing
the square of each ( X i ) by variance (σ ) . The resulting Chi-squared random variable, W,
2

with three degrees-of-freedom is distributed as:

W 1 / 2 exp ( −W / 2) 3
X i2
f (W ) = W =∑
Γ(n / 2) ⋅ 23 / 2 1 σ2

The gamma function is expressed as an integral, but is constant for a fixed number of degrees
of freedom. For three degrees of freedom, Γ( 3 / 2 ) ⋅ 2 = π / 2 .

The variable of interest is the square root of W. Consider the transformation: y =a W where
“a” is a scaling factor. It will be shown subsequently that this factor is the standard deviation
of the normal random variables. The distribution of variable y is required.

The inverse of the transformation is W = y 2 / a 2 . The support of W is the positive real axis
and as such, the support of y is also the positive real axis. Further, a one-to-one transformation
exists between y and W, so the distribution of y can be found from the standard procedure
(Reference [4]):
dW 2y
fY ( y ) = fW ( y ) = fW ( y )
dy a2

This distribution is in the general family of “Chi distributions.” The three degree of freedom
Chi distribution is the Maxwell density function. Substituting the transformed variable, and
the value for Γ(3 / 2 ) , provides the Maxwell probability density function:
2 y2  − y2 
f ( y) = ⋅ exp  
2  Equation (1)
π a 3
 2a 
The Cumulative Density Function (CDF) is derived by direct integration over the interval
0 ≤ y where 0 ≤ y ≤ ∞ .

 y  2 y  − y2 
F ( y ) = P[Y ≤ y ] = erf  −
 ⋅ exp  
2  Equation (2)
 2a π a  2a 
The error function is a tabulated integral function defined as:
u
2
π∫
erf (u ) = exp ( −t 2
) dt ; erf (∞) =1 and erf (0) = 0
0

It is noted that if we define z = y / a and dz = dy / a , the parameter “a” is a “scale” parameter


for the Maxwell distribution. This is seen by writing Equation 1 as:

6
DRAFT

dz 1 2
f ( z) = f ( z ⋅ a) = z 2 exp ( −z 2 / 2)
dy a π

Figure 1 shows the Maxwell distribution for various values of the scale parameter:

0.7

0.6

a = 1
0.5
a = 2
a = 3
0.4 a = 5
f(y)

0.3

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9 10
y

Figure 1 – Maxwell Probability Distribution

It is interesting to note that the distribution has an approximate “bell shape” which implies
this distribution is close in some sense to the normal distribution.

7
DRAFT

Maxwell Distribution Properties

Calculation of the distribution moment (about the origin) can in theory be done using the
moment generating function (MGF). The MGF does exist for this distribution but is quite
complicated. Moments about the origin are more easily evaluated by direct integration. For
the mean (first moment):

 − y2 
µ1′ =
2 1 2 1 8
∫0   dy = 2 a4 =
3
y exp 2 
a
π a3  2a  π a 3
π

The second moment about the origin is:



 − y2  π
µ2′ =
2 1 2 1
∫0   dy = = 3a2
4
y exp 2 
3 a5
π a3  2 a  π a 3
2

The third and fourth moments about the origin can be calculated similarly:
′ 2 ′
µ3 = 8 a 3 ⋅ µ 4 = 15 a 4
π
The variance follows from the first two moments:
′ ′
2
 3π − 8 
σ Y2 = µ2 = µ2 −  µ1  = a 2  
   π 
Skewness of the distribution involves the third moment about the mean. It should be positive
as indicated in Figure 1. Skewness and kurtosis can be shown to be constant:
3 2 4
µ3 = µ3′ −3 µ1′ ⋅ µ2′ +2  ′
µ1  and µ4 = µ4′ −4 µ1′ ⋅ µ3′ +6  ′
µ1 

 µ2 −3 

µ1 
     

µ3 2 2 (16 − 5 π )
Skewness = = = 0.4857 > 0
σ3 ( 3 π − 8) 3 / 2
µ4 15 π 2 + 16 π − 192
Kurtosis excess = = − 3 = 0.1082
σ4 ( 3 π − 8) 2
This is not a highly skewed distribution, as Figure 1 confirms. The mode ( ymo ) of the
Maxwell distribution can be found by differentiating equation 1 and setting the result to zero
(note: a > 0):
df 2 1 2  y3 
= exp ( − y 2
/ 2 a ) 
 2 y −  = 0 ⇒ ymo = 2 a
dy π a3  a 2 

8
DRAFT

Parameter Estimation and CRLB

CRLB for Maxwell Distribution

Various methods exist to estimate parameter “a” when samples from the Maxwell distribution
are available. Since the estimate â depends on this data, it is a random variable and has an
associated sampling distribution, fT ( t : a ) . The notation implies the estimator T = aˆ . The
distribution of this random variable depends on the true parameter value “a”. The sampling
distribution generally is unknown, but the variance of the estimator can often be calculated, at
least approximately. Thus, we desire to find the variance of this estimator, denoted var ( â ) .
The variance of an estimator is referred to as the Mean Square Error (MSE) of the estimator.
The square root of the MSE is referred to as Standard Error (SE) of the estimator.

The first question regarding estimators is: does a theoretical lower bound exist for the MSE,
and if so is there an estimator that achieves this bound? The Cramer-Rao Lower Bound
(CRLB), if it exists, provides an expression for the minimum MSE. It is a function of the
particular distribution and sample size. An estimator that attains this lower bound may not
exist. If the CRLB exists, it is derived from the “likelihood function” of the sample.

The likelihood function of a sample is the product of the probabilities, or joint distribution, of
the sample values. The Maxwell distribution, with sample size n, has the following likelihood
function:
n/2
n
2  n yi2   −1 n

L( y; a ) = ∏ fY ( yi ; a ) =   ∏ 3  exp  2 ∑y 2
 Equation (3)
π 
i
1  1 a   2a 1 
The likelihood function is generally more useful in logarithmic form:
n
n 2 n  y2  1 n
ln L( y; a ) = ∑ln fY ( yi ; a ) = ln   + ∑ln  i3  − 2 ∑y 2
Equation (4)
2 π  1
i
1  a  2a 1

Note that the logarithmic function is monotonic increasing, so the operations on the likelihood
function can be replaced with operations on the “log likelihood” function in most cases:

The CRLB derivation can be found in References [1] or [2]. In order for the CRLB to exist,
the support of the density function cannot depend on the parameters, and the first two
derivatives of the log likelihood function, with respect to the parameters, must exist. These
“regularity conditions” are both met for the Maxwell distribution. The CRLB is:

1 1 1
CRLB = = =
  ∂ ln L  2
   ∂ ln f ( y )  2
 n ⋅ I (a )
E    n ⋅ E  Y
 
  ∂ a     ∂ a  

In the previous expression, E denotes expectation and I(a) is called the “Fisher information.”

9
DRAFT

For the Maxwell distribution, evaluation of the Fisher information is straightforward:


∂ln f ( y ) −3 y 2
= + 3
∂a a a

2  9 y2 6 y4 y6   − y2  2
I ( a) = ∫0  a 5 a 7 a9   2 a 2  dy = π ( I1 + I 2 + I 3 )
 − +  exp
π
The first integral is evaluated as:

 3 π  −y   − y 2  π 9
 − ( a 2 y ) exp 
9
I1 = 5 a erf    =
2 
a  2  2a  2 a 0 2 a2

The remaining two integrals are similar, with one term involving the error function and the
second term involving a polynomial in y, multiplied by the exponential term. The Fisher
information for the Maxwell distribution is:
2  9 18 15  π 6
I ( a) =  − 2 + 2 = 2
π a 2
a a  2 a
The CRLB is:
a2
CRLB = Equation (5)
6n

Moment Estimators

Given that the CRLB exists, the second question is: how close does the MSE of a chosen
estimator function get to the minimum variance provided by this lower bound? The ratio of
the CRLB to the MSE for a particular estimator is referred to as the estimator efficiency.

In order to evaluate efficiency of a selected estimator, the variance of that estimator is


required. We will address three kinds of estimators; Maximum Likelihood (MLE), Method of
Moments (MOM) and method of quantiles. All of these methods allow for relatively direct
calculation of the MSE of the estimators. The last one requires some results from order
statistics of a sample. The MSE of quantile estimators is addressed in the next section.

MLE and MOM estimators depend on the sample moments. As such, an expression for
variance of the sample moments is required. The “r-th’ sample moment about the origin is
given by the following computational formula, for random variable Y:
′ 1 n
M r = ∑Yi r
n 1
Note that sample moments are random variables, because they depend on the random sample
values. It is shown in Appendix C that the sample moments are unbiased.
′ ′
E M r  = µr

 

10
DRAFT

Appendix C contains a general derivation of the variance and covariance of these sample
moments. Because the Maxwell distribution contains only one parameter, only one sample
moment is required and no covariance is involved. The expression of interest is:
′ 1 ′ ′ 
2
var 
M r  = µ2 r − 
 µr 
  Equations (6)
  n   

MLE and MOM estimators are functions of sample moments as will be shown. An expression
for the variance of a function of these sample moments is therefore required. The exact
expression for the variance of a function of these random variables is generally not known. In
such cases, an approximation can be developed via the “Delta Method,” which is discussed in
Appendix C.

In the particular case of the single parameter Maxwell distribution:



â = g 
M r  where r = 1 or 2.
 

The variance of this estimator is expressed in terms of the derivatives of the function (see
Appendix C):
2
 ∂ aˆ  ′
var ( aˆ ) =  var 
M r  Equation (7)
 ∂M ′   
 r 

The partial derivative is evaluated at the respective mean value: M k ′ = µk ′ . The MSE of
parameter â can be calculated using Equations 6 and 7.

Maximum Likelihood Estimate of “a”

The maximum likelihood principle is based on the notion that parameter values occurring in
the likelihood (or log likelihood) function should maximize that function. In this application,
the likelihood function is no longer considered as a joint probability function as it was for the
CRLB development. In this case, the data values are fixed, and the likelihood is viewed as a
function of the parameter value(s). The parameter is chosen such that the probability of
observing the actual data values is maximized.

For the Maxwell distribution, the derivative of Equation 4 is taken and equated to zero for
maximization:
∂ ln L( y; a ) 1 n n
1
∂a
=0 = 3
a
∑y
1
2
1 −3∑
1 a
Solving this last equation provides the MLE for the parameter “a”:
n
1
aˆ MLE =
3n
∑y
1
2
i Equation (8)

To calculate the MSE of this estimator, note that the statistic considered is:

11
DRAFT

′ ∂T 1
M2 and = Equation (9)
T = aˆ MLE = ∂M 2
′ ′
3 µ2′ 12 µ2

The variance of the second moment is calculated by inserting values for the second and fourth
moments into Equation 6:
′ 1 ′ ′  6a
2 4
var  M 2  = µ4 −  µ2   =
  n    n

Squaring Equation 9, using the previous result, and plugging into Equation 7 provides MLE
estimator variance:
2
var ( aˆ MLE )=a Equation (10)
6n

The estimator “efficiency” defined previously for this estimator is:


CRLB
eff (T ) =
var (T )
When this ratio is 1.0, the estimator is called “efficient.” The MLE estimator of “a” is
therefore efficient since Equation 5 and Equation 10 are identical.

Method Moments Estimate of “a”

The method of moments is based on the idea that if two distributions have the same moments,
they should be (approximately) the same. The method therefore consists of setting the first
“k” sample moments equal to the corresponding moments of the assumed distribution. The
number of equations then equals the number of parameters being estimated.
For the Maxwell distribution, only one parameter is available; thus, we set the sample mean to
the distribution mean:
′ 8 π ′
M1 = a ⇒ aˆ MOM = M1
π 8
Using the previous procedure to derive the MSE of this estimator:
∂T π
=
′ 8
∂M 1

′ 1 ′ ′  1  2 8 2
2
var 
M1  = µ2 − 
 µ1 
  = 3 a − a 
  n    n π 

a2
var ( aˆ MOM ) = 0.178097 ⋅ Equation (11)
n
The efficiency of the MOM estimator is thus:
1/ 6
eff ( aˆ MOM ) = = 0.935818
0.178097

12
DRAFT

The MOM estimator is seen to have a variance approximately 7% greater than the MLE
estimator of â ; consequently it is less efficient. It is of interest to note that if we had used the
second moments and equated the sample and distribution moments, the resulting MOM
estimator would have been identical to the MLE. This shows that in some cases, judicious
selection of the sample moments to use in MOM might have some advantage.

The above examples indicate that MLE estimators are more efficient, and this is generally
true. MLE estimators have several other advantages. The first is that for “large” sample sizes,
the MLE estimator approaches a normal distribution with mean equal to the true parameter,
and the variance equal to the CRLB.

The second advantage of an MLE estimator is that it has an “invariance” property. This means
that if an estimate of some function of a parameter is required, we can insert the MLE
estimate of the parameter into the function to obtain the MLE of that function. This is
expressed mathematically as:
gˆ MLE ( a ) = g ( aˆ MLE )

It is noted that MLE estimators are often biased, although this is not the case for the Maxwell
distribution as is apparent since it is an efficient estimator.

13
DRAFT

Quantile Estimates and Order Statistics

Quantile estimators are similar to MOM estimators in that they are not based on a
maximization procedure. Quantile estimators are generally easy to compute, but often the
MSE of such estimators is difficult to ascertain. The efficiency of these estimators is generally
less than MOM estimators. One application of such estimators is when the data has been
censored. Censored data implies that not all samples have been evaluated (e.g., such as life
testing). The method depends on some results for order statistics of a sample. The pertinent
results are discussed in Appendix D.

In some cases the presence of unexpectedly large (or small) values may be present in a
sample. These ‘outliers’ can have a disproportionate influence on the parameter estimates. If
the outlier needs to be accounted for in a sample, but not unduly affect the parameter estimate,
then quantile estimates are quite “robust” in this sense. The relation between quantile and
MLE (or MOM) estimators is analogous to the sample mean and median estimates of the
population central tendency in this sense. The order statistics just the sample values arranged
in ascending order:
Y(1) < Y(1) <  < Y( n−1) < Y( n )

The r-th order statistic is denoted as Yr . Each order statistic of the sample has an associated
probability value, referred to as the quantile of the order statistic:

qr = P [ y = Y( r ) ] =
r
n +1
The idea behind quantile estimators is to pick a number of quantiles equal to the number of
parameters to be estimated, and the associated sample order statistics for those quantiles.
Inserting these values into the distribution function provides a set of equations equal to the
number of unknown parameters. For the Maxwell distribution, there is only one parameter,
hence, only one quantile is needed to estimate the unknown parameter.

Several questions arise. First, does the mean square error of the parameter estimate depend on
which quantile is selected? Secondly, if a particular quantile gives the minimum MSE, how is
it identified? A third question is: does the MSE depend on the parameter value itself? The
following development shows that for the Maxwell distribution, the MSE does depend on the
selected quantile, but the minimum MSE quantile does not depend on the parameter value.

Quantile Estimator for Maxwell Distribution

Quantile estimators in general are functions of the sample order statistics (just as moment
estimators are functions of the sample moments). The delta method procedure can again be
applied, and for a single parameter case the MSE of the estimator is:
2
 ∂a 
var (T ≡ aˆ ) =   ⋅ var (Y( r ) ) Equation (12)
∂Y( r ) 
 

The “large sample” formula for of any order statistic is derived in Appendix D:

14
DRAFT

qr ( 1 − q r )
var (Y( r ) ) =
n ⋅ f ( y( r ) ; a ) 2
Equation (13)

In this equation, y( r ) is the value of the “r-th” order statistic corresponding to the sample
quantile qr . The notation Y( r ) implies this quantity is a random variable. The function f (•)
is the underlying density function of the random variable y.
The minimum variance quantile is determined using the following procedure. We first need to
transform y( r ) into a more computationally convenient form:
yr
tr = Equation (14a)
2a

Equation 2 is then expressed in terms of this variable, which shows that t r depends only on
qr :

t r exp ( − t r2 )
2
qr = erf ( t r ) − Equation (14b)
π
The derivative term required in Equation 12 is obtained directly from Equation 14a:
∂a −1
= Equation (15a)
∂Yr 2 tr

A plot the relationship between qr and t r from Equation 14a is shown in the left side of
Figure 2. The Maxwell density function can also be expressed in terms of t r :
8 t r2
f (t r ; a ) = exp (−t r2 ) Equation (15b)
π a

Note that this expression involves parameter “a”. For selected value of “a”, Equation 12 can
be evaluated over the range of quantiles 0 < qr < 1 , by calculating t r via Equation 14b, and
then using Equations 15a and 15b.

qr ⋅ ( 1 − q r ) 1
var( aˆ ) = Equation (15c)
n ⋅ [ f ( tr ; a ) ] 2 tr2
2

The right hand side of Figure 2 shows the relative variation of Equation 15c as a function of
the quantile value qr for various choices of parameter “a”. The sample size n is a scaling
variable that does not affect the shape of the curves.

For parameter “a” close to unity, the relative variance appears rather insensitive to the
quantile value, but as the parameter increases, a minimum close to qr = 0.75 (upper quartile)
is evident. In fact, the minimum of Equation 15c occurs at qr = 0.7576 , and is independent
of the parameter as can be verified by direct calculation.

15
DRAFT

2 1 8

a = 1
1 6
1 . 8 a = 2
a = 3
1 4
a = 5
1 . 6
1 2

1 . 4 1 0

E s tim a te V a ria n c e (n = 1 )
t v a lu e

1 . 2 8

6
1
4

0 . 8
2

0
0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1
q u a n t ile v a lu e q u a n t ile v a lu e

Figure 2 –Order Statistics Estimator Variance for Maxwell Distribution


For ease of calculation, the upper quartile Y3 / 4 value is recommended for the Maxwell
distribution estimator (a more exact analysis shows that the minimum quantile is 0.7676). At
the upper quartile, variable “t” equals 1.4330, and the quantile parameter estimate is:
Yqr =0.75
aˆQUAN = = 0.49344 ⋅ Y3 / 4 Equation (16)
2 (1.433 )
The efficiency of this quantile estimator is calculated using these values for the minimum
MSE:

[ f (1.433 ; a ) ] 2 = 8 (1.433 4 ) ⋅ exp (−2 ⋅ (1.433 ) 2 1


=
0.1767
π a 2
a2

.75 ⋅ (1 − .75 ) ) a2 0.2584 a 2


var ( aˆQUAN ) = =
n ( 0.1767 ) 2 ⋅ (1.433 ) 2 n

The quantile estimator efficiency can be calculated using the CRLB:

eff (aˆQUAN ) =
(a 2
/ 6n )
= 0.6450
( a ⋅ .2584 / n
2
) Equation (17)

16
DRAFT

Estimation Example for Hypothetical Sample

As an example of the use of the parameter estimators, assume a sample of size 100 from a
Maxwell distribution is available. In this example, the parameter value is a = 1.5.
Computation was done using MATLAB functions. The sample is:

data = [ 3.7743 2.6739 1.1872 0.7470 2.7742 1.7389 2.1289 0.3800 1.7851 2.9057
2.2847 2.1543 1.1863 3.3807 1.0166 2.4348 3.0516 2.4945 0.7600 4.1672
4.3509 3.3549 2.2589 0.9332 1.5142 1.4671 2.2576 4.3885 1.7236 1.9492
0.7190 1.6172 1.6940 1.1957 1.9209 2.6369 1.4716 2.4191 3.0460 2.4330
3.5561 1.7868 2.2678 2.2190 2.7345 0.6198 2.3071 3.0629 4.0377 1.8466
4.9367 2.3712 2.6957 1.1412 2.3771 0.6530 1.9543 0.2878 2.0164 3.0145
4.4819 5.7881 1.2017 1.8738 2.7212 1.5086 3.1833 1.7552 0.8323 2.6708
3.3421 4.6603 2.9550 2.0458 1.5950 3.9521 2.4211 1.7068 1.4631 2.7573
4.0250 3.0648 3.5964 1.7052 3.4843 4.1598 2.4427 1.6287 4.2384 1.9250
4.1387 0.9586 3.7827 1.2290 2.8417 1.4812 0.6498 3.3351 3.3007 2.9664 ]

100 100

∑ yi = 240 .133
1
∑y
1
2
i = 704 .720

1 1.52
aˆ MLE = ⋅ 704 .72 = 1.5327 SE MLE = CRLB = = 0.06123
300 600

π 240 .133 CRLB .06123


aˆ MOM = = 1.5048 SE MOM = = = 0.06330
8 100 eff ( aˆ MOM ) .93581

CRLB
aˆQUAT = 0.49344 ⋅Y3 / 4 =1.5118 Y3 / 4 = 3.0639 SE Quat = = 0.0762
eff ( aˆQUAN )

A check on the calculated standard errors was made by running 200 replicates (of sample size
100 each with a = 1.5 ). The standard deviation of the 200 MLE, MOM and quantile
estimates was calculated and compared to the theoretical values above. Results are:

SE ( aˆ MLE ) = 0.0631 SE ( aˆ MOM ) = 0.0650 SE ( aˆQUAT ) = 0.0764

In order to check the quantile estimator, estimates and standard errors were made at the lower
quartile (Y1 / 4 ) and median (Y1 / 2 ) also. Results showed the upper quartile estimator does in
fact produce minimum variance among these.

17
DRAFT

Mahalanobis Distance and Maxwell Distribution

The Mahalanobis distance ( d M ) is a “variance weighted” or statistical distance between two


points in r-dimensional space. We confine discussion to Euclidian three dimensional space.
Mahalanobis distance is employed when the distance between r-dimensional random variables
is required, but the distance is weighted inversely to the “spread” or variance if each variable.
Components of the random vector y may be correlated.

The definition of Mahalanobis distance is:

dM = ( y −μ )T C −1 ( y −μ ) Equation (18)

y = r - dimensiona l random variable


Where: μ = mean vecto r
C = covariance matrix

Suppose the three dimensional vector random components are from uncorrelated normal
distributions. The variance of each component is in general different. For simplicity assume
all of the component variables are zero mean. The covariance matrix inverse is just the
inverse of the diagonal terms in this case, and the square of the Mahalanobis distance is:

 y2   y2   y2 
d M2 =  12  +  22  +  32 
 σ1   σ 2   σ 3 

This squared value is seen to be the sum of squares of three standard normal random
2
variables. As such, the distribution of d M is Chi-squared with 3 degrees-of-freedom. The
square root of this random variable therefore is Maxwell distributed as shown previously,
2
where the distribution “a” is equal to unity because d M is the sum of three standard normal
variables.
2
The next issue is what happens to the distribution of d M when the components are
correlated?
In the following discussion, we consider tri-variate jointly distributed random variables.
Assume that all components have different standard deviations and different means. The
vector for the three components is [ x1 x2 x3 ] .

Examine the general covariance matrix for these variables, where non-zero correlations are
present:
 σ 12 ρ12 σ 1 ⋅ σ 2 ρ13 σ 1 ⋅ σ 3 
 
C =  ρ12 σ 1 ⋅ σ 2 σ 22 ρ 23 σ 2 ⋅ σ 3 
 ρ12 σ 1 ⋅ σ 2 ρ 23 σ 2 ⋅ σ 3 σ 32 

18
DRAFT

This matrix is positive definite and has non-negative eigenvalues. The eigenvector
corresponding to each eigenvalue allows for spectral decomposition of the covariance matrix
which can be expressed as:
C = Γ′ ΛΓ = (Γ′ Λ1 / 2 Γ)(Γ′ Λ1 / 2 Γ)
Matrix Γare the eigenvectors, Λis a diagonal matrix of the eigenvalues and the factoring
shown can be accomplished since the eigenvector matrix is orthonormal. Define the “square
root” of the covariance matrix as:
C 1 / 2 = ( Γ′ ΛΓ)
Because the eigenvalues are all positive, the inverse of the above matrix is:

C −1 / 2 = (Γ′ Λ−1 / 2 Γ)

Now, let Z be a random vector that has a multivariate normal distribution with zero mean and
independent components with unity variance, or Z → N 3 ( 0, I 3 ) . The subscript 3 implies Z is
tri-variate. Define a random vector X such that:
X = C 1 / 2 Z +μ

This linear relation can be inverted such that:


Z = C −1 / 2 ( X −μ )

Using the previous equation, results in the following:


3
W = Z′ Z = ( X − μ ) C −1 ( X − μ ) = d M2 = ∑ zi2 → χ 32
1

The previous equation states that the Mahalanobis distance squared is still Chi-squared, with
three degrees of freedom, as a result of the zi being independent standard normal variables,
even though the underlying variables (x) are correlated. It is again noted that the value of
parameter “a” is unity.

Application of the Mahalanobis distance measure is provided in the second example below.
The fact that the parameter value is unity provides a metric for evaluation of consistency of a
covariance matrix and a set of random variables.

19
DRAFT

Maxwell Distribution Applications

This section provides two examples of missile systems analysis where the Maxwell
distribution applies.

Three Dimensional Miss Distance


One measure of tactical missile performance is the distribution of miss distance, measured as
the Euclidian distance from the missile to the target center, at closest approach of the weapon.
For air-to-ground missiles, the two-dimensional impact point is described in terms of
orthogonal “x-y” coordinates, centered at the target. The usual assumption is that the each
coordinate position is independent, and identically distributed normal variables, with zero
mean. These assumptions lead to the well known result that the distribution of the miss
distance in the two-dimensional plane follows a Rayleigh distribution.
For air-to-air or surface-to-air missiles, the three-dimensional Euclidian miss distance
distribution is sometimes used to describe performance. If the same assumptions are made
regarding each component (iid distributed, zero mean normal variables), then the resulting
joint probability distribution is:

f ( x, y , z ; a ) =
1
exp 
(
 − x2 + y 2 + z 2 )  where a 2 ≡ common var iance
( 2π a )
2 3/ 2
 2 a2 

To obtain Euclidian miss distance r, transform the orthogonal Cartesian coordinates to
spherical coordinates:
 y z
r = x2 + y2 + z2 θ = Tan −1   φ = Sin −1  
x r
x = r ⋅ cos θ ⋅ cos φ 0 ≤r ≤∞
y = r ⋅ cos θ ⋅sin φ −π ≤ θ ≤ π
z = r ⋅sin φ −π / 2 ≤ φ ≤ π / 2

The joint distribution of the transformed variables is:

f ( r ,θ , φ; a ) = J
( x, y , z ) 1  −r2 

( r , θ , φ ) ( 2 π a 2 )3 / 2
exp  2 a2 

 
The Jacobian for the transformation follows from the coordinate transform equations:

∂x ∂x ∂x
∂r ∂θ ∂φ cos ϕ ⋅ cos θ − r ⋅ cos φ ⋅ sin θ − r ⋅ sin φ ⋅ cos θ
J
( x, y , z )
=
∂y ∂y ∂y
= cos ϕ ⋅ sin θ r ⋅ cos ϕ ⋅ cos θ − r ⋅ sin φ ⋅ sin θ = r 2 ⋅ cos φ
( r ,θ ,φ ) ∂r ∂θ ∂φ
∂z ∂z ∂z sin φ 0 r ⋅ cos φ
∂r ∂θ ∂φ

20
DRAFT

r 2 cos φ  −r2 
f ( r ,θ , φ; a ) = exp   Equation (19)
(2π a ) 2 3/ 2  2 a2



The marginal distribution of each variable is obtained by integrating out the other two:
r2  − r2  π /2 π
2 r2  −r2 
f (r ) = exp  2  ∫ cos φ ⋅ dφ ∫ dθ = exp   Equation (20a)
(2π a )
2 3/ 2
 2 a  −π / 2 −π
π a3 2 
2 a 
Equation 19a shows that the 3-dimensional miss distance is distributed as a Maxwell random
variable, with the parameter equal to the common standard deviation of the underlying normal
distribution.
For reference, the other two marginal distributions can be easily calculated, noting that:

 −r2  π
∫0 ⋅  2  dr = a 3
2
r exp
 2a  2

π/2
π a3 1
f (θ ) =
2 ( 2π ) a3 2π ∫ cos φ dφ = 2 π
π
− /2
Equation (20b)

π
π a3 cos φ
f (φ) =
2 ( 2π ) a3 2π
cos φ ∫πdθ =

2
Equation (20c)

Thus, the distribution of the azimuth (θ) angle is uniform of the interval of 2π , while the
elevation angle (φ) has a cosine distribution. It is of interest to note that the product of the
three marginal equations results in the joint distribution of Equation 19. This result shows
these variables are independent of each other.

If a set of miss distance data is available, the standard deviation of the underlying normal
random variables can be estimated using one of the methods discussed previously.

Filter Covariance Consistency Check

In some ground tracking systems, prediction and smoothing of the target track is implemented
via an alpha-beta (α − β ) filter. Reference 6 contains an excellent description of these
trackers, and their implementation. Trackers or this type generally filter target position in
Cartesian coordinates, although measurements are made in a different “sensor” coordinate
system. The errors associated with the sensor coordinates (usually, range, azimuth and
elevation) are generally independent, but become correlated when they are transformed to the
Cartesian system.

The covariance matrix associated with the target track is used for a variety of purposes,
including uplinks to a missile seeker for search and acquisition by the onboard system. A
potential problem can occur because the alpha-beta (α− β ) filter does not directly produce a
covariance matrix estimate, although one can be derived from the particular structure of the

21
DRAFT

filter. In applications, the covariance matrix is supplied external to the α− β filter and should
be “matched” to the actual sensor error measurement characteristics.

One consistent way of accomplishing this is to develop the respective variance terms from the
implemented filter, in the sensor coordinate system, and transform to the Cartesian system.
This requires that statistics of the error components (assumed normal) be known ahead of
time. It also assumes that the “lag error”: of the filter is negligible compared to the noise error.
Another approach is to develop the covariance matrix from some other algorithm, such as a
related sensor or filter and make necessary adjustments to be compatible with the alpha-beta
filter.

In either case, consistency of the implemented filter with the supplied covariance matrix
should be assessed. One way of doing this is by comparing the theoretical capture ratio of the
measurements with the actual (or simulated) data error values. Capture ratio is defined as the
proportion of “weighted error magnitudes” that lie inside the volume contained within the
“2.5-sigma” error (position) ellipsoid. For equal component variance, the error ellipsoid is a
sphere.

Development of the statistical distribution for Mahalanobis distance showed it is Maxwell


distributed with parameter a = 1 . This fact can be used to check consistency of the
measurement error and covariance matrix. The theoretical value is found from Equation 2:

 2.5  2  − 2 .5 2 
Capture Ratio = P[Y ≤ 2.5] = erf  −
 2 . 5 ⋅ exp 
 
 = 0.899939
 2  π  2 

The “data” value of the capture ratio is computed by evaluating the Mahalanobis distance
according to Equation 18 at each update point while the filter is tracking. The ratio of
calculated points larger than 2.5, divided by the total data points provides the data capture
ratio. This should be close to the theoretical value if the filter, measurements, and covariance
data are consistent.

A second equivalent estimate of consistency can be done by calculating the MLE, and
associated MSE of the parameter “a” from the data values. If the computed value ( aˆ −1) is
more than twice the square root of the in the MSE, this is a strong indication that a mismatch
exists.

Potential reasons for a mismatch are:


• The supplied covariance matrix is in fact based in a different model than the alpha-
beta filter.
• The actual noise error components have different variances than those assumed in
development of the covariance matrix.
• The actual noise error components are not from normal distributions, or are not zero
mean.

22
DRAFT

• Large target maneuvers during tracking have introduced large lag error in the filter
output which is not included in the theoretical development.

Miscellaneous Topics

This section covers some additional topics related to the Maxwell distribution and its
estimator properties.

Generation of Maxwell Random Variables


Simulation analyses work may require generation of Maxwell distributed random variables.
This can be accomplished in either of two ways. The first is by direct use of the cumulative
density function. As noted previously, we can write this in the form:

F ( tr ) = qr = erf ( tr ) −
2
π
(
⋅ t r ⋅ exp − tr
2
) where tr =
yr
2a

The qr variables have a uniform distribution on the unit interval. Given a particular random
value on this interval, the corresponding value of t r can be found by a numerical procedure
using the above function.. For a particular analysis, the parameter value “a” is known, so
values of the random variables yr are obtained from the t r values.

Various root finding procedures are available as discussed in Reference [7]. When applicable,
the Newton-Raphson method is desirable since it has quadratic convergence. This method
does require the derivatives to be “well behaved” in that the derivative value does not vanish
anywhere over the range of function values. Unfortunately, the Maxwell CDF has a zero
derivative at zero and the derivative approaches zero as the variable becomes large. As such,
the Newton-Raphson method does not converge in general and some other method, not
involving derivatives, should be used. Bisection is one method, which has slow convergence.
Brent’s method (Reference [7]) has “super-linear” convergence and is considered one of the
best “non-derivative’ root finding algorithms.

The second way to generate Maxwell random variables is to use the results derived for the
miss distance distribution. In this method, three separate normal random deviates are
generated, each with a standard deviation corresponding to parameter “a”. These three
deviates are squared and summed. The square root of the sum then provides the desired
Maxwell random deviate.

YMaxwell = X 12 + X 22 + X 32 X i → N ( 0, a ) i = 1, 2, 3

This implementation is generally much faster, depending on the software implementation,


because the iterative root finding is not required (although a normal random number algorithm
is needed). The results derived in Appendix F use this generation method and MATLAB
random variate generators. This implementation executes an order of magnitude faster than
does root finding implementation.

23
DRAFT

Non-Linear Least Squares (NLS) Parameter Estimation

The minimum variance quantile estimate of parameter “a” was derived previously. A question
is: “if all of the quantile data points were used to estimate the parameter, by minimizing the
square of the error between estimated and predicted values, would this produce a better
estimate?” This proposed estimation procedure is a least squares problem.

The parameter is related to the quantile values and the data points through the CDF. As such,
the least squares problem can be expressed as:
n n    yr  2 yr  − yr2   
error = ε = ∑ [ qr − F ( y : a ) ] = ∑  qr − erf  − exp   
r =1 
r =1    2 a  π a  2 a   
The non-linear least squares problem is to minimize the square of this error by setting the
derivative with respect to the parameter equal to zero:
∂ε 2
=0
∂a

This derivative still involves the error function and requires numerical solution. A
computationally more direct way is use the fact that a single parameter is involved. The
function can be minimized directly using an optimization procedure (such as golden section,
Reference [7]).

A difficulty in determining the efficiency of this non-linear estimator is that the neither the
distribution nor the mean square error can be calculated analytically. This non-linear least
squares (NLS) estimator will require numerical evaluation for any particular desired case.

As an example, take a Maxwell distribution with a = 1.5 , and sample size n = 100 . The
approach is to take a large number of Monte Carlo samples of size n and compute the MLE
and the NLS estimator for each sample. The mean and standard deviation of these n estimates
is calculated and compared. The MLE estimators should have variance very close to the
CRLB. In this particular example, 5000 samples of size 100 were generated, and the NLS and
MLE estimates were made for each. The quantiles were estimated as:

r
qr =
n +1

Figure 3 shows the results for the first 500 samples. The results apply only to the case
considered, but they are indicative of the relative performance of the NLS and MLE
estimators. The figure shows that the two estimates are highly correlated as would be
expected ( rCORR = 0.8976 ) .
Calculated results for the means and standard deviations are:

a MLE =1.5013 a NLS = 1.5019


s MLE = 0.0613 s NLS = 0.0688

24
DRAFT

1.7

1.65

Error Least Squares Estiamtes 1.6

1.55

1.5

1.45

1.4

1.35

1.3
1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75
MLE estimates

Figure 3 – NLS & MLE Estimators: Empirical Results

The efficiency of the NLS (for this example) is the ratio of the CRLB to the estimator
variances.
a 2 /( 6 n)
eff ( a NLS ) ≈ = 0.8022 ≈ rCORR
2

( 0.0682 ) 2

Efficiency of the sample derived MLE estimator can be computed also:


a 2 / 600
eff ( aMLE ) = = 0.991 ≈ 1.0
(0.0613 ) 2
This example indicates that using all of the quantiles in a sample does provide a higher
relative efficiency than the single quantile estimator. However, both the MLE and MOM
estimators have smaller MSE and are thus superior to this estimator.
The nonlinear least squares estimator is related to the ideas of correlation based goodness-of-
fit tests. Appendix F provides discussion of test of this type.

25
DRAFT

Appendix A

Bayes’ Estimators for the Maxwell Distribution

Development

Bayesian estimation is based on a different point of view than are the other point estimation
methods discussed. Whereas MOM, MLE and quantile estimates assume the parameter is
fixed, the Bayesian approach is to treat this quantity as a random variable. The method is has
an advantage of incorporating probabilistic information that is known prior to taking a
sample. The method is based on conditional probabilities and is an expression for Bayes’ rule
as applied to density functions. It is expressed as:

f ( y | a ) ⋅ h( a )
f (a | y ) = Equation (A1)
g (y )

In this equation, h (a ) represents statistical knowledge about variable “a” prior to obtaining
a data sample. It is a probability density and is called the “prior density.” The conditional
density f (y | a ) is the joint density of the sample (y), given a value of the parameter. Since
the data values are independent, this density is the product of the probabilities of each data
point, which is just the likelihood function of Equation 3.

The function g (y ) is the probability density of the data. This density can be calculated in
terms of the likelihood and the prior density using the Theorem of Total Probability.
Integration is over the support of the prior density (i.e., the range of values that parameter “a”
can take on):
g (y ) = ∫ f ( y | a ) ⋅ h( a ) da Equation (A2)
This density is essentially a normalization factor, so that Equation A1 satisfies the
requirements of a probability function.

The conditional density, f (a | y ) , is referred to as the posterior (or “a posteriori”) distribution


and represents the statistical knowledge about variable “a” after the current data sample is
incorporated. This posterior density, if it can be evaluated, provides a complete description of
the estimator (for the given data) since the mean, variance and other parameters can in theory
be calculated.

Actual selection of the estimate â can be chosen in several ways. The Bayes’ estimator most
commonly used is found as the expectation of the posterior distribution:

aˆ BAYS = E ( a | y ) = ∫ a ⋅ f ( a | y ) ⋅ h( a ) Equation (A3)


g ( y)

26
DRAFT

The Bayesian procedure can be thought of as a “weighted average’ of the prior distribution
mean and the data sample mean. Most textbooks such as References 1-4 show simple
examples of this for normal and binomial distribution parameters. Generally as the data
sample size n increases, more weight is placed on the sample and less on the prior
distribution. This will be demonstrated in an example for the Maxwell distribution.

There are two issues involved with Bayesian estimation that complicate the procedure. The
first is selection of the prior distribution for the parameter, which is subjective in nature. This
distribution can be based on prior data samples if they are available. This distribution may be
assumed by the analyst, based simply on intuition. The second issue is the evaluation of the
data density (normalization factor) and the integrals resulting in Equation A2 and A3. For
single parameter distributions, the integrals can usually be evaluated numerically.

The likelihood function for the Maxwell distribution data sample is:
n
2
n/2
 n 2  −1 n  1
L( y; a ) = f ( y | a ) = ∏ f Y ( yi ; a ) =  ∏yi 
 
 exp  2 a 2 ∑y 2 ⋅
π   1 
i  a 3n
1  1 
It is noted that all terms in the likelihood function that are not functions of the parameter can
be brought outside the integrals, and thus cancel. This includes terms involving only the y i
since these are fixed for a particular sample.
Evaluation of integrals involving this likelihood function presents particular problems because
of its structure. If the sample size n is even moderately large, the product of the negative
exponential and the term (α−3m ) becomes extremely small. In numerical evaluation
procedures the resulting integrals evaluate to a “0/0” situation (to computational accuracy on
most computers).

The likelihood function is the distribution of the data samples. At any fixed value of the
parameter, the likelihood is a function of the data only, and as such it is a statistic. The
question arises as whether a simpler statistic exists that contains all of the statistical
information of the sample. The answer is affirmative, if and only if, a sufficient statistic exists
for the data sample.

A sufficient statistic is defined in terms of conditional densities. Heuristically, a statistic that


is sufficient summarizes all of the information contained in the sample about the value of the
parameters to be estimated. The direct way of determining if a sufficient statistic exists is via
the Factorization Theorem (Reference 8). This theorem says that if the likelihood function can
be separated into two factors; one involving only data and constants and the other involving
only the parameter and a statistic, then that statistic is sufficient.

The Maxwell likelihood function can be factored as follows:


 2   n 2 
n/2
 
  −1 n  1  
L( y; a ) = f (y | a ) =    ∏ yi  exp  2 ∑ y 2
⋅ 3n  = d ( yi ) t ( a; Ts )

 π   1
i
 
 
  2a 1  a  

27
DRAFT

The function d ( yi ) depends only on the data while function t ( a; Ts ) depends only on the
parameter and the (sufficient) statistic Ts where:
n

Ts = ∑ yi2 = n ⋅ M 2
1

In other words, the second moment of the data about the origin is a sufficient statistic for the
Maxwell distribution. It should be noted that any function of a sufficient statistic is also a
sufficient statistic. The fact that the second raw moment is sufficient should be no surprise
because this statistic is what determines the MLE. Sufficient statistics that arise from applying
the Factorization Theorem often produce estimators on which an MLE is based.

Equation A1 can be rewritten as:


fT ( Ts | a ) ⋅ h (a )
f ( a | Ts ) = Equation (A4)
∫ f (Ts | a ) ⋅ h ( a ) da

The distribution function fT (Ts | a ) is the sampling distribution of the sufficient statistic Ts ,
for a fixed parameter value. To determine this distribution, recall expressions for the mean
and variance of the sample moments for the Maxwell distribution:
′ ′
E (Ts ) = E 
n ⋅ M 2  = n ⋅ E
M 2  =3 a ⋅n
2
   
′ ′
var (Ts ) = var 
n ⋅ M 2  = n ⋅ var 
2
M 2  =6 a ⋅ n
4
   

Since the sample moment is the sum of n independent random variables, we can invoke the
Central Limit Theorem or CLT (Reference 9). This theorem states that the distribution of the
sum of a “large’ number of random variables approaches a normal distribution. In order to
proceed further, we will assume that the second moment is distributed normally, with the
preceding mean and variance. If the constants and terms involving only the data values are set
to a proportionality factor (“k”), The Bayes estimate is the expectation of Equation A4, over
the support of “a”. The sample data, as contained in Ts , is fixed for the Bayes’ estimate.

 1
k ⋅ ∫ a  2
  1 T − 3 a 2 n 2 

 exp  −  s
  (
  h (a ) da
)
 2 6n a 2
a     
aˆ BAYS = Equation (A5)
 1 
 1 T − 3 a 2 n 2 

k ⋅ ∫  2  exp  −  s
  (
  h ( a ) da
)
a  2
 6n a 
2
 

Equation A5 is as far as the theoretical development can go, other than canceling out common
terms in the exponent. The next step is to arrive at some prior distribution and then perform
the integration as shown. This will be done via a hypothetical example.

Hypothetical Example of Maxwell Parameter Bayesian Estimation

28
DRAFT

Suppose we have an electronic device that has inputs from three sources. Each signal is first
passed through a squaring circuit, and these are fed into a summing circuit. The circuits are
fast enough that dynamics of the system do not need to be considered. The output signal is the
sum of the three squared signals. The mean value of each input is known. The problem is to
estimate RMS power of input noise signals for a given device.

From previous experience with these devices, the RMS noise is known to vary from 3V to
5V, with the data being approximately bell shaped around 4V. The beta distribution is a quite
flexible statistical model when the data is known to have a finite support (i.e., the values are
within a finite interval of the real axis). The symmetrical beta distribution, for random
variable a, has the following density function over the finite interval [b, c ] :

Γ( 2 γ ) ( a − b)
h( z; γ ) = z γ −1 ⋅ (1 − z )
γ −1
where z=
Γ( γ ) ⋅ Γ( γ ) c −b

If the parameter γ is unity, the beta distribution becomes the uniform distribution. A graph of
the symmetrical beta function for [3, 5] is shown in Figure A1 for various values of γ .

1.4
γ= 2

1.2 γ= 3
γ= 4
(Symmetrical) Beta Distribution priors

γ= 5
1

0.8

0.6

0.4

0.2

0
3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5
parameter "a"

Figure A1 – Symmetrical Beta (prior) Distribution

For this example, assume the beta parameter value has been established at 3. Once a sample is
taken and the second moment of the sample is calculated, the numerical integration of
Equation A5 can be carried out. Note again that constant factors in the prior distribution
cancel out.

We might be interested in how the Bayes estimate varies with different values of the sample
second moment, and with different sample sizes n. We can do the integration of Equation A5

29
DRAFT

for a range of values of M 2′ and n, and display the results graphically. In order to see how
the Bayes’ estimate compares with the MLE estimate, write the sufficient statistic, as:


Ts = n ⋅ M 2 = 3 n ⋅ aˆ MLE
2

Figure A2 shows the relationship between the MLE estimate and the Bayesian estimate,
assuming a symmetrical beta prior, with parameter equal to three. Note that n occurs
elsewhere in the integrals of Equation A5, independent of â MLE , and therefore is not a simple
scale factor. Several points are noted regarding these results.

4.8 n = 10
n = 25
4.6 n = 100
Bayesian parameter estimate "a "
BAY

4.4

4.2

3.8

3.6

3.4
prior
distribution
3.2 support

3
1 2 3 4 5 6 7
MLE parameter estimate "a "
MLE

Figure A2 – Parameter Estimate Method Comparison

30
DRAFT

First, if the data indicate the parameter is not included in the support of the assumed prior, the
resulting Bayes estimate should not be used. In the case shown, if the data indicate that the
parameter is either greater than 5 or less than 3, then either this is a different device that the
historical data is based on, or the selection of the prior distribution was not adequate.

Second, the MLE and Bayes estimates are equal only at the mean of the prior. If the prior
density were not symmetrical, then in general the two estimates would never be the same.
Also, the variation of the Bayes estimate is approximately linear with the MLE estimate, over
the prior distribution support.

The third observation is that as sample size increases, the slope of the linear relationship
between the MLE and Bayes becomes greater. As mentioned earlier, the Bayes estimate is a
weighted mean of the prior and sample data. The prior tends to “pull” the estimate toward the
mean value. The data tends to “pull” the estimate toward the MLE, which is the best estimate
in the absence of knowledge other than the data itself.

For sample sizes less than about 10, application of the CLT may be questionable. In this case,
use of the likelihood function in equation A3 generally allows for sufficiently accurate
numerical computation of the required integrals. Substitution of the likelihood function rather
than the sample distribution of the statistic do not affect choice of the prior distribution.

Another “Bayesian” estimate that may be chosen in some cases is the “Maximum a Posterior”
(MAP) estimate, This corresponds to the maximum value (mode) of the posterior density.
The MAP has some appeal since function g (y ) is a fixed constant for the data sample. In
this case, all that is required is maximizing the numerator of Equation A1, and no integration
is involved. In fact, if the prior density is chosen as uniform over an interval (c,d) the MAP is
just the MLE solution. As with the Bayes estimate, the MAP estimate must be within the
support of the prior.

31
DRAFT

Appendix B

Two Parameter Maxwell Distribution

Definition and Properties

The left side of Figure B1 shows the projection of a (hypothetical) sample of random data
values which represent three dimensional miss distances from a target. The third dimension is
perpendicular to the page. The miss distance magnitude is shown on the right hand side.
These graphs indicate a bias in miss distance. Assuming the lower bound as zero would not
provide the best estimation of this density function. The miss distance bias can be accounted
for by considering a two-parameter Maxwell distribution as the applicable statistical model.

E m p i r ic a l C D F - 2 0 0 s a m p l e s
8 1

0 .9
6

0 .8
4
0 .7
Axis 2 Miss Distance

2
0 .6
P ( d<D )

0 0 .5

0 .4
-2
Target
0 .3
-4
0 .2
-6
0 .1

-8 0
-8 -6 -4 -2 0 2 4 6 8 2 3 4 5 6 7 8 9 10
Axis 1 Miss Distance
3 D M i s s D i s ta n c e d

Figure B1 – Miss Distance Projection and Magnitude

The general two parameter density function is now a “location-scale” type distribution:

f ( y; a , b ) =
2 ( y − b ) 2 ⋅exp  − ( y − b ) 2 
π a3  2 a2  Equation (B1)
 
a >0 ; −∞ < b < ∞

The corresponding CDF is again found by direct integration on the interval [b, ∞] :
 y −b  2 ( y − b)  − ( y − b) 2 
F ( y; a, b) = erf   − exp   Equation (B2)
 2a  π a  2a 
 
These function forms allow for a negative location parameter; however, based on physical
considerations, it is expected that b ≥ 0 for almost all real world problems.
If parameter b is known, then a simple change of variable shows that all of the previous
results apply for estimators. If b is considered as an unknown parameter, a quite different
situation occurs. In this latter case, the support of the random variable y depends on the

32
DRAFT

parameter itself. This situation violates one of the “regularity conditions” for determining the
CRLB; and in fact a lower bound for the estimators of “a” and “b” cannot be found.
In Appendix A, the concept of a “sufficient statistic” was introduced. The existence of
sufficient statistics is generally the starting point for determining MLE estimators. In general
estimation of two parameters requires two jointly sufficient statistics. Recall that the
Factorization Theorem provides a convenient way to derive sufficient statistics. The
likelihood function for the two parameter Maxwell distribution is:

2
L ( y; a , b ) =  
n/2 n
( yi − b ) 2  1 n
2

π 
∏ a 3
exp  − 2 ∑( y i − b ) 
1  2a 1 

Because of the product terms, this function cannot be factored into one term that includes only
statistics and parameters, and a second factor including only data values. What this implies is
that entire data set is required to define the parameters. This agrees with intuition since we
know that b ≤Y(1) where Y(1) is the smallest order statistic. This smallest value is not known
unless all of the data values are known. Hence, the data cannot be summarized into a function
of the data that contains the same information about the parameters.

Method of Moment Estimators

A straightforward procedure for estimation of the parameters for this distribution is the
method of moments. This allows for relatively simple computations and also allows for
computation of the standard error of estimate for the parameters.

In order to calculate moments about the origin for Equation B1, simple substitution u = y −b
is made. The first and second moments are:

′ ′ 8
m1 = ∫ ( u + b ) ⋅ f ( u; a ) du = µ1 + b = a +b
0
π

′ ′ ′ 8
m2 = ∫ ( u + b ) ⋅ f ( u; a, b ) du = µ2 + 2 b µ1 + b 2 = 3 a 2 + 2 b
2
a + b2
0
π

The following notation will be used in discussion of MOM estimators for the two parameter
Maxwell distribution:


mr ≡ r - th population moment
µr′ ≡ r - th moment about b (moment of variable u )
′ 1 n
M r = ∑ yir ≡ r - th sample (data) moment
n 1

Note that the moments µr ′ are the same as these moments for the one parameter distribution
(i.e., when b = 0).

33
DRAFT

3π − 8
Variance of the random variable y is unaffected by location change: σ22 = a2 .
π
The MOM estimates for “a” and “b” are determined by equating first and second sample
moments, M 1′ and M 2′ to the theoretical expressions and solving the simultaneous
equations.
′ ′
b = M 1 − µ1
2 2
′ ′ ′ ′ ′ ′ ′ ′ ′
M 2 = µ2 +2 µ1 M 1 −µ1  +
M  −2 M 1 ⋅ µ1 +
µ1 
   1   

After some algebra, and substituting for the moments µr ′ :


2
′ ′
aˆ = c1 ⋅ M 2 −
M 1  Equation (B3a)
 
2
′ ′ ′
bˆ = M 1 −c2 M 2 −
M 1  Equation (B3b)
 
1/ 2 1/ 2
 π   8 
The constants in these equations are: c1 = 
 3π −8 
 and c2 =   .
   3π −8 

Both of the estimated parameters depend on M 1′ and M 2′ . The mean square error for each
of these MOM estimates can be derived using the formulas developed in Appendix C for the
mean square error of a function of moment statistics. The relevant equations for two
parameter Maxwell distributions are:
2 2
 ∂ aˆ 
var ( aˆ ) =  ( )  
( )   
(
 var M ′ +  ∂ aˆ  var M ′ + 2  ∂ aˆ  ∂ aˆ cov M ′, M ′
 ∂M ′  1  ∂M ′  2  ∂M ′  ∂M ′  1 2 ) Equation
 1   2   1  2 
(B4a)

2 2
 ∂ bˆ 
var ( bˆ ) = 
 ∂M ′
( ) 1
 ˆ 

 ∂M  2 ( )  ˆ  ˆ 
 ∂M  ∂M ′ 
′ 1 2(
 var M ′ +  ∂ b  var M ′ + 2  ∂ b  ∂ b cov M ′ , M ′ ) Equation
 1   2   1  2 
(B4b)

The partial derivatives in these equations are evaluated at the mean value of for the respective
parameters. In order to evaluate variance of the moment estimators, the third and fourth
moments of the density function are required.

′ ′ ′ ′
m3 = µ3 + 3 µ2 ⋅ b + 3 µ1 ⋅ b 2 + b 3 Equation (B5a)
′ ′ ′ ′ ′
m4 = µ4 + 4 µ3 ⋅ b + 6 µ2 ⋅ b 2 + 4 µ1 ⋅ b 3 + b 4 Equation (B5b)

The variance and covariance of the moment estimators are also derived in Appendix C. These
quantities are required in terms of the population moments.

34
DRAFT

′ 1 ′ ′ 
2
var 
M1  = m2 − 
 m1 
 
  n   
′ 1 ′ ′ 
2
var 
M 2  = m4 − 
 m2 
 
  n   

′ ′ 1 ′ ′ ′
cov  M 1 , M 2  = m3 −  m1 ⋅ m2 
  n   
Plugging the respective population moments into the above expressions and then substituting
the parameters “a” and “b” into the expressions for the moments µr ′ , provides (after some
algebra) the following:
′ 1 a2
var  M 1  = Equation (B6a)
  n c12

′ a 2  2 4 b2 c 
var  M 2  = 6a + 2 + 4ab 2  Equation (B6b)
  n  c1 c1 

′ ′ a 2  c2 2b 
cov M 1 , M 2  =  a+ 2  Equation (B6c)
  n  c1 c1 

The partial derivative terms are evaluated by differentiation of Equations B3a and B3b. The
last expression results from evaluating the derivatives at the moment mean values.

 ∂ aˆ  ′  c2 
 = − c1 ⋅ M 1 − c12
= 
 c a +b

 ∂M ′  ′ ′
2 a  1  Equation (B7a)
 1  M 2 −
M1 
 
 ∂ aˆ  c 1 c12
 = 1 =
 ∂M ′  2 ′ ′ 2 2a Equation (B7b)
 2  M 2 −
M1 

 

 ∂bˆ  ′
  =1 + c2 ⋅ M 1 c1 ⋅ c2 ( ( c2 / c1 ) a + b )
=1 +
 ∂M ′  ′ ′
2 a Equation (B7c)
 1  M 2 −
 M1 
 
 ∂bˆ  c −1 − c2 ⋅ c1
 = 2 =
 ∂M ′  2 ′ ′
2 2a Equation (B7d)
 2  M 2 −
M 1 
 

The remaining two terms required for evaluation of Equations B4a and B4b are the products
of the respective partial derivatives:

35
DRAFT

 ∂ aˆ   ∂ aˆ  ′
  = − c12 M 1 − c14 ( ( c2 / c1 ) a + b )
=
 ∂M ′   ∂M ′   ′ ′ 
2
2 a2 Equation (B7e)
 1  2  2 M 2 − M1   
   

 ∂ bˆ   ∂ bˆ   ∂ bˆ  ′
  =  − c22 M 1 − c1 ⋅ c2 c12 ⋅ c22 ( ( c2 / c1 ) a + b )
= −
 ∂M ′   ∂M ′   ∂M ′   ′ ′ 
2
2a 2 a2 Equation (B7f)
 1  2   2  2 M 2 − 
 1 
M
   

Combining expressions in Equations B6 and B7 (after substantial algebra) produces a


remarkably simple result for each of the two standard errors.

var ( aˆ ) =
a2
n
[
1.5 c14 + c22 − ( c1 ⋅ c2 )
2
]
() [ ( ) {( ) }]
2
a
var bˆ = 1.5 ( c1 ⋅ c2 ) + 1 + c22 ⋅ 1 + c22 / c12 − c22
2

n
Inserting the constant values results in:
a2
var ( aˆ ) = 0.5270 Equation (B8a)
n

()
var bˆ = 1.2737
a2
n
Equation (B8b)

Neither of the standard errors depends on the parameter “b”. Figure B2 shows plots for each
of the mean square errors, where it is noted sample size is just a scale factor (assumed 1.0 in
the figure). The result agrees with our intuition that “location” of the distribution should not
affect the standard errors, but the scale parameter does have a significant affect.

14 3 5

12 3 0

10 2 5
- (n = 1)

- (n = 1 )

8 2 0
hat
hat
V a ria n ce of a

6 1 5
V a ria n c e o f b

4 1 0

2 5

0 0
1 1 .5 2 2 .5 3 3 .5 4 4 .5 5 1 1 .5 2 2 .5 3 3 .5 4 4 .5 5
P a r a m e te r a P a ra m e te r a

Figure B2 – Variance of MOM Estimators “a” and “b” (n = 1)

In practical inference applications, we do not know the actual values of the parameters as
these are what we are trying to estimate. Usual statistical practice is to first compute the

36
DRAFT

estimates â and b̂ , and use these estimates in the standard error formulas. One should note
that this introduces some additional errors into these error estimates.

A second issue that can arise in applying Equations B3a and B3b to actual data is that, in
some cases, the estimate bˆ >Y(1) , where Y(1) is the smallest sample value. If this occurs, the
following procedure is recommended:
1. Calculate the standard error of b̂ .
2. Determine if the inequality Y(1) −bˆ <2 ⋅ ()
var bˆ is true.

3. If this inequality is true, assume the minimum value is Y(1) as the “two-sigma” error
of estimate includes this value.
4. If the inequality is false, a likely reason is that the sample is not from a Maxwell
distribution.
5. Check the condition: Y( 2 ) −bˆ < 2 ⋅ ()
var bˆ

6. If the inequality is false for the second smallest order statistic, one can assume the
sample is not from a Maxwell distribution.
7. If only the smallest order statistic violates the “two-sigma” error, but does so by a
large amount, the analyst may choose to treat this value as an outlier.
8. In any case, it is suggested that the minimum of b̂ and Y(1) be used for in any
calculations involving the accepted distribution.

37
DRAFT

Maximum Likelihood Estimators

As noted previously, the two parameter Maxwell distribution does not have a sufficient
statistic and the CRLB cannot be calculated. The ideas behind MLE estimation are still
applicable. Recall that the likelihood is a function of the parameters, with data values fixed.
MLE estimates are the values of the parameters that maximize this function. The two
parameter Maxwell log likelihood is:

n 2 n n
( y − b)
ln L( y; a, b ) = ln   + 2 ∑ ln ( yi − b ) − 3 ⋅ n ln a − ∑ i 2
2

2 π  1 1 2a

The respective MLE equations are:

∂ln L 3n 1 n
+ 3 ∑ ( yi − b ) = 0
2
=− Equation (B9a)
∂a a a 1

∂ ln L  n
1 1 n 
=  − 2∑ + 2 ∑ ( yi − b )  = 0 Equation (B9b)
∂b  1 ( yi − b ) a 1 

These two equations can be combined:

2 n n
1 n
(
∑ i y − b ) 2
⋅ ∑1 ( y − b) ∑1 ( yi − b)
= Equation (B9c)
3n 1 i

Equation (B9c) provides the estimate b̂MLE and equation (B9a) then provides âMLE . Note
that the value of b̂MLE is again constrained such that bˆMLE ≤ y(1) where y(1) is the smallest
sample value. The same rules suggested for MOM estimators when bˆMLE >Y(1) should
generally be applicable.

Numerical Test Results

Monte Carlo generated experiments have been used to test both the MOM and the MLE
estimators. The basic test set consisted of 10,000 replications of samples of various sizes.
Some conclusions drawn from these experiments are:

1. Both the MOM and MLE estimators can produce estimates of b̂ that are greater than
the minimum sample values YMIN .
2. The MOM estimate of b̂ is virtually always within twice the standard deviation of the
variance provided by Equation B8b (1 exception in 10,000).

38
DRAFT

3. The MOM estimates of â shows very little sensitivity to not replacing b̂MOM with
YMIN . It is recommended that the constraint be enforced so that no negative values
occur for the random variable (Y −bˆ ) . The MSE of these estimates are essentially
equal to the values given by Equations B8.
4. MLE estimates show large sensitivity to not replacing b̂MLE with YMIN . If this
replacement is not made, both a larger variance and a bias is introduced into the
estimates. For MLE estimates it is imperative that the constraint be enforced.
5. Variance of the MLE parameter estimates is less than that of the corresponding MOM
estimates, if bias is not considered. The MLE estimates do appear to be biased
however even when the minimum constraint is enforced. The experiments show
b̂MLE is positively biased (i.e., larger that the population value) while âMLE is
negatively biased.
6. The reason for the large sensitivity of the MLE estimates, when bˆMLE > YMIN , is that
the assumption of positive values in the derivation of Equation B9c is violated. The
resulting solution does not maximize the likelihood in this case. Enforcing the
constraint implies that the log-likelihood of Equation B9a is valid, but the function is
not necessarily maximized.

39
DRAFT

Appendix C

Sample Moment Standard Error and the Delta Method

Variance and Covariance of Sample Moments

Calculation of standard error formulae for MLE and MOM estimators involves requires
knowing the variance (and covariance for the two parameter distribution) of the sample
moments. The r-th sample moments are calculated as:
′ 1 n
M r = ∑ xir
n 1

These moments are unbiased as seen by taking the expectation of this statistic:

1  n  1 n


 n  1
1 ′
( )
E  M r  = E  ∑ xir  = ∑ E xir = n µ r = µ r

 n 1 n

Before calculating the variance and covariance, two mathematical facts are reviewed. The
first is that the expectation of the product of functions of two independent variables is the
product of the expectations. Assume x and y are independent variables, and g(x) and h(y) are
functions of these variables, respectively:

E [ g ( x ) ⋅ h( y ) ] = ∫ g ( x) ⋅ h( y ) ⋅ f ( x, y ) dxdy
x, y

= ∫ g ( x) ⋅ f x ( x ) dx ∫h( y ) ⋅ f y ( y ) dy =E [ g ( x)] ⋅ E [h( y )]


x y

The second useful fact is algebraic. Consider the product of two expressions that are
composed of a sum of n values each:
( a1 + a2 +  + an ) ⋅ ( b1 + b2 +  + bb )
n
Upon expansion, this expression consists of n terms of the form: ∑a ⋅ b
1
i i . There are

n ⋅ ( n −1) remaining cross terms: (ai ⋅b j )i ≠ j .

The variance of any sample moment is:

′  ′ ′ 
2
 ′ 
2 2
 ′   ′
2 2
var 
M r  = E 
 M r − µr 
  = E 
M r 
  −2 E 

µr ′ ⋅ M r ′

+

 µr  = E 
M r 
  − µr 
                

40
DRAFT

The term involving the expectation can be expanded:

 1 n r  2  1
 ′  2 
[
E  M r   = E  ∑ xi   = 2 E ( x1r + x2r +  + xnr ) ⋅ ( x1r + x2r +  + xnr ) ]
    n 1   n
1   n  1  
 n n
[ ]
= 2  E  ∑ xi2 r  + ∑∑ xir x rj  = 2  n ⋅ E xi2 r + ⋅∑∑ E ( xir ⋅ x rj ) 
n n

n   1  i =1 j =1 i ≠ j  n  i =1 j =1 i≠j 
Because each sample value is independent, the following holds:

( ) ( ) ( )
2

E xir ⋅ x rj = E xir ⋅ E x rj = 
 µr 
i ≠j  

Combining the previous three equations results in:

′ 1  ′  µ ′  − n 2 ⋅  µ ′   = 1 µ ′ −  µ ′  
2 2 2
var  M r  = 2 n ⋅ µ + n ⋅ ( n − 1) Equation (C1)
  n 
2 r
 r
  r
  n  2 r
 r
 

Derivation of the expression for the covariance is similar. The covariance for any two sample
moments is:

′ ′ ′ ′  ′ ′  ′ ′ ′ ′
cov 
M r , M s =E 


 M r − µs 
⋅  M r − µs 
 = E 
M r , M s  − µr ⋅ µs
       

Expanding the expectation, and aging using independence of the sample values:

1   n 

′ ′
 n
1
[( )( )]
E  M r , M s  = 2 E x1r + x2r +  + xnr ⋅ x1s + x2s +  + xns = 2  E  ∑ xir +s  + ∑∑ xir x sj  =
n   1
 n n
 i=1 j =1 i ≠ j 
1    1
[ ]
n n
′ ′ ′
2 
n ⋅ E x r +s
+ ∑∑ E  xir ⋅ x sj  = 2 n ⋅ µ r +s + n ⋅ ( n − 1) ⋅ µr ⋅ µ s 
n 
i 
 
i =1 j =1  i ≠ j  n

Combining the above expressions results in the formula for the covariance:

′ ′ 1 ′ ′ ′
cov  M r , M s  = µr +s − µr ⋅ µs  Equation (C2)
  n  

In actual calculation, the population parameters (and thus moments) are generally not known
as these are the quantities being estimated. The sample moments themselves are substituted
into Equation C1 and Equation C2 in these cases. As such, the variance and covariance are
only approximations.

41
DRAFT

The Delta Method

The delta method is a general procedure to find estimates of the mean and variance of a
function of random variables, when the mean and variance of the random variables is known.
The method rests on Taylor’s Theorem for representing functions that are sufficiently
differentiable. We will consider a function of two variables only, but the method can be
extended to a function of any number of random variables.
Taylor’s Theorem (Reference 2) states that any function can be expanded around a fixed
value: For a function of two variables: z = f ( x, y ) , we can expand around the point
( x0 , y0 ) , retaining only linear terms:
∂f ∂f
f ( x, y ) = f ( x0 , y0 ) + x =x 0 ( x −x0 ) + y =y 0 ( y −y0 ) +Rn
∂x ∂y

The term Rn is the remainder and the key point of Taylor’s theorem is that it approaches zero
as n approaches infinity. In many applications only the linear terms are retained, and this will
be the procedure here.
In our statistical applications, we will chose the mean of the random variables and expand
around that point:
x0 = E ( x ) = µX y0 = E ( y ) = µY

So if we take the expectation of this expansion, noting the derivative values are constants:
 ∂f   ∂f 
E ( z ) = E [ f ( µX , µY )] + E ( x −µX ) +
 ∂x µX   ∂y µY E ( y −µY )

   

E ( z ) = f ( µ X , µY ) (Equation C3)
Hence the expected value of a function of random variables is just the function evaluated at
the expected values of the variables. Figure C1 provides a illustration of the assumption
involved, for a function of one variable.

f(y )
pdf(y )

f(y 0) + (df/dy)| y=y0 (y – y0)

y
y0 = µY

Figure C1 –Linear Function Approximation and Underlying Density

42
DRAFT

The approximation will be very good if the probability density of the underlying variable is
concentrated enough such that most of the “probability” lies within an interval where the
linear approximation does not deviate significantly from the actual function.

The variance of the functions is also required.


2
E [( z − E ( z ) ) ]
2   ∂f
= E 
  ∂f
µX ⋅ ( x − µX ) + 


µY 


⋅ ( y − µY )
  ∂x   ∂y  
2 2
=
 ∂f
 ∂

[(  ∂f
µX  ⋅ E x − µX 2 + 

) ] 
 ∂y µY 
 ⋅ E y − µY
2
[(
 ∂f
+2 ⋅
 ∂
 ∂f
µX ) ] 
 ∂y µY 

⋅ E x − µX ⋅ y − µY [( )( )]
x   x 

Note that the functions x and y are random variables and in general are correlated. This last
expression becomes:
2 2
 ∂f   ∂f 
[ ]  ∂f
E ( z −E ( z ) ) 2 = 

 ∂x µX 
 var ( X ) + 
 ∂f
 ∂y µY  var (Y ) +2 ⋅
 ∂x µX  ∂y µY cov ( X , Y )
 
      

This basic formula is used often in deriving the MSE of various types of statistics (estimators)
that are functions of other statistics with known variances. Examples are estimators that are
functions of the sample moments. The variance and covariance of sample moments is
calculated, or approximated, as discussed above.

43
DRAFT

Appendix D

Some Results for Order Statistics of a Sample

Distribution of Order Statistics

Start by assuming a random sample of size n, from some distribution specified by f (•) . The
sample values are arranged in ascending order:
u1 ≤ u 2 ≤ u3 ≤  un

Each observation is now the r-th order statistic, U ( r ) where r refers to the index of the
observed value. U (1) is the smallest, and U ( n ) is the largest value. The process of ordering
the observations has introduced dependence between the order statistics.

Reference [5] provides a discussion of order statistics which includes a derivation of their
distribution and other properties of these statistics.
If the underlying density function is f (u;θ) and the corresponding CDF is F (u;θ) , the
density function, g (u( r ) ;θ) , for the r-th order statistic is:

g (u( r ) ;θ) =
n!
[ F (u;θ)]r −1 ⋅ f (u;θ) ⋅[1 − F (u;θ)]n−r Equation (D1)
( r −1)! ( n −r )!
The parameter θ can be either a scalar or vector quantity. Subsequent discussion assumes the
support of the underlying distribution is ( 0 ≤ u ≤ ∞) . The area to the left of any order statistic
U r is G (U r ) . The expected value of this area (or, equivalently, the probability) is:

[ ]
E G (U ( r ) ) = ∫ G (ur ) g (ur ) du r
0

Invoking a change of variable: z = G (ur ) ; noting dz = g (ur ) du r , and using density


function transforms the integral expression:
1
[
E G (U ( r ) ) = ∫ ] n!
(r −1)! (n − r )!
z r (1 − z ) dz
n −r

This is a “beta-like” integral, and carrying out the integration results in:

[
E F (U ( r ) ) = ] n!
( r −1)! ( n −r )!
Γ( r +1) Γ(n −r +1)
Γ( n +1)

Invoking the gamma function property for integer arguments provides a simple expression for
the expected value of the probability of any order statistic (with a continuous underlying
distribution):

[
E G (U ( r ) ) = ] r
n +1
= qr Equation (D2)

Variance of Order Statistics

44
DRAFT

We now seek an expression for the variance of the r-th order statistic. Variance of an order
statistic will only be an approximation, where the derivation utilizes the “Delta Method”
discussed in Appendix C.
Begin by considering a uniform distribution on the unit interval [0,1]. Equation D1 becomes:

g (u( r ) ;θ) =
n!
(u ) r −1 ⋅ (1 −u ) n−r
( r −1)! ( n −r )!
The expectation of this function is seen again to involve “beta like” integrals, and as expected,
direct integration provides the same answer as Equation D2 (which applies for any
distribution).

E (U ( r ) ) =
r
n +1
The variance of the uniformly distributed variable can also be found by direct integration. We
obtain:

var (U ( r ) ) =
1 r  r 
⋅ ⋅ 1 −  Equation (D3)
N + 2 n +1  n +1 

For any continuous density function, f ( y; a ) , and CDF F ( y; a ) , define the random variable
u:
u = F ( y; a ) )

This random variable is uniform on the interval [0,1]. The order statistic Y( r ) therefore has
the distribution:
Y( r ) = F −1 (u( r ) ; a )

Recall that if variable y is a function of variable x, i.e., y = F (x), and the mean and variance
of variable x are known, the delta method provides a way to approximate the mean and
variance of variable y.
2

E (Y ) ≈ g ( µ X ) d g
var (Y ) ≈ var ( X ) 
 x =µX



 dx 

d F −1 dy
In the current case, x = u and g = F −1 . The required derivative is: =
du du
It useful to note that because y is a monotonic function of variable u:
du dy 1
= f ( y; a ) so that =
dy dy f ( y; a )

Since we know both the mean and variance of U ( r ) , the variance of the order statistic Y( r )
follows from the above relations:

E (Y( r ) ) = F −1 
 r 
 = y( r )
 N +1 

45
DRAFT

var (Y( r ) ) =
1 r  r  1
⋅ ⋅ 1 − ⋅
n + 2 n + 1  n + 1  f ( y( r ) ; a ) 2

The last equation for the variance of the order statistic implies that for moderately large n,

1 q r ⋅ ( 1 − qr )
var (Y( r ) ) = ⋅ Equation (D4)
n f ( y ( r ) ; a) 2

As n becomes large, the order statistic Y( r ) , has an asymptotic normal distribution with
variance and mean given by the preceding equations. Reference [5] provides a detailed
derivation of this Theorem.

Relative Efficiency of Maxwell Mean and Median Estimators

An application of Equation D4 is to find the relative efficiency of the sample mean and the
sample median for the one parameter Maxwell distribution.

A Maxwell distributed variable has sample mean and variance:

8 var ( y ) 1  3π − 8  2 a2
y= a var ( y ) = =   a = .4535
π n n π  n
The median value of the Maxwell distribution can be evaluated from Figure 2 in the main
report, where qr = 0.5. At this point ymed =1.5380 a .

2 (1.538 a )  − (1.538 a ) 2  .5784


2
f ( ymed ; a ) = exp 

=

π a3  2 a2  a

0.52 a2
var ( ymed ) = a 2
= .7473
n ( .5784 )
2
n

var ( y )
eff REL = = .6069
var ( ymed )

This shows that the sample mean has about 61% of the variability of the sample median (For
comparison the relative efficiency for the normal distribution sample mean to sample median
is about 0.64.) The sample mean is preferable for estimating the parameter of the Maxwell
distribution if one of these simple statistics were used, rather than MLE, MOM or quantile
estimators.

46
DRAFT

Appendix E

Sums of Maxwell IID Random Variables

It may be of interest in some applications to determine the distribution of the sum of two or
more independent identically distributed (IID) random variables drawn from a population that
follows the Maxwell distribution. Moment generating functions or characteristic functions are
often employed when closed form expressions for these functions exist and when their
products inverted. This approach does not provide expressions that can be inverted for the
Maxwell distribution. A direct integration approach will be used.

The joint distribution of two IID random variables, X and Y, is provided by the product of the
density functions. Define the sum of these as random variable Z. Then:

Z = X +Y or X = Z −Y

The distribution of Z can be found by substituting into the joint density, and integrating out
variable Y. Because both X and Y are confined to non-negative values, integration is over the
interval [0, z], which limits X to non-negative values.
z
f ( z ) = ∫ f ( y ) ⋅ f ( z − y ) dy
0

For the Maxwell distribution, the integral is:

2
z
 − y2 − ( z − y) 2 
f ( z) = ∫0 ( )   dy
2
y 2
z − y ⋅ exp  
π a6  2 a 2

Evaluation is carried out by expanding the terms in the exponent and the polynomial, and
integrating term by term, and finally collecting terms. The result is:

1   − z2   z    − z 2 
f ( z) =  π exp   erf   ⋅ 12 a 4 − 4 a 2 z 2 + z 4  + 2 z a  z 2 − 6a 2  ⋅ exp  
5
8π a   2   2a       2a 2  Equati
 4a   
on (E1)

As before, erf (•) denotes the error function. The density function of equation E1 cannot be
integrated in closed form to obtain the CDF, or moments of this distribution. This density can
be integrated using numerical techniques to find the desired quantities, for a given value of
the parameter “a”.

The mean and variance for a sum of Maxwell variables follows from the basic theorem for the
addition of independent variables. This theorem can be stated mathematically for the sum of
N variables, Yi :

Z N =Y1 +Y2  +YN

47
DRAFT

8
E ( Z N ) = ∑E (Yi ) = N ⋅ E (Yi ) = N ⋅ a Equation (E2)
N π

 3π − 8  2
var ( Z N ) = ∑ var ( Yi ) = N ⋅  ⋅a Equation (E3)
N  π 

Performing numerical integration for various values of the parameter confirms this result for
N = 2. Also numerical integration shows that this distribution has constant skewness:

E( z − µZ )
3
Skew( z ) = = 0.3434
σ z3 / 2

Recall that skewness of the Maxwell distribution is 0.4857, so the sum of Maxwell variables
produces a more symmetrical distribution. The Central Limit Theorem implies that as N
becomes “large,” the distribution of the sum should approach the normal distribution with
mean and variance as indicated in equations E2 and E3, respectively. The skewness
calculation confirms this for the sum of two variables.

The size required by the CLT for a “large number” of terms will depend on the underlying
distributions involved. This requirement can be investigated for the Maxwell distribution
using the sum of only two variables.

Figure E1 is a plot of equation E1, for the exact distribution of the sum of two Maxwell
variables. Also shown is the normal approximation for this distribution, using Equations E2
and E3, where N = 2. The parameter “a” is 1.0 in both cases. The underlying Maxwell
distribution for each variable is shown for comparison. The figure implies that the normal
distribution is a reasonable approximation of the sum, except for the “tails” of the distribution,
even for two variables.

48
DRAFT

0
.7

M
axwellD istrib u
tiona=1
0
.6 E
xac
tD is
tribu tio
n ,N =2
N
orma
lA p pro xim atio
n,N
=2

0
.5
Probability

0
.4

0
.3

0
.2

0
.1

0
0 1 2 3 4 5 6 7
R
an
domV
aria
ble
sZa
ndY

Figure E1 – Maxwell Density and Sum of Two Random Variables

If sums of Maxwell random variables are required, it appears that the normal distribution is
adequate for N ≥ 3 in most cases, with mean and variance given in equations E2 and E3.
Evaluation of the exact distributions is possible only through numerical procedures. If N = 2 ,
equation E1 is recommended along with numerical evaluation of the CDF or other quantities
of interest.

49
DRAFT

Appendix F

Goodness of Fit and Outlier Tests for Maxwell Distribution

Hypothesis Tests

Discussion of point estimation methods has been based on the assumption that a data sample
is in fact from a Maxwell distribution. Given a set of data, an analyst would probably want to
get some indication of the veracity of this assumption prior to drawing conclusions about the
statistical model. The procedures used to determine if a set of data is from a particular
distribution, or family of distributions, fall under the category of statistical hypothesis tests.

Hypothesis tests are essentially a tool for making decisions under uncertainty. These tests
consist of specifying a dichotomous decision specified by a null hypothesis, H 0 ,and an
alternative hypothesis, H 1 . Often, the null hypothesis is specified with the intent to reject it.
If the null hypothesis is rejected, this implies the alternative hypothesis is accepted. The
converse is not true: failure to reject the null hypothesis does not mean that it is true. Rather,
failure to reject the null hypothesis means that insufficient data is available. Reference [11]
covers the theory of hypothesis tests in more detail.

In the decision framework, four possible outcomes can occur, depending on which decision is
made, and what is the true “state of nature.” If the null hypothesis is true (i.e., actual state of
nature), and the analyst does not reject it, or if the alternative hypothesis is true and the
analyst rejects the null hypothesis, then the correct decision is made. However, if the state if
nature is such that the null hypothesis is true, but the analyst rejects it, a Type I error has
occurred. If the alternative hypothesis is true, but the analyst fails to reject the null hypothesis,
then a Type II error has occurred. Useful statistical tests generally provide the analyst with a
quantitative way to assess the Type I error probability which is referred to as α. Τ he Type
II error probability, which is referred to as β is generally much harder to asses for a given
statistical test. In general, if the size of a sample is fixed, the Type I and Type II errors have
an inverse relationship. Decreasing both α and β requires increasing sample size. Most
hypothesis tests specify the required probability α and do not explicitly specify β . This
approach assumes that making a Type I error is more serious that making a Type II error. The
Type I error probability is called the “significance level” of the test.

Obviously, hypothesis tests require that a relevant statistic be available. Since a statistic
depends on the sample, it is a random variable, and has an associated distribution. Different
hypothesis tests use different statistics, each with a distribution that can be derived exactly or
asymptotically (i.e., large sample approximations). The percentage points for a significance
test are the probability values that a particular statistic has a specified value when the null
hypothesis is in fact true.

A simple hypothesis is one where the distribution is completely specified, which includes not
only the type of distribution, but also any parameters of that distribution. An example of a
simple hypothesis is for a Maxwell type distribution with parameter a = 1 . A composite
hypothesis specifies the type of distribution (e.g., Maxwell) but allows for one of more free
parameters that are not specified. Most goodness-of-fit tests require a simple null hypothesis

50
DRAFT

but some do admit composite hypothesis. It is necessary to be aware of the assumptions


behind any goodness-of-fit test that is being used to provide support for the analysis.

Before proceeding further, it is noted that a some controversy arises regarding hypothesis tests
as regards simulation experiments (Reference [14]). The essence of the argument involves the
“power” of a test. Power is defined as the probability of rejecting the null hypothesis. In
practical terms, the power is:

P (θ ) =1 − β ; θ = allowable parameters for H 0

As the size of the sample increases, the Type II error decreases, and eventually, enough
samples will be taken that the null hypothesis will be rejected. In simulation experiments, this
means that unless we absolutely know the data comes from the null hypothesis distribution,
any distributional hypothesis will be rejected if enough Monte Carlo samples are obtained.
The analyst will need to decide if results based on goodness of fit tests are of value in
defending results based on the assumed distribution.

Power of a particular test is hard to evaluate quantitatively since it requires knowing


probabilities of the test statistic under all other alternatives from the null hypothesis. In other
words, the power of a test is a function of the alternative hypothesis, as well as the test
statistic itself. Statistical theory for hypothesis tests deals with deriving test statistics and
“most powerful’ tests for given situations. References [1], [2], [5] and [11] discuss this topic
in more detail.

In practical application, the power function generally is not known. The approach is to specify
the null hypothesis, and then determine the critical region, for a particular statistic, where the
null hypothesis is rejected under the condition that the null hypothesis is true. This approach
is used to derive statistics relevant to the specification of the null hypothesis as a Maxwell
distribution in the following. The critical region for specified statistics are derived through
Monte Carlo simulation.

Hypothesis tests for distributions, referred to as goodness-of-fit tests, essentially try to show
that the data sample does not come from the distribution of the null hypothesis. In this sense it
is never proved that the data comes from the test distribution. Either the hypothesis is rejected
which “proves” the data does not come from that distribution; or the hypothesis is not
rejected. To emphasize: failure to reject the null hypothesis does not mean the data is from the
specified distribution, rather it merely means that we can’t reject that hypothesis.

Tests for outliers in distributions are similar in approach. The null hypothesis is that data
comes from a given distribution. A test statistic must be chosen in some way, and the critical
region for that statistic is again calculated using Monte Carlo simulation.

51
DRAFT

Exploratory Data Analysis – Probability Plots

Prior to performing any type of hypothesis test regarding the null distribution, one may want
to perform an Exploratory Data Analysis (EDA) or the sample data. EDA is a set of
techniques that are used to obtain a qualitative understanding of the data and as an aid in
formulating hypothesis tests regarding. EDA is a data driven procedure that is intended to “let
the data speak for itself” rather than consideration to specified models. The only technique
addressed in this paper is the use of probability plots, which are directly related to formulation
of hypothesis about probability distributions.

Probability plots consist of taking the sampled values, placing them in ascending order, and
calculating the empirical probability associated with each of these order statistics. The
empirical probability is sometimes referred to as plotting position. The suggested plotting
position formula for a data sample of size n is:
i
pi =
( n +1)
The empirical probability is plotted against the corresponding ordered sample value on one or
more specially constructed “probability papers.” These papers are most easily constructed
when the particular distribution can be put into a standardized “location-scale” form. Normal
probability paper is the most common example.

A plot if the sample data along with the empirical probability should produce an approximate
straight line, if the data is in fact from the assumed distribution. For the Maxwell distribution,
there is no generally available probability paper. As such, a MATLAB function was written to
produce Maxwell probability paper. Appendix G provides more discussion, as well as a listing
of the code for this development.

In EDA, the probability plot is often used to obtain a preliminary indication of whether the
data is in fact from the selected distribution. If the distribution is markedly different from a
straight line, then proceeding with a formal goodness-of-fit test is not required. Rather, a
different distribution should be considered.

In many cases, a generally good fit can be obtained, but a few of the data points deviate
considerably at either the lower or upper end of the distribution. These values are potential
outliers. The probability plot is quite helpful in indicating if outliers may be present in the
sample. When possible outliers are identified, then some appropriate statistical test for
identification is warranted. Figure F0 shows Maxwell probability plots where goodness-of-fit
is questionable (panel a), and a plot where there appear to be outliers (panel b).

The following sections discuss relevant statistics and hypothesis tests for goodness-of-fit
(GoF) tests. This is followed by a discussion of selecting a relevant statistics for testing
outliers in for the Maxwell distribution along with tables of percentage points. Unless noted, a
specific test applies to the one parameter Maxwell distribution.

52
DRAFT

M a x w e ll P r o b a b ilit y P lo t M a x w e ll P ro b a b ilit y P lo t

9 9 9 9
9 8 9 8

9 5 9 5

9 0 9 0

8 0 8 0
7 0 7 0
P ro ba bility %

P ro b a b ility %
5 0 5 0

3 0 3 0
2 0 2 0
1 0 1 0
4 4
1 1

0 0
-4 -2 0 2 4 6 8 1 0 0 1 2 3 4 5 6 7 8 9 1 0
D a ta V a lu e s
D a t a V a lu e s

(a) Possible Poor Fit (b) Possible Outliers

Figure F0 – EDA Probability plots for Maxwell Distribution

Goodness-of-Fit Tests - General

A large number of goodness-of-fit tests have been devised over the last century. A good
survey is provided in Reference [12]. This paper discusses only two of these: methods based
on the Empirical Distribution Function (EDF) and those based on correlation.

One of the oldest tests for goodness-of-fit is Pearson’s Chi-squared test. These tests do allow
for a composite null hypothesis, where only the distribution type is required. The Chi-squared
test is accurate only when the sample size is large enough that the test statistic is in fact Chi-
square distributed. Small samples or too few cells (or “bins”) can invalidate this test. The
process of “binning” the observations leads to a loss of information, and these tests are
generally less powerful than those based on order statistics or regression. A discussion and
procedures for implementation of the Chi-Squared test are provided in Reference [10].

Empirical Distribution Function (EDF) Tests

EDF statistical tests are divided into two classes: those using supremum statistics and those
using quadratic statistics. The statistic Dn is the supremum (or maximum value) of the
absolute difference between the empirical CDF of size n, Fn ( y ) , and the hypothesized CDF,
F ( y) :

Dn =sup Fn ( y ) − F ( y )
y

53
DRAFT

The empirical distribution function of the sample, Fn ( y ) consists of the probability values for
the sample order statistics, where the r-th order statistic:
r
Fn ( y( r ) ) =
n
Supremum statistics are very useful because they are distribution free. This implies that the
test can be applied to any continuous distribution, without having to derive particular values
of the statistic for every distribution. The non-parametric nature of the test rests on the fact
that random variables from arbitrary distribution are related to the distribution of uniform
random variables on the unit interval [0,1].
The most common application of a supremum test is the Kolmogorov-Smirnov (K-S)
procedure. A potential difficulty with the K-S test is that it requires a simple null hypothesis.
As discussed, this requires that not only the distribution form be specified but also the
parameters be specified. Estimating parameters from the data and then applying the K-S test is
not correct and can lead to serious errors, particularly for small samples. The reason is that
when the parameters are estimated from the data, the resulting order statistics may not longer
be uniformly distributed as assumed by the K-S test. References 10 and 12 contain additional
details on use of this test.

The distribution of the K-S statistic is derived in Reference [5]. Tables for percentage points
at selected levels of significance (α) as a function of sample size (n) are provided for this
statistic. The MATLAB statistics toolbox provides an implementation of this test, with any
choice of sample size and level of significance.

The K-S test is very good at assessing a shift between the hypothesized distribution and the
data. This is because the supremum statistic is a measure of distance between the two
distributions. On the other hand, scale differences between distributions often are most
evidenced in the tail of the distribution, and the K-S statistic is least sensitive there. Recall
that the variance of an order statistic is proportional to p ⋅ (1 − p ) , where p is the quantile
probability. This has a maximum at p = 0.5, and decreases to zero in the tail of the sample
distribution. As such, the power of the test is highest when differences occur between central
portions of the distributions. To alleviate this situation, the supremum statistics could be
derived using a weighed average. This concept is discussed in the next section as applied to
quadratic statistics.

Quadratic Tests

Quadratic statistics are formulated on the basis of a weighted average of the squares of the
deviations from the empirical distribution Fn ( y ) and the hypothesized distribution F ( y ) .
The general form is:

T = n ∫[ Fn ( y ) − F ( y ) ]2 Ψ( y ) dF ( y )
−∞

The function F ( y ) is the hypothesized distribution and Ψ( y ) is a weighting function. This


approach also requires a simple null hypothesis specification in order to derive an analytic

54
DRAFT

expression for the distribution of T. If Ψ( y ) =1 , the resulting statistic is called the Cramer-
von Mises statistic.

The following weight function leads to the Anderson-Darling statistic, Reference [13]:

1
Ψ( y ) = Equation (F1)
F ( y ) ⋅ [1 − F ( y ) ]

This weight function tends to place a higher “value’ on the tails of the distribution, and as
such is a more powerful test for detecting changes in scale between two distributions. The
computational formula for this statistic, denoted as A2 , is:

2 1 n
n 1
[
A = −n − ∑( 2 i −1) ln ( Z ( i ) ) + ln (1 − Z ( n+1−i ) ) ] Equation (F2)

( )
Z ( i ) values from the EDF data, Z ( i ) = F Y( i ) ; aˆ , where the Y(i ) are the ordered sample data
values.

Monte Carlo Simulation of Statistics

The idea behind Monte Carlo development for goodness-of-fit tests rests on being able to
generate random samples from the distribution that is being assumed (i.e., the null hypothesis
distribution). Samples drawn from the actual distribution are random variables. If a very large
number of samples are drawn, and the desired statistic is evaluated for each sample,
percentage points can be calculated based on these results. The assumption is that the number
of samples is large enough to adequately approximate the actual population.

This approach applies for either K-S or quadratic type tests, as well as for correlation and
regression goodness-of-fit statistics. Reference [10] discusses this issue and states that when
the distribution is of “location-scale” type, distribution of the test statistic does not depend on
the actual parameter values. Heuristically, this seems reasonable since the location-scale
random variable can be standardized. As noted in the main report, both the one and two
parameter Maxwell distributions are of this type.

Quadratic statistics have the property that they converge very quickly to their asymptotic
limits. For other than small sample sizes, the percentage points are therefore depend
asymptotically only on the level of significance chosen. These statistics have also been found
to have higher power than Chi-squared or K-S test as a result of the distribution tail having
more weight in the sample.

One point necessary to keep in mind when using asymptotic results is: the estimator used for
the unknown parameters must be asymptotically efficient. For the one parameter Maxwell
distribution, this implies using the maximum likelihood estimator, âMLE . The MOM
estimates for the two parameter distribution are not asymptotically efficient. If we are to use
these moment estimates, the limiting distribution for the quadratic statistic would be a

55
DRAFT

function of the sample size (as well as a function of the estimation method). The following
results apply only to the single parameter Maxwell distribution.

Anderson-Darling Statistic for Maxwell One-Parameter Distribution

The methodology provided in Reference [12] was used to derive the Anderson-Darling
asymptotic percentage points for the Maxwell distribution. These results, along with some
very small sample value results, are given in Table F1 for selected levels of significance. The
results are derived for 50,000 Monte Carlo samples, each of the size specified. The
methodology is applied as follows:

1. Generate a Monte Carlo sample (size n) of standardized values “t” by obtaining the
root-mean-square of three standard normal random variables. Note: the sample values
are drawn from a population with parameter a = 1. Arrange values in ascending
order.

2. The MLE estimate of “a” is next calculated using the sample data values. This is
accomplished as:

∑t i
2

aˆ MLE = 2 1
3n

The 2 factor is required because the normalization factor in the Maxwell


distribution includes this term ( 2 a ) .

3. The n sample values must reflect that that an estimated parameter, rather than the
actual population parameter is being used. The sample standardized value ti is then:

ti
ti =
aˆ MLE

4. The empirical distribution function for the sample is calculated from the CDF:

ti ⋅ exp (−ti2 )
2
Fn ( ti ) = erf (ti ) −
π
5. Equation F2 is then calculated for each sample. This is done 50,000 times for the
current results. The values of the statistic A2 are sorted and quantile values selected
corresponding to the selected significance level.

Sample Significance Level α

56
DRAFT

0.50 0.25 0.10 0.05 0.02 0.01


Size
5 .4589 .6646 .9358 1.1493 1.4458 1.70761
10 .4723 .6868 .9728 1.2041 1.5304 1.7822
25 .4744 .6913 .9929 1.2239 1.5488 1.7797
> = 30 .4805 .7070 1.0094 1.2390 1.5643 1.8162

Table F1 – Maxwell Distribution Percentage Points for the A2 Statistic

It is emphasized that when using the percentage points in Table F1, the statistic A2 must be
calculated using the MLE estimate of the parameter. If the value of the parameter “a” is
known (or assumed) independently, a table with different values at each significance level
would result.

Correlation Statistics for Two-Parameter Maxwell Distribution

A somewhat different approach can be taken for testing goodness-of-fit when the distribution
is of “location-scale” type. The idea uses the fact that the quantiles of order statistics from a
sample should be distributed uniformly on the unit interval. The development follows
Reference [12].

The procedure uses the form of the standard definition of the correlation coefficient, defined
between a random variable Yi and some set of constants, Ti .

1 n
n
∑ (
(Yi − Y ) ⋅ Ti − Tˆ ) ; (Y − Y ) 2
(T − T ) 2

R2 = 1 sY = i
sT = i

sY ⋅ sT n n

The usual correlation coefficient is defined between two random variables. The above
definition extends this definition to the case where one set of variables is not random.

The distribution of R 2 is based on a large number of Monte Carlo simulations of random


samples drawn from the selected distribution. In general, the distribution or R 2 will depend
on the underlying distribution form, as well as the specific parameters. In the case of
“location-scale” distributions, the random variables can be standardized, and only the
resulting correlation results depend only on the distribution form.

The values from each sample are arranged in ascending order (i,e, as order statistics). The
most representative constant associated with each order statistic is the expected value of that
order statistic. Although the exact distribution of the order statistics are known (Appendix D),
the moments of these distributions cannot be calculated in closed form for most probability
densities, including the Maxwell distribution.
An alternative to the mean value of the order statistic is the mean value of the quantile
associated with that order statistic. This value is known in a simple form:

57
DRAFT

r
E ( qr ) =
n +1
As an approximation, select constants Ti such that:
 i 
Ti = F −1  
 n +1 
Using the Delta Method, it can be shown that this is an approximation to the mean of the
actual order statistic, for a one term Taylor expansion of the inverse function. As sample size
increases, the approximation becomes more exact (Reference [5]).

Hence, for sample size n, and a specified location-scale distribution, a set Ti ; for 1 ≤ i ≤ n ,
can be calculated using the distribution inverse function. A large number of (Monte Carlos)
samples are drawn from the specified distribution, and the regression with the fixed Ti is
computed for each sample. The resulting sample statistics are ordered, and percentage points
determined.

As the statistic R 2 decreases, it is less likely that the sample belongs to the specified
distribution (null hypothesis). In accordance with standard practice, the following related
statistic is generally used:
Z n = n ⋅ (1 − R 2 )

This statistic has the property that it increases as indication of fit decreases.

A MATLAB implementation of the procedure was developed. As a check, the Type I


Extreme Value distribution result correlation results of Reference [12] were checked, using
50,000 rather than 10,000 samples. The method was then applied to the Maxwell distribution.
Note that these results can be used for either the one or two parameter distribution. The
distribution of Z n is quite different than a normal distribution.

Figure F-1 shows the distribution of the Z n statistic for 50,000 Monte Carlos runs, and a
sample size of 100 plotted on normal probability paper.

Percentage points for goodness-of-fit regression testing of a Maxwell distribution is provided


in Table E2, for indicated levels of significance and sample sizes.

58
DRAFT

N o r m a l P ro b a b ilit y P lo t

0 .9 9 9
0 .9 9 7
0 .9 9
0 .9 8
0 .9 5
0 .9 0
0 .7 5

P robability 0 .5 0

0 .2 5
0 .1 0
0 .0 5
0 .0 2
0 .0 1
0 .0 0 3
0 .0 0 1

0 0 .0 2 0 .04 0.0 6 0 .0 8 0 .1 0 .12


D a ta

Figure F1 – Distribution of Zn for n = 100 (50,000 MC runs)

Sample Significance Level α


Size 0.50 0.25 0.10 0.05 0.02 0.01
25 .7601 1.1414 1.6714 2.1169 2.7207 3.2321
50 .8960 1.3301 1.9313 2.4214 3.1496 3.7543
75 .9699 1.4302 2.0813 2.6271 3.4239 4.0950
100 1.0213 1.4964 2.1690 2.7294 3.5346 4.2037
150 1.0894 1.5831 2.2870 2.8972 3.7203 4.4158
200 1.1373 1.6498 2.3622 2.9590 3.8189 4.5451
400 1.2345 1.7753 2.5359 3.1263 4.0064 4.6605
750 1.3251 1.8966 2.6751 3.3299 4.2250 4.8762
1000 1.3611 1.9446 2.7354 3.3592 4.2418 4.9649

Table F2 – Percentage points for the Maxwell Distribution: Zn = n [1 – R2]

Figure F2 is a plot of the percentage points of Table F2. The plot shows that, for a given level
of significance, an asymptote is approached when the sample is sufficiently large. For small
sample sizes, the relationship is quite non-linear. This is a characteristic of all goodness-of-fit
statistics although definition of “large sample” differs depending on the test.

59
DRAFT

α = 0.01
4.5

4 α = 0.02

percentage points Z = n [1 - R]
3.5
2

α = 0.05
3

2.5 α = 0.10

α = 0.25
1.5

α = 0.5
1

0.5
0 100 200 300 400 500 600 700 800 900 1000
sample size

Figure F2 – Percentage Points for Maxwell Correlation Statistic

Regression Statistics

Regression statistics are related to correlations statistics, and can be derived the same way.
For location-scale distributions, the regression uses the ordered sample values and the
constants Ti to perform an ordinary least-squares liner fit:

Y( i ) = b0 +b1 Ti +εi

The test statistic is a function of the residuals of the least-squares fit because these tests are
closely related to correlation tests. Also, because the power functions have not been
determined, the choice of regression vs. correlation test is not evident.

For single parameter Maxwell distribution the Anderson-Darling statistic discussed above is
believed to be more powerful than either regression or correlation statistics. The interested
reader can consult Reference [12] for further discussion.

60
DRAFT

Outlier Tests

Panel (b) of Figure F0 provides an example of a data set that appears to follow a Maxwell
distribution quite closely, excepting for the three upper values. Applying one of the previous
goodness-of-fit tests may or may not result in rejecting the null hypothesis of the distribution
type. In either case, it may be of interest to determine if these extreme values can be identified
as outliers.

Outliers have long history and a certain degree of empiricism is involved in selecting a
relevant statistic for identifying and rejecting outliers as discussed in Reference [15]. Much of
the work done on outliers has assumed that the samples are drawn from normal distributions.
When distributions other than normal are involved, one should approach the historical criteria
for outlier detection with caution.

Detection of outliers is a hypothesis testing problem as noted previously. The null hypothesis
is that the data are from a specified distribution. The alternative hypothesis is that “k” outliers
are present, where often k equals 1. In any case, the alternative hypothesis must state if the
outliers are at the upper tail, the lower tail or possibly at either tail of the distribution. Any
useful test for outliers will depend on this alternative hypothesis. Different decision statistics
are usually be required to test different alternatives. The present paper deals only with the null
hypothesis of a Maxwell distribution and an alternative that includes outliers on the upper tail.

As noted in the general discussion, an outlier test involves a composite hypothesis. For a
decision statistic T, the test takes the form:

Null hypothesis H0: All data vales come from the same Maxwell distribution

Alternative H1: The data contains k outliers from a different distribution

Test: T > Tα where α is the probability of a Type I error (significance level)


The first question is: how can we come up with a useful statistic for testing for upper outliers?
To the end, we consider basing our choice on the likelihood ratio for the null hypothesis and a
selected alternative. Likelihood ratio tests are a tool used to handle composite hypothesis.
This type of test compares the ratio if the likelihood of the composite hypothesis to the
likelihood of the null hypothesis, given the particular sample data x :

Alternative HypothesisLilelihood L1 ( x;θ1 )


LR = =
Null HypothesisLikelihood L0 ( x;θ 0 )

As the likelihood the alternative increases, the probability that the null hypothesis is true
decreases. As such, the likelihood ratio increases as probability of rejecting the null
hypothesis is increases. The idea is to find a suitable statistic T reflects this behavior. The
approach is again to used Monte Carlo simulation to determine percentage points at specified
significance levels for the statistic, under the assumption the null hypothesis is true.

61
DRAFT

The specific type of alternative considered for the Maxwell outlier test statistic is for
“slippage” alternative. This means that we consider one or more of the data values to have
come from the same type of distribution (i.e., Maxwell), but to have different parameters than
the null distribution.

Although the slippage alternative assumes the same distribution type for outliers, it should be
reasonable if the outlier data are from a different distribution family (Reference [15]). In
particular alternate distributions that are of the “exponential family” such as the Normal or
Gamma distributions should have essentially the same test statistic. A definition and
discussion of exponential families of distributions is provided in References [1], [2] and [3].

In the development that follows, consideration is restricted to the hypothesis of a single outlier
at the upper tail. Development of a test statistic for k outliers is very similar. Results will be
stated for multiple outliers. The approach is to calculate the likelihood ratio as discussed
above and determine a statistic that increases as the likelihood increases. As discussed in the
main text, we will actually use the log-likelihood for convenience of manipulation.

The log-likelihood function for the null hypothesis, n data values is:
n

n 2 n ∑ yi2
ln L0 ( y; a ) = ln   + 2 ∑ln yi − 3 ⋅ n ln a − 1 2
2 π  1 2a
The alternative hypothesis assumes that ( n −1) data values are from the same Maxwell
distribution, but one observation is from a different distribution. This different distribution is
characterized by a scale parameter ( a / λ) , where λ is the “slippage parameter.” The log-
likelihood function takes the form:
 n−1 2 
( n−1) / 2 n −1  −∑yi  1/ 2
2  y 2
  2 yn2 ⋅ λ3  − yn2 ⋅ λ2 
L1 ( y; a, λ) =   ∏ ⋅exp i 1
  ⋅exp 
 2 a2 

π  1 a  2 a2
3
 π  a3  
 
 

Note that the slippage parameter λ should be less than 1.0 in order for the tail of this “outlier”
distribution to be extended. The data point yn represents the largest sample value. The log-
likelihood for the alternative hypothesis is:
n −1

n 2 n ∑ yi2
y 2 λ2
ln L1 = ln   + 2 ∑ ln yi − 3 ⋅ (n −1) ln a − 1 2 + 3 ⋅ ln λ − 3 ⋅ ln a − n 2
2 π  1 2a 2a
The parameter values that maximize each of these likelihood functions can be found in the
usual way by zeroing the derivative and solving for the various terms. For the null hypothesis,
the MLE solution is the same as derived in previous sections of this paper:
n

∑y 2
i
y2
a= 1
=
3⋅ n 3

62
DRAFT

An important notational convention is noted:


n n −1

∑y 2
∑y 2

( y )′ =
i i
and
y2 = 1 2 1
n n −1
Substituting “a” into the log-likelihood for L0 provides the maximum value of the null
hypothesis function, for the given sample data:

n 2
[ ]
n
3n 3n
ln L0 ,MAX = ln   + 2 ∑ln yi − ln y 2 − ln 3 −
2 π  1 2 2
The alternative likelihood function has two parameters. Note that parameter “a” in the
alternative likelihood has a different value than it does for the null likelihood. Taking
derivatives with respect to λ and a, and equating to zero and solving provides:
n−1

∑y 2
i
( y )′
2
λ2 =
yn2
=
( y 2 )′
a =
2 1
= 3a2 yn2
3 ( n − 1) 3
Substituting these expressions into the alternative likelihood function, and after some
combining of terms, the maximum function value is:

n 2 n
ln   + 2 ∑ln yi −
3n  2 ′
 2
( )
+3  y ( )′  − 1  n−1 2 1 n−1 2 
ln L1,MAX =
2 π  1 2 

ln y − ln 3
 2  yn2  2 a2 
∑ i n −1 ∑
1
y −
1
yi 

 

Also, substituting for “a”, the last term of this expression becomes:

1  n−1 2 1 n−1 2  3n
− 2 ∑ yi − ∑ yi  = −
2a  1 n −1 1  2
The maximized log-likelihood ratio is the difference of these functions:

3 n   y 2  3  ( y 2 ) 
 
ln LR = {ln L1,MAX − ln L0,MAX } = ln + ln  2  :
2   ( y 2 )′  2 y
    n 
yn2
Now define the statistic: T1 = and the above expression reduces to:
( y )′
2

3 n  ( n −1) + T1  1
ln LR = ln   − ln   Equation (F3)
2  n   T1 

The relevant statistic T1 is the ratio of the square of the largest data value and the average of
the remaining data values. The following comments apply to this formulation:

63
DRAFT

1. The statistic T1 is always greater than 1.0. As such, it can be seen from Equation
(F3) that as T1 increases, the log-likelihood ratio increases. This implies that as T1
becomes larger, the rejection of the null hypothesis will increase.

2. The statistic T1 is called an “exclusive” statistic since the average does not include
y2
the extreme value. An alternative “inclusive” statistic could have chosen as U1 = n2 .
y
T1 and U 1 are functionally related as can be easily shown. As such power of the test
for either statistic is equal. In the following tables, the “exclusive” form of the statistic
has been used.

3. The test statistic for k outliers, based on the slippage alternative is:

1 n 2
∑ yj
k n−k +1
Tk = where y1 , y 2  yn −k < y n −k +1 ,  y n
1 n −k 2
∑ yi
(n − k ) 1

4. Strictly speaking, the likelihood ratio as used in the derivation of these statistics is not
correct, because selecting the k largest values violates the assumption of randomness
in the likelihood function. Reference [15] addresses this issue. Essentially, what is
being done is to recognize that there are n! permutations of the sample data. If a
hypothesis test was performed for each permutation, the one resulting in the maximum
of the likelihood ratio would be chosen. Since this is the only one of interest, the
above procedure can be used.

The following tables provide percentage points for the specified levels of significance.
Each table is for a different sample size. The exclusive for of statistic is used. For outlier
rejection, the significance level should be small. The tables reflect this consideration. All
tables reflect results of 50,000 Monte Carlo simulation samples.

64
DRAFT

Maxwell Outlier percentage points n = 25


Outlier Significance Level α
s .10 .05 .02 .01 .005 .001
1 4.8145 5.4052 6.2223 6.8581 7.4441 ----
2 4.4494 4.8802 5.4402 5.8519 6.2644 ----

Maxwell Outlier percentage points n = 50


Outlier Significance Level α
s .10 .05 .02 .01 .005 .001
1 5.1436 5.7111 6.4765 6.9823 7.6112 8.9510
2 4.6623 5.0379 5.5728 5.9360 6.3128 7.0560
3 4.4140 4.7270 5.1355 5.4232 5.7602 6.3415
4 4.2591 4.5401 4.8972 5.1420 5.4380 5.9670

Maxwell Outlier percentage points n = 75


Outlier Significance Level α
s .10 .05 .02 .01 .005 .001
1 5.3799 5.9265 6.6109 7.1600 7.7189 9.0112
2 4.8484 5.2052 5.6900 6.0244 6.3659 7.1285
3 4.5690 4.8665 5.2383 5.5123 5.7930 6.4035
4 4.3980 4.6556 4.9654 5.2079 5.4433 5.9793

Maxwell Outlier percentage points n = 100


Outlier Significance Level α
s .10 .05 .02 .01 .005 .001
1 5.5275 6.0788 6.7843 7.3133 7.8463 9.1031
2 4.9945 5.3513 5.8099 6.1483 6.4609 7.2427
3 4.6902 4.9838 5,3575 5.6042 5.8680 6.5090
4 4.4938 4.7507 5.0644 5.2797 5.4996 5.9863
5 4.3535 4.5879 4.8676 5.0629 5.2502 5.6770

65
DRAFT

Maxwell Outlier percentage points n = 200


Outlier Significance Level α
s .10 .05 .02 .01 .005 .001
1 5.9591 6.4758 7.1402 7.6411 8.1419 9.4174
2 5.3773 5.7278 6.1531 6.4487 6.7565 7.5001
3 5.0492 5.3298 5.6506 5.8891 6.1371 6.6655
4 4.8270 5.0670 5.3452 5.5360 5.7442 6.1896
5 4.6628 4.8747 5.1289 5.2989 5.4698 5.8298

Maxwell Outlier percentage points n = 400


Outlier Significance Level α
s .10 .05 .02 .01 .005 .001
1 6.4135 6.9305 7.5920 8.0892 8.5975 9.7696
2 5.8003 6.1378 6.5539 6.8654 7.1519 7.8258
3 5.4567 5.7216 6.0462 6.2671 6.4962 7.0009
4 5.2221 5.4428 5.7136 5.9018 6.0728 6.4955
5 5.0431 5.2376 5.4770 5.6349 5.7883 6.1710

66
DRAFT

Appendix G

Probability Plots and MATLAB Code for Plot Generation

The first step in many Exploratory Data Analyses (EDA) of a data set is plotting the data on
one or more types of probability. This allows for rapid qualitative assessment of whether the
data appears to come from the selected distribution. The probability plot is also useful in
identifying possible outliers as discussed in Appendix F.

MATLAB provides Normal distribution plots directly. Maxwell probability plots can be
generated using appropriate MATLAB code and plotting utilities. The resulting probability
paper can be used for either the one or two parameter distributions because of the “location-
scale” nature of this distribution.

The figure on the next page is the general purpose Maxwell probability paper which can be
copied and used for hand plotting if required. Following this figure is code for a MATLAB
function that takes a data vector as input and develops the probability plot axes, plots the input
data vector, and provides an estimate of the “best fit” Maxwell distribution for the given data.

Essentially, the probability plot is just plotting standardized values on both the ordinate and
abscissa, both being on a linear scale. Various probability values, which correspond to fixed
values of the standardized variable, are superimposed on the ordinate. This axis is always the
same regardless of the data. The abscissa scale is linearly adjusted to fit the actual data input.

The best fit line is generated by obtaining estimates of the parameters â and b̂ using the
method of moments. The MOM estimators are used for simplicity since they can be calculated
directly. The MOM and MLE estimators provided the same information for qualitative
evaluation.

67
DRAFT

M a x w e ll P r o b a b ility P lo t

99

98

95

90

80
Probability %

70

50

30

20

10

D a ta V a lu e s

General Purpose Maxwell Probability Paper

68
DRAFT

MATLAB Code for Probability Plotting Function

function [ahat bhat] = maxwell_probplt(Y)


%
% This function takes in a data vector and develops a plot of the values on
% probability axes corresponding to a general two dimensional Maxwell
% distribution. A fitted line, based on moment parameter estimates is also
% drawn on the plot.
% Tht Maxwell CDF in terms of the standardized variable t is:
%
% F(t) = erf(t) - (2/sqrt(pi))*t*exp(-t^2)
%
% t = (Y - b)/a
%
% First sort the data array for plotting purposes
%
Y = sort(Y);
%
% check for at least four data values
nsamp = length(Y);
%
if( nsamp < 4)
disp('Error - Input requires at least 4 data values')
return
end
%
% Prepare the basic probability plot axes. This amounts to plotting the
% standardized Maxwell variable on both the X and Y axes. The Y axis then
% has tick marks superimposed corresponding to probabilities.
qv = (0:0.01:.99);
for i = 1:length(qv)
q = qv(i);
T(i) = fzero(@(t) maxinv(t,q),0.5);
end
T(1) = 0; % rounding error may cause non zero value so set to zero.
%
Q = [0 0.01 0.04 .1 .2 .3 .5 .7 .8 .9 .95 .98 .99];
for i = 1:length(Q)
q = Q(i);
L(i) = fzero(@(t) maxinv(t,q),0.5);
end
L(1) = 0;
%
%
plot(T,T,'Linestyle','none'),grid
set(gca,'YTick',L)
set(gca,'YTickLabel',{'0';'1';'4';'10';'20';'30';'50';'70';'80';'90' ...
;'95';'98';'99'})
title('Maxwell Probability Plot')
ylabel('Probability %')
xlabel('Data Values')
hold
%

69
DRAFT

%
% Now compute the empirical probability and the standardized varibles
% associated with the empirical values.
for i = 1:nsamp
r = i/(1 + nsamp);
et(i) = fzero(@(t) maxinv(t,r),0.5);
end
%
% Plot the standardized empirical values on the Y-axis (which are
% equivalent to the empirical probability) and the actual values on the
% X-axis.
plot(Y,et,'ro')
%
% Compute MOM estimates of the a and b for the Maxwell distribution
% See document Appendix B for derivation
%
m2 = sum(Y.*Y)/nsamp;
m1 = sum(Y)/nsamp;
mdiff = m2 - m1^2;
%
ahat = 1.4849*sqrt(mdiff);
bhat = m1 - 2.3695*sqrt(mdiff);
if(bhat > Y(nsamp))
bhat = Y(nsamp);
end
TFit = L*sqrt(2)*ahat + bhat;
%
plot(TFit,L,'Linestyle','--')
hold off
%
return
%
%
% Solution of standardized variates requires root-finding solution to
% contained in this function
function f = maxinv(t,q)
f = erf(t) - (2/sqrt(pi))*t*exp(-t^2) - q;
return

70
DRAFT

Appendix H

Chi Distributions for k > 3 and Nakagami Distributions

It has been noted that the Rayleigh and Maxwell distributions can be derived from the Chi-
squared distribution, with 2 and 3 degrees-of-freedom, respectively. These are useful because
certain physical phenomena give rise to such distributions. Two and three dimensional miss
distance distributions are examples. There may be physical situations where more than three
terms are involved in the root-mean-square variable. It was noted in the main report that these
give rise to Chi-distributions.

The Chi distributions can be further generalized so that the shape parameter k is no longer an
integer value, but can take on any continuous value greater than zero. These are called
Nakagami distributions. This distribution has been proposed in physical applications as a
model for fading of communication signals in multi-path.

Basic Properties of Chi Distributions


Consider the following random variable, where the X i are normal random variables with a
common variance “a”.
k
y =a ∑X
1
i
2
= a Wk

The chi-squared distribution of W, with k degrees of freedom is:


W ( k −2 ) / 2
f (W ) = exp ( −W / 2 )
Γ( k / 2 ) 2 k / 2

The distribution of the random variable y follows from using the standard Jacobian
transformation method:

dW 2y 1 y k −1  y2 
f Y ,k ( y ) = fW ( y ) = f ( y ) = exp 
− 2 a 2 
; y ≥0
Γ( k / 2 ) 2( k −2 ) / 2 a k
W
dy a2  

Equation (H1)

The cumulative density function is found by direct integration:


y
1 u k −1  u2  1  k y2 
Fk ( y ) =
Γ( k / 2 ) 2 ( k −2 ) / 2 ∫0 a k exp 
− 2 a2




du = γ
Γ( k / 2 ) 
 ;
 2 2a
2



Equation (H2)

71
DRAFT

The last function is the (lower) incomplete gamma function, and is defined as:
x
γ ( n ; x ) = ∫t n−1 e −t dt
0

The incomplete gamma function has a recursive property that makes evaluation easier
(Reference [16]):
γ ( k +1; x ) = k γ ( k ; x ) − x k e −x

γ (1; x ) =1 −e −x and
1 
γ  ; x  = π erf ( x)
2 
These distributions for a given k are collectively referred to as “Chi-k distributions.”
These above relations show that when k is even, the distribution function involves only
exponential terms, whereas when k is odd, the CDF involves the error function.
As an example, the above procedure can be applied when k = 4 to obtain:
 y2   − y2 
F4 ( y ) = 1 − 
1 +  exp  
 2 a2 

 2 a2



It is noted that regardless of the value of k, the distribution can be written in terms of the
standardized variable used for the Maxwell distribution:
y
t=
2a

The p-th raw moment of the Chi-k distribution is given as:



1 u k + p−1  u2  2p/2 k + p  p
mkp =
Γ( k / 2 ) 2( k −2 ) / 2 ∫0 a k exp 
− 2 a2




du = Γ
Γ( k / 2 )  2 
a

Equation (H3)

A plot of several Chi distributions (a = 1) is given in Figure H1. As k increases, the mode of
the distribution shifts to the right as would be expected since the root-mean-square sum
involves more terms. Differentiating Equation (H1) with respect to y and equating to zero
gives the mode of the distribution as:
ykMode = a k −1

The figure also appears to show that the variance and mode probability value change very
slowly as k increases.

72
DRAFT

0.7
k=3
k=4
k=6
0.6 k=9

0.5

0.4
Probability

0.3

0.2

0.1

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Data Value

Figure H1 – Chi-k Distributions for Parameter a = 1

Table H1 provides values of the distribution mean, variance, skewness and kurtosis excess for
several of the Chi-k distributions. The raw moments were computed using Equation (H3).

k mean 2nd Moment variance skewness Kurtosis excess


3 1.5958 a 3 a2 0.4535 a2 0.4857 0.1082
1.8800
4 1.8800 a 4 a2 0.4657 a2 0.4057 0.0593
5 2.1277a 5 a2 0.4729 a2 0.3542 0.0370
6 2.3500 a 6 a2 0.4777 a2 0.3179 0.0251
9 2.9180 a 9 a2 0.4854 a2 0.2519 0.0106

Table H1 – Moment Descriptors for Selected Chi-k Distributions

The table confirms that variance of the distribution changes little, but the skewness and
kurtosis excess indicate as k increases, the distributions approach the normal distribution.
Approximating the Chi distribution as normal when k becomes “large” may be adequate for
some applications.

73
DRAFT

MLE Estimator and CRLB

All of the procedures discussed for the Maxwell distribution can be applied to Chi-k
distributions. As an example, the log-likelihood function for a sample of size n is:
n n
1
Lk ( n; a ) = −n ln C ( k ) ) + ( k −1) ∑ln yi − k ⋅ n ln a − ∑y 2
i
1 2 a2 1

where C (k ) = Γ(k / 2) ⋅ 2( k −2 ) / 2 . Differentiating and setting to zero produces the MLE


estimate for the parameter:
n

∑y 2
i
M2

aˆ MLE = 1
=
k ⋅n k
The CRLB could be derived as was done for the Maxwell distribution using the appropriate
density function. A simpler alternate argument can be used however. If we apply the
Factorization theorem discussed in Appendix A to the above likelihood function, it is clear
that the sufficient statistic for all k is the same:
n
Ts = ∑ yi2
1

The MLE estimate is a function of this sufficient statistic and as was the case for k = 3, the
MLE is an efficient estimator. Thus, the CRLB is the mean square error for this estimator, and
the necessary relation is (see main body or this paper):
2
 ∂ aˆ   1  ′  ′ 2 
MSE ( aˆ MLE ) = CRLB ( k ) =  MLE  var  M ′  = 1
µ −µ  
 ∂M ′   2  4 k M ′  n  4  2  
 2  2

The relation between relevant moments turns out to have a simple form:
′ ′
2

M 2 ≈ µ2 = k ⋅ a2 and µ4′ −  ′
 µ2  = 2 ⋅k ⋅a
4
 

Substituting and simplifying results in:


a2
CRLB ( k ) = Equation (H4)
2⋅k ⋅n

The variance of the MLE estimate can be used to compare efficiencies with the MOM or
quantile estimator MSE if desired. It is interesting that the MSE of the estimator decreases
with increasing k. Noting that “a” is the variance of underlying normal variables, it would be
expected that as more variables are included in the summing (i.e., as k increases) , more
information is available for a given sample size n, and the estimate MSE would decrease.

The MOM estimate for parameter “a” is the same as the MLE estimate when using the second
moment. MOM estimates are discussed below in general terms for the Nakagami distribution.

74
DRAFT

Random Number Generation and Inverse CDF

The distribution function of Equation [H2] becomes more complex as k increases. Generation
of random variables from Chi distributions is easily accomplished by taking the root-mean-
square of the number of normal random variables. This is the recommended procedure for
developing Monte Carlo samples if statistics for outliers or goodness of fit are needed.

Calculation of Chi-k random variates at specified quantile values will require inversion of the
distribution function. Examples where this is needed are for construction of probability paper
or Monte Carlo evaluation of the distribution of certain statistics. When k gets large, deriving
the exact distribution function using Equation (H2) and the recursive relations becomes
unwieldy. Depending on the problem requirements, the normal distribution might be an
alternative, with mean and variance calculated from Equation (H3).
It is noted that if the normal component distributions do not have the same underlying
variance, and/or if there are correlations present, then the Mahalanobis distance should be
used. In this case the Chi-k parameter for the distribution is unity.

Nakagami Distribution

As noted in the introduction, Chi-k distributions are related to even more general types of
Nakagami distributions. The connection is established in the following discussion.

This distribution has the probability density function:


α
2  α  2 α−1  α 
f ( y) =   y exp  − y 2  Equation (H5)
Γ(α )  λ   λ 

Here α is a shape parameter and λ is a scale parameter. If these parameters are chosen:
k
α= and λ = k ⋅a
2

With these choices, the Nakagami distribution becomes a Chi-k distribution. It is apparent that
this former distribution is a generalization of the latter, where the shape parameter is not
restricted to integers or an integer plus one-half.

If one is dealing with a set of data that is suspected of coming from a Chi-k distribution, but
the value of k is uncertain, this more general form may be useful for estimating the value of
the shape parameter. The likelihood function for the Nakagami distribution is:
n nα
 2  α  n
 α n 2
L( y;α , λ ) =     ∏ y 2 α −1
exp  − ∑ yi 
 Γ(α )   λ   λ 1
i
1 

Since two parameters need to be estimated, we ask if there are two sufficient statistics for this
likelihood function. We can apply the Factorization theorem for multiple parameters. In the
case of the above likelihood function note that:

75
DRAFT

2 α −1
n
 n 
∏y
1
2 α −1
=  ∏ yi 
 1 
Thus, the data is represented by the two sufficient statistics:
n n

Ts1 = ∑ yi2 = n ⋅ M 2 and Ts 2 = ∏ yi
1 1

The first of these is the usual summation (moment) statistic for the squares of the data values.
The second on is a product statistic.

In order to determine the mean square error of the estimator, it would be required to know the
variance of these two statistics. Appendix C and the main report, discuss how the variance of
summation statistics can be calculated. Unfortunately, product statistics are much more
difficult to deal with.

The maximum likelihood estimator of the two parameters can be developed however by the
usual method. Consider the log-likelihood function:
n
α  1
ln L( y;α , λ ) = n [ ln 2 + α { ln α − ln λ} − ln Γ( α ) ] + ( 2 α − 1) ∑ ln yi −   ∑ yi2
1 λ 1
Differentiating with respect to λ and equating to zero produces:
α α n
1 n
−n + ∑ yi2 = 0 or λˆMLE = ∑ yi2 = M 2′
λ λ2 1 n 1
Differentiating with respect to α and equating to zero results in:
 d ln Γ(α )  n
1 n 2
n ln α − ln λ + 1 −
 dα 
+ 2 ∑
1
ln yi − ∑ yi = 0
λ 1
Substituting the expression for λ into the last term reduces this term to n. The derivative of
the natural logarithm of the gamma function is called the “psi” function or the digamma
function. This function does not have a simple closed form and is usually represented by
symbol Ψ( x ) .

This last equation can be solved numerically using a MATLAB function which implements
evaluation of the digamma function. The resulting equation for the solution of α is:

2 n ′
Ψ ( α ) − ln α = ∑ ln yi − ln M 2 
n 1  

As a practical note, the equation solver may extrapolate values to α <= 0 . This value must be
always positive. As such, these values need to be reset to some small positive value if this
occurs. Generally, we would expect parameter α ≥ 1 .

Solving the previous equation gives α̂MLE . The analyst could use this, along with the
estimate λ̂MLE and use the Nakagami distribution of Equation (H5). The mean square error

76
DRAFT

for the estimate of λ is easily found since this involves only the data second moment. Mean
square error of the estimate of α does not have a simple expression.

The method of moments may be applied to the Nakagami distribution to obtain parameter
estimates in terms of summation statistics. The general expression for the p-th moment of this
distribution is found in terms of the gamma function as:
Γ(α + p / 2)  λ 
p/2
p
mNak =  
Γ(α ) α 

The second moment using this equation, and noting Γ(α +1) = α Γ(α) provides:

λ Γ(α + 1) ˆ
2
mNak = = λMOM ≅ M 2′
α Γ(α )

The MOM estimator for λ is the same as the MLE estimator. The first moment could be
used for the remaining parameter; however, for simplicity the so-called “inverse normalized
variance estimator” is generally used. This results by considering the fourth moment:
Γ(α + 2)  λ 
2
λ2
4
mNak =   = (1 + α )
Γ(α )  α  α
On rearrangement and substitution:
2 ′
mNak M2
αˆ MOM = ≅
2
mNak − mNak
2
′ ′
2
M 4 −  M 2 
 
If required, the mean square error of these estimators can be derived using methods discussed
in Appendix C for functions of moment estimators.

If the data is required to have a Chi-k distribution, it seems reasonable to choose k such that it
rounds to the integer closest to ( 2 α) . The estimate of parameter “aChi” would be updated
using this rounded value in the estimator equation: aˆChi = M 2′ / k round .

77
DRAFT

Appendix I

Inverse Maxwell Distribution

The Inverse Maxwell (IM) distribution is related to the Maxwell distribution in the following
way. If random variable y is Maxwell distributed, random variable ( x =1 / y ) is inverse
Maxwell distributed.

The distribution of x can be derived in the usual manner for single variable function
transformation:
1 1 dx 1 2 y2  y2 
f x ( x) = f y ( y) f y ( y) =  
x=
y J
J =
dy
= 2
y π a3
exp − 2 a2 
 
Substituting the transformed variables in the Maxwell density results in the inverse Maxwell
density function:
2 1   −1 
f x ( x) =  3 4 
 exp   Equation (I-1)
π
a x 
 2 a2 x2



The distribution function is obtained from direct integration of this equation:
 1  2  1   −1 
Fx ( x) = 1 − erf  
 2 a x + π 
a x 
exp 
 2 a2 x2 
 Equation (I-2)
     
The symbol “erf ” denotes the error function as before. The IM distribution involves only a
“scale” parameter, which leads to some simplification, as noted previously when the Maxwell
distribution was discussed.

The density function is of indeterminate form as x → 0. This limit can be evaluated by


rewriting the expression and using L’Hopital’s rule twice.

1
Let z≡ then z → ∞ as x → 0
x

2 1 z4 2 1 4 z2 2 8a
f x (0) = lim = lim = lim =0
z →∞ π a3  z 
2
π a z →∞  z 
2
π z →∞  z2 
exp  2  exp  2  exp  2 
 2a  2a   2a 

A comparison of the IM and the Maxwell is shown in Figure I1 for a = 1. Applications for the
IM distribution are not readily available, but from the shape of the density function suggests
that it may be useful when the data are highly skewed, which often occurs when considering
extreme value events.

78
DRAFT

M a x w e ll a n d In ve rs e M a x w e ll D e n s it ie s a = 1 M a x w e ll a n d In ve rs e M a x w e ll D is t rib u t io n s a = 1
1 .8 1

1 .6 0 .9
In ve rs e M a x w e ll
M a x w e ll 0 .8
1 .4

0 .7
1 .2

0 .6

Probability Distribution
1
Probability Density

0 .5
0 .8
0 .4

0 .6
0 .3

0 .4
0 .2

In ve rs e M a x w e ll
0 .2 0 .1
M a x w e ll

0 0
0 0 .5 1 1.5 2 2.5 3 3.5 4 0 0 .5 1 1 .5 2 2 .5 3 3.5 4
y y

Figure I1: Maxwell and Inverse Maxwell Probability Functions (a = 1)

Basic Properties of the Inverse Maxwell Distribution


The mean and variance of the IM distribution are easily obtained by taking the required
expectation operations:

 −1 
µ1′ = E [ x ] =
1 2 1 2
exp 
 2a 2 x 2 
 =a
a π  0 π

 −1 
µ2 = E [x

] 1 1
2
= − 2 erf 

 = 2

a  2 a x 0 a

′ ′
2
1 π −2
σ 2 = µ2 −  µ1  = 2  
  a  π 

The “inverse” effect of the value of the parameter a can be seen from the respective variances
and means. As “a” increases in the Maxwell distribution, the mean and variance increase;
whereas, the IM distribution mean and variance decrease.

Continuing with the third moment of the distribution:


µ3 = E [ x ]
′ 1  −1 
3
=− Ei 
 2a 2 x 2 

2π a 3
 0

Where Ei (•) is the exponential integral. At the upper bound (i.e., x → ∞ ), the argument
approaches zero and this integral is undefined. Thus, the third moment does not exist.
Proceeding in the same fashion, it can be shown that only the first and second moments exist.

The mode of the inverse Maxwell distribution is found by setting the derivative of the density
function to zero:

79
DRAFT

d fx −4 1 1
=0= 5 + 2 7 or xmo =
dx x a x 2a

CRLB and Parameter Estimators


The CRLB for this distribution can be calculated in a similar manner to that for the Maxwell
distribution. Recall:
1
CRLB =
  ∂ ln f ( x)  2

n ⋅ E  X
 
  ∂ a  

1 2  1  ∂ ln f X ( x ) 3 1
ln f x ( x ) = ln   − 3ln a − 4ln x − 
 2a 2 x 2 
 and =− + 3 2
2 π    ∂a a a x
2 ∞
 ∂ ln f X ( x)  2  9 6 9   −1  2
E   = ∫  a − 7 6 + 9 8  exp  2 2  = [I + I + I ]
 ∂a  π 0
5 4
x a x a x   2a x  π 1 2 3
The first integral is (and invoking L’Hopital’s rule for evaluation):

 a2 π  1  π 9
I1 =
9  exp 

−1 
 erf  
 −a  2 a x  =
3

a5  x  2a 2 x 2 2 2 a2
     0

Likewise:
π −18 π 15
I2 = 2
I3 =
2 a 2 a2
The CRLB for the inverse Maxwell distribution is:
a2
CRLB = Equation I-3
6n

This value is the same as for the Maxwell distribution.

The maximum likelihood estimator for the IM density is found from the log-likelihood:

n 2 n
1 n
1
ln L( x; a ) = ln   − 3 n ⋅ ln a − 4∑ ln xi − 2 ∑x
2 π  1 2a 1
2
i

Differentiating with respect to parameter “a” and equating to zero results in:

− 3n 1 n 1 1 n 1
+ 3∑ 2 =0 or aˆ MLE = ∑
a a 1 xi 3 n 1 xi2
n
1
For this distribution, the statistic Ts = ∑ 2 is seen to be sufficient. It was proved that for
1 xi

the Maxwell distribution, this estimator is efficient for any sample size “n”. The method there

80
DRAFT

relied on the fact that the MLE estimator was a function of the second sample moment of the
data. For the inverse Maxwell, this is not the case. The statistic cannot be expressed in terms
of any sample moment.

Although an exact expression for the MSE is not available, likelihood theory shows that
MLE’s are asymptotically efficient. Thus, for large enough sample sizes, the CRLB can be
used as an estimate of the mean square error of this estimator.

Moment Estimator

It is tempting to use the method of moments and estimate the parameter “a”. It would seem
that the second data moment would provide the estimate:
2
1  ∂aˆ   ′
 µ −  µ ′  
2
aˆ MOM =
′ and MSE ( aˆ MOM ) = 
M2  ∂M ′   4
 2
 
 2 

The MSE of this estimator is infinite because the fourth moment of the density is not defined.

The first moment could be used to derive an estimator. In this case:


2
2 1  ∂aˆ  1  ′ ′ 
2
aˆ MOM = and MSE ( aˆ MOM ) =  µ − 
 µ 
π M′  ∂M ′  n 

2
 1   
1  1 

The MSE does exist for this estimator. Carrying out the evaluation results in:
 
  
2 1 1 1  2  a2
MSE ( a
ˆ MOM ) =  2 
1 −  ≈
π  ′
4
n a  π  1.7519 n
 M 1  
   

The efficiency of the first moment estimator can now be found:

eff MOM =
CLRB
= 2
( a 2 / 6n ) = 0.2919
MSE (aˆ MOM ) ( a / 1.7519 n )
This is a very low efficiency estimator. The MLE should be used when estimating the
parameter of an inverse Maxwell distribution.

Quantile estimators depend only on the distribution and the relation between quantiles and the
estimator. As such, these estimators could be derived if required. Further, Bayesian parameter
estimation could be done if desired.

Inverse Maxwell Variate Generation

When a large number of IM distributed random variates are required, it is faster to use the
inverse relationship

1
X IM = where Z i → N ( 0, a ) i = 1, 2, 3
Z + Z 22 + Z 32
1
2

81
DRAFT

In some applications, inversion of the IM distribution function is required. Because the


distribution function involves only a scale parameter, it can be expressed in standard form by
defining t ≡ 2 a x :

1  2 1   −1 
Fx (t ) =1 −erf 
t + π  t exp 
 t2 

     

A general inversion algorithm for the scaled variates can be developed from this. Random
variates from specific distributions are calculated from the scaled values.

The ability to make the distribution function non-dimensional allows for development of
goodness-of-fit tests such as the Anderson-Darling or correlation type if desired. Also, tests
for outlier rejection can be developed using the same likelihood ratio simulation techniques as
applied for the Maxwell distribution.

Probability plotting paper can be developed easily for this distribution, using the same
procedure as applied for the Maxwell distribution.

82
DRAFT

References
1. Hogg, R. V., J. W. McKean, A. T. Craig, “Introduction to Mathematical Statistics,” 6th
Edition, Prentice-Hall, 2005.
2. Casella, G., and R. Berger, “Statistical Inference,” 2nd Edition, Duxbury Publishing,
2002.
3. Bury, K. V., “Statistical Models in Applied Science,” John Wiley & Sons, 1975.
4. Papoulis, A., and U. Pillai, “Probability, Random Variables and Stochastic Processes,”
4th Edition, McGraw-Hill, 2002.
5. Gibbons, J., and S. Chakraborti, “Nonparametric Statistical Inference,” 3rd Edition,
Dekker Publishing, 1992.
6. Brookner, E., “Tracking and Kalman Filtering Made Easy,” J. Wiley and Sons, 1998.
7. Presss, W., et. al., “Numerical Recipes,” Cambridge University Press, 1986.

8. Wackerly, D. W. Mendenhall and R. Scheaffer, “Mathematical Statistics with


Applications,” 5th Edition, Duxbury Press, 1996.

9. Parzen, E., “Modern Probability and Its Applications,” John Wiley and Sons, 1960.

10. Conover, W., “Practical Nonparametric Statistics,” 2nd Edition, J. Wiley and Sons,
1980.

11. Freund, J. E., “Mathematical Statistics,” 2nd Edition, Prentice-Hall, 1971.

12. D’Agostino, R., and M. A. Stevens, “Goodness-of-Fit Techniques,” Dekker


Publishing, 1986.

13. Anderson, T.W., and D.A. Darling, “A Test of Goodness of fit,” J. American
Statistical Association, Vol. 49, No. 268, December 1954.

14. Schmeiser, Bruce W., “Some Myths and Common Errors in Simulation Experiment,”
Proceedings of the 2001 Winter Simulation Conference.

15. Barnett, V. and T. Lewis, “Outliers in Statistical Data,” 3rd Edition, John Wiley and
Sons, 1994.

16. Abramowitz, M. and I. Stegun, “Handbook of Mathematical Functions,” Applied


Mathematics Series 55, National Bureau of Standards, U.S. Government Printing
Office, 1964.

83