Professional Documents
Culture Documents
Hugo Hernandez
ForsChem Research, 050030 Medellin, Colombia
hugo.hernandez@forschem.org
doi: 10.13140/RG.2.2.30177.35686
Abstract
Perhaps the best practice for predicting the outcome of any observable process is constructing
robust mathematical models considering their most relevant factors. However, noise and
randomness, caused by remaining factors not included in the model, will always be present.
The behavior of random and randomistic variables (in a more general sense), can be
mathematically described by the probability density function (PDF). It is therefore desirable to
obtain PDFs for a measured variable, after a finite sample of data has been obtained. The
identification of density functions fitting the data sample is denoted here as the reconstruction
of the PDF. Such reconstruction is considered an inverse problem, since many different PDFs
can satisfactorily describe the sample obtained. Furthermore, sampling always incorporates an
inherent error in the process, given that the behavior of the sample may differ with respect to
the behavior of the population, especially for small-sized samples. Thus, reconstructing PDFs is
quite a challenging task. Different reconstruction methods, either based on the sample
cumulative probability distribution or on the sample moments, are described and their
performance is evaluated considering six different sets of data. Those test examples are
samples obtained from populations with known probability distribution, allowing assessing the
prediction capability of the reconstruction methods. If the type of distribution is known a priori,
parametric reconstruction methods are found to be the best alternative. However, for
unknown distributions, polynomial reconstruction methods provided good approximations for
all cases considered. A selection of algorithms (in R language) used in the present work, is
included in the Appendix.
Keywords
1. Introduction
Randomistic variables are variables that can be measured several times under identical
conditions.[1] By repeating the measurement of a variable, a particular set of measured values
is obtained. Such set is called a sample, and the procedure used to obtain the set is called
sampling. If the set contains all measurements possible, it is denoted as a population, and the
procedure for obtaining the set is the census. Since some values may be repeated, their relative
occurrence frequencies in the population describe the corresponding probability distribution
( ) of the randomistic variable.
There are, however, two main difficulties with the definition of the probability distribution of a
population:
1) When the number of different outcomes is large (e.g. ), then the magnitude of
the relative frequencies for each outcome value tend to be negligible. Particularly, this
is the case of continuous variables.
2) When the number of different possible measurements is large, limitless§ or infinite,
then a census is practically impossible. In this case, the probability distribution of the
variable cannot be known with absolute certainty.
It is possible to overcome the first issue by defining the probability density function of a
randomistic variable , as:
( ) ( )
( )
(1.2)
§
Let us consider flipping a coin, for example. The total possible number of times that anyone can flip (or
could have flipped) a coin is practically limitless. If anyone is not flipping a coin right now, then the set of
results is already incomplete and cannot be considered as a population.
Since
( ) ( )
(1.3)
then it can be concluded that the relative frequency of occurrence of a certain value is
proportional to the non-negligible probability density ( ):
( ) ( )
(1.4)
From Eq. (1.1) to (1.4) it can be concluded that:
( ) ∫ ( )
(1.5)
On the other hand, the term in the limit presented in Eq. (1.2) can be considered as a finite
probability density [1] about with step :
( )
( )
(1.6)
and therefore:
( ) ( )
(1.7)
( ) ∑ ( )
(1.8)
The second issue mentioned earlier, regarding the impossibility to determine probability
distributions of most populations with absolute certainty, is solved by means of estimation.
Thus, the probability distribution is estimated from the data available in a sample of the
population of interest. The estimation procedure will be denoted as a reconstruction of the
probability distribution. The reconstruction of the probability distribution can be considered as
the inverse of sampling. In fact, probability distribution reconstruction is an inverse problem,[2]
where measurements are used to infer the values of parameters that characterize the system.
As with any other inverse problem, the reconstruction of the probability distribution will not
lead to a unique answer. Furthermore, it is possible that the answer obtained is not the correct
one. Thus, different reconstruction methods will be compared in order to assess their accuracy.
Particularly, we will be focusing in this report on the reconstruction of probability density
functions of continuous variables. This case is selected because it is more sensitive to the
particular set of values obtained in the sample, and therefore, it is more challenging.
Eq. (1.2) shows the direct relationship between the cumulative probability distribution and the
probability density function. This type of methods is based on this relationship in order to
determine the probability density function of a population from the cumulative probability
distribution of a sample.
This is perhaps the most direct use of Eq. (1.2). However it requires a reliable function
describing the cumulative probability distribution of the population. Since it is not available, an
estimated differentiable cumulative probability function should be obtained from the data
sample. The main difficulty of this method is that the data sample is discrete (the number of
different outcomes is finite) and the sample size is limited. Thus, the cumulative probability
function should be obtained by regression or by determining interpolation polynomials from
the data in the sample.
(2.1)
The observed probability of occurrence for each outcome obtained in the sample is:
(̃ )
(2.2)
and therefore, the observed cumulative probability of the outcomes in the sample is:
(̃ ) ∑ (̃ ) ∑
(2.3)
where ̃ is a randomistic variable representing any measurement in the sample.
Assuming a non-biased sampling procedure, it is expected that the cumulative probability for
the corresponding population , evaluated at be bounded by:
( ) (̃ )
(̃ ) ( ) (̃ )
(2.4)
By choosing the center of the interval as an estimate of the cumulative probability, a set of
cumulative probability estimates ( ̂ ) for each measurement outcome will be obtained,
where:
(̃ )
̂( )
(̃ ) (̃ )
{
(2.5)
The set of cumulative probability estimates ( ̂ ) vs. measurements ( ) is then used for fitting a
curve using any suitable method (for example those presented in [3], including polynomial
interpolation, cubic splines, moving least squares, etc.). Thus, the cumulative probability
estimates are approximated by an arbitrary function :
̂( ) ( )
(2.6)
And therefore,
( )
̂( )
(2.7)
This method is based on the definition previously presented in Eq. (1.6). The assumption is that:
( )
̂( ) ( )
(2.8)
where ̂ represents an estimate of the probability density function. Thus, by selecting a value
of , it is possible to obtain a set of estimates ̂ for different values of , which can then be
used to fit a curve using any suitable method. Clearly, the choice of will have a significant
effect on the results obtained. Although smaller values of are desired for a better
estimation of , for small data samples this leads to highly noisy results. Smoother functions
can then be obtained using larger values of , with the risk of reducing the accuracy of the
estimation. This method is also known as the Naïve Estimator.[4] An algorithm programmed in
R is presented in Appendix A.7 for obtaining Naïve estimators from data samples.
This method assumes that the variable can be expressed as a function of an arbitrary
standard random function , as follows:[5]
(2.9)
where and are transformation parameters. There are three basic types of standard random
functions summarized in Table 1.
Many different standard random variables can be defined, either non-parametric standard
random variables (for example, Gaussian, Uniform, Exponential, etc.) or parametric standard
random variables (for example, Student’s t, Fishers’ F, Gamma, Weibull, etc.). Each of these
standard random variables has pre-defined probability density ( ) and cumulative probability
( ) functions.** Thus, the parameters and , and additional standard random parameters
( ) if any, are estimated by fitting the cumulative probability estimates ̂ to the pre-defined
cumulative probability function of the selected standard random variable.
**
See Table 2 in [6] for some examples.
Table 1. Types and Properties of Standard Random Variables (taken from [6])
Parameters
Type Properties of Bounds of Bounds of
( ) ( )
( )
[ ) [ )
( )
I ( ) ( ) ( ) ( ) ( )
( ] ( ]
( )
( ) ( )
[ ] [ ]
( ) ( )
[ ) [ ) ( )
II ( ) ( )
( ] [ ) ( )
III ( ) ( ) [ ] [ ]
Thus,
̂
̂( ) ( ̂)
̂
(2.10)
and
̂
̂( ) ( ̂)
̂
(2.11)
where ̂ , ̂ and ̂ are the estimated parameters obtained by any suitable optimization
procedure. Furthermore, the best standard reference function can also be optimized for a
particular data sample.[6]
The kernel density approach assumes that the cumulative probability function can be described
as the average cumulative probability of different kernels (usually each kernel representing
each measurement in the sample). The kernels are predefined functions, usually but not
necessarily symmetric and unimodal.[7] Gaussian and Epanechnikov [8] (quadratic) kernels are
normally preferred.
∑ ( )
̂( ) ( ) ∑ ( )
(2.12)
( )
where is the sample size, represents the kernel function, and is the smoothing
or bandwidth†† coefficient. The bandwidth also determines the smoothness and accuracy of the
estimator, although in opposite directions. Thus, the use of small values of improves accuracy
but also increases noise. Unfortunately, the selection of the bandwidth is subjective. However,
some rules of thumb are available for specific kernel functions. For example, for Gaussian
kernels the ideal bandwidth is:[4]
( ) ̂
(2.13)
where ̂ is the sample standard deviation.
Kernels are themselves probability density functions. Particularly the Gaussian kernel is
represented by the standard normal probability density function:
( )
√
(2.14)
On the other hand, Epanechnikov’s optimum kernel is expressed as:[8]
( ) | | √
( ) { √
| | √
(2.15)
As an alternative to the use of the cumulative probability distribution of the data sample, it is
also possible to estimate the probability density function of the population from the particular
distribution moments observed in the sample. These methods are useful for example in
randomistic optimization.[9] In this case, the moments of random variables are the decision
variables, and their values are obtained after solving the optimization problem. Thus, it is
††
The bandwidth of the kernel is analogue to the bin width in a histogram.
necessary to reconstruct the distribution from the moments in order to obtain a complete
description of the randomistic variable of interest.
( ) ( ) ∫ ( )
(2.16)
where ( ) is the -th moment operator and ( ) is the expected value operator.
On the other hand, the moments observed in a sample ̃ of size can be determined as:
( ̃) ∑
(2.17)
Thus, the basic concept behind this type of methods is that the moments of the sample can be
considered as estimates of the moments of the probability distribution of the population, and
therefore:
̂ ( ) ( ̃)
(2.18)
∫ ̂( ) ∑
(2.19)
The problem consists on obtaining the estimates of the probability density function ̂( ) from
the available measurements . And this inverse problem can be solved using different
approaches.
This method is analogous to the method presented in Section 2.1.3 for parametric estimation
based on standard random variables. Similarly, the transformation presented in Eq. (2.9) is
used, along with a predefined standard random probability density function ( ). The
difference lies in that the parameters ( ) are estimated from the sample moments and not
from the cumulative probability of the sample. At least one distribution moment (different
from the zero-th moment‡‡) must be used for each parameter to be identified. By using a larger
number of moments, better estimates of the parameters can be obtained and the assumed
standard distribution can be validated.
̂
∫ ( ̂) ∑
̂ ̂
(2.20)
Please notice that in principle, can be any non-zero value (integer or not), as long as the
integral on the left hand side of the equation has a convergent solution, and the sum on the
right hand side of the equation is not indeterminate. That is why only positive values of are
preferred. The term ̂ dividing inside the integral appears by applying the change of variable
theorem.[10]
∫ (̂ ̂ ( ̂ )) ( ( ̂ )) ( ̂) ∑( ) ̂ ̂ ∫ ( ̂) ( ( ̂ )) ( ̂)
∑( ) ̂ ̂ ( ( ̂ ))
(2.21)
where ( ̂ ) represents any realization of the standard random variable ( ̂).
Therefore,
( ̃) ∑( ) ̂ ̂ ( ( ̂ ))
(2.22)
Eq. (2.22) represents a set of nonlinear algebraic equations where ̂ , ̂ and ̂ are the unknowns.
If the number of parameters is the same as the number of different moments considered
(other than the zero-th moment), a single solution is obtained. If more moments than
‡‡
The zero-th moment cannot be used because it yields a trivial, useless result.
parameters are used, a cost function is defined (e.g. sum of squared differences between
̂
̂( ) ( ̂)
̂
(2.23)
In particular, by using the transform (2.9) with a Type I standard random variable , it can be
found considering only the first and second moments:
̂
∫ ( ̂) ̂ ∑
̂ ̂
(2.24)
̂
∫ ( ̂) ̂ ̂ ∑
̂ ̂
(2.25)
which can be expressed as:
̂ ( ∑ ) ( ∑ )
(2.26)
Let us now assume that we use the bilateral Laplace transform (introduced by Laplace [12] on
his study of probabilities) on the true probability density function of :
{ ( )} ∫ ( )
(2.27)
since the exponential term can be expanded as an infinite sum, Eq. (2.27) becomes:
( ) ( )
{ ( )} ∫ ∑( ) ( ) ∑ ∫ ( ) ∑ ( )
(2.28)
Thus, the bilateral Laplace transform of the probability density function can be expressed in
terms of the non-negative integer moments of the distribution. Furthermore,
( )
( ) {∑ ( )}
(2.29)
where represents the inverse bilateral Laplace transform operator.
By using Eq. (2.18), and truncating the infinite sum to a maximum odd moment , it is
possible to estimate the probability density function as:
( )
̂( ) {∑ ( ̃ )}
(2.30)
The main difficulty of this approach is that inverse bilateral Laplace transforms are not
commonly known as Laplace transforms. Thus, an expression in terms of the Laplace transform
would be desirable. The problem is that unilateral Laplace transforms only consider the positive
values of .
For that purpose, let us then consider the following variable transform:§§
( ̃) ( ̃)
( ̃) ( )
(2.31)
In this case, is always positive and unilateral Laplace transform can be used on its probability
density function:
{ ( )} ∫ ( )
(2.32)
Thus, proceeding similarly as before:
( )
̂ ( ) {∑ ( ̃ )}
(2.33)
and for negative values of , ̂ ( ) .
§§
The last term in Eq. (2.31) provides a simple estimate of how far would be the true population minimum
from the sample minimum. Alternatively, it would be possible to consider that the minimum value of the
population is exactly the minimum value in the sample, or to estimate the population minimum by a
different method.
Unfortunately, the Laplace inverse of the power series of results in a function of derivatives
of Dirac’s , and that is not a suitable solution for our purposes. Therefore, in this work, Padé
approximants will be used:[13]
( ) ∑
∑ ( ̃) [ ] ( )
∑
(2.34)
where , are the corresponding coefficients of the Padé approximant, is the order of the
polynomial in the numerator, and is the order of the polynomial in the denominator. Then
∑
̂ ( ) { }
∑
(2.35)
Now, from the change of variable theorem:[10]
( ̃) ( ̃)
̂( ) ̂ ( ( ̃) ( ))
(2.36)
Although Laplace integral transforms can be used, the truncation of the infinite series
expansion of the exponential might lead to significant errors in the estimation of the
probability density function. Thus, it is possible to alternatively consider Mellin integral
transform ( ) [14] of the probability density function ( ) (after the variable change given in
2.31):
{ ( )} ∫ ( ) ( )
(2.37)
Thus, the Mellin transform of the positive variable corresponds to the function describing the
moments of the distribution, given that .
̂ ( ) { ( ̃ )} ∫ ( ̃)
(2.38)
This approach requires finding a suitable function describing the observed moments in
̃
the transformed data sample , and using the Mellin inversion formula presented in Eq. (2.38).
Then Eq. (2.36) can be used to estimate the probability density function of .
This method consists on approximating the probability density function by a cubic spline using
internal nodes*** in the interval [ ], as follows:
( ) ∑ ( )
(2.39)
where are constant coefficients that must satisfy the three continuity conditions of cubic
splines (continuity in the function and its first two derivatives).
( ) ∫ ( ) ∑∑ ∫ ( )
∑∑ ∫ ∑( )
∑∑ ∑( ) ( )
(2.40)
Thus, by using different moment estimates of the distribution, along with the spline
continuity conditions, it is possible to estimate all coefficients. The zero-th moment, which is
always equal to , can be used.
This method may easily result in negative values of the probability density function, and/or ill-
conditioned systems of equations. In those cases, an iterative solution of the problem must be
***
If real measurements are available, they can be used as nodes as long as moment estimates are
available. Otherwise an equidistant partition of the interval, considering the number of available
moments should be performed.
performed, starting from an initial guess of the probability density function and the
measurements interval. More details on this method and its numerical implementation using
the iterative procedure can be found in [15]. For the present report, the iteration was
performed by changing the estimated values for the probability density at the minimum and
maximum elements of the sample.
A variation of the previous method assumes that the probability density function can be
described by a single polynomial expression of order , as follows:
( ) ∑
(2.41)
( )
( ) ∫ ( ) ∫ ∑ ∑
(2.42)
The performance of the different methods presented in Section 2 for estimating the probability
density function from a data sample are tested using 6 different test examples and 2
benchmark examples. The test examples, sampled from known probability distribution
functions, are used for determining the fitness and accuracy obtained with the different
reconstruction methods. The benchmark examples, taken from the literature,[4] are used only
to assess the fitness and the similitude of the different reconstruction methods.
Each of the test examples have a known probability density function, which is the reference for
the accuracy assessment. Then, for each probability density function, the cumulative
probability function is determined using Eq. (1.5). Afterwards, two groups of different
uniform random numbers (between 0 and 1) are obtained for each test function. These random
numbers are true random numbers obtained from atmospheric noise using a radio signal at an
unused broadcasting frequency together with a skew correction algorithm.[16] Particularly,
these random numbers are obtained from the service provided at random.org.††† For each of
the uniform random numbers ( ) of the first group, a corresponding measurement value ( )
is obtained from the cumulative probability distribution, as follows:
( )
(3.1)
where represents the inverse cumulative probability function. If the inverse cannot be
explicitly obtained, then the value is obtained by an iterative search method. For example by
minimizing the function ( ( )) .
The second group of uniform random numbers is used to sort the measurement values
obtained from the first group of random numbers. Then, the measurements whose random
numbers in the second group have the lowest values represent the data sample. The whole set
of measurements is used to visually verify that the available data follows the pre-
defined probability distribution.
The probability density functions of the test examples, presented in Section 3.3, are
reconstructed using the methods described in Section 2. Once a reconstructed probability
density function is obtained, it is compared to the original sample data by means of three
different metrics for fitness assessment, presented in Section 3.2.1. Also, since each test
example was obtained from a reference probability density function, the accuracy of the
reconstructed probability density is evaluated by comparing it to the reference probability
density function. Such comparison is done using the concept of similitude between probability
density functions introduced in a previous report.[6] The determination of the probability
density similitude is presented in Section 3.2.2. The equation used for assessing accuracy is
presented in Section 3.2.3.
†††
Uniform random number generator: https://www.random.org/decimal-fractions/. Number of decimal
places used: 10.
The fitness of a reconstruction method measures the ability of the method to obtain an
estimate of the probability density that satisfactorily describes the sampled data. Such
comparison is best performed on the cumulative probability of the data.
Three different metrics have been previously proposed for assessing the fitness of a cumulative
probability model to the sample data, as can be seen in Table 2:
Table 2. Fitness error metrics for cumulative probability models (adapted from [6])
Mathematical Model Rejection
Fitness Criteria
Expression Criterion
Maximum difference in
( )
cumulative probability
Average difference in ∑
〈 〉 〈 〉
cumulative probability
where is the difference or error between the cumulative distribution described by the model
( ̂ ) and the cumulative distribution observed at each outcome (ranked in ascending order) in
the sample, and it is given by:
̂( ) ∑ ̂( ) ∑
∑ ̂( ) ∑
∑ ̂( ) ̂( ) ∑
{
(3.2)
Usually, if at least two of the criteria proposed in Table 2 do not result in rejection, then the
model can be considered to be fit to the sample data.
The similitude ( ) between two probability distribution models can be defined as:[6]
( ) ∫ ( ( ) ( ))
(3.3)
Another related concept that can be used to assess the similitude between two models is the
relative probability error given by:[17]
∫ | ( ) ( )|
( ) ( )
(3.4)
The algorithm programmed in R for similitude assessment is presented in Appendix A.5.
The accuracy ( ) of a probability density function reconstruction method for a particular test
example is determined as the similitude between the estimated probability density function
and the reference probability density function, defined for the test example. Thus,
∫ | ̂( ) ( )|
( ̂| ) (̂ ) ∫ ( ̂( ) ( ))
(3.5)
Six different test examples have been considered for the current assessment. For each test
example, the following information is provided:
( ) {
(3.6)
Reference cumulative probability function:
( ) {
(3.7)
Sample Size:
Table 3. Sample Data for the Test Example 1: Standard Uniform Distribution.
0.94984581 0.61068765 0.31245958 0.84835996 0.71142486 0.04495294
0.82666537 0.25544950 0.50851774 0.67086908 0.27409144 0.75048860
0.14488340 0.00616502 0.58209403 0.77794805 0.04842904 0.30529027
0.97782694 0.29575453 0.70165301 0.80645354 0.55693643 0.24583023
0.94960493 0.73219776 0.33442330 0.02234013 0.55732476 0.43998253
Fitness: ,〈 〉 ,
Figure 1. Cumulative Probability Distribution for Test Example 1: Standard Uniform Distribution.
Dotted blue line: Reference cumulative probability function (Eq. 3.7). Green dashed line: Large
set of 10.000 random measurements. Red solid line: Random sample of 30 measurements.
( )
√
(3.8)
Reference cumulative probability function:
( )
( ) √
(3.9)
Sample Size:
Table 4. Sample Data for the Test Example 2: Standard Normal Distribution.
-1.2439 0.4505 -0.3852 -1.8290 1.3407 -0.6469 -0.4056 0.6841 -1.5678 0.6941
-1.8914 -0.4966 0.8662 -0.2750 -0.2041 1.7432 0.1361 0.2965 1.3166 -0.9297
0.4800 -0.1934 1.6562 0.9544 2.1867 0.3798 1.1047 0.7773 0.9502 0.7686
-0.2860 1.1326 -0.7087 -1.0967 -0.4775 0.1193 -0.9110 -0.1606 0.4372 -1.6280
-0.9466 -0.3390 -2.2981 1.5130 0.1238 -0.5792 0.8862 0.6181 0.1883 0.0580
Fitness: ,〈 〉 ,
Figure 2. Cumulative Probability Distribution for Test Example 2: Standard Normal Distribution.
Dotted blue line: Reference cumulative probability function (Eq. 3.9). Green dashed line: Large
set of 10.000 random measurements. Red solid line: Random sample of 50 measurements.
(3.10)
Reference cumulative probability function:
〈 〉 ( 〈 〉)
( ) √ ( 〈 〉)
√
( )
( )
( ) ( )
(3.11)
Sample Size:
Fitness: ,〈 〉 ,
3.3.4. Test Example 4: Scopd Distribution of Time between Molecular Collisions [17]
( ) ( ) ( ) ( )
√
(3.12)
Reference cumulative probability function:
( ) ( ) ( )
√
(3.13)
Sample Size:
Fitness: ,〈 〉 ,
Figure 4. Cumulative Probability Distribution for Test Example 4: Scopd Distribution. Dotted
blue line: Reference cumulative probability function (Eq. 3.13). Green dashed line: Large set of
10.000 random measurements. Red solid line: Random sample of 50 measurements.
( ) ( )
( )
√
(3.14)
Reference cumulative probability function:
( ) ( )
( ) √ √
(3.15)
Sample Size:
Fitness: ,〈 〉 ,
Figure 5. Cumulative Probability Distribution for Test Example 5: Bimodal Distribution. Dotted
blue line: Reference cumulative probability function (Eq. 3.15). Green dashed line: Large set of
10.000 random measurements. Red solid line: Random sample of 30 measurements.
( ) {
(3.16)
Reference cumulative probability function:
( ) {
(3.17)
Sample Size:
Fitness: ,〈 〉 ,
Benchmark examples are sets of measurements or moments reported in the literature, whose
true probability density function is considered unknown. Thus, for benchmark examples it is
not possible to test accuracy. The purpose of these examples is to test the similitude in the
results obtained using the different methods.
This data set contains 107 different observations of eruption length (in minutes) of Old Faithful
geyser in Yellowstone Park. Data shown in Table 9 and Figure 7 were obtained from Table 2.2.
presented in [4].
Table 9. Sample Data for the Benchmark Example 1: Length of eruptions of Old Faithful geyser.
Sample Size:
4.37 1.83 4.25 3.83 1.73 4.18 1.67 2.27 4.73
4.70 1.83 3.58 1.85 3.10 4.58 3.50 2.93 3.72
1.68 3.95 3.67 3.80 4.62 3.50 4.20 4.63 4.50
1.75 4.83 1.90 3.80 1.88 4.62 4.43 4.00 4.40
4.35 3.87 4.13 3.33 3.52 4.03 1.90 1.97 4.58
1.77 1.73 4.53 3.73 3.77 1.97 4.08 3.93 3.50
4.25 3.92 4.10 1.67 3.43 4.60 3.43 4.07 1.80
4.10 3.20 4.12 4.63 2.00 4.00 1.77 4.50 4.28
4.05 2.33 4.00 1.83 3.73 3.75 4.50 2.25 4.33
1.90 4.57 4.93 2.03 4.60 4.00 1.80 4.25 4.13
4.00 3.58 3.68 2.72 2.93 4.33 3.70 4.08 1.95
4.42 3.70 1.85 4.03 4.65 1.82 2.50 3.92
This example provides the first 10 moments of a crystal size distribution, corresponding to a
batch seeded crystallization process where two seed distributions with different mean size are
initially mixed. The moments presented in Table 10 were obtained from Example 2.3 in [15]. ‡‡‡
The decimal logarithms of the moments are included for increased precision.
Table 10. Moments of Benchmark Example 2: Multimodal Crystal Size [m] Distribution
( )
0 1 0.000
-3
1 1.743x10 -2.759
-6
2 4.062 x10 -5.391
-8
3 1.078 x10 -7.967
-11
4 3.049 x10 -10.516
-14
5 8.945 x10 -13.048
6 2.692 x10-16 -15.570
-19
7 8.261 x10 -18.083
-21
8 2.575 x10 -20.589
-24
9 8.134 x10 -23.090
-26
10 2.598 x10 -25.585
Although simple, this is actually a challenging test example, particularly for methods providing
smooth probability density functions. Table 11 summarizes the performance of all the
reconstruction methods presented in Section 2, using different reconstruction parameters. A
comparison of the probability density functions obtained is presented in Figure 8.
‡‡‡
The moments reported in [14] are not normalized. Therefore, all of them must be divided by the zero-
th moment ( ).
Table 11. Performance of probability density functions estimated for Test Example 1 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
87.6% 88.5% 82.2% 97.7%
(CDF: Linear model)
Derivative of CDF
146.1% 132.1% 179.1% 93.6%
(CDF: Quadratic model)
Derivative of CDF
150.1% 138.3% 190.9% 93.6%
(CDF: Cubic model)
Naïve Estimator
89.0% 140.9% 126.5% 77.0%
(30 nodes, =0.1)
Naïve Estimator
181.7% 264.2% 465.4% 89.7%
(30 nodes, =0.3)
Naïve Estimator
205.1% 331.0% 727.3% 90.8%
(30 nodes, =0.5)
Sample-based Parametric
Estimation 94.3% 88.7% 83.7% 98.3%
(Uniform Distribution)
Sample-based Parametric
Estimation 136.8% 114.5% 148.8% 81.4%
(Normal Distribution)
Kernel Density Estimation
(Gaussian Kernels, optimal 95.8% 96.3% 102.6% 85.5%
=0.16117)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 99.9% 102.0% 111.6% 85.5%
=0.16117)
Kernel Density Estimation
64.4% 37.7% 24.2% 83.1%
(Gaussian Kernels, =0.05)
Kernel Density Estimation
70.2% 42.4% 28.4% 83.5%
(Epanechnikov Kernels, =0.05)
Moment-based Parametric
Estimation (Uniform Distribution, 88.0% 88.9% 82.0% 97.7%
to )
Moment-based Parametric
Estimation (Uniform Distribution, 107.1% 90.3% 91.2% 99.5%
to )
Moment-based Parametric
Estimation (Normal Distribution, 156.3% 181.2% 286.0% 80.6%
to )
Inverse Laplace Transform
475.2% 708.9% 3489.5% 60.1%
( )
Inverse Mellin Transform
100.3% 100.3% 100.5% 100%
(using to )
Cubic Splines
175.3% 180.6% 331.8% 84.7%
(13 nodes, using to )
Polynomial Approach
79.4% 77.1% 65.9% 96.2%
(using to )
Polynomial Approach
94.8% 61.8% 56.3% 92.8%
(using to )
2.5
1.5
Reference
Reference Naive: 30 nodes, dx=0.1 (b)
CDF Derivative: Linear CDF
(a) Naive: 30 nodes, dx=0.3
CDF Derivative: Quadratic CDF Naive: 30 nodes, dx=0.5
2.0
CDF Derivative: Cubic CDF
Probability Density
Probability Density
1.0
1.5
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
2.5
1.5
Reference
Reference Gaussian Kernels: h=0.16117 (d)
Sample-based Parametric: Uniform
(c) Epanechnikov Kernels: h=0.16117
Sample-based Parametric: Normal Gaussian Kernels: h=0.05
2.0
Epanechnikov Kernels: h=0.05
Probability Density
Probability Density
1.0
1.5
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
2.0
Probability Density
3
1.0
2
0.5
1
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
2.0
Reference (g)
Cubic Splines
Polynomial (M0 to M2)
Polynomial (M0 to M4)
1.5
Probability Density
1.0
0.5
0.0
Figure 8. Probability density functions reconstructed for Test Example 1 using the
reconstruction methods listed in Table 11
Particularly for this test example, all the probability density function reconstruction methods
were evaluated. In some cases, different parameter sets were considered.
Several methods were able to reconstruct the original probability density function with
accuracy greater than 95%. That was the case of the method based on the derivative of the
cumulative probability function fitted by a linear model, the parametric methods (assuming a
uniform distribution), the inverse Mellin transform method, and the moment-based polynomial
method. As expected, the relative fit to the data of the previous methods was close to 100% or
even lower.
Some methods were capable of significantly improving the fit to the data without improving
the accuracy. That was the case of the Kernel density estimators (particularly for small
bandwidth). This indicates that overfitting occurred in this type of methods. Thus, improving
the fitness of the probability density function to the data in a sample does not guarantee
accurately reconstructing the true probability distribution of the population.
Parametric methods considering a uniform distribution were capable of fitting the data better
than the reference function, while at the same time providing high accuracies. The
disadvantage of parametric methods is that the true shape of the probability density function
should be known a priori. And that is seldom the case. Thus, it is necessary to test different pre-
defined probability density functions with a wide variety of shapes, in order to successfully
reconstruct the density function.
The inverse Mellin method was particularly interesting, as it was capable of almost exactly
reconstructing the reference density function. It is possible that the inverse Mellin method is
very efficient for uniform distributions. Thus, it is important to analyze the results obtained
with the other test examples. On the opposite side, the inverse Laplace method was the worst
performer, not only in fitness but also in accuracy. This result might indicate that the method is
very sensitive to truncation of the infinite series expansion. Furthermore, it is a cumbersome
method as it involves obtaining Padé approximants and inverse Laplace transforms, requiring
manual analytical solution or symbolic programming. Given that the difficulty involved in this
method is not rewarded by fitness and accuracy, it will not be considered for the remaining test
examples.
Since the moment-based polynomial method seems to be performing better and with a lower
computational load than the closely related moment-based splines method, the latter will not
be tested for the remaining test examples.
It is also evidenced that the shape of the probability density function reconstructed by the
Naïve and Kernel density estimators are highly sensitive to the particular reconstruction
parameters used. The polynomial reconstruction also presented significant changes in the
shape of the density function, particularly for higher order polynomials. For methods resulting
in noisy (multimodal) probability density functions, it is difficult to assess if the behavior
predicted is real or if it is just a mathematical artifact.
The second test example is also challenging for two reasons: 1) It is a highly non-linear
mathematical function, and 2) It involves both positive and negative values. Table 12
summarizes the performance of the reconstruction methods used. Only one set of
reconstruction parameters is considered for each method. The probability density functions
obtained are presented in Figure 9.
Table 12. Performance of probability density functions estimated for Test Example 2 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
th 78.1% 33.0% 20.5% 92.5%
(CDF: 5 degree model)
Naïve Estimator (50 nodes, =0.5) 75.6% 42.2% 26.2% 88.4%
Sample-based Parametric Estimation
64.6% 40.0% 21.2% 96.5%
(Normal Distribution)
Kernel Density Estimation (Gaussian
80.9% 68.8% 46.4% 91.9%
Kernels, optimal =0.49257)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 83.1% 72.3% 51.3% 92.0%
=0.49257)
Moment-based Parametric Estimation
65.4% 54.2% 33.4% 98.0%
(Normal Distribution, to )
Inverse Mellin Transform
612.2% 1029.1% 6435.0% 71.1%
(using , , , ,…, )
Polynomial Approach (using to ) 73.6% 31.6% 17.2% 92.2%
The parametric methods (sample- and moment-based) provided the best accuracy (>95%) for
the normal test example. The polynomial methods (sample-based derivative of CDF and
moment-based polynomial) along with the kernel density methods scored between 90 and 95%
accuracy. Particularly the polynomial methods presented a strange behavior at the extreme
values of the sample. This is probably due to the cut-off in the density distribution performed at
those extreme values. However, the central shape of the distribution is close to the reference
distribution. The Naïve estimator correctly followed the behavior of the distribution, but it is
too noisy. The inverse Mellin method failed in this case to correctly describe the density
function. This is probably due to the high sensitivity of the results to the coefficients obtained
by integration (when performing Mellin inverse transform). On the other hand, the methods
providing a best fit to the sample data were the parametric and the polynomial methods (both
sample-based and moment-based).
0.6
0.6
Reference Reference
CDF Derivative: 5-th degree CDF (a) Gaussian Kernels: h=0.49257 (b)
Epanechnikov Kernels: h=0.49257
0.5
Naïve: 30 nodes, dx=0.5
0.5
Probability Density
Probability Density
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
x x
0.8
Reference (c)
Moment-based Parametric: Normal
Inverse Mellin Transform: M0 to M10, steps: 1/4
Polynomial: M0 to M5
0.6
Probability Density
0.4
0.2
0.0
-3 -2 -1 0 1 2 3
Figure 9. Probability density functions reconstructed for Test Example 2 using the
reconstruction methods listed in Table 12
Table 13. Performance of probability density functions estimated for Test Example 3 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
35.9% 19.1% 6.5% 84.2%
(CDF: 5-th degree model)
Naïve Estimator (40 nodes, =100) 18.8% 6.6% 1.2% 79.8%
Sample-based Parametric Estimation
46.1% 29.3% 12.3% 92.3%
(Maxwell-Boltzmann Distribution)
Kernel Density Estimation (Gaussian
44.7% 30.2% 13.1% 95.2%
Kernels, optimal =98.96)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 47.4% 34.1% 15.9% 96.6%
=98.96)
Moment-based Parametric Estimation
(Maxwell-Boltzmann Distribution, 59.3% 35.7% 17.3% 90.4%
and )
Inverse Mellin Transform
1258% 1052% 12993% 39.3%
(using , , , ,…, )
Polynomial Approach (using to ,
43.4% 25.1% 10.1% 89.4%
scaling variable by )
For this test example, the best accuracy (95-96%) was obtained using Kernel density estimators,
particularly using Epanechnikov kernels. The parametric methods considering a Maxwell-
Boltzmann distribution only obtained 90-92% accuracy. It is possible that this low accuracy is
due to the large fitness errors between the sample and the reference distribution. Please
notice that low relative fitness errors are reported in Table 13, indicating that the original
reference function did not fit so well the data sample.
The inverse Mellin transform again failed to accurately describe the density function or fit the
sample data, as a result of the sensitivity to the coefficients obtained during the integration.
Again, since this method requires additional steps of analytical or symbolic integration and is
not performing well, it will not be considered for the remaining test examples.
Another interest point in these results is that some methods (i.e. CDF derivative, Naïve and
Kernel estimators) obtained a probability density function with two relevant humps,
resembling the presence of two different populations. This indicates that there is a higher
concentration of results in the sample about 500 m/s but also about 1000 m/s.
0.0030
(a) (b)
Reference Reference
CDF Derivative: 5-th degree CDF Gaussian Kernels: h=98.96
0.0020
Naïve: 40 nodes, dx=100 Epanechnikov Kernels: h=98.96
Sample-based Parametric: Maxwell-Boltzmann
0.0020
Probability Density
Probability Density
0.0010
0.0010
0.0000
0.0000
x x
0.010
(c)
Reference
Moment-based Parametric: Maxwell-Boltzmann
Inverse Mellin Transform: M0 to M10, steps: 1/4
0.008
Polynomial: M0 to M5
Probability Density
0.006
0.004
0.002
0.000
Figure 10. Probability density functions reconstructed for Test Example 3 using the
reconstruction methods listed in Table 13
The Scopd distribution of time between molecular collisions is a highly non-linear function that
can be approximately represented by an exponential distribution, although they are not the
same. The shape of this distribution is not easily described by lower-degree polynomials, and
they tend to present larger deviations at the extremes of the distribution. The results obtained
for several reconstruction methods are presented in Table 14 and Figure 11.
It can be seen again that parametric estimations based on the corresponding Scopd function
resulted in higher accuracies (>96%). Parametric exponential distributions and the moment-
based polynomial approach also performed well (94-95%). Particularly, the polynomial
approach achieved low fitness errors, comparable to those obtained with sample-based
parametric Scopd. Kernel density estimators (Gaussian and Epanechnikov) did not perform so
well, especially because negative collision times are predicted.
Table 14. Performance of probability density functions estimated for Test Example 4 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
114.3% 158.1% 208.1% 89.2%
(CDF: 4-th degree model)
Naïve Estimator (50 nodes, =0.5) 198.7% 357.2% 808% 82.8%
Sample-based Parametric Estimation
102.7% 73.8% 63.8% 96.7%
(Scopd Distribution)
Sample-based Parametric Estimation
115.4% 108.8% 119.7% 95.1%
(Exponential Distribution)
Kernel Density Estimation (Gaussian
251.8% 204.1% 495.8% 76.0%
Kernels, optimal =0.46764)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 275.0% 222.6% 600.3% 73.8%
=0.46764)
Moment-based Parametric Estimation
108.1% 116.5% 131.6% 99.0%
(Scopd Distribution, )
Moment-based Parametric Estimation
147.5% 114.8% 138.3% 94.0%
(Exponential Distribution, )
Polynomial Approach (using to ) 101.4% 65.1% 57.5% 94.3%
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
-1 0 1 2 3 4 5 -1 0 1 2 3 4 5
x x
1.5
1.5
Probability Density
1.0
1.0
0.5
0.5
0.0
0.0
-1 0 1 2 3 4 5 -1 0 1 2 3 4 5
x x
Figure 11. Probability density functions reconstructed for Test Example 4 using the
reconstruction methods listed in Table 14
This particular distribution was selected as test example in order to assess the prediction
capability of the reconstruction methods when sampling from multimodal distributions. The
moment-based parametric estimation assuming a binormal distribution requires the first 5
moments of the distribution and involves analytically solving several integrals with increased
complexity. Such method is not considered for this example. The results obtained are
summarized in Table 15 and Figure 12.
Table 15. Performance of probability density functions estimated for Test Example 5 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
th 114.3% 93.2% 92.8% 87.2%
(CDF: 5 degree model)
Naïve Estimator (60 nodes, =1.5) 110.0% 114.8% 118.0% 88.0%
Sample-based Parametric Estimation
72.0% 33.1% 20.6% 91.6%
(Binormal Distribution)
Sample-based Parametric Estimation
225.7% 271.7% 677.5% 75.8%
(Normal Distribution)
Kernel Density Estimation (Gaussian
129.3% 121.7% 137.6% 81.5%
Kernels, optimal =1.106)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 134.4% 129.5% 152.3% 80.4%
=1.106)
Moment-based Parametric Estimation
217.5% 272.8% 702.1% 76.2%
(Normal Distribution, and )
Polynomial Approach (using to ) 55.4% 33.2% 19.3% 90.1%
The parametric approach considering a Normal distribution does not represent the original
density distribution as only one mode is obtained. Furthermore, the mode found does not
correspond to a true mode in the sample. All other methods considered were able to identify
both modes. Each method found different relative frequencies for the modes. The best
accuracy was found for the sample-based parametric estimation assuming a binormal
distribution. Close results were obtained by the polynomial approach, but the extreme tails are
not correctly identified as a result of truncation. Furthermore, the best fit to the data sample
was obtained by the moment-based polynomial fit. The CDF derivative and the Naïve estimator
also presented acceptable performances. The kernel density estimators did not score so well
because they flattened too much the region between modes.
0.30
0.30
Reference Reference
0.25 CDF Derivative: 5-th degree CDF Sample-based Parametric: Binormal
Naïve: 60 nodes, dx=1.5
0.25
Sample-based Parametric: Normal
(a) (b)
Probability Density
Probability Density
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
-4 -2 0 2 4 -4 -2 0 2 4
x x
0.30
0.30
Reference Reference
Gaussian Kernels: h=1.106 Moment-based Parametric: Normal
0.25
0.25
Epanechnikov Kernels: h=1.106 Polynomial: M0 to M6
(c) (d)
Probability Density
0.20
Probability Density
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
-4 -2 0 2 4 -4 -2 0 2 4
x x
Figure 12. Probability density functions reconstructed for Test Example 5 using the
reconstruction methods listed in Table 15
The final test example is an arbitrary polynomial distribution. This particular distribution
presents an antimode instead of a mode. The idea was including a distribution with a
completely different shape compared to the previous examples. The corresponding
reconstruction results can be seen in Table 16 and Figure 13. For this particular case, the
moment-based parametric estimation using a polynomial distribution model is equivalent to
the polynomial estimation method presented in Section 2.2.5.
The most accurate reconstructions were obtained with the polynomial methods (parametric
estimation and derivative of CDF). These methods also provided the best fit to the data sample.
Kernel density estimation methods successfully describe the antimode, but they had some
difficulties with the extremes of the distribution. The Naïve estimator was again a noisy
reconstruction around the reference function.
Table 16. Performance of probability density functions estimated for Test Example 6 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
th 86.7% 31.3% 19.3% 95.3%
(CDF: 4 degree model)
Naïve Estimator (40 nodes, =0.2) 129.5% 198.5% 283.7% 87.8%
Sample-based Parametric Estimation
th 86.7% 26.1% 18.2% 93.2%
(4 degree Polynomial Distribution)
Kernel Density Estimation (Gaussian
116.5% 75.0% 70.0% 82.2%
Kernels, optimal =0.1618)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 122.6% 82.8% 81.8% 80.7%
=0.1618)
Moment-based Parametric Estimation
69.3% 37.1% 21.6% 95.6%
(Polynomial Approach, to )
2.0
2.0
Probability Density
Probability Density
1.0
1.0
0.5
0.5
0.0
0.0
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
x x
Figure 13. Probability density functions reconstructed for Test Example 6 using the
reconstruction methods listed in Table 16
In this benchmark example, the true probability density function of the data is unknown. Thus,
the idea is using the different methods to obtain the best possible guess of the probability
distribution of the population. Table 17 and Figure 14 summarize the results obtained for this
data set.
From the results obtained in the previous Test Examples, it was possible to observe that the
Naïve estimator, although noisy, almost always follows the true probability density of the
population when the smoothing step is chosen close to the optimal kernel bandwidth (Eq.
2.13). Thus, it is expected that any smoother density function, presenting a high similitude with
the Naïve estimator, might accurately represent the true distribution of the population. For this
reason, the similitude between each reconstruction and the Naïve estimation has been
included in Table 17.
Table 17. Performance of probability density functions estimated for Benchmark Example 1
using different CDF-based reconstruction methods.
Similitude with Naïve
Reconstruction Method 〈 〉
( )
Derivative of CDF
th 0.0403 0.0109 0.0497 89.4%
(CDF: 4 degree model)
Naïve Estimator (100 nodes, =0.4) 0.0597 0.0259 0.1859 100%
Sample-based Parametric Estimation
th 0.0537 0.0116 0.0608 88.5%
(4 degree Polynomial Distribution)
Kernel Density Estimation (Gaussian
0.1453 0.0697 1.3159 78.9%
Kernels, optimal =0.3759)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 0.1529 0.0745 1.4971 77.4%
=0.3759)
Moment-based Polynomial Estimation
0.0277 0.0059 0.0170 90.0%
(using Moments to )
1.0
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6
x
Figure 14. Probability density functions reconstructed for Benchmark Example 1 using the
reconstruction methods listed in Table 17
Polynomial reconstruction methods achieved the best fit to the data sample while at the same
time presenting the highest similitude with the Naïve estimator at optimal bandwidth.
Particularly, the moment-based polynomial estimation can be considered as the best
description of the true probability density function for this population. The degree of the
polynomial was chosen by minimizing all three fitness error metrics. When at least one of the
fitness error metrics do not decrease by increasing the degree of the polynomial, the previous
degree is selected as optimal. The polynomial probability density function obtained is:
( [ ]) ( )
(4.1)
It is also possible to conclude that the data sample contains two groups of observations. The
first group of low eruption time lengths (<2.5 min) resemble an exponential distribution with a
lag (minimum eruption length). The second group, consisting of longer times (>2.5 min), is a
skewed unimodal distribution.
The final example provides only information about the moments of the distribution. Thus, only
the moment-based methods described in Section 2.2 can be used. Given that there is no data
set, it is not possible to test the fitness of the reconstructions. However, since it is known that
the moments come from a distribution with three modes,[15] this criterion will be used to
assess the effectiveness of each method.
For this example, given the large difference in magnitude between the moments, the following
re-scaling of the variable is proposed for reducing numerical error:
(4.2)
Thus,
( ) ( )
(4.3)
The transformed moments are presented in Table 18. When possible, all moments available are
used for the reconstruction of the probability density function. The results are presented in
Figure 15. The moment based reconstruction method considering a normal distribution was
included just for illustrative purposes. It was known beforehand that the distribution was
multimodal, but nevertheless it provides a visual reference of the moments used. Clearly, the
inverse transform methods do not offer a reliable representation of the distribution, probably
because of the large error involved in the approximations and integration procedures. On the
other hand, the polynomial approach, while not perfect, successfully predicted a multimodal
distribution from the moment data. The predicted modal points are also close to those
presented in the right plot of Fig. 5 reported in [15]. The similitude between all other methods
to the polynomial fit was found to be below 75%.
Table 18. Transformed Moments for the Benchmark Example 2 using Eq. (4.3)
( )
0 1
1 0.4378
2 0.2563
3 0.1709
4 0.1214
5 0.0894
6 0.0676
7 0.0521
8 0.0408
9 0.0324
10 0.0260
3.0
1.5
1.0
0.5
0.0
y
Figure 15. Probability density functions reconstructed for Benchmark Example 2 using
transform (4.2)
The probability density function obtained in terms of the original variable is presented in Figure
16, neglecting the noise at the upper tail. The corresponding equation is:
( [ ]) (
)
(4.4)
600
500
Probability Density [1/m]
400
300
200
100
0
x [m]
Figure 16. Probability density functions reconstructed for Benchmark Example 2 without
variable transformation. Function described by Eq. (4.4).
5. Conclusion
Reconstructing the original probability distribution from a data sample or from a set of
distribution moments is a challenging inverse problem. Two main difficulties can be found. On
one hand, there is no unique solution to the problem since different distributions might yield
the same finite sample of data or the same finite set of moments. On the other hand, there is
always an intrinsic error involved in the sampling procedure used to obtain the data and/or the
moments. Three different fitness error metrics were considered for assessing the error in
cumulative probability: Maximum error, mean error and sum of squared error. For six reference
probability density functions of different shapes, random samples of 30 to 60 elements in size
presented sample cumulative probability distributions slightly different from their
corresponding population cumulative probability. The average maximum sampling error was
6.5%, with an average mean sampling error of 2.3%. The average sum of squared error was
0.0457. In general, the sampling error increases by reducing the sample size. Even if the
reconstruction procedure provided a perfect fit to the data sample, it will not necessarily
reflect the true probability distribution of the population. Thus, the challenging goal is
satisfactorily reconstructing the true probability density function of a population from a finite
sample or set of moments.
Different reconstruction methods were presented in Section 2, either based on the cumulative
probability distribution of the sample, or on a finite set of moments of a sample. These
methods were used to reconstruct six test examples described in Section 3. The performance
of these methods was assessed by means of the relative fitness error (maximum, mean and
sum of squares relative to the reference probability density function) comparing the
reconstructed probability function to the cumulative probability distribution of the sample, and
the accuracy obtained comparing the reconstructed probability distribution to the reference
density function. A summary of the performance evaluation for the test examples considered is
presented in Table 19 and Figure 17. Although not all the methods were used for all test
examples, these results allow reaching some interesting conclusions.
Table 19. Summary of performance (relative fitness error and accuracy) of different
reconstruction methods for the test examples considered, sorted by average accuracy
Relative Fitness Error (%) Accuracy (%)
Reconstruction Method
Average Best Worst Average Best Worst
Moment-based Parametric
Estimation (known 17.3% 147.5% 87.2% 99.5% 90.4% 96.4%
distribution)
Sample-based Parametric
Estimation (known 12.3% 119.7% 62.9% 98.3% 91.6% 94.8%
distribution)
Moment-based Polynomial
53.8% 10.1% 101.4% 92.5% 96.2% 89.4%
Approach
Derivative of CDF 96.1% 6.5% 208.1% 91.7% 97.7% 84.2%
Naïve Estimator 209.2% 1.2% 808.0% 85.5% 90.8% 77.0%
Kernel Density Estimation 140.6% 13.1% 600.3% 84.9% 96.6% 73.8%
Cubic Splines 229.2% 175.3% 331.8% 84.7% 84.7% 84.7%
Sample-based Parametric
Estimation (unknown
114.5% 677.5% 262.5% 81.4% 75.8% 78.6%
distribution - Normal
approximation)
Moment-based Parametric
Estimation (unknown
156.3% 702.1% 302.7% 80.6% 76.2% 78.4%
distribution - Normal
approximation)
Inverse Mellin Transform 2631.2% 100.3% 12993.0% 70.1% 100.0% 39.3%
Inverse Laplace Transform 1557.9% 475.2% 3489.5% 60.1% 60.1% 60.1%
Clearly, the most efficient methods for reconstructing probability distributions are the
parametric estimation methods (both sample-based and moment-based), as long as the true
type of probability distribution is known. Even though it is possible to find the type of
distribution present after a careful analysis of the particular problem, it is not always the case.
Figure 17. Performance assessment of different reconstruction methods for the test examples
considered. Top: Average (green dot) and range (blue line) for all three relative fitness error
metrics. Bottom: Average (green dot) and range (blue line) for accuracy.
When the distribution is unknown, using a general distribution (such as the normal distribution)
is not the best solution. In those cases, the moment-based polynomial approach or the sample-
based derivative of the cumulative probability distribution function (CDF) are more accurate.
These results were confirmed by evaluating the benchmark examples (with unknown true
distribution), where these two polynomial methods were most likely the best performers.
The polynomial methods, however, require defining the optimal degree of the polynomial for
fitting the data. Visual inspection is probably the best approach for defining the optimal
degree. However, it is also possible to automate the search for an optimal polynomial degree.
For the moment-based polynomial approach, the maximum degree of the polynomial depends
on the number of moments available. The recommended procedure consists on starting with
only two moments ( and ) , fitting the polynomial and predicting the remaining moments.
Then, adding moments stepwise until the sum of squared differences in ( ̃ ) stop
significantly improving (see algorithm in Appendix A.9).
For the derivative of the CDF method, the suggested procedure is the following:
1. Construct the Naïve estimator with an optimal smoothing factor (calculated as the
optimal bandwidth for a Gaussian kernel density estimator: Eq. 2.13).
2. Begin with a linear model for the CDF, fit the model parameters and calculate the
density function by differentiating. Calculate the similitude between the Naïve
estimator and the linear model.
3. Increase the degree of the polynomial model for the CDF. Fit the model and calculate
the similitude with respect to the Naïve estimator. Repeat this step until the similitude
stops significantly improving.
Although the Naïve estimator is very noisy, it tends to follow the true probability distribution of
the population. Thus, using a polynomial model fitted to the sample CDF and being as close as
possible to the Naïve estimator is expected to provide a good prediction of the probability
density function. (see algorithm in Appendix A.9)
Another important observation is that the methods based on integral transforms (inverse
Laplace transform and inverse Mellin transform methods), which provide a sound theory for
density reconstruction, in practice are poor performers. It is highly likely that the
approximation procedures to obtain the inverse transforms from data samples do not provide
enough accuracy when compared to other methods.
This work was not intended to provide an exhaustive comparison of all possible methods of
probability density reconstruction. Thus, some efficient prediction methods might have been
left aside. However, their performance could be easily assessed and compared to the methods
considered here, by reconstructing the different test examples presented in Section 3.
Acknowledgments
The author gratefully acknowledges Prof. Jaime Aguirre (Universidad Nacional de Colombia)
for helpful discussions on the topic and for reviewing the manuscript.
This research did not receive any specific grant from funding agencies in the public,
commercial, or not-for-profit sectors.
References
[1] Hernandez, H. (2018). The Realm of Randomistic Variables. ForsChem Research Reports
2018-10. doi: 10.13140/RG.2.2.29034.16326.
[2] Tarantola, A. (2005). Inverse Problem Theory and Methods for Model Parameter Estimation.
Philadelphia: SIAM.
[3] Lancaster, P., & Salkauskas, K. (1986). Curve and Surface Fitting: An Introduction. London:
Academic Press.
[4] Silverman, B. W. (1998). Density Estimation for Statistics and Data Analysis. Boca Raton:
Chapman & Hall/CRC.
[11] Hernandez, H. (2018). Expected Value, Variance and Covariance of Natural Powers of
Representative Standard Random Variables. ForsChem Research Reports 2018-08. doi:
10.13140/RG.2.2.15187.07205.
[12] Laplace, P. S. (1814). Théorie Analytique des Probabilités. 2nd Ed. Paris: Courcier.
[13] Karlsson, J., & von Sydow, B. (1976). The Convergence of Padé Approximants to Series of
Stieltjes. Arkiv för Matematik, 14, 43.
[14] Epstein, B. (1948). Some Applications of the Mellin Transform in Statistics. The Annals of
Mathematical Statistics, 19(3), 370-379.
[15] John, V., Angelov, I., Öncül, A. A., & Thévenin, D. (2007). Techniques for the Reconstruction
of a Distribution from a Finite Number of its Moments. Chemical Engineering Science, 62(11),
2890-2904.
[17] Hernandez, H. (2017). Multicomponent Molecular Collision Kinetics: Rigorous Collision Time
Distribution. ForsChem Research Reports 2017-7. doi: 10.13140/RG.2.2.26218.31689.
A.6. Algorithm for Reconstructing the PDF as the Derivative of a Polynomial CDF
CDFderiv<-function(datasample,degree=5,disp=FALSE){
#This function constructs a PD function from the input data sample
#"datasample", considering a polynomial degree ("degree") for fitting the
#sample CDF. If "disp" is set to TRUE, the polynomial coefficients and
#regression R2 will be shown, and PDF and CDF will be plotted. The output of
#this function is the PDF. It uses function "bound" defined in Appendix A.3.
#Extracting information from the sample
xmin=bound(datasample)[1] #Estimation of minimum value in population
xmax=bound(datasample)[2] #Estimation of maximum value in population
x=sort(datasample)-xmin #Sorted transformed values in the sample
N=length(x) #Sample Size
#Estimates of cumulative probability in the sample
phis=length(which(x<=x[1]))/(2*N)
for (i in 2:N){
phis[i]=(length(which(x<=x[i-1]))+length(which(x<=x[i])))/(2*N)
}
#Polynomial model of the CDF
dataf=data.frame(x,phis) #Constructing data frame
model="phis~-1+x" #Definition of initial regression model
if (degree>=2){ #For higher degree polynomials
for (i in 2:degree){
model=paste(model,"+I(x^",toString(i),")") #Increase degree of the model
}
}
phim=lm(model,data=dataf) #Regression
phicoef=phim$coefficients #Polynomial CDF coefficients
R2=1-var(phim$residuals)/var(phis) #Determination of R2
#Definition of CDF function
phif<-function(X){
phi=0 #Initialize Cumulative Probability
for (i in 1:length(phicoef)){ #For each term in the polynomial
phi=phi+phicoef[i]*((X-xmin)^i) #Add term to CP
}
#CP is zero if phi is <0, and 1 if phi is >1
phi=phi*(as.integer(phi>=0&phi<=1)+as.integer(phi>1)/phi)
return(phi)
}
#Derivative of the CDF
i=1:length(phicoef)
rhocoef=i*phicoef #Calculate PDF coefficients
A.7. Algorithm for Reconstructing the PDF using the Naïve Estimator Method
Naive<-function(datasample,nodes=NULL,delta=NULL,disp=FALSE){
#This function constructs a PD function from the input data sample
#"datasample", using a Naïve estimator with a certain number of "nodes" and a
#smoothing factor "dx". By default, the number of nodes is the number of data
#points and the smoothing factor is the optimal bandwidth for Gaussian Kernel
#estimators. If "disp" is set to TRUE, the PDF will be plotted. The output of
#this function is the PDF.
xmin=min(datasample) #Minimum value in sample
xmax=max(datasample) #Maximum value in sample
N=length(datasample) #Sample size
sigma=sd(datasample) #Sample standard deviation
if (is.null(nodes)==TRUE) nodes=N #By default: nodes=Sample size
ndist=(xmax-xmin)/(nodes-1) #Distance between nodes
x=xmin+(0:(nodes-1))*ndist #Node values
if (is.null(delta)==TRUE) delta=sigma*(4/(3*N))^(1/5) #Default: optimal delta
A.8. Algorithm for Reconstructing the PDF using the Moment-based Polynomial Fit Method
polyPDF<-function(moments,xmin=NULL,xmax=NULL,scale=1,disp=FALSE){
#This function constructs a PD function from the "moments" input data frame,
#containing the moment order ("n") in one column and their values ("moments")
#in another. The limits of the function ("xmin","xmax") are required inputs.
#A scale factor ("scale") can be used to transform data avoiding singularity.
#By default, "scale" is 1. If "disp" is TRUE, the PDF will be plotted and the
#polynomial coefficients are shown. The output of this function is the PDF.
#Moments can be generated using the "momentset" function given in Appendix A.2.
if (is.null(xmin)&is.null(xmax)){
print("Please input xmin and xmax estimated values")
return(NULL)
} else {
n=length(moments$n) #Number of moments available
degree=n-1 #Degree of polynomial
A=matrix(0,n,n) #Initialize matrix of coefficients
B=matrix(0,n,1) #Initialize vector of independent terms
for (i in 1:n){ #For each moment
ni=moments$n[i] #ni-th moment
for (j in 1:n){ #For each power term
A[i,j]=(((xmax*scale)^(ni+j))-((xmin*scale)^(ni+j)))/(ni+j)
}
B[i]=moments$moments[i]*scale^ni #Scale moments
}
a=as.vector(solve(A,B)) #Find coefficients
#Definition of PDF function
rhof<-function(x){
rho=a[1] #Independent term
for (i in 2:(degree+1)){
rho=rho+a[i]*(x*scale)^(i-1) #Polynomial terms
}
optdegree=i
}
}
rhof=CDFderiv(datasample,degree=optdegree,disp=disp)
return(rhof)
}
}
if (method=="moments"){
if (is.null(moments)==TRUE){
if (is.null(datasample)==TRUE){
print("Please input a data sample as a vector or a moments data frame")
return(NULL)
}
moments=momentset(datasample,maxdegree) #Calculate moments from sample
}
if (is.null(xmin)&is.null(xmax)){
print("Please input xmin and xmax estimated values")
return(NULL)
}
SSMbest=Inf #Initialize best sum of squared differences
for (mc in 2:min(nrow(moments),maxdegree)){
testmoments=moments[1:mc,] #Set of test moments
#Identify polynomial PDF
rhof=polyPDF(testmoments,xmin=xmin,xmax=xmax,scale=scale,disp=FALSE)
predM=testmoments$moments #Initialize set of predicted moments
i=1:1001
y=xmin+(xmax-xmin)*(i-1)/1000 #Definition of evaluation points
rho=rhof(y) #Density at evaluation points
SSM=0 #Initialize sum of squared differences
for (i in 1:nrow(moments)){ #For each moment
predM[i]=sum(y^moments$n[i]*rho)*(xmax-xmin)/1000
if (moments$n[i]!=0){
SSM=SSM+(moments$moments[i]^(1/moments$n[i])-
predM[i]^(1/moments$n[i]))^2
}
}
if (SSM<(1-tol)*SSMbest){ #If SSM improves update optimum
SSMbest=SSM
mombest=testmoments
predMbest=predM
}
}
if (disp==TRUE){
plot(moments$n,moments$moments,col="red",
xlab="Moment order",ylab="Moment value")
points(moments$n,predMbest,col="blue")
}
rhof=polyPDF(mombest,xmin=xmin,xmax=xmax,scale=scale,disp=disp)
return(rhof)
} else {
print("Please input a valid method: sample or moments")
return(NULL)
}
}