You are on page 1of 52

2018-12

Comparison of Methods for the Reconstruction of Probability Density


Functions from Data Samples

Hugo Hernandez
ForsChem Research, 050030 Medellin, Colombia
hugo.hernandez@forschem.org

doi: 10.13140/RG.2.2.30177.35686

Abstract

Perhaps the best practice for predicting the outcome of any observable process is constructing
robust mathematical models considering their most relevant factors. However, noise and
randomness, caused by remaining factors not included in the model, will always be present.
The behavior of random and randomistic variables (in a more general sense), can be
mathematically described by the probability density function (PDF). It is therefore desirable to
obtain PDFs for a measured variable, after a finite sample of data has been obtained. The
identification of density functions fitting the data sample is denoted here as the reconstruction
of the PDF. Such reconstruction is considered an inverse problem, since many different PDFs
can satisfactorily describe the sample obtained. Furthermore, sampling always incorporates an
inherent error in the process, given that the behavior of the sample may differ with respect to
the behavior of the population, especially for small-sized samples. Thus, reconstructing PDFs is
quite a challenging task. Different reconstruction methods, either based on the sample
cumulative probability distribution or on the sample moments, are described and their
performance is evaluated considering six different sets of data. Those test examples are
samples obtained from populations with known probability distribution, allowing assessing the
prediction capability of the reconstruction methods. If the type of distribution is known a priori,
parametric reconstruction methods are found to be the best alternative. However, for
unknown distributions, polynomial reconstruction methods provided good approximations for
all cases considered. A selection of algorithms (in R language) used in the present work, is
included in the Appendix.

Keywords

Cumulative Probability, Inverse Problems, Moments, Polynomial Fitting, Probability Density


Functions, Random Distributions, Reconstruction, Sampling.

14/12/2018 ForsChem Research Reports 2018-12 (1 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

1. Introduction

Randomistic variables are variables that can be measured several times under identical
conditions.[1] By repeating the measurement of a variable, a particular set of measured values
is obtained. Such set is called a sample, and the procedure used to obtain the set is called
sampling. If the set contains all measurements possible, it is denoted as a population, and the
procedure for obtaining the set is the census. Since some values may be repeated, their relative
occurrence frequencies in the population describe the corresponding probability distribution
( ) of the randomistic variable.

In addition, it is possible to define the cumulative probability distribution ( ) of a randomistic


variable as:
( ) ( )
(1.1)
where represents any particular measured value of the randomistic variable .

Both the probability distribution function ( ( )) and the cumulative probability


distribution function ( ( )) can be considered as fingerprints of the corresponding
randomistic variable.

There are, however, two main difficulties with the definition of the probability distribution of a
population:

1) When the number of different outcomes is large (e.g. ), then the magnitude of
the relative frequencies for each outcome value tend to be negligible. Particularly, this
is the case of continuous variables.
2) When the number of different possible measurements is large, limitless§ or infinite,
then a census is practically impossible. In this case, the probability distribution of the
variable cannot be known with absolute certainty.

It is possible to overcome the first issue by defining the probability density function of a
randomistic variable , as:

( ) ( )
( )
(1.2)

where ( ) represents the central finite difference about with a step .

§
Let us consider flipping a coin, for example. The total possible number of times that anyone can flip (or
could have flipped) a coin is practically limitless. If anyone is not flipping a coin right now, then the set of
results is already incomplete and cannot be considered as a population.

14/12/2018 ForsChem Research Reports 2018-12 (2 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Since
( ) ( )
(1.3)

then it can be concluded that the relative frequency of occurrence of a certain value is
proportional to the non-negligible probability density ( ):

( ) ( )
(1.4)
From Eq. (1.1) to (1.4) it can be concluded that:

( ) ∫ ( )

(1.5)

On the other hand, the term in the limit presented in Eq. (1.2) can be considered as a finite
probability density [1] about with step :

( )
( )
(1.6)
and therefore:
( ) ( )
(1.7)

( ) ∑ ( )

(1.8)

The second issue mentioned earlier, regarding the impossibility to determine probability
distributions of most populations with absolute certainty, is solved by means of estimation.
Thus, the probability distribution is estimated from the data available in a sample of the
population of interest. The estimation procedure will be denoted as a reconstruction of the
probability distribution. The reconstruction of the probability distribution can be considered as
the inverse of sampling. In fact, probability distribution reconstruction is an inverse problem,[2]
where measurements are used to infer the values of parameters that characterize the system.
As with any other inverse problem, the reconstruction of the probability distribution will not
lead to a unique answer. Furthermore, it is possible that the answer obtained is not the correct
one. Thus, different reconstruction methods will be compared in order to assess their accuracy.
Particularly, we will be focusing in this report on the reconstruction of probability density
functions of continuous variables. This case is selected because it is more sensitive to the
particular set of values obtained in the sample, and therefore, it is more challenging.

14/12/2018 ForsChem Research Reports 2018-12 (3 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

2. Description of Methods for the Reconstruction of Probability Density


Functions

Two main types of reconstruction methods will be considered:

1) Reconstruction of probability density functions from the cumulative probability


distribution of data samples.
2) Reconstruction of probability density functions from the moments of the distribution,
which can also be estimated from data samples.

2.1. Reconstruction Methods based on the Cumulative Probability Distribution

Eq. (1.2) shows the direct relationship between the cumulative probability distribution and the
probability density function. This type of methods is based on this relationship in order to
determine the probability density function of a population from the cumulative probability
distribution of a sample.

2.1.1. Derivative of the Cumulative Probability Distribution Function (CDF)

This is perhaps the most direct use of Eq. (1.2). However it requires a reliable function
describing the cumulative probability distribution of the population. Since it is not available, an
estimated differentiable cumulative probability function should be obtained from the data
sample. The main difficulty of this method is that the data sample is discrete (the number of
different outcomes is finite) and the sample size is limited. Thus, the cumulative probability
function should be obtained by regression or by determining interpolation polynomials from
the data in the sample.

The data sample consists of measurements representing different outcomes ( ).


Each of the outcomes is repeated times, where is the rank of each outcome ordered
from lowest to highest. Thus,

(2.1)

The observed probability of occurrence for each outcome obtained in the sample is:

(̃ )
(2.2)

14/12/2018 ForsChem Research Reports 2018-12 (4 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

and therefore, the observed cumulative probability of the outcomes in the sample is:

(̃ ) ∑ (̃ ) ∑

(2.3)
where ̃ is a randomistic variable representing any measurement in the sample.

Assuming a non-biased sampling procedure, it is expected that the cumulative probability for
the corresponding population , evaluated at be bounded by:

( ) (̃ )

(̃ ) ( ) (̃ )
(2.4)

By choosing the center of the interval as an estimate of the cumulative probability, a set of
cumulative probability estimates ( ̂ ) for each measurement outcome will be obtained,
where:

(̃ )
̂( )
(̃ ) (̃ )
{
(2.5)

The set of cumulative probability estimates ( ̂ ) vs. measurements ( ) is then used for fitting a
curve using any suitable method (for example those presented in [3], including polynomial
interpolation, cubic splines, moving least squares, etc.). Thus, the cumulative probability
estimates are approximated by an arbitrary function :

̂( ) ( )
(2.6)

And therefore,
( )
̂( )
(2.7)

Particularly for this comparison, only a polynomial approximation of ( ) will be considered.


An algorithm programmed in R language (https://www.r-project.org/) is presented in Appendix
A.6 for finding the degree of the polynomial that better fits the data sample by least squares
minimization.

14/12/2018 ForsChem Research Reports 2018-12 (5 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

2.1.2. Finite Probability Density Approximation (Naïve Estimator)

This method is based on the definition previously presented in Eq. (1.6). The assumption is that:

( )
̂( ) ( )
(2.8)

where ̂ represents an estimate of the probability density function. Thus, by selecting a value
of , it is possible to obtain a set of estimates ̂ for different values of , which can then be
used to fit a curve using any suitable method. Clearly, the choice of will have a significant
effect on the results obtained. Although smaller values of are desired for a better
estimation of , for small data samples this leads to highly noisy results. Smoother functions
can then be obtained using larger values of , with the risk of reducing the accuracy of the
estimation. This method is also known as the Naïve Estimator.[4] An algorithm programmed in
R is presented in Appendix A.7 for obtaining Naïve estimators from data samples.

2.1.3. Pre-defined Probability Density Functions (Sample-based Parametric Estimation)

This method assumes that the variable can be expressed as a function of an arbitrary
standard random function , as follows:[5]

(2.9)

where and are transformation parameters. There are three basic types of standard random
functions summarized in Table 1.

Many different standard random variables can be defined, either non-parametric standard
random variables (for example, Gaussian, Uniform, Exponential, etc.) or parametric standard
random variables (for example, Student’s t, Fishers’ F, Gamma, Weibull, etc.). Each of these
standard random variables has pre-defined probability density ( ) and cumulative probability
( ) functions.** Thus, the parameters and , and additional standard random parameters
( ) if any, are estimated by fitting the cumulative probability estimates ̂ to the pre-defined
cumulative probability function of the selected standard random variable.

**
See Table 2 in [6] for some examples.

14/12/2018 ForsChem Research Reports 2018-12 (6 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 1. Types and Properties of Standard Random Variables (taken from [6])
Parameters
Type Properties of Bounds of Bounds of

( ) ( )
( )
[ ) [ )
( )
I ( ) ( ) ( ) ( ) ( )
( ] ( ]
( )
( ) ( )
[ ] [ ]
( ) ( )
[ ) [ ) ( )
II ( ) ( )
( ] [ ) ( )
III ( ) ( ) [ ] [ ]

Thus,
̂
̂( ) ( ̂)
̂
(2.10)
and
̂
̂( ) ( ̂)
̂
(2.11)

where ̂ , ̂ and ̂ are the estimated parameters obtained by any suitable optimization
procedure. Furthermore, the best standard reference function can also be optimized for a
particular data sample.[6]

2.1.4. Kernel Density Estimators

The kernel density approach assumes that the cumulative probability function can be described
as the average cumulative probability of different kernels (usually each kernel representing
each measurement in the sample). The kernels are predefined functions, usually but not
necessarily symmetric and unimodal.[7] Gaussian and Epanechnikov [8] (quadratic) kernels are
normally preferred.

The probability density function is then estimated as:

14/12/2018 ForsChem Research Reports 2018-12 (7 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

∑ ( )
̂( ) ( ) ∑ ( )

(2.12)

( )
where is the sample size, represents the kernel function, and is the smoothing
or bandwidth†† coefficient. The bandwidth also determines the smoothness and accuracy of the
estimator, although in opposite directions. Thus, the use of small values of improves accuracy
but also increases noise. Unfortunately, the selection of the bandwidth is subjective. However,
some rules of thumb are available for specific kernel functions. For example, for Gaussian
kernels the ideal bandwidth is:[4]

( ) ̂
(2.13)
where ̂ is the sample standard deviation.

Kernels are themselves probability density functions. Particularly the Gaussian kernel is
represented by the standard normal probability density function:

( )

(2.14)
On the other hand, Epanechnikov’s optimum kernel is expressed as:[8]

( ) | | √
( ) { √
| | √
(2.15)

2.2. Reconstruction Methods based on the Moments of the Distribution

As an alternative to the use of the cumulative probability distribution of the data sample, it is
also possible to estimate the probability density function of the population from the particular
distribution moments observed in the sample. These methods are useful for example in
randomistic optimization.[9] In this case, the moments of random variables are the decision
variables, and their values are obtained after solving the optimization problem. Thus, it is

††
The bandwidth of the kernel is analogue to the bin width in a histogram.

14/12/2018 ForsChem Research Reports 2018-12 (8 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

necessary to reconstruct the distribution from the moments in order to obtain a complete
description of the randomistic variable of interest.

The moments of the probability distribution of are defined as follows:[1]

( ) ( ) ∫ ( )

(2.16)

where ( ) is the -th moment operator and ( ) is the expected value operator.

On the other hand, the moments observed in a sample ̃ of size can be determined as:

( ̃) ∑

(2.17)

Thus, the basic concept behind this type of methods is that the moments of the sample can be
considered as estimates of the moments of the probability distribution of the population, and
therefore:
̂ ( ) ( ̃)
(2.18)

And by combining Eq. (2.16) to (2.18), it is found that:

∫ ̂( ) ∑

(2.19)

The problem consists on obtaining the estimates of the probability density function ̂( ) from
the available measurements . And this inverse problem can be solved using different
approaches.

2.2.1. Moment-based Parametric Estimation

This method is analogous to the method presented in Section 2.1.3 for parametric estimation
based on standard random variables. Similarly, the transformation presented in Eq. (2.9) is
used, along with a predefined standard random probability density function ( ). The
difference lies in that the parameters ( ) are estimated from the sample moments and not

14/12/2018 ForsChem Research Reports 2018-12 (9 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

from the cumulative probability of the sample. At least one distribution moment (different
from the zero-th moment‡‡) must be used for each parameter to be identified. By using a larger
number of moments, better estimates of the parameters can be obtained and the assumed
standard distribution can be validated.

Eq. (2.19) becomes in this case:

̂
∫ ( ̂) ∑
̂ ̂
(2.20)

Please notice that in principle, can be any non-zero value (integer or not), as long as the
integral on the left hand side of the equation has a convergent solution, and the sum on the
right hand side of the equation is not indeterminate. That is why only positive values of are
preferred. The term ̂ dividing inside the integral appears by applying the change of variable
theorem.[10]

The integral in Eq. (2.20) can be also expressed as:

∫ (̂ ̂ ( ̂ )) ( ( ̂ )) ( ̂) ∑( ) ̂ ̂ ∫ ( ̂) ( ( ̂ )) ( ̂)

∑( ) ̂ ̂ ( ( ̂ ))

(2.21)
where ( ̂ ) represents any realization of the standard random variable ( ̂).

Therefore,

( ̃) ∑( ) ̂ ̂ ( ( ̂ ))

(2.22)

where ( ( ̂ )) represent the moments of the standard random distribution. These


moments can be obtained from integration using the corresponding probability density
function of the standard random variable, or they can also be found in tables.[11]

Eq. (2.22) represents a set of nonlinear algebraic equations where ̂ , ̂ and ̂ are the unknowns.
If the number of parameters is the same as the number of different moments considered
(other than the zero-th moment), a single solution is obtained. If more moments than

‡‡
The zero-th moment cannot be used because it yields a trivial, useless result.

14/12/2018 ForsChem Research Reports 2018-12 (10 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

parameters are used, a cost function is defined (e.g. sum of squared differences between

√ ( ̃ ) and √∑ ( )̂ ̂ ( ( ̂))), and the best set of parameter estimates is


obtained by optimization. Then again,

̂
̂( ) ( ̂)
̂
(2.23)

In particular, by using the transform (2.9) with a Type I standard random variable , it can be
found considering only the first and second moments:

̂
∫ ( ̂) ̂ ∑
̂ ̂
(2.24)
̂
∫ ( ̂) ̂ ̂ ∑
̂ ̂
(2.25)
which can be expressed as:

̂ ( ∑ ) ( ∑ )

(2.26)

2.2.2. Inverse Laplace Transform Approach

Let us now assume that we use the bilateral Laplace transform (introduced by Laplace [12] on
his study of probabilities) on the true probability density function of :

{ ( )} ∫ ( )

(2.27)

since the exponential term can be expanded as an infinite sum, Eq. (2.27) becomes:

( ) ( )
{ ( )} ∫ ∑( ) ( ) ∑ ∫ ( ) ∑ ( )

(2.28)

14/12/2018 ForsChem Research Reports 2018-12 (11 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Thus, the bilateral Laplace transform of the probability density function can be expressed in
terms of the non-negative integer moments of the distribution. Furthermore,

( )
( ) {∑ ( )}

(2.29)
where represents the inverse bilateral Laplace transform operator.

By using Eq. (2.18), and truncating the infinite sum to a maximum odd moment , it is
possible to estimate the probability density function as:
( )
̂( ) {∑ ( ̃ )}

(2.30)

The main difficulty of this approach is that inverse bilateral Laplace transforms are not
commonly known as Laplace transforms. Thus, an expression in terms of the Laplace transform
would be desirable. The problem is that unilateral Laplace transforms only consider the positive
values of .

For that purpose, let us then consider the following variable transform:§§

( ̃) ( ̃)
( ̃) ( )

(2.31)

In this case, is always positive and unilateral Laplace transform can be used on its probability
density function:

{ ( )} ∫ ( )

(2.32)
Thus, proceeding similarly as before:
( )
̂ ( ) {∑ ( ̃ )}

(2.33)
and for negative values of , ̂ ( ) .

§§
The last term in Eq. (2.31) provides a simple estimate of how far would be the true population minimum
from the sample minimum. Alternatively, it would be possible to consider that the minimum value of the
population is exactly the minimum value in the sample, or to estimate the population minimum by a
different method.

14/12/2018 ForsChem Research Reports 2018-12 (12 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Unfortunately, the Laplace inverse of the power series of results in a function of derivatives
of Dirac’s , and that is not a suitable solution for our purposes. Therefore, in this work, Padé
approximants will be used:[13]

( ) ∑
∑ ( ̃) [ ] ( )

(2.34)

where , are the corresponding coefficients of the Padé approximant, is the order of the
polynomial in the numerator, and is the order of the polynomial in the denominator. Then


̂ ( ) { }

(2.35)
Now, from the change of variable theorem:[10]

( ̃) ( ̃)
̂( ) ̂ ( ( ̃) ( ))

(2.36)

2.2.3. Inverse Mellin Transform Approach

Although Laplace integral transforms can be used, the truncation of the infinite series
expansion of the exponential might lead to significant errors in the estimation of the
probability density function. Thus, it is possible to alternatively consider Mellin integral
transform ( ) [14] of the probability density function ( ) (after the variable change given in
2.31):

{ ( )} ∫ ( ) ( )

(2.37)

Thus, the Mellin transform of the positive variable corresponds to the function describing the
moments of the distribution, given that .

Then, from Eq. (2.18):

̂ ( ) { ( ̃ )} ∫ ( ̃)

(2.38)

14/12/2018 ForsChem Research Reports 2018-12 (13 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

where represents the unit imaginary number, and is an arbitrary constant.

This approach requires finding a suitable function describing the observed moments in
̃
the transformed data sample , and using the Mellin inversion formula presented in Eq. (2.38).
Then Eq. (2.36) can be used to estimate the probability density function of .

2.2.4. Cubic Splines Approach

This method consists on approximating the probability density function by a cubic spline using
internal nodes*** in the interval [ ], as follows:

( ) ∑ ( )

(2.39)

where are constant coefficients that must satisfy the three continuity conditions of cubic
splines (continuity in the function and its first two derivatives).

Then, the -th moment of the distribution is calculated as:

( ) ∫ ( ) ∑∑ ∫ ( )

∑∑ ∫ ∑( )

∑∑ ∑( ) ( )

(2.40)

Thus, by using different moment estimates of the distribution, along with the spline
continuity conditions, it is possible to estimate all coefficients. The zero-th moment, which is
always equal to , can be used.

This method may easily result in negative values of the probability density function, and/or ill-
conditioned systems of equations. In those cases, an iterative solution of the problem must be

***
If real measurements are available, they can be used as nodes as long as moment estimates are
available. Otherwise an equidistant partition of the interval, considering the number of available
moments should be performed.

14/12/2018 ForsChem Research Reports 2018-12 (14 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

performed, starting from an initial guess of the probability density function and the
measurements interval. More details on this method and its numerical implementation using
the iterative procedure can be found in [15]. For the present report, the iteration was
performed by changing the estimated values for the probability density at the minimum and
maximum elements of the sample.

2.2.5. General Polynomial Approach

A variation of the previous method assumes that the probability density function can be
described by a single polynomial expression of order , as follows:

( ) ∑

(2.41)

The -th moment will be then given by:

( )
( ) ∫ ( ) ∫ ∑ ∑

(2.42)

If the measurement limits ( ) are known, then by considering different


moments, it is possible to estimate all coefficients of the polynomial from the linear system
of equations obtained. The algorithm in R for this type of polynomial fit is presented in
Appendix A.8. If the limits are unknown, at least different moments should be
considered, and the problem must be solved by optimization, finding both the optimal limits
and polynomial coefficients.

3. Evaluation Methodology and Description of Test Examples

The performance of the different methods presented in Section 2 for estimating the probability
density function from a data sample are tested using 6 different test examples and 2
benchmark examples. The test examples, sampled from known probability distribution
functions, are used for determining the fitness and accuracy obtained with the different
reconstruction methods. The benchmark examples, taken from the literature,[4] are used only
to assess the fitness and the similitude of the different reconstruction methods.

14/12/2018 ForsChem Research Reports 2018-12 (15 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

3.1. Sampling Procedure for Test Examples

Each of the test examples have a known probability density function, which is the reference for
the accuracy assessment. Then, for each probability density function, the cumulative
probability function is determined using Eq. (1.5). Afterwards, two groups of different
uniform random numbers (between 0 and 1) are obtained for each test function. These random
numbers are true random numbers obtained from atmospheric noise using a radio signal at an
unused broadcasting frequency together with a skew correction algorithm.[16] Particularly,
these random numbers are obtained from the service provided at random.org.††† For each of
the uniform random numbers ( ) of the first group, a corresponding measurement value ( )
is obtained from the cumulative probability distribution, as follows:

( )
(3.1)

where represents the inverse cumulative probability function. If the inverse cannot be
explicitly obtained, then the value is obtained by an iterative search method. For example by
minimizing the function ( ( )) .

The second group of uniform random numbers is used to sort the measurement values
obtained from the first group of random numbers. Then, the measurements whose random
numbers in the second group have the lowest values represent the data sample. The whole set
of measurements is used to visually verify that the available data follows the pre-
defined probability distribution.

3.2. Performance Assessment

The probability density functions of the test examples, presented in Section 3.3, are
reconstructed using the methods described in Section 2. Once a reconstructed probability
density function is obtained, it is compared to the original sample data by means of three
different metrics for fitness assessment, presented in Section 3.2.1. Also, since each test
example was obtained from a reference probability density function, the accuracy of the
reconstructed probability density is evaluated by comparing it to the reference probability
density function. Such comparison is done using the concept of similitude between probability
density functions introduced in a previous report.[6] The determination of the probability
density similitude is presented in Section 3.2.2. The equation used for assessing accuracy is
presented in Section 3.2.3.

†††
Uniform random number generator: https://www.random.org/decimal-fractions/. Number of decimal
places used: 10.

14/12/2018 ForsChem Research Reports 2018-12 (16 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

3.2.1. Fitness Assessment

The fitness of a reconstruction method measures the ability of the method to obtain an
estimate of the probability density that satisfactorily describes the sampled data. Such
comparison is best performed on the cumulative probability of the data.

Three different metrics have been previously proposed for assessing the fitness of a cumulative
probability model to the sample data, as can be seen in Table 2:

Table 2. Fitness error metrics for cumulative probability models (adapted from [6])
Mathematical Model Rejection
Fitness Criteria
Expression Criterion
Maximum difference in
( )
cumulative probability

Average difference in ∑
〈 〉 〈 〉
cumulative probability

Sum of squared differences



in cumulative probability

where is the difference or error between the cumulative distribution described by the model
( ̂ ) and the cumulative distribution observed at each outcome (ranked in ascending order) in
the sample, and it is given by:

̂( ) ∑ ̂( ) ∑

∑ ̂( ) ∑

∑ ̂( ) ̂( ) ∑
{
(3.2)

Usually, if at least two of the criteria proposed in Table 2 do not result in rejection, then the
model can be considered to be fit to the sample data.

The algorithm programmed in R for fitness assessment is presented in Appendix A.4.

14/12/2018 ForsChem Research Reports 2018-12 (17 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

3.2.2. Similitude Assessment

The similitude ( ) between two probability distribution models can be defined as:[6]

( ) ∫ ( ( ) ( ))

(3.3)

Another related concept that can be used to assess the similitude between two models is the
relative probability error given by:[17]

∫ | ( ) ( )|
( ) ( )
(3.4)
The algorithm programmed in R for similitude assessment is presented in Appendix A.5.

3.2.3. Accuracy Assessment

The accuracy ( ) of a probability density function reconstruction method for a particular test
example is determined as the similitude between the estimated probability density function
and the reference probability density function, defined for the test example. Thus,

∫ | ̂( ) ( )|
( ̂| ) (̂ ) ∫ ( ̂( ) ( ))

(3.5)

3.3. Test Examples

Six different test examples have been considered for the current assessment. For each test
example, the following information is provided:

 The reference probability density function used to generate the measurements.


 The corresponding cumulative probability distribution obtained from Eq. (1.5).
 The sample size.
 The set of data obtained by the sampling procedure.
 The fitness of the reference probability density function to the data sample.
 A comparative plot showing the cumulative probability of the whole set of 10.000
measurements generated, the cumulative probability of the sample data, and the
reference cumulative probability function.

14/12/2018 ForsChem Research Reports 2018-12 (18 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

3.3.1. Test Example 1: Standard Uniform Distribution

Reference probability density function:

( ) {
(3.6)
Reference cumulative probability function:

( ) {

(3.7)

Sample Size:

Table 3. Sample Data for the Test Example 1: Standard Uniform Distribution.
0.94984581 0.61068765 0.31245958 0.84835996 0.71142486 0.04495294
0.82666537 0.25544950 0.50851774 0.67086908 0.27409144 0.75048860
0.14488340 0.00616502 0.58209403 0.77794805 0.04842904 0.30529027
0.97782694 0.29575453 0.70165301 0.80645354 0.55693643 0.24583023
0.94960493 0.73219776 0.33442330 0.02234013 0.55732476 0.43998253

Fitness: ,〈 〉 ,

Figure 1. Cumulative Probability Distribution for Test Example 1: Standard Uniform Distribution.
Dotted blue line: Reference cumulative probability function (Eq. 3.7). Green dashed line: Large
set of 10.000 random measurements. Red solid line: Random sample of 30 measurements.

14/12/2018 ForsChem Research Reports 2018-12 (19 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

3.3.2. Test Example 2: Standard Normal Distribution

Reference probability density function:

( )

(3.8)
Reference cumulative probability function:
( )
( ) √

(3.9)
Sample Size:

Table 4. Sample Data for the Test Example 2: Standard Normal Distribution.
-1.2439 0.4505 -0.3852 -1.8290 1.3407 -0.6469 -0.4056 0.6841 -1.5678 0.6941
-1.8914 -0.4966 0.8662 -0.2750 -0.2041 1.7432 0.1361 0.2965 1.3166 -0.9297
0.4800 -0.1934 1.6562 0.9544 2.1867 0.3798 1.1047 0.7773 0.9502 0.7686
-0.2860 1.1326 -0.7087 -1.0967 -0.4775 0.1193 -0.9110 -0.1606 0.4372 -1.6280
-0.9466 -0.3390 -2.2981 1.5130 0.1238 -0.5792 0.8862 0.6181 0.1883 0.0580

Fitness: ,〈 〉 ,

Figure 2. Cumulative Probability Distribution for Test Example 2: Standard Normal Distribution.
Dotted blue line: Reference cumulative probability function (Eq. 3.9). Green dashed line: Large
set of 10.000 random measurements. Red solid line: Random sample of 50 measurements.

14/12/2018 ForsChem Research Reports 2018-12 (20 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

3.3.3. Test Example 3: Maxwell-Boltzmann Distribution of Speed of Potassium Molecules [18]

Reference probability density function:


( 〈 〉) ( )
( ) √ ( ) ( 〈 〉) ( )

(3.10)
Reference cumulative probability function:

〈 〉 ( 〈 〉)
( ) √ ( 〈 〉)

( )
( )
( ) ( )
(3.11)
Sample Size:

Table 5. Sample Data for the Test Example 3: Maxwell-Boltzmann Distribution.


490.97 565.82 559.45 801.30 999.41 532.13 945.82 836.09
687.86 1211.51 507.30 879.91 660.70 983.92 617.97 855.48
945.11 657.67 859.03 731.51 610.58 986.85 595.54 690.34
399.89 651.52 506.36 920.04 450.82 735.93 743.05 655.86
504.71 907.56 595.52 588.97 885.92 805.57 1161.54 635.69

Fitness: ,〈 〉 ,

Figure 3. Cumulative Probability Distribution for Test Example 3: Maxwell-Boltzmann


Distribution. Dotted blue line: Reference cumulative probability function (Eq. 3.11). Green
dashed line: Large set of 10.000 random measurements. Red solid line: Random sample of 40
measurements.

14/12/2018 ForsChem Research Reports 2018-12 (21 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

3.3.4. Test Example 4: Scopd Distribution of Time between Molecular Collisions [17]

Reference probability density function:

( ) ( ) ( ) ( )

(3.12)
Reference cumulative probability function:

( ) ( ) ( )

(3.13)

Sample Size:

Table 6. Sample Data for the Test Example 4: Scopd Distribution.


0.2120 0.5856 2.0385 1.7689 0.5582 1.3222 1.8612 0.5222 0.0991 4.0240
2.0459 0.9928 0.3862 1.7036 0.5141 0.1575 1.2897 0.1132 0.7786 1.1274
0.1958 0.3044 0.3626 0.3554 0.0629 0.8815 0.2076 0.3748 2.5757 2.4896
0.5043 0.0462 0.2548 1.4070 1.2133 0.5162 0.2237 3.3238 0.7086 0.7180
0.4649 0.1643 0.0425 3.1055 0.1944 0.5762 2.3019 0.1909 0.4330 2.2076

Fitness: ,〈 〉 ,

Figure 4. Cumulative Probability Distribution for Test Example 4: Scopd Distribution. Dotted
blue line: Reference cumulative probability function (Eq. 3.13). Green dashed line: Large set of
10.000 random measurements. Red solid line: Random sample of 50 measurements.

14/12/2018 ForsChem Research Reports 2018-12 (22 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

3.3.5. Test Example 5: Bimodal Distribution

Reference probability density function:

( ) ( )

( )

(3.14)
Reference cumulative probability function:

( ) ( )
( ) √ √

(3.15)

Sample Size:

Table 7. Sample Data for the Test Example 5: Bimodal Distribution.


-1.166 2.960 1.559 2.005 2.821 -3.900 1.943 1.144 1.285 0.798 -2.340 3.454
1.035 -3.232 -1.911 2.498 -2.586 2.134 -0.614 -2.692 -0.936 -3.173 3.364 -1.160
2.506 -3.613 1.482 -2.482 4.096 -1.496 2.895 1.805 -2.307 -0.574 -0.099 -1.751
-2.410 1.707 1.993 -2.610 -1.690 3.139 -4.141 2.716 -1.382 -0.560 4.333 0.261
1.230 0.474 3.730 -3.709 -0.952 -2.633 -0.926 -2.902 -2.138 1.240 -0.546 -2.227

Fitness: ,〈 〉 ,

Figure 5. Cumulative Probability Distribution for Test Example 5: Bimodal Distribution. Dotted
blue line: Reference cumulative probability function (Eq. 3.15). Green dashed line: Large set of
10.000 random measurements. Red solid line: Random sample of 30 measurements.

14/12/2018 ForsChem Research Reports 2018-12 (23 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

3.3.6. Test Example 6: Polynomial Distribution

Reference probability density function:

( ) {

(3.16)
Reference cumulative probability function:

( ) {

(3.17)

Sample Size:

Table 8. Sample Data for the Test Example 6: Polynomial Distribution.


1.0666 1.9518 1.5207 1.0033 1.7291 1.9025 1.7923 1.1390
1.7720 1.3580 1.4555 1.6355 1.8266 1.3973 1.6731 1.6315
1.1264 1.8666 1.3047 1.5789 1.2626 1.2611 1.0622 1.0230
1.9237 1.0622 1.0457 1.0256 1.2640 1.7642 1.5920 1.7886
1.8970 1.1621 1.9794 1.7502 1.3236 1.1498 1.3196 1.2600

Fitness: ,〈 〉 ,

Figure 6. Cumulative Probability Distribution for Test Example 6: Polynomial Distribution.


Dotted blue line: Reference cumulative probability function (Eq. 3.15). Green dashed line: Large
set of 10.000 random measurements. Red solid line: Random sample of 40 measurements.

14/12/2018 ForsChem Research Reports 2018-12 (24 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

3.4. Benchmark Examples

Benchmark examples are sets of measurements or moments reported in the literature, whose
true probability density function is considered unknown. Thus, for benchmark examples it is
not possible to test accuracy. The purpose of these examples is to test the similitude in the
results obtained using the different methods.

3.4.1. Benchmark Example 1: Old Faithful Geyser Eruptions

This data set contains 107 different observations of eruption length (in minutes) of Old Faithful
geyser in Yellowstone Park. Data shown in Table 9 and Figure 7 were obtained from Table 2.2.
presented in [4].

Table 9. Sample Data for the Benchmark Example 1: Length of eruptions of Old Faithful geyser.
Sample Size:
4.37 1.83 4.25 3.83 1.73 4.18 1.67 2.27 4.73
4.70 1.83 3.58 1.85 3.10 4.58 3.50 2.93 3.72
1.68 3.95 3.67 3.80 4.62 3.50 4.20 4.63 4.50
1.75 4.83 1.90 3.80 1.88 4.62 4.43 4.00 4.40
4.35 3.87 4.13 3.33 3.52 4.03 1.90 1.97 4.58
1.77 1.73 4.53 3.73 3.77 1.97 4.08 3.93 3.50
4.25 3.92 4.10 1.67 3.43 4.60 3.43 4.07 1.80
4.10 3.20 4.12 4.63 2.00 4.00 1.77 4.50 4.28
4.05 2.33 4.00 1.83 3.73 3.75 4.50 2.25 4.33
1.90 4.57 4.93 2.03 4.60 4.00 1.80 4.25 4.13
4.00 3.58 3.68 2.72 2.93 4.33 3.70 4.08 1.95
4.42 3.70 1.85 4.03 4.65 1.82 2.50 3.92

Figure 7. Cumulative Probability Distribution for Benchmark Example 1: Length of eruptions of


Old Faithful geyser. Red solid line: Data set of 107 measurements.

14/12/2018 ForsChem Research Reports 2018-12 (25 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

3.4.2. Benchmark Example 2: Multimodal Crystal Size Distribution

This example provides the first 10 moments of a crystal size distribution, corresponding to a
batch seeded crystallization process where two seed distributions with different mean size are
initially mixed. The moments presented in Table 10 were obtained from Example 2.3 in [15]. ‡‡‡
The decimal logarithms of the moments are included for increased precision.

Table 10. Moments of Benchmark Example 2: Multimodal Crystal Size [m] Distribution
( )
0 1 0.000
-3
1 1.743x10 -2.759
-6
2 4.062 x10 -5.391
-8
3 1.078 x10 -7.967
-11
4 3.049 x10 -10.516
-14
5 8.945 x10 -13.048
6 2.692 x10-16 -15.570
-19
7 8.261 x10 -18.083
-21
8 2.575 x10 -20.589
-24
9 8.134 x10 -23.090
-26
10 2.598 x10 -25.585

4. Results and Discussion

This section summarizes the performance assessment of probability density functions


estimated for the Test Examples using different reconstruction methods. The relative fitness
error of each probability density function was obtained as the ratio of each metric presented in
Table 2 for the density function to the metric for the corresponding reference probability
density function. The accuracy of each probability density function is determined using Eq.
(3.5).

4.1. Test Example 1: Standard Uniform Distribution

Although simple, this is actually a challenging test example, particularly for methods providing
smooth probability density functions. Table 11 summarizes the performance of all the
reconstruction methods presented in Section 2, using different reconstruction parameters. A
comparison of the probability density functions obtained is presented in Figure 8.

‡‡‡
The moments reported in [14] are not normalized. Therefore, all of them must be divided by the zero-
th moment ( ).

14/12/2018 ForsChem Research Reports 2018-12 (26 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 11. Performance of probability density functions estimated for Test Example 1 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
87.6% 88.5% 82.2% 97.7%
(CDF: Linear model)
Derivative of CDF
146.1% 132.1% 179.1% 93.6%
(CDF: Quadratic model)
Derivative of CDF
150.1% 138.3% 190.9% 93.6%
(CDF: Cubic model)
Naïve Estimator
89.0% 140.9% 126.5% 77.0%
(30 nodes, =0.1)
Naïve Estimator
181.7% 264.2% 465.4% 89.7%
(30 nodes, =0.3)
Naïve Estimator
205.1% 331.0% 727.3% 90.8%
(30 nodes, =0.5)
Sample-based Parametric
Estimation 94.3% 88.7% 83.7% 98.3%
(Uniform Distribution)
Sample-based Parametric
Estimation 136.8% 114.5% 148.8% 81.4%
(Normal Distribution)
Kernel Density Estimation
(Gaussian Kernels, optimal 95.8% 96.3% 102.6% 85.5%
=0.16117)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 99.9% 102.0% 111.6% 85.5%
=0.16117)
Kernel Density Estimation
64.4% 37.7% 24.2% 83.1%
(Gaussian Kernels, =0.05)
Kernel Density Estimation
70.2% 42.4% 28.4% 83.5%
(Epanechnikov Kernels, =0.05)
Moment-based Parametric
Estimation (Uniform Distribution, 88.0% 88.9% 82.0% 97.7%
to )
Moment-based Parametric
Estimation (Uniform Distribution, 107.1% 90.3% 91.2% 99.5%
to )
Moment-based Parametric
Estimation (Normal Distribution, 156.3% 181.2% 286.0% 80.6%
to )
Inverse Laplace Transform
475.2% 708.9% 3489.5% 60.1%
( )
Inverse Mellin Transform
100.3% 100.3% 100.5% 100%
(using to )
Cubic Splines
175.3% 180.6% 331.8% 84.7%
(13 nodes, using to )
Polynomial Approach
79.4% 77.1% 65.9% 96.2%
(using to )
Polynomial Approach
94.8% 61.8% 56.3% 92.8%
(using to )

14/12/2018 ForsChem Research Reports 2018-12 (27 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

2.5
1.5
Reference
Reference Naive: 30 nodes, dx=0.1 (b)
CDF Derivative: Linear CDF
(a) Naive: 30 nodes, dx=0.3
CDF Derivative: Quadratic CDF Naive: 30 nodes, dx=0.5

2.0
CDF Derivative: Cubic CDF

Probability Density
Probability Density

1.0

1.5
1.0
0.5

0.5
0.0
0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

2.5
1.5

Reference
Reference Gaussian Kernels: h=0.16117 (d)
Sample-based Parametric: Uniform
(c) Epanechnikov Kernels: h=0.16117
Sample-based Parametric: Normal Gaussian Kernels: h=0.05

2.0
Epanechnikov Kernels: h=0.05

Probability Density
Probability Density

1.0

1.5
1.0
0.5

0.5
0.0
0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x
2.0

Reference (e) Reference


Moment-based Parametric: Uniform (M1 to M2) Inverse Laplace Transform (f)
Moment-based Parametric: Uniform (M1 to M4) Inverse Mellin Transform
Moment-based Parametric: Normal (M1 to M2)
4
1.5
Probability Density

Probability Density

3
1.0

2
0.5

1
0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x
2.0

Reference (g)
Cubic Splines
Polynomial (M0 to M2)
Polynomial (M0 to M4)
1.5
Probability Density

1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 8. Probability density functions reconstructed for Test Example 1 using the
reconstruction methods listed in Table 11

14/12/2018 ForsChem Research Reports 2018-12 (28 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Particularly for this test example, all the probability density function reconstruction methods
were evaluated. In some cases, different parameter sets were considered.

Several methods were able to reconstruct the original probability density function with
accuracy greater than 95%. That was the case of the method based on the derivative of the
cumulative probability function fitted by a linear model, the parametric methods (assuming a
uniform distribution), the inverse Mellin transform method, and the moment-based polynomial
method. As expected, the relative fit to the data of the previous methods was close to 100% or
even lower.

Some methods were capable of significantly improving the fit to the data without improving
the accuracy. That was the case of the Kernel density estimators (particularly for small
bandwidth). This indicates that overfitting occurred in this type of methods. Thus, improving
the fitness of the probability density function to the data in a sample does not guarantee
accurately reconstructing the true probability distribution of the population.

Parametric methods considering a uniform distribution were capable of fitting the data better
than the reference function, while at the same time providing high accuracies. The
disadvantage of parametric methods is that the true shape of the probability density function
should be known a priori. And that is seldom the case. Thus, it is necessary to test different pre-
defined probability density functions with a wide variety of shapes, in order to successfully
reconstruct the density function.

The inverse Mellin method was particularly interesting, as it was capable of almost exactly
reconstructing the reference density function. It is possible that the inverse Mellin method is
very efficient for uniform distributions. Thus, it is important to analyze the results obtained
with the other test examples. On the opposite side, the inverse Laplace method was the worst
performer, not only in fitness but also in accuracy. This result might indicate that the method is
very sensitive to truncation of the infinite series expansion. Furthermore, it is a cumbersome
method as it involves obtaining Padé approximants and inverse Laplace transforms, requiring
manual analytical solution or symbolic programming. Given that the difficulty involved in this
method is not rewarded by fitness and accuracy, it will not be considered for the remaining test
examples.

Since the moment-based polynomial method seems to be performing better and with a lower
computational load than the closely related moment-based splines method, the latter will not
be tested for the remaining test examples.

It is also evidenced that the shape of the probability density function reconstructed by the
Naïve and Kernel density estimators are highly sensitive to the particular reconstruction
parameters used. The polynomial reconstruction also presented significant changes in the

14/12/2018 ForsChem Research Reports 2018-12 (29 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

shape of the density function, particularly for higher order polynomials. For methods resulting
in noisy (multimodal) probability density functions, it is difficult to assess if the behavior
predicted is real or if it is just a mathematical artifact.

4.2. Test Example 2: Standard Normal Distribution

The second test example is also challenging for two reasons: 1) It is a highly non-linear
mathematical function, and 2) It involves both positive and negative values. Table 12
summarizes the performance of the reconstruction methods used. Only one set of
reconstruction parameters is considered for each method. The probability density functions
obtained are presented in Figure 9.

Table 12. Performance of probability density functions estimated for Test Example 2 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
th 78.1% 33.0% 20.5% 92.5%
(CDF: 5 degree model)
Naïve Estimator (50 nodes, =0.5) 75.6% 42.2% 26.2% 88.4%
Sample-based Parametric Estimation
64.6% 40.0% 21.2% 96.5%
(Normal Distribution)
Kernel Density Estimation (Gaussian
80.9% 68.8% 46.4% 91.9%
Kernels, optimal =0.49257)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 83.1% 72.3% 51.3% 92.0%
=0.49257)
Moment-based Parametric Estimation
65.4% 54.2% 33.4% 98.0%
(Normal Distribution, to )
Inverse Mellin Transform
612.2% 1029.1% 6435.0% 71.1%
(using , , , ,…, )
Polynomial Approach (using to ) 73.6% 31.6% 17.2% 92.2%

The parametric methods (sample- and moment-based) provided the best accuracy (>95%) for
the normal test example. The polynomial methods (sample-based derivative of CDF and
moment-based polynomial) along with the kernel density methods scored between 90 and 95%
accuracy. Particularly the polynomial methods presented a strange behavior at the extreme
values of the sample. This is probably due to the cut-off in the density distribution performed at
those extreme values. However, the central shape of the distribution is close to the reference
distribution. The Naïve estimator correctly followed the behavior of the distribution, but it is
too noisy. The inverse Mellin method failed in this case to correctly describe the density
function. This is probably due to the high sensitivity of the results to the coefficients obtained

14/12/2018 ForsChem Research Reports 2018-12 (30 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

by integration (when performing Mellin inverse transform). On the other hand, the methods
providing a best fit to the sample data were the parametric and the polynomial methods (both
sample-based and moment-based).

0.6
0.6

Reference Reference
CDF Derivative: 5-th degree CDF (a) Gaussian Kernels: h=0.49257 (b)
Epanechnikov Kernels: h=0.49257

0.5
Naïve: 30 nodes, dx=0.5
0.5

Sample-based Parametric: Normal

Probability Density
Probability Density

0.4
0.4

0.3
0.3

0.2
0.2

0.1
0.1

0.0
0.0

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

x x
0.8

Reference (c)
Moment-based Parametric: Normal
Inverse Mellin Transform: M0 to M10, steps: 1/4
Polynomial: M0 to M5
0.6
Probability Density

0.4
0.2
0.0

-3 -2 -1 0 1 2 3

Figure 9. Probability density functions reconstructed for Test Example 2 using the
reconstruction methods listed in Table 12

4.3. Test Example 3: Maxwell-Boltzmann Distribution of Speed of Potassium Molecules

The Maxwell-Boltzmann distribution also represents a challenging non-linear function to be


described, as it can be seen in Eq. (3.10) and (3.11). Furthermore, this test example presented
the largest fitness errors with respect to the reference density function. An additional
complexity in this particular test example is the presence large values. The methods based on
the solution of systems of equations (e.g. moment-based cubic splines and polynomials) may
experience singularity issues when consider higher-order moments. In those cases, it is
recommended to scale the values in the sample before the analysis. The algorithm in R for the
polynomial fit allowing for scaling the input moments is presented in Appendix A.8.

14/12/2018 ForsChem Research Reports 2018-12 (31 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 13. Performance of probability density functions estimated for Test Example 3 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
35.9% 19.1% 6.5% 84.2%
(CDF: 5-th degree model)
Naïve Estimator (40 nodes, =100) 18.8% 6.6% 1.2% 79.8%
Sample-based Parametric Estimation
46.1% 29.3% 12.3% 92.3%
(Maxwell-Boltzmann Distribution)
Kernel Density Estimation (Gaussian
44.7% 30.2% 13.1% 95.2%
Kernels, optimal =98.96)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 47.4% 34.1% 15.9% 96.6%
=98.96)
Moment-based Parametric Estimation
(Maxwell-Boltzmann Distribution, 59.3% 35.7% 17.3% 90.4%
and )
Inverse Mellin Transform
1258% 1052% 12993% 39.3%
(using , , , ,…, )
Polynomial Approach (using to ,
43.4% 25.1% 10.1% 89.4%
scaling variable by )

For this test example, the best accuracy (95-96%) was obtained using Kernel density estimators,
particularly using Epanechnikov kernels. The parametric methods considering a Maxwell-
Boltzmann distribution only obtained 90-92% accuracy. It is possible that this low accuracy is
due to the large fitness errors between the sample and the reference distribution. Please
notice that low relative fitness errors are reported in Table 13, indicating that the original
reference function did not fit so well the data sample.

The inverse Mellin transform again failed to accurately describe the density function or fit the
sample data, as a result of the sensitivity to the coefficients obtained during the integration.
Again, since this method requires additional steps of analytical or symbolic integration and is
not performing well, it will not be considered for the remaining test examples.

Another interest point in these results is that some methods (i.e. CDF derivative, Naïve and
Kernel estimators) obtained a probability density function with two relevant humps,
resembling the presence of two different populations. This indicates that there is a higher
concentration of results in the sample about 500 m/s but also about 1000 m/s.

14/12/2018 ForsChem Research Reports 2018-12 (32 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

0.0030
(a) (b)
Reference Reference
CDF Derivative: 5-th degree CDF Gaussian Kernels: h=98.96

0.0020
Naïve: 40 nodes, dx=100 Epanechnikov Kernels: h=98.96
Sample-based Parametric: Maxwell-Boltzmann
0.0020

Probability Density
Probability Density

0.0010
0.0010

0.0000
0.0000

500 1000 1500 2000 500 1000 1500 2000

x x
0.010

(c)
Reference
Moment-based Parametric: Maxwell-Boltzmann
Inverse Mellin Transform: M0 to M10, steps: 1/4
0.008

Polynomial: M0 to M5
Probability Density

0.006
0.004
0.002
0.000

500 1000 1500 2000

Figure 10. Probability density functions reconstructed for Test Example 3 using the
reconstruction methods listed in Table 13

4.4. Test Example 4: Scopd Distribution of Time between Molecular Collisions

The Scopd distribution of time between molecular collisions is a highly non-linear function that
can be approximately represented by an exponential distribution, although they are not the
same. The shape of this distribution is not easily described by lower-degree polynomials, and
they tend to present larger deviations at the extremes of the distribution. The results obtained
for several reconstruction methods are presented in Table 14 and Figure 11.

It can be seen again that parametric estimations based on the corresponding Scopd function
resulted in higher accuracies (>96%). Parametric exponential distributions and the moment-
based polynomial approach also performed well (94-95%). Particularly, the polynomial
approach achieved low fitness errors, comparable to those obtained with sample-based
parametric Scopd. Kernel density estimators (Gaussian and Epanechnikov) did not perform so
well, especially because negative collision times are predicted.

14/12/2018 ForsChem Research Reports 2018-12 (33 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 14. Performance of probability density functions estimated for Test Example 4 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
114.3% 158.1% 208.1% 89.2%
(CDF: 4-th degree model)
Naïve Estimator (50 nodes, =0.5) 198.7% 357.2% 808% 82.8%
Sample-based Parametric Estimation
102.7% 73.8% 63.8% 96.7%
(Scopd Distribution)
Sample-based Parametric Estimation
115.4% 108.8% 119.7% 95.1%
(Exponential Distribution)
Kernel Density Estimation (Gaussian
251.8% 204.1% 495.8% 76.0%
Kernels, optimal =0.46764)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 275.0% 222.6% 600.3% 73.8%
=0.46764)
Moment-based Parametric Estimation
108.1% 116.5% 131.6% 99.0%
(Scopd Distribution, )
Moment-based Parametric Estimation
147.5% 114.8% 138.3% 94.0%
(Exponential Distribution, )
Polynomial Approach (using to ) 101.4% 65.1% 57.5% 94.3%
1.5

1.5

Reference (a) Reference (b)


CDF Derivative: 4-th degree CDF Sample-based Parametric: Scopd
Naïve: 50 nodes, dx=0.5 Sample-based Parametric: Exponential
Probability Density
Probability Density

1.0
1.0

0.5
0.5

0.0
0.0

-1 0 1 2 3 4 5 -1 0 1 2 3 4 5

x x
1.5

1.5

Reference (c) Reference (d)


Gaussian Kernels: h=0.46764 Moment-based Parametric: Scopd
Epanechnikov Kernels: h=0.46764 Moment-based Parametric: Exponential
Polynomial: M0 to M4
Probability Density

Probability Density
1.0

1.0
0.5

0.5
0.0

0.0

-1 0 1 2 3 4 5 -1 0 1 2 3 4 5

x x

Figure 11. Probability density functions reconstructed for Test Example 4 using the
reconstruction methods listed in Table 14

14/12/2018 ForsChem Research Reports 2018-12 (34 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

4.5. Test Example 5: Bimodal Distribution

This particular distribution was selected as test example in order to assess the prediction
capability of the reconstruction methods when sampling from multimodal distributions. The
moment-based parametric estimation assuming a binormal distribution requires the first 5
moments of the distribution and involves analytically solving several integrals with increased
complexity. Such method is not considered for this example. The results obtained are
summarized in Table 15 and Figure 12.

Table 15. Performance of probability density functions estimated for Test Example 5 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
th 114.3% 93.2% 92.8% 87.2%
(CDF: 5 degree model)
Naïve Estimator (60 nodes, =1.5) 110.0% 114.8% 118.0% 88.0%
Sample-based Parametric Estimation
72.0% 33.1% 20.6% 91.6%
(Binormal Distribution)
Sample-based Parametric Estimation
225.7% 271.7% 677.5% 75.8%
(Normal Distribution)
Kernel Density Estimation (Gaussian
129.3% 121.7% 137.6% 81.5%
Kernels, optimal =1.106)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 134.4% 129.5% 152.3% 80.4%
=1.106)
Moment-based Parametric Estimation
217.5% 272.8% 702.1% 76.2%
(Normal Distribution, and )
Polynomial Approach (using to ) 55.4% 33.2% 19.3% 90.1%

The parametric approach considering a Normal distribution does not represent the original
density distribution as only one mode is obtained. Furthermore, the mode found does not
correspond to a true mode in the sample. All other methods considered were able to identify
both modes. Each method found different relative frequencies for the modes. The best
accuracy was found for the sample-based parametric estimation assuming a binormal
distribution. Close results were obtained by the polynomial approach, but the extreme tails are
not correctly identified as a result of truncation. Furthermore, the best fit to the data sample
was obtained by the moment-based polynomial fit. The CDF derivative and the Naïve estimator
also presented acceptable performances. The kernel density estimators did not score so well
because they flattened too much the region between modes.

14/12/2018 ForsChem Research Reports 2018-12 (35 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

0.30

0.30
Reference Reference
0.25 CDF Derivative: 5-th degree CDF Sample-based Parametric: Binormal
Naïve: 60 nodes, dx=1.5

0.25
Sample-based Parametric: Normal
(a) (b)
Probability Density

Probability Density
0.20

0.20
0.15

0.15
0.10

0.10
0.05

0.05
0.00

0.00
-4 -2 0 2 4 -4 -2 0 2 4

x x
0.30

0.30
Reference Reference
Gaussian Kernels: h=1.106 Moment-based Parametric: Normal
0.25

0.25
Epanechnikov Kernels: h=1.106 Polynomial: M0 to M6
(c) (d)
Probability Density

0.20

Probability Density

0.20
0.15

0.15
0.10

0.10
0.05

0.05
0.00

0.00

-4 -2 0 2 4 -4 -2 0 2 4

x x

Figure 12. Probability density functions reconstructed for Test Example 5 using the
reconstruction methods listed in Table 15

4.6. Test Example 6: Polynomial Distribution

The final test example is an arbitrary polynomial distribution. This particular distribution
presents an antimode instead of a mode. The idea was including a distribution with a
completely different shape compared to the previous examples. The corresponding
reconstruction results can be seen in Table 16 and Figure 13. For this particular case, the
moment-based parametric estimation using a polynomial distribution model is equivalent to
the polynomial estimation method presented in Section 2.2.5.

The most accurate reconstructions were obtained with the polynomial methods (parametric
estimation and derivative of CDF). These methods also provided the best fit to the data sample.
Kernel density estimation methods successfully describe the antimode, but they had some
difficulties with the extremes of the distribution. The Naïve estimator was again a noisy
reconstruction around the reference function.

14/12/2018 ForsChem Research Reports 2018-12 (36 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 16. Performance of probability density functions estimated for Test Example 6 using
different reconstruction methods.
Relative Relative 〈 〉 Relative Accuracy (%)
Reconstruction Method
(%) (%) (%) (̂| )
Derivative of CDF
th 86.7% 31.3% 19.3% 95.3%
(CDF: 4 degree model)
Naïve Estimator (40 nodes, =0.2) 129.5% 198.5% 283.7% 87.8%
Sample-based Parametric Estimation
th 86.7% 26.1% 18.2% 93.2%
(4 degree Polynomial Distribution)
Kernel Density Estimation (Gaussian
116.5% 75.0% 70.0% 82.2%
Kernels, optimal =0.1618)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 122.6% 82.8% 81.8% 80.7%
=0.1618)
Moment-based Parametric Estimation
69.3% 37.1% 21.6% 95.6%
(Polynomial Approach, to )

2.0
2.0

Reference (a) Reference (b)


CDF Derivative: 4-th degree CDF Gaussian Kernels: h=0.1618
Naïve: 40 nodes, dx=0.2 Epanechnikov Kernels: h=0.1618
Sample-based Parametric: 4-th deg. Polynomial Moment-based Polynomial: M0 to M4
1.5
1.5

Probability Density
Probability Density

1.0
1.0

0.5
0.5

0.0
0.0

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5

x x

Figure 13. Probability density functions reconstructed for Test Example 6 using the
reconstruction methods listed in Table 16

4.7. Benchmark Example 1: Old Faithful Geyser Eruptions

In this benchmark example, the true probability density function of the data is unknown. Thus,
the idea is using the different methods to obtain the best possible guess of the probability
distribution of the population. Table 17 and Figure 14 summarize the results obtained for this
data set.

From the results obtained in the previous Test Examples, it was possible to observe that the
Naïve estimator, although noisy, almost always follows the true probability density of the
population when the smoothing step is chosen close to the optimal kernel bandwidth (Eq.
2.13). Thus, it is expected that any smoother density function, presenting a high similitude with
the Naïve estimator, might accurately represent the true distribution of the population. For this

14/12/2018 ForsChem Research Reports 2018-12 (37 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

reason, the similitude between each reconstruction and the Naïve estimation has been
included in Table 17.

Table 17. Performance of probability density functions estimated for Benchmark Example 1
using different CDF-based reconstruction methods.
Similitude with Naïve
Reconstruction Method 〈 〉
( )
Derivative of CDF
th 0.0403 0.0109 0.0497 89.4%
(CDF: 4 degree model)
Naïve Estimator (100 nodes, =0.4) 0.0597 0.0259 0.1859 100%
Sample-based Parametric Estimation
th 0.0537 0.0116 0.0608 88.5%
(4 degree Polynomial Distribution)
Kernel Density Estimation (Gaussian
0.1453 0.0697 1.3159 78.9%
Kernels, optimal =0.3759)
Kernel Density Estimation
(Epanechnikov Kernels, optimal 0.1529 0.0745 1.4971 77.4%
=0.3759)
Moment-based Polynomial Estimation
0.0277 0.0059 0.0170 90.0%
(using Moments to )
1.0

CDF Derivative: 4-th degree CDF


Naïve: 100 nodes, dx=0.4
Sample-based Parametric: 4-th deg. Polynomial
Gaussian Kernels: h=0.3759
0.8

Epanechnikov Kernels: h=0.3759


Moment-based Polynomial: M0 to M3
Probability Density

0.6
0.4
0.2
0.0

0 1 2 3 4 5 6

x
Figure 14. Probability density functions reconstructed for Benchmark Example 1 using the
reconstruction methods listed in Table 17

Polynomial reconstruction methods achieved the best fit to the data sample while at the same
time presenting the highest similitude with the Naïve estimator at optimal bandwidth.
Particularly, the moment-based polynomial estimation can be considered as the best
description of the true probability density function for this population. The degree of the

14/12/2018 ForsChem Research Reports 2018-12 (38 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

polynomial was chosen by minimizing all three fitness error metrics. When at least one of the
fitness error metrics do not decrease by increasing the degree of the polynomial, the previous
degree is selected as optimal. The polynomial probability density function obtained is:

( [ ]) ( )
(4.1)

It is also possible to conclude that the data sample contains two groups of observations. The
first group of low eruption time lengths (<2.5 min) resemble an exponential distribution with a
lag (minimum eruption length). The second group, consisting of longer times (>2.5 min), is a
skewed unimodal distribution.

4.8. Benchmark Example 2: Multimodal Crystal Size Distribution

The final example provides only information about the moments of the distribution. Thus, only
the moment-based methods described in Section 2.2 can be used. Given that there is no data
set, it is not possible to test the fitness of the reconstructions. However, since it is known that
the moments come from a distribution with three modes,[15] this criterion will be used to
assess the effectiveness of each method.

For this example, given the large difference in magnitude between the moments, the following
re-scaling of the variable is proposed for reducing numerical error:

(4.2)
Thus,
( ) ( )
(4.3)

The transformed moments are presented in Table 18. When possible, all moments available are
used for the reconstruction of the probability density function. The results are presented in
Figure 15. The moment based reconstruction method considering a normal distribution was
included just for illustrative purposes. It was known beforehand that the distribution was
multimodal, but nevertheless it provides a visual reference of the moments used. Clearly, the
inverse transform methods do not offer a reliable representation of the distribution, probably
because of the large error involved in the approximations and integration procedures. On the
other hand, the polynomial approach, while not perfect, successfully predicted a multimodal
distribution from the moment data. The predicted modal points are also close to those
presented in the right plot of Fig. 5 reported in [15]. The similitude between all other methods
to the polynomial fit was found to be below 75%.

14/12/2018 ForsChem Research Reports 2018-12 (39 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 18. Transformed Moments for the Benchmark Example 2 using Eq. (4.3)
( )
0 1
1 0.4378
2 0.2563
3 0.1709
4 0.1214
5 0.0894
6 0.0676
7 0.0521
8 0.0408
9 0.0324
10 0.0260
3.0

Moment-based Parametric Estimation: Normal Distribution


Inverse Laplace Transform
Inverse Mellin Transform
2.5

Moment-based Polynomial Estimation: M0 to M10


2.0
Probability Density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

y
Figure 15. Probability density functions reconstructed for Benchmark Example 2 using
transform (4.2)

The probability density function obtained in terms of the original variable is presented in Figure
16, neglecting the noise at the upper tail. The corresponding equation is:

( [ ]) (

)
(4.4)

14/12/2018 ForsChem Research Reports 2018-12 (40 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

600
500
Probability Density [1/m]

400
300
200
100
0

0.000 0.001 0.002 0.003

x [m]
Figure 16. Probability density functions reconstructed for Benchmark Example 2 without
variable transformation. Function described by Eq. (4.4).

5. Conclusion

Reconstructing the original probability distribution from a data sample or from a set of
distribution moments is a challenging inverse problem. Two main difficulties can be found. On
one hand, there is no unique solution to the problem since different distributions might yield
the same finite sample of data or the same finite set of moments. On the other hand, there is
always an intrinsic error involved in the sampling procedure used to obtain the data and/or the
moments. Three different fitness error metrics were considered for assessing the error in
cumulative probability: Maximum error, mean error and sum of squared error. For six reference
probability density functions of different shapes, random samples of 30 to 60 elements in size
presented sample cumulative probability distributions slightly different from their
corresponding population cumulative probability. The average maximum sampling error was
6.5%, with an average mean sampling error of 2.3%. The average sum of squared error was
0.0457. In general, the sampling error increases by reducing the sample size. Even if the
reconstruction procedure provided a perfect fit to the data sample, it will not necessarily
reflect the true probability distribution of the population. Thus, the challenging goal is
satisfactorily reconstructing the true probability density function of a population from a finite
sample or set of moments.

14/12/2018 ForsChem Research Reports 2018-12 (41 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Different reconstruction methods were presented in Section 2, either based on the cumulative
probability distribution of the sample, or on a finite set of moments of a sample. These
methods were used to reconstruct six test examples described in Section 3. The performance
of these methods was assessed by means of the relative fitness error (maximum, mean and
sum of squares relative to the reference probability density function) comparing the
reconstructed probability function to the cumulative probability distribution of the sample, and
the accuracy obtained comparing the reconstructed probability distribution to the reference
density function. A summary of the performance evaluation for the test examples considered is
presented in Table 19 and Figure 17. Although not all the methods were used for all test
examples, these results allow reaching some interesting conclusions.

Table 19. Summary of performance (relative fitness error and accuracy) of different
reconstruction methods for the test examples considered, sorted by average accuracy
Relative Fitness Error (%) Accuracy (%)
Reconstruction Method
Average Best Worst Average Best Worst
Moment-based Parametric
Estimation (known 17.3% 147.5% 87.2% 99.5% 90.4% 96.4%
distribution)
Sample-based Parametric
Estimation (known 12.3% 119.7% 62.9% 98.3% 91.6% 94.8%
distribution)
Moment-based Polynomial
53.8% 10.1% 101.4% 92.5% 96.2% 89.4%
Approach
Derivative of CDF 96.1% 6.5% 208.1% 91.7% 97.7% 84.2%
Naïve Estimator 209.2% 1.2% 808.0% 85.5% 90.8% 77.0%
Kernel Density Estimation 140.6% 13.1% 600.3% 84.9% 96.6% 73.8%
Cubic Splines 229.2% 175.3% 331.8% 84.7% 84.7% 84.7%
Sample-based Parametric
Estimation (unknown
114.5% 677.5% 262.5% 81.4% 75.8% 78.6%
distribution - Normal
approximation)
Moment-based Parametric
Estimation (unknown
156.3% 702.1% 302.7% 80.6% 76.2% 78.4%
distribution - Normal
approximation)
Inverse Mellin Transform 2631.2% 100.3% 12993.0% 70.1% 100.0% 39.3%
Inverse Laplace Transform 1557.9% 475.2% 3489.5% 60.1% 60.1% 60.1%

Clearly, the most efficient methods for reconstructing probability distributions are the
parametric estimation methods (both sample-based and moment-based), as long as the true
type of probability distribution is known. Even though it is possible to find the type of
distribution present after a careful analysis of the particular problem, it is not always the case.

14/12/2018 ForsChem Research Reports 2018-12 (42 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Figure 17. Performance assessment of different reconstruction methods for the test examples
considered. Top: Average (green dot) and range (blue line) for all three relative fitness error
metrics. Bottom: Average (green dot) and range (blue line) for accuracy.

14/12/2018 ForsChem Research Reports 2018-12 (43 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

When the distribution is unknown, using a general distribution (such as the normal distribution)
is not the best solution. In those cases, the moment-based polynomial approach or the sample-
based derivative of the cumulative probability distribution function (CDF) are more accurate.
These results were confirmed by evaluating the benchmark examples (with unknown true
distribution), where these two polynomial methods were most likely the best performers.

The polynomial methods, however, require defining the optimal degree of the polynomial for
fitting the data. Visual inspection is probably the best approach for defining the optimal
degree. However, it is also possible to automate the search for an optimal polynomial degree.
For the moment-based polynomial approach, the maximum degree of the polynomial depends
on the number of moments available. The recommended procedure consists on starting with
only two moments ( and ) , fitting the polynomial and predicting the remaining moments.
Then, adding moments stepwise until the sum of squared differences in ( ̃ ) stop
significantly improving (see algorithm in Appendix A.9).

For the derivative of the CDF method, the suggested procedure is the following:
1. Construct the Naïve estimator with an optimal smoothing factor (calculated as the
optimal bandwidth for a Gaussian kernel density estimator: Eq. 2.13).
2. Begin with a linear model for the CDF, fit the model parameters and calculate the
density function by differentiating. Calculate the similitude between the Naïve
estimator and the linear model.
3. Increase the degree of the polynomial model for the CDF. Fit the model and calculate
the similitude with respect to the Naïve estimator. Repeat this step until the similitude
stops significantly improving.

Although the Naïve estimator is very noisy, it tends to follow the true probability distribution of
the population. Thus, using a polynomial model fitted to the sample CDF and being as close as
possible to the Naïve estimator is expected to provide a good prediction of the probability
density function. (see algorithm in Appendix A.9)

Another important observation is that the methods based on integral transforms (inverse
Laplace transform and inverse Mellin transform methods), which provide a sound theory for
density reconstruction, in practice are poor performers. It is highly likely that the
approximation procedures to obtain the inverse transforms from data samples do not provide
enough accuracy when compared to other methods.

This work was not intended to provide an exhaustive comparison of all possible methods of
probability density reconstruction. Thus, some efficient prediction methods might have been
left aside. However, their performance could be easily assessed and compared to the methods
considered here, by reconstructing the different test examples presented in Section 3.

14/12/2018 ForsChem Research Reports 2018-12 (44 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Acknowledgments

The author gratefully acknowledges Prof. Jaime Aguirre (Universidad Nacional de Colombia)
for helpful discussions on the topic and for reviewing the manuscript.

This research did not receive any specific grant from funding agencies in the public,
commercial, or not-for-profit sectors.

References

[1] Hernandez, H. (2018). The Realm of Randomistic Variables. ForsChem Research Reports
2018-10. doi: 10.13140/RG.2.2.29034.16326.

[2] Tarantola, A. (2005). Inverse Problem Theory and Methods for Model Parameter Estimation.
Philadelphia: SIAM.

[3] Lancaster, P., & Salkauskas, K. (1986). Curve and Surface Fitting: An Introduction. London:
Academic Press.

[4] Silverman, B. W. (1998). Density Estimation for Statistics and Data Analysis. Boca Raton:
Chapman & Hall/CRC.

[5] Hernandez, H. (2018). Multidimensional Randomness, Standard Random Variables and


Variance Algebra. ForsChem Research Reports 2018-02. doi: 10.13140/RG.2.2.11902.48966.

[6] Hernandez, H. (2018). Parameter Identification using Standard Transformations: An


Alternative Hypothesis Testing Method. ForsChem Research Reports 2018-04. doi:
10.13140/RG.2.2.14895.02728.

[7] Altman, N. S. (1992). An Introduction to Kernel and nearest-Neighbor nonparametric


Regression. The American Statistician, 46(3), 175-185.

[8] Epanechnikov, V. A. (1969). Non-parametric Estimation of a Multivariate Probability Density.


Theory of Probability & Its Applications, 14(1), 153-158.

[9] Hernandez, H. (2018). Introduction to Randomistic Optimization. ForsChem Research


Reports 2018-11. doi: 10.13140/RG.2.2.30110.18246.

[10] Hernandez, H. (2017). Multivariate Probability Theory: Determination of Probability Density


Functions. ForsChem Research Reports 2017-13. doi: 10.13140/RG.2.2.28214.60481.

[11] Hernandez, H. (2018). Expected Value, Variance and Covariance of Natural Powers of
Representative Standard Random Variables. ForsChem Research Reports 2018-08. doi:
10.13140/RG.2.2.15187.07205.

14/12/2018 ForsChem Research Reports 2018-12 (45 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

[12] Laplace, P. S. (1814). Théorie Analytique des Probabilités. 2nd Ed. Paris: Courcier.

[13] Karlsson, J., & von Sydow, B. (1976). The Convergence of Padé Approximants to Series of
Stieltjes. Arkiv för Matematik, 14, 43.

[14] Epstein, B. (1948). Some Applications of the Mellin Transform in Statistics. The Annals of
Mathematical Statistics, 19(3), 370-379.

[15] John, V., Angelov, I., Öncül, A. A., & Thévenin, D. (2007). Techniques for the Reconstruction
of a Distribution from a Finite Number of its Moments. Chemical Engineering Science, 62(11),
2890-2904.

[16] Haahr, M. (2010). Introduction to Randomness and Random Numbers. URL:


https://www.random.org/randomness/ [accessed 2018-11-24]

[17] Hernandez, H. (2017). Multicomponent Molecular Collision Kinetics: Rigorous Collision Time
Distribution. ForsChem Research Reports 2017-7. doi: 10.13140/RG.2.2.26218.31689.

[18] Hernandez, H. (2017). Standard Maxwell-Boltzmann distribution: Definition and properties.


ForsChem Research Reports 2017-2. doi: 10.13140/RG.2.2.29888.74244.

Appendix: Selected Algorithms Programmed in R

A.1. Determination of real Moments from a Sample


Mn<-function(datasample,n=0){
#Determination of the "n"-th moment of a sample. n is a real number.
N=length(datasample) #Sample size
Mn=0 #Initialize moment
for (i in 1:N){
Mn=Mn+(datasample[i])^n #Accumulate moment
}
Mn=Mn/N #Averaging over sample size
return(Mn)
}

A.2. Determination of the first integer Moments from a Sample


momentset<- function(datasample,nmax=10){ #First 10 moments by default
#Constructing a set of integer moments from 0 to "nmax" for a sample
#Requires the function "Mn" defined in Appendix A.1.
moments=c(1) #Zero-th moment value
n=c(0) #Zero-th moment order
for (i in 1:nmax){ #For each subsequent moment
moments[i+1]=Mn(datasample,i) #n-th moment value
n[i+1]=i #n-th moment order
}
MS=data.frame(n,moments) #Constructing output data frame
return(MS)
}

14/12/2018 ForsChem Research Reports 2018-12 (46 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

A.3. Boundaries Estimation from a Sample


bound<-function(datasample){
#Boundaries estimation from a sample
N=length(datasample) #Sample size
xmin=min(datasample) #Minimum sample value
xmax=max(datasample) #Maximum sample value
dx=(xmax-xmin)/(2*N) #Estimation margin
xmin=xmin-dx #Update minimum
xmax=xmax+dx #Update maximum
bounds=c(xmin,xmax) #Output vector
return(bounds)
}

A.4. Error Fitness Assessment


fitness<-function(datasample,PDF,xmin=NULL,xmax=NULL,nsteps=10000){
#Error Fitness Assessment between a given "PDF" function and a "datasample"
#"nsteps" between "xmin" and "xmax" will be used for the assessment
#It uses function "bound" defined in Appendix A.3.
N=length(datasample) #Sample size
s=sort(datasample) #Location of data points
smin=bound(s)[1] #Estimated minimum
smax=bound(s)[2] #Estimated maximum
#Update minimum value if necessary
if (is.null(xmin)==TRUE) xmin=smin
if (smin<xmin) xmin=smin
#Update maximum value if necessary
if (is.null(xmax)==TRUE) xmax=smax
if (smax<xmax) xmax=smax
f=match.fun(PDF) #Definition of pd function
i=1:(nsteps+1) #Step counter
dx=(xmax-xmin)/nsteps #Step size
x=xmin+(i-1)*dx #Location of steps
err=NULL #Initializing error
phi=NULL #Initializing CP
for (j in 1:N){ #For each data point
imax=floor((s[j]-xmin)/dx) #Calculate number of steps for data point
phi[j]=f(xmin)*dx/2 #Initialize CP
for (i in 1:imax){ #Acumulate CP
phi[j]=phi[j]+(f(x[i])+f(x[i+1]))*dx/2
}
phis_low=length(which(s<s[j]))/N #Lower CP of data point
phis_high=length(which(s<=s[j]))/N #Higher CP of data point
err[j]=0 #Calculate error
if (phi[j]>phis_high) err[j]=phi[j]-phis_high
if (phi[j]<phis_low) err[j]=phis_low-phi[j]
}
maxe=max(err) #Maximum error
avge=mean(err) #Mean error
SSe=sum(err^2) #SS error
OUT=c(maxe,avge,SSe) #Output vector
names(OUT)=c("Maximum fitness error","Mean fitness error","SS fitness error")
return(OUT)
}

14/12/2018 ForsChem Research Reports 2018-12 (47 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

A.5. PDF Similitude Assessment


similitude<-function(PDF1,PDF2,xmin=-10,xmax=10,nsteps=100000){
#PDF Similitude Assessment between a given "PDF1" function and a reference
#"PDF2" function. "nsteps" between "xmin" and "xmax" will be used for the
#assessment. By default the assessment range is from -10 to 10.
i=1:(nsteps+1) #Step counter
x=xmin+(xmax-xmin)*(i-1)/nsteps #Location of steps
f1=match.fun(PDF1) #Definition of pd function 1
f2=match.fun(PDF2) #Definition of pd function 2
rho1=f1(x) #Calculation of pd 1 at each step
rho2=f2(x) #Calculation of pd 2 at each step
rhomin=pmin(rho1,rho2) #Minimum PD
simil=200*sum(rhomin)/(sum(rho1)+sum(rho2)) #Integration and standardization
names(simil)=c("Similitude (%)")
return(simil)
}

A.6. Algorithm for Reconstructing the PDF as the Derivative of a Polynomial CDF
CDFderiv<-function(datasample,degree=5,disp=FALSE){
#This function constructs a PD function from the input data sample
#"datasample", considering a polynomial degree ("degree") for fitting the
#sample CDF. If "disp" is set to TRUE, the polynomial coefficients and
#regression R2 will be shown, and PDF and CDF will be plotted. The output of
#this function is the PDF. It uses function "bound" defined in Appendix A.3.
#Extracting information from the sample
xmin=bound(datasample)[1] #Estimation of minimum value in population
xmax=bound(datasample)[2] #Estimation of maximum value in population
x=sort(datasample)-xmin #Sorted transformed values in the sample
N=length(x) #Sample Size
#Estimates of cumulative probability in the sample
phis=length(which(x<=x[1]))/(2*N)
for (i in 2:N){
phis[i]=(length(which(x<=x[i-1]))+length(which(x<=x[i])))/(2*N)
}
#Polynomial model of the CDF
dataf=data.frame(x,phis) #Constructing data frame
model="phis~-1+x" #Definition of initial regression model
if (degree>=2){ #For higher degree polynomials
for (i in 2:degree){
model=paste(model,"+I(x^",toString(i),")") #Increase degree of the model
}
}
phim=lm(model,data=dataf) #Regression
phicoef=phim$coefficients #Polynomial CDF coefficients
R2=1-var(phim$residuals)/var(phis) #Determination of R2
#Definition of CDF function
phif<-function(X){
phi=0 #Initialize Cumulative Probability
for (i in 1:length(phicoef)){ #For each term in the polynomial
phi=phi+phicoef[i]*((X-xmin)^i) #Add term to CP
}
#CP is zero if phi is <0, and 1 if phi is >1
phi=phi*(as.integer(phi>=0&phi<=1)+as.integer(phi>1)/phi)
return(phi)
}
#Derivative of the CDF
i=1:length(phicoef)
rhocoef=i*phicoef #Calculate PDF coefficients

14/12/2018 ForsChem Research Reports 2018-12 (48 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

#Definition of PDF function


rhof<-function(X){
rho=0 #Initialize density
for (i in 1:length(rhocoef)){ #For each term in the polynomial
rho=rho+rhocoef[i]*((X-xmin)^(i-1)) #Add term to density
}
#Density is zero when rho is negative or X is beyond boundaries
rho=rho*as.integer(X>=xmin&X<=xmax)*as.integer(rho>0)
return(rho)
}
#Display results
if (disp==TRUE){
print("PDF as derivative of a polynomial CDF")
print(paste("R2 = ",R2*100,"%"))
print(paste("xmin =",toString(xmin)))
print(paste("xmax =",toString(xmax)))
print("Polynomial model of the CDF:")
names(phicoef)[1]="(x-xmin)" #Update coeff. names
if (length(phicoef)>=2){
for (i in 2:length(phicoef)){
names(phicoef)[i]=paste("(x-xmin)^",toString(i)) #Update coeff. names
}
}
print(phicoef)
print("Polynomial model of the PDF:")
#Update coefficient names
names(rhocoef)[1]="(Intercept)"
names(rhocoef)[2:length(rhocoef)]=names(phicoef)[1:(length(phicoef)-1)]
print(rhocoef)
#PDF plot
i=1:1001
y=xmin+(xmax-xmin)*(i-1)/1000
plot(x+xmin,phis,col="red",xlim=c(xmin,xmax),ylim=c(0,1),
xlab="Measurement values",ylab="Cumulative Probability")
lines(y,phif(y),col="blue",lty=2)
rho=rhof(y)
plot(y,rho,type="l",col="blue",xlim=c(xmin,xmax),ylim=c(0,max(rho)),
xlab="Measurement values",ylab="Probability Density")
}
return(rhof)
}

A.7. Algorithm for Reconstructing the PDF using the Naïve Estimator Method
Naive<-function(datasample,nodes=NULL,delta=NULL,disp=FALSE){
#This function constructs a PD function from the input data sample
#"datasample", using a Naïve estimator with a certain number of "nodes" and a
#smoothing factor "dx". By default, the number of nodes is the number of data
#points and the smoothing factor is the optimal bandwidth for Gaussian Kernel
#estimators. If "disp" is set to TRUE, the PDF will be plotted. The output of
#this function is the PDF.
xmin=min(datasample) #Minimum value in sample
xmax=max(datasample) #Maximum value in sample
N=length(datasample) #Sample size
sigma=sd(datasample) #Sample standard deviation
if (is.null(nodes)==TRUE) nodes=N #By default: nodes=Sample size
ndist=(xmax-xmin)/(nodes-1) #Distance between nodes
x=xmin+(0:(nodes-1))*ndist #Node values
if (is.null(delta)==TRUE) delta=sigma*(4/(3*N))^(1/5) #Default: optimal delta

14/12/2018 ForsChem Research Reports 2018-12 (49 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

#Estimates of probability density in the sample


rhos=NULL #Initialize density
for (i in 1:nodes){ #For each node calculate finite density
rhos[i]=(length(which(datasample<=(x[i]+delta/2)))-
length(which(datasample<=(x[i]-delta/2))))/(N*delta)
}
#Spline fit of sample finite density
rhosp=splinefun(x,rhos)
#Definition of PDF function
rhof<-function(X){
rho=rhosp(X)
#Density is zero when rho is negative or X is beyond boundaries
rho=rho*as.integer(X>=xmin&X<=xmax)*as.integer(rho>0)
return(rho)
}
#Display option
if (disp==TRUE){
print(paste("xmin =",toString(xmin)))
print(paste("xmax =",toString(xmax)))
i=1:1001
y=xmin+(xmax-xmin)*(i-1)/1000
rho=rhof(y)
plot(y,rho,type="l",col="blue",xlim=c(xmin,xmax),ylim=c(0,max(rho)),
xlab="Measurement values",ylab="Probability Density")
}
return(rhof)
}

A.8. Algorithm for Reconstructing the PDF using the Moment-based Polynomial Fit Method
polyPDF<-function(moments,xmin=NULL,xmax=NULL,scale=1,disp=FALSE){
#This function constructs a PD function from the "moments" input data frame,
#containing the moment order ("n") in one column and their values ("moments")
#in another. The limits of the function ("xmin","xmax") are required inputs.
#A scale factor ("scale") can be used to transform data avoiding singularity.
#By default, "scale" is 1. If "disp" is TRUE, the PDF will be plotted and the
#polynomial coefficients are shown. The output of this function is the PDF.
#Moments can be generated using the "momentset" function given in Appendix A.2.
if (is.null(xmin)&is.null(xmax)){
print("Please input xmin and xmax estimated values")
return(NULL)
} else {
n=length(moments$n) #Number of moments available
degree=n-1 #Degree of polynomial
A=matrix(0,n,n) #Initialize matrix of coefficients
B=matrix(0,n,1) #Initialize vector of independent terms
for (i in 1:n){ #For each moment
ni=moments$n[i] #ni-th moment
for (j in 1:n){ #For each power term
A[i,j]=(((xmax*scale)^(ni+j))-((xmin*scale)^(ni+j)))/(ni+j)
}
B[i]=moments$moments[i]*scale^ni #Scale moments
}
a=as.vector(solve(A,B)) #Find coefficients
#Definition of PDF function
rhof<-function(x){
rho=a[1] #Independent term
for (i in 2:(degree+1)){
rho=rho+a[i]*(x*scale)^(i-1) #Polynomial terms
}

14/12/2018 ForsChem Research Reports 2018-12 (50 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

#Density is zero when rho is negative or X is beyond boundaries


rho=rho*scale*as.integer(x>=xmin&x<=xmax)*as.integer(rho>0)
return(rho)
}
if (disp==TRUE){
#Set coefficients names
names(a)[1:2]=c("(Intercept)","x")
if (length(a)>2){
for (i in 3:length(a)){
names(a)[i]=paste("x^",toString(i-1))
}
}
print("Polynomial model of the PDF:")
print(a)
#PDF plot
i=1:1001
y=xmin+(xmax-xmin)*(i-1)/1000
rho=rhof(y) #Calculate density
plot(y,rho,type="l",col="blue",xlim=c(xmin,xmax),ylim=c(0,max(rho)),
xlab="Measurement values",ylab="Probability Density")
}
}
return(rhof)
}

A.9. General Inversion Algorithm for Reconstructing a PDF from Data


invPDF<-function(datasample=NULL,moments=NULL,method="sample",maxdegree=5,
#This function constructs a PD function from the input data, either a data
#sample vector or a data frame with the moments of the distribution. When a
#data sample is used, a "sample"-based or a "moment"-based "method" can be
#used. The limits of the function ("xmin","xmax") are required when only the
#moments are given. A scale factor ("scale") can be used to transform data
#avoiding singularity. A tolerance "tol" is used for evaluating significant
#improvements in the search of the optimum degree. By default, "scale" is 1 and
#"tol" is 0.02. If "disp" is TRUE, the PDF will be plotted and the polynomial
#coefficients are shown. The output of this function is the PDF. This function
#requires the functions CDFderiv, Naive and polyPDF, previously defined in
#Appendix A.6, A.7 and A.8, respectively.
xmin=NULL,xmax=NULL,scale=1,tol=0.02,disp=FALSE){
if (method=="sample"){
if (is.null(datasample)==TRUE){
if (is.null(moments)==TRUE){
print("Please input a data sample as a vector")
return(NULL)
} else {
method="moments"
}
} else {
xmin=min(xmin,bound(datasample)[1]) #Verify minimum value
xmax=max(xmax,bound(datasample)[2]) #Verify maximum value
rhon=Naive(datasample) #Naive estimator
similbest=0 #Initialize best similitude
for (i in 1:maxdegree){
#Estimate PDF by derivative of CDF
rhof=CDFderiv(datasample,degree=i)
#Evaluate similitude to Naive
simil=similitude(rhon,rhof,xmin=xmin,xmax=xmax)
if (simil>(1+tol)*similbest){ #Update optimal model
similbest=simil

14/12/2018 ForsChem Research Reports 2018-12 (51 / 52)


www.forschem.org
Comparison of Methods for the Reconstruction of
Probability Density Functions from Data Samples
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

optdegree=i
}
}
rhof=CDFderiv(datasample,degree=optdegree,disp=disp)
return(rhof)
}
}
if (method=="moments"){
if (is.null(moments)==TRUE){
if (is.null(datasample)==TRUE){
print("Please input a data sample as a vector or a moments data frame")
return(NULL)
}
moments=momentset(datasample,maxdegree) #Calculate moments from sample
}
if (is.null(xmin)&is.null(xmax)){
print("Please input xmin and xmax estimated values")
return(NULL)
}
SSMbest=Inf #Initialize best sum of squared differences
for (mc in 2:min(nrow(moments),maxdegree)){
testmoments=moments[1:mc,] #Set of test moments
#Identify polynomial PDF
rhof=polyPDF(testmoments,xmin=xmin,xmax=xmax,scale=scale,disp=FALSE)
predM=testmoments$moments #Initialize set of predicted moments
i=1:1001
y=xmin+(xmax-xmin)*(i-1)/1000 #Definition of evaluation points
rho=rhof(y) #Density at evaluation points
SSM=0 #Initialize sum of squared differences
for (i in 1:nrow(moments)){ #For each moment
predM[i]=sum(y^moments$n[i]*rho)*(xmax-xmin)/1000
if (moments$n[i]!=0){
SSM=SSM+(moments$moments[i]^(1/moments$n[i])-
predM[i]^(1/moments$n[i]))^2
}
}
if (SSM<(1-tol)*SSMbest){ #If SSM improves update optimum
SSMbest=SSM
mombest=testmoments
predMbest=predM
}
}
if (disp==TRUE){
plot(moments$n,moments$moments,col="red",
xlab="Moment order",ylab="Moment value")
points(moments$n,predMbest,col="blue")
}
rhof=polyPDF(mombest,xmin=xmin,xmax=xmax,scale=scale,disp=disp)
return(rhof)
} else {
print("Please input a valid method: sample or moments")
return(NULL)
}
}

14/12/2018 ForsChem Research Reports 2018-12 (52 / 52)


www.forschem.org

You might also like