Professional Documents
Culture Documents
René Andrae
Max-Planck-Institut für Astronomie, Königstuhl 17, 69117 Heidelberg, Germany
e-mail: andrae@mpia-hd.mpg.de
or graduate) students have to teach such methods for error estimation to themselves when
working scientifically for the first time. This manuscript presents an easy-to-understand
overview of different methods for error estimation that are applicable to both model-based
and model-independent parameter estimates. These methods are not discussed in detail,
but their basics are briefly outlined and their assumptions carefully noted. In particular,
the methods for error estimation discussed are grid search, varying χ2 , the Fisher matrix,
Monte-Carlo methods, error propagation, data resampling, and bootstrapping. Finally, a
method is outlined how to propagate measurement errors through complex data-reduction
pipelines.
1 Introduction
This manuscript is intended as a guide to error estimation for parameter estimates in astron-
omy. I try to explain several different approaches to this problem, where the emphasis is on
highlighting the diversity of approaches and their individual assumptions. Making those as-
sumptions explicitly clear is one of the major objectives, because using a certain method in a
situation where its assumptions are not satisfied will result in incorrect error estimates. As this
manuscript is just an overview, the list of methods presented is by no means complete.
A typical task in scientific research is to make measurements of certain data and then to
draw inferences from them. Usually, the inference is not drawn directly from the data but
rather from one or more parameters that are estimated from the data. Here are two examples:
• Apparent magnitude of stars or galaxies. Based on a photometric image, we need to
estimate the parameter “flux” of the desired object, before we can infer its apparent
magnitude.
• Radial velocity of stars. First, we need to take a spectrum of the star and identify appro-
priate emission/absorption lines. We can then estimate the parameter “radial velocity”
from fitting these spectral lines.
Whenever such parameter estimates are involved, it is also crucial to estimate the error of the
resulting parameter.
What does a parameter estimate and its error actually signify? More simply, what is the
meaning of an expression such as 4.3 ± 0.7? This question will be answered in detail in Sects.
2.3 and 2.5, but we want to give a preliminary answer here for the sake of motivation. The
crucial point is that the true result of some parameter estimate is not something like 4.3 ± 0.7,
but rather a probability distribution for all possible values of this parameter. An expression
like 4.3 ± 0.7 is nothing more than an attempt to encode the information contained in this
1
René Andrae (2010) – Error estimation in astronomy: A guide
probability distribution in a more simple way, where the details of this “encoding” are given
by some general standards (cf. Sect. 2.5). Put simply, the value 4.3 signifies the maximum of
the probability distribution (the most likely value), whereas the “error” 0.7 signifies the width
of the distribution. Hence, the value 4.3 alone contains insufficient information, since it does
not enable us to reconstruct the probability distribution (the true result). More drastically:
A parameter value without a corresponding error estimate is meaningless. Therefore, error
estimation is equally as important an ingredient in scientific work as parameter estimation itself.
Unfortunately, a profound and compulsory statistical eduction is missing in many university
curricula. Consequently, when (undergraduate or graduate) students are faced with these
problems for the first time during their research, they need to teach it to themselves. This is
often not very efficient and usually the student focuses on a certain method but does not gain
a broader overview. The author’s motivation was to support this process by providing such an
overview.
Where do uncertainties stem from? Of course, an important source of uncertainties is the
measured dataset itself, but models can also give rise to uncertainties. Some general origins of
uncertainties are:
We obviously have to differentiate between random and systematic errors, i.e., between vari-
ance/scatter and bias/offset. Systematic errors are usually very hard to identify and to correct
for. However, this is not part of this manuscript, since individual solutions strongly depend on
the specific problem. Here, different methods of estimating random errors (variance/scatter)
are considered, i.e., those quantities that determine the size of error bars or, more generally,
error contours. Error estimation for parameter estimation only is described, whereas error
estimation for classification problems is not discussed.
This manuscript will not be submitted to any journal for two reasons: First, its content
is not genuinely new but a compilation of existing methods. Second, its subject is statistical
methodology rather than astronomy. Any comments that may improve this manuscript are
explicitly welcome.
2 Preliminaries
Before diving into the different methods for error estimation, some preliminaries should be
briefly discussed, firstly, the terminology used throughout this manuscript and, secondly, errors
of data. Thirdly, the basics of parameter estimation are briefly explained, including the intro-
duction of the concept of a likelihood function. Fourthly, the central-limit theorem is discussed.
Finally, the concept of confidence intervals, which are the desired error estimates, is introduced.
2.1 Terminology
This manuscript is about “error estimation for parameter estimates”. The first step is usually
to measure some data and also to measure its error or uncertainty (Sect. 2.2). Given this
measurement, the task is then to estimate some parameter (Sect. 2.3). The estimate of the
parameter θ is denoted by a hat, θ̂, which is common practice in statistics. Here, I want
to introduce the concept of a qualitative difference between “measuring” and “estimating”:
Measurements are outcomes of a real experiment. Conversely, estimates are inferences from
2
René Andrae (2010) – Error estimation in astronomy: A guide
measurements, i.e., they are not directly related to experiments. Although this difference is
not of vital importance, both terms are rigorously differentiated throughout this manuscript in
order to make clear what is being referred to.
Another issue of terminology concerns the words “error” and “uncertainty”. As mentioned
in the introduction, systematic errors are not considered here, and hence both terms may
be used more or less synonymously.1 There is also a third word of the same family, namely
“noise”, which could also be used synonymously. However, personally I would use the word
“noise” only in the context of measurements, whereas in the context of parameter estimates
the word “uncertainty” appears to be most natural.
3
René Andrae (2010) – Error estimation in astronomy: A guide
1 (n − µ)2
1
prob(n|µ, σ) = √ exp − , (2)
2πσ 2 2 σ2
which has much more convenient analytical properties than the Poisson distribution, as we
shall see in Sections 2.3, 3.2, and 3.3. Figure 2 shows a Poisson distribution with mean µ = 10
and a Gaussian with mean µ = 10 and variance σ 2 = µ = 10. Evidently, for a mean of only
ten photon counts per pixel, the actual Poisson distribution can be nicely approximated by a
Gaussian already. This is usually a valid approximation in the optical regime and at larger
wavelengths, whereas in the high-energy regime (UV, X-ray, gamma) it is not unheard of that
there are less than ten photon counts per pixel.
4
René Andrae (2010) – Error estimation in astronomy: A guide
value xn error value xn error value xn error value xn error value xn error
7 3.18 12 3.18 12 3.08 11 2.87 8 3.32
10 3.45 9 3.14 11 3.41 7 3.07 11 2.99
11 2.92 12 3.12 13 3.32 7 2.89 9 3.08
10 3.14 13 3.03 9 3.44 12 3.06 9 3.18
8 3.43 10 3.12 10 3.31 10 2.93 9 3.40
14 2.85 11 3.07 12 3.21 6 2.90 9 3.01
Table 1: Data sample used as a standard example for all methods. All data points xn are
sampled from a Poisson distribution with mean µ = 10 (cf. Fig. 2). The columns entitled
“error” give the Gaussian standard deviations σn for each data point xn for the cases where
the error distribution is assumed to be Gaussian.
of parameters that maximise the likelihood function is called the maximum-likelihood estimate
and it is usually denoted by θ̂.
A somewhat philosopical note: Actually, the likelihood function as given by Eq. (3) is what
every parameter estimation is aiming for. This function, L(D; M, θ), contains all important
information about the data and the model, a theorem which is called likelihood principle.
However, Eq. (3) is just an abstract definition and even a more specific example (e.g. Eqs. (4)
and (5)) usually does not provide more insight. Therefore, one has to extract the information
from Eq. (3) in some way. If the model under consideration has only one or two model
parameters, it is possible to plot the likelihood function directly (e.g. Fig. 8), without involving
any optimisation procedure. Although such a plot is actually the final result of the parameter-
estimation process, people (including myself) are usually happier giving “numbers”. Moreover,
if a model has more than two parameters, the likelihood function cannot be plotted anymore.
Hence, the standard practise of encoding the information contained in the likelihood function
is by identifying the point in parameter space where the likelihood function takes its maximum
(the maximum-likelihood estimate) plus inferring the “width” of the function at its maximum
(the uncertainty). If nothing contrary is said, these two quantities usually signify the mean
value and the standard deviation of a Gaussian. Consequently, if both values are provided, one
can reconstruct the full likelihood function.
In order to “give some flesh” to the rather abstract concept of a likelihood function, two
simple examples of parameter estimation are now discussed. This allows us to see this concept
and the Poisson and Gaussian distributions “in action”. Table 1 also introduces the data sample
that will be used to demonstrate every method that is discussed using actual numbers.
5
René Andrae (2010) – Error estimation in astronomy: A guide
There are two reasons for maximising log L instead of L. First, log L sometimes takes a much
more convenient mathematical form, enabling us to solve the maximisation problem analyti-
cally, as we shall see immediately. Second, L is a product of N potentially small numbers. If
N is large this can cause a numerical underflow in the computer. As the logarithm is a strictly
monotonic function, the maxima of L and log L will be identical. The logarithmic likelihood
function is given by
N
X
log L(D; µ) = log prob(xn |µ, σn ) . (6)
n=1
where C encompasses all terms that do not depend on µ and are therefore constants during
the maximisation problem. We can now identify the sum,
N
2
X (xn − µ)2
χ = , (8)
n=1
σn2
such that log L(D; µ) = − 21 χ2 + C. In other words, maximising the likelihood function in the
case of Gaussian noise is equivalent to minimising χ2 .6 In order to estimate µ, we now take the
first derivative of Eq. (7) or Eq. (8) with respect to µ, set it equal to zero and try to solve the
resulting equation for µ. The first derivative of Eq. (7) set to zero then reads
N
d log L(D; µ) X (xn − µ)
= = 0, (9)
dµ n=1
σn2
This estimator is a weighted mean which underweights data points with large measurement
errors, i.e., data points that are very uncertain. For the example data set of Table 1 we get
µ̂ ≈ 10.09. This result can be simplified by assuming that all data points have identical standard
deviations, i.e., σn = σ for all xn . Our result then reads
PN
xn
µ̂ = n=1 , (11)
N
which is simply the arithmetic mean. For the example data set of Table 1 we then get µ̂ ≈ 10.07.
The derivation of the corresponding error estimation of µ̂ is postponed to Sect. 3.3.
6
Actually, we should say that minimising χ2 provides the correct estimator if and only if the error distribution
of the data is Gaussian. If the error distribution is not Gaussian, then minimising squared residuals may well
be plausible but it is not justified.
6
René Andrae (2010) – Error estimation in astronomy: A guide
where C again summarises all terms that do not depend on µ. Taking the first derivative with
respect to µ and setting it to zero yields
N
d log L(D; µ) 1X
= −N + xn = 0 . (14)
dµ µ n=1
which is the arithmetic mean, again. Do not be mistaken by the fact that the result was identical
for the Gaussian and the Poisson distribution in this example. In general, trying to estimate a
certain quantity assuming different error distributions also results in different estimators.
where C contains everything that does not depend on f . The first derivative of log L w.r.t. f
reads,
∂ log L(D; f ) n N −n
= − . (18)
∂f f 1−f
Equating this to zero and solving for f yields the maximum-likelihood estimator
n
fˆ = , (19)
N
7
René Andrae (2010) – Error estimation in astronomy: A guide
provided that neither n nor f nor N are zero.7 We will consider this example in two flavours:
1. N = 10 and n = 0, i.e., the given sample is very small and contains no special objects.
An error estimate enables us to assess whether this rules out the existence of these special
objects.
2. N = 30 and n = 4.
Figure 3 shows the binomial likelihood functions for both cases. An example where a binomial
distribution shows up in astronomy can, e.g., be found in Cisternas et al. (2010).
8
René Andrae (2010) – Error estimation in astronomy: A guide
9
René Andrae (2010) – Error estimation in astronomy: A guide
Figure 5: Confidence intervals for the Figure 6: Different types of 68.3% confi-
Gaussian distribution of mean hθi and dence intervals for a multimodal likelihood
standard deviation σ. If we draw N values function. The vertical dashed red line indi-
of θ from a Gaussian distribution, 68.3% cates the maximum-likelihood estimate θ̂.
of the values will be inside the interval The panels are numbered according to the
[hθi − σ, hθi + σ] as shown in panel (a), definitions in the main text.
whereas 95.5% of the values will be inside
the interval [hθi − 2σ, hθi + 2σ] as shown in
panel (b).
If we draw a sample value θ from a Gaussian with mean hθi and standard deviation σ, e.g., by
trying to estimate hθi from measured data, the deviation |θ − hθi| will be smaller than 1σ with
68.3% probability, and it will be smaller than 2σ with 95.5% probability, etc. In simple words,
if we fit some function to N data points with Gaussian errors, we have to expect that 31.7% of
all data points deviate from this fit by more than one sigma.8
The Gaussian is an almost trivial example, due to its symmetry around the mean. In
general, likelihood functions may not be symmetric and how to define confidence intervals in
these cases should be explained. For asymmetric distributions mean and maximum position do
not conincide (e.g. Poisson distribution). The actual parameter estimate θ̂ is the maximum-
likelihood estimate, i.e., it indicates the maximum of the likelihood function, not its mean. We
define the confidence interval θ− ≤ θ̂ ≤ θ+ for a given distribution function prob(θ) via (e.g.
Barlow 1993)
Z θ+
prob(θ− ≤ θ̂ ≤ θ+ ) = dθ prob(θ) = C , (21)
θ−
where usually C = 0.683 in analogy to the one-sigma-interval of the Gaussian. In practice, the
distribution function prob(θ) is usually unknown and only given as a histogram of samples of
θ. In this case, the integral in Eq. (21) reduces to the fraction of all samples θ that are between
θ− and θ+ . However, Eq. (21) does not uniquely define the confidence interval, an additional
criterion is required. Possible criteria are (e.g. Barlow 1993):
1. Symmetric interval: θ− and θ+ are symmetric around the parameter estimate, i.e., θ̂−θ− =
θ+ − θ̂.
2. Shortest interval: θ+ − θ− is smallest for all intervals that satisfy Eq. (21).
8
If you are presented fitted data where the fit goes through all 1σ-errorbars, you should definitely be sceptical.
10
René Andrae (2010) – Error estimation in astronomy: A guide
P 1 2 3 4 5 6 7 8 9 10
cP 0.6827 0.3935 0.1988 0.1149 0.0715 0.0466 0.0314 0.0217 0.0152 0.0109
R θ−
3. Central interval: The probabilities above and below the interval are equal, i.e., −∞
dθ prob(θ) =
R∞
θ+
dθ prob(θ) = (1 − C)/2.
In case of a symmetric distribution, e.g., a Gaussian, all three definitions are indeed equivalent.
However, in general they lead to different confidence intervals. Figure 6 shows the 68.3%
confidence intervals9 resulting from the three definitions for an example distribution that could
be a likelihood function resulting from a parameter estimate. In practice, there is usually no
preference for any of these definitions10 , it should only be made explicitly clear which one is
used.
provide ellipsoidal error contours, i.e., they are capable of describing linear correlations in the
parameters, but not nonlinear correlations such as “banana-shaped” error contours. However,
even in this simple case, things are complicated. The reason for this is that the one-sigma
contour no longer marks a 68.3% confidence region as in Fig. 5. It is straight-forward to
compute that the one-sigma contour of a two-dimensional Gaussian marks a 39.4% confidence
region, whereas in three dimensions it is just a 19.9% confidence region.11 In general, the
confidence cP contained inside a one-sigma contour of a P -dimensional Gaussian with P > 1 is
given by,
Z 1
1 P −2 2
cP = P/2
2π 2 dr rP −1 er /2 . (23)
(2π) 0
Table 2 gives cP for P -dimensional Gaussians with P ≤ 10, in order to give an impression of
how quickly the confidence declines. Evidently, one needs to be very careful when interpreting
one-sigma contours in more than one dimension.
9
Note our terminology: We are talking of a “68.3% confidence interval”, not of a “one-sigma interval”.
10
A symmetric confidence interval may not be sensible in case of a highly asymmetric likelihood function.
As a nice example, consider the likelihood function of example 3.1 shown in Fig. 3. Furthermore, the central
interval would cause the “maximum” at fˆ = 0 to lie outside this confidence interval. h i
11
R 2π Rσ 1 r2
In order to obtain the two-dimensional result, solve the integral 0 dϕ 0 dr r 2πσ 2 exp − 2σ 2 that assumes
a spherically symmetric Gaussian given in polar coordinates.
11
René Andrae (2010) – Error estimation in astronomy: A guide
If the central-limit theorem does not apply – e.g., because the number N of measured data
is small or the likelihood function itself is not well-behaved – things get even more involved.
Nonlinear correlations in the parameters, i.e., “banana-shaped” error contours, are an obvious
indicator for this case. The symmetric confidence region can still be defined easily, but it
obviously lacks the ability to describe parameter correlations. Identifying the “shortest region”
or “central region” may be computationally very expensive. Barlow (1993) recommends a
definition of the confidence region via the contour at which the logarithmic likelihood is 0.5
lower than that at its maximum, i.e., where the likelihood function takes 1/e of its maximum
value. However, the degree of confidence of the resulting region strongly depends on the number
of dimensions, similarly to the Gaussian case discussed above.
As an alternative approach I now discuss a
method that is designed to provide a 68.3% con-
fidence region. Similarly to the method recom-
mended by Barlow (1993), it employs contours
on which the likelihood function is constant. The
recipe is as follows:
12
René Andrae (2010) – Error estimation in astronomy: A guide
3.2 Varying χ2
χ2 has already been introduced in Eq. (8). Let us assume that the model parameters were
estimated by minimising χ2 . We then vary the model parameters slightly around the optimal
values such that χ2 changes by less than 1. In other words, we look for the contour where
χ2 = χ2min + 1, thereby defining the error estimate of the model parameters. The basic idea
here is that if the likelihood function were Gaussian, this would yield the 1σ contour.
12
This is motivated by the central-limit theorem, but the Gaussian approximation needs to be checked.
13
René Andrae (2010) – Error estimation in astronomy: A guide
The crucial assumption of this method is that the error distribution of the measured data
is indeed Gaussian, because otherwise using a χ2 does not make sense (Sect. 2.3.1). Moreover,
this method relies on an accurate measurement of the data errors σn .
14
René Andrae (2010) – Error estimation in astronomy: A guide
parameters, but also an uncertainty in the value of χ2 itself. The value of χ2 is subject to
the so-called χ2 -distribution (e.g. see Barlow 1993) whose expectation value is indeed N − P .
However, this distribution is not sharp but has a nonzero variance of 2(N − P ). Consequently,
if N − P is small, there is a large relative uncertainty on the value of χ2 . This means χ2 may
deviate substantially from N − P even though the model is linear and correct.
which is a P -dimensional (P -variate) Gaussian with mean θ~0 and covariance matrix Σ. This
covariance matrix is the desired error estimate. On the diagonals it contains the variance
estimates of each individual θp and the off-diagonals are the estimates of the covariances (see
e.g. Barlow 1993, for more information about covariances). Comparing Eqs. (20) and (25), we
identify
2 −1
∂ log L
Σ̂ = − . (26)
∂θi ∂θj
Care should be taken with the order of indices and matrix inversion. The matrix of second
derivatives of log L is called “Fisher matrix”. If the second derivatives of log L can be evaluated
analytically, this method may be extremely fast from a computational point of view. However,
if this is impossible, they can usually also be evaluated numerically. By construction, this
method can only describe elliptical error contours. It is impossible to obtain “banana-shaped”
error contours from this method.
Of course, this method also invokes assumptions that have to be checked. Those assumptions
are:
1. The error distribution of the measurements is known, i.e., the likelihood function is defined
correctly.
15
René Andrae (2010) – Error estimation in astronomy: A guide
This second assumption is the actual problem. Although the central-limit theorem ensures this
asymptotically, Fig. 4 shows an example where this assumption of the Fisher matrix breaks
down. There are two simple tests to check the validity of the resulting covariance-matrix
candidate Σ̂. A valid covariance matrix has to be positive definite, i.e., ~xT · Σ̂ · ~x > 0 for any
nonzero vector ~x, and both tests try to check this:
The first test is usually easier to perform, whereas the second test is more restrictive. It is
strongly recommended that these tests are applied whenever this method is used. Unfortu-
nately, these tests are only rule-out criteria. If Σ̂ fails any of these tests, it is clearly ruled
out. However, if it passes both tests, we still cannot be sure that Σ̂ is a good approximation,
i.e., that the Gaussian is indeed a decent approximation to the likelihood function at its maxi-
mum. Nevertheless, the major advantage of this method is that it is very fast and efficient, in
particular if we can evaluate the second derivatives of log L analytically.
There are also situations where the Fisher matrix is definitely correct. This is the case for
Gaussian measurement errors and linear models. In this case L(θ) ~ is truly a Gaussian even
2
without any approximation. For instances, inspect Eq. (8): χ is a quadratic function of µ,
2
and since L(µ) ∝ e−χ /2 the likelihood function is a Gaussian.
If all N data points are again assumed to have identical errors, σn = σ, this simplifies to
σ̂µ2 = σ 2 /N which is the error estimate for the arithmetic mean in case of Gaussian measurement
errors.
16
René Andrae (2010) – Error estimation in astronomy: A guide
µ̂2 µ̂
σ̂µ2 = PN = , (30)
n=1 xn
N
PN
where n=1 xn = N µ̂ has been identified according to the estimator of Eq. (15).
d2 log L n N −n N3
= − − = − , (31)
df 2 f 2 (1 − f )2 n(N − n)
where we have inserted the maximum-likelihood estimator fˆ = n/N . Hence, the error estimate
pretends to be
n(N − n)
σf2 = ,
N3
which yields σf2 = 0 in example 3.1 and σf2 ≈ 0.004 in example 3.2. Now we are lucky, because
σf2 = 0 tells us that we forgot that in example 3 the likelihood function is binomial and not
Gaussian, i.e., this whole calculation was nonsense. Unfortunately, this is not necessarily that
obvious, as the result for example 3.2 shows.
17
René Andrae (2010) – Error estimation in astronomy: A guide
18
René Andrae (2010) – Error estimation in astronomy: A guide
N
X
F̂ = fi . (32)
i=1
As argued in Sect. 2.2, the error distribution in photometric images is in excellent approxima-
tion to Gaussian, if the exposure time was long enough. If we denote the measurement error
of pixel i by σi , we can then estimate the error of F̂ via
N
!2 N
2
X ∂ F̂ X
σ̂F = σi = σi2 , (33)
i=1
∂f i i=1
which is fairly simple in this case. However, this can become very cumbersome for more general
model-independent parameters. In particular, it is impossible if a model-independent parameter
involves an operation on the measured data that is not differentiable, e.g., selecting certain data
points. In fact, error propagation can also be applied to model-based parameter estimates, if
these estimators can be expressed as differentiable functions of the data, e.g., as is the case
for linear models with Gaussian measurement errors. For instance, Equation (28) can also be
derived from Eq. (10) using this method, giving the same result for the example data of Table
1. However, in general, this is not the case.
19
René Andrae (2010) – Error estimation in astronomy: A guide
Figure 11: Error estimation via resampling Figure 12: Error estimation via bootstrap-
the data, using the standard data. The ping the data, using the standard data.
distributions of µ resulting from the re- The distributions of µ result from the boot-
sampling procedure assuming Poisson er- strapping procedure and estimating the
rors in panel (a) and Gaussian errors in Poisson mean via Eq. (15) in panel (a)
panel (b) are both well approximated by and estimating the Gaussian mean via Eq.
a Gaussian. In panel (a) the Gaussian is (10) in panel (b). Both distributions are
given by µ̂ = 10.05 and σ̂µ = 0.57. In panel well approximated by a Gaussian. In panel
(b) the Gaussian is given by µ̂ = 10.11 and (a) the Gaussian is given by µ̂ = 10.06 and
σ̂µ = 0.55. σ̂µ = 0.36. In panel (b) the Gaussian is
given by µ̂ = 10.08 and σ̂µ = 0.38.
Why does this method only provide an upper limit for the uncertainty? The reason is
that even though we are using the correct error distribution of the measured data, we are
centering this error distribution at the measured value instead of the (unknown) true value.
This introduces additional scatter and leads us to overestimate the uncertainty. Nevertheless,
it may still be acceptable to use the overestimated uncertainty as a conservative estimate,
depending on the precise scientific question.
This method is a very intuitive approach to error estimation, because it simulates repeated
measurements of the data. It can also be applied to error estimation for model-based parameter
estimates.
4.3 Bootstrapping
Bootstrapping (Efron 1979; Hastie et al. 2009) is another resampling method for error estimation
that can be applied to model-based as well as model-independent parameter estimation. Let
us assume we have N measurements {x1 , x2 , . . . , xN } from which we estimate some parameter.
Again, we resample our data in order to create “alternative” data sets from which we then
repeatedly estimate the parameter of interest, monitoring its distribution as before. However,
20
René Andrae (2010) – Error estimation in astronomy: A guide
the details of the resampling process are different from those in Sect. 4.2. Instead of resampling
each data point from its individual error distribution, we draw new samples from the measured
data set itself. Drawing these “bootstrap” samples is done with replacement, i.e., the same
data point can occur multiple times in our bootstrap sample. To give an example, we consider
a data set {x1 , x2 , x3 , x4 }. Some examples for its bootstrap samples are:
• {x1 , x2 , x1 , x4 },
• {x1 , x2 , x2 , x4 },
• {x1 , x3 , x3 , x3 },
As the same data point can occur multiple times but ordering is not important, for N data
points the total number of possible different bootstrap samples is
2N − 1 (2N − 1)!
= . (34)
N N !(N − 1)!
In practice, the number of bootstrap samples chosen is set to some useful number, where
“useful” is determined by the trade-off between computational effort and the desire to have as
many samples as possible in order to get a good estimate.
The major advantage of bootstrapping is that the error distribution of the measured data
does not need to be known (unless the parameter estimation itself requires it). The crucial
assumption here is that the measured data sample itself encodes the information about its
error distribution. However, the parameter estimation must be capable of dealing with the
bootstrapped samples which, in general, include certain data points multiple times while com-
pletely lacking other data points. For instance, the flux estimator of Eq. (32) would not be
capable of handling bootstrap samples, since all pixels have to contribute precisely once. Nev-
ertheless, if we know the data’s error distribution, we should really exploit this knowledge by
using, e.g., resampling instead of bootstrapping.
21
René Andrae (2010) – Error estimation in astronomy: A guide
flux. This preprocessing is usually done either using complete instrument-specific pipelines,
more general software routines such as those found in IRAF or MIDAS, or a combination of
both. Now the question arises, how to propagate the errors on the initial measurements through
such complex preprocessing?
In general, an analytic error propagation such as that discussed in Sect. 4.1 is impossible,
either because the data-reduction pipeline is too complex, or because the pipeline is used as
a “black box”. Nevertheless, it is possible to propagate the errors through the pipeline via
resampling (Sect. 4.2), though it may be computationally expensive. Let us outline this
method for our previous example of spectral measurements. We do have the raw spectral data,
the measured bias fields and flat fields. We resample each of these fields as described in Sect.
4.2, say N resamplings, assuming the measurement errors are Gaussian (or Poisson, if we count
only few photons). Then, we feed each resampling instance through the pipeline and monitor
the outcome. The result will be a set of reduced spectra, which provide an estimate of the
reduced spectrum’s error distribution.
Of course, this method may be computationally expensive in practice.18 However, as we
have argued earlier, an error estimate is inevitably necessary. Therefore, if this method is the
only possibility to get such an error estimate, computational cost is not an argument.19
6 Summary
I have discussed different methods for error estimation that apply to model-based as well as
model-independent parameter estimates. The methods have been briefly outlined and their
assumptions have been made explicit. Whenever employing one of these methods, all assump-
tions should be checked. Table 3 summarises all the methods discussed here and provides a
brief overview of their applicability. It was beyond the scope of this manuscript to describe
all methods in detail. Where possible I pointed to the literature for examples. Furthermore, I
have also outlined how one can propagate errors through data-reduction pipelines.
My recommendations for error estimation are to use Monte-Carlo methods in case of model-
based parameter estimates and Monte-Carlo-like resampling of the measured data in case of
model-independent parameter estimates. These methods only require knowledge of the mea-
surement errors but do not invoke further assumptions such as Gaussianity of the likelihood
function near its maximum. Bootstrapping may also be an option if sufficient data are available.
I conclude with some recommendations for further reading:
• Barlow (1993): An easy-to-read introduction into the basics of statistics without going
too much into depth. Recommendable to get a first idea about parameter estimation,
error estimation, and the interpretation of uncertainties.
• Press et al. (2002): This book contains some very useful chapters about statistical theory,
e.g., linear least-squares fitting. It is excellent for looking up how a certain method works,
but it is not meant as an introduction to statistics.
• Hastie et al. (2009): A good textbook giving a profound introduction into analysing data.
However, the main focus of this book is on classification problems, rather than regression
problems. Given the rather mathematical notation and compactness, this book requires
a certain level of prior knowledge.
18
Although after the necessary groundwork has been covered, the batch processing of simple spectra usually
takes no more than a few seconds a piece.
19
In fact, one may argue that computational cost is not an argument anyway. If something is computationally
too expensive, then one should buy more computers! Unfortunately, this approach is usually hampered by the
fact that licensed software is very popular in astronomy (e.g. IDL).
22
René Andrae (2010) – Error estimation in astronomy: A guide
Table 3: Methods for error estimation discussed in this manuscript. This table gives a brief
overview of each method: Specifically, whether a certain method is applicable to model-based
and/or model-independent parameter estimates, whether knowledge about the data’s error
distribution is necessary, and what kind of error contours can be estimated.
• MacKay (2003): The focus of this textbook is again mainly on classification, but it
also provides a broader overview of concepts of data analysis and contains an excellent
introduction to Monte-Carlo methods. This textbook also requires a certain level of prior
knowledge.
Acknowledgements I would like to thank David W. Hogg for reading this manuscript and
providing valuable comments on the technical issues. Furthermore, I would like to thank
Ellen Andrae and Katherine Inskip for helping me to get this rather technical subject down to
something that is hopefully readable. Katherine also helped me to get the examples with spec-
troscopy right. Last but not least, I want to thank Ada Nebot Gomez-Morán, who complained
that she was dearly missing a guide to error estimation during her PhD. I hope she approves
of this manuscript. This is the first revised version of the original manuscript.
References
Barlow, R. 1993, Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences
(Wiley VCH)
Burtscher, L., Jaffe, W., Raban, D., et al. 2009, ApJL, 705, L53
Cisternas, M., Jahnke, K., Inskip, K. J., et al. 2010, ArXiv e-prints
Cowles, M. K. & Carlin, B. P. 1996, Journal of the American Statistical Association, 91, 883
Hastie, T., Tibshirani, R., & Friedman, J. 2009, The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. (Springer-Verlag)
MacKay, D. 2003, Information Theory, Inference, and Learning Algorithms (Cambridge Uni-
versity Press)
Press, W., Teukolsky, S., Vetterling, W., & Flanner, B. 2002, Numerical Recipes in C++
23