You are on page 1of 68

NIKHEF-06-01

Introduction to Bayesian Inference


M. Botje
NIKHEF, PO Box 41882, 1009DB Amsterdam, the Netherlands
June 21, 2006

-2

-1
0

1
1
2

Abstract
These are the write-up of a NIKHEF topical lecture series on Bayesian inference.
The topics covered are the definition of probability, elementary probability calculus and assignment, selection of least informative probabilities by the maximum
entropy principle, parameter estimation, systematic error propagation, model selection and the stopping problem in counting experiments.

Contents
1 Introduction

2 Bayesian Probability

2.1 Plausible inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2 Probability calculus

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3 Exhaustive and exclusive sets of hypotheses . . . . . . . . . . . . . . . .

2.4 Continuous variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10


2.5 Bayesian versus Frequentist inference . . . . . . . . . . . . . . . . . . . . 12
3 Posterior Representation

14

3.1 Mean, variance and covariance . . . . . . . . . . . . . . . . . . . . . . . . 15


3.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 The covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Basic Probability Assignment

21

4.1 Bernoullis urn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


4.2 Binomial distribution

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


4.4 Poisson distribution

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5 Gauss distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26


5 Least Informative Probabilities

28

5.1 Impact of prior knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . 28


5.2 Symmetry considerations

. . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Maximum entropy principle . . . . . . . . . . . . . . . . . . . . . . . . . 31


6 Parameter Estimation

35

6.1 Gaussian sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36


6.2 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Example: no signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.4 Systematic error propagation

. . . . . . . . . . . . . . . . . . . . . . . . 43

6.5 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


7 Counting

50

7.1 Binomial counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50


2

7.2 The negative binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


A Solutions to Selected Exercises

54

Introduction

The Frequentist and Bayesian approaches to statistics differ in the definition of probability. For a Frequentist, probability is the relative frequency of the occurrence of an
event in a large set of repetitions of the experiment (or in a large ensemble of identical
systems) and is, as such, a property of a so-called random variable. In Bayesian statistics, on the other hand, probability is not defined as a frequency of occurrence but as
the plausibility that a proposition is true, given the available information. Probabilities
are thenin the Bayesian viewnot properties of random variables but a quantitative
encoding of our state of knowledge about these variables. This view has far-reaching
consequences when it comes to data analysis since Bayesians can assign probabilities to
propositions, or hypotheses, while Frequentists cannot.
In these lectures we present the basic principles and techniques underlying Bayesian
statistics or, rather, Bayesian inference. Such inference is the process of determining
the plausibility of a conclusion, or a set of conclusions, which we draw from the available
data and prior information.
Since we derive in this write-up (almost) everything from scratch, little reference is made
to the literature. So let us start by giving below our own small library on the subject.
Perhaps one of the best review articles (from an astronomers perspective) is T. Loredo
From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics [1]. Very
illuminating case studies are presented in An Introduction to Parameter Estimation
using Bayesian Probability Theory [2] and An Introduction to Model Selection using
Probability Theory as Logic [3] by G.L. Bretthorst. A nice summary of Bayesian statistics from a particle physics perspective can be found in the article Bayesian Inference
in Processing Experimental Data by G. DAgostini [4].
A good introduction to Bayesian methods is given in the book by Sivia Data Analysisa
Bayesian Tutorial [5]. More extensive, with many worked-out examples in Mathematica, is the book by P. Gregory (also an astronomer) Bayesian Logical Data Analysis for
the Physical Sciences [6].1 The ultimate reference, but certainly not for the fainthearted,
is the monumental work by Jaynes, Probability TheoryThe Logic of Science [7]. Unfortunately Jaynes died before the book was finished so that it is incomplete. It is
available in print (Cambridge University Press) but a free copy can still be found on
the website given in [7]. For those who want to refresh their memory on Frequentist
methods we recommend Statistical Data Analysis by G. Cowan [8].
A rich collection of very interesting articles (including those cited above) can be found
on the web site of T. Loredo http://www.astro.cornell.edu/staff/loredo/bayes.
Finally there are of course these lecture notes which can be found, together with the
lectures themselves, on
http://www.nikhef.nl/user/h24/bayes.
Exercise 1.1: The literature list above suggests that Bayesian methods are more popular
in astronomy than in our field of particle physics. Can you give reasons why astronomers
would be more inclined to turn Bayesian?
1

The Mathematica notebooks can be downloaded, for free, from www.cambridge.org/052184150X.

Bayesian Probability

2.1

Plausible inference

In Aristotelian logic a proposition can be either true or false. In the following we


will denote a proposition by a capital letter like A and represent true or false by the
turns
Boolean values 1 and 0, respectively. The operation of negation (denoted by A)
a true proposition into a false one and vice versa.
Two propositions can be linked together to form a compound proposition. The state
of such a compound proposition depends on the states of the two input propositions
and on the way these are linked together. It is not difficult to see that there are exactly
16 different ways in which two propositions can be combined.2 All these have specific
names and symbols in formal logic but here we will be concerned with only a few of
these, namely, the tautology (), the contradiction (), the and (),3 the or (+)
and the implication (). Below we give the truth tables of these binary operations
A
0
0
1
1

B
0
1
0
1

AB A B A B A + B A B
1
0
0
0
1
1
0
0
1
1
1
0
0
1
0
1
0
1
1
1

(2.1)

Note that the tautology is always true and the contradiction always false, independent
of the value of the input propositions.
Exercise 2.1: Show that A A is a contradiction and A + A a tautology.

Two important relations between the logical operations and and or are given by the
de Morgans laws

A B = A + B

and

A + B = A B.

(2.2)

We note here a remarkable duality possessed by logical equations in that they can be
transformed into other valid equations by interchanging the operations and +.
Exercise 2.2: Prove (2.2). Hint: this is easiest done by verifying that the truth tables of
the left and right-hand sides of the equations are the same. Once it is shown that the first
equation in (2.2) is valid, then duality guarantees that the second equation is also valid.

The reasoning process by which conclusions are drawn from a set of input propositions is
called inference. If there is enough input information we apply deductive inference
which allows us to draw firm conclusions, that is, the conclusion can be shown to be
either true or false. Mathematical proofs, for instance, are based on deductive inferences.
2

The two input propositions define four possible input states which each give rise to two possible
output states (true or false). These output states can thus be encoded as bits in a 4-bit word. Five out
of the 16 possible output words are listed in the truth table (2.1).
3
The conjunction A B will often be written as the juxtaposition AB since it looks neat in long
expressions or as (A, B) since we are accustomed to that in mathematical notation.

If there is not enough input information we apply inductive inference which does not
allow us to draw a firm conclusion. The difference between deductive and inductive
reasoning can be illustrated by the following simple example:
P1: Roses are red
P2: This flower is a rose

This flower is red (deduction)

P1: Roses are red


P2: This flower is red

This flower is perhaps a rose (induction)

Induction thus leaves us in a state of uncertainty about our conclusion. However, the
statement that the flower is red increases the probability that we are dealing with a rose
as can easily be seen from the fact thatprovided all roses are redthe fraction of roses
in the population of red flowers must be larger than that in the population of all flowers.
The process of deductive and inductive reasoning can formally be summarized in terms
of the following two syllogisms4

Proposition 1:
Proposition 2:
Conclusion:

Deduction
If A is true then B is true
A is true
B is true

Induction
If A is true then B is true
B is true
A is more probable

The first proposition in both the syllogisms can be recognized as the implication A B
for which the truth table is given in (2.1). It is straight forward to check from this truth
table the validity of the conclusions derived above; in particular it is seen that A can be
true or false in the second (inductive) syllogism.
A;
(ii) What are the conExercise 2.3: (i) Show that it follows from A B that B
clusions in the two syllogisms above if we replace the second proposition by, respectively,
A is false and B is false ?

In inductive reasoning then, we are in a state of uncertainty about the validity (true or
false) of the conclusion we wish to draw. However, it makes sense to define a measure
P (A|I) of the plausibility (or degree of belief) that a proposition A is true, given the
information I. It may seem quite an arbitrary business to attempt to quantify something
like a degree of belief but this is not so.
Cox (1946) has, in a landmark paper [9], formulated the rules of plausible inference
and plausibility calculus by basing them on several desiderata. These desiderata are so
few that we can summarize them here: (i) boundednessthe degree of plausibility of a
proposition A is a bounded quantity; (ii) transitivityif A, B and C are propositions
and the plausibility of C is larger than that of B which, in turn, is larger than that
of A then the plausibility of C must be larger than that of A; (iii) consistencythe
plausibility of a proposition A depends only on the relevant information on A and not
on the path of reasoning followed to arrive at A.
4

A syllogism is a triplet of related propositions consisting of a mayor premise, a minor premise and
a conclusion.

It turns out that these desiderata are so restrictive that they completely determine the
algebra of plausibilities. To the surprise of many, this algebra appeared to be identical
to that of classical probability as defined by the axioms of Kolmogorov (see below for
these axioms). Plausibility is thusfor a Bayesianidentical to probability.5

2.2

Probability calculus

We will now derive several useful formula starting from the fundamental axioms of probability calculus taking the viewpoint of a Homo Bayesiensis, when appropriate. First, as
already mentioned above, P (A|I) is a bounded quantity; by convention P (A|I) = 1 (0)
when we are certain that the proposition A is true (false). Next, the two Kolmogorov
axioms are the sum rule
P (A + B|I) = P (A|I) + P (B|I) P (AB|I)

(2.3)

and the product rule


P (AB|I) = P (A|BI)P (B|I).

(2.4)

Let us, at this point, spell-out the difference between AB (A and B) and A|B (A given
B): In AB, B can be true or false while in A|B, B is assumed to be true and cannot
be false. The following terminology is often used for the probabilities occurring in the
product rule (2.4): P (AB|I) is called the joint probability, P (A|BI) the conditional
probability and P (B|I) the marginal probability.
Probabilities can be represented in a Venn diagram by the (normalized) areas of the
sub-sets A and B of a given set I. The sum rule is then trivially understood from the
following diagram
I
A

AB
B

while the product rule can be seen to give the relation between different normalizations of the area AB: P (AB|I) corresponds to the area AB normalized to I, P (A|BI)
corresponds to AB normalized to B and P (B|I) corresponds to B normalized to I.
Because A + A is a tautology (aways true) and AA a contradiction (always false) we
find from (2.3)
=1
P (A|I) + P (A|I)
(2.5)
which sometimes is taken as an axiom instead of (2.3)
Exercise 2.4: Derive the sum rule (2.3) from the axioms (2.4) and (2.5).
5

The intimate connection between probability and logic is reflected in the title of Jaynes book:
Probability TheoryThe Logic of Science.

If A and B are mutually exclusive propositions (they cannot both be true) then,
because AB is a contradiction, P (AB|I) = 0 and (2.3) becomes
P (A + B|I) = P (A|I) + P (B|I)

(A and B exclusive).

(2.6)

If A and B are independent (the knowledge of B does not give us information on A


and vice versa),6 then P (A|BI) = P (A|I) and (2.4) becomes
P (AB|I) = P (A|I)P (B|I)

(A and B independent).

(2.7)

Because AB = BA we see from (2.4) that P (A|BI)P (B|I) = P (B|AI)P (A|I). From
this we obtain the rule for conditional probability inversion, also known as Bayes
theorem:
P (D|HI)P (H|I)
P (H|DI) =
.
(2.8)
P (D|I)
In (2.8) we replaced A and B by D and H to indicate that in the following these
propositions will refer to data and hypothesis, respectively. Ignoring for the moment the normalization constant P (D|I) in the denominator, Bayes theorem tells us
the following: The probability P (H|DI) of a hypothesis, given the data, is equal to
the probability P (H|I) of the hypothesis, given the background information alone (that
is, without considering the data) multiplied by the probability P (D|HI) that the hypothesis, when true, just yields that data. In Bayesian parlance P (H|DI) is called the
posterior probability, P (D|HI) the likelihood, P (H|I) the prior probability and
P (D|I) the evidence.
Bayes theorem describes a learning process in the sense that it specifies how to update
the knowledge on H when new information D becomes available.
We remark that Bayes theorem is valid in both the Bayesian and Frequentist worlds
because it follows directly from axiom (2.4) of probability calculus. What differs is
interpretation of probability: for a Bayesian, probability is a measure of plausibility so
that it makes perfect sense to convert P (D|HI) into P (H|DI) for data D and hypothesis
H. For a Frequentist however, probabilities are properties of random variables and,
although it makes sense to talk about P (D|HI), it does not makes sense to talk about
P (H|DI) because a hypothesis H is a proposition and not a random variable. See
Section 2.5 for more on Bayesian versus Frequentist.
Many people are not always fully aware of the consequences of probability inversion. To
see this, consider the case of Mr. White who goes to a doctor for an AIDS test. This
test is known to be 100% efficient (the test never fails to detect AIDS). A few days later
poor Mr. White learns that he is positive. Does this mean that he has AIDS?
Most people (including, perhaps, Mr. White himself and his doctor) would say yes
because they fail to realize that
P (AIDS|positive) 6= P (positive|AIDS).
6

We are talking here about a logical dependence which could be defined as follows: A and B are
logically dependent when learning about A implies that we also will learn something about B. Note
that logical dependence does not necessarily imply causal dependence. Causal dependence does, on the
other hand, always imply logical dependence.

That inverted probabilities are, in general, not equal should be even more clear from
the following example:
P (rain|clouds) 6= P (clouds|rain).
Right?
In the next section we will learn how to deal with Mr. Whites test (and with the opinion
of his doctor).

2.3

Exhaustive and exclusive sets of hypotheses

Let us now consider the important case that H can be decomposed into an exhaustive
set of mutually exclusive hypotheses {Hi }, that is, into a set of which one and only one
hypothesis is true.7 Notice that this implies, by definition, that H itself is a tautology.
Trivial properties of such a complete set of hypotheses are8
P (Hi , Hj |I) = P (Hi |I) ij
and

X
i

P (Hi|I) = P (

X
i

Hi |I) = 1

(normalization)

(2.9)
(2.10)

where we used the sum rule (2.6) in the first equality and the fact that the logical sum of
the Hi is a tautology in the second equality. Eq. (2.10) is the extension of the sum-rule
axiom (2.5).
Similarly it is straight forward to show that
X
X
P (D, Hi|I) = P (D,
Hi |I) = P (D|I).
i

(2.11)

This operation is called marginalization9 and plays a very important role in Bayesian
analysis since it allows to eliminate sets of hypotheses which are necessary in the formulation of a problem but are otherwise of no interest to us (nuisance parameters).
The inverse of marginalization is the decomposition of a probability: Using the product rule we can re-write (2.11) as
X
X
P (D|I) =
P (D, Hi|I) =
P (D|Hi, I)P (Hi|I)
(2.12)
i

which states that the probability of D can be written as the weighted sum of the probabilities of a complete set of hypotheses {Hi }. The weights are just given by the probability
that Hi , when true, gives D. In this way we have expanded P (D|I) on a basis of probabilities P (Hi |I).10 Decomposition is often used in probability assignment because it
allows to express a compound probability in terms of known elementary probabilities.
7

A trivial example is the complete set H1 : x < a and H2 : x a with x and a real numbers.
Here and in the following we write the conjunction AB as A, B.
9
A projection of a two-dimensional distribution f (x, y) on the x or y axis is called a marginal
distribution. Because (2.11) is projecting out P (D|I) it is called marginalization.
P
10
Note that (2.12) is similar to the closure relation in quantum mechanics hD|Ii = i hD|Hi ihHi |Ii.
8

Using (2.12), Bayes theorem (2.8) can, for a complete set of hypotheses, be written as
P (D|Hi, I)P (Hi|I)
,
P (Hi|D, I) = P
i P (D|Hi , I)P (Hi |I)

(2.13)

from which it is seen that the denominator is just a normalization constant.


If we calculate with (2.13) the posteriors for all the hypotheses Hi in the set we obtain a
spectrum of probabilities which, in the continuum limit, goes over to a probability density distribution (see Section 2.4). Note that in computing this spectrum the likelihood
P (D|Hi, I) is viewed as a function of the hypotheses for fixed data. In case P (D|Hi, I)
is regarded as a function of the data for fixed hypothesis it is not called a likelihood but,
instead, a sampling distribution.
Exercise 2.5: Mr. White is positive on an AIDS test. The probability of a positive
test is 98% for a person who has AIDS (efficiency) and 3% for a person who has not
(contamination). Given that a fraction = 1% of the population is infected, what is the
probability that Mr. White has AIDS? What would be this probability for full efficiency
and for zero contamination?
Exercise 2.6: What would be the probability that Mr. White has AIDS given the prior
information = 0 (nobody has AIDS) or = 1 (everybody has AIDS)? Note that both
these statements on encode prior certainties. Convince yourself, by studying Bayes
theorem, of the fact that no amount of data can ever change a prior certainty.

2.4

Continuous variables

The formalism presented above describes the probability calculus of propositions or,
equivalently, of discrete variables (which can be thought of as an index labeling a set
of propositions). To extend this discrete algebra to continuous variables, consider the
propositions
A : r < a,
B : r < b,
C:ar<b
for a real variable r and two fixed real numbers a and b with a < b. Because we have
the Boolean relation B = A + C and because A and C are mutually exclusive we find
from the sum rule (2.6)
P (a r < b|I) = P (r < b|I) P (r < a|I) G(b) G(a).

(2.14)

In (2.14) we have introduced the cumulant G(x) P (r < x|I) which obviously is a
monotonically increasing function of x. The probability density p is defined by
P (x r < x + |I)
dG(x)
=
0

dx

p(x|I) = lim

(note that p is positive definite) so that (2.14) can also be written as


Z b
P (a r < b|I) =
p(r|I) dr.

(2.15)

(2.16)

In terms of probability densities, the product rule (2.4) can now be written as
p(x, y|I) = p(x|y, I) p(y|I).
10

(2.17)

Likewise, the normalization (2.10) can be written as


Z
p(x|I) dx = 1,

(2.18)

the marginalization/decomposition (2.12) as


Z
Z
p(x|I) = p(x, y|I) dy = p(x|y, I) p(y|I) dy

(2.19)

and Bayes theorem (2.13) as


p(y|x, I) = R

p(x|y, I) p(y|I)
.
p(x|y, I) p(y|I) dy

(2.20)

Exercise 2.7: A counter produces a yes/no signal S when it is traversed by a pion.


Given are the efficiency P (S|, I) = and mis-identification probability P (S|
, I) = .
The fraction of pions in the beam is P (|I) = . What is the probability P (|S, I) that
a particle which generates a signal is a pion in case (i) is known and (ii) is unknown?
In the latter case assume a uniform prior distribution for in the range 0 1.

We make four remarks: (1)Probabilities are dimensionless numbers so that the dimension of a density is the reciprocal of the dimension of the variable. This implies
that p(x) transforms when we make a change of variable x f (x). The size of the
infinitesimal element dx corresponding to df is given by dx = |dx/df | df ; because the
probability content of this element must be invariant we have


dx
dx


and thus
p(f |I) = p(x|I) . (2.21)
p(f |I) df = p(x|I) dx = p(x|I) df
df
df
Exercise 2.8: A lighthouse at sea is positioned a distance d from the coast.
y

x
x0

This lighthouse emits collimated light pulses at random times in random directions, that
is, the distribution of pulses is uniform in . Derive an expression for the probability to
observe a light pulse as a function of the position x along the coast. (From Sivia [5].)

(2)Without prior information it is tempting to chose a uniform distribution for the


prior density p(y|I) in Bayes theorem (2.20). However, the distribution of a transformed
variable z = f (y) will then, in general, not be uniform. An ambiguity arises in the
Bayesian formalism if our lack of information can be encoded equally well by a uniform
distribution in y as by a uniform distribution in z. How to select an optimal noninformative prior among (perhaps many) alternatives is an importantand sometimes
11

controversialissue: we will come back to it in Section 5. Note, however, that in an


iterative learning process the choice of initial prior becomes increasingly less important
because the posterior obtained at one step can be taken as the prior in the next step.11
Exercise 2.9: We put two pion counters in the beam both with efficiency P (S|, I) =
and mis-identification probability P (S|
, I) = . The fraction of pions in the beam is
P (|I) = . A particle traverses and both counters give a positive signal. What is the
probability that the particle is a pion? Calculate this probability by first taking the
posterior of the measurement in counter (1) as the prior for the measurement in counter
(2) and second by considering the two responses (S1, S2) as one measurement and using
Bayes theorem directly. In order to get the same result in both calculations an assumption
has to be made on the measurements in counters (1) and (2). What is this assumption?

(3)The equations (2.19) and (2.20) are, together with their discrete equivalents, all
you need in Bayesian inference. Indeed, apart from the approximations and transformations described in Section 3 and the maximum entropy principle described in Section 5,
the remainder of these lectures will be nothing more than repeated applications of decomposition, probability inversion and marginalization.
(4)Plausible inference is, strictly speaking, always conducted in terms of probabilities
instead of probability densities. A density p(x|I) is turned into a probability by multiplying it with the infinitesimal element dx. For conditional probabilities dx should refer
to the random variable (in front of the vertical bar) and not to the condition (behind
the vertical bar); thus p(x|y, I)dx is a probability but p(x|y, I)dy is not.

2.5

Bayesian versus Frequentist inference

In the previous sections some mention was made of the differences between the Bayesian
and Frequentist approaches. Although it is not the intention of these lectures to make
a detailed comparision of the two methods we think it is appropriate to make a few
remarks. For this we will follow the quasi-historical account of T. Loredo [1] which
helps to put matters in a proper perspective.
First, it is important to realize that the calculus presented in the previous sections
gives us the rules for manipulating probabilities but not how to assign them. Sampling
distributions (or likelihoods) are not much of a problem since they can, at least in
principle, be assigned by deductive reasoning within a suitable model (a Monte Carlo
simulation, for instance). However, the situation is less clear for the assignment of priors.
Bernoulli (1713) was the first to formulate a rule for a probability assignment which he
called the principle of insufficient reason:
If for a set of N exclusive and exhaustive propositions there is no evidence
to prefer any proposition over the others then each proposition must be
assigned an equal probability 1/N.
11

The rate of convergence can be much affected by the initial choice of prior, see Section 5.1 for a
nice example taken from the book by Sivia.

12

While this can be considered as a solid basis for dealing with discrete variables, problems
arise when we try to extend the principle to infinitely large sets, that is, to continuous
variables. This is because a uniform distribution (uninformative) can be turned into a
non-uniform distribution (informative) by a simple coordinate transformation. A solution is provided by not using uniformity but, instead, maximum entropy as a criterion
to select uninformative probability densities, see Section 5.
People were also concerned by the lack of rationale to use the axioms (2.3) and (2.4)
for the calculus of a degree of belief which was, indeed, the concept of probability
for Bernoulli (1713), Bayes (1763) and, above all, Laplace (1812) who was the first to
formulate Bayesian inference as we know it today. In spite of the considerable success
of Laplace in applying Bayesian methods to astronomy, medicine and other fields of
investigation, the concept of plausibility was considered to be too vague, and also too
subjective, for a proper mathematical formulation. This nail in the coffin of Bayesianism
has of course been removed by the work of Cox (1946), see Section 2.1.
An apparent masterstroke to eliminate all subjectiveness was to define probability as the
limiting frequency of the occurrence of an event in an infinite sample of repetitions or,
equivalently, in an infinite ensemble of identical copies of the process under study. Such
a definition of probability is consistent with the rules (2.3) and (2.4) but inconsistent
with the notion of a probability assignment to a hypothesis since in the repetitions of
an experiment such a hypothesis can only be true or false and nothing else.12 This then
invalidates the inversion of P (D|HI) to P (H|DI) with Bayes theorem which is quite
convenient since it removes, together with the theorem itself, the need to specify this
disturbing prior probability P (H|I).
However, the problem is now how to judge the validity of an hypothesis without allowing
access via Bayes theorem. The Frequentist answer is the construction of a so-called
statistic. Such a statistic is a function of the data and thus a random variable with
a distribution which can be derived directly from the sampling distribution. The idea
is now to construct a statistic which is sensitive to the hypothesis being tested and to
compare the data with the long-term behavior of this statistic. Well known examples
of a statistic are the sample mean, variance, 2 and so on. A disadvantage is that it
is far from obvious how to construct a statistic when we have to deal with complicated
problems. As a guidance many criteria have been invented like unbiasedness, efficiency,
consistency and sufficiency but this still does not lead to a unique definition.
Exercise 2.10: The number n of radioactive decays observed during a time interval t
is Poisson distributed (Section 4.4)
P (n|Rt) =

(Rt)n Rt
e
,
n!

where R is the decay rate. Motivate the use of n/t as a statistic for R (consult a
statistics textbook if necessary).

Another unattractive feature of the Frequentist approach is that conclusions are based
on hypothetical repetitions of the experiment which never took place. This is in stark
12

A model parameter thus has for a Frequentist a fixed, but unknown, value. This is also the view
of a Bayesian since the probability distribution that he assigns to the parameter does not describe how
it fluctuates but how uncertain we are of its value.

13

contrast to Bayesian methods where the evidence is provided by the available data and
prior information. Many repetitions (events) actually do take place in particle physics
experiments but this is not so in observational sciences like astronomy for instance. Also,
it turns out that the construction of a sampling distribution is not as unambiguous as
one might think. Sampling distributions may depend, in fact, not only on the relevant
information carried by the data but also on how the data were actually obtained. A
well known example of this is the so-called stopping problem which we will discuss in
Section 7.13
Guidance in the construction of priors, or of any probability, is provided by sampling
theory (the study of random processes like those occurring in games of chance), group
theory (the study of symmetries and invariants) and, above all, the principle of maximizing information entropy (see Section 5.3). However, one may take as a rule of thump
that when the prior is wide compared to the likelihood, it does not matter very much
what distribution we chose. On the other hand, when the likelihood is so wide that it
competes with any reasonable prior then this simply means that the experiment does
not carry much information on the problem at hand. In such a case it should not come
as a surprise that answers become dominated by prior knowledge or assumptions. Of
course there is nothing wrong with that, as long as these prior assumptions are clearly
stated. The prior also plays an important role when the likelihood peaks near a physical boundary or, as very well may happen, resides in an unphysical region (likelihoods
related to neutrino mass measurements are a famous example; for another example see
Section 6.3 in this write-up). Note that in case the prior is important, Bayesian and
Frequentist methods might yield different answers to your problem.
With these remarks we leave the Bayesian-Frequentist debate for what it is and refer to
the abundant literature on the subject, see e.g. [10] for recent discussions.
In the previous sections we have explicitly kept the probability densities conditional to
I in all expressions as a reminder that Bayesian probabilities are always defined in
relation to some background information and also that this information must be the
same for all probabilities in a given expression; if this is not the case, calculations may
lead to paradoxical results.14 This background information should not be regarded as
encompassing all that is known but simply as a list of what is needed to unambiguously
define all the probabilities in an expression. In the following we will be a bit liberal and
sometimes omit I when it clutters the notation.

Posterior Representation

The full result of Bayesian inference is the posterior distribution. However, instead of
publishing this distribution in the form of a parameterization, table, plot or computer
program it is often more convenient to summarize the posterior in terms of a few pa13

Many people consider the stopping problem the nail in the coffin of Frequentism.
This happens, for instance, when we calculate a posterior P1 (H|DI) P (D|HI)P (H|I) and
feed this posterior back into Bayes theorem as an improved prior to calculate P2 (H|DI)
P (D|HI)P1 (H|DI) using the same data D. It is obvious already from the inconsistent notation that
this kind bootstrapping is not allowed.
14

14

rameters. In the following subsection we present the mean, variance and covariance
as a measure of position and width. The remaining two subsections are devoted to
transformations of random variables and to the properties of the covariance matrix.

3.1

Mean, variance and covariance

The expectation value of a function f (x) is defined by15


Z
< f > = f (x) p(x|I) dx

(3.1)

where the integration domain is understood to be the definition range of the distribution
p(x|I). The k-th moment of a distribution is the expectation value < xk >. From (2.18)
it immediately follows that the zeroth moment < x0 > = 1. The first moment is called
the mean of the distribution and is a location measure
Z
= x = < x > = x p(x|I) dx.
(3.2)
The variance 2 is the second moment about the mean
Z
2
2
2
= < x > = < (x ) > = (x )2 p(x|I) dx.

(3.3)

The square root of the variance is called the standard deviation and is a measure of
the width of the distribution.
Exercise 3.1: Show that the variance is related to the first and second moments by
< x2 > = < x2 > < x >2 .

The width of a multivariate distribution is characterized by the covariance matrix:


Z
Z
i = xi = < xi > = xi p(x1 , . . . , xn |I) dx1 dxn
Z
Z
Vij = < xi xj > = (xi i )(xj j ) p(x1 , . . . , xn |I) dx1 dxn . (3.4)
The covariance matrix is obviously symmetric.
Exercise 3.2: Show that the off-diagonal elements of Vij vanish when x1 , . . . , xn are
independent variables.

A correlation between the variables is better judged from the matrix of correlation
coefficients which is defined by

15

ij = p

Vij
Vij
=
.
i j
Vii Vjj

(3.5)

We discuss here only continuous variables; the expressions for discrete variables are obtained by
replacing the integrals with sums.

15

It can be shown that 1 ij +1.

The position of the maximum of a probability density function is called the mode16
which often is taken as a location parameter (provided the distribution has a single
maximum). For the general case of an n-dimensional distribution one finds the mode
(to
by minimizing the function L(x) = ln p(x|I). Expanding L around the point x
be specified below) we obtain
+
L(x) = L(x)

n
X

L(x)
i=1

xi

1 X X 2 L(x)
xi +
xi xj +
2 i=1 j=1 xi xj

(3.6)

to be the point where L is minimum. With this


with xi xi xi . We now take x
is a solution of the set of equations
choice x

L(x)
=0
xi

(3.7)

so that the second term in (3.6) vanishes. Up to second order, the expansion can now
be written in matrix notation as

+ ,
+ 12 (x x)H(x
x)
L(x) = L(x)

(3.8)

where the Hessian matrix of second derivatives is defined by


Hij

2 L(x)
.
xi xj

(3.9)

Exponentiating (3.8) we find for our approximation of the probability density in the
neighborhood of the mode:

p(x|I) C exp[ 12 (x x)H(x


x)]

(3.10)

where C is a normalization constant. Now if we identify the inverse of the Hessian with
a covariance matrix V then the approximation (3.10) is just a multivariate Gaussian
in x-space17


1
1 (x x)
,
exp 12 (x x)V
p(x|I) p
(3.11)
(2)n |V |
where |V | denotes the determinant of V .18

Sometimes the posterior is such that the mean and covariance matrix (or Hessian) can
be calculated analytically. In most cases, however, minimization programs like minuit
and V from L(x) (the function L is calculated in
are used to determine numerically x
the subroutine fcn provided by the user).
The approximation (3.11) is easily marginalized. It can be shown that integrating a
multivariate Gaussian over one variable xi is equivalent to deleting the corresponding
16

We denote the mode by x


to distinguish it from the mean x
. For symmetric distributions this
distinction is irrelevant since then x
= x.
17
This is why we have expanded in (3.6) the logarithm instead of the distribution itself: only then
the inverse of the second derivative
p matrix is equal to the covariance matrix of a multivariate Gaussian.
18
There is no problem with |V | since |V | is positive definite as will be shown in Section 3.3.

16

row and column i in the Hessian matrix H. This defines a new Hessian H and, by
inversion, a new covariance matrix V . Replacing V by V and n by (n 1) in (3.11)
then obtains the integrated Gaussian. It is now easy to see that integration over all but
one xi gives
"

2 #
1
1 xi xi
p(xi |I) = exp
(3.12)
2
i
i 2
where i2 is the diagonal element Vii of the covariance matrix V .19
Let us close this section by making the remark that communicating a posterior by
only giving the mean (or mode) and covariance is inappropriateand perhaps even
misleadingwhen the distribution is multi-modal, strongly a-symmetric or exhibits long
tails.20 Note also that there are distributions for which mean and variance are ill defined
as is, for instance, the case for a uniform distribution on [, +]. Mean and variance
may not even exist because the integrals (3.2) or (3.3) are divergent. An example of
this is the Cauchy distribution
p(x|I) =

1 1
.
1 + x2

Exercise 3.3: The Cauchy distribution is often called the Breit-Wigner distribution
which usually is parameterized as
p(x|x0 , ) =

/2
1
.
(/2)2 + (x x0 )2

Use (3.6)in one dimensionto characterize the position and width of the Breit-Wigner
distribution. Show also that is the FWHM (full width at half maximum).

If the choice of an optimal value is of strategic importance, like in business or finance,


then this choice is often based on decision theory (not described here) instead of just
taking the mean or mode.

3.2

Transformations

In this section we briefly describe how to construct probability densities of functions of


a (multi-dimensional) random variable. We start by calculating the probability density
p(z|I) of a single function z = f (x) from a given distribution p(x|I) of n variables x.
Decomposition gives
Z
Z
p(z|I) =
p(z, x|I) dx = p(z|x, I) p(x|I) dx =
Z
=
[z f (x)] p(x|I) dx
(3.13)
19

The error on a fitted parameter given by minuit is the diagonal element of the covariance matrix
and is thus the width of the marginal distribution of this parameter.
20
Without additional information people will assume that the posterior is unimodal and that a one
standard deviation interval contains roughly 68.3% probability as is the case for a Gauss distribution.

17

where we have made the trivial assignment p(z|x, I) = [z f (x)]. This assignment
guarantees that the integral only receives contributions from the hyperplane f (x) = z.
As an example consider two independent variables x and y distributed according to
p(x, y|I) = f (x)g(y). Using (3.13) we find that the distribution of the sum z = x + y is
given by the Fourier convolution of f and g
Z
Z
p(z|I) = f (x)g(z x) dx = f (z y)g(y) dy.
(3.14)

Likewise we find that the product z = xy is distributed according to the Mellin convolution of f and g
Z
Z
dx
dy
p(z|I) = f (x)g(z/x)
= f (z/y)g(y) ,
(3.15)
|x|
|y|
provided that the definition ranges do not include x = 0 and y = 0.
Exercise 3.4: Use (3.13) to derive Eqs. (3.14) and (3.15).

In case of a coordinate transformation it may be convenient to just use the Jacobian


as we have done in (2.21) in Section 2.4. By a coordinate transformation we mean a
mapping Rn Rn by a set of n functions
z(x) = {z1 (x), . . . , zn (x)},
for which there exists an inverse transformation
x(z) = {x1 (z), . . . , xn (z)}.
The probability density of the transformed variables is then given by
q(z|I) = p[x(z)|I] |J |

(3.16)

where |J | is the absolute value of the determinant of the Jacobian matrix


Jik =

xi
.
zk

(3.17)

Exercise 3.5: Let x and y be two independent variables distributed according to p(x, y|I) =
f (x)g(y). Let u = x + y and v = x y. Use (3.16) to obtain an expression for q(u, v|I)
in terms of f and g and show, by integrating over v, that the marginal distribution of
u is given by (3.14). Likewise define u = xy and v = x/y and show that the marginal
distribution of u is given by (3.15).

The above, although it formally settles the issue of how to deal with functions of random
variables, often gives rise to tedious algebra as can be seen from the following exercise:21
Exercise 3.6: Two variables x1 and x2 are independently Gaussian distributed:
"

2 #
1 xi i
1
i = 1, 2.
p(xi |I) = exp
2
i
i 2
Show, by carrying out the integral in (3.14), that the variable z = x1 + x2 is Gaussian
distributed with mean = 1 + 2 and variance 2 = 12 + 22 .
21

Later on we will use Fourier transforms (characteristic functions) to make life much easier.

18

An easy way to deal with any function of any distribution is to generate p(x|I) by
Monte Carlo, calculate F (x) at each generation and histogram the result. However,
if we are content with summarizing the distributions by a location parameter and a
covariance matrix, then there is a very simple transformation rule, known as linear
error propagation.
Let F (x) be one of a set of m functions of x. Linear approximation gives
=
F F (x) F (x)

n
X

F (x)
i=1

xi

xi

(3.18)

stands for your favorite location parameter (usually mean


with xi = xi xi . Here x
or mode). Now multiplying (3.18) by the expression for F and averaging obtains
< F F > =

n X
n
X
F F
i=1 j=1

xi xj

< xi xj > .

(3.19)

Eq. (3.19) can be written in compact matrix notation as


VF = DVx DT

(3.20)

where DT denotes the transpose of the m n derivative matrix Di = F /xi . Well


known applications are the quadratic addition of errors for a sum of independent variables
(3.21)
2 = 12 + 22 + + n2 for z = x1 + x2 + + xn

and the quadratic addition of relative errors for a product of independent variables
 2
 2  2  2
n
1
2
=
+
++
for z = x1 x2 xn .
(3.22)
z
x1
x2
xn
Exercise 3.7: Use (3.19) to derive the two propagation rules (3.21) and (3.22).
Exercise 3.8: A counter is traversed by N particles and fires n times. Since n N these
countsare not independent
but n and m = N n are. Assume Poisson errors (Section 4.4)

n = n and m = m and use (3.19) to show that the error on the efficiency = n/N
is given by
r
(1 )
=
N
This is known as the binomial error, see Section 4.2.

3.3

The covariance matrix

In this section we investigate in some more detail the properties of the covariance matrix
which, together with the mean, fully characterizes the multivariate Gaussian (3.11).
In Section 3.1 we have already remarked that V is symmetric but not every symmetric
matrix can serve as a covariance matrix. To see this, consider a function f (x) of a set
of Gaussian random variables x. For the variance of f we have according to (3.20)
2 = < f 2 > = d V d,
19

where d is the vector of derivatives f /xi . But since 2 is positive for any function f
it follows that the following inequality must hold:
d V d > 0 for any vector d.

(3.23)

A matrix which possesses this property is called positive definite.


A covariance matrix can be diagonalized by a unitary transformation. This can be
seen from Fig. 1 where we show the one standard deviation contour of two correlated
(a)

x2

(b)

y2

1
x1

y1

Figure 1: The one standard deviation contour of a two dimensional Gaussian for (a) correlated
variables x1 and x2 and (b) uncorrelated variables y1 and y2 . The marginal distributions of x1 and x2
have a standard deviation of 1 and 2 , respectively.

Gaussian variables (x1 , x2 ) and two uncorrelated variables (y1 , y2 ). It is clear from these
plots that the two error ellipses are related by a simple rotation. The rotation matrix is
unique, provided that the variables x are linearly independent. A pure rotation is not
the only way to diagonalize the covariance matrix since the rotation can be combined
with a scale transformation along y1 or y2 .
The rotation U which diagonalizes V must, according to the transformation rule (3.19),
satisfy the relation
U V U T = L V U T = U TL
(3.24)

where L = diag(1 , . . . , n ) denotes a diagonal matrix. In (3.24) we have used the


property U 1 = U T of an orthogonal transformation. Let the columns i of U T be
denoted by the set of vectors ui , that is,
uij = UjiT = Uij .

(3.25)

It is then easy to see that (3.24) corresponds to the set of eigenvalue equations
V ui = i u i .

(3.26)

Thus, the rotation matrix U and the vector of diagonal elements i is determined by
the complete set of eigenvectors and eigenvalues of the covariance matrix V .
Exercise 3.9: Show that (3.24) is equivalent to (3.26).
Exercise 3.10: (i) Show that for a symmetric matrix V and two arbitrary vectors x and
y the following relation holds y V x = x V y; (ii) Show that the eigenvectors ui and uj
of a symmetric matrix V are orthogonal, that is, ui uj = 0 for i 6= j; (iii) Show that the
eigenvalues of a positive definite symmetric matrix V are all positive.

20

The normalization factor of the multivariate Gaussian (3.11) is proportional to the


square root of the determinant|V | which makes only sense if this determinant is positive.
Indeed,
Y



i = |L| = U V U T = |U | |V | U T = |V | > 0
i

where we have used the fact that |U | = 1 and that all the eigenvalues of V are positive.

Basic Probability Assignment

In Section 2.3 we have introduced the operation of decomposition which allows us to


assign a compound probability by writing it as a sum of known elementary probabilities.
As a very basic application of decomposition in combination with Bernoullis principle
of insufficient reason we will discuss, in the next section, drawing balls from an urn.
In passing we will make some observations about probabilities which are very Bayesian
and which you may find quite surprising if you have never encountered them before.
We then proceed by deriving the Binomial distribution in subsection 4.2 using nothing
else but the sum and product rules of probability calculus. Multinomial and Poisson
distributions are introduced in the subsections 4.3 and 4.4. In the last subsection we
will introduce a new tool, the characteristic function, and derive the Gauss distribution
as the limit of a sum of arbitrarily distributed random variables.

4.1

Bernoullis urn

Consider an experiment where balls are drawn from an urn. Let the urn contain N
balls and let the balls be labeled i = 1, . . . , N. We can now define the exhaustive and
exclusive set of hypotheses
Hi = this ball has label i,

i = 1, . . . , N.

Since we have no information on which ball we will draw we use the principle of insufficient reason to assign the probability to get ball i at the first draw:
P (Hi|N, I) =

1
.
N

(4.1)

Next, we consider the case that R balls are colored red and W = N R are colored
white. We define the exhaustive and exclusive set of hypotheses
HR = this ball is red
HW = this ball is white.
21

We now want to assign the probability that the first ball we draw will be red. To solve
this problem we decompose this probability into the hypothesis space {Hi } which gives
P (HR |I) =

N
X
i=1

P (HR , Hi |I) =

N
X
i=1

P (HR |Hi , I)P (Hi|I)

N
1 X
R
=
P (HR |Hi, I) =
N i=1
N

(4.2)

where, in the last step, we have made the trivial probability assignment

1 if ball i is red
P (HR |Hi , I) =
0 otherwise.
Next, we assign the probability that the second ball will be red. This probability depends
on how we draw the balls:
1. We draw the first ball, put it back in the urn and shake the urn. The latter action
may be called randomization but from a Bayesian point of view the purpose
of shaking the urn is, in fact, to destroy all information we might have on the
whereabouts of this ball after it was put back in the urn (it would most likely
end-up in the top layer of balls). Since this drawing with replacement does
not change the contents of the urn and since the shaking destroys all previously
accumulated information, the probability of drawing a red ball a second time is
equal to that of drawing a red ball the first time:
P (R2 |I) =

R
,
N

where R2 stands for the hypothesis the second ball is red.


2. We record the color of the first ball, lay it aside, and then draw the second ball.
Obviously the content of the urn has changed after the first draw. Depending on
the color of the first ball we assign:
R1
N 1
R
P (R2 |W1 , I) =
N 1
P (R2 |R1 , I) =

Exercise 4.1: Draw a first ball and put it aside without recording its color. Show that
the probability for the second draw to be red is now P (R2 |I) = R/N .

In the above we have shown that the probability of the second draw may depend on the
outcome of the first draw. We will now show that the probability of the first draw may
depend on the outcome of the second draw! Consider the following situation: A first
ball is drawn and put aside without recording its color. A second ball is drawn and it
22

turns out to be red. What is the probability that the first ball was red? Bayes theorem
immediately shows that it is not R/N:
P (R1 |R2 , I) =

P (R2 |R1 , I)P (R1 |I)


R1
=
.
P (R2 |R1 , I)P (R1|I) + P (R2 |W1 , I)P (W1 |I)
N 1

If this argument fails to convince you, take the extreme case of an urn containing one
red and one white ball. The probability of a red ball at the first draw is 1/2. Lay the
ball aside and take the second ball. If it is red, then the probability that the first ball
was red is zero and not 1/2. The fact that the second draw influences the probability of
the first draw has of course nothing to do with a causal relationship but, instead, with
a logical relationship.
Exercise 4.2: Draw a first ball and put it back in the urn without recording its color.
The color of a second draw is red. What is the probability that the first draw was red?

4.2

Binomial distribution

We now draw n balls from the urn, putting the ball back after each draw and shaking the
urn. In this way the probability that a draw is red is the same for all draws: p = R/N.
What is the probability that we find r red balls in our sample of n draws? Again, we
seek to decompose this probability into a combination of elementary ones which are easy
to assign. Let us start with the hypothesis
Sj = the n balls are drawn in the sequence labeled j
where j = 1, . . . , 2n is the index in a list of all possible sequences (of length n) of
white and red draws. The set of hypotheses {Sj } is obviously exclusive and exhaustive.
The draws are independent, that is, the probability of the k th draw does not depend
on the outcome of the other draws (remember that this is only true for drawing with
replacement). Thus we find from the product rule
P (Sj |I) = P (C1, . . . , Cn |I) =

n
Y

k=1

P (Ck |I) = prj (1 p)nrj ,

(4.3)

where Ck stands for red or white at the k th draw and where rj is the number of red
draws in the sequence j. Having assigned the probability of each element in the set
{Sj }, we now decompose our probability of r red balls into this set:
n

P (r|I) =

2
X
j=1

"

P (r, Sj |I) =

2n
X
j=1

2
X
j=1

P (r|Sj , I)P (Sj |I)

(r rj ) pr (1 p)nr

where we have assigned the trivial probability



1 when the sequence Sj contains r red draws
P (r, |Sj , I) = (r rj ) =
0 otherwise.
23

(4.4)

The sum inside the square brackets in (4.4) counts the number of sequences in the set
{Sj } which have just r red draws. It is an exercise in combinatorics to show that this
number is given by the binomial coefficient. Thus we obtain
 
n!
n r
r
nr
P (r|p, n) =
p (1 p)nr .
(4.5)
p (1 p)
=
r!(n r)!
r
This is called the binomial distribution which applies to all processes where the
outcome is binary (red or white, head or tail, yes or no, absent or present etc.), provided
that the probability p of the outcome of a single draw is the same for all draws. In
Fig. 2 we show the distribution of red draws for n = (10, 20, 40) trials for an urn with
p = 0.25.
P H r 10, 0.25 L

P H r 20, 0.25 L

P H r 40, 0.25 L
0.14
0.12
0.1
0.08
0.06
0.04
0.02

0.2

0.25
0.2

0.15

0.15

0.1

0.1
0.05

0.05
0

10

10

12

14

10

15

20

25

Figure 2: The binomial distribution to observe r red balls in n = (10, 20, 40) draws from an urn
containing a fraction p = 0.25 of red balls.

The binomial probabilities are just the terms of the binomial expansion
n  
X
n r nr
n
a b
(a + b) =
r
r=0

(4.6)

with a = p and b = 1 p. From this it follows immediately that


n
X

P (r|p, n) = 1.

r=0

The condition of independence of the trials is important and may not be fulfilled: for
instance, suppose we scoop a handful of balls out of the urn and count the number r
of red balls in this sample. Does r follow the binomial distribution? The answer is no
since we did not perform draws with replacement, as required. This can also be seen
from the extreme situation where we take all balls out of the urn. Then r would not be
distributed at all: it would just be R.
The first and second moments and the variance of the binomial distribution are
<r> =
< r2 > =
2

n
X

r=0
n
X

rP (r|p, n) = np
r 2 P (r|p, n) = np(1 p) + n2 p2

r=0
2

< r > = < r > < r >2 = np(1 p)


24

(4.7)

If we now define the ratio = r/n then it follows immediately from (4.7) that
<> = p

< 2 > =

p(1 p)
n

(4.8)

The square root of this variance is called the binomial error which we have already
encountered in Exercise 3.8. It is seen that the variance vanishes in the limit of large
n and thus that converges to p in that limit. This fundamental relation between a
probability and a limiting relative frequency was first discovered by Bernoulli and is
called the law of large numbers. This law is, of course, the basis for the Frequentist
definition of probability.

4.3

Multinomial distribution

A generalization of the binomial distribution is the multinomial distribution which


applies to N independent trials where the outcome of each trial is among a set of k
alternatives with probability pi . Examples are drawing from an urn containing balls
with k different colors, the throwing of a dice (k = 6) or distributing N independent
events over the bins of a histogram.
The multinomial distribution can be written as
P (n|p, N) =

N!
pn1 pnk k
n1 ! nk ! 1

(4.9)

where n = (n1 , . . . , nk ) and p = (p1 , . . . , pk ) are vectors subject to the constraints


k
X

ni = N

and

i=1

k
X

pi = 1.

(4.10)

i=1

The multinomial probabilities are just the terms of the expansion


(p1 + + pk )N
from which the normalization of P (n|p, N) immediately follows. The average, variance
and covariance are given by
< ni > = Npi
< n2i > = Npi (1 pi )
< ni nj > = Npi pj for i 6= j.

(4.11)

Marginalization is achieved by adding in (4.9) two or more variables ni and their corresponding probabilities pi .
Exercise 4.3: Use the addition rule above to show that the marginal distribution of each
ni in (4.9) is given by the binomial distribution P (ni |pi , N ) as defined in (4.5).

The conditional distribution on, say, the count nk is given by


P (m|nk , q, M) =

M!
nk1
q1n1 qk1
n1 ! nk1 !
25

where
k1

X
1
m = (n1 , . . . , nk1 ), q = (pi , . . . , pk1), s =
pi and M = N nk .
s
i=1
Exercise 4.4: Derive the expression for the conditional probability by dividing the joint
probability (4.9) by the marginal (binomial) probability P (nk |pk , N ).

4.4

Poisson distribution

Here we consider events or counts which occur randomly in time (or space). The
counting rate R is supposed to be given, that is, we know the average number of counts
= Rt in a given time interval t. There are several ways to derive an expression for
the probability P (n|) to observe n events in a time interval with contains, on average,
events. Our derivation is based on the fact that this probability distribution is a
limiting case of the binomial distribution.
Assume we divide the interval t in N sub-intervals t. The probability to observe an
event in such a sub-interval is then p = /N, see (4.2). Now we can always make N
so large and t so small that the number of events in each sub-interval is either one or
zero. The probability to find n events in N sub-intervals is then equal to the (binomial)
probability to find n successes in N trials:
 n 
N!
N n
P (n|N) =
1
.
n!(N n)! N
N
Taking the limit N then yields the desired result
N
n
n 
N!
1

e .
=
P (n|) = lim
N (N n)!(N )n n!
N
n!

(4.12)

This distribution is known as the Poisson distribution. The normalization, average


and variance are given by, respectively,

P (n|) = 1, < n >= and < n2 >= .

(4.13)

n=0

Exercise 4.5: A counter is traversed by beam particles at an average rate of R particles


per second. (i) If we observe n counts in a time interval t, derive an expression for
the posterior distribution of = Rt, given that the prior for is uniform. Calculate
mean, variance, mode and width (inverse of the Hessian) of this posterior. (ii) Give an
expression for the probability p( |R, I) d that the time interval between the passage of
two particles is between and + d .

4.5

Gauss distribution

The sum of many small random fluctuations leads to a Gauss distribution, irrespective
of the distribution of each of the terms contributing to the sum. This fact, known as
26

the central limit theorem, is responsible for the dominant presence of the Gauss
distribution in statistical data analysis. To prove the central limit theorem we first
have to introduce the characteristic function which is nothing else than the Fourier
transform of a probability density.
The Fourier transform, and its inverse, of a distribution p(x) is defined by
Z

(k) =

ikx

p(x) dx

1
p(x) =
2

eikx (k) dk

(4.14)

This transformation plays an important role in proving many theorems related to sums
of random variables and moments of probability distributions. This is because a Fourier
transform turns a Fourier convolution in x-space, see (3.14), into a product in k-space.22
To see this, consider a joint distribution of n independent variables
p(x|I) = f1 (x1 ) fn (xn ).
Using (3.13) we write for the transform of the distribution of the sum z =
(k) =

dz

=
=

Z
Z

dx1 dxn exp(ikz) f1 (x1 ) fn (xn ) (z

dx1 dxn exp ik

n
X
i=1

xi

xi

n
X

xi )

i=1

f1 (x1 ) fn (xn )

dx1 dxn exp(ikx1 )f1 (x1 ) exp(ikxn )fn (xn )

= 1 (k) n (k).

(4.15)

The transform of a sum of independent random variables is thus the product of the
transforms of each variable.
The moments of a distribution are related to the derivatives of the transform at k = 0:
Z
dn (k)
dn (0)
n ikx
=
= in < xn > .
(4.16)
(ix)
e
p(x)
dx

n
n
dk
dk

The characteristic functions (Fourier transforms) of many distributions can be found in,
for instance, the particle data book [11]. Of importance to us is the Gauss distribution
and its transform
"

2 #


1
1 2 2
1 x
p(x) = exp
(k) = exp ik k
(4.17)
2

2
2
To prove the P
central limit theorem we consider the sum of a large set of n random
variables s = xj . Each xj is distributed independently according to fj (xj ) with mean
22

A Mellin transform turns a Mellin convolution (3.15) in x-space into a product in k-space. We will
not discuss Mellin transforms in these notes.

27

j and a standard deviation which we take, for the moment, to be the same for all fj .
To simplify the algebra, we do not consider the sum itself but rather
n
X

n
X
xj j
s

z=
yj =
=
n
n
j=1
j=1

(4.18)

P
where we have set =
i . Now take the Fourier transform j (k) of the distribution
of yj and make a Taylor expansion around k = 0. Using (4.16) we find

X
k m dm j (0) X (ik)m < yjm >
j (k) =
=
m! dk m
m!
m=0
m=0

= 1+

X
(ik)m < (xj j )m >
k22
=
1

+ O(n3/2 )
m/2
m!
n
2n
m=2

(4.19)

Taking only the first two terms of this expansion we find from (4.15) for the characteristic
function of z
n



1 2 2
k22
exp k
for n .
(4.20)
(k) = 1
2n
2
But this is just the characteristic function of a Gaussian with mean zero and width .
Transforming back to the sum s we find


(s )2
1
(4.21)
exp
p(s) =
n 2
2n
It can be shown that the central limit theorem also applies when the widths ofPthe
individual distributions are different in which case the variance of the Gauss is 2 = i2
instead of n 2 as in (4.21). However, the theorem breaks down when one or more
individual widths are much larger than the others, allowing for one or more variables xi
to occasionally dominate the sum. It is also required that all i and i exist so that the
theorem does not apply to, for instance, a sum of Cauchy distributed variables.
Exercise 4.6: Apply (4.16) to the characteristic function (4.17) to show that the mean and
variance of a Gauss distribution are and 2 , respectively. Show also that all moments
beyond the second vanish (the Gauss distribution is the only one who has this property).
Exercise 4.7: In Exercise 3.6 we have derived the distribution of the sum of two Gaussian
distributed variables by explicitly calculating the convolution integral (3.14). Derive the
same result by using the characteristic function (4.17). Convince yourself that the central
limit theorem always applies to sums of Gaussian distributed variables even for a finite
number of terms or large differences in width.

5
5.1

Least Informative Probabilities


Impact of prior knowledge

The impact of the prior on the outcome of plausible inference is nicely illustrated by
a very instructive example, taken from Sivia [5], were the bias of a coin is determined
from the observation of the number of heads in N throws.
28

Let us first recapitulate what constitutes a well posed problem so that we can apply
Bayesian inference.
First, we need to define a complete set of hypotheses. For our coin flipping experiment this will be the value of the probability h to obtain a head in a single
throw. The definition range is 0 h 1.
Second, we need a model which relates the set of hypotheses to the data. In other
words, we need to construct the likelihood P (D|H, I) for all the hypotheses in the
set. In our case this is the binomial probability to observe n heads in N throws
of a coin with bias h
P (n|h, N, I) =

N!
hn (1 h)N n .
n!(N n)!

(5.1)

Finally we need to specify the prior probability p(h|I) dh.


If we observe n heads in N throws, the posterior distribution of h is given by Bayes
theorem
N!
hn (1 h)N n p(h|I) dh,
(5.2)
p(h|n, N, I) dh = C
n!(N n)!

where C is a normalization constant which presently is of no interest to us. In Fig. 3


these posteriors are shown for 10, 100 and 1000 throws of a coin with bias h = 0.25 for a
pHhL

pHhL

00

1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8
pHhL

h
pHhL

310

h
pHhL

00

1
0.8
0.6
0.4
0.2

pHhL

310

2551000

1
0.8
0.6
0.4
0.2
1

0.2 0.4 0.6 0.8


pHhL

21100

1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8

0.2 0.4 0.6 0.8


pHhL

21100

0.2 0.4 0.6 0.8

1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8

1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8

2551000

1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8

1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8

pHhL

21100

1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8

pHhL

00

1
0.8
0.6
0.4
0.2

pHhL

pHhL

310

1
0.8
0.6
0.4
0.2

2551000

1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8

0.2 0.4 0.6 0.8

Figure 3: The posterior density p(h|n, N ) for n heads in N flips of a coin with bias h = 0.25. In
the top row of plots the prior is uniform, in the middle row it is strongly peaked around h = 0.5 while
in the bottom row the region h < 0.5 has been excluded. The posterior densities are scaled to unit
maximum for ease of comparison.
flat prior (top row of plots), a strong prior preference for h = 0.5 (middle row) and a prior
which excludes the possibility that h < 0.5 (bottom row). It is seen that the flat prior
converges nicely to the correct answer h = 0.25 when the number of throws increases.
29

The second prior does this too, but more slowly. This is not surprising because we have
encoded, in this case, a quite strong prior belief that the coin is unbiased and it takes
a lot of evidence from the data to change that belief. In the last choice of prior we see
that the posterior cannot go below h = 0.5 since we have excluded this region by setting
the prior to zero. This is an illustration of the fact that no amount of data can change
certainties encoded by the prior as we have already remarked in Exercise 2.6. This can
of course be turned into an advantage since it allows to exclude physically forbidden
regions from the posterior, like a negative mass for instance.
From this example it is clear that unsupported information should not enter into the
prior because it may need a lot of data to converge to the correct result in case this
information turns out to be wrong. The maximum entropy principle provides a means to
construct priors which have the property that they are consistent with given boundary
conditions but are otherwise maximally un-informative. Before we proceed with the
maximum entropy principle let us first, in the next section, make a few remarks on
symmetry considerations in the assignment of probabilities.

5.2

Symmetry considerations

In Section 2.5 we have introduced Bernoullis principle of insufficient reason which states
that, in absence of relevant information, equal probability should be assigned to members of an enumerable set of hypotheses. While this assignment strikes us as being
very reasonable it is worthwhile to investigate if we can find some deeperor at least
alternativereason behind this principle.
Suppose we would plot the probabilities assigned to each hypothesis in a bar chart.
If we are in a state of complete ignorance about these hypotheses then it obviously
should not not matter how they would be ordered in such a chart. Stated differently, in
absence of additional information the set of hypotheses is invariant under permutations.
But our bar chart of probabilities can only be invariant under permutations if all the
probabilities are the same. Hence the statement that Bernoullis principle is due to
our complete ignorance is equivalent to the statement that it is due to permutation
invariance.
Similarly, translation invariance implies that the least informative probability distribution of a so-called location parameter is uniform. If, for instance, we are completely
ignorant of the actual whereabouts of the train from Utrecht to Amsterdam then any
location on the track should be, for us, equally probable. In other words, this probability
should obey the relation
p(x|I) dx = p(x + a|I) d(x + a) = p(x + a|I) dx,

(5.3)

which can be satisfied only when p(x|I) is a constant.


A somewhat less intuitive assignment is related to scale parameters. Suppose we
would be completely ignorant of distance scales in the universe so that a star in the
night sky could, for us, very well be light years, light decades, light centuries or further
away. Then each scale would seem equally probable, that is, for r > 0 and > 0
p(r|I) dr = p(r|I) d(r) = p(r|I) dr.
30

(5.4)

But this is only possible when p(r|I) 1/r. This probability assignment, which applies
to positive definite scale parameters, is called a Jeffreys prior. Note that a Jeffreys
prior is uniform in ln(r), which means that it assigns equal probability per decade
instead of per unit interval as does a uniform prior.
Both the uniform and the Jeffreys prior cannot be normalized when the variable ranges
are x [, ] or r [0, ]. Such un-normalizable distributions are called improper.
The way to deal with improper distributions is to normalize them on a finite interval
and take the limits to infinity (or zero) at the end of the calculation. This is, by the way,
good practice in any mathematical calculation involving limits. The posterior should,
of course, always remain finite. If not, it is very likely that your problem is either ill
posed or that your data do not carry enough information.
Exercise 5.1: We make an inference on a counting rate R by observing the number of
counts n in a time interval t. Assume that the likelihood P (n|Rt) is Poisson distributed
as defined by (4.12). Assume further a Jeffreys prior for R, defined on the positive interval
R [a, b]. Show that (i) for t = 0 the posterior is equal to the prior and that we cannot
take the limits a 0 or b ; (ii) when t > 0 but still n = 0 we can take the limit
b but not a 0; (iii) that the latter limit can be taken once n > 0. (From Gull [12].)

Finally, let us remark that we have only touched here upon very simple cases where no
information, equal probability and invariance are more or less synonymous so that
the above may seem quite trivial. However, in Jaynes [7] you can find several examples
which are far from trivial.

5.3

Maximum entropy principle

In the assignments we have made up to now (mainly in Section 4), there was always
enough information to unambiguously determine the probability distribution. For instance if we apply the principle of insufficient reason to a fair dice then this leads to a
probability assignment of 1/6 for each face i = 1, . . . , 6. Note that this corresponds to
an expectation value of < i >= 3.5. But what probability should we assign to each of
the six faces when no information is given about the dice except that, say, < i >= 4.5?
There are obviously an infinite number of probability distributions which satisfy this
constraint so that we have to look elsewhere for a criterion to select one of these.
Jaynes (1957) has proposed to take the distribution which is the least informative by
maximizing the entropy, a concept first introduced by Shannon (1948) in his pioneering
paper on information theory [13]. The entropy carried by a probability distribution is
defined by
 
n
X
pi
S(p1 , . . . , pn ) =
pi ln
(discrete case)
m
i
i=1


Z
p(x)
dx (continuous case)
(5.5)
S(p) = p(x) ln
m(x)

where mi or m(x) is the so-called Lebesgue measure (see below) which satisfies
Z
n
X
mi = 1 or
m(x) dx = 1.
(5.6)
i=1

31

The definition (5.5) is such that larger entropies correspond to smaller information
content of the probability distribution. Note that the Lebesgue measure makes the
entropy invariant under coordinate transformations since both p and m transform in
the same way.
To get some feeling for the meaning of this Lebesgue measure imagine a set of swimming
pools on a rainy day. The amount of water collected by each pool will depend on
the distribution of the falling raindrops and on the surface size of each pool. These
surface sizes then play the role of the Lebesgue measure. Some formal insight can be
gained by maximizing (5.5) imposing only the normalization constraint and nothing
else. Restricting ourselves to the discrete case we find, using the method of Lagrange
multipliers, that the following equation has to be satisfied
!#
" n
 
n
X
X
pi
+
pi 1
= 0.
(5.7)
pi ln

mi
i=1
i=1
Differentiation of (5.7) to pi leads to the equation
 
pi
ln
+1+=0

pi = mi e(+1) .
mi
P
Imposing the normalization constraint
pi = 1 we find, using (5.6)
!
n
n
X
X
pi =
mi e(+1) = e(+1) = 1
i=1

i=1

from which it immediately follows that


pi = mi (discrete case),

p(x) dx = m(x) dx (continuous case).

(5.8)

The Lebesgue measure mi or m(x) is thus the least informative probability distribution
in complete absence of information. To specify this Ur -prior we have, again, to look
elsewhere and use symmetry arguments (see Section 5.2) or just common sense. In
practice a uniform or very simple Lebesgue measure is often adequate to describe the
structure of the sampling space.
Let us now suppose that the Lebesgue measure is known (we will simply take it to be
uniform in the following) and proceed by imposing further constraints in the form of socalled testable information. Such testable information is nothing else than constraints
on the probability distributions themselves like specifying moments, expectation values,
etc. Generically, we can write a set of m such constraints as
Z
n
X
fki pi = k
or
fk (x)p(x) dx = k ,
k = 1, . . . , m.
(5.9)
i=1

Using Lagrange multipliers we maximize the entropy by solving the equation, in the
discrete case,
!#
" n
!
 
m
n
n
X
X
X
X
pi
k
fki pi k
= 0.
(5.10)

pi ln
+ 0
pi 1 +
mi
i=1
i=1
i=1
k=1

32

Differentiating to pi gives the equation


ln

pi
mi

+ 1 + 0 +

m
X

k fki = 0

k=1

Imposing the normalization condition

pi = mi exp(1 0 ) exp

pi = 1 to solve for 0 , we find


!
m
X
1
pi = mi exp
k fki
Z

m
X
k=1

k fki .

(5.11)

k=1

where we have introduced the partition function


Z(1 , . . . , m ) =

n
X
i=1

mi exp

m
X
k=1

k fki .

(5.12)

Such partition functions play a very important role in statistical mechanics and thermodynamics since they contain all the information on the system under consideration,
somewhat like the Lagrangian in dynamics. Indeed, our constraints (5.9) are encoded
in Z through
ln Z
= k .
(5.13)

k
Exercise 5.2: Prove (5.13) by differentiating the logarithm of (5.12).

The formal solution (5.11) guarantees that the normalization condition is obeyed but is
still void of content since we have not yet determined the unknown Lagrange multipliers
1 , . . . , k from the equations (5.9) or, equivalently, from (5.13). These equations often
have to be solved numerically; a few simple cases are presented below.
Assuming a uniform Lebesgue measure, it immediately follows from (5.8) that in absence
of any information we have
pi = p(xi |I) = constant,
in accordance with Bernoullis principle of insufficient reason. In the continuum limit
this goes over into p(x|I) = constant.
Let us now consider a continuous distribution defined on [0, ] and impose a constraint
on the mean
Z
<x> =
xp(x|I) dx =
0

Taking the continuum limit of (5.11) we have, assuming a uniform Lebesgue measure,
Z
1
x
x
e
dx
p(x|I) = e
= ex .
0

Imposing the constraint on < x > leads to


Z
1
x ex dx = =

0
33

from which we find that x follows an exponential distribution




x
1
.
p(x|, I) = exp

(5.14)

The moments of this distribution are given by


< xn > = n! n so that < x > = , < x2 > = 22 and < x2 > = 2 .
Another interesting case is a continuous distribution defined on [, ] with a constraint on the variance
Z
2
< x > =
(x )2 p(x|I) dx = 2 .

We find from the continuum limit of (5.11) after normalization


r



p(x|I) =
exp (x )2 .

The constraint on the variance gives


r Z


1

(x )2 exp (x )2 dx =
= 2
< x2 > =

2
so that x turns out to be Gaussian distributed

"

2 #
1
1 x
p(x|, , I) = exp
.
2

(5.15)

This is then the third time we encounter the Gaussian: in Section 3.1 as a convenient
approximation of the posterior in the neighborhood of the mode, in Section 4.5 as
the limiting distribution of a sum of random variables and in this section as the least
informative distribution consistent with a constraint on the variance.
As an example of a non-uniform Lebesgue measure we will now derive the Poisson
distribution from the maximum entropy principle. We want to know the distribution
P (n|I) of n counts in a time interval t when there is nothing given but the average

nP (n|I) = .

(5.16)

n=0

To find the Lebesgue measure of the time interval t we divide it into a very large
number (M) of intervals t. A particular distribution of counts over these infinitesimal
boxes is called a micro-state. If the micro-states are independent and equally probable
then it follows from the sum rule that the probability of observing n counts is proportional to the number of micro-states which have n boxes occupied. It is easy to see that
for M n this number is given by M n /n!. Upon normalization we then have for the
Lebesgue measure
M n M
e .
(5.17)
m(n) =
n!
34

Inserting this result in (5.11) we get


C eM Me
P (n|I) =
n!
with

so that

X Me
1
= eM
C
n!
n=0

n

n


= eM exp Me ,

n
Me
P (n|I) =
.
exp (Me ) n!

(5.18)

To calculate the average we observe that


n
n

X


n Me
X Me

=
=
exp Me = Me exp Me
n!
n=0
n!

n=0
Combining this with (5.18) we find from the constraint (5.16)
Me = P (n|) =

n
e
n!

(5.19)

which is the same result as derived in Section 4.4.

Parameter Estimation

In data analysis the measurements are often described by a parameterized model. In


hypothesis testing such a model is called a composite hypothesis (i.e. one with parameters) in contrast to a simple hypothesis (without parameters). Given a composite
hypothesis, the problem is how to extract information on the parameters from the data.
This is called parameter estimation. It is important to realize that the composite
hypothesis is assumed here to be true; investigating the plausibility of the hypothesis
itself, by comparing it to a set of alternatives, is called model selection. This will be
the subject of Section 6.5.
The relation between the model and the data is encoded in the likelihood function
p(d|a, s, I)
where d denotes a vector of data points and a and s are the model parameters which
we have sub-divided in two classes:
1. The class a of parameters of interest;
2. The class s of so-called nuisance parameters which are necessary to model
the data but are otherwise of no interest. These parameters usually describe the
systematic uncertainties due to detector calibration, acceptance corrections and
so on. Input parameters also belong to this class like, for instance, an input value
of the strong coupling constant s s taken from the literature.
35

There may also be parameters in the model which have known values. These are, if not
explicitly listed, included in the background information I.
Given a prior distribution for the parameters a and s, Bayes theorem gives for the joint
posterior distribution
p(d|a, s, I)p(a, s|I)dads
.
p(a, s|d, I)dads = R
p(d|a, s, I)p(a, s|I)dads

(6.1)

The posterior of the parameters a is then obtained by marginalization of the nuisance


parameters s:
Z
p(a|d, I) = p(a, s|d, I) ds.
(6.2)

As we have discussed in Section 5, some care has to be taken in choosing appropriate


priors for the parameters a. However, a very nice feature is that unphysical regions can
be excluded from the posterior by simply setting the prior to zero. In this way it isto
give an exampleimpossible to obtain a negative value for the neutrino mass even if that
would be preferred by the likelihood. The priors for s are assumed to be known from
detector studies (Monte Carlo simulations) or, in case of external parameters, from the
literature. Note that the marginalization (6.2) provides a very elegant way to propagate
the uncertainties in the parameters s to the posterior distribution of a (systematic error
propagation, see Section 6.4).
Bayesian parameter estimation is thus fully described by the equations (6.1) and (6.2).
However, these two innocent looking formula do often represent a large amount of numerical computation to evaluate the integrals. This may be a far from trivial task, in
particular when our estimation problem is multi-dimensional. Considerable simplifications occur when two or more variables are independent (the probability distributions
then factorize), when the distributions are Gaussian or when de model is linear in the
parameters.
In the following subsections we will discuss a few simple cases which are frequently
encountered in data analysis.

6.1

Gaussian sampling

Suppose we make a series of measurements of the temperature in a room. Let these


measurements be randomly distributed according to a Gaussian with a known width
(resolution of the thermometer). We now ask the question what is the best estimate

of the temperature?
Since the measurements are independent, we can use the product rule (2.7) and write
for the likelihood
"
2 #
n
n 
Y
1
1 X di
p(d|, ) =
p(di |, ) =
.
exp
(2 2 )n/2
2 i=1

i=1

Assuming a uniform prior for the posterior becomes


"
2 #
n 
1 X di
.
p(|d, ) = C exp
2 i=1

36

(6.3)

To calculate the normalization constant it is convenient to write


n
X
(di )2 = V + n(d )2
i=1

where

X
1X
2.
d =
di and V =
(di d)
n i=1
i=1

The constant C is now obtained from


r

Z




V
V
2 2
n(d )2
1
= exp 2
d =
exp 2 .
exp
C
2
2 2
n
2

Inserting this result in (6.3) we find


"
r

2 #
d

n
n
, n) =
p(|d,
.
(6.4)
exp
2 2
2

But this is just a Gaussian with mean d and width / n. Thus we have the well known
result

(6.5)

= d .
n
Exercise 6.1: Derive (6.5) directly from (6.3) by expanding L = ln p using equations
(3.6), (3.7) and (3.9) in Section 3.1. Calculate the width as the inverse of the Hessian.

Next, suppose that is unknown. Assuming a uniform prior



0
for 0
p(|I) =
constant for > 0

we find for the posterior


)2 
C
V
+
n(
d
V, n) =
p(, |d,
exp
n
2 2

(6.6)

where, again, C is a normalization constant. In Fig. 4a we show this joint distribution


for four samples (n = 4) drawn from a Gauss distribution of zero mean and unit width.
Here we have set the random variables in (6.6) to d = 0 and V = 4.
The posterior for is found by integrating over .

Z

p(|d, V, n) =
p(, |d, V, n) d = C
0

1
V + n(d )2

(n1)/2

(6.7)

This integral converges only for n 2 which is not surprising since one measurement
cannot carry information on . When n = 2, the distribution (6.7) is improper (cannot
be normalized on [, ]). Calculating the normalization constant C for n 3 we
find the Student-t distribution for = n 2 degrees of freedom:
r
(n1)/2

[(n

1)/2]
V
n
V, n) =
p(|d,
.
(6.8)
[(n 2)/2] V V + n(d )2

This distribution can be written in a form which has the number of degrees of freedom
as the only parameter.
37

pHL

HaL

pHL

HbL

HcL

0.7
0.5

0.6

2.5

0.5

0.4

2
0.4
0.3
1.5

0.3
0.2
0.2

0.1

0.1
0.5
-2

-1

-4

-2

10

V, n) for d = 0, V = 4 and n = 4; (b) The marginal


Figure 4: (a) The joint distribution p(, |d,

V, n) (full curve) compared to a Gaussian with = 1/ 3 (dotted curve); (c) The


distribution p(|d,
marginal distribution p(|V, n).
Exercise 6.2: Transform t2 = ( + 2)(d )2 /V with = n 2 > 0 and show that
(6.8) can be written as

(+1)/2
[( + 1)/2]
1

p(|d, V, n) d p(t|) dt =
dt.
(/2) 1 + t2 /
This is the expression for the Student-t distribution usually found in the literature.

Exercise 6.3: Show by expanding the negative logarithm of (6.7) that the maximum of
the posterior is located at u
= d and that the Hessian is given by H = n(n 1)/V . Note
that we do not need to know the normalization constant C for this calculation.

Characterizing the posterior by mode and width we find a result similar to (6.5)
V
S
.

= d with S 2 =
n1
n

(6.9)

In Fig. 4b we show the marginal distribution (6.8) for d =p0, V = 4 and n =


4 (full
curve) compared to a Gaussian with zero mean and width V /n(n 1) = 1/ 3. It is
seen that the Student-t distribution is similar to a Gaussian but has much longer tails.
Likewise we can integrate (6.6) over to obtain the posterior for :


Z
C
V

p(|d, V, n) =
p(, |d, V, n) d = n1 exp 2 .

(6.10)

Integrating this equation over to calculate the normalization constant C we find the
2 distribution for = n 2 degrees of freedom

 /2

1
V
V
p(|V, n) = 2
exp 2 .
2
(/2) +1
2
This distribution is shown for V = 4 and n = 4 in Fig. 4c.
38

(6.11)

Exercise 6.4: Transform 2 = V / 2 and show that (6.11) can be written in the more
familiar form with as the only parameter
1
2
exp( 21 2 ) 2
2
2
d
p(|V, n) d p( |) d =
2 ()
with = /2. Use the definition of the Gamma function
Z
(z) =
tz1 et dt
0

and the property (z + 1) = z(z) to show that the mean and variance of the 2 distribution are given by and 2, respectively.

Exercise 6.5: Show by expanding p


the negative logarithm of (6.10) that the maximum
of the posterior is located at
= V /(n 1) and that the Hessian is given by H =
2(n 1)2 /V .

The mode and width of p(|V, n) are given by


r
r
V
V
1

.
=
n1 n1 2

(6.12)

More familiar measures of the 2 distribution for degrees of freedom are the average
and the variance (see Exercise 6.4)
< 2 > =

6.2

< 2 > = 2.

and

Least squares

In this section we consider the case that the data can be described by a function f (x; a)
of a variable x depending on a set of parameters a. For simplicity we will consider only
functions of one variable; the extension to more dimensions is trivial. Suppose that
we have made a series of measurements {di } at the sample points {xi } and that each
measurement is distributed according to some sampling distribution pi (di |i, i ). Here
i and i characterize the position and the width of the sampling distribution pi of data
point di . We parameterize the positions i by the function f :
i (a) = f (xi ; a).
If the measurements are independent we can write for the likelihood
p(d|a, I) =

n
Y
i=1

pi [di |i(a), i ].

Introducing the somewhat more compact notation pi (di |a, I) for the sampling distributions, we write for the posterior distribution
!
n
Y
p(a|d, I) = C
pi (di |a, I) p(a|I)
(6.13)
i=1

39

where C is a normalization constant and p(a|I) is the joint prior distribution of the
parameters a. The position and width of the posterior can be found by minimizing
L(a) = ln[p(a|d, I)] = ln(C) ln[p(a|I)]

n
X
i=1

ln[pi (di |a, I)]

(6.14)

as described in Section 3.1, equations (3.6)(3.9). In practice this is often done numerically by presenting L(a) to a minimization program like minuit. Note that this
procedure can be carried out for any sampling distribution pi be it Binomial, Poisson,
Cauchy, Gauss or whatever. In case the prior p(a|I) is chosen to be uniform, the second
term in (6.14) is constant and the procedure is called a maximum likelihood fit.
The most common case encountered in data analysis is when the sampling distributions
are Gaussian. For a uniform prior, (6.14) reduces to
2
n 
X
di i (a)
1
1 2
.
(6.15)
L(a) = constant + 2 = constant + 2
i
i=1

We then speak of 2 minimization or least squares minimization. When the function f (x; a) is linear in the parameters, the minimization can be reduced to a single
matrix inversion, as we will now show.
A function which is linear in the parameters can generically be expressed by
f (x; a) =

m
X

a f (x)

(6.16)

=1

where the f are a set of functions of x and the a are the coefficients to be determined
from the data. We denote by wi 1/i2 the weight of each data point and write for the
log posterior
#2
"
n
n
m
X
X
X
L(a) = 12
a f (xi ) .
(6.17)
wi [di f (xi ; a)]2 = 21
wi di
i=1

i=1

=1

is found by setting the derivative of L to all parameters to zero


The mode a
#
"
m
n
X
X
L(
a)
a f (xi ) f (xi ) = 0.
wi di
=
a
i=1

(6.18)

=1

We can write this equation in vector notation as

b = Wa

so that

= W 1 b
a

(6.19)

where the (symmetric) matrix W and the vector b are given by


W =

n
X

wi f (xi )f (xi )

and

b =

i=1

n
X

wi di f (xi ).

(6.20)

i=1

Differentiating (6.17) twice to a yields an expression for the Hessian


H =

2 L(
a)
= W .
a a
40

(6.21)

Higher derivatives vanish so that the quadratic expansion (3.8) in Section 3.1 is exact.
To summarize, when the function to be fitted is linear in the parameters we can build
a vector b and a matrix W as defined by (6.20). The posterior (assuming uniform
= W 1 b and covariance matrix
priors) is then a multivariate Gaussian with mean a
1
1
V H = W . In this way, a fit to the data is reduced to one matrix inversion
and does not need starting values for the parameters, nor iterations, nor convergence
criteria.
Exercise 6.6: Calculate the matrix W and the vector b of (6.20) for a polynomial parameterization of the data
f (x; a) = a1 + a2 x + a3 x2 + a4 x3 + .
Exercise 6.7: Show that a fit to a constant results in the weighted average of the data
P
wi di
1
.
pP
a
1 = Pi
w
i i
i wi

6.3

Example: no signal

In this section we describe a typical case where the likelihood prefers a negative value
for a positive definite quantity. Defining confidence limits in such a case is a notoriously
difficult problem in Frequentist statistics. But in our Bayesian approach the solution is
trivial as is illustrated by the following example where a negative counting rate is found
after background subtraction.
A search was made by the NA49 experiment at the CERN SPS for D0 production in a
large sample of 4 106 Pb-Pb collisions at a beam energy of 158 GeV per nucleon [18].
Since NA49 does not have secondary vertex reconstruction capabilities, all pairs of
positive and negative tracks in the event were accumulated in invariant mass spectra
0 K+ . In the
assuming that the tracks originate from the decays D0 K + or D
left-hand side plot of Fig. 5 we show the invariant mass spectrum of the D0 candidates.
The vertical lines indicate a 90 MeV window around the nominal D0 mass. The large
combinatorial background is due to a multiplicity of approximately 1400 charged tracks
per event giving, for each event, about 5105 entries in the histogram. In the right-hand
side plot we show the invariant mass spectrum after background subtraction. Clearly
no signal is observed.
A least squares fit to the data of a Cauchy line shape on top of a polynomial background
0+D
0) = 0.36 0.74 per event as shown by the full
yielded a negative value N(D
curve in the right-hand side plot of Fig. 5. As already mentioned above, this is a typical
example of a case where the likelihood favors an outcome which is unphysical.
To calculate an upper limit on the D0 yield, Bayesian inference is used as follows. First,
the likelihood of the data d is written as a multivariate Gaussian in the parameters a
which describe the D0 yield and the background shape


1
)V 1 (a a
)
p(d|a, I) = p
exp 21 (a a
(2)n |V |
41

(6.22)

10

dN/dm (1/GeV)

dN/dm (GeV -1)

5000

400

200

-200

2
m(,K) (GeV)

1.8

1.9
m( ,K) (GeV)

Figure 5: Left: The invariant mass distribution of D0 candidates in 158A GeV Pb-Pb collisions at the
CERN SPS. The open (shaded) histograms are before (after) applying cuts to improve the significance.
0
The vertical lines indicate a 90 MeV window around the nominal D0 mass. Right: The D0 + D
invariant mass spectrum after background subtraction. The full curve is a fit to the data assuming a
fixed signal shape. The other curves correspond to model predictions of the D0 yield.

and V are the best values and covariance mawhere n is the number of parameters and a
trix as obtained from minuit, see also Section 3.1. Taking flat priors for the background
parameters leads to the posterior distribution
p(a|d, I) = p

C
(2)n |V |



)V 1 (a a
) p(N|I)
exp 12 (a a

(6.23)

where C is a normalization constant and p(N|I) is the prior for the D0 yield N. The
posterior for N is now obtained by integrating (6.23) over the background parameters.
As explained in Section 3.1 this yields a one-dimensional Gauss with a variance given
by the diagonal element 2 = VN N of the covariance matrix. Thus we have

!2

1 N N
C
) p(N|I)
p(N|I) = C g(N; N,
(6.24)
p(N|d, I) = exp
2

where we have introduced the short-hand notation g() for a one-dimensional Gaussian
= 0.36 and = 0.74 as obtained from the fit to the data.
density. In (6.24), N

As a last step we encode in the prior p(N|I) our knowledge that N is positive definite

0 for N < 0
(6.25)
P (N|I) (N)
with
(N) =
1 for N 0.
Inserting (6.25) in (6.24) and integrating over N to calculate the constant C we find
) (N)
p(N|d, I) = g(N; N,

Z

) dN
g(N; N,

1

(6.26)

The posterior distribution is thus a Gaussian with mean and variance as obtained from
the fit. This Gaussian is set to zero for N < 0 and re-normalized to unity for N 0.
42

The upper limit (Nmax ) corresponding to a given confidence level (CL) is then calculated
from
Z
1
Z Nmax
Z Nmax
) dN
) dN
g(N; N,
g(N; N,
p(N|d, I) dN =
CL =
. (6.27)
0

0 ) < 1.5 per event at 98% CL.


Using the numbers quoted above we find N(D0 + D

6.4

Systematic error propagation

Data are often subject to sources of uncertainty which cause a simultaneous fluctuation
of more than one data point. We will call these correlated uncertainties systematic
errors in contrast to statistical errors which areby definitionpoint to point uncorrelated.
To propagate the systematic uncertainties to the parameters a of interest one often
offsets the data by each systematic error in turn, redo the analysis, and then add the
in quadrature. Such an intuitive ad hoc procedure
deviations from the optimal values a
(offset method) has no sound theoretical foundation and may even spoil your result
by assigning errors which are far too large, see [14] for an illustrative example and also
Exercise 6.10 below.
To take systematic errors into account we will include them in the data model. This can
of course be done in many ways, depending on the experiment being analyzed. Here we
restrict ourselves to a linear parameterization which has the advantage that it is easily
incorporated in any least squares minimization procedure. This model, as it stands,
does not handle asymmetric errors. However, in case we deal with several systematic
sources these asymmetries tend to vanish by virtue of the central limit theorem.
In Fig. 6 we show a systematic distortion of a set of data points
di di + si .
Here i is a list of systematic deviations and s is an interpolation parameter which
dHxL
2

dHxL
2

1.5

1.5

0.5

0.5
x
2

10

10

Figure 6: Systematic distortion (black symbols) of a set of data points (gray symbols) for two values
of the interpolation parameter s = +1 (left) and s = 1 (right).

controls the amount of systematic shift applied to the data. Usually there are several
43

sources of systematic error stemming from uncertainties in the detector calibration,


acceptance, efficiency and so on. For m such sources, the data model can be written as
m
X

di = ti (a) + ri +

s i ,

(6.28)

=1

where ti (a) = f (xi ; a) is the theory prediction containing the parameters a of interest
and i is the correlated error on point i stemming from source . In (6.28), the uncorrelated statistical fluctuations of the data are described by the independent Gaussian
random variables ri of zero mean and variance i2 . The s are independent Gaussian
random variables of zero mean and unit variance which account for the systematic fluctuations. The joint distribution of r and s is thus given by
 m

 2
n
Y
1
ri2 Y 1
s
exp 2
exp .
p(r, s|I) =
2i
2
2
2
i=1 i
=1

The covariance matrix of this joint distribution is trivial:


< ri rj > = i2 ij

< s s > =

< ri s > = 0.

(6.29)

Because the data are a linear combination of the Gaussian random variables r and s it
follows that d is also Gaussian distributed


1
1 (d d)
.
p(d|I) = p
exp 12 (d d)V
(6.30)
(2)n |V |
The mean d is found by taking the average of (6.28)
di = < di > = ti (a) + < ri > +

m
X

< s > i = ti (a).

(6.31)

=1

A transformation of (6.29) by linear error propagation (see Section 3.2) gives for the
covariance matrix

12 + S11
S12

S1n
m

X
S21
22 + S22
S2n

i j .
(6.32)
V =
with Sij =
..
..
..
..

.
.
.
.
=1
Sn1
Sn2
n2 + Snn
Exercise 6.8: Use the propagation formula (3.19) in Section 3.2 to derive (6.32) from
(6.28) and (6.29).

It is also easy to calculate this covariance matrix by directly averaging the product
di dj . Because all the cross terms vanish by virtue of (6.29) we immediately obtain
X
X
s j ) >
s i )(rj +
Vij = < di dj > = < (ri +

= < ri r j > +

XX

i2 ij

44

i j < s s > +

i j .

Inserting (6.31) in (6.30) and assuming a uniform prior, the log posterior of the parameters a can be written as
L(a) = ln[p(a|d)] = Constant +

1
2

n X
n
X
i=1 j=1

[di ti (a)]Vij1 [dj tj (a)].

(6.33)

Minimizing L defined by (6.33) is impractical because we need the inverse of the covariance matrix (6.32) which can become very large. Furthermore, when the systematic
errors dominate, the matrix mightnumericallybe uncomfortably close to a matrix
with the simple structure Vij = i j , which is singular.
Fortunately (6.33) can be cast into an alternative form which avoids the inversion of
large matrices [15]. Our derivation of this result is based on the standard steps taken in
a Bayesian inference: (i) use the data model to write an expression for the likelihood;
(ii) define prior probabilities; (iii) calculate posterior probabilities with Bayes theorem
and (iv) integrate over the nuisance parameters.
The likelihood p(d|a, s) is calculated from the decomposition in the variables r
Z
Z
p(d|a, s) = dr p(d, r|a, s) = dr p(d|r, a, s) p(r|a, s)
(6.34)
The data model (6.28) is incorporated through the trivial assignment
p(d|r, a, s) =

n
Y

[ri + ti (a) +

i=1

m
X
=1

s i di ].

(6.35)

As already discussed above, the distribution of r is taken to be




n
Y
1
ri2
exp 2 .
p(r|a, s) =
2i
2
i=1 i

(6.36)

Inserting (6.35) and (6.36) in (6.34) yields, after integration over r,


"
P
2 #

n
Y
1
d

t
(a)

i
i

exp 12
.
p(d|a, s) =

2
i
i=1 i

(6.37)

Assuming a Gaussian prior for s


p(s|I) = (2)m/2

m
Y

exp( 12 s2 )

=1

and a uniform prior for a, the joint posterior distribution can be written as

!2
n
m
m
X
X
X
p(a, s|d) = C exp 21
wi di ti (a)
s i 12
s2
i=1

=1

(6.38)

=1

where wi = 1/i2 . The log posterior L = ln p can now numerically be minimized (for
instance by minuit) with respect to the parameters a and s. Marginalization of the
45

nuisance parameters s, as described in Section 3.1, then yields the desired result. Clearly
we now got rid of our large covariance matrix (6.32) at the expense of extending the
parameter space from a to a s. In global data analysis where many experiments are
combined the number of systematic sources s can become quite large so that minimizing
L of (6.38) may still not be very attractive.
However, the fact that L is linear in s allows to analytically carry out the minimization
and marginalization with respect to s. For this, we expand L like in (3.6) but only to s
and not to a (it is easy to show that this expansion is exact i.e. that higher derivatives
in s vanish):
L(a, s) = L(a, s) +

X L(a, s)

s +

1 X X 2 L(a, s)
s s .
2 s s

Solving the equations L/s = 0 we find


L(a, s) = L(a, s) + 21 (s s)S(s s)

(6.39)

with
s = S 1 b
S = +

n
X

wi i i

i=1

b (a) =

n
X
i=1

and
L(a, s) =

1
2

n
X
i=1

wi

wi [di ti (a)]i .

di ti (a)

m
X

s i

=1

!2

(6.40)

1
2

m
X

s2 .

(6.41)

=1

Exercise 6.9: Derive the equations (6.39)(6.41).

Exponentiating (6.39) we find for the posterior




p(a, s|d) = C exp [L(a, s)] exp 12 (s s)S(s s)

which yields upon integrating over the nuisance parameters s


p(a|d) = C exp [L(a, s)]

(6.42)

The log posterior (6.41) can now numerically be minimized with respect to the parameters a. Instead of the n n covariance matrix V of (6.32) only an m m matrix S
has to be inverted with m the number of systematic sources.
The solution (6.40) for s can be substituted back into (6.41). This leads after straight
forward algebra to the following very compact and elegant representation of the posterior [15]
" n
#)
(
X
2
1

1
wi (di ti ) b S b
.
(6.43)
p(a|d) = C exp 2
i=1

46

The first term in the exponent is the usual 2 in absence of correlated errors while
the second term takes into account the systematic correlations. Note that S does not
depend on a so that S 1 can be computed once and for all. The vector b, on the other
hand, does depend on a so that it has to be recalculated at each step in the minimization
loop.
Although the posteriors defined by (6.33) and (6.43) look very different, it can be shown
(tedious algebra) that they are mathematically identical [16]. In other words, minimizing
as minimizing the negative logarithm of (6.43).
(6.33) leads to the same result for a
It is clear that the uncertainty on parameters derived from a given data sample should
decrease when new data are added to the sample. This is because additional data will
always increase the available information even when these data are very uncertain. From
this it follows immediately that the error obtained from an analysis of the total sample
can never be larger than the error obtained from an analysis of any sub-sample of the
data. It turns out that error estimates based on equations (6.40)(6.43) do meet this
requirement but that the offset methodmentioned in the beginning of this section
does not. This issue is investigated further in the following exercise.
Exercise 6.10: We make n measurements di i of the temperature in a room. The
measurements have a common systematic offset error . Calculate the best estimate

of the temperature and the total error (statistical systematic) by: (i) using the offset
method mentioned at the beginning of this section and (ii) using (6.43). To simplify the
algebra assume that all data and errors have the same value: di = d, i = .
A second set of n measurements is added using another thermometer which has the same
resolution but no offset uncertainty. Calculate the best estimate
and the total error
from both data sets using either the offset method or (6.43). Again, assume that di = d
and i = to simplify the algebra.
Now let n . Which error estimate makes sense in this limit and which does not?

The systematic errors described in this section were treated as offsets. Another important class are scale or normalization errors. Scale uncertainties are non-linear
and their treatment is beyond the scope of this write-up although it is similar to that
described above for offset errors: include scale parameters s in the data model, calculate the joint posterior p(a, s|d) assuming a normal, or perhaps a lognormal23 prior
distribution centered around unity and, finally, integrate over s. Considerable care
should be taken in correctly formulating the log likelihood (or log posterior) since scale
uncertainties not only affect the position but also the widths of the sampling distributions. Therefore normalization termswhich usually are ignoredsuddenly become
dependent on the (scale) parameters so that they should be taken into account in the
minimization. For more on the treatment of normalization uncertainties and possible
large biases in the fit results we refer to [17].

6.5

Model selection

In the previous sections we have derived posterior distributions of model parameters


under the assumption that the model hypothesis is true. In this section we enlarge
23

When x is normally (i.e. Gaussian) distributed then y = ex follows the lognormal distribution.

47

the hypothesis space to allow for several competing models and ask the question which
model should be preferred in light of the evidence provided by the data. The Bayesian
procedure which provides, at least in principle, an answer to this question is called
model selection. As we will see, this model selection is not only based on the quality
of the data description (goodness of fit) but also on a factor which penalizes models
which have a larger number parameters. Bayesian model selection thus automatically
applies Occams razor in preferring, to a certain degree, the more simple models.
As already mentioned several times before, Bayesian inference can only asses the plausibility of an hypothesis when this hypothesis is a member of an exclusive and exhaustive
but
set. One can of course always complement a given hypothesis H by its negation H

this does not bring us very far since H is usually too vague a condition for a meaningful
probability assignment.24 Thus, in general, we have to include our model H into a finite
set of mutually exclusive and exhaustive alternatives {Hk }. This obviously restricts the
outcome of our selection procedure to one of these alternatives but allows, on the other
hand, to use Bayes theorem (2.13) to assign posterior probabilities to each of the Hk
P (D|Hk , I) P (Hk |I)
P (Hk |D, I) = P
.
i P (D|Hi , I) P (Hi|I)

(6.44)

To avoid calculating the denominator, one often works with the so-called odds ratio
(i.e. a ratio of probabilities)
Okj =

P (Hk |D, I)
P (Hk |I) P (D|Hk , I)
=

.
P (Hj |D, I)
P (Hj |I) P (D|Hj , I)
|
{z
} | {z } |
{z
}

Posterior odds

Prior odds

(6.45)

Bayes factor

The first term on the right-hand side is called the prior odds and the second term the
Bayes factor.
The selection problem can thus be solved by calculating the odds with (6.45) and accept
hypothesis k if Okj is much larger than one, declare the data to be inconclusive if the
ratio is about unity and reject k in favor of one of the alternatives if Okj turns out to be
much smaller than one. The prior odds are usually set to unity unless there is strong
prior information in favor of one of the hypotheses. However, when the hypotheses in
(6.45) are composite, then not only the prior odds depend on prior information but also
the Bayes factor.
To see this, we follow Sivia [5] in working out an illustrative example where the choice is
between a parameter-free hypothesis H0 and an alternative H1 with one free parameter
. Let us denote a set of data points by d and decompose the probability density
p(d|H1) into
Z
Z
p(d|H1 ) = p(d, |H1 ) d = p(d|, H1) p(|H1) d.
(6.46)
To evaluate (6.46) we assume a uniform prior for in a finite range and write
p(|H1 ) =
24

1
.

(6.47)

We could, for instance, describe a detector response to pions by a probability P (d|). However,
it would be very hard to assign something like a not-pion probability P (d|
) without specifying the
detector response to members of an alternative set of particles like electrons, kaons, protons etc.

48

Next, an expansion of the logarithm of the likelihood up to second order in gives


2

+ 1 [( )/]
+
L() = ln[p(d|, H1)] = L()
2

is the position of the maximum likelihood and is the width as characterized


where
by the inverse of the Hessian. By exponentiating L we obtain

!2

H1 ) exp 1
(6.48)
p(d|, H1) p(d|,
2

Finally, inserting (6.47) and (6.48) in (6.46) gives, upon integration over

2
H1 )
p(d|H1 ) p(d|,
.

(6.49)

Exercise 6.11: Derive an expression for the posterior of by inserting Eqs. (6.47), (6.48)
and (6.49) in Bayes theorem
p(|d, H1 ) =

p(d|, H1 ) p(|H1 )
.
p(d|H1 )

Thus we find for the odds ratio (6.45)


P (H0 |d, I)
P (H0|I)
=

P (H1 |d, I)
P (H1|I)
| {z }
| {z }

Posterior odds

Prior odds

p(d|H0)

p(d|H , )
| {z1 }

Likelihood ratio

2
| {z }

(6.50)

Occam factor

As already mentioned above, the prior odds can be set to unity unless there is information which gives us prior preference for one model over another. The likelihood ratio
will, in general, be smaller than unity and therefore favor the model H1 . This is because
the additional flexibility of an adjustable parameter usually yields a better description
of the data. This preference for models with more parameters leads to the well known
phenomenon that one can fit an elephant with enough free parameters.25 This clearly
illustrates the inadequacy of using the fit quality as the only criterion in model selection.
Indeed, such a criterion alone could never favor a simpler model.
Intuitively we would prefer a model that gives a good description of the data in a wide
range of parameter values over one with many fine-tuned parameters, unless the latter
would provide a significantly better fit. Such an application of Occams razor is encoded
by the so-called Occam factor in (6.50). This factor tends to favor H0 since it penalizes
H1 for reducing a wide parameter range to a smaller range allowed by the fit. Here
we immediately face the problem that H0 would always be favored in case is set to
infinity. As far as we know, there is no other way but setting prior parameter ranges as
honestly as possible when dealing with model selection problems.
In case Hi and Hj both have one free parameter ( and ) the odds ratio becomes
P (Hi|I) p(d|Hi,
)
P (Hi|d, I)
=

P (Hj |d, I)
P (Hj |I) p(d|Hj , )

25

(6.51)

Including as many parameters as data points will cause any model to perfectly describe the data.

49

For similar prior ranges and the likelihood ratio has to overcome the penalty
factor / . This factor favors the model for which the likelihood has the largest width.
It may seem a bit strange that the less discriminating model is favored but inspection
of (6.49) shows that the evidence P (D|H) carried by the data tends to be larger for
models with a larger ratio /, that is, for models which cause a smaller collapse of
the hypothesis space when confronted with the data. Note that the choice of prior range
is less critical in (6.51) than in (6.50) so that this poses not much of a problem when we
use Bayesian model selection to chose between, say, a Breit-Wigner or a Gaussian peak
shape in an invariant mass spectrum.
Finally, let us remark that the Occam factor varies like the power of the number of free
parameters so that models with many parameters may get very strongly disfavored by
this factor.
Exercise 6.12: Generalize (6.51) to the case where Hi has n free parameters and Hj
has m free parameters (with n 6= m).

This presentation of model selection had to remain very sketchy since practical applications depend strongly on the details of the selection problem at hand. For many
worked-out examples we refer to Sivia [5], Gregory [6], Bretthorst [3] and Loredo [1]
which also contains a discussion on goodness-of-fit tests used in Frequentist model selection.

Counting

It is often thought that Frequentist inference is objective because it is only based on


sampling distributions or on distributions derived thereof. However, in the construction
of such a sampling distribution we need to specify how repetitions of the experiment
are actually carried out. The potential difficulties arising from this are perhaps best
illustrated by the so-called stopping problem, already mentioned in Section 2.5.
The stopping problem occurs in counting experiments because the sampling distribution
(or likelihood) depends on the strategy we have adopted to halt the experiment. In a
simple coin flipping experiment, for instance, the sampling distribution is different when
we chose to stop after a fixed amount of N throws or chose to stop after the observation
of a fixed amount of n heads. In the first case the number of heads is the random
variable while in the second case it is the number of throws.
In the following subsections we will have a closer look at these two stopping strategies.
As we will see, Frequentist inference is sensitive to the stopping rules but Bayesian
inference is not.

7.1

Binomial counting

We denote the probability of heads in coin flipping by h and that of tails by 1 h.


We assume that the coin flips are independent so that the probability of observing n
50

heads in N throws is given by the binomial distribution


P (n|N, h, I) =

N!
hn (1 h)N n .
n!(N n)!

(7.1)

If we define as our statistic for h the ratio R = n/N, the average and variance of R are
given by (4.8)
<n>
h(1 h)
< R >=
=h
< R2 >=
.
(7.2)
N
N
From Bayes theorem we obtain for the posterior of h
p(h|n, N, I) dh = C P (n|N, h, I) p(h|I) dh

(7.3)

where C is a normalization constant. Assuming a uniform prior p(h|I) = 1 we find by


integrating (7.3):
Z 1
N!
1
1
=
hn (1 h)N n dh =
C
N +1
0 n!(N n)!
so that we obtain for the normalized posterior
p(h|n, N) dh =

(N + 1)! n
h (1 h)N n dh.
n!(N n)!

(7.4)

In Fig. 7 we plot the evolution of the posterior for the first three throws (for more
pHhL

pHhL

01

pHhL

02

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2
0.2

0.4

0.6

0.8

13

0.2
0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

Figure 7: The posterior density p(h|n, N ) for n heads in the first three flips of a coin with bias h = 0.25.
A uniform prior distribution of h has been assumed. The densities are scaled to unit maximum for ease
of comparison.
throws see Fig. 3 in Section 5). From this plot, and also from (7.4), it is seen that the
distribution vanishes at the edges of the interval only when at least one head and one
tail has been observed. Thus for 1 n N 1 the distribution has its peak inside
the interval so that it makes sense to calculate the position and width from L = ln p.
This gives
s

= n h(1 h)
h
(1 n N 1).
(7.5)
N
N
Note, as a curiosity, that the expressions given above are similar to those for < R > and
< R2 > given in (7.2).
Exercise 7.1: Derive (7.5)

51

Exercise 7.2: A counter is traversed by N particles and fires N times. Calculating the
efficiency and error from (7.2) gives = 1 0 which is an unacceptable estimate for
the error. Derive from (7.4) an expression for the lower limit corresponding to the
confidence interval defined by the equation
Z 1
p(|N, N ) d = .

Show that for N = 4 and = 0.65 the result on the efficiency can be reported as
+0
0.19

=1

7.2

(65% CL)

The negative binomial

Instead of throwing the coin N times we may decide to throw the coin as many times as
is necessary to observe n heads. Because the last throw must by definition be a head,
and because the probability of this throw does not depend on the previous throws, the
probability of N throws is given by
P (N|n, h) = P (n-1 heads in N-1 throws) P (one head in one throw)
(N 1)!
hn (1 h)N n
n 1, N n.
(7.6)
=
(n 1)!(N n)!
This distribution is known as the negative binomial. In Fig. 8 we show this distribution for a fair coin (h = 0.5) and n = (3, 9, 20) heads.
P H N 3, 0.5 L
0.2

P H N 20, 0.5 L
0.07

P H N 9, 0.5 L
0.1

0.15

0.1

0.06

0.08

0.05

0.06

0.04
0.03

0.04

0.02

0.05
0.02

10

12

14

0.01
10

15

20

25

30

35

N
30

40

50

60

70

Figure 8: The negative binomial P (N |n, h) distribution of the number of trials N needed to observe
n = (3, 9, 20) heads in flips of a fair coin (h = 0.5).
It can be shown that P (N|n, h) is properly normalized

X
P (N|n, h) = 1.
N =n

The first and second moments and the variance of this distribution are

X
n
<N > =
NP (N|n, h) =
h
N =n
<N2 > =

N 2 P (N|n, h) =

N =n

< N 2 > =

n(1 h)
h2

n(1 h) n2
+ 2
h2
h

(7.7)
52

If we define the ratio Q = N/n as our statistic for z 1/h it follows directly from (7.7)
that the average and variance of Q are given by
< Q2 > =

<Q> = z

z(z 1)
n

(7.8)

In the previous section we took R = n/N as a statistic for h because it had the property
that < R > = h, when averaged over the binomial distribution. But if we average R
over the negative binomial (7.6) then < R > 6= h. This can easily be seen, without any
explicit calculation, from the fact that N and not n is the random variable and that the
reciprocal of an average is not the average of the reciprocal:
<R> = <

1
n
n
> = n < > 6=
= h.
N
N
<N >

To calculate the posterior we again assume a flat prior for h so that we can write
p(h|N, n) dh = C

(N 1)!
hn (1 h)N n dh
(n 1)! (N n)!

which gives, upon integration, a value for the normalization constant


C=

N(N + 1)
.
n

The normalized posterior is thus


p(h|N, n) dh =

(N + 1)! n
h (1 h)N n dh.
n!(N n)!

(7.9)

This posterior is the same as that given in (7.4) as it should be since the relevant
information on h is carried by the observation of how many heads there are in a certain
number of throws and not by how the experiment was halted.
It is worthwhile to have a closer look at the likelihoods (7.1) and (7.6) to understand
why the posteriors come out to be the same. It is seen that the dependence of both
likelihoods on h is given by the term
hn (1 h)(N n) .
The terms in front are different because of the different stopping rules but these terms
do not enter in the inference on h since they do not depend on h. They can thus be
absorbed in the normalization constant of the posterior which can in both cases be
written as
p(h|n, N) dh = C hn (1 h)(N n) dh

with, obviously, the same constant C.

Here we have encountered a very important property of Bayesian inference namely its
ability to discard information which is irrelevant. This is in accordance with Cox
desideratum of consistency which states that conclusions should depend on relevant
information only. Frequentist analysis does not possess this property since the stopping
rule must be specified in order to construct the sampling distribution and a meaningful
statistic. Such inference therefore violates at least one of Cox desiderata.
53

Solutions to Selected Exercises

Exercise 1.1
This is because astronomers often have to draw conclusions from the observation of rare events. Bayesian
inference is well suited for this since it is based solely on the evidence carried by the data (and prior
information) instead of being based on hypothetical repetitions of the experiment.

Exercise 2.3
and A B are true if and only if A is false. But
(i) From the truth table (2.1) it is seen that both B
A.

this is just the implication B


(ii) Negating the minor premises in the syllogisms leads to the conclusions

Proposition 1:
Proposition 2:
Conclusion:

Induction
If A is true then B is true
A is false
B is less probable

Deduction
If A is true then B is true
B is false
A is false

Exercise 2.4
From de Morgans law and repeated application of the product and sum rules (2.4) and (2.5) we find
P (A + B) =
=
=
=
=

1 P (A + B) = 1 P (AB)
A)P
(A)
= 1 P (A)[1
P (B|A)]

1 P (B|

1 P (A) + P (B|A)P (A) = P (A) + P (AB)

P (A) + P (A|B)P
(B) = P (A) + P (B)[1 P (A|B)]
P (A) + P (B) P (A|B)P (B) = P (A) + P (B) P (AB).

Exercise 2.5
(i) The probability for Mr. White to have AIDS is
P (A|T ) =

P (T |A)P (A)
0.98 0.01
= 0.25.
=

0.98 0.01 + 0.03 0.99


P (T |A)P (A) + P (T |A)P (A)

(ii) For full efficiency P (T |A) = 1 so that


P (A|T ) =

1 0.01
= 0.25.
1 0.01 + 0.03 0.99

= 0 so that
(iii) For zero contamination P (T |A)
P (A|T ) =

0.98 0.01
= 1.
0.98 0.01 + 0 0.99

Exercise 2.6
(i) If nobody has AIDS then P (A) = 0 and thus
P (A|T ) =

P (T |A) 0
1 = 0.
P (T |A) 0 + P (T |A)

(ii) If everybody has AIDS then P (A) = 1 and thus


P (A|T ) =

P (T |A) 1
0 = 1.
P (T |A) 1 + P (T |A)

In both these cases the posterior is thus equal to the prior independent of the likelihood P (T |A).

54

Exercise 2.7
(i) When is known we can write for the posterior distribution
P (|S, ) =

P (S|)P ()

=
.
P (S|)P () + P (S|
)P (
)
+ (1 )

(ii) When is unknown we decompose P (|S) in which gives


Z 1
Z 1
P (|S) =
P (, |S) d =
P (|S, )p() d.
0

Assuming a uniform prior p() = 1 gives for the probability that the signal S corresponds to a pion
Z 1

[ + ln(/)]
d
,
P (|S) =
=
+ (1 )
( )2
0
where we have used the Mathematica program to evaluate the integral.

Exercise 2.8
The quantities x, x0 , d and are related by x = x0 + d tan . With p(|I) = 1/ it follows that p(x|I)
is Cauchy distributed

d
d
1
1 cos2
.
=
p(x|I) = p(|I) =
dx
d
(x x0 )2 + d2

The first and second derivatives of L = ln p are


dL
2(x x0 )
=
dx
(x x0 )2 + d2

d2 L
4(x x0 )2
2
=

+
.
2
2
2
2
dx
[(x x0 ) + d ]
(x x0 )2 + d2

This gives for the position and width of p(x)


dL(
x)
= 0 x = x0
dx

1
d2 L(
x)
2
=
= 2
2
dx2
d

d
= .
2

Exercise 2.9
(i) The posterior distribution of the first measurement is
P (|S1 ) =

P (S1 |)P ()
=
.
P (S1 |)P () + P (S1 |
)P (
)
+ (1 )

Using this as the prior for the second measurement we have


P (|S1 , S2 ) =

P (S2 |, S1 )P (|S1 )
2
= = 2
.
P (S2 |, S1 )P (|S1 ) + P (S2 |
, S1 )P (
|S1 )
+ 2 (1 )

Here we have assumed that the two measurements are independent, that is,
P (S2 |, S1 ) = P (S2 |) =

P (S2 |
, S1 ) = P (S2 |
) = .

(ii) Direct application of Bayes theorem gives


P (|S1 , S2 ) =

2
P (S1 , S2 |)P ()
= 2
.
P (S1 , S2 |)P () + P (S1 , S2 |
)P (
)
+ 2 (1 )

Here we have again assumed that the two measurements are independent, that is,
P (S1 , S2 |) = P (S1 |)P (S2 |) = 2

P (S1 , S2 |
) = P (S1 |
)P (S2 |
) = 2.

Both results are thus the same when we assume that the two measurements are independent.

55

Exercise 3.1
Because averaging is a linear operation we have
< x2 >

= < (x x)2 > = < x2 > 2


x < x > + < x >2

= < x2 > 2 < x >2 + < x >2 = < x2 > < x >2 .

Exercise 3.2
The covariance matrix can be written as
Vij =< (xi x
i )(xj x
j ) > =< xi xj > < xi >< xj > .
For independent variables the joint probability factorizes p(xi , xj |I) = p(xi |I)p(xj |I) so that
Z
Z
< xi xj > = dxi xi p(xi |I) dxj xj p(xj |I) =< xi >< xj > .
This implies that the off-diagonal elements of Vij vanish.

Exercise 3.3
(i) It is easiest to make the transformation y = 2(x x0 )/ so that the Breit-Wigner transforms into

dx
1
1
.
p(x|x0 , , I) p(y|I) = p(x|I) =
dy
1 + y2

For this distribution L = ln + ln(1 + y 2 ). The first and second derivatives of L are given by
dL
dy
d2 L
dy 2

=
=

2y
1 + y2
4y 2
2

+
.
(1 + y 2 )2
1 + y2

From dL/dy = 0 we find y =0. It follows that the second derivative of L at y is 2 so that the width
of the distribution is y = 1/ 2. Transforming back to the variable x = x0 + y/2 gives

dx

x
= x0
and
x = y = .
dy
2 2

(ii) Substituting x = x0 in the Breit-Wigner formula gives for the maximum value 2/(). Substituting
x = x0 /2 gives a value of 1/() which is just half the maximum.

Exercise 3.4
For z = x + y and z = xy we have
ZZ
Z
p(z|I) =
(z x y) f (x)g(y) dxdy = f (z y)g(y) dy
and
Z
ZZ
ZZ
dy
dw
dy = f (z/y)g(y) .
p(z|I) =
(z xy) f (x)g(y) dxdy =
(z w) f (w/y)g(y)
|y|
|y|

Exercise 3.5
(i) The inverse transformations are
x=

u+v
2

y=

uv
2

so that the determinant of the Jacobian is




x/u x/v 1/2
1/2

=
|J| =
y/u y/v 1/2 1/2

56


1
= .
2

The joint distribution of u and v is thus


p(u, v) = p(x, y) |J| =

1
1
f (x)g(y) = f
2
2

u+v
2

 

uv
g
.
2

Integration over v gives


 

Z
Z 
Z
u+v
uv
1
p(u) = p(u, v) dv =
f
g
dv = f (w)g(u w) dw
2
2
2
which is just (3.14).
(ii) Here we have for the inverse transformation and the Jacobian determinant
r


v(4uv)1/2

1
u
u(4uv)1/2

x = uv
|J| =
=
y=
(4uv)1/2 u(4uv 3 )1/2 2v
v

which gives for the joint distribution

p(u, v) =


1
uv g
f
2v

r 
u
.
v

Marginalization of v gives
r  Z
Z
Z
u

dw
dv
u
=
f
f (w)g
p(u) = p(u, v) dv =
uv g
2v
v
w
w

which is just (3.15).

Exercise 3.6
According to (3.14) the distribution of z is given by
"
(
2 
2 #)
Z
1
x 1
z x 2
1
p(z|I) =
.
+
dx exp
21 2
2
1
2
The standard way to deal with such an integral is to complete the squares, that is, to write the
exponent in the form a(x b)2 + c. This allows to carry out the integral
r
Z
Z




 1
2
2
2
1
1
exp 21 c .
dy exp 2 ay =
dx exp 2 a(x b) + c = exp 2 c
a

Our problem is now reduced to finding the coefficients a, b and c such that the following equation holds
p(x r)2 + q(x s)2 = a(x b)2 + c
where the left-hand side is a generic expression for the exponent of the convolution of our two Gaussians.
Since the coefficients of the powers of x must be equal at both sides of the equation we have
p+q =a

pr + qs = ab

pr2 + qs2 = ab2 + c.

Solving these equations for a, b and c gives


pr + qs
p+q

a=p+q

b=

p = 1/12

q = 1/22

c=

pq
(s r)2 .
p+q

r = 1

s = z 2

Then substituting
yields the desired result


1
(z 1 2 )2
p
.
p(z|I) =
exp
2(12 + 22 )
2(12 + 22 )

57

Exercise 3.7
For independent random variables xi with variance i2 we have < xi xj > = i j ij .
P
(i) For the sum z = xi we have z/xi = 1 so that (3.19) gives
< z 2 > =

XX
i

(ii) For the product z =

i j ij =

i2 .

xi we have z/xi = z/xi so that (3.19) gives

< z 2 > =

XX z z
X  i 2
i j ij = z 2
.
xi xj
xi
i
j
i

Exercise 3.8
We have
=

N n
m
=
=
2
n
(n + m)
N2

n
n
=
N
n+m

n
n
= 2
=
2
m
(n + m)
N

and
< n2 > = n

< m2 > = m = N n

< nm > = 0.

Inserting this in (3.19) gives for the variance of


2

< n > +
< m2 >
=
m

2
 n 2
N n
(1 )
=
n
+
(N n) =
.
2
2
N
N
N


< >

2

Exercise 3.9
Writing out (3.24) in components gives, using the definition (3.25)
X

T
Vik Ukj
=

T
Uik
k kj

Vik ujk = j uji

V uj = j uj .

Exercise 3.10
(i) For a symmetric matrix V and two arbitrary vectors x and y we have
yVx =

yi Vij xj =

xj VjiT yi =

xj Vji yi = x V y.

ij

ij

ij

(ii) Because V is symmetric we have


ui V uj = uj V ui i ui uj = j ui uj or (i j ) ui uj = 0 ui uj = 0 for i 6= j .
(iii) If V is positive definite we have ui V ui = i ui ui = i > 0.

Exercise 4.1
Decomposition of P (R2 |I) in the hypothesis space {R1 , W1 } gives
P (R2 |I) =
=

P (R2 , R1 |I) + P (R2 , W1 |I)

P (R2 |R1 , I)P (R1 |I) + P (R2 |W1 , I)P (W1 |I)
R1 R
R W
R
+
= .
N 1 N
N 1 N
N

58

Exercise 4.2
For draws with replacement we have P (R2 |R1 , I) = P (R2 |W1 , I) = P (R1 |I) = R/N and P (W1 |I) =
W/N . Inserting this in Bayes theorem gives
R
P (R2 |R1 , I)P (R1 |I)
= .
P (R2 |R1 , I)P (R1 |I) + P (R2 |W1 , I)P (W1 |I)
N

P (R1 |R2 , I) =

Exercise 4.3
Without loss of generality we can consider marginalization of the multinomial distribution over all but
the first probability. According to (4.10) we have
n2 =

k
X
i=2

ni = N n1

and

p2 =

k
X
i=2

pi = 1 p1 .

Inserting this in (4.9) gives the binomial distribution


P (n1 , n2 | p1 , p2 , N ) =

N!
pn1 (1 p1 )N n1 .
n!(N n)! 1

Exercise 4.4
From the product rule we have
P (n1 , . . . , nk |I) = P (n1 , . . . , nk1 |nk , I)P (nk |I)
From this we find for the conditional distribution
P (n1 , . . . , nk1 |nk , I) =
=
=

P (n1 , . . . , nk |I)
P (nk |I)
1
nk !(N nk )!
N!
nk1 nk
pk
pn1 pk1
n1 ! nk ! 1
N!
pnk k (1 pk )N nk


n1
nk1
(N nk )!
p1
pk1

.
n1 ! nk1 ! 1 pk
1 pk

Exercise 4.5
(i) The likelihood to observe n counts in a time window t is given by the Poisson distribution
n
exp()
n!

P (n|) =

with = Rt and R the average counting rate. Assuming a flat prior for [0, ] the posterior is
p(|n) = C

n
exp()
n!

with C a normalization constant which turns out to be unity: the Poisson distribution has the remarkable property that it is normalized with respect to both n and :
Z
Z n

X
X
n

P (n|) =
exp() = 1
and
p(|n) d =
exp() = 1.
n!
n!
0
0
n=0
n=0
The mean, second moment and the variance of the posterior are
< > = n + 1,

< 2 > = (n + 1)(n + 2),

< 2 > = n + 1.

The log posterior and the derivatives are


n
dL
=1 ,
d

L = constant n ln() + ,

59

n
d2 L
= 2.
2
d

Setting the derivative tozero we find


= n. The square root of the inverse of the Hessian at the mode
gives for the width = n.
(ii) The probability p( |R, I) d that the time interval between the passage of two particles is between
and + d is given by
p( |R, I) d

= P (no particle passes during ) P (one particle passes during d )

= exp(R ) Rd.

Exercise 4.6
The derivatives of the characteristic function (4.17) are





d(k)
1 2 2
d2 (k) 
1 2 2
2 2
2
2
= (i k ) exp ik k .
= (i k ) exp ik k
dk
2
dk 2
2
From (4.16) we find for the first and second moment
<x> =

1 d(0)
=
i dk

1 d2 (0)
= 2 + 2
i2 dk 2

< x2 > =

from which it immediately follows that the variance is given by < x2 > < x >2 = 2 .

Exercise 4.7
From (4.15) and (4.17) we have


1
(k) = 1 (k) 2 (k) = exp i(1 + 2 )k (12 + 22 )k 2
2

which is just the characteristic function of a Gauss with mean 1 + 2 and variance 12 + 22 .

Exercise 5.2
From differentiating the logarithm of (5.12) we get
ln Z

n
m
X

1 Z
1 X
mi

=
exp
j fji
Z k
Z i=1
k
j=1

n
m
n
X
X
1 X
fki pi = k .
j fji =
fki mi exp
Z i=1
i=1
j=1

Exercise 6.1
The log posterior of (6.3) is given by
n

L = Constant +

1X
2 i=1

di

2

Setting the first derivative to zero gives



n 
X
dL
di
=0
=
d
2
i=1

1X
di .
n i=1

The Hessian and the square root of its inverse at


are
n

n
d2 L X 1
=
= 2
2
2
d

i=1

60

d2 L(
)
2
d

 1
2

= .
n

Exercise 6.3
The negative logarithm of the posterior (6.7) is given by
L = Constant +


n1
ln V + nz 2
2

The first derivative to is


where we have set z = d.

dL
dL
(n 1)nz

= 0 z = 0
= d.
=
=
d
dz
V + nz 2
For the second derivative we find
d2 L
(n 1)n 2(n 1)n2 z 2
d2 L
=
=

d2
dz 2
V + nz 2
(V + nz 2 )2

H=

d2 L(
)
d2 L(0)
(n 1)n
=
=
.
2
2
d
dz
V

Exercise 6.4
By substituting t = 21 2 we find for the average
Z
Z
2( + 1)
2
= 2 = .
t et dt =
< 2 > =
2 p(2 |) d2 =
() 0
()
0
Likewise, the second moment is found to be
Z
Z
4
4( + 2)
< 4 > =
= 4( + 1) = ( + 2).
4 p(2 |) d2 =
t+1 et dt =
() 0
()
0
Therefore the variance is
< 4 > < 2 >2 = ( + 2) 2 = 2.

Exercise 6.5
The negative logarithm of (6.10) is given by
L = Constant + (n 1) ln() +

V
2 2

whence we obtain
dL
d
d2 L
d 2

=
=

V
n1
V
2 =
3 =0

n1
1 n 3V
d2 L(
)
2(n 1)2
+ 4 H=
=
.
2
2

dL
V

Exercise 6.6
In case of a polynomial parameterization
f (x; a) = a1 + a2 x + a3 x2 + a4 x3 +
the basis functions are given by f (x) =
equation (6.19) takes the form
P
P
Pi wi x2i
Pi wi

Pi wi x2i Pi wi x3i
i wi xi
i wi xi

Exercise 6.7

x1 . To give an example, for a quadratic polynomial the

P
2
a
1
Pi wi di
Pi wi x3i
a
2 = Pi wi di xi .
Pi wi x4i
2
a

w
x
3
i wi di xi
i i i

In case f (x; a) = a1 (fit to a constant) the matrix W and the vector b are one-dimensional:
X
X
wi di .
wi ,
b=
W =
i

61

It then immediately follows from Eqs. (6.19) and (6.21) that


P
wi di
1
a
1 = Pi
.
pP
w
i i
i wi

Exercise 6.8

The covariance matrix V of r and s is given by (6.29) as


Vij = i2 ij ,

V
= ,

Vi
= 0.

Differentiation of (6.28) gives for the derivative matrix D


di
= ij ,
rj

Dij =

Di =

di
= i .
s

Carrying out the matrix multiplication V = DV D T we find


XX
XX
T

T
Vij =
Dik Vkl Dlj
+
Di V
Dj
k

XX
k

ik k2 kl lj

XX

i2 ij +

i j

i j .

Exercise 6.9
From (6.38) we have for the log posterior
L(a, s) = Constant +

1
2

n
X
i=1

wi

di ti (a)

m
X

s i

=1

!2

1
2

m
X

s2 .

=1

Setting the derivative to zero leads to


n
m
n
X
X
X
L(a, s)
=
wi (di ti )i +
s
wi i i + s = 0.
s
=1
i=1
i=1

This equation can be rewritten as


m
X

=1

But this is just the matrix equation

Exercise 6.10

n
X

wi i i

i=1

n
X
i=1

wi (di ti )i .

S s = b as given in (6.40).

The best estimate


of the temperature is, according to Exercise 6.7, given by the weighted average.
Setting di = d and wi = 1/i2 = 1/ 2 , we find for the average and variance of n measurements
P
wi di
1
2

= P
.
=d
and
<
2 > = P
=
n
wi
wi
1. Offsetting all data points by an amount gives for the best estimate
= d . Adding the
statistical and systematic deviations in quadrature we thus find from the offset method that
r
2

= d
+ 2

= d
for
n .
n

62

2. The matrix S and the vector b defined in (6.40) are in this case one-dimensional. We have
 2
X
n(d )
n(d )2

,
b=
,
wi (di ti )2 =
.
S =1+n
2

2
Inserting this in (6.43) we find, after some algebra
L() =

1 n(d )2
,
2 2 + n2

dL()
n(d )
,
= 2
d
+ n2

d2 L()
n
= 2
.
2
d
+ n2

Setting the first derivative to zero gives


= d. Equating the error as the square root of the
inverse of the Hessian (second derivative at
) obtains the same result as from the offset method:
r
2

=d
+ 2

=d
for
n .
n
We now add a second set of n measurements which do not have a systematic error . The weighted
average gives for the best estimate
and the variance of the combined data

=d

<
2 > =

and

2
.
2n

1. Offsetting the first set of n data points by an amount but leaving the second set intact
gives for the best estimate
= d /2. Adding the statistical and systematic deviations in
quadrature we thus find from the offset method that
s
 2

=d
+
for
n .

=d
2n
2
2
But this error is larger than if we would have considered only the second data set:

=d

= d0
for
n
n
In other words, the offset method violates the requirement that the error derived from all available
data must always de smaller than that derived from a subset of the data.
2. The matrix S and the vector b defined in (6.40) are now two-dimensional but with many zerovalued elements since the systematic error of the second data set is zero. We find


 
S 0
b
S=
b=
,
0 1
0
where S and b are defined above. The log posterior of (6.43) is found to be
" 
#
2


2
d
1 n(d )2
1
1
2n
2
+
n
.
bS b =
L() =
2

2 2 + n2
2
Solving the equation dL()/d = 0 immediately yields
= d. The inverse of the second
derivative gives an estimate of the error. After some straight forward algebra we find
 2

1

=d
2+n 2
+ 2
.
n

It is seen that the error vanishes in the limit n , as it should be.

Exercise 6.11
In (3.11) we state that the posterior p(|d, H1 ) is Gaussian distributed in the neighborhood of the mode
Indeed, that is exactly what we find inserting Eqs. (6.47), (6.48) and (6.49) in Bayes theorem,
.

!2

1
1
.
p(|d, H1 ) = exp
2

63

The approximations made in Eqs. (6.47) and (6.48) are thus consistent with those made in (3.11).

Exercise 6.12
When Hi has n free parameters and Hj has m free parameters , the Occam factor in (6.51) becomes
the ratio of multivariate Gaussian normalization factors, see (3.11):
s
(2)m |V |
.
(2)n |V |

Exercise 7.1
For the negative log posterior of the binomial distribution we have
L = Constant n ln(h) (N n) ln(1 h).
The first and second derivatives of L to h are
dL
dh
d2 L
dh2

=
=

hN n
h(1 h)
h2 N + n 2hn
.
h2 (1 h)2

2
= n/N . The Hessian is thus d2 L(h)/dh

h)
from which (7.5)
From dL/dh = 0 we obtain h
= N/h(1
follows.

Exercise 7.2
The posterior of for N counts in N events is, from (7.4),
p(|N, N ) = (N + 1)N .
Integrating the posterior we find for the confidence level
Z a
Z 1
d p(|N, N ) = 1 a(N +1) a = (1 )1/(N +1)
d p(|N, N ) = 1
=
a

For N = 4 and = 0.65 we find a = 0.81 so that the result on the efficiency can be reported as
=1

+0
0.19

(65% CL)

References
[1] T. Loredo, From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics in Maximum Entropy and Bayesian Methods, ed. P.F. Foug`ere, Kluwer
Academic Publishers, Dordrecht (1990).
[2] G.L. Bretthorst, An Introduction to Parameter Estimation using Bayesian Probability Theory in Maximum Entropy and Bayesian Methods, ed. P.F. Foug`ere,
Kluwer Academic Publishers, Dordrecht (1990).
[3] G.L. Bretthorst, An Introduction to Model Selection using Probability Theory as
Logic in Maximum Entropy and Bayesian Methods, ed. G.L. Heidbreder, Kluwer
Academic Publishers, Dordrecht (1996).
[4] G. DAgostini, Bayesian Inference in Processing Experimental DataPrinciples
and Basic Applications, arXiv:physics/0304102 (2003).
64

[5] D.S. Sivia, Data Analysisa Bayesian Tutorial , Oxford University Press (1997).
[6] P. Gregory, Bayesian Logical Data Analysis for the Physical Sciences, Cambridge
University Press (2005).
[7] E.T. Jaynes, Probability TheoryThe Logic of Science, Cambridge University
Press (2003). See also http://omega.albany.edu:8008/JaynesBook.html (1998).
[8] G. Cowan, Statistical Data Analysis, Oxford University Press (1998).
[9] R.T. Cox, Probability, frequency and reasonable expectation, Am. J. Phys. 14, 1
(1946).
[10] F. James et al. eds., Proc. Workshop on Confidence Limits, CERN Yellow Report
2000-005. See also http://www.cern.ch/cern/divisions/ep/events/clw.
[11] PDG, S. Eidelman et al., Phys. Lett. B592, 1 (2004).
[12] S.F. Gull, Bayesian Inductive Inference and Maximum Entropy in MaximumEntropy and Bayesian Methods in Science and Engineering, G.J. Ericson and
C.R. Smith eds., Vol. I, 53, Kluwer Academic Publishers (1988).
[13] C.E. Shannon, The mathematical theory of communication, Bell Systems Tech. J.
27, 379 (1948).
[14] S.I. Alekhin, Statistical properties of estimators using the covariance matrix , hepex/0005042.
[15] D. Stump et al., Uncertainties of predictions from parton distribution functions,
Phys. Rev. D65 014012 (2002), hep-ph/0101051.
[16] D. Stump, private communication.
[17] G. DAgostini, Nucl. Inst. Meth. A346, 306 (1994);
T. Takeuchi, Prog. Theor. Phys. Suppl. 123, 247 (1996), hep-ph/9603415.
[18] NA49 Collab., C. Alt et al., Upper limit of D 0 production in central Pb-Pb collisions
at 158A GeV , Phys. Rev. C73, 034910 (2006), nucl-ex/0507031.

65

Index
assignment of probabilities
by decomposition, 9, 21
by maximum entropy, 3135
by principle of insufficient reason, 21
by symmetry considerations, 3031
trivial assignments, 18, 22, 23, 45
background information, 6, 14, 36
Bayes factor, 48
Bayes theorem
as a learning process, 8, 12
definition of, 8, 10, 11
in Frequentist statistics, 8, 13
Bayes, T., 13
Bayesian inference, 4, 1314
discards irrelevant information, 53
steps taken in, 29, 45
Bayesian probability, see probability
Bayesian-Frequentist debate, 1214
Bernoullis urn, drawing from, 2123
Bernoulli, J., 12, 25
binomial distribution
definition and properties of, 2325
posterior, for uniform prior, 5152
binomial error, 19, 25
Breit-Wigner distribution, 17
Cauchy distribution, 17, 28, 41
causal versus logical dependence, 8, 23
central limit theorem, 27, 43
proof and validity of, 2728
characteristic function, 27
2 distribution, 3839
2 minimization, see least squares
conditional probability, definition of, 7
confidence limits, 41, 43, 52
conjunction, see logical and
contradiction, 5, 7, 8
convolution, 18
correlation coefficient, 15
counting experiments, 50
covariance matrix
as the inverse of the Hessian, 16
definition of, 15
determinant of, 16, 21

diagonal elements of, 17, 42


diagonalization of, 20
eigenvectors and eigenvalues of, 20
linear transformation of, 19, 44
of systematic errors, 44
properties of, 15, 1921
Cox desiderata, 6
Cox, R.T., 6, 13
cumulant, 10
de Morgans laws, 5
decision theory, 17
decomposition, definition of, 9, 11
deduction; deductive inference, 5
degree of belief, see plausibility
density, see probability distribution
disjunction, see logical or
distribution, see probability distribution
drawing with(out) replacement, 2224
eigenvalue equations, 20
ensemble, 13
entropy
definition of, 31
maximum entropy principle, 13, 14, 31
35
error contour, 20
error propagation, linear, 19, 44
evidence, 50
definition of, 8
expectation value, 15
exponential distribution, 34
Fourier convolution, 18, 27
Fourier transform, 27, 28
Frequentist probability, see probability
Gauss distribution
characteristic function of, 27
from central limit theorem, 26
from maximum entropy, 34
marginalization of, 16
multivariate, definition of, 16
Gaussian sampling, 3639
sample mean and width, 37
66

group theory, 14
Hessian matrix, definition of, 16
hypothesis
complete set of, 9, 21, 23, 48
in Bayesian inference, 8
in Frequentist statistics, 8, 13
simple and composite, 35
implication, 5, 6
improper distribution, 31, 37
induction; inductive inference, 6
inference
Bayesian, see Bayesian inference
definition of, 5
information entropy, see entropy
invariance, 3031
Jacobian matrix, definition of, 18
Jaynes, E.T., 4, 31
Jeffreys prior, 31
joint probability, definition of, 7
Kolmogorov axioms, 7
Lagrange multipliers, 32, 33
Laplace, P.S., 13
law of large numbers, 25
least squares minimization, 4041
Lebesgue measure, 31, 32
non-uniform, 34
likelihood
definition of, 8
for a complete set of hypotheses, 10
in parameter estimation, 35
in unphysical region, 14, 41
width, compared to prior, 14
linear model, 36, 4041, 46
location parameter, 15, 16, 30
logical and, 5
logical or, 5
logical versus causal dependence, 8, 23
lognormal distribution, 47
marginal probability, definition of, 7
marginalization, 12
definition of, 9, 11
of multinomial distribution, 25
of multivariate Gauss, 16

of nuisance parameters, 36, 42, 46


maximum likelihood fit, 40
mean, definition of, 15
Mellin convolution, 18
minuit, 16, 40, 42, 45
mode, definition of, 16
model selection, 35, 4750
moments
definition of, 15
from characteristic function, 27
multinomial distribution, 2526
negation, 5
negative binomial distribution, 5253
normalization errors, 47
normalization, definition of, 9, 11
nuisance parameters, 35
Occam factor, 49, 50
Occams razor, 48
odds ratio, 48, 49
parameter estimation, 3541
partition function, 33
plausibility, 6
plausible inference, see Bayesian inference
Poisson distribution, 13, 26
definition and properties of, 26
from maximum entropy, 34
posterior, for uniform prior, 26, 59
polynomial parameterization, 41
positive definite matrix, 20
posterior probability
definition of, 8
for different priors, 2930
logarithmic expansion of, 16, 3741, 46,
51
principle of insufficient reason, 21, 30
definition of, 12
from maximum entropy, 33
principle of maximum entropy, see entropy
prior odds, 48
prior probability
choice of, 11, 2830
defines physical domain, 14, 30, 36, 42
definition of, 8
uniform, 11, 29, 40, 48, 53
width, compared to likelihood, 14
67

probability
Bayesian definition of, 4, 7, 8, 13
Frequentist definition of, 4, 8, 13, 25
probability assignment, see assignment
probability calculus, 711
axioms of, 7
probability distribution
definition of, 10
properties of, 1516
probability inversion, 8, 12
product rule
definition of, 7, 10
for independent propositions, 8
propositions, 5
exclusive, 8
independent, 8

coordinate transformation, 18
of probability density, 11, 1718
unitary, 20
uniform prior, see prior probability
uninformative probability, 11, 13, 32, 34
variance, definition of, 15
weighted average, 41
well posed problem, specification of, 29

quadratic addition of errors, 19


random variable, 4, 8, 13, 18
sampling distribution
ambiguity in construction of, 14, 50
definition of, 10
sampling theory, 14
scale errors, 47
scale parameter, 30
Shannon, C.E., 31
standard deviation, 15
statistic, 51, 53
definition of, 13
desirable properties of, 13
statistical error, definition of, 43
stopping problem, 14, 50, 53
Student-t distribution, 3738
subjective probability, 13
sum rule
definition of, 7
for exclusive propositions, 8
syllogism, 6
systematic error propagation, 36, 4347
using covariance matrix, 44
using offset method, 43, 47
systematic error, definition of, 43
tautology, 5, 7, 9
testable information, 32
transformation
68