Main Textbook ACTL3162 UNSW

w)
Non-Life Insurance:
Mathematics & Statistics
tes
(m
Lecture Notes
NL
no
Mario V. Wthrich
RiskLab Switzerland
Department of Mathematics
ETH Zurich
Version April 14, 2016
Electronic copy available at: http://ssrn.com/abstract=2319328
NL
no
tes
(m
w)
Version April 14, 2016, M.V. Wthrich, ETH Zurich

Preface and Terms of Use
tes
(m
w)
Lecture notes. The present lecture notes cover the lecture Non-Life Insurance:
Mathematics & Statistics which is held in the Department of Mathematics at ETH
Zurich. This lecture is a merger of the two lectures Nicht-Leben Versicherungsmathematik and Risk Theory for Insurance. It was taught for its first time in
Spring 2014 at ETH Zurich and in Fall 2014 at University of Bologna (jointly with
Tim Verdonck). The lecture aims at providing a basis in non-life insurance mathematics which forms a core subject of actuarial sciences. After this course, the students are recommended to follow lectures that give a deeper knowledge in different
subjects of non-life insurance mathematics, such as Credibility Theory, Non-Life
Insurance Pricing with Generalized Linear Models, Stochastic Claims Reserving
Methods, Market-Consistent Actuarial Valuation, Quantitative Risk Management,
Data Analytics, etc.
no
Prerequisites. The prerequisites for this lecture are a solid education in mathematics, in particular, in probability theory and statistics.
NL
Terms of Use. These lecture notes are an ongoing project which is continuously
revised and updated. Of course, there may be errors in the notes and there is always
room for improvement. Therefore, I appreciate any comment and/or corrections
that readers may have. However, I would like you to respect the following rules:
These notes are provided solely for educational, personal and non-commercial
use. Any commercial use or reproduction is forbidden.
All rights remain with the author. He may update the manuscript or withdraw the manuscript at any time. There is no right of the availability of any
(old) version of these notes. The author may also change these terms of use
at any time.
The author disclaims all warranties, including but not limited to the use or
the contents of these notes. On using these notes, you fully agree to this.
Citation: please use the SSRN URL.
All pictures and graphs included in these notes are either downloaded from the
internet (open access) or were plotted by the author. If downloaded graphs
3
4
violate copyright, I appreciate an immediate note and the corresponding pictures will be removed from these lecture notes.
NL
no
tes
(m
w)
Previous versions.
September 2, 2013
December 2, 2013
August 27, 2014
June 29, 2015
Acknowledgment
tes
(m
w)
Writing these notes, I profited greatly from various inspiring as well as ongoing
discussions, concrete contributions and critical comments with and by several people: first of all, the students that have been following our lectures at ETH Zurich
since 2006; furthermore Hans Bhlmann, Christoph Buser, Philippe Deprez, Paul
Embrechts, Farhad Farhadmotamed, Urs Fitze, Markus Gesmann, Alois Gisler,
Laurent Huber, Lukas Meier, Michael Merz, Esbjrn Ohlsson, Gareth Peters, Albert Pinyol i Agelet, Peter Reinhard, Simon Rentzmann, Rodrigo Targino, Teja
Turk, Tim Verdonck, Maximilien Vila, Yitian Yang, Patrick Zchbauer. I especially thank Alois Gisler for providing his lecture notes [54] and the corresponding
exercises.
NL
no
Zurich, April 14, 2016
Mario V. Wthrich
NL
no
tes
(m
w)
w)
Contents
(m
1 Introduction
1.1 Nature of non-life insurance . . . . . . . . . . . . . . .
1.1.1 Non-life insurance and the law of large numbers
1.1.2 Risk components and premium elements . . . .
1.2 Probability theory and statistics . . . . . . . . . . . . .
1.2.1 Random variables and distribution functions . .
1.2.2 Terminology in statistics . . . . . . . . . . . . .
NL
no
tes
2 Collective Risk Modeling

2.1 Compound distributions . . . . . . . . . . . . .
2.2 Explicit claims count distributions . . . . . . . .
2.2.1 Binomial distribution . . . . . . . . . . .
2.2.2 Poisson distribution . . . . . . . . . . . .
2.2.3 Mixed Poisson distribution . . . . . . . .
2.2.4 Negative-binomial distribution . . . . . .
2.3 Parameter estimation . . . . . . . . . . . . . . .
2.3.1 Method of moments . . . . . . . . . . .
2.3.2 Maximum likelihood estimators . . . . .
2.3.3 Example and 2 -goodness-of-fit analysis
3 Individual Claim Size Modeling

3.1 Data analysis and descriptive statistics . . . . .
3.2 Selected parametric claims size distributions . .
3.2.1 Gamma distribution . . . . . . . . . . .
3.2.2 Weibull distribution . . . . . . . . . . .
3.2.3 Log-normal distribution . . . . . . . . .
3.2.4 Log-gamma distribution . . . . . . . . .
3.2.5 Pareto distribution . . . . . . . . . . . .
3.3 Model selection . . . . . . . . . . . . . . . . . .
3.3.1 Kolmogorov-Smirnov test . . . . . . . .
3.3.2 Anderson-Darling test . . . . . . . . . .
3.3.3 Goodness-of-fit and information criteria .
3.4 Calculating within layers for claim sizes . . . . .
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
11
13
14
14
21
.
.
.
.
.
.
.
.
.
.
23
23
25
26
29
36
37
40
41
45
47
.
.
.
.
.
.
.
.
.
.
.
.
53
53
58
59
64
66
70
73
79
79
82
83
86
Contents
3.4.1
3.4.2
Claim size modeling using layers . . . . . . . . . . . . . . . .

Re-insurance layers and deductibles . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(m
.
.
.
.
.
.
.
.
.
.
.
.
tes
5 Ruin Theory in Discrete Time

5.1 Net profit condition . . . . . . . .
5.2 Lundberg bound . . . . . . . . .
5.3 Pollaczek-Khinchin formula . . .
5.3.1 Ladder epochs . . . . . . .
5.3.2 Cramr-Lundberg process
5.4 Subexponential claim sizes . . . .
no
6 Premium Calculation Principles

6.1 Simple risk-based principles . . . . . . . . . . . . . .
6.2 Advanced premium calculation principles . . . . . . .
6.2.1 Utility theory pricing principles . . . . . . . .
6.2.2 Esscher premium . . . . . . . . . . . . . . . .
6.2.3 Probability distortion pricing principles . . . .
6.2.4 Cost-of-capital principles using risk measures .
6.2.5 Deflator based pricing principles . . . . . . . .
7 Tariffication and Generalized Linear Models
7.1 Simple tariffication methods . . . . . . . . . .
7.2 Gaussian approximation . . . . . . . . . . . .
7.2.1 Maximum likelihood estimation . . . .
7.2.2 Goodness-of-fit analysis . . . . . . . .
7.3 Generalized linear models . . . . . . . . . . .
7.3.1 GLM for Poisson claims counts . . . .
7.3.2 GLM for gamma claim sizes . . . . . .
7.3.3 Variable reduction analysis . . . . . . .
7.3.4 Claims frequency example . . . . . . .
NL
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
w)
4 Approximations for Compound Distributions

4.1 Approximations . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Normal approximation . . . . . . . . . . . . . . . .
4.1.2 Translated gamma and log-normal approximations .
4.1.3 Edgeworth approximation . . . . . . . . . . . . . .
4.2 Algorithms for compound distributions . . . . . . . . . . .
4.2.1 Panjer algorithm . . . . . . . . . . . . . . . . . . .
4.2.2 Fast Fourier transform . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
86
88
.
.
.
.
.
.
.
93
93
94
97
100
105
105
116
.
.
.
.
.
.
121
121
125
129
129
131
133
.
.
.
.
.
.
.
141
142
144
144
154
156
159
164
.
.
.
.
.
.
.
.
.
167
170
174
174
177
182
185
186
189
192
8 Bayesian Models and Credibility Theory

201
8.1 Exact Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . 202
8.1.1 Poisson-gamma model . . . . . . . . . . . . . . . . . . . . . 203
Contents
8.2
8.1.2
Linear
8.2.1
8.2.2
8.2.3
8.2.4
9
Exponential dispersion family with conjugate priors
credibility estimation . . . . . . . . . . . . . . . . .
Bhlmann-Straub model . . . . . . . . . . . . . . .
Bhlmann-Straub credibility formula . . . . . . . .
Estimation of structural parameters . . . . . . . . .
Prediction error in the Bhlmann-Straub model . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
tes
(m
w)
9 Claims Reserving
9.1 Outstanding loss liabilities . . . . . . . . . . . . . . . .
9.2 Claims reserving algorithms . . . . . . . . . . . . . . .
9.2.1 Chain-ladder algorithm . . . . . . . . . . . . . .
9.2.2 Bornhuetter-Ferguson algorithm . . . . . . . . .
9.3 Stochastic claims reserving methods . . . . . . . . . . .
9.3.1 Gamma-gamma Bayesian CL model . . . . . . .
9.3.2 Over-dispersed Poisson model . . . . . . . . . .
9.4 Claims development result . . . . . . . . . . . . . . . .
9.4.1 Definition of the claims development result . . .
9.4.2 One-year uncertainty in the Bayesian CL model
9.4.3 The full picture of run-off uncertainty . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
NL
no
10 Solvency Considerations
10.1 Balance sheet and solvency . . .
10.2 Risk modules . . . . . . . . . .
10.3 Insurance liability variables . .
10.3.1 Market-consistent values
10.3.2 Insurance risk . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
207
212
213
214
218
221
.
.
.
.
.
.
.
.
.
.
.
225
226
232
232
236
237
239
247
249
249
251
257
.
.
.
.
.
263
263
267
270
270
271
Contents
NL
no
tes
(m
w)
10
Chapter 1
1.1.1
Nature of non-life insurance
(m
1.1
w)
Introduction
Non-life insurance and the law of large numbers
NL
no
tes
Insurance originates from a general demand of society who asks for protection
against unforeseeable events which might cause serious (financial) damage to individuals and society. Insurance organizes the financial protection against such
unforeseeable (random) events, meaning that it takes care of the financial replacements of the (potential) damage. The general idea is to build a community (collective) to which everybody contributes a certain amount (fixed deterministic premium1 ) and then the (potential) financial damage is financed by the means of this
community.
In special cases, for instance in re-insurance or accident insurance, the premium can also have
a random part. This is not further discussed here.
11
12
Chapter 1. Introduction
The basic features of such communities are that every member faces similar risks.
By building such communities the individual members profit from diversification
benefits in the form of a law of large numbers that applies to the community.
Insurance companies organize the fair distribution within the community.
NL
no
tes
(m
w)
Modern insurance is traced back to the Great

Fire of London in 1666 which has destroyed a
big part of the city of London. This event has
initiated fire insurance protection against such
disastrous events. Today, fire insurance belongs
to the branch of non-life insurance which is also
known as property and casualty insurance in
Great Fire of London 1666
the US and general insurance in the UK and
Australia. Non-life insurance comprises car insurance, liability insurance, property
insurance, accident and health insurance, marine insurance, credit insurance, legal
protection insurance, travel insurance and other similar products. Insurance contracts for these types of products have in common that they specify an insurance
period (typically of one year). Then all insured (random) events that occur within
this insurance period and which are causing financial damage to which the insurance contract applies are indemnified. Such random payments caused by insured
events are called insurance claims.
Typically, the insurance premium for these contracts is paid at the beginning of the insurance
period (upfront). To determine this insurance
premium, the insurance company pools similar
risks whose individual insurance claims can be
described by a sequence Y1 , . . . , Yn , n N, of
random variables. These insurance claims Yi
are random at the beginning of the insurance
J. Bernoulli
period and therefore need to be described with
probability theory. Assume we have a probability space (, F, P) and Y1 , . . . , Yn are uncorrelated and identically distributed
random variables on that probability space with finite mean = E[Y1 ]. In that
case we can apply the weak law of large numbers (LLN) which says that for all
>0

"
#
n

1 X

Yi = 0.
(1.1)
lim P

n
n
i=1
Basically, this means that the total claim amount becomes more predictable with
increasing portfolio size n and, therefore, we can calculate the insurance premium
quite accurately for large portfolio sizes n because this provides the required equal
balance. The weak law of large numbers is therefore considered to be a theoretical cornerstone of insurance. It goes back to the Swiss mathematician Jakob
Bernoulli (1655-1705) of the famous Bernoulli family and was first published in
13
his path-breaking work Ars Conjectandi which has appeared in 1713, eight years
after his death, see Bolthausen-Wthrich [15].
Pn
Y n
i
N (0, 1)
n
A. De Moivre
as
n ,
(1.2)
(m
i=1
w)
For independent and identically distributed random variables

Y1 , Y2 , . . . with finite variances 2 the weak law of large numbers can further be refined by Chebychevs inequality which
provides rates of convergence and by the central limit theorem
(CLT) which provides the asymptotic limit distribution. The
CLT states under the above assumptions that we have the following convergence in distribution
1.1.2
no
tes
i.e. in the limit (and under appropriate scaling) we obtain a

standard Gaussian distribution. The crucial feature is that the
denominator only increases of order n, i.e. it increases at a

slower rate than n. This exactly implies that the total claim
amount of the portfolio becomes predictable in the limit because the relative confidence bounds get narrower the bigger the
portfolio becomes. These are the basics why insurance works.
The CLT goes back to Abraham De Moivre (1667-1754) who
P.-S. Laplace
published a first article on the CLT in 1733 based on coin tossing, this was way ahead of time, and to Pierre-Simon Laplace (1749-1827) who
provided an extension in 1812, see also page 94 below.
Risk components and premium elements
NL
Insurance contracts involve many different risky components. We briefly present

them from the insurance company point of view.
1. Pure randomness: The outcomes of the claims Yi are uncertain/random. This
risk is taken care of by the volume n of the insurance portfolio (as described
in (1.1) and (1.2)). That is, this risk can be controlled in a sufficient way if
the insurance portfolio is large.
2. Model risk: The description of the randomness of variables Yi , described in
the previous item, is always based on a stochastic model, i.e. we describe the
random outcomes in a model world. This modeling should have the minimal
requirement that it characterizes the nature of Yi in a sufficiently accurate
way. However, typically model risk arises because our model description does
not perfectly describe real world behavior. There are different things that
may go wrong in this modeling task:
14
(a) the model world does not provide an appropriate description of real world
behavior;
(b) the parameters in the chosen model are misspecified;
(c) risk factors change over time so that past observations do not appropriately describe what may happen in the future (non-stationarity), of
course, this is closely related to (a) and (b).
+ pure risk premium = E[Yi ]
(m
w)
In practice, these uncertainties (including pure randomness) ask for a risk loading
(risk margin) beyond the pure risk premium defined by = E[Yi ]. The aim of this
risk loading is to provide financial stability. We will describe this in detail below
in Chapters 5, 6 and 10.
We close this section by describing the premium elements that are considered for
insurance premium calculation:
+ risk margin to protect against the risks mentioned above

+ profit margin
tes
financial gains on investments

+ sales commissions to agents
+ taxes
no
+ other administrative expenses
1.2
1.2.1
NL
The sum of all these items specifies the insurance premium. Non-life insurance
mathematics and statistics typically studies the first two items. This is part of the
program of the subsequent chapters.
Probability theory and statistics

Random variables and distribution functions
In this section we briefly recall the crucial notation and key results of probability
theory used in these notes. We denote the underlying probability space by (, F, P)
and assume throughout that this probability space is sufficiently rich so that it
carries all the objects that we are going to consider.
Random variables on this probability space (, F, P) are denoted by capital letters X, Y, S, N, . . . and the corresponding observations are denoted by small letters
x, y, s, n, . . .. That is, x constitutes a realization of X. Random vectors are denoted by boldface, e.g., X = (X1 , . . . , Xd )0 and the corresponding observation by
x = (x1 , . . . , xd )0 for a given dimension d N. Since there is broad similarity
15
between random variables and random vectors, we restrict to random variables for
introducing the crucial terminology from probability theory.
Random variables X are characterized by (probability, cumulative) distribution
functions F = FX : R [0, 1], meaning that for all x R
F (x) = FX (x) = P [X x] [0, 1]
w)
denotes the probability that X has an outcome less or equal to x. In general, we

drop the subscript in the distribution function F = FX and we simply write X F
for X having distribution function F ; a (cumulative) distribution function is a rightcontinuous, non-decreasing function with limx F (x) = 0 and limx F (x) = 1.
pk = P [X = k] > 0
(m
We distinguish two important types of random variables:

(i) a random variable X F is called discrete if F is a step function with countably
many steps in discrete points k A R. In this case we write
for k A,
with kA pk = 1. We call pk probability weight of X in k A;

(ii) a random variable X F is called absolutely continuous if there exists a
measurable function f 0 with f = F 0 , i.e.
F (x) =
tes
Z x
for all x R.
f (y) dy
no
This function f is called density of X and in that case we also use the terminology
X f.
Assume X F and h : R R is a sufficiently nice measurable function. We
define the expected value of h(X) by
kA h(k) pk
if X is discrete,
if X is absolutely continuous.
h(x) dF (x) =
NL
E [h(X)] =
h(x)f (x) dx
The middle term uses the general framework of the Riemann-Stieltjes integral
R
R h dF (and in fact the second equality is not an identity because the middle
term is more general than the right-hand side). The sufficiently nice refers to the
fact that E [h(X)] is only defined upon existence. The most important functions h
in our analysis define the following moments (based upon existence):
mean, expectation, expected value or first moment of X F
X = E [X] =
x dF (x);
k-th moment of X F
h
E Xk =
xk dF (x);
16
variance of X F
h
2
X
= Var (X) = E (X E[X])2 = E X 2 E [X]2 0;
standard deviation and coefficient of variation of X F

and
skewness of X F
E (X E[X])3
3
X
for E[X] > 0;
(m
w
X =
X
E[X]
Vco(X) =
X = Var (X)1/2
moment generating function of X F at position r R

MX (r) = E [exp {rX}] =
exp {rx} dF (x).
The moment generating function is crucial to identify the properties of random

variables X, see Lemmas 1.2, 1.3 and 1.4 below.
no
tes
Lemma 1.1. Choose X F and assume that there exists r0 > 0 such that
MX (r) < for all r (r0 , r0 ). Then MX (r) has a power series expansion
for r (r0 , r0 ) with
h
i
X rk
MX (r) =
E Xk .
k0 k!
NL
Proof. Note that it suffices to choose r (r0 , r0 ) with r 6= 0. Since e|rx| erx + erx , the
assumptions imply the following integrability E [exp {|rX|}] < . This implies that E[|X|k ] <
for all k N because |x|k is dominated by e|rx| for sufficiently large |x|. It also implies that
Pm
the partial sums |fm (x)| = | k=0 (rx)k /k!| are uniformly bounded by the integrable (w.r.t. dF )
P
function k0 |rx|k /k! = e|rx| . This allows to apply the dominated convergence theorem which
provides
m
h
i
X
rk k
E X = lim E [fm (X)] = E lim fm (X) = MX (r).
lim
m
m
m
k!
k=0
This proves the lemma.
Lemma 1.1 implies that the power series converges for all r (r0 , r0 ) for given
r0 > 0 and, thus, we have a strictly positive radius of convergence 0 > 0. A
standard result from analysis implies that in the interior of the interval [0 , 0 ]
we can differentiate MX () arbitrarily often (term by term of the power series) and
the derivatives at the origin are given by
h
i
dk
k
M
(r)|
=
E
X
<
X
r=0
drk
for k N0 .
(1.3)
17
Lemma 1.2. Choose a random variable X F and assume that there exists r0 > 0
such that MX (r) < for all r (r0 , r0 ). Then the distribution function F of X
is completely determined by its moment generating function MX .
Proof. The existence of a strictly positive radius of convergence 0 implies that all moments
of X exist and that they are directly determined by the moment generating function via (1.3).
Theorem 30.1 of Billingsley [13] then implies that there is at most one distribution function F
which has the same moments (1.3) for all k N.
2
w)
For one-sided random variables the statement even holds true in general:
Lemma 1.3. Assume X 0, P-a.s. The distribution function F of X is completely
determined by its moment generating function MX .
2
(m
Proof. See Section 22 of Billingsley [13], in particular Theorem 22.2.
Lemma 1.3 gives for two random variables X F and Y G

with X 0 and Y 0, P-a.s., the following implication
(d)
X = Y.
tes
MX MY
This property is often used to identify distribution functions.
no
Lemma 1.4. Assume that the random variables Xn , n N, P.L. Chebychev

and X have finite moment generating functions MXn , n N,
and MX on a common interval (r0 , r0 ) with r0 > 0. Suppose
limn MXn (r) = MX (r) for all r (r0 , r0 ). Then (Xn )n converges in distribution to X, write Xn X for n .
NL
Proof. See Section 30 of Billingsley [13]. Basically, Chebychevs inequality

implies tightness of the underlying probability measures from which the
convergence in distribution is derived.
2
The Pafnuty Lvovich Chebychev (1821-1894) inequality is

sometimes also called Andrey Andreyevich Markov (18561922) inequality. Chebychev was Markovs teacher and the inequality first appeared in the work of Chebychev. Note that
there are different spellings of Chebychev such as Tchebysheff,
etc.
A.A. Markov
Example 1.5 (Gaussian distribution). Assume X N (, 2 ) has a Gaussian

distribution with parameters R and 2 > 0. X is an absolutely continuous
random variable with density f (x) for x R given by
18
1
1 (x )2
f (x) =
exp
.
2
2
2
(
The moment generating function of X N (, 2 ) is given by

n
MX (r) = exp r + r2 2 /2 <
for r R.
(1.4)
This moment generating function is obtained by direct calculation completing the

square. Observe that MX () is finite on R and, thus, all moments exist and

d
1
= ,
MX (r)|r=0 = exp r + r2 2 + r 2
r=0
dr
2
w)
X = E [X] =
and for the second moment we obtain

E X

d2
1 2 2
= 2 + 2 .
( + r 2 )2 + 2
= 2 MX (r)|r=0 = exp r + r
r=0
dr
2
(m
This implies for the variance of Gaussian distributions

h
2
X
= Var(X) = E X 2 E [X]2 = 2 .
tes
Moreover, any random variable Y that has moment generating function of the form
(1.4) is Gaussian with mean Y = and variance Y2 = 2 , see Lemma 1.2.

Exercise 1 (Gaussian distribution).
no
(a) Assume X N (0, 1). Prove that a + bX N (a, b2 ) for a, b R.

(b) Assume that Xi are independent and Xi N (i , i2 ). Prove that
P
P
N ( i i , i i2 ).
Xi
NL
(c) Assume X N (0, 1). Prove that E[X 2k+1 ] = 0 for all k N0 .
The Gaussian distribution is named after

Carl Friedrich Gauss (1777-1855).
He was one of the greatest mathematicians and has contributed to many different fields in mathematics and physics.
We recommend the novel of Kehlmann
[65] that fictitiously describes the lives of
Carl Friedrich Gauss and of the natural scientist Alexander von Humboldt (1769-1859).
C.F. Gauss
19
Often we do not directly consider the moment generating function MX of a random

variable X but rather its logarithm. The cumulant generating function of X is given
by
log MX (r) = log E [exp {rX}] .
Assume that MX is finite on (r0 , r0 ) with r0 > 0. We have
MX0 (r)

= E [X] = X ,
MX (r) r=0

d
log MX (r)|r=0 =
dr
d2
MX00 (r)MX (r) (MX0 (r))2
2

= Var (X) = X
,
log
M
(r)|
=
X
r=0

dr2
(MX (r))2
r=0
h
i
d3
3
3
log
M
(r)|
=
E
(X
E[X])
= X X
.
X
r=0
dr3
(1.5)
(m
w)
Lemma 1.6. Assume that MX is finite on (r0 , r0 ) with r0 > 0. Then log MX ()
is a convex function on (r0 , r0 ).
Proof. In order to prove convexity we calculate the second derivative at position r (r0 , r0 )
=
=
00
0
00
MX
(r)MX (r) (MX
(r))2
MX
(r)
=
2
(MX (r))
MX (r)

!2

E XerX
E X 2 erX
.
E [erX ]
E [erX ]
tes
d2
log MX (r)
dr2
no
Define the new function Fr by
1
Fr (x) =
MX (r)
0
MX
(r)
MX (r)
2
ery dF (y).
(1.6)
Observe that Fr is a distribution function. Thus, we can choose a random variable Xr Fr

whose variance is given by

E X 2 erX
E[Xr ] =
E [erX ]
NL
0 Var(Xr ) =
E[Xr2 ]

!2
E XerX
d2
= 2 log MX (r).
rX
E [e ]
dr
2
This proves the claim.
Remark. The distribution function Fr defined in (1.6) gives the Esscher measure
of F . The Esscher measure has been introduced by Bhlmann [19] for a new
premium calculation principle. We come back to this in Section 6.2.2, below.
The next formula is often used: Assume that X F is non-negative, P-a.s., and
has finite first moment. Then we have identity
E[X] =
Z
0
x dF (x) =
Z
0
[1 F (x)] dx =
Z
0
P [X > x] dx.
20
The proof uses integration by parts and the result says that we can calculate
expected values from survival functions F (x) = 1 F (x) = P[X > x]. Survival
functions will be important for the study of the fatness of the tails of distribution
functions. This plays a crucial role for the modeling of large claims.
w)
Often we deal with sequences X1 , X2 , . . . of random variables which are independent

and identically distributed (i.i.d.) with distribution function F . In this case we use
i.i.d.
the notation X1 , X2 , . . . F .
Another property that is going to be used quite frequently is the so-called tower
property, see Williams [97]. It states that for any sub--algebra G F on our
probability space (, F, P) we have for any integrable random variable X F
(m
E [X] = E [E [X| G]] .
(1.7)
In particular, if X and Y are two random variables on (, F, P) we have

E [X] = E [E [X| Y ]] ,
tes
where E[X|Y ] is an abbreviation for E[X|(Y )] with (Y ) F denoting the algebra generated by the random variable Y . Assume that X is square integrable
then tower property (1.7) implies
(1.8)
no
Var(X) = E [Var (X| G)] + Var (E [ X| G]) .
We have mentioned above that distribution functions F are right-continuous and

non-decreasing. This allows to define the left-continuous generalized inverse of F
by
F (p) = inf {x; F (x) p} ,
NL
where we use convention inf = . For p (0, 1), F (p) is often called the
p-quantile of X F . The generalized inverse F is only tricky at places where F
has a discontinuity or where F is not strictly increasing. It satisfies the following
properties, see Proposition A3 in McNeil et al. [77],
1. F is non-decreasing and left-continuous.
2. F is continuous iff F is strictly increasing.
3. F is strictly increasing iff F is continuous.
4. (If F is right-continuous, then) F (x) z iff F (z) x.
5. F (F (x)) x.
6. F (F (z)) z.
21
7. If F is strictly increasing, then F (F (x)) = x.

8. If F is continuous, then F (F (z)) = z.
1.2.2
Terminology in statistics
w)
Items 4. to 8. need F (z) < . Note that the first part of item 4. is put in brackets
because distribution functions are right-continuous. However, generalized inverses
can also be defined for functions that are not right-continuous (as long as they are
non-decreasing) and then the condition in the bracket of item 4. is needed.
tes
(m
Often we face the problem that we need to predict the outcome of a random varic For
able X F . This problem is solved by specifying an appropriate predictor X.
c = = E[X]. On the other hand a distriinstance, we can choose as predictor X
X
bution function F often involves unknown parameters. These unknown parameters
need to be estimated, for instance, using past experience and expert opinion. For
example, we can estimate the (unknown) mean X of X by an estimator b X . If we
c=
b X for predicting X, then
b X serves at the same time
now choose predictor X
as estimator for X and as predictor for X. In this sense we obtain an estimation
error which is specified by the difference X b X and we obtain a prediction error
which is characterized by the following difference
c = X
b X = (X X ) + (X
bX ) .
X X
(1.9)
no
The second term on the right-hand side of (1.9) specifies the estimation error and
the first term on the right-hand side of (1.9) is often called pure process error which
is due to the stochastic nature of X, see also Section 9.3.
NL
Statistical tests deal with the problem of making decisions. Assume we have an
observation x of a random vector X F with given but unknown parameter
which lies in a given set of possible parameters. The aim is to test whether
the (true, unknown) parameter that has generated x may belong to some subset
0 . In the simplest case we have a singleton 0 = {0 }. Assume that we
would like to check whether x may have been generated by a given parameter 0 .
Null hypothesis H0 : = 0 .
(Two-sided) alternative hypothesis H1 : 6= 0 .
We then build a test statistics T (X) whose distribution function is known under
the null hypothesis H0 and we consider the question whether T (x) takes an unlikely
value under the null hypothesis. Therefore one chooses a significance level q (0, 1)
(typically 5% or 1%) and for this significance level one chooses a critical region Cq
with P[T (X) Cq ] q (under the null hypothesis). The null hypothesis is then
rejected if T (x) falls into this critical region. In practice, one often calculates the
22
so-called p-value. This denotes the critical probability at which the null hypothesis
is just rejected (for one-sided unbounded intervals). For instance, if we choose a
significance level of 5% and the resulting p-value of T (x) is less or equal to 5% then
the test rejects the null hypothesis on the 5% significance level.
Exercise 2 (2 -distribution). Assume that Xk has a 2 -distribution with k N
degrees of freedom, i.e. Xk is absolutely continuous with density
1
xk/21 exp {x/2} ,
k/2
2 (k/2)
for x 0.
w)
f (x) =
(a) Prove that f is a density. Hint: see Section 3.3.3 and proof of Proposition
2.20.
(m
(b) Prove
MXk (r) = (1 2r)k/2
(d)
for r < 1/2.
(c) Choose Z N (0, 1) and prove Z 2 = X1 .

i.i.d.
Pk
i=1
(d)
Zi2 = Xk and calculate the first
NL
no
tes
(d) Choose Z1 , . . . , Zk N (0, 1). Prove

two moments of the latter.
Chapter 2
w)
Collective Risk Modeling
(m
The aim of this chapter is to describe the probability distribution of the total claim
amount S that an insurance company faces within a fixed time period. For the
time period we take one (accounting) year. Assume that N counts all claims that
occur within this fixed accounting year. The total claim amount is then given by
S = Y1 + Y2 + . . . + YN =
N
X
Yi ,
tes
i=1
2.1
NL
no
where Y1 , . . . , YN models the individual claim sizes. If we are at the beginning

of this accounting year then neither the number of claims N nor the individual
claim sizes Y1 , . . . , YN are known. Therefore, we model all these unknowns with
random variables that characterize the possible outcomes of the total claim amount
S (which, of course, then is also a random variable). We call such models for S
collective risk models because we consider the whole portfolio as a collective. We
hope to discover a law of large numbers for the total insurance portfolio so that the
insurance company can benefit from diversification benefits (between individual
risks) that allow to predict possible outcomes of S (more) accurately.
Compound distributions
The starting point of the modeling of S is a compound distribution. This compound

distribution is based on rather strong model assumptions on the one hand, but on
the other hand it already leads to a good description and understanding of the
possible outcomes of the total claim amount S.
Model Assumptions 2.1 (compound distribution). The total claim amount S is
given by the following compound distribution
S = Y1 + Y2 + . . . + YN =
N
X
i=1
with the three standard assumptions

23
Yi ,
24
Chapter 2. Collective Risk Modeling

1. N is a discrete random variable which only takes values in A N0 ;
i.i.d.
2. Y1 , Y2 , . . . G with G(0) = 0;
3. N and (Y1 , Y2 , . . .) are independent.
Remarks.
If S satisfies these three standard assumptions from Model Assumptions 2.1
we say that S has a compound distribution.
(m
w
The first assumption of the compound distribution says that the number of
claims N takes only non-negative integer values. The event {N = 0} means
that no claim occurs which provides a total claim amount of S = 0.
tes
The second assumption means that the individual claims Yi do not affect
each other, for instance, if we face a large first claim Y1 this does not give
any information for the remaining claims Yi , i 2. Moreover, we have
homogeneity in the sense that all claims have the same marginal distribution
function G with
0 = G(0) = P [Y1 0] ,
i.e. the individual claim sizes Yi are strictly positive, P-a.s. We use synonymous the terminology (individual) claim size, (individual) claim and claims
severity for Yi .
no
Finally, the last assumption says that the individual claim sizes are not affected by the number of claims and vice versa, for instance, if we observe
many claims this does not contain any information whether these claims are
of smaller or larger size.
NL
This compound distribution is the base model for collective risk modeling and we
are going to describe different choices for the claims count distribution of N and
for the individual claim size distributions of Yi . We start with the basic recognition
features of compound distributions.
Proposition 2.2. Assume S has a compound distribution. We have
E[S] = E[N ] E[Y1 ],
Var(S) = Var(N ) E[Y1 ]2 + E[N ] Var(Y1 ),
s
1
Vco(S) =
Vco(N )2 +
Vco(Y1 )2 ,
E[N ]
MS (r) = MN (log(MY1 (r)))
for r R,
whenever they exist.
25
Proof. Using the tower property (1.7) we obtain for the mean of S
##
"N
"N
#
"N
#
#
" " N
X
X
X
X
E[S] = E
Yi = E E
= E
E [ Yi | N ] = E
E [Yi ]
Yi N

i=1
i=1
i=1
i=1
E [N E [Y1 ]] = E [N ] E [Y1 ] .
i=1
Var
N
X
i=1
!
E [Yi ]
"
+E
i=1
!#

Yi N

= Var (N ) E [Y1 ] + E [N ] Var (Y1 ) .
Var (Yi )
i=1
(m
i=1
N
X
N
X
w)
For the second statement we have, see also (1.8),

#!
"
!
" N
N
X
X
Var(S) = Var
Yi
+ E Var
= Var E
Yi N

i=1
i=1
!
"N
#
N
X
X
= Var
E [ Yi | N ] + E
Var ( Yi | N )
Finally, for the moment generating function we have

##
"N
#
"
( N
)#
" " N

Y
X
Y

= E
E [ exp {rYi }| N ]
MS (r) = E exp r
Yi
= E E
exp {rYi } N

i=1
i=1
i=1

= E MY1 (r)N = E [exp {N log(MY1 (r))}] = MN (log(MY1 (r))).
2
tes
This proves the proposition.
Under Model Assumptions 2.1 the distribution function of S can be written as
" N
X
kA
i=1
" k
X
Yi

x N

#
no
P [S x] =
kA
= k P [N = k]
Yi x P [N = k] =
i=1
(2.1)
Gk (x) P [N = k] ,
kA
NL
where Gk denotes the k-th convolution of the distribution function G. In partici.i.d.

ular, we have for Y1 , Y2 G
P [Y1 + Y2 x] =
G(x y) dG(y) = G2 (x).
With formula (2.1) we obtain a closed form solution for the distribution function
of S. However, in general, this formula is not useful due to the computational
complexity of calculating Gk for too many k A. We present other solutions
for the calculation of the distribution function of S. These involve simulations,
approximations and smart analytic techniques under additional model assumptions.
2.2
Explicit claims count distributions
In this section we give explicit distribution functions for the number of claims N
modeling. The three most commonly used distribution functions are the binomial
26
(m
w)
distribution, the Poisson distribution and the negative-binomial distribution. Our

aim is to present these three distribution functions, describe the properties of the
resulting compound distributions, and discuss parameter estimation. These three
distribution functions constitute the family of Panjer distributions, see Lemma 4.7
below.
In a non-life insurance context the claims count random variable N should always
be understood in relation to an underlying (deterministic) volume v > 0. Therefore, we consistently use a volume measure to describe N . Often this is not done
in the related literature. The volume measure will become especially important for
the study of diversification benefits, parameter estimation and in the evaluation of
parameter uncertainty. The volume measure can be of different nature depending
on the insurance business considered and one should always choose the most appropriate one. Typical volume measures are: number of insured persons, number
of policies, number of risks. But in health and accident insurance it could also be
the aggregated wages insured or in fire insurance the total insured value. To make
language simple we interpret v > 0 as the number of risks insured. On the other
side, N counts the number claims. The ratio N/v is called claims frequency and
the expected number of claims is given by
tes
E[N ] = v,
where > 0 denotes the expected claims frequency. Under these premises we would
like to describe the probability weights
for k A N0 .
2.2.1
no
pk = P [N = k]
Binomial distribution
NL
For the binomial distribution we choose a fixed volume v N and a fixed default
probability p (0, 1) (expected claims frequency).
We say N has a binomial distribution, write N Binom(v, p), if
v
pk = P [N = k] =
k
pk (1 p)vk
for all k {0, . . . , v} = A.
The binomial formula provides kA pk = 1, see e.g. Section 5.3 in Merz-Wthrich

[79], and, hence, we have a discrete distribution function on the set A = {0, . . . , v}.
The special case v = 1 is called Bernoulli distribution or Bernoulli experiment, write
N Bernoulli(p), and reflects the coin tossing experiment
(
P [N = k] =
1p
p
for k = 0,
for k = 1.
This describes whether a single risk defaults or not.

27
Proposition 2.3. Assume N Binom(v, p) for fixed v N and p (0, 1). Then
s
Var(N ) = vp(1 p),
E[N ] = vp,
MN (r) = (per + (1 p))v
Vco(N ) =
1p
,
vp
for all r R.
w)
Proof. We calculate the moment generating function and then the first two moments follow from
formula (1.5). For the moment generating function we have

X
X v
k
rk v
k
vk
MN (r) =
e
p (1 p)
=
(per ) (1 p)vk
k
k
kA
kA
k
vk
X v
per
1p
r
v
= (pe + (1 p))
.
k
per + (1 p)
per + (1 p)
kA
(m
The last sum is again a summation over probability weights pk , k A, of a binomial distribution
with default probability p = (per )/(per + (1 p)) (0, 1). Therefore it adds up to 1 which
completes the proof.
2
Next we give a second characterization of the binomial distribution which leads to

the interpretation of the binomial distribution.
tes
Corollary 2.4. Assume N Binom(v, p) with given v N and p (0, 1). Choose
i.i.d.
X1 , . . . , Xv Bernoulli(p). Then we have
(d)
no
N =
v
X
Xi .
i=1
Pv
Proof. In view of Lemma 1.3 it suffices to prove that N and X =
i=1 Xi have the same
moment generating function. The moment generating function of the latter is given by
" v
#
v
v
Y
Y

Y
rXi
(per + (1 p)) = MN (r).
E erXi =
MX (r) = E
e
=
i=1
i=1
NL
i=1
This completes the proof.
Remarks. The corollary states that N describes the number of defaults within
a portfolio of fixed size v N. Every risk in this portfolio has the same default
probability p and defaults between different risks do not influence each other (are
independent). Thus, if N has a binomial distribution then every risk in such a
portfolio can at most default once. This is the case, for instance, for life insurance
policies where an insured can die at most once. In non-life insurance this distribution is less commonly used because for typical non-life insurance policies we can
have more than one claim within a fixed time interval, e.g., a car insurance policy
can suffer two or more accidents within the same accounting year. Therefore, the
binomial distribution is not of central interest in non-life insurance modeling.
28
Definition 2.5 (compound binomial model). The total claim amount S has a
compound binomial distribution, write
S CompBinom(v, p, G),
if S has a compound distribution with N Binom(v, p) for given v N and
p (0, 1) and individual claim size distribution G.
w)
Proposition 2.6. Assume S CompBinom(v, p, G). We have

E[S] = vp E[Y1 ],

Var(S) = vp E[Y12 ] pE[Y1 ]2 ,

s
1 q
1 p + Vco(Y1 )2 ,
vp
MS (r) = (pMY1 (r) + (1 p))v
for r R,
(m
Vco(S) =
tes
Proof. The proof is an immediate consequence of Propositions 2.2 and 2.3.
no
Remark. The coefficient of variation Vco(S) is a measure for the degree of diversification within the portfolio. If S has a compound binomial distribution with
fixed default probability p and fixed claim size distribution G having finite second
moment, then the coefficient of variation converges to zero of order v 1/2 as the
portfolio size v increases.
Corollary 2.7 (aggregation property). Assume S1 , . . . , Sn are independent with
Sj CompBinom(vj , p, G) for all j = 1, . . . , n. The aggregated claim has a compound binomial distribution with
NL
S=
n
X
Sj CompBinom
n
X
vj , p, G .
j=1
j=1
Proof. Exercise. Note here that n describes the (deterministic) number of portfolios and should
not be confused with the binomial random variable N . 2
Exercise 3. Assume S CompBinom(v, p, G) and choose M > 0 such that

G(M ) (0, 1). Define the compound distribution of claims Yi exceeding threshold
M by
Slc =
N
X
Yi 1{Yi >M } .
i=1
Then we have Slc CompBinom(v, p(1 G(M )), Glc ) where the large claims size
distribution satisfies Glc (y) = P [Y1 y|Y1 > M ].

2.2.2
29
Poisson distribution
For defining the Poisson distribution we choose a fixed volume v > 0 and a fixed
expected claims frequency > 0.
We say N has a Poisson distribution, write N Poi(v), if
(v)k
k!
for all k A = N0 .
w)
pk = P [N = k] = ev
(m
The power series expansion of the exponential function ev

P
provides k0 pk = 1 and thus we have a discrete distribution function on the set A = N0 .
The Poisson distribution goes back to Simon Denis Poisson (1781-1840) who has published his work on probability
theory in 1837.
no
tes
Note that parameter v only appears as a product in the

S.D. Poisson
Poisson distribution. Therefore, we could also define c =
v > 0 and work solely with c. This is the way how the Poisson distribution is
typically treated in the literature. We would like to keep the separation of c into
and v because we would like to have the frequency interpretation for which also
allows for the study of diversification benefits. This is exactly one of the statements
in the next proposition.
Proposition 2.8. Assume N Poi(v) for fixed , v > 0. Then
s
NL
E[N ] = v = Var(N ),
1
,
v
for all r R.
Vco(N ) =
MN (r) = exp {v(er 1)}
Proof. We calculate the moment generating function and then the first two moments follow from
formula (1.5). For the moment generating function we have using the power series expansion of
the exponential function
MN (r)
X
k0
erk ev
X (ver )k
(v)k
= ev
= exp {v + ver } .
k!
k!
k0
Proposition 2.8 provides the interpretation of the parameter . For given volume
v > 0 the expected claims frequency is

N
= .
v

30
Moreover, for the coefficient of variation of the claims frequency N/v we obtain

Vco
N
v
= (v)1/2 0
for v .
(2.2)
Next we give a constructive characterization of the Poisson distribution.

Lemma 2.9. Assume Nv Binom(v, p) with v N and p = p(v) (0, 1) such
that limv vp = c (0, ). Then Nv converges in distribution to N Poi(c) as
v .
w)
Proof. In view of Lemma 1.4 we need to prove that the moment generating functions of Nv have
the appropriate convergence property.
h
ivp(v)
1/p(v)
MNv (r) = (per + (1 p))v = (1 + p(v) (er 1))
.
(m
Note that p(v) 0 as v . If we apply this limit to the inner bracket (1 + p(v)(er 1))1/p(v)
we exactly obtain the limit definition of the exponential function exp{er 1}, see Definition 14.30
in Merz-Wthrich [79]. This with the fact that vp(v) c as v provides the proof.
2
tes
Interpretation. Binomially distributed claims counts Nv can be approximated

by a Poisson distribution if the default probability p is very small compared to the
portfolio size v.
Definition 2.10 (compound Poisson model). The total claim amount S has a
compound Poisson distribution, write
no
S CompPoi(v, G),
if S has a compound distribution with N Poi(v) for given , v > 0 and individual
claim size distribution G.
NL
Proposition 2.11. Assume S CompPoi(v, G). We have

E[S] = v E[Y1 ],
Var(S) = v E[Y12 ],
s
1 q
1 + Vco(Y1 )2 ,
v
MS (r) = exp {v(MY1 (r) 1)}
Vco(S) =
for r R,

Remark. If S has a compound Poisson distribution with fixed expected claims

frequency > 0 and fixed claim size distribution G having finite second moment,
31
then the coefficient of variation converges to zero of order v 1/2 as the portfolio
size v increases.
The compound Poisson distribution has the so-called aggregation property and the
disjoint decomposition property. These are two extremely beautiful and useful
properties which explain part of the popularity of the compound Poisson model.
We first state and prove these two properties and then we give interpretations in
the context of non-life insurance portfolio modeling.
n
X
Sj CompPoi(v, G),
j=1
with
v=
n
X
vj ,
n
X
vj
j=1
j=1
(m
S=
w)
Theorem 2.12 (aggregation of compound Poisson distributions). Assume

S1 , . . . , Sn are independent with Sj CompPoi(j vj , Gj ) for all j = 1, . . . , n. The
aggregated claim has a compound Poisson distribution
G=
and
n
X
j vj
Gj .
j=1 v
tes
Note here that n describes the (deterministic) number of portfolios S1 , . . . , Sn and

should not be confused with the Poisson random variable N .
NL
no
Proof. We have assumed that Gj (0) = 0 for all j = 1, . . . , n which implies that S 0, P-a.s.
From Lemma 1.3 it follows that we only need to identify the moment generating function of S in
order to prove that it is compound Poisson distributed. Observe that MS (r) exists at least for
r 0. Thus, we calculate (using the independence of the Sj s)
n
n
n
X
Y
Y
E [exp {rSj }]
MS (r) = E exp r
Sj = E
exp {rSj } =
j=1
j=1
j=1

n
n
n

o
Y
X
j vj
=
exp j vj MY (j) (r) 1
= exp v
MY (j) (r) 1 ,
1
1
v
j=1
j=1
(j)
where we have assumed Y1 Gj . This is a compound Poisson distribution with expected number of claims v and the claim size distribution G is obtained from the moment generating funcPn j vj
Pn j vj
tion j=1 v
MY (j) (r): note that G = j=1 v
Gj is a distribution function (non-decreasing,
1
right-continuous, limx G(x) = 0 and limx G(x) = 1). We choose Y G and obtain
Z
Z
n
X
v
j
j
MY (r) =
ery dG(y) =
ery d
Gj (y)
v
0
0
j=1
=
Z
n
X
j vj
j=1
ery dGj (y) =
n
X
j vj
j=1
MY (j) (r).
1
Using Lemma 1.3 once more for the claim size distribution proves the theorem.
32
m
X
G(y) =
p+
j Gj (y)
w)
Next we analyze the disjoint decomposition property. Therefore we slightly extend

the compound Poisson model. Let (p+
j )j=1,...,m be a discrete probability distribution
on the finite set {1, . . . , m}. Assume p+
j > 0 for all j. We can interpret the set
{1, . . . , m} as different sub-portfolios or different lines of business. For instance,
if we have a car insurance portfolio, a property insurance portfolio and a liability
insurance portfolio we set m = 3, with j {1, 2, 3} labeling the portfolios of the
three lines of business. Assume Gj are the corresponding claim size distributions
of the sub-portfolios with Gj (0) = 0. Then, we can define the mixture distribution
by
for y R.
j=1
P [I = j] = p+
j
(m
Theorem 2.12 exactly provides such a mixture distribution with p+

j = j vj /(v) if
we aggregate the sub-portfolios.
The next theorem provides the opposite direction, i.e. it is aiming at decomposing
(mixing) distribution G. We define a discrete random variable I which indicates
to which sub-portfolio a particular claim Y belongs to: define I by
for all j {1, . . . , m}.
(2.3)
tes
This allows to extend the compound Poisson model from Definition 2.10.
no
Definition 2.13 (extended compound Poisson model). The total claim amount
P
S= N
i=1 Yi has a compound Poisson distribution as defined in Definition 2.10. In
addition, we assume that (Yi , Ii )i1 are i.i.d. and independent of N with Yi having
marginal distribution function G with G(0) = 0 and Ii having marginal distribution
function given by (2.3).
NL
Remark. Note that Definition 2.13 gives a well-defined extension, i.e. it fully
respects the assumptions made in Definition 2.10 because (Yi , Ii )i1 are i.i.d. and
independent of N with Yi having the appropriate marginal distribution function
G. Observe that we do not specify the dependence structure between Yi and Ii . If
we choose m = 1 in (2.3) we are back in the classical compound Poisson model.
Therefore, the next theorem especially applies to the compound Poisson model.
Before stating the next theorem we introduce an admissible and measurable disjoint
decomposition (partition) of the total space. The random vector (Y1 , I1 ) takes
values in R+ {1, . . . , m}. On this latter we choose a finite sequence A1 , . . . , An
of (measurable) sets such that Ak Al = for all k 6= l and
n
[
Ak = R+ {1, . . . , m}.
(2.4)
k=1
Such a sequence A1 , . . . , An is called a measurable disjoint decomposition of R+

{1, . . . , m}. This measurable disjoint decomposition is called admissible for (Y1 , I1 )
if for all k = 1, . . . , n
p(k) = P [(Y1 , I1 ) Ak ] > 0.

Note that
Pn
k=1
33
p(k) = 1, due to (2.4) and the mutual disjointness.
Theorem 2.14 (disjoint decomposition of compound Poisson distributions). Assume that S fulfills the extended compound Poisson model assumptions of Definition
2.13. We choose an admissible and measurable disjoint decomposition A1 , . . . , An
for (Y1 , I1 ). Define for k = 1, . . . , n the random variables
Sk =
N
X
Yi 1{(Yi ,Ii )Ak } .
w)
i=1
Sk are independent and CompPoi(k vk , Gk ) distributed for k = 1, . . . , n with

k vk = vp(k) > 0
Gk (y) = P [ Y1 y| (Y1 , I1 ) Ak ] .
(m
and
Proof of Theorem 2.14. We prove the theorem using the multivariate extension of the moment generating function. Choose r = (r1 , . . . , rn )0 Rn . The multivariate moment generating
function of random vector S = (S1 , . . . , Sn )0 is given by
"
( n
)#
"
( n
)#
N
X
X X
0
MS (r) = E [exp {r S}] = E exp
rk Sk
= E exp
rk
Yi 1{(Yi ,Ii )Ak }
k=1
i=1
k=1
) ##

= E
E exp
rk Yi 1{(Yi ,Ii )Ak } N

i=1
k=1
"N "
( n
)##
Y
X
= E
E exp
rk Yi 1{(Yi ,Ii )Ak }
.
"
i=1
n
X
tes
"N
Y
k=1
l=1
E exp
n
X
k=1
)
#

rk Yi 1{(Yi ,Ii )Ak } (Yi , Ii ) Al P [(Yi , Ii ) Al ]

NL
k=1
"
n
X
no
Note that N is a Poisson distributed random variable and n denotes the deterministic number of
disjoint sets A1 , . . . , An . We calculate the inner expected values of the last expression.
"
( n
)#
"
( n
)
#
n
X
X
X
E exp
rk Yi 1{(Yi ,Ii )Ak }
=
E exp
rk Yi 1{(Yi ,Ii )Ak } 1{(Yi ,Ii )Al }
l=1
n
X
k=1
n
X
E [ exp {rl Yi }| (Yi , Ii ) Al ] p(l) =
l=1
p(l) MY (l) (rl ),

1
l=1
(l)
where we assume Y1 Gl . Collecting all terms we obtain
!N
"
(
!)#
n
n
X
X
= E exp N log
MS (r) = E
p(l) MY (l) (rl )
p(l) MY (l) (rl )
1
l=1
(
=
exp v
n
X
!)
p
(l)
MY (l) (rl ) 1
n
Y
l=1
(
= exp v
l=1
l=1
n
X
l=1
(l)
MY (l) (rl ) 1
n
n

o
Y
exp vp(l) MY (l) (rl ) 1
=
MSl (rl ).
1
l=1
This proves the theorem because we have obtained a product (i.e. independence) of moment
generating functions of compound Poisson distributed random variables Sl , l = 1, . . . , n.
2
34
Remarks 2.15 (Aggregation and disjoint decomposition properties).
w)
The aggregation property implies that we can follow a bottom-up modeling

approach for the entire insurance business: we model each sub-portfolio Sj
independently with a compound Poisson distribution. The total portfolio S
is then easily obtained by the aggregation theorem and we stay in the same
family of distributions. This theorem is of special importance when we estimate the frequency parameters j and the individual claim size distributions
Gj on the bottom (sub-portfolio) level.
tes
(m
The disjoint decomposition property implies that we

can also follow a top-down modeling approach: we
model the overall portfolio S by a compound Poisson distribution. The disjoint decomposition property
then easily allows to allocate the total claim amount
to the sub-portfolios. The crucial result here is, at
the first sight surprising, that this allocation results
in independent compound Poisson distributions for Sj .
This independence property does not hold true for
other compound distributions because it essentially
uses the independent space and time decoupling property of Poisson point
processes, see also Section 3.3.2 in Mikosch [81].
no
For I we have chosen a finite (discrete) indicator. Of course, this model can
easily be extended to other indicators. The crucial property is the i.i.d. assumption on the random vectors (Yi , Ii ). We have chosen a finite indicator I
because this has the natural interpretation of sub-portfolios. If I = 1, P-a.s.,
then we can completely drop this indicator.
NL
The choice of the appropriate volume on the sub-portfolios depends on the choice
of the indicator I. If m = 1, i.e. if we only consider one portfolio, and if we apply
a disjoint decomposition of this portfolio as follows
Yi = Yi 1{Yi A1 } + . . . + Yi 1{Yi An } ,
then it is natural to set vk = v and k = p(k) for k = 1, . . . , n. That is, the volume
v > 0 remains constant but the expected claims frequencies k change accordingly
to Ak . This is also called thinning of the Poisson point process.
The second extreme case is m = n > 1 and the disjoint decomposition is given by
{(Yi , Ii ) Ak } = {Ii = k},
i.e. we only consider a decomposition according to different sub-portfolios k =
1, . . . , m. In this case we would rather define vk > 0 by the volume of portfolio k
and k = p(k) v/vk .
35
A = A1 = {Y1 M }
and
w)
Example 2.16 (large claims separation). A very important application of the

disjoint decomposition property of compound Poisson distributions is the separation of large claims from small claims. Often, there does not exist one parametric
distribution function G that applies to the entire range of possible outcomes of
the individual claim sizes Yi . Therefore, these individual claim sizes are divided
into different layers that need to be concatenated. Let us assume that we would
like to model two layers. We choose a large claims threshold M > 0 such that
G(M ) (0, 1), i.e. G(M ) is bounded away from zero and one. We then define the
disjoint decomposition A1 , A2 of R+ by
Ac = A2 = {Y1 > M } .
Ssc =
N
X
Yi 1{Yi M }
and
Slc =
N
X
Yi 1{Yi >M } .
i=1
tes
i=1
(m
Assume that S CompPoi(v, G). We define the total claim Ssc in the small
claims layer and the total claim Slc in the large claims layer by
Theorem 2.14 implies that Ssc and Slc are independent and compound Poisson
distributed with
and
no
Ssc CompPoi (sc v = G(M )v , Gsc (y) = P [Y1 y|Y1 M ]) ,
Slc CompPoi (lc v = (1 G(M ))v , Glc (y) = P [Y1 y|Y1 > M ]) .
NL
In particular, this means that we can model the small and the large claims layers
completely separately and then obtain the total claim amount distribution by a simple convolution of the two resulting distribution functions (due to independence),
see Example 4.11, below.
For the large claims layer we need to determine the expected large claims frequency
lc > 0. The individual large claim sizes Y1 |{Y1 >M } are often modeled with a Pareto
distribution with threshold M and tail parameter > 1, for more details see
Sections 3.2.5 and 3.4.1.
The small claims layer is often approximated by a parametric distribution function: we have seen in (2.1) that compound distributions may lead to rather time
consuming computational complexity when the expected number of claims sc v is
large. Therefore, one typically assumes that the expected number of small claims
is sufficiently large so that we are already in the asymptotic regime of the central
limit theorem and then we approximate this compound distribution by the Gaussian distribution, see Theorem 4.1 below, or maybe by a distribution function that
36
is slightly skewed, see Sections 4.1.2 and 4.1.3. Note that the small claims layer
cannot be distorted by large claims because they are already sorted out by the
threshold M . We will describe this in more detail in Section 3.4.1, below.
2.2.3
Mixed Poisson distribution
w)
Above we have introduced the binomial and the Poisson distributions. These two
distributions have the following relationship
E [N ] > Var(N ),
Poisson distribution
E [N ] = Var(N ).
However, insurance data often suggests
(m
binomial distribution
E [N ] < Var(N ).
no
tes
Therefore, we present more claims count distributions for N . In particular, the

mixed Poisson distribution enjoys the latter property of a variance dominating the
mean (over-dispersion). We remark that similar constructions could also be done
for the binomial distribution. We refrain from doing so because the Poisson case
is more appropriate for non-life insurance modeling.
The mixed Poisson distribution gives the general principle and a specific example
will be given in the next section. The idea is to attach volatility (or uncertainty)
to the claims frequency parameter , thus, the claims frequency will be modeled
as a latent (random) variable. Based on this latent variable we then choose the
claims count distribution being conditionally Poisson distributed.
NL
Definition 2.17 (mixed Poisson distribution).

Assume H with H(0) = 0, E [] = and Var() > 0.
Conditionally, given , N Poi(v) for a fixed volume v > 0.
Lemma 2.18. Assume N satisfies Definition 2.17. We have E [N ] < Var(N ).
Proof. The tower property implies E[N ] = E[E[N |]] = E[v] = v and
Var(N )
E[Var(N |)] + Var(E[N |]) = vE[] + v 2 Var() > v.
In the next section we make an explicit choice for the distribution function H.
2.2.4
37
Negative-binomial distribution
In this section we assume that N has a mixed Poisson distribution and we assume
that the latent variable is drawn from a gamma distribution. Therefore, we
briefly introduce the gamma distribution, which is described in more detail in
Section 3.3.3, below.
f (x) =
c
x1 exp {cx}
()
E[X] =
,
c
Var(X) =
c2
for x 0.
(m
The moments of X are given by
w)
We say X has a gamma distribution, write X (, c), with shape parameter

> 0 and scale parameter c > 0 if X is a non-negative, absolutely continuous
random variable with density
and
MX (r) =
c
cr
for r < c.
tes
The gamma distribution has many nice properties and it is used rather frequently
for the modeling of latent variables and for the modeling of individual claim sizes,
see Section 3.3.3.
no
Definition 2.19 (negative-binomial distribution, 1st definition). We say N has

a negative-binomial distribution, write N NegBin(v, ), with volume v > 0,
expected claims frequency > 0 and dispersion parameter > 0, if
(, ), and
NL
conditionally, given , N Poi(v).

Note that for = we are exactly in the context of Definition 2.17 with the first
two moments given by
E[] =
Var() = 2 / > 0.
and
Proposition 2.20 (negative-binomial distribution, 2nd definition). The negative-binomial distribution as defined in Definition
2.19 satisfies for k A = N0
!
k+1
pk = P[N = k] =
k
(1 p) pk ,
where we choose p = v/( + v) (0, 1).

G. Plya
38
This second representation is the definition often used for the

negative-binomial distribution. It is sometimes also named after
George Plya (1887-1985). In our context, it is simpler to
work with the first definition. Especially, parameter estimation
will give an explicit meaning to the latent variable .
(m
w)
Proof of Proposition 2.20. We apply the tower property which implies

(v)k
P[N = k] = E [P[N = k|]] = E exp{v}
k!
Z
k
(xv)
exp{xv}
x1 exp {x} dx
=
k!
()
0
Z
( + v)+k +k1
(v)k ( + k)
=
x
exp {( + v)x} dx
() k! ( + v)+k 0
( + k)

k

( + k)
v
k+1
=
=
(1 p) pk ,
() k!
+ v
+ v
k
notice that the second last inequality follows because we have a gamma density with shape
parameter + k and scale parameter + v under the integral. This trick of completion should
be remembered because it is applied rather frequently.
2
tes
Proposition 2.21. Assume N NegBin(v, ) for fixed , v, > 0. Then

E[N ] = v
Var(N ) = v(1 + v/) > v,

s
1 q
1 + v/ > 1/2 > 0,
v
!
1p
for all r < log p,
1 per
no
Vco(N ) =
MN (r) =
and p = v/( + v) (0, 1).
NL
Proof. The first three statements are a direct consequence of the proof of Lemma 2.18 and the
properties of the gamma distribution. Therefore, it remains to calculate the moment generating
function. The tower property implies

MN (r) = E E erN
= E [exp {v (er 1)}] = M (v(er 1)),
2
from which the claim follows for (, ) and 1 p = /( + v).
Proposition 2.21 provides a nice interpretation. For given volume v > 0 the expected claims frequency is

N
= .
E
v
Moreover, for the coefficient of variation of the claims frequency N/v we obtain

Vco
N
v
(v)1 + 1 1/2 > 0
for v .
39
w)
This can be interpreted as follows. The random variable reflects the uncertainty
in the true underlying frequency parameter of the Poisson distribution. This
uncertainty also remains in the portfolio for infinitely large volume v, i.e. this
risk is not diversifiable, and the positive lower bound 1/2 is determined by the
dispersion parameter (0, ). In particular, consider a time series N1 , N2 , . . .
of claims counts in different accounting years 1, 2, . . .. Each of these accounting
years has its own (risk) characteristics 1 , 2 , . . ., like weather conditions, inflation
index, portfolio fluctuations, etc. Since we do not know these characteristics a
priori, i.e. prior to future accounting years, we model these characteristics with a
latent factor (t )t1 which provides the true frequency parameter for accounting
year t, given by t = t . This differs from the Poisson case, see (2.2).
0.010
no
0.015
tes
0.020
binomial
Poisson
negativebinomial
0.000
NL
0.005
probability weights p_k
(m
0.025
Example 2.22 (claims count distributions). We compare the binomial, Poisson

and the negative-binomial distributions. We assume that they have identical means
E[N ] = 500 with v = 1000, p = = 0.5 and = 100.
200
300
400
500
600
700
800
Figure 2.1: Probability weights pk of binomial, Poisson and negative binomial

distributions with identical means (for convenience plotted as lines).
In Figure 2.1 we plot the corresponding probability weights pk . We observe that

the coefficient of variation is increasing from the binomial over the Poisson to the
negative-binomial distribution, which gives successively more uncertainty to claims
counts.

40
Definition 2.23 (compound negative-binomial model). The total claim amount S

has a compound negative-binomial distribution, write
S CompNB(v, , G),
if S has a compound distribution with N NegBin(v, ) for given , v, > 0 and
individual claim size distribution G.
E[S] = v E[Y1 ],
Var(S) = v E[Y12 ] + (v)2 E[Y1 ]2 /,
Vco(S) =
MS (r) =
(m
w)
Proposition 2.24. Assume S CompNB(v, , G). We have, whenever they

exist,
1 q
1 + Vco(Y1 )2 + v/ > 1/2 ,
v
!
1p
for r R such that MY1 (r) < 1/p,
1 pMY1 (r)
tes
with p = v/( + v) (0, 1).

no
Exercise 4. Assume S CompNB(v, , G) and choose M > 0 such that G(M )

(0, 1). Define the compound distribution of claims Yi exceeding threshold M by
Slc =
N
X
Yi 1{Yi >M } .
i=1
2.3
NL
Then we have Slc CompNB((1 G(M ))v, , Glc ) where the large claims size
distribution satisfies Glc (y) = P [Y1 y|Y1 > M ].
Parameter estimation
Once we have specified the distribution functions for N and Yi we still need to
determine their parameters. In the case of the claims count distribution of N these
are (i) the default probability p for the binomial distribution; (ii) the expected
claims frequency for the Poisson distribution; or (iii) the expected claims frequency and the dispersion parameter for the negative-binomial distribution.
Essentially, there are three different common ways to estimate these parameters:
1. method of moments (MM),
2. maximum likelihood estimation (MLE) method,
41
3. Bayesian inference method (inverse probability method).

In this section we describe the first two methods. For the Bayesian inference method
we refer to Chapter 8.
2.3.1
Method of moments
and
2 = 2 (1 , 2 ) = Var(Xt ) < .
(m
= (1 , 2 ) = E[Xt ] <
w)
We start with an example to explain the method of moments. Assume that we have
i.i.d.
an i.i.d. sequence X1 , . . . , XT F , where F is a parametric distribution function
that depends (for simplicity) on a two dimensional (real valued) parameter (1 , 2 ).
Assume that the first two moments of X1 are finite, and thus, for all t = 1, . . . , T
we have mean and variance (as a function of (1 , 2 ))
Remark. For general d-dimensional (real valued) parameters (1 , . . . , d ) we extend the argument to the first d moments of Xt .
b T =
T
1 X
Xt
T t=1
tes
We define the sample mean and sample variance by, T 2 for the latter,
and
bT2 =
T
X
1
(Xt b T )2 .
T 1 t=1
(2.5)
no
A straightforward calculation shows that these are unbiased estimators for and
2 , that is,
E[b T ] = = (1 , 2 )
and
E[bT2 ] = 2 = 2 (1 , 2 ).
(2.6)
NL
This motivates the moment estimator (b1 , b2 ) for (1 , 2 ) by solving the system of
equations
b T = (b1 , b2 )
and
bT2 = 2 (b1 , b2 ).
In our situation the problem is more involved. Assume we have a vector of observations N = (N1 , . . . , NT )0 , where Nt denotes the number of claims in accounting
year t. The difficulty is that Nt , t = 1, . . . , T , are not i.i.d. because they depend
on different volumes vt . That is, in general, the portfolio changes over accounting
years. Therefore, we need to slightly modify the framework described above.
Assumption 2.25. Assume there exist strictly positive volumes v1 , . . . , vT such
that the components of F = (N1 /v1 , . . . , NT /vT )0 are independent with
= E[Nt /vt ]
and
t2 = Var(Nt /vt ) (0, ),
for all t = 1, . . . , T .
42
Lemma 2.26. We make Assumption 2.25. The unbiased linear (in F) estimator
for with minimal variance is given by
b MV
T
X
1
=
2
t=1 t
!1 T
X
Nt /vt
,
2
t=1 t
the variance of this estimator is given by

b MV
T
X
1
=
2
t=1 t
!1
w)
Var
The upper index MV stands for minimal variance estimator.
(m
Proof. We apply the method of Lagrange, see Section 24.3 in Merz-Wthrich [79]. We define
the mean vector = e = (1, . . . , 1)0 RT and the diagonal positive definite covariance matrix
= diag(12 , . . . , T2 ) of F. Then we would like to solve the following minimization problem
x+ = arg min{xRT ;x0 =}
1 0
x x,
2
thus, we minimize the variance Var(x0 F) = x0 x subject to all unbiased linear combinations of
F which gives the constraint = E[x0 F] = x0 . The Lagrangian for this problem is given by
1 0
x x c(x0 ),
2
tes
L(x, c) =
with Lagrange multiplier c. The optimal value x+ is found by the solution of
L(x, c) = x c = 0
x
and
L(x, c) = x0 + = 0.
c
no
The first requirement implies x = c1 = c1 e. Plugging this into the second requirement
implies = x0 = c2 e0 1 e. If we solve this for the Lagrange multiplier we obtain c =
1 (e0 1 e)1 . This provides
!1
T
X
0
1
1
+
1
x = 0 1 e =
12 , . . . , T2 .
2
e e
t=1 t
NL
bMV = (x+ )0 F and the variance is given by

This implies that
T

bMV = (x+ )0 x+ = (e0 1 e)1 =
Var
T
T
X
!1
t2
t=1
We apply this lemma to the case of the binomial and the Poisson distributions.
Assume that Nt , t = 1, . . . , T , are independent with Nt Binom(vt , p) or Nt
Poi(vt ), respectively. Then we have in the binomial case
E[Nt /vt ] = p
and
Var(Nt /vt ) = p(1 p)/vt = t2 ,
and
Var(Nt /vt ) = /vt = t2 .
and in the Poisson case

E[Nt /vt ] =
43
Note that in both cases the unknown parameter p and , respectively, appears in the
variance. However, the appearance is of multiplicative
nature which implies that
1
2 PT
2
. Therefore, we get the following
it cancels in the weights wt = t
s=1 s
moment estimators in the binomial and the Poisson cases.
Estimator 2.27 (moment estimators in the binomial and Poisson cases).
We have the following unbiased linear minimal variance estimators:
pbMV
= PT
T
T
X
s=1
vs
w)
binomial case for p

T
X
Nt =
t=1
vt
Nt
;
s=1 vs vt
PT
t=1
b MV = P
T
T
T
X
s=1
vs
(m
Poisson case for

T
X
Nt =
t=1
vt
Nt
.
s=1 vs vt
PT
t=1
The variances of these estimators are given by
tes

p(1 p)
Var pbMV
= PT
T
s=1 vs
and
b MV = P
Var
T
T

s=1
vs
These variances (and uncertainties) converge to zero for Ts=1 vs , and they can
be estimated by replacing the unknown parameters p and , respectively, by their
estimators. Note that we can explicitly give these distributions of the estimators
P
P
because in the former case Tt=1 Nt Binom( Tt=1 vt , p) and in the latter case
PT
PT
t=1 vt ).
t=1 Nt Poi(
no
NL
The negative-binomial case is more complex. Assume that Nt , t = 1, . . . , T , are

independent with Nt NegBin(vt , ). For the first two moments we have
E[Nt /vt ] =
and
Var(Nt /vt ) = /vt + 2 / = t2 .
The variance term has two unknown parameters and and we lose the nice multiplicative structure from the binomial and the Poisson case which has allowed to
apply Lemma 2.26 in a straightforward manner. If we drop the condition minimal
variance we obtain the following unbiased linear estimator.
Estimator 2.28 (moment estimator in the negative-binomial case (1/2)).
We have the following unbiased linear estimator for
b NB = P
T
T
T
X
s=1
vs
t=1
Nt =
T
X
t=1
vt
Nt
.
s=1 vs vt
PT
44
In the last formula we could also take other volume weighted averages. The unbib NB immediately follows from the assumptions of the negative-binomial
asedness of
T
distribution. The variance of this estimator is given by
Var
b NB
!2 T
X
PT
s=1
vs
PT
t=1
Var(Nt ) =
vt + (vt )2 /
P
T
s=1
t=1
vs
2
Vb 2
T
w)
There remains the estimate of . Therefore, we define

T
1 X
Nt b N B
=
vt
T
T 1 t=1
vt
2
(2.7)
E VbT2
(m
Lemma 2.29. In the negative-binomial model VbT2 satisfies

T
T
X
2
vt2
=+
vt Pt=1
T
T 1 t=1
t=1 vt
tes
This motivates the following estimator.
1
.
Estimator 2.30 (moment estimator in the negative-binomial case (2/2)).

The method of moments suggests the following estimator for
T
T
b NB )2
X
(
1
vt2
T
= b 2 b NB
vt Pt=1
,
T
VT T T 1 t=1
t=1 vt
no
bTNB
b NB , otherwise use the Poisson or the binomial model (no over-dispersion

for VbT2 >
T
in data N1 , . . . , NT ).
NL
bN B for we have
Proof of Lemma 2.29. Using the unbiasedness of
T
"
#

T
T
2
h i
X
X
1
Nt b N B
1
Nt bN B
2
b
E VT
=
vt E
T
=
vt Var
T
T 1 t=1
vt
T 1 t=1
vt

T

X
1
Nt
Nt b N B
bN B
=
vt Var
2Cov
, T
+ Var
T
T 1 t=1
vt
vt
" T
#
X vt + (vt )2 / PT vt + (vt )2 /
1
t=1
=
.
PT
T 1 t=1
vt
s=1 vs
We justify these estimators in the case of vt = v for all t = 1, . . . , T . This uniform

volume case provides
T
X
b NB v = 1
Nt = b T ,
T
T t=1
45
which is the sample mean of i.i.d. random variables Nt . For the estimate we
obtain in the uniform volume case
bTNB =
b NB v)2
(
T
b NB v
VbT2 v
T
VbT2 v =
T

1 X
bN B v 2 =
b T2 ,
Nt
T
T 1 t=1
with
and
E[bT2 ] = 2 = v + (v)2 /.
(m
E[b T ] = = v
w)
where the latter term is the sample variance of i.i.d. random variables Nt . Or in
other words, the proposed estimators in the uniform volume vt = v case are found
by looking at the system of equations (2.6). In the negative-binomial model this
system is given by
Replacing and 2 by their sample estimators and solving the system of equations
b NB and
bTNB in the uniform volume case.
provides
T
Maximum likelihood estimators
tes
2.3.2
no
The MLE method has been popularized by Sir Ronald Aylmer

Fisher (1890-1962) but it has been used already before by Gauss,
Laplace and others. The philosophy behind MLE is different
compared to the method of moments. For MLE the first objective is not unbiasedness but maximizing the probability of a
given observation. MLE can be done for densities or for probability weights, we formulate it for the latter because at the moment
Sir R.A. Fisher
we are looking at discrete random variables N .
NL
Assume that the components of N = (N1 , . . . , NT )0 are independent with probabil(t)

ity weights pk () = P [Nt = k] = P[Nt = k] that depend on a common unknown
parameter . The independence property of N1 , . . . , NT implies that the
joint likelihood function for observation N is given by
LN () =
T
Y
(t)
pNt (),
t=1
and their joint log-likelihood function is given by

`N () = log LN () =
T
X
(t)
log pNt ().
t=1
46
The MLE for is based on the rationale that should be chosen such that the
probability of observing N = (N1 , . . . , NT )0 is maximized. The MLE bMLE
for
T
based on the observation N is thus given by (subject to existence and uniqueness)
= arg max LN () = arg max `N ().
bMLE
T
(m
w
X
(t)
`N () =
log pNt () = 0.
t=1
This is solved by a root search algorithm. Under suitable regularity properties and
real valued parameter the MLE bMLE
is found as solution of
T
(t)
If the probability weights pk () are sufficiently regular as a function of in a regis asympular domain which contains the true parameter , then the MLE bMLE
T
totically unbiased for T and under appropriate scaling it has an asymptotic
Gaussian distribution with inverse Fishers information as covariance matrix, for
details see Theorem 4.1 in Lehmann [70].
pbMLE
= PT
T
tes
Estimator 2.31 (MLE in the binomial case). Assume N1 , . . . , NT are independent

and Binom(vt , p). The MLE is given by
T
X
vs
T
X
t=1
t=1
vt
Nt
= pbMV
T .
v
v
t
s=1 s
PT
no
s=1
Nt =
Proof. The log-likelihood function is given by

T
X
vt
log
`N (p) =
+ Nt log p + (vt Nt ) log(1 p).
N
t
t=1
NL
Calculating the derivative w.r.t. p provides the requirement

T
X Nt
v t Nt
`N (p) =
= 0.
p
p
1p
t=1
2
Solving this for p proves the claim.
Estimator 2.32 (MLE in the Poisson case). Assume N1 , . . . , NT are independent

and Poi(vt ). The MLE is given by
b MLE = P
T
T
T
X
s=1
vs
Nt =
t=1
T
X
t=1
Nt
b MV .
=
T
v
v
t
s=1 s
vt
PT
Proof. The log-likelihood function is given by

`N () =
T
X
(vt ) + Nt log(vt ) log(Nt !).
t=1
47
Calculating the derivative w.r.t. provides the requirement

T
Nt
`N () =
= 0.
vt +
t=1
2
Solving this for provides the claim.
Estimator 2.33 (MLE in the negative-binomial case). Assume N1 , . . . , NT are

b MLE ,
b MLE ) is the solution of
independent and NegBin(vt , ). The MLE (
T
X
Nt + 1
log
+ log(1 pt ) + Nt log pt = 0,
(, ) t=1
Nt
w)
(m
with pt = vt /( + vt ) (0, 1).
Unfortunately, this system of equations does not have a closed form solution, and
a root search algorithm is needed to find the MLE solution for (, ), see also page
61 below.
Example and 2 -goodness-of-fit analysis
tes
2.3.3
no
We apply the claims count models (Poisson and

negative-binomial) to a real data set. We take
the data set provided in Gisler [54]. This data
set describes the number of claims in an insurance portfolio that protects private households
against water claims. The data is displayed in
Table 2.1 and Figure 2.2.
NL
We observe a strong growth of volume of more than 40% in this insurance portfolio
from v1982 = 2400 755 policies to v1991 = 3440 757 policies. Such a strong growth
might question the stationarity assumption in the expected claims frequency t
because this growth might also reflect a substantial change in the portfolio (and
the underlying product possibly). Nevertheless we assume its validity (because
the observed claims frequencies Nt /vt do not show any structure such as a linear
trend, see Figure 2.2) and we fit the Poisson and, if necessary, the negative-binomial
distribution to this data set.
Poisson model. We assume that Nt are independent with Nt Poi(vt ). The
linear minimal variance estimator and the MLE for are given by, see Estimator
2.32,
1991
X
b MV =
b MLE = P 1
Nt = 5.43%.
T
T
1991
s=1982 vs t=1982

year
t
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
total
volume
vt
240755
255571
269739
281708
306888
320265
323481
334753
340265
344757
3018182
number of
claims Nt
13153
14186
14207
13461
21261
19934
15796
15157
17483
19185
163823
frequency
Nt /vt
5.46%
5.55%
5.27%
4.78%
6.93%
6.22%
4.88%
4.53%
5.14%
5.56%
5.43%
w)
48
(m
Table 2.1: Private households water insurance: number of policies, claims counts
and observed yearly claims frequencies, source Gisler [54].
The coefficient of variation in the Poisson model is given by, see (2.2),
tes
Vco(Nt /vt ) = (vt )1/2 .

b MV by
This coefficient of variation is estimated using
T
no
d
b MV 1/2 0.8%.
Vco(N
t /vt ) = (T vt )
NL
If we choose 1 standard deviation as confidence bounds, i.e. if we consider the

confidence interval CIt = ( (vt )1/2 ), we obtain estimated confidence intervals
(for any t) of roughly
c = (5.39%, 5.47%).
CI
t
These resulting confidence bounds are very narrow and we observe that most of
the observed yearly claims frequencies Nt /vt in Table 2.1 lie outside of these confidence bounds, see Figure 2.3 (lhs). This clearly rejects the assumption of having
Poisson distributions for the number of claims and suggests that we should study
the negative-binomial model for Nt .
Negative-binomial model. As described above, the negative-binomial model is
able to model the heterogeneity over different accounting years t. It assumes that
every accounting year t is characterized by a latent (risk) factor t which describes
the nature of that particular accounting year t. A priori all years are similar which
is expressed by the i.i.d. property of t with t (, ) for identical dispersion
parameters > 0. We estimate this dispersion parameter with Estimator 2.30.
b NB =
b MV . We obtain
We expect that it substantially differs from , i.e. VbT2 >
T
T
0.070
49
0.065
0.060
w)
0.055
observed frequencies
0.050
0.045
1982
1984
1986
(m
1988
1990
tes
Figure 2.2: Observed yearly claims frequencies Nt /vt from t = 1982 to 1991 compared to the overall average frequency of 5.43%, see Table 2.1.
VbT2 = 15.84 > 5.43%. Thus, we have a clear over-dispersion which results in the
estimate
bTNB = 56.23
and
d
Vco(N
t /vt ) =
b NB v )1 + (
bTNB )1 13%.
(
t
T
no
If we calculate the estimated 1 standard deviation confidence bounds we obtain for

all t roughly
c = (4.70%, 6.15%).
CI
t
NL
This makes much more sense in view of the observed frequencies Nt /vt in Table
2.1. We see that 7/10 of the observations are within these confidence bounds, see
Figure 2.3 (rhs).

We close this section with a statistical test: In the previous example it was obvious
that the Poisson model does not fit to the data. In situations where this is less
obvious we can use the following 2 -goodness-of-fit test.
Null hypothesis H0 : Nt are independent and Poi(vt ) distributed for t = 1, . . . , T .
We are going to build a test statistics for the evaluation of this null hypothesis H0 .
We define
T
X
(Nt /vt )2
= (N) =
.
/vt
t=1
It is not straightforward to determine the explicit distribution function of .
Therefore, we give an approximate answer to this request of hypothesis testing.
0.070
0.070
50
0.065
0.065
0.060
0.055
0.060
0.055
0.050
0.050
1984
1986
1988
1990
1982
1984
1986
1988
1990
(m
1982
w)
0.045
0.045
Figure 2.3: Observed yearly claims frequencies Nt /vt from t = 1982 to 1991 compared to the to the estimated overall frequency of 5.43%. (lhs): 1 standard deviation confidence bounds Poisson case; (rhs): 1 standard deviation confidence bounds
negative-binomial case.
tes
The aggregation and disjoint decomposition theorems (Theorems 2.12 and 2.14)
imply that Nt Poi(vt ) can be understood as a sum of vt i.i.d. random variables
Xi Poi(). That is,
(d)
Nt =
vt
X
Xi ,
no
i=1
with E[X1 ] = and Var(X1 ) = . But then the CLT (1.2) applies with
Ze
Nt /vt
q
/vt
Nt vt (d)
=
=
vt
Pvt
X vt
i
N (0, 1)
vt
i=1
as
vt .
NL
This explains that Zet can be approximated in distribution by a standard Gaussian

random variable Zt N (0, 1) for vt sufficiently large.
i.i.d.
Next, if we assume that Z1 , . . . , ZT N (0, 1) then a standard result in statistics
PT
says that t=1 Zt2 has a 2 -distribution with T degrees of freedom, denoted by 2T ,
see also Exercise 2 on page 22. Therefore, we obtain the asymptotic approximation
in distribution
T
T
X
(Nt /vt )2 X
2 (d)
e
=
Zt
Zt2 2T .
= (N) =
/vt
t=1
t=1
t=1
T
X
b MLE .
In the last step we need to replace the unknown parameter by its estimate
T
By doing so, we lose one degree of freedom, thus, we get the test statistics b and
the corresponding distributional approximation
b =
T
X
vt
t=1
51
b MLE
Nt /vt
T
2
b MLE
(d)
2T 1 .
(2.8)
w)
We revisit the previous example. For the data in Table 2.1 we obtain b = 20 627.
The 99%-quantile of the 2 -distribution with T 1 = 9 degrees of freedom is given
by 21.67. Since this is by far smaller than b we reject the null hypothesis H0 on
the significance level of q = 1%. This, of course, is not surprising in view of Figure
2.3 (lhs).

Exercise 5. Consider the data given in Table 2.2. Estimate the parameters for
1
1000
10000
2
997
10000
3
985
10000
4
989
10000
5
1056
10000
6
1070
10000
7
994
10000
8
986
10000
(m
t
Nt
vt
9
1093
10000
10
1054
10000
Table 2.2: Observed claims counts Nt and corresponding volumes vt .
no
tes
the Poisson and the negative-binomial models. Which model is preferred? Does
a 2 -goodness-of-fit test reject the null hypothesis on the 5% significance level of
having Poisson distributions?
Exercise 6. An insurance company decides to offer a no-claims bonus to good car

drivers, namely,
a 10% discount after 3 years of no claim, and
NL
a 20% discount after 6 years of no claim.

How does the base premium need to be adjusted so that this no-claims bonus can
be financed? For simplicity we assume that all risks have been insured for at least
6 years. Answer the question in the following two situations:
(a) Homogeneous portfolio with i.i.d. risks having i.i.d. Poisson claim counts with
frequency parameter = 0.2.
(b) Heterogeneous portfolio with independent risks being characterized by a frequency parameter having a gamma distribution with mean = 0.2 and
Vco() = 1. Conditionally, given , the individual years have i.i.d. Poisson
claim counts with frequency parameter .
NL
no
tes
(m
w)
52
Chapter 3
w)
Individual Claim Size Modeling
(m
In Model Assumptions 2.1 we have introduced the compound distribution

S = Y1 + Y2 + . . . + YN =
N
X
Yi ,
i=1
with the three standard assumptions
tes
1. N is a discrete random variable which takes values in A N0 ;

i.i.d.
2. Y1 , Y2 , . . . G with G(0) = 0;
3. N and (Y1 , Y2 , . . .) are independent.
no
In Chapter 2 we have discussed the modeling of the claims count distribution of

N . In this chapter we concentrate on the modeling of the individual claim sizes Yi .
NL
To get an understanding for the modeling of G we

present a data analysis based on two explicit data sets.
The first data set is a private property (PP) insurance
data set that consists of 72769 claims records. The
second data set is a commercial property (CP) insurance data set that consists of 18285 claims records.
Before presenting sophisticated mathematical modeling methods for G we analyze
these two data sets using tools from descriptive statistics.
3.1
Data analysis and descriptive statistics
The first observation is that the two data sets contain many claims records with
zero claims payments. That is, many of the recorded claims were settled without
any payments. In the case of PP insurance these were about 16% of the reported
claims and in the case of CP insurance we observe about 21% of zero claims. Zero
claims are due to reasons such as: the final claim does not exceed the deductible,
the insurance company is not liable for the claim, another insurance policy covers
53
54
Chapter 3. Individual Claim Size Modeling
the claim, reporting a (small) claim reduces the no-claims-benefit too much so that
the insured decides to withdraw the claim, etc.
(m
w)
We can treat zero claims in two different ways: (i) estimate the proportion of
zero claims separately and add this probability weight to G at 0; (ii) we simply
reduce the expected claims frequency by these zero claims. The first way (i) is
mathematically consistent, but contradicts our model assumption G(0) = 0; the
second way (ii) perfectly fits into the compound Poisson modeling framework due
to the disjoint decomposition Theorem 2.14 (also the binomial and the negativebinomial case can be handled, see Examples 3 and 4). In general, the second
version (ii) is the simpler one to deal with (however, one may lose information by
dropping zero claims). Here, we assume that G(0) = 0 and E[N ] = v, where v > 0
is the portfolio size and N only counts strictly positive claims. Henceforth, after
subtracting these zero claims, we have n = 610 053 strictly positive claims records
in PP insurance and n = 140 532 in CP insurance denoted by Y1 , . . . , Yn .
NL
no
tes
We start with the scatter plots of the data, see Figures 3.1 and 3.2. We plot the
individual claim sizes (ordered by arrival date) both on the original scale (lhs) and
on the log scale (rhs). These scatter plots do not offer much information because
Figure 3.1: Scatter plot of the n = 610 053 strictly positive claims records of PP
insurance ordered by arrival date: original scale (lhs) and log scale (rhs).
they are overloaded, at least they do not show any obvious trends (and therefore
suggest stationarity of the data). We calculate the sample means and the sample
variances of the observations, see also (2.5),
b n =
n
1 X
Yi
n i=1
and
bn2 =
n
1 X
(Yi b n )2 ,
n 1 i=1
55
(m
w)
Figure 3.2: Scatter plot of the n = 140 532 strictly positive claims records of CP
insurance ordered by arrival date: original scale (lhs) and log scale (rhs).
For our data sets we obtain empirical moments
CP :
d = 2.42;
b n = 30 116, bn = 70 534, Vco
n
(3.1)
tes
PP :
d = 4.16.
b n = 6 850, bn = 28 505, Vco
n
(3.2)
Next we give the histogram for PP insurance, see Figure 3.3 (lhs). We see that
50000
100000
150000
claim sizes
200000
12000
10000
8000
count
6000
4000
2000
NL
0
60000
50000
40000
count
30000
20000
10000
0
histogram logged claim sizes PP insurance
no
histogram claim sizes PP insurance
250000
10
12
logged claim sizes
Figure 3.3: Histogram of the n = 610 053 strictly positive claims records of PP
insurance: original scale (lhs) and log scale (rhs).
a few large claims distort the whole picture so that the histogram is not helpful.
We could plot a second one only considering small claims. In Figure 3.3 (rhs) we
plot the histogram for logged claim sizes. In Figure 3.4 we give the corresponding
(m
w)
56
Figure 3.4: Box plots of claims records of PP and CP insurance: original scale (lhs)
and log scale (rhs).
tes
box plots. They show positive skewness. The ultimate goal is to model the full
distribution functions G(y) = P[Y y] for the two portfolios PP and CP. Having so
many observations we could directly work with the empirical distribution function
(at least for small claims, see Section 3.4.1) which is given by
b (y) =
G
n
n
1 X
1{Y y} .
n i=1 i
(3.3)
no
The empirical distribution function of logged claim sizes is given in Figure 3.5
(lhs). For a sequence of observations Y1 , . . . , Yn we denote the ordered sample by
Y(1) Y(2) . . . Y(n) . For the next definitions we assume that Y G has finite
mean. We define the loss size index function and its empirical counterpart by
Ry
NL
I(G(y)) =
0 z
R
0 z
Pbnc
dG(z)
dG(z)
and
Ib
n ()
Y(i)
,
i=1 Yi
= Pi=1
n
for [0, 1]. The loss size index function I(G(y)) chooses a claim size threshold
y and then evaluates the relative expected claim that is explained by claim sizes
below this threshold y. The resulting empirical graphs are presented in Figure 3.5
(rhs). Rather typically in non-life insurance we see that the 20% largest claims
roughly cause 75% of the total claim size! This explains that large claims heavily
influence the total claim amount.
We have already seen in the previous figures that large claims may lead to several
modeling challenges. Two plots that especially focus on large claims are the mean
excess plot and the log-log plot. We define the mean excess function and empirical
mean excess function by
Pn
e(u) = E [Yi u|Yi > u]
and
ebn (u) =
i=1 (Yi u)1{Yi >u}

.
Pn
i=1 1{Yi >u}
57
(m
w)
b of PP and CP insurance on log

Figure 3.5: Empirical distribution functions G
n
scale (lhs) and corresponding empirical loss size index functions Ibn (rhs).
tes
The (empirical) mean excess plot is obtained by

u 7 e(u)
and
u 7 ebn (u),
and
y 7
and the (empirical) log-log plot by
b (y)) .
log y, log(1 G
n
NL
no
y 7 (log y, log(1 G(y)))
Figure 3.6: Empirical log-log plot (lhs) and empirical mean excess plot (rhs) of PP
and CP insurance data.
In Figure 3.6 we present the empirical log-log and mean excess plots of the two
58
data sets. Linear decrease in the log-log plot and linear increase in the mean excess
plot will have the interpretation of heavy tailed distributions in the sense that the
= 1 G is regularly varying at infinity, see (3.4) below.
survival function G
3.2
Selected parametric claims size distributions
w)
In this section we introduce popular parametric claim size distributions. We only

consider distribution functions G with unbounded support in R+ . We use the
following notation for a random variable Y G:
MY (r)
(m
density of Y for G being absolutely continuous,
moment generating function of Y in r R, where it exists,
expected value of Y , if it exists,
Y2
variance of Y , if it exists,
Vco(Y )
coefficient of variation of Y , if it exists,

skewness of Y , if it exists,
=1G
G
survival function of Y , i.e. G(y)

= P[Y > y].
tes
no
For analyzing G the following quantities are of interest (assuming Y < ):
E[Y 1{u1 <Y u2 } ]
expected value of Y within layer (u1 , u2 ],

loss size index function for level y,
e(u) = E[Y u|Y > u]
mean excess function of Y above u.
NL
I(G(y)) = E[Y 1{Y y} ]/Y
If G depends on a parameter and we have i.i.d. observations Yi G, then we

can estimate this parameter from the data. The method of moments estimator
is denoted by bMM and the MLE by bMLE , see also Section 2.3. Note that if
one estimates this parameter one should also try to assess the precision of this
parameter estimate (parameter uncertainty).
59
For the analysis of the tail of the distribution function we

consider the property of regular variation at infinity. Therefore, we assume that G has infinite support at . Then, we
= 1 G is regularly varying
say that the survival function G
R , if for
at infinity with (tail) index > 0, we write G
all t > 0
G(xt)
1 G(xt)
lim
= t .
= lim
x G(x)
x 1 G(x)
(3.4)
3.2.1
Gamma distribution
(m
w)
is slowly
If the above holds true for = 0 then we say G
R0 ; if the above holds
varying at infinity and we write G
is rapidly varying at infinity and we write G
R .
true for = then we say G
R for some
From an insurance point of view distribution functions G with G
[0, ) are dangerous because they have a large potential for big claims, see
Chapter 3 in Embrechts et al. [39]. Therefore, it is crucial to know this index of
regular variation at infinity, see also Remarks 5.17.
no
tes
Some people refer the gamma distribution to Karl Pearson (1857-1936), however, it seems that already Laplace [69] has used it. We have introduced the gamma
distribution in Section 2.2.4 for the definition of the negative-binomial distribution
and we will also see that this distribution is very useful in the context of generalized
linear models and Bayesian modeling, see Chapters 7, 8 and 9 below.
NL
The gamma distribution has two parameters, a shape parameter > 0 and a scale
parameter c > 0. We write Y (, c). The distribution function of Y has positive
support R+ with density for y 0 given by
g(y) =
c 1
y
exp {cy} .
()
There is no closed form solution for the distribution function G. For y 0 it can
only be expressed as
G(y) =
Z y
0
c 1 cx
1 Z cy 1 z
x e
dx =
z e dz = G(, cy),
()
() 0
where G(, ) is the incomplete gamma function. From this we see that the family
of gamma distributions is closed towards multiplication with a positive constant,
that is, for > 0 we have
Y (, c/).
(3.5)
60
This property is important when we deal with claims inflation and it explains why
c is called scale parameter. For the moment generating function and the moments
we have
MY (r) =
c
cr
for r < c,
,
Y2 = 2 ,
c
c
1/2
Vco(Y ) =
,
Y = 2 1/2 > 0.
=
w)
NL
no
tes
(m
For 0 u1 < u2 and u, y > 0 we obtain
[G( + 1, cu2 ) G( + 1, cu1 )] ,

E[Y 1{u1 <Y u2 } ] =
c
I(G(y)) = G( + 1, cy),
!
1 G( + 1, cu)
e(u) =
u.
c
1 G(, cu)
Figure 3.7: Gamma distribution with mean Y = 1 and shape parameter

{1/2, 1, 3/2, 2}. lhs: density g(y); rhs: log-log plot.
Exercise 7. Assume Y (, c).
Prove the statements of the moment generating function MY and the loss
size index function I(G(y)). Hint: use the trick of the proof of Proposition
2.20.
Prove the statements
e(u) = Y
1 I(G(u))
u,
1 G(u)
E[Y 1{u1 <Y u2 } ] = Y (I(G(u2 )) I(G(u1 ))) .
61
w)
The gamma distribution does not have a regularly varying tail at infinity, see
Table 3.4.4 in Embrechts et al. [39]. In fact, G(y)

= 1 G(y) decays roughly as
exp{cy} to 0 as y , because exp{cy} gives an asymptotic lower bound and
exp{(c )y} an asymptotic upper bound for any > 0 on G(y).

Note that the
gamma distribution is also not subexponential due to (5.10), below.
For generating gamma random numbers in R the following code is used (n stands
for the number of random numbers to be generated)
> rgamma(n, shape=, rate=c)
(m
The method of moments estimators (based on the first two empirical moments) are
given by
b n
b 2
cbMM = 2
and
b MM = n2 .
bn
bn
For the MLE we have log-likelihood function, set Y = (Y1 , . . . , Yn )0 ,
log c log () + ( 1) log Yi cYi .
i=1
tes
`Y (, c) =
n
X
The MLE b MLE of is the solution of
n
0 () 1 X
log Yi = 0.
+
()
n i=1
no
log log b n
(3.6)
This is solved numerically, and the MLE for c is then given by
NL
cbMLE =
b MLE
.
b n
For the numerical solution in R one can use the command

> fitdistr(data, gamma)
The numerical fitting does not always work when the range of observations Y is too
large. In such cases it is recommended that in the first step the data is scaled by a
constant factor > 0, this can be done due to (3.5); next parameters are estimated
for scaled data; and in the last step the estimated scale parameter is scaled back by
the same factor. An alternative way is to explicitly program the function given in
(3.6) and then apply the root search command uniroot(). The term 0 ()/()
is calculated with digamma(), see also Section 3.9.5 in Kaas et al. [64].
62
Remark 3.1 (exponential and 2 -distributions). The special case = 1 is referred

to the exponential distribution with parameter c > 0, and denoted by expo(c).
The special case = k/2 and c = 1/2 is the 2 -distribution with k N degrees of
freedom, see Exercise 2 on page 22.
tes
(m
w)
Example 3.2 (gamma distribution for PP data). We fit the PP insurance data
displayed in Figure 3.1 to the gamma distribution.
NL
no
Figure 3.8: Gamma distribution with MM and MLE fits applied to the PP insurance data. lhs: QQ plot; rhs: loss size index function.
Figure 3.9: Gamma distribution with MM and MLE fits applied to the PP insurance data. lhs: log-log plot; rhs: mean excess plot.
63
(m
w)
From Figures 3.8 and 3.9 we immediately conclude that the gamma model does not
fit to the PP data. The reason is that the data is more heavy tailed. This can be
seen in the QQ plot in Figure 3.8 (lhs): the data at the right end of the distribution
lie substantially above the line. The MM estimators manage to model the data up
to some layer, the MLE estimators, however, are heavily distorted by the small
claims which can be seen in the mean excess plot in Figure 3.9 (rhs). In fact, we
have too many small claims (observations below 1500) to be explained by a gamma
distribution. The MLE is heavily based on these small observations, in Figure 3.8
(rhs) and Figure 3.9 (lhs) we see that MLE fits well for small claims, whereas MM
provides more appropriate results in the upper range of the data. Summarizing, we
should choose more heavy tailed distribution functions to model this data and the
resulting figures are already sufficient for rejecting the gamma model. This first
data example also indicates that probably there is not one distribution that fits all
claims layers. We come back to this in Section 3.4.

Remark 3.3 (inverse Gaussian distribution). A distribution function which is also
found quite often in the actuarial literature is the inverse Gaussian distribution,
see for instance Section 3.9.6 in Kaas et al. [64]. Its density is for y 0 given by
(
3/2
1
2
=
+ cy ,
y
exp
2cy
2
2c
tes
3/2
( cy)2
g(y) =
y
exp
2cy
2c
where > 0 is a shape parameter and c > 0 a scale parameter. Observe that this
density behaves similar as the gamma density for y . For the distribution
function we have a closed form solution in the following (weak) sense
!
no
G(y) = + cy + e2 cy ,
cy
cy
NL
where () is the standard Gaussian distribution. This can be checked by calculating the derivative of the latter. For the moment generating function and the
moments we have
n h
io
MY (r) = exp 1 (1 2r/c)1/2

for r c/2,
Y =
,
Y2 = 2 ,
c
c
Vco(Y ) = 1/2 ,
Y = 31/2 > 0.
b MM and cbMM . The MLE is given by

From this we calculate the MM estimators
"
b MLE =
n
1X
Y 1
n i=1 i
#1
n
1X
Yi 1
n i=1
and
cbMLE =
b MLE
.
b n
The inverse Gaussian distribution leads to an improvement of the fit compared to

the gamma distribution. Overall it is also not convincing, especially in the tails (it
has the same asymptotic behavior as the gamma distribution). Since the inverse
Gaussian distribution is less handy than the ones that will be presented below we
refrain from further discussing this distribution function.

64
3.2.2
Weibull distribution
w)
The Weibull distribution has his name from Ernst Hjalmar

Waloddi Weibull (1887-1979), however it was first identified
by Maurice Frchet (1878-1973) in 1927, but Weibull was
probably the first one who has in 1951 described the distribution
function in detail.
(m
The Weibull distribution has two parameters, a shape parameter

> 0 and a scale parameter c > 0. We write Y Weibull(, c).
The distribution function of Y has positive support R+ with
density for y 0 given by
E.H.W. Weibull
tes
g(y) = (c ) (cy) 1 exp {(cy) } .
no
We are especially interested in (0, 1) because this provides a slower decay of

the survival function compared to the gamma distribution. For y 0 we have
G(y) = 1 exp {(cy) } .
NL
This still does not have a regularly varying tail at infinity but the decay of the
is slower than in the gamma case for < 1, see also Table 3.4.4
survival function G
in Embrechts et al. [39]. In fact, the survival function G(y)

= 1 G(y) decays as
exp{(cy) } to 0 for y . Note that the Weibull distribution is subexponential

for (0, 1), see Example 1.4.7 in Embrechts et al. [39]. We will come back to
subexponentiality in Section 5.4.
The family of Weibull distributions is closed towards multiplication with positive
constants, that is, for > 0 we have
Y Weibull(, c/).
The moment generating function and the moments are given by
Y
Y2
Y
does not exist for < 1 and r > 0,

(1 + 1/ )
,
=
c
(1 + 2/ )
=
2Y ,
2
"c
#
1 (1 + 3/ )
2
3
=
3Y Y Y .
Y3
c3
w)
MY (r)
65
(1 + 1/ )
[G(1 + 1/, (cu2 ) ) G(1 + 1/, (cu1 ) )] ,
c
I(G(y)) = G(1 + 1/, (cy) ),
!
(1 + 1/ ) 1 G(1 + 1/, (cu) )
u.
e(u) =
c
exp{(cu) }
NL
no
tes
(m
E[Y 1{u1 <Y u2 } ] =
Figure 3.10: Weibull distribution with mean Y = 1 and shape parameter

{1/4, 1/3, 1/2, 1}. lhs: density g(y); rhs: log-log plot.
(d)
For generating Weibull random numbers observe that we have the identity Y =
(d)
1 1/
Z
with Z expo(1) = (1, 1). The R code for the (1, 1) distribution is
c
> rgamma(n, shape=1, rate=1)
The method of moments estimators are given by
cbMM =
(1 + 1/bMM )
b n
and
bn2
(1 + 2/bMM )
+
1
=
.
b 2n
(1 + 1/bMM )2
66
The latter needs to be solved numerically in R:

> f <- function(x,a){lgamma(1+2/x)-2*lgamma(1+1/x)-log(a+1)}
> tau <- uniroot(f, c(0.001,1), tol=0.001, a=var(data)/mean(data)2)
For the MLE we need to solve the system of equations (numerically)
!1/
and
n
1X
log(cYi ) ((cYi ) 1) = 1.
n i=1
w)
c=
n
1X
Y
n i=1 i
no
tes
(m
Example 3.4 (Weibull distribution for PP data). We fit the PP insurance data
displayed in Figure 3.1. From Figures 3.11 and 3.12 we see that the Weibull model
Figure 3.11: Weibull distribution with MM and MLE fits applied to the PP insurance data. lhs: QQ plot; rhs: loss size index function.
NL
gives a better fit to the PP data compared to the gamma model. The reason is
that it allows for more probability mass in the upper tail of the distribution, the
estimate for is in the interval (0.5, 0.75). The MM estimators manage to model
the data up to some layer. The MLE estimators, however, are still distorted by
the big mass of small claims which can be seen in the mean excess plot in Figure
3.12 (rhs). Summarizing, we should choose even more heavy tailed distributions to
model this data, and we should carefully treat (and probably separate) small and
large claims.
3.2.3
Log-normal distribution
Making the tail of the distribution function heavier than the Weibull distribution
tail leads us to the log-normal distribution. The log-normal distribution has two
parameters, a mean parameter R and a standard deviation parameter >
67
(m
w)
Figure 3.12: Weibull distribution with MM and MLE fits applied to the PP insurance data. lhs: log-log plot; rhs: mean excess plot.
tes
0. We write Y LN(, 2 ). The log-normal distribution has the property that

log Y N (, 2 ). Therefore, almost every crucial property can be obtained from
normal distributions. The distribution of Y has positive support R+ with density
for y 0 given by
(
no
1
1 (log y )2
1
exp
.
g(y) =
2
2
2 y
NL
For y 0 we have distribution function
log y
G(y) =
,
with () denoting the standard Gaussian distribution function. The family of

log-normal distributions is closed towards multiplication with a positive constant,
that is, for > 0 we have
Y LN( + log , 2 ).
We have the following
68
does not exist for r > 0,

n
o
= exp + 2 /2 ,
Y2
= exp 2 + 2
Vco(Y ) =
exp{ 2 } 1 ,
1/2
exp{ 2 } 1
exp{ 2 } + 2

,
1/2
exp{ 2 } 1

"
> 0.
w)
MY (r)
log u(+ 2 )

log u
u.
NL
no
tes
e(u) = Y
!#
(m
log u2 ( + 2 )
log u1 ( + 2 )
E[Y 1{u1 <Y u2 } ] = Y
!
2
log y ( + )
I(G(y)) =
,
Figure 3.13: Log-normal distribution with mean Y = 1 and standard deviation

parameters {0.5, 1, 1.25, 1.5}. lhs: density g(y); rhs: log-log plot.
The log-normal distribution does not have a regularly varying survival function
at infinity, see Table 3.4.4 in Embrechts et al. [39]. Note that the log-normal
distribution is subexponential, see Example 1.4.7 in Embrechts et al. [39]. We will
come back to subexponentiality in Section 5.4.
For generating log-normal random numbers we simply choose standard Gaussian
random numbers Z and then set Y = exp{ + Z}.
69

"
MM
bn2
+1
= log
b 2n
!#1/2
and
b MM = log b n (b MM )2 /2.
The MLE is given by

n
1X
=
log Yi
n i=1
and
(b MLE )2
n
2
1X
=
log Yi b MLE .
n i=1
w)
b MLE
no
tes
(m
Example 3.5 (log-normal distribution for PP data). We fit the PP insurance data
displayed in Figure 3.1. In Figures 3.14 and 3.15 we present the results. We observe
Figure 3.14: Log-normal distribution with MM and MLE fits applied to the PP
insurance data. lhs: QQ plot; rhs: loss size index function.
NL
that the log-normal distribution gives quite a good fit. We give some comments on
the plots: The MM estimator looks convincing because the observations match the
lines quite well in the QQ plot. The only things that slightly disturb the picture
are the three largest observations, see QQ plot. It seems that they are less heavy
tailed then the log-normal distribution suggests. This is also the reason why the
empirical mean excess plot deviates from the log-normal distribution, see Figure
3.15 (rhs). A little bit puzzling is the bad performance of the MLE. The reason is
again that more than 50% of the claims are less than 1500. The MLE therefore is
very much based on these small observations and provides a good fit in that range of
observations but it gives a bad fit for larger claims. We conclude from this that the
PP data set should be modeled with different distributions in different layers. The
reason for this heterogeneity is that PP insurance contracts have different modules
such as theft, water damage, fire, etc. and it is recommended (if data allows) to
model each of these modules separately. This may also explain the abnormalities in
(m
w)
70
tes
Figure 3.15: Log-normal distribution with MM and MLE fits applied to the PP
insurance data. lhs: log-log plot; rhs: mean excess plot. Note that the small
hump in the empirical distribution is at CHF 3000 which is probably induced by
a maximal cover for a particular risk factor.
3.2.4
no
the log-log plot because these different modules, in general, have different maximal
covers.
Log-gamma distribution
NL
The log-gamma distribution is more heavy tailed than the log-normal distribution
and is obtained by assuming that log Y (, c) for positive parameters and c.
The density for y 1 is given by
c
g(y) =
(log y)1 y (c+1) ,
()
and the distribution function can be written as
G(y) = G(, c log y).
For the moment generating function and the moments we have
Y
Y2
Y

c
for c > 1,
=
c 1

c
=
2Y
for c > 2,
c2

1
c
3
2
=
3Y Y Y
Y3
c3
For c > 1, 1 u1 < u2 and u, y > 1 we obtain
for c > 3.
w)
MY (r)
71
c
[G(, (c 1) log u2 ) G(, (c 1) log u1 )] ,
c1
I(G(y)) = G(, (c 1) log y),
!

1 G(, (c 1) log u)
c
e(u) =
u.
c1
1 G(, c log u)
NL
no
tes
(m
E[Y 1{u1 <Y u2 } ] =
Figure 3.16: Log-gamma distribution with mean Y = 2 and parameter c

{2, 3, 4, 8}. lhs: density g(y); rhs: log-log plot.
The log-gamma distribution has a regularly varying survival function at infinity
with tail index c > 0, see Table 3.4.2 in Embrechts et al. [39].
b MM
log b n
MM
b
c
log bcMM
1
and
log cbMM log(cbMM 2)

log(bn2 + b 2n )
=
,
log b n
log cbMM log(cbMM 1)
where the latter is solved numerically using, e.g., the R command uniroot().
The MLE is obtained analogously to the MLE for gamma observations by simply
replacing Yi by log Yi .
(m
w)
72
NL
no
tes
Figure 3.17: Log-gamma distribution with MM and MLE fits applied to the PP
insurance data. lhs: QQ plot; rhs: loss size index function.
Figure 3.18: Log-gamma distribution with MM and MLE fits applied to the PP
insurance data. lhs: log-log plot; rhs: mean excess plot.
Example 3.6 (log-gamma distribution for PP data). We fit the PP insurance

data displayed in Figure 3.1. From Figures 3.17 and 3.18 we conclude that the
log-gamma model provides the best fit to the data from the models considered so
far. As already commented on in the log-normal example, we see that probably
the only thing that does not entirely fit to the log-gamma distribution are the 3 or
4 largest claims which are less heavy tailed than the log-gamma distribution would
suggest. The tail index of regular variation is about cb = 5.8 in this example.

3.2.5
73
Pareto distribution
w)
We have seen that large claims often need a special treatment.

Therefore, large claims are often modeled separately with either a Pareto or a generalized Pareto distribution. Here, we
concentrate on the Pareto distribution. The Pareto distribution is named after Vilfredo Federico Damaso Pareto
(1848-1923) who initially used this distribution to describe the
allocation of wealth. Sometimes the Pareto distribution is also
called power law distribution because of the power law decay of
its survival function.
V.F.D. Pareto
(m
The Pareto distribution specifies a (large claims) threshold > 0 and then only
models claims above this threshold, see also Example 2.16. The claims above this
threshold are assumed to have regularly varying tails with tail index > 0. For
Y Pareto(, ), the density for y is given by
(+1)
tes
y
g(y) =

no
and distribution function can be written as

G(y) = 1

NL
We have closedness towards multiplication with a positive constant, that is, for
> 0 we have
Y Pareto(, ).
For the moment generating function and the moments we have
MY (r)
Y
Y2
Y
=
for > 1,
1
= 2
for > 2,
( 1)2 ( 2)

2(1 + ) 2 1/2
=
for > 3.
3
(3.7)
74
For > 0, u1 < u2 and u, y > we obtain

"
u2 +1
u1 +1
E[Y 1{u1 <Y u2 } ] =

1
+1
y
I(G(y)) = 1
for > 1,
1
e(u) =
u
for > 1,
1

for 6= 1,
tes
(m
w
and for = 1 we have E[Y 1{u1 <Y u2 } ] = log(u2 /u1 ).
no
Figure 3.19: Comparison of Pareto, log-gamma, log-normal, Weibull and gamma

distributions all having mean Y = 2 and variance Y2 = 20. lhs: densities g(y);
rhs: log-log plot.
NL
As soon as we only study tails of distributions we should use MLEs for parameter
estimation (the method of moments is not sufficiently robust against outliers).
Since the threshold has a natural meaning we only need to estimate . The MLE
is given by
!1
n
1X
MLE
b
=
log Yi log
.
n i=1
i.i.d.
Lemma 3.7. Assume Y1 , . . . , Yn Pareto(, ). We have

h
b MLE =
E
n1
and
b MLE =
Var
n2
2 .
2
(n 1) (n 2)
(d)
Proof. Choose Z expo() = (1, ). Then, eZ = Y Pareto(, ) (this can be seen

by a change of variables in the corresponding densities). This immediately implies that Zi =
i.i.d.
log Yi log expo(). The sum of these i.i.d. exponential random variables is gamma
distributed with parameters = n and c = . Using the scaling property (3.5) we conclude that
(b
MLE )1 (n, n) .
75
This implies for k < n

MLE k
E (b
)
=
Z
0
z k
(n)n n1 nz
(n k)
z
e
dz =
(n)k .
(n)
(n)
(3.8)
2
From this the claim follows.
H
b k,n
(m
w)
For the MLE of it was assumed that the threshold

is given in a natural way. If this threshold needs to be
detected from data, the Hill plot can be of help. For the
Hill method we refer to McNeil et al. [77], Section 7.2.4.
We order the claims accordingly to Y(1) Y(2) . . .
Y(n) . The Hill plot explores the stability of the MLEs
when successively dropping the smallest observations.
Therefore we define for k < n the Hill estimator by
n
X
1
log Y(i) log Y(k)
n k + 1 i=k
!1
The Hill estimator is based on the rationale that the Pareto distribution is closed
towards increasing thresholds, i.e. for Y Pareto(0 , ) and 1 > 0 we have for
all y 1

tes
y
0

1
0
P [ Y > y| Y 1 ] =
y
=
1

no
Therefore, if the data comes from a Pareto distribution we should observe stability
H
b k,n
for changing k. The confidence bounds of the Hill estimators are determined
in
by Lemma 3.7.
NL
Example 3.8 (Pareto for extremes of PP insurance). We start the analysis with
the PP insurance data.
To perform this large claims analysis we choose only the largest
H
b k,n
claims of Figure 3.1. The Hill plot k 7
is given in Figure
3.20 (together with confidence bounds of 1 standard deviation,
estimated by Lemma 3.7). We observe a fairly stable picture
in k around value = 2.5 up to the largest 100 claims. For
larger claims the Hill estimator disappears to 4 or 5 which
(once more) explains that the tail of the largest observations
is not really heavy tailed. This is similar to the log-normal
S.I. Resnick
and the log-gamma fit. Sidney Ira Resnick [86] has called
this phenomenon Hill horror plot and it stems from the difficulty that the Hill
estimator cannot correctly adjust non-Pareto like tails. The right-hand side of
Figure 3.20 gives the log-log plot for = 2.5, in accordance to the Hill plot we
see that the slope of the data is slightly less than this value for smaller claims,
but the data becomes less heavy tailed further out in the tails. This becomes also
obvious from the mean excess plot and the QQ plot in Figure 3.21.

(m
w)
76
NL
no
tes
H
b k,n
with confidence bounds of
Figure 3.20: PP insurance data; lhs: Hill plot k 7
1 standard deviation; rhs: log-log plot for = 2.5.
Figure 3.21: PP insurance data largest claims only; lhs: QQ plot; rhs: mean excess
plot for = 2.5.
Example 3.9 (Pareto for extremes of CP insurance). In a second analysis we
examine the extremes of the CP claims data of Figure 3.2. The results are presented
in Figure 3.22. At the first sight they look similar to the PP insurance example,
i.e. they begin to destabilize between the 150 and 100 largest claims. However, the
main difference is that the tail index is much smaller in the CP example. That is,
there is a higher potential for large claims for this line of business.
Example 3.10 (nuclear power accident example). We revisit the nuclear power
accident data set studied in Hofert-Wthrich [60], see also Sovacool [94].
77
(m
w)
H
b k,n
with a confidence interval
Figure 3.22: CP insurance data; lhs: Hill plot k 7
of 1 standard deviation; rhs: log-log plot for = 1.4.
tes
In Figure 3.23 we plot all nuclear power accidents that

have occurred until the end of 2011 and which have a
claim size larger than 20 mio. USD (as of 2010). These
events include Three Mile Island (United States, 1979),
Chernobyl (Ukraine, 1986) and Fukushima (Japan, 2011).
no
Fukushima 2011
In Figure 3.24 we provide the Hill plot. We observe that
this data is very heavy tailed. The Hill plot suggests to set the tail index around
empirical distribution
1.0
24
scatter plot logged claim sizes nuclear power accidents
nuclear power accidents
17
10
20
0.0
18
0.2
20
0.4
empirical distribution
NL
21
19
claim sizes (log scale)
22
0.6
0.8
23
30
40
50
60
17
18
19
20
21
22
23
24
logged claim sizes
Figure 3.23: 61 largest nuclear power accidents until 2011; lhs: logged claim sizes
(in chronological order, the last entry is Fukushima); rhs: empirical distribution
function of claim sizes.
78
loglog plot (with alpha = 0.64 for the 61 largest observations)
1.2
Hill plot of nuclear power accidents
0.4
61
51
41
31
21
11
w)
0.6
log (1distribution function)
0.8
Pareto parameter
1.0
Pareto distribution
observations
17
18
20
21
22
23
24
log (claim size)
(m
number of observations
19
H
b k,n
Figure 3.24: 61 largest nuclear power accidents until 2011; lhs: Hill plot k 7
with confidence bounds of 1 standard deviation; rhs: log-log plot for = 0.64.
tes
0.64, which means that we have an infinite mean model. The log-log plot in Figure
3.24 shows that this tail index choice captures the slope quite well.
no
Exercise 8. Natural hazards in Switzerland are covered by the so-called Schweizerische Elementarschaden-Pool (ES-Pool). This is a pool of private Swiss insurance
companies which organizes the diversification of natural hazards in Switzerland.
NL
For pricing of these natural hazards one distinguishes between small events and large events, the latter having a
total claim amount exceeding CHF 50 millions per event.
The following 15 storm and flood events have been observed in years 1986 2005 (these are the events with a
total claim amount exceeding CHF 50 millions).
Storm Lothar 26.12.1999
date
amount in CHF mio.
20.06.1986
52.8
18.08.1986
135.2
18.07.1987
55.9
23.08.1987
138.6
26.02.1990
122.9
21.08.1992
55.8
24.09.1993
368.2
08.10.1993
83.8
date
amount in CHF mio.
18.05.1994
78.5
18.02.1999
75.3
12.05.1999
178.3
26.12.1999
182.8
04.07.2000
54.4
13.10.2000
365.3
20.08.2005
1051.1
79
Fit a Pareto distribution with parameters = 50 and > 0 to the observed

claim sizes. Estimate parameter using the unbiased version of the MLE.
We introduce a maximal claims cover of M = 2 billions CHF per event,
i.e. the individual claims are given by Yi M = min{Yi , M }, see also Section
3.4.2. For the yearly claim amount of storm and flood events we assume
a compound Poisson distribution with Pareto claim sizes Yi . What is the
expected total yearly claim amount?
w)
What is the probability that we observe a storm and flood event next year
which exceeds the level of M = 2 billions CHF?
3.3
Model selection
(m
3.3.1
no
tes
In the previous section we have presented different distributions for claim size
modeling and we have debated which one fits best to the observed data. The argumentation was completely based on graphical tools like log-log plots. Graphical
tools are important, but in statistics there are also methodological tools that consider these questions from a more analytical point of view. Two commonly used
tests are the Kolmogorov-Smirnov (KS) test and the Anderson-Darling (AD) test.
These two tests are discussed in Sections 3.3.1 and 3.3.2.
In Section 3.3.3 we give the 2 -goodness-of-fit test and we discuss the Akaike
information criterion (AIC) as well as the Bayesian information criterion (BIC).
Kolmogorov-Smirnov test
NL
Andrey Nikolaevich Kolmogorov (1903-1987) was

the world leading probabilist, in 1933 he gave the modern axiomatic foundations of probability theory, his book
was called Grundbegriffe der Wahrscheinlichkeitsrechnung and has appeared in German, see [66]. Unfortunately, on Nikolai Vasilyevich Smirnov (19001966) less is known.
A.N. Kolmogorov
80
w)
The KS test is a non-parametric test investigating whether

a particular continuous distribution function G0 fits to a
given sample Y1 , . . . , Yn . Therefore, one compares the emb of the sample and the distripirical distribution function G
n
bution function G0 . The argument is based on the GlivenkoCantelli theorem which says that the empirical distribution
function of an i.i.d. sample converges uniformly to the true
underlying distribution function, P-a.s., if the number n of
i.i.d. observations goes to infinity (this result does not require continuity of the distribution function), see Theorem
20.6 in Billingsley [13].
tes
(m
Assume we have an i.i.d. sequence Y1 , Y2 , . . . from an unknown continuous distribution function G and we denote the corresponding
b , see
empirical distribution function of finite sample size n by G
n
also (3.3). We would like to test whether these samples Y1 , Y2 , . . .
may stem from G0 .
Consider the null hypothesis H0 : G = G0 against the two-sided
alternative hypothesis that these distribution functions differ. We
define the KS test statistics by

b G
Dn = Dn (Y1 , . . . , Yn ) = G
n
0
N.V. Smirnov
b (y) G (y) .
= sup G
n
0
y
no
The KS test statistics has the property, see (13.4) in Billingsley [12],
nDn Kolmogorov distribution K
as n .
NL
The Kolmogorov distribution K is for y R+ given by

K(y) = 1 2
(1)j+1 exp 2j 2 y 2 .
j=1
The null hypothesis H0 is rejected on the significance level q (0, 1) if

Dn > n1/2 K (1 q),
where K (1 q) denotes the (1 q)-quantile of the Kolmogorov distribution K.

q
K (1 q)
20%
1.07
10% 5% 2% 1%
1.22 1.36 1.52 1.63
81
(m
w)
Figure 3.24: KS test statistics for method of moments and MLE fits applied to the
PP insurance data; lhs: log-normal distribution; rhs: log-gamma distribution.
NL
no
tes
Example 3.11 (KS test, PP insurance data). We apply the KS test to the lognormal and the log-gamma fits of the PP insurance data, see Examples 3.5 and 3.6.
In the log-normal case we obtain for the MLE fit Dn = 0.05 and for the methods of
moment fit Dn = 0.12. These values are far too large compared to the large sample
size of n = 610 053 and the KS test clearly rejects the null hypothesis of having a
log-normal distribution on the 1% significance level. If we look at Figure 3.24 (lhs)
we see that these big values of the KS test statistics are driven by small claims,
i.e. we obtain a bad fit for small claims, the tails however do not look too badly.
The log-gamma fit looks better than the log-normal fit, see Figure 3.24 (note that
the y-axes have different scales in the two plots). It provides KS test statistics
Dn = 0.04 for the MLE fit and Dn = 0.06 for the method of moments fit. These
values are still far too large to not reject H0 on the 1% significance level.
Conclusion. The claim size modeling should be split into different claim size layers.
Example 3.12 (KS test, tail distribution). In this example we investigate the tail
fits of the Pareto distributions in the CP and the PP examples for the n = 505
largest claims, see Examples 3.8 and 3.9. The results are presented in Figure 3.25.
For the PP insurance data we obtain Dn = 0.027 (for = 2.5) and for the CP
insurance data we receive Dn = 0.061 (for = 1.4). The first value is sufficiently
small so that the null hypothesis cannot be rejected on the 5% significance level,
the CP insurance value reflects just about the critical value on the 5% significance
level, i.e. the resulting p-value is just about 5%. The plot of the point-wise terms of
Dn looks fine for the PP insurance data, however, the graph for the CP insurance
data looks a bit one-sided, suggesting two different regimes (this can also seen from
Figure 3.22).

(m
w)
82
Figure 3.25: Point-wise terms of KS test statistics for MLE fits applied to the 505
largest claims; lhs: PP insurance data; rhs: CP insurance data.
3.3.2
Anderson-Darling test
tes
The advantage of the (non-parametric) KS test is that it can be applied to any

situation of continuous distribution functions. The drawback of this large generality, of course, is that it is often not very powerful and, especially, not very good in
detecting particular properties such as tail behavior.
NL
no
The two statisticians Theodore Wilbur Anderson and Donald Allan Darling have developed a modification of the KS
test, the so-called AD test, which gives more weight to the tail
of the distributions. It is therefore more sensitive in detecting
tail fits, but on the other hand it has the disadvantage of not
being non-parametric, and critical values need to be calculated
for every chosen distribution function.
The KS test statistics is modified by the introduction of a weight
T.W. Anderson
function : [0, 1] R+ which then modifies the KS test statistics Dn as follows

q
b
sup Gn (y) G0 (y) (G0 (y)).
y
Different choices of allow to weight different regions of the support of the distribution function differently, the KS test statistics
is obtained by 1. The choice proposed by Anderson and Darling is (t) = (t(1 t))1 in order to investigate the tails of the
distributions.
D.A. Darling
In contrast to the maximal difference between the empirical distribution function
b and the null hypothesis G we could also consider a weighted L2 -distance. This
G
n
0
83
leads to the Anderson-Darling modification of the Cramr-von Mises test. The AD

test statistics for (t) = (t(1 t))1 is obtained from

A2n = n
Z
R
2
b (y) G (y)
G
n
0
G0 (y)(1 G0 (y))
dG0 (y).
3.3.3
w)
Anderson-Darling have explicitly identified the asymptotic behavior of An as n

by determining the limiting characteristic function. We do not further elaborate
on this but refer to the literature in statistics.
Goodness-of-fit and information criteria
K. Pearson
(3.9)
no
tes
(m
There are many other criteria that can be applied for testing fits
and distributional choices. Many of them are based on asymptotic normality. For instance, a 2 -goodness-of-fit test splits the
support of the null hypothesis distribution function G0 into K
disjoint intervals Ik = [ck , ck+1 ), k = 1, . . . , K. Then, data is
grouped according to these intervals, i.e. Ok counts the number
of observed realizations Y1 , . . . , Yn in interval Ik and Ek denotes
the expected number of observations in Ik according to the distribution function G0 . The test statistics of n observations is
then defined by
K
X
(Ok Ek )2
2
.
Xn,K =
Ek
k=1
NL
2
If d parameters were estimated in G0 , then Xn,K
is compared to a 2 -distribution
with K 1 d degrees of freedom, see also Exercise 2 on page 22. Often it is
suggested that we should have Ek > 4 for reasonable results. However, these
rules-of-thumbs are not very reliable.
This 2 -goodness-of-fit test is sometimes also called Pearsons -square test, named
after Karl Pearson (1857-1936) who has investigated this test in 1900.
Within the framework of MLE methods the Hirotugu

Akaike (1927-2009) information criterion (AIC) and the
Bayesian information criterion (BIC) are often used, we
refer to Akaike [2] and Section 2.2 in Congdon [28]. These
criteria are used to compare different distribution functions and densities. Assume we want to compare two
different densities g1 and g2 that where fitted to (i.i.d.)
data Y = (Y1 , . . . , Yn )0 . The AIC is defined by
(i)
AIC(i) = 2`Y + 2d(i) ,
H. Akaike
84

(i)
where `Y is the log-likelihood function of density gi for data Y and d(i) denotes
the number of estimated parameters in gi , for i = 1, 2. For MLE we maximize
(i)
`Y and in order to evaluate the AIC we penalize the model for having too many
parameters. The AIC then says that the model with the smallest AIC value should
be preferred.
The BIC uses a different penalty term for the number of parameters (all these
penalty terms are motivated by asymptotic results). It reads as
(i)
w)
BIC(i) = 2`Y + log(n) d(i) ,
no
tes
(m
and the model with the smallest BIC value should be preferred.
Figure 3.26: Akaikes original hand notes on the AIC (lhs) at the Institute of
Statistical Mathematics in Tokyo, Japan (rhs).
NL
Exercise 9 (AIC and BIC). Assume we have i.i.d. claim sizes Y = (Y1 , . . . , Yn )0
with n = 1000 which were generated by a gamma distribution, see Figure 3.27.
The sample mean and sample standard deviation are given by
b n = 0.1039
and
bn = 0.1050.
If we fit the parameters of the gamma distribution we obtain the method of moments estimators and the MLEs
b MM = 0.9794
and
cbMM = 9.4249,
b MLE = 1.0013
and
cbMLE = 9.6360.
This provides the fitted distributions displayed in Figure 3.28. The fits look perfect
and the corresponding log-likelihoods are given by
`Y (b MM , cbMM ) = 1264.013
and
`Y (b MLE , cbMLE ) = 1264.171.
85
(m
w)
NL
no
tes
Figure 3.27: I.i.d. claim sizes Y = (Y1 , . . . , Yn )0 with n = 1000; lhs: observed data;
rhs: empirical distribution function.
Figure 3.28: Fitted gamma distributions; lhs: log-log plot; rhs: QQ plot.
(a) Why is `Y (b MLE , cbMLE ) > `Y (b MM , cbMM ) and which fit should be preferred
according to AIC?
(b) The estimates of are very close to 1 and we could also use an exponential
distribution function. For the exponential distribution function we obtain
MLE cbMLE = 9.6231 and `Y (cbMLE ) = 1264.169. Which model (gamma or
exponential) should be preferred according to the AIC and the BIC?
86
3.4
Calculating within layers for claim sizes
3.4.1
w)
In the previous sections we have experienced that it is difficult to fit one parametric
distribution function to the entire range of possible outcomes of the claim sizes.
Therefore, we often consider claim sizes in different layers. Another reason why
different layers of claim sizes are of interest is that re-insurance can often be bought
for different claims layers. For these reasons we would like to understand how claim
sizes behave in different layers. First we discuss the modeling issue and second we
describe modeling of re-insurance layers.
Claim size modeling using layers
(m
We come back to the issue that the KS test rejects the most popular parametric
i.i.d.
fits, see Example 3.11. We assume that Y1 , Y2 , . . . G and we would like to split
G into different layers. The simplest case is to choose two layers, see Example 2.16,
that is, choose a large claims threshold M > 0 such that G(M ) (0, 1), i.e. G(M )
is bounded away from zero and one. We then define the disjoint decomposition
and
{Y1 > M } .
tes
{Y1 M }
Assume that S CompPoi(v, G). We consider the total claim Ssc in the small
claims layer and the total claim Slc in the large claims layer given by
i=1
Yi 1{Yi M }
and
no
Ssc =
N
X
Slc =
N
X
Yi 1{Yi >M } .
i=1
Theorem 2.14 implies that Ssc and Slc are independent and compound Poisson
distributed with
and
NL
Ssc CompPoi (sc v = G(M )v , Gsc (y) = P [Y1 y|Y1 M ]) ,
Slc CompPoi (lc v = (1 G(M ))v , Glc (y) = P [Y1 y|Y1 > M ]) .
Thus, we can model large claims and small claims separately (independently).
Observe that we have the following decomposition
G(y) = P [ Y1 y| Y1 M ] G(M ) + P [ Y1 y| Y1 > M ] (1 G(M ))
= Gsc (y)G(M ) + Glc (y)(1 G(M )).
Often a successful modeling approach involves 3 steps:
87
1. Choose threshold M > 0 sufficiently large so that many of the observations

fall into the lower layer (0, M ]. In this lower layer one either fits a parametric
distribution function to the data or one directly works with the empirical
distribution function (due to the Glivenko-Cantelli theorem). If a distribution function is fitted one should ensure that this distribution function has
compact support (0, M ], for instance, by choosing a truncated gamma distribution.
w)
2. Estimate probability G(M ) of the event {Y1 M } which is typically large.
(m
3. Fit a Pareto distribution to Glc for threshold = M , i.e. estimate the tail
index > 0 from the observations exceeding this threshold M .
NL
no
tes
Example 3.13. We revisit the PP and the CP insurance data set. We choose
Figure 3.29: Empirical fit in small claims layer and Pareto distribution fit in large
claims layer, the gray lines show the large claims threshold; lhs: PP insurance data;
rhs: CP insurance data.
large claims threshold M = 500 000 in both cases. In the PP insurance data set we
b
have 237 observations above this threshold, which provides estimate 1 G(M
)=
0
237/61 053 = 0.39%. For the CP insurance example we have 272 claims above
b
this threshold, which provides estimate 1 G(M
) = 1.87%. Next we calculate
the sample mean and the sample coefficient of variation in the small claims layer
{Yi M }:
PP :
b {Yi M } = 20 805,
d
Vco
{Yi M } = 1.80,
CP :
b {Yi M } = 40 377,
d
Vco
{Yi M } = 1.51.
88
3.4.2
(m
w)
These should be compared to (3.1)-(3.2). We observe a substantial reduction of

the sample coefficient of variation in the small claims layer compared to the entire
range of possible outcomes. This is not surprising because large claims drive the
coefficient of variation. For CP insurance we also see that the sample mean in the
lower claims layer is substantially reduced. This is due to the fact that 1.87% of
claims exceed the threshold M = 500 000 and these claims may get very large and
drive the mean, see also loss size index function in Figure 3.5.
Finally, we fit distribution function G to the data. We choose the empirical distribution functions below the threshold M and Pareto distributions for the tail fit
in the large claims layer, having tail parameters as estimated in Examples 3.8
and 3.9 (this is also supported by the KS tests, see Example 3.12). The results are
presented in Figure 3.29. For PP insurance they look convincing, whereas the CP
insurance fit is not entirely satisfactory in the large claims layer (which might ask
for a bigger large claims threshold M and a slightly bigger tail parameter ).
Re-insurance layers and deductibles
tes
Above we have calculated expected values in claims layers E[Y 1{u1 <Y u2 } ] for various parametric distribution functions. This is of interest for several reasons. This
we are going to discuss next.
NL
no
(i) The first reason is that insurance contracts often have deductibles. On the one
hand small claims often cause too much administrative costs, and on the other
hand deductibles are also an instrument to prevent from fraud (moral hazard). For
instance, it can become quite expensive for an insurance company if every insured
claims that his umbrella got stolen. Therefore, a deductible d > 0 of, say, 200
CHF is introduced and the insurance company only covers the claim (Y d)+ that
exceeds this deductible d. In this case the pure risk premium for claim Y G is
given by
E [(Y d)+ ] =
Z
d
(y d) dG(y) = E[Y 1{Y >d} ] d P[Y > d]
(3.10)
= P[Y > d] (E[Y |Y > d] d) = P[Y > d] e(d),
under the assumption that P[Y > d] > 0 and that the mean excess function e() of
Y exists.
Remark. Fitting a distribution function to claims data (Y d)+ needs some
care. If the original claims Y G (absolutely continuous with density g), then the
density after deductible is for y d given by
gd (y) =
g(y)
.
1 G(d)
Thus, MLE of parameters becomes more involved.

(3.11)
89
(ii) The second reason is that the insurance company may have a maximal insurance
cover per claim, i.e. it covers claims only up to a maximal size of M > 0 and the
exceedances need to be paid by the insured; or, similarly, it may cover claims
exceeding M but has a re-insurance cover for these exceedances. In that case the
insurance company covers (Y M ) and the pure risk premium for this (bounded)
claim is given by
y dG(y) + M P[Y > M ] = E[Y 1{Y M } ] + M P[Y > M ]
(m
w
= E[Y ] E[Y 1{Y >M } ] M P[Y > M ]
E [Y M ] =
Z M
= E[Y ] P[Y > M ] e(M ) = E[Y ] E [(Y M )+ ] .
tes
If we combine deductibles with maximal covers we obtain excess-of-loss (XL) (re-)

insurance treaties. Assume we have a deductible u1 > 0 (in re-insurance terminology this also called priority or retention). Then the insurance treaty u2 xs u1
covers the claims layer (u1 , u1 + u2 ], that is, this contract covers a maximal excess
of u2 above the priority u1 . The pure risk premium for such contracts is then given
by
E[((Y u1 )+ ) u2 ] = E[(Y u1 )+ ] E[(Y u1 u2 )+ ].
no
An issue, when dealing with layers, is claims inflation. Assume we sell insurance
contracts with a deductible d > 0 and we ask for a pure risk premium E [(Y d)+ ].
Since cash flows have time values this premium has to be revised carefully for later
periods as the following theorem shows.
NL
Theorem 3.14 (leverage effect of claims inflation). Choose a fixed deductible d >
0 and assume that the claim at time 0 is given by Y0 . Assume that there is a
deterministic inflation index i > 0 such that the claim at time 1 can be represented
(d)
by Y1 = (1 + i)Y0 . We have
E[(Y1 d)+ ] (1 + i) E[(Y0 d)+ ].
Proof. We calculate the pure risk premium

Z
E[(Y1 d)+ ]
=
=
=
P[(Y1 d)+ > y] dy =

P[Y1 > y + d] dy

Z0
Z 0
x
P[Y1 > x] dx =
P Y0 >
dx
1+i
d
d
Z
P [Y0 > y] dy,
(1 + i)
d
1+i
90
where we have twice applied a change of variables. The latter is calculated as follows
!
Z d
Z
E[(Y1 d)+ ] = (1 + i)
P [Y0 > y] dy +
P[Y0 > y] dy
d
1+i
Z
=
(1 + i)
d
1+i
P [Y0 > y] dy + (1 + i) E[(Y0 d)+ ].

2
E [(Y0 d)+ ] =
w)
Example 3.15 (leverage effect of claims inflation). Assume that Y0 Pareto(, )

with > 1 and choose a deductible d > . In that case we have, see (3.10),
1
d.
1
(d)
(m
Choose inflation index i > 0 such that (1 + i) < d. From (3.7) we obtain
Y1 = (1 + i)Y0 Pareto((1 + i), ).
This provides for > 1 and i > 0
1
d
1
1
d > (1 + i) E [(Y0 d)+ ] .
1
tes
E [(Y1 d)+ ] =
d
(1 + i)
= (1 + i)
no
Observe that we obtain a strict inequality, i.e. the pure risk premium grows faster
than the claim sizes itself. The reason for this faster growth is that claims Y0 d
may entitle for claims payments after claims inflation adjustments, i.e. not only the
claim sizes are growing under inflation but also the number of claims is growing if
one does not adapt the deductible to inflation.
NL
Exercise 10. In Figure 3.30 we display the distribution function of loss Y G and
the distribution function of the loss after applying different re-insurance covers to
Y . Can you explicitly determine the re-insurance covers from the graphs in Figure
3.30.

Exercise 11. Assume claims sizes Yi in a given line of business can be described
by a log-normal distribution with mean E[Yi ] = 30 000 and Vco(Yi ) = 4.
Up to now the insurance company was not offering contracts with deductibles. Now
it wants to offer the following three deductible versions d = 200, 500, 10 000. Answer
the following questions:
1. How does the claims frequency change by the introduction of deductibles?
2. How does the expected claim size change by the introduction of deductibles?
3. By which amount changes the expected total claim amount?
91
no
tes
(m
w)
NL
Figure 3.30: Distribution functions implied by re-insurance contracts.
NL
no
tes
(m
w)
92
Chapter 4
(m
w)
Approximations for Compound

Distributions
no
tes
In Chapter 2 we have introduced claims count distributions for the modeling of the
number of claims N within a fixed time period. In Chapter 3 we have met several
claim size distribution functions G for the modeling of the claim sizes Y1 , Y2 , . . ..
Ultimately, we always need to calculate the compound distribution of S, see Definition 2.1. As explained in Proposition 2.2, we can easily calculate the moments
and the moment generating function of this compound distribution. On the other
hand the distribution function of S given in (2.1) is a notoriously difficult object
because it involves (too) many convolutions of the claim size distribution function
G. The aim here is to explain how we can circumvent this difficulty.
NL
The most commonly used practice in the insurance industry

to overcome this problem is to apply Monte Carlo simulations
and then consider the resulting empirical distribution function
as a sufficiently good approximation to the true distribution
function. This approach is based on the Glivenko-Cantelli theorem, see Billingsley [13], Chapter 20. Though this is a feasible way we do not recommend it. The issue is that it is
often unclear to asses what sufficiently good means, i.e. the
rates of convergence of the Monte Carlo samples may be very
poor which results in a lot of simulations. This is especially true for heavy tailed
distribution functions of regularly varying type (3.4). Therefore, we recommend
approximations, the Panjer algorithm and fast Fourier transforms (FFT) which are
often more efficient.
4.1
Approximations
In many situations approximations to S are used. These may be justified by the

central limit theorem (CLT) if the expected number of claims is large. Compound
93
94
Chapter 4. Approximations for Compound Distributions
4.1.1
Normal approximation
(m
w
distributions may have two different risk drivers in the tail of the distribution function, namely, the number of claims N may contribute to large values of S or single
large claims in Y1 , . . . , YN may drive extreme values in S. Let us concentrate on
the compound Poisson model, in particular, we would like to use the decomposition theorem in the spirit of Example 2.16. In this case, mostly the claim sizes Yi
contribute to the tail of the distribution (if these are heavy tailed). Therefore, we
emphasize that in the light of the compound Poisson model one should separate
small from large claims resulting in the independent decomposition S = Ssc + Slc .
Next, if the expected number of small claims vsc is large, Ssc can be approximated
by a parametric distribution function and Slc should be modeled explicitly. This
we are going to describe in detail in the remainder of this chapter.
tes
The normal approximation is motivated by the CLT which goes

back to de Moivre (1733) and Laplace (1812), see (1.2). It
was then Aleksandr Mikhailovich Lyapunov (1857-1918)
who stated it in the general version and who discovered the
importance of the CLT.
no
The classical CLT holds for a fixed number of claims. In our

approach the number of claims is not fixed, therefore we need
A.M. Lyapunov
a refinement of the CLT. We do this for a Poissonian number
of claims N by keeping the expected claims frequency fixed and by sending the
volume v .
Theorem 4.1. Assume S CompPoi(v, G) with G having a finite second moment. We have
NL
S vE[Y1 ]
q
vE[Y12 ]
N (0, 1)
as
v .
Observe that we consider a special class of distribution functions G having finite

second moment. As long as we work in the set-up of Ssc this is not a restriction
because claim sizes are bounded by the large claims threshold M and therefore
have finite variance.
Proof. Observe that it is sufficient to consider v N because intermediate volumes v allow for
approximations bvc and dve (and the approximation error is asymptotically negligible). Thus, we
choose v N. Disjoint decomposition Theorem 2.14 then provides
S=
N
X
i=1
(d)
Yi =
N`
v X
X
`=1 i=1
(`)
Yi
v
X
S` ,
`=1
95
i.i.d.
where S` CompPoi(, G). The first two moments of these compound Poisson distributions
are given by E[S1 ] = E[Y1 ] and Var(S1 ) = E[Y12 ]. Therefore, the assumptions of the CLT are
fulfilled and the claim follows from (1.2).
2
Theorem 4.1 is the motivation for the following approximation of the distribution
function of S
vE[Y12 ]
x vE[Y1 ]
vE[Y12 ]
x vE[Y1 ]
vE[Y12 ]
(4.1)
w)
P [S x] = P
S vE[Y1 ]
tes
(m
where denotes the standard Gaussian distribution function. This approximation

works well when v is large and if the claim sizes Yi do not have heavy tailed
distribution functions G. Otherwise it under-estimates the true potential of large
outcomes of S (because Theorem 4.1 provides a good approximation solely around
the mean of S, in particular, because the Gaussian distribution has zero skewness).
For rates of convergences we refer to the literature, see for instance Embrechts et
al. [39].
Note that the normal approximation (4.1) also allows for negative claims S, which
under our model assumptions is excluded, thus, it is really an approximation that
needs to be considered carefully.
no
Example 4.2 (Normal approximation for PP insurance). We revisit the PP insurance data of Example 3.13. We consider 3 different examples:
NL
(a) Only small claims: in this example we only consider the claim size distribution
function G(y) = P [Y y|Y M ], i.e. the claims are compactly supported
in (0, M ]. As explicit claim size distribution function we choose the empirical
distribution of Example 3.13, see Figure 3.29 (lhs), with M = 500 000. We
choose portfolio size v such that v = 100.
(b) Claim size distribution function G is chosen as in (a), but this time we choose
portfolio size v such that v = 1000.
(c) In addition to (b) we add the large claims layer modeled by a Pareto distribution with = M = 500 000 and = 2.5 and for the expected number of
large claims we set lc v = 3.9.
For simplicity the true distribution function is evaluated by Monte Carlo simulation, which contradicts our statement above, but is appropriate for sufficiently
large samples (and sufficient patience). We choose 100000 simulations, this will
be further illustrated in Example 4.11 below.
In Figure 4.1 we present the results of the normal approximation (4.1) in case (a).
We observe an appropriately good fit around the mean but the normal approximation clearly under-estimates the tails of the true distribution function, see log-log
(m
w)
96
Figure 4.1: Compound Poisson distribution of S and normal approximation (4.1)

in case (a), i.e. no large claims, expected number of claims 100; lhs: distribution
function; rhs: log-log plot.
NL
no
tes
plot in Figure 4.1 (rhs). Moreover, the true distribution function has positive skewness S = 0.43 whereas the normal approximation has zeroqskewness. In the normal
approximation we obtain probability mass (vE[Y1 ]/ vE[Y12 ]) = 6 107 for
a negative total claim amount (which is fairly small).

in case (b), i.e. no large claims, expected number of claims 1000; lhs: distribution
function; rhs: log-log plot.
In Figure 4.2 we show situation (b) which is the same as situation (a) the only
difference is that we enlarge the portfolio size by a factor 10. We see better approximation properties due to the fact that we have convergence in distribution for
97
(m
w)
portfolio size v . We observe a lower skewness S = 0.15 which improves the

normal approximation, also in the tails.
tes

in case (c), i.e. with large claims, total expected number of claims 1003.9; lhs:
distribution function; rhs: log-log plot.
4.1.2
no
Finally, in Figure 4.3 we also include large claims (in contrast to Figure 4.2) having
an expected number of large claims of 3.9 and a Pareto tail parameter of = 2.5.
We see that in this case the normal approximation is useless in the tail, which
strongly favors the large claims separation as suggested in Example 2.16.
Translated gamma and log-normal approximations
NL
In Example 4.2 we have seen that the normal approximation can be useful for large
portfolio sizes v and under the exclusion of large claims. For small portfolio sizes
the approximation may be bad because the true distribution often has substantial
positive skewness. This leads to the idea of approximating the small claims layer
by other distribution functions that enjoy positive skewness.
We choose k R and define the (translated or shifted) random variables
X = k + Z,
where Z (, c)
or
Z LN(, 2 ).
We have in the translated gamma case

E[X] = k + /c,
Var(X) = /c2
and
X = 2 1/2 > 0,
and in the translated log-normal case

E[X] = k + exp{ + 2 /2},
Var(X) = exp{2 + 2 }(exp{ 2 } 1),
2
X = (e + 2)(exp{ 2 } 1)1/2 > 0.

98
The idea now is to do a fit of moments between S and X. Assume that S has finite
third moment and then we choose
where Z (, c)
X = k + Z,
or
Z LN(, 2 ),
such that the three parameters of X fulfill

E[X] = E[S],
Var(X) = Var(S)
and
X = S ,
(4.2)
w)
and then this fitted random variable X is chosen as an approximation to S.

Exercise 12. Assume that S has a compound Poisson distribution with expected
number of claims v > 0 and claim size distribution G having finite third moment.
v E[Y1 ] = k + /c,
(m
1. Prove that the fit of moments approximation (4.2) for a translated gamma
distribution for X provides the following system of equations
v E[Y12 ] = /c2
and
E[Y13 ]
= 2 1/2 .
(v)1/2 E[Y12 ]3/2
tes
2. Solve this system of equations for k R, > 0 and c > 0 and prove that it
has a well-defined solution for G(0) = 0.
3. Why should this approximation not be applied to case (c) of Example 4.2?
no
NL
Example 4.3 (Translated gamma and log-normal approximations). We revisit

cases (a) and (b) of Example 4.2, that is, we only consider the small claims layer
and we would like to approximate the compound Poisson distribution in this small
claims layer by translated gamma and log-normal distributions.
The approximations for expected number of claims v = 100, i.e. case (a), are
presented in Figure 4.4 and the ones for expected number of claims v = 1000,
i.e. case (b), in Figure 4.5. In both cases we see that the translated gamma and lognormal approximations provide remarkably good fits. For this reason, the small
claims layer is often approximated by one of these two parametric distribution
functions.
Observe that for k > v we have a Chernoff type bound of (Stirlings formula
provides asymptotic behavior k! = O(exp{k log(k/e)}) as k )
P [N k] exp {k log k v + k log(ev)} .
This explains that the compound Poisson distribution with bounded claim sizes
Yi M is less heavy tailed compared to the translated gamma and log-normal
distributions.
99
(m
w)
NL
no
tes
Figure 4.4: Compound Poisson distribution of S and normal approximation (4.1),

translated gamma and log-normal approximations (4.2) in case (a), i.e. no large
claims, expected number of claims 100; lhs: distribution function; rhs: log-log plot.

translated gamma and log-normal approximations (4.2) in case (b), i.e. no large
claims, expected number of claims 1000; lhs: distribution function; rhs: log-log
plot.
The KS test rejects the null hypothesis on the 5% significance level for the normal
approximation in both cases (a) and (b), whereas this is not the case for the
translated gamma and log-normal approximations in both cases (a) and (b), the
p-values are clearly bigger than 5%; for the exact p-values we refer to Table 4.1,
below. In case (a) the translated gamma approximation is favored, in case (b)
the translated log-normal approximation (though the differences in the latter are
100
negligible).
4.1.3
Edgeworth approximation
(m
w)
The Edgeworth approximation is named after Francis Ysidro

Edgeworth (1845-1926). The approximations presented in the
previous section were rather ad-hoc. We have just chosen a (simple) distribution function that enjoys skewness and then we have
done a fit of moments (with no further argument on the shape of
the approximating distribution function). The Edgeworth approximation starts from the CLT and then tries to adjust higher
order terms in approximation (4.1) by the evaluation of moment F.Y. Edgeworth
generating functions in terms of Taylor expansions.
Assume S is compound Poisson distributed with claim size distribution G having a positive radius of convergence 0 > 0. As in Theorem 4.1 we consider the
normalized random variable
tes
S vE[Y1 ]
Z= q
.
vE[Y12 ]
no
We have E[Z] = 0, Var(Z) = 1 and Z = S . The aim now is to approximate

the moment generating function of Z by comparable terms coming from normal
distributions and argue with Lemma 1.4. Therefore, we first consider the following
Taylor expansion around the origin, choose n 3,
log MZ (r) =
n dk
X
k
dr
k=0
log MZ (r)|r=0 k
r + o(rn )
k!
as r 0.
NL
d
We set ak = dr
k log MZ (r)|r=0 /k! and note that we have a0 = log MZ (0) = 0,
a1 = E[Z] = 0 and a2 = Var(Z)/2! = 1/2. This provides approximation
n
1 2 X
MZ (r) exp
r +
ak r k
2
k=3

n
X
1 2
= exp r exp
ak rk .
2
k=3
(
Using a second Taylor expansion for ex = 1 + x + x2 /2! + . . . applied to the latter

exponential function in the last expression, the moment generating function of Z
is approximated by
MZ (r) er
2 /2
1 +
n
X
k=3
P
ak r k +
n
k=3
ak r k
2!
2
+ . . . .
Depending on the required precision as r 0 we can choose more terms in the

bracket (highlighted by + . . .) and we can take more terms in the summation
101
reflected by the upper index n in the summation. Thus, for appropriate constants
bk R we get the approximation (for small r)
MZ (r) e
r2 /2
1 + a3 r 3
bk r k .
(4.3)
k4
Lemma 4.4. Let denote the standard Gaussian distribution function and (k)
its k-th derivative. For k N0 and r R
k r2 /2
r e
= (1)
erx (k+1) (x) dx.
w)
(m
Proof. The proof goes by induction. Choose k = 0, then

Z
Z
2
2
1
rx 0
e (x) dx =
erx ex /2 dx = MX (r) = er /2 ,
2
which is the moment generating function of X N (0, 1).

Induction step k k + 1. Using integration by parts we have
Z
Z
h
i
k+1
rx (k+2)
k+1 rx (k+1)
k+1
(1)
e
(x) dx = (1)
e
(x)
(1)
rerx (k+1) (x) dx.
tes
Note that the first term on the right-hand side is equal to zero because (k+1) (x) goes faster to
zero than erx may possibly converge to infinity. This and the induction assumption for k provides
identity
Z
Z
2
k+1
rx (k+2)
k
(1)
e
(x) dx = r (1)
erx (k+1) (x) dx = r rk er /2 ,
no
which proves the claim.
Lemma 4.4 allows to rewrite approximation (4.3) for small r as follows, set X
N (0, 1),
h
MZ (r) E erX a3
erx (4) (x) dx +
bk (1)k
k4
erx (k+1) (x) dx
NL
=
erx 0 (x) a3 (4) (x) +
bk (1)k (k+1) (x) dx.
k4
Assume that Z has distribution function denoted by FZ , then the latter suggests
approximation, see Lemmas 1.2, 1.3 and 1.4,
dFZ (z) 0 (z) a3 (4) (z) +
bk (1)k (k+1) (z) dz.
k4
Integration provides Edgeworth approximation, set x =
def.
P [S x] = FZ (z) EW(z) = (z) a3 (3) (z) +
vE[Y12 ] z + vE[Y1 ],
X
k4
bk (1)k (k) (z). (4.4)
102
This formula provides the refinement of the normal approximation (4.1), namely we
correct the first order approximation by higher order terms involving skewness
and other higher order terms reflected by a3 and bk in (4.4). The Edgeworth
approximation (4.4) is elegant but its use requires some care as we are just going
to highlight.
1
2
0 (z) = ez /2 ,
2
w)
We first consider the derivatives (k) for k 1. The first derivative is given by
and the higher order derivatives for k 2 are given by

dk1 1 z2 /2
k1 z 2 /2
e
=
O
z
e
dz k1 2
From this we immediately see that

lim EW(z) = 0
and
(m
(k) (z) =
for |z| .
lim EW(z) = 1.
no
tes
Attention. The issue with the Edgeworth approximation EW(z) is that it is not
necessarily a distribution function because it does not need to be monotone in z,
see Example 4.5, below!
Example 4.5. To see the possible non-monotonicity of EW(z) we only take into
account skewness, i.e. a3 = Z Z3 /6 = S /6, and the approximation ez 1 + z in
(4.4). We have
NL
1
0
2
(z) = ez /2 ,
2
1
2
(2) (z) = z ez /2 ,
2
1
1
2
2
(3) (z) = ez /2 + z 2 ez /2 ,
2
2
1
1
1
2
2
2
(4) (z) = z ez /2 + 2z ez /2 z 3 ez /2 .
2
2
2
This implies

d
EW(z) = 0 (z) a3 (4) (z) = 0 (z) 1 3a3 z + a3 z 3 .
dz
(4.5)
Consider the function h(z) = 1 3a3 z + a3 z 3 for positive skewness S > 0. Then
we have
lim h(z) =
and
lim h(z) = ,
z
z
103
tes
(m
w)
which explains that the derivative of EW(z) has both signs and therefore EW(z)
is not monotone. However, in the upper tail of the distribution of S, that is, for
z sufficiently large, the Edgeworth approximation (4.5) is monotone and can be
used as an appropriate approximation. We emphasize that these monotonicity
properties should always be carefully checked in the Edgeworth approximation.
no

translated gamma approximation (4.2) and Edgeworth approximation (4.4) in case
(a), i.e. no large claims, expected number of claims 100; lhs: distribution function;
rhs: log-log plot.
NL
We revisit the numerical examples given in Examples 4.3. In Figure 4.6 we give
the approximation in case (a), i.e. expected number of claims equal 100, and in
Figure 4.7 we give the approximation in case (b), i.e. expected number of claims
equal 1000. In both cases we only choose the next additional moment, which is
the skewness and refers to term a3 , and we choose approximation ez 1 + z in
(4.4). We see in both cases that the Edgeworth approximation clearly outperforms
the Gaussian approximation. However, the Edgeworth approximation is still lighttailed which can be seen by comparing it to the translated gamma approximation.
In Figure 4.8 we compare the Edgeworth density (4.5) to the Gaussian density.
We clearly see the influence of the skewness parameter a3 and S > 0, respectively.
Moreover, we also see that the influence of the skewness parameter is decreasing
with a higher expected number of claims. Of course, this exactly reflects the CLT,
see Theorem 4.1.
If we calculate the minimal value of the Edgeworth density (4.5) we obtain in
case (a) the value 9.8 104 and in case (b) the value 4.1 105 . This exactly
explains that the Edgeworth density is not a proper probability density because
it violates the positivity property. However, this only occurs in the range of very
small claims and therefore it can be used as an approximation in the range of large
(m
w)
104
NL
no
tes

translated gamma approximation (4.2) and Edgeworth approximation (4.4) in case
(b), i.e. no large claims, expected number of claims 1000; lhs: distribution function;
rhs: log-log plot.
Figure 4.8: We compare the Edgeworth density (4.5) to the Gaussian density;
lhs: in case (a), i.e. expected number of claims 100; rhs: in case (b), i.e. expected
number of claims 1000.
claims.
Finally, in Table 4.1 we present the p-values resulting from the KS test of the
different approximations, see Section 3.3.1. In this particular case we see that the
translated gamma distribution is preferred in case (a), whereas in case (b) the
approximations are very similar. For this reason, one often chooses a translated
gamma distribution in practice (and also because it can easily be handled). Note
normal approximation
translated gamma approximation
translated log-normal approximation
Edgeworth approximation
105
case (a) case (b)

0%
0%
51%
57%
8%
59%
13%
58%
Table 4.1: p-values of the KS test of Section 3.3.1.
w)
that the Edgeworth approximation can be refined and improved by considering

more terms in the Taylor expansion. This closes the example.
(m
We finally remark that there exist similar approximations as the Edgeworth approximation, for instance, the Gram-Charlier expansion, the Laguerre-gamma expansion or the Jacobi-beta expansion. These expansions are quite popular in engineering but they have similar weaknesses as the Edgeworth approximation and we
will not further discuss them.
4.2.1
Algorithms for compound distributions
tes
4.2
Panjer algorithm
no
The Panjer algorithm (also known as Panjers recursion) goes back

to Harry H. Panjer [83]. The Panjer algorithm assumes a specific property for the claims count distribution and then it applies
this property in a clever way to develop a recursive algorithm for
the calculation of the distribution function of S.
NL
Throughout this section we assume that N is a claims count disH.H. Panjer

tribution that is supported in a possibly infinite interval A N0
containing 0. The corresponding probability weights are denoted by pk for k N0
and we set pk = 0 for k
/ A.
Definition 4.6 (Panjer distribution). N has a Panjer distribution if there exist
constants a, b R such that for all k N we have the recursion
pk = pk1 (a + b/k) .
106
Note that Panjer distributions require p0 > 0, otherwise the recursion for k 1 will not provide a well-defined distribution function.
Bjrn Sundt and William S. Jewell (1932-2003) have characterized the Panjer distributions. This is stated in the following
lemma.
B. Sundt
w)
Lemma 4.7 (Sundt-Jewell [95]). Assume N has a non-degenerate Panjer distribution. N is either binomially, Poisson or negative-binomially distributed.
b
>0
k
for all k N.
W.S. Jewell
tes
pk = pk1
(m
Proof. In order for N to have a non-degenerate distribution function we

need to have |A| > 1. Thus, we may and will choose as initialization for
the recursion k = 1 A (A is an interval containing at least 0 and 1). The
Panjer distribution then provides for this k the identity p1 = p0 (a + b). To
have a well-defined distribution function we need to have a + b 0, otherwise
p1 < 0. The case a + b = 0 provides a degenerate distribution function, thus
we even need to have a + b > 0.
Case (i). Assume a = 0. This implies b > 0 and
no
This is exactly the Poisson distribution with parameters a = 0 and b = v > 0 for A = N0
because for the Poisson distribution we have, see Section 2.2.2, pk /pk1 = v/k.
Case (ii). Assume a < 0. To have positive probabilities we need to make sure that a + b/k
remains positive for all k A. This requires |A| < . We denote the maximal value in A
by v N (assuming it has pv > 0). The positivity constraint then provides b/v > a and
a + b/(v + 1) = 0. The latter implies that pk = 0 for all k > v and is equivalent to the requirement
v = (a + b)/a > 0. We set p = a/(1 a) (0, 1) which provides

v+1
b
a(v + 1)
p
pk = pk1 a +
= pk1 a
= pk1
1
.
k
k
1p
k
NL
For the binomial distribution we have on A, see Section 2.2.1,

pk
pk1
p vk+1
p
p v+1
=
+
.
1p
k
1p 1p k
This is exactly the binomial distribution with parameters a = p/(1 p) and b = (v + 1)p/(1 p)
and A = {0, . . . , v}.
Case (iii). Assume a > 0. In this case we define = (a + b)/a > 0. This provides b = a( 1)
and

b
1
pk = pk1 a +
= pk1 a 1 +
.
k
k
Since the latter should be summable in order to obtain a well-defined distribution function we
need to have a < 1. For the negative-binomial distribution we have, see Proposition 2.20,
pk
p(k + 1)
p( 1)
=
=p+
.
pk1
k
k
This is exactly the negative-binomial distribution with parameters a = p and b = p( 1) and
A = N0 . This proves the lemma.
2
107
The previous lemma shows that the (important) claims count distributions that
we have considered in Chapter 2 are Panjer distributions and the corresponding
choices a, b R are provided in the proof of Lemma 4.7. We restate this in the
next corollary.
w)
Corollary 4.8. Assume N has a non-degenerate Panjer distribution. For a =

p/(1 p) and b = (v + 1)p/(1 p) we have the binomial distribution, for a = 0
and b = v we have the Poisson distribution, and for a = p and b = p( 1) we
obtain the negative-binomial distribution with p = v/( + v).
def.
fr = P[S = r] =
p0
Pr
k=1
a+
b kr
gk frk
for r = 0,
for r > 0.
tes
(m
Theorem 4.9 (Panjer algorithm [83]). Assume S has a compound distribution

according to Model Assumptions 2.1 with N having a Panjer distribution with parameters a, b R and the claim size distribution G is discrete with support N.
Denote gm = P[Y1 = m] for m N. Then we have for r N0
Proof of Theorem 4.9. We will prove a more general result in Theorem 4.9(B) below.
no
Remarks.
NL
The Panjer algorithm requires a Panjer distribution for N and strictly positive
and discrete claim sizes Yi N, P-a.s. Then it provides an algorithm that
allows to calculate the compound distribution without doing the involved
convolutions (2.1): Assume N Poi(v), henceforth, a = 0, b = v and for
rN
r
X
k
fr =
v gk frk .
(4.6)
r
k=1
Theorem 4.9 allows to apply recursion (4.6) as follows
f0 = p0 = ev ,
f1 = vg1 f0 ,
1
f2 = v g1 f1 + vg2 f0 ,
2
1
2
f3 = v g1 f2 + v g2 f1 + vg3 f0 ,
3
3
..
.
Observe that fr only depends on f0 , . . . , fr1 .
108
In practical applications there might occur the situation where the initial
value f0 is nonsensical on IT systems. This has to do with the fact that
IT systems can represent numbers only up to some numerical precision. Let
us explain this using the compound Poisson distribution providing Panjer
algorithm (4.6). If the expected number of claims v is very large, then
on IT systems the initial value f0 = p0 = ev may be interpreted as zero
and thus the algorithm cannot start due to missing numerical precision and
meaningful starting value. We call this numerical underflow.
w)
In this case we can modify the Panjer algorithm as follows: choose any strictly
positive starting value fe0 > 0 and develop the iteration
(m
fe1 = vg1 fe0 ,

1
fe2 = v g1 fe1 + vg2 fe0 ,
2
1
2
fe3 = v g1 fe2 + v g2 fe1 + vg3 fe0 ,
3
3
..
.
tes
Observe that this provides a multiplicative shift from fr to fer . The true
probability weights are then found by
n
fr = exp log fer + log f0 log fe0 ,
no
where we go over to the log-scale to avoid another multiplication with missing

numerical precision.
This multiplicative shift may lead to a numerical overflow, which might require to shift forward and backward the algorithm several times to get sensible
values. Important at the end is a final check to see whether
NL
n
X
fr 1
as n ,
r=0
in order to have total probability mass 1.

We need to have discrete claim sizes Yi N. Of course, this can be modified
to any other span d > 0, i.e. Yi dN, because for r N
P[S = dr] = P
"N
X
i=1
Yi = dr = P
"N
X
Yi /d = r = P
i=1
"N
X
Ye
=r ,
i=1
with Yei = Yi /d N.
For non-discrete claim sizes Yi we need to discretize them in order to apply
the Panjer algorithm. Choose span size d > 0 and consider for k N0
G((k + 1)d) G(kd) = P [kd < Y1 (k + 1)d] .
109
These probabilities can now either be shifted to the left or to the right endpoint of the interval [kd, (k+1)d]. We define the two new discrete distribution
functions for k N0
h
(4.7)
gk = P Y1 = kd = G((k + 1)d) G(kd).
(4.8)
+
gk+1
= P Y1+ = (k + 1)d = G((k + 1)d) G(kd),
and
h
w)
This provides the following stochastic ordering (stochastic dominance)

Y1 sd Y1 sd Y1+ ,
S =
N
X
Yi sd S =
i=1
(m
where the latter means P[Y1 > x] P[Y1 > x] P[Y1+ > x]. This implies
N
X
Yi sd S + =
i=1
N
X
Yi+ ,
i=1
tes
for Yi being i.i.d. copies of Y1 and Yi+ being i.i.d. copies of Y1+ (also independent of N ). Thus, we get lower and upper bounds S sd S sd S + which
become more narrow the smaller we choose the span d. In most applications,
especially for small v, these bounds/approximations are sufficient compared
to the other uncertainties involved in the prediction process (parameter estimation uncertainty, etc.).
NL
no
To S + we can directly apply the Panjer algorithm. S is more subtle because it may happen that g0 > 0 and, thus, the Panjer algorithm cannot be
applied in its classical form of Theorem 4.9. In the case of the compound
Poisson distribution this problem is easily circumvented due to the disjoint
decomposition theorem, Theorem 2.14, which says that
S =
N
X
i=1
Yi =
N
X
i=1
(d)
Yi 1{Y >0} = Se
i
e = v(1 g )
has again a compound Poisson distribution with parameters v
0
e
and weights of the claim sizes gk = gk /(1 g0 ) for k N. Finally, we apply
the Panjer algorithm to the compound Poisson distributed random variable
Se to get the second bound. We prefer to give a more general version of
the Panjer algorithm that also allows to treat the case g0 > 0, see Theorem
4.9(B) below.
There are more sophisticated discretization methods, but often our proposal
(4.7)-(4.8) is sufficient. Moreover, it provides explicit upper and lower bounds
which is an advantage if one tries to quantify the precision of the approximation.
110
Theorem 4.9(B) (modified Panjer algorithm) Assume S has a compound distribution according to Model Assumptions 2.1 with N having a Panjer distribution
with parameters a, b R and the claim size distribution G is discrete with support
N0 (we allow for g0 = P[Y1 = 0] > 0). Then we have for r N0
P
k
kN0 pk g0
P
r
1
k=1 a
1ag0
b kr
for r = 0,
for r > 0.
gk frk
fr = P[S = r] =
kN
(m
w
Proof of Theorem 4.9(B). Note that we have kpk = (ak + b)pk1 = a(k 1)pk1 + (a + b)pk1 .
We multiply this equation with (MY1 (x))k1 MY0 1 (x) and sum over k N. This provides the
identity
X
X
kpk (MY1 (x))k1 MY0 1 (x) =
(a(k 1)pk1 + (a + b)pk1 ) (MY1 (x))k1 MY0 1 (x).
kN
The left-hand side is the derivative w.r.t. x of

) ##
" "
( N

X
X

= E MY1 (x)N =
Yi N
pk (MY1 (x))k ,
MS (x) = E E exp x

i=1
kN0
tes
whereas the right-hand side fulfills, again using the derivative of MS (x) in the second step,
X
(a(k 1)pk1 + (a + b)pk1 ) (MY1 (x))k1 MY0 1 (x)
kN
(akpk + (a + b)pk ) (MY1 (x))k MY0 1 (x) = aMS0 (x)MY1 (x) + (a + b)MS (x)MY0 1 (x).
no
kN0
Thus, we have just proved that the moment generating function for compound Panjer distributions
satisfies the following differential equation
MS0 (x) = aMY1 (x)MS0 (x) + (a + b)MY0 1 (x)MS (x).
NL
Each side of the above identity can be expanded as powers of ex

X
X
X
X
X
fr rexr = a
gk exk
fl lexl + (a + b)
gk kexk
fl exl .
r1
k0
l1
k1
l0
Comparing the terms with the same powers r 1 of ex we obtain

rfr
r1
X
gk (r k)frk + (a + b)
k=0
=
=
arg0 fr + a
r
X
r
X
kgk frk
k=1
gk (r k)frk + (a + b)
k=1
r
X
arg0 fr + ar
k=1
r
X
kgk frk
k=1
gk frk + b
r
X
kgk frk = arg0 fr +
k=1
r
X
(ar + bk)gk frk .
k=1
Dividing both sides by r 1 and bringing the first term on the right-hand side of the last equality
to the other side we obtain

r
X
k
gk frk .
(1 ag0 )fr =
a+b
r
k=1
111
This proves the claim for r > 0. For r = 0 we have

" k
#
X
X
X
P[S = 0] = p0 +
pk P
Yi = 0 = p0 +
pk P [Y1 = . . . = Yk = 0]
i=1
kN
p0 +
pk g0k
kN
kN
pk g0k ,
kN0
w)
where in the second last step we have used the independence property of the claim sizes Yi .
This finishes the proof. Note that we have (implicitly) assumed that there exists a positive
radius of convergence for the moment generating functions, see also Lemma 1.1. We can do this
w.l.o.g. because in order to calculate fr = P[S = r] we may replace the claim sizes Yi by bounded
claim sizes Yi (r + 1) and the resulting probability weight fr will be the same.
2
(m
Example 4.10 (Panjer algorithm compound Poisson distribution). We choose a

compound Poisson model with expected number of claims v = 1 and Pareto
i.i.d.
claim size distribution Yi Pareto(, ) with = 500 000 and = 2.5. In a
first step we need to discretize the claim sizes. We calculate the distributions of
Yi sd Yi sd Yi+ according to (4.7) and (4.8) with
=
+
gk+1
= G((k + 1)d) G(kd) =
kd
(k + 1)d
NL
no
tes
gk
Figure 4.9: Discretized claim size distributions (gk )k and (gk+ )k ; lhs: case (i) with
span d = 100 000; rhs: case (ii) with span d = 10 000.
As span size we choose two different values: (i) d = 100 000 and (ii) d = 10 000. In
Figure 4.9 we plot the resulting probability weights (gk )k and (gk+ )k . We see that
the discretization error disappears for decreasing span d.
We then implement the Panjer algorithm in R. The implementation is rather
straightforward. In a first step we invert the ordering in the claim size distributions
112
(gk )k and (gk+ )k so that in the second step we can apply matrix multiplications.
This looks as follows:
Note that we shift indexes by 1 (because arrays start at 1)
for (k in 0:(Kmax-1)) { g[2,Kmax-k] <- g[1,k+1]*k }
f[1] <- exp(-lambda * v)
for (r in 1:(Kmax-1)) {
f[r+1] <- g[2,(Kmax-r):(Kmax-1)] %*% f[1:r] * lambda * v / r
}
no
tes
(m
The results are presented in Figures 4.10 and 4.11.
w)
#
>
>
>
NL
Figure 4.10: Discrete probability weights of compound Poisson distribution with

v = 1 from Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii)
with span d = 10 000.
In Figure 4.10 we plot the resulting probability weights of the (discretized) compound Poisson distribution, the left-hand side gives the picture for span d = 100 000
and the right-hand side for d = 10 000. The observation is that span d = 100 000
gives quite some differences between lower and upper bounds reflected by (gk )k
and (gk+ )k , for span d = 10 000 they are sufficiently close so that we obtain appropriate approximations to the continuous Pareto distribution case. We also observe
that the resulting distribution has two obvious modes, see Figure 4.10 (rhs), these
reflect the cases of having one claim N = 1 and having N = 2 claims, the cases
N 3 only give smaller discontinuities.
Finally, in Figure 4.11 we show the log-log plots of the distribution functions. The
straight blue line reflects the Pareto distribution Y1 Pareto(, ), i.e. of having
exactly one claim with tail parameter = 2.5 (which corresponds to the negative
113
(m
w)
Figure 4.11: Log-log plot of compound Poisson distribution with v = 1 from

Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii) with span
d = 10 000.
tes
slope of the blue line). We observe that asymptotically the compound Poisson
distribution with v = 1 coincides with the Pareto claim size distribution.
NL
no
Example 4.11. We revisit case (c) of Example 4.2. For large claims Slc we assume
a compound Poisson distribution with expected number of claims lc v = 3.9 and
Pareto(, ) claim size distribution with = 500 000 and = 2.5. We choose the
same two discretizations as in Example 4.10, see Figure 4.9, and then we apply the
Panjer algorithm to the large claims layer as explained above. The results for the
distributions of Slc are presented in Figures 4.12 and 4.13.
The results are in line with the ones of Example 4.10 and we should prefer span size
d = 10 000 which gives a sufficiently good approximation to the continuous Pareto
claims size distribution. Observe that due to lc v = 3.9 the resulting compound
Poisson distribution has more modes now, see Figure 4.12 (rhs). In Figure 4.13 we
see that the asymptotic behavior is sandwiched between the Pareto distribution
Pareto(, ) with tail parameter = 2.5 and this Pareto distribution stretched
with the expected number of claims lc v = 3.9 (blue lines in Figure 4.13). We
observe a rather slow convergence to the asymptotic slope which tells us that
parameter estimation for Pareto distributions is a very difficult (if not impossible)
task if only few observations are available.
Finally, we convolute the large claims layer Slc of case (c) in Example 4.2 with the
corresponding small claims layer Ssc , see case (b) of Example 4.2. For the small
claims layer we choose a translated gamma distribution as approximation to the
true distribution function of Ssc , i.e. we set
S = Ssc + Slc Xsc + Slc ,
(4.9)
(m
w)
114
NL
no
tes
Figure 4.12: Discrete probability weights of a compound Poisson distribution with

lc v = 3.9 from Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii)
with span d = 10 000.
Figure 4.13: Log-log plot of compound Poisson distribution with lc v = 3.9 from
Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii) with span
d = 10 000.
where Xsc is the translated gamma approximation to Ssc (see Example 2.16 and
(4.2)) and Slc are the discretized versions of Slc which model the large claims layer
having a compound Poisson distribution with Pareto claim sizes.
In order to calculate the compound Poisson random variable Slc we apply the Panjer
algorithm with span d = 10 000. The disjoint decomposition theorem, see Theorem
2.14 and Example 2.16, implies that in the compound Poisson case we may and will
assume that the large claims separation leads to an independent decoupling of Ssc
115
(m
w)
+ Slc for
Figure 4.14: Case (c) of Example 4.2: exact discretized distribution Xsc
span d = 10 000, Monte Carlo approximation and normal approximation (only rhs).
lhs: discrete probability weights (upper and lower bounds); rhs: log-log plot (see
also Figure 4.3 (rhs)).
(1)
fk
no
tes
and Slc , and Xsc and Slc , respectively, see (4.9). Therefore, the aggregate distribution of Xsc + Slc is obtained by a simple convolution of the marginal distributions
of Xsc and Slc . Using also a discretization for the distribution function of Xsc to
, the
the same span d = 10 000 as in the Panjer algorithm for Slc , denoted by Xsc
convolution of Xsc + Slc can easily be calculated analytically. That is, no Monte
Carlo simulation is needed. Namely, denote the discrete probability weights of Xsc
(2)
(1)
by (fk )k0 and the discrete probability weights of Slc by (fk )k0 , i.e. set
h
= P Xsc
= kd
(2)
and
fk
= P Slc = kd .
NL
Then, due to independence, we have for all r N0 discrete probability weights

def.
fr = P Xsc
+ Slc = rd =
r
X
(1) (2)
fk
frk .
(4.10)
k=0
# Note that we shift indexes by 1 (because arrays start at 1)

> for (k in 0:(Kmax-1)) { f2[2,Kmax-k] <- f2[1,k+1] }
> for (r in 1:Kmax) { f[r] <- f2[2,(Kmax-r+1):Kmax] %*% f1[1:r] }
The results are presented in Figure 4.14. On the left-hand side we present the
probability weights (fr )r0 and on the right-hand side the log-log plot of the resulting distribution function. We observe that the Monte Carlo approximation
(100000 simulations) has bad properties in the tail of the distribution, see Figure
4.14 (rhs), and one should avoid the simulation approach if possible. Especially,
116
w)
for heavy tailed distribution functions the Monte Carlo simulation approach has
a weak speed of convergence performance. Note that convolution (4.10) is exact,
and in some sense this discretized version can be interpreted as an optimal Monte
Carlo sample with equidistant observations.
We conclude that approximation (4.9) with a translated gamma distribution for
the small claims layer and a compound Poisson distribution with Pareto tails for
the large claims layer is often a good model for total claim amount modeling in
non-life insurance. Moreover, using a discretization with appropriate span size d
the resulting discrete distribution function can be calculated analytically (and we
obtain upper and lower bounds which can be controlled).
(m
expected claim amount E[S]

30 1310 397
standard deviation Var(S)1/2
3380 819
coefficient of variation Vco(S)
10.8%
0
99.5%-VaR upper bound (from discretization)
4 0460 500
99.5%-VaR lower bound (from discretization)
40 0380 500
99.5%-VaR E[S]
9120 500
no
tes
Table 4.2: Resulting key figures, the 99.5%-VaR corresponds to the 99.5%-quantile
of S, see Example 6.25, below. The 99.5%-VaR is calculated with the discretized
version with span d = 10 000, therefore we obtain upper and lower bounds resulting
+ Slc .
from the discretization error in Xsc
Finally, in Table 4.2 we present the resulting key figures. We observe that the
resulting distribution function is substantially more heavy tailed than the Gaussian
distribution which is not surprising in view of Figure 4.14 (rhs).
Fast Fourier transform
NL
4.2.2
We briefly sketch the fast Fourier transform (FFT) to explain the main idea. We
follow Embrechts-Frei [38], Section 6.7 in Panjer [84], and we also recommend Cern
[27] as a reference.
In Chapter 1 we have introduced the moment generating function of X given by
MX (r) = E[erX ]. The beauty of such transforms is that they allow to treat independent random variables in an elegant way, in the sense that convolutions turn
into products, i.e. for X and Y independent we have (whenever they exist)
MX+Y (r) = MX (r)MY (r).
For compound distributed random variables S we have, see Proposition 2.2,
MS (r) = MN (log MY1 (r)).
(4.11)
117
If we manage to identify the right-hand side of the latter equation, that is, find Z
such that MN (log MY1 (r)) = MZ (r), then Lemma 1.2 explains that S and Z have
the same distribution function and we do not need to perform the convolutions (if
Z is sufficiently explicit). This is also the idea behind this section.
w)
In the sequel of this section the moment generating function is

replaced by the (discrete) Fourier transform, which is named after
Jean Baptiste Joseph Fourier (1768-1830). The reason for
this replacement is that the Fourier transform has a nice inversion
formula that is crucial in this section (and which allows to identify
the right-hand side of (4.11) in a straight forward manner). We
present the discretized case as it is usually used in practice.
(m
J.B.J. Fourier
Assume we have finite support A = {0, . . . , n 1} and that (fl )lA is a discrete
distribution function on A. The discrete Fourier transform of (fl )l is defined by
fz =
n1
X
zl
fl exp 2i
n
l=0
for z A.
(4.12)
tes
Assume S (fl )l , then we have, by a slight abuse of notation,

z
zS
fz = MS 2i
= E exp 2i
n
n

no
The discrete Fourier transform has the following nice inversion formula
X
1 n1
zl
fl =
fz exp 2i
n z=0
n
(
for l A.
(4.13)
NL
This provides the first part of the idea to the algorithm: if we are able to explicitly
calculate the discrete Fourier transform (fz )z , then the inversion formula provides
the wanted probability weights (fl )l . Note that this idea also applies if (fl )l are
weights that do not necessarily add up to 1.
Remarks. In the literature one also finds an other definition of the discrete Fourier
transform, namely in (4.12) the factor 2i is sometimes replaced by 2i. This
implies that we also need a switch of sign in the inversion formula (4.13). Similarly,
the scaling n1 in (4.13) may be shifted to (4.12). Note that the discrete Fourier
transform acts on the cyclic group Z/nZ.
The above gives the following recipe:
Step 1. Choose threshold n N up to which we would like to determine the
distribution function of S, i.e. we are interested in P[S n 1].
118
Step 2. Discretize the claim severity distribution G to obtain weights (gk )kA .
For discretization we refer to the last section on the Panjer algorithm, see
P
remarks on page 107. Note that typically we have kA gk < 1, because
claims Yi may exceed threshold n 1 with positive probability.
Step 3. Calculate the discrete Fourier transform (
gz )zA of of (gk )kA .
z
z
fz = MS 2i
= MN log MY1 2i
n
n

w)
Step 4. Calculate the discrete Fourier transform (fz )zA of S (fl )lA using
identity (4.11) with r = 2iz/n and (
gz )zA , respectively, that is, set

= MN (log gz ) .
(4.14)
(m
Step 5. Apply the inversion formula to obtain (fl )lA from (fz )zA .
l=0
zl
gl exp 2i d
2
2d1
X1
l=0
2d1
X1
l=0
g2l
2zl
exp 2i d
2
2d1
X1
g2l exp 2i

2d1
z(2l + 1)
exp 2i
2d
2d1
X1
z
+ exp 2i d
2

l=0
g2l+1 exp 2i
zl
2d1
z
gb(1) ,
2d z

NL
= gbz(0) + exp 2i
zl
g2l+1
l=0
no
gz =
d 1
2X
tes
The remaining part of the FFT explains how to calculate the discrete Fourier
transform (
gz )zA of Y1 (gl )lA efficiently. There is a nice recursive algorithm
that allows to calculate these discrete Fourier transforms for the choices n = 2d ,
d N0 . The discrete Fourier transform of (gl )l for n = 2d is given by
(0)
where gbz(0) is the discrete Fourier transform of (gl )l=0,...,m1 = (g2l )l=0,...,m1 and
(1)
gbz(1) is the discrete Fourier transform of (gl )l=0,...,m1 = (g2l+1 )l=0,...,m1 for m =
2d1 . Note that this step reduces length 2d to length 2d1 and iterating this until we
have reduced the total length 2d to 20 = 1 calculates the discrete Fourier transform
of (gl )l in an efficient way.
Observe that the total length of (fz )z is also n = 2d . Therefore, the exactly same
recursive algorithm is applied for the calculation of the inversion formula to obtain
(fl )l .
In R there is a command for the FFT. Use the following lines to transform a
discrete, finite distribution g = (gl )l :

#
#
>
>
119
Check normalizations in (4.12)-(4.13) (depending on implementation they

may be different)
g_hat <- fft(g)
g <- fft(g_hat, inverse = TRUE)/length(g)
For more information on the FFT and calculation with complex numbers we refer
to Cern
[27].
w)
We conclude this section with remarks on compound distributions, for details we

refer to Embrechts-Frei [38]. First, we compare the efficiency of the proposed methods. We assume that we calculate the compound distribution of S with discrete
claim sizes Y (gl )lA of length n for n .
operations
claims counts N
precision
full convolution (2.1)

Panjer algorithm
FFT
O(n3 )
O(n2 )
O(n log n)
any distribution
Panjer distributions
any distribution
exact
exact
not exact
(m
method
NL
no
tes
Observe that we have hidden one issue when applying the FFT to compound distributions. As mentioned above, the discrete Fourier transform acts on the cyclic
group Z/nZ. But transformation (4.14) does not respect this cyclic structure and
compound claims that exceed n 1 are wrapped around. This wrap around error
(also called aliasing error) can be substantial and needs a careful consideration. If
it is too large, then n should be increased so that less probability mass exceeds the
threshold n 1, an example is provided in Figure 4.15.
Figure 4.15: Panjer algorithm versus FFT for the compound Poisson distribution
with v = 1 and discrete claim size distribution (g` )` with g` = 1/10 for ` =
1, . . . , 10 with (lhs) n = 12, (middle) n = 15, and (rhs) n = 20.
NL
no
tes
(m
w)
120
Chapter 5
w)
Ruin Theory in Discrete Time
no
tes
(m
Ruin theory has its origin in the early twentieth century when
Ernst Filip Oskar Lundberg (1876-1965) [71] wrote his famous Uppsala PhD thesis in 1903. It was later the distinguished
Swedish mathematician and actuary Harald Cramr (18931985) [29, 30] who developed the cornerstones in collective risk
and ruin theory and has made many of Lundbergs ideas mathematically rigorous. Therefore, the underlying process studied in
ruin theory is called Cramr-Lundberg process. For the collected
H. Cramr
work of Cramr we refer to [31]. Since then a vast literature has
developed in this field, important contributions are Feller [45], Bhlmann [19],
Rolski et al. [89], Asmussen-Albrecher [7], Dickson [36], Kaas et al. [64] and many
scientific papers by Hans-Ulrich Gerber and Elias S.W. Shiu. Therefore,
this theory is sometimes also called Gerber-Shiu risk theory, see Kyprianou [68].
NL
Because it is not our intention to write another textbook on ruin

theory we keep this chapter rather short and only give some key
ideas and results. In particular, we investigate the importance
of the tail of the claim size distribution. Our short summary is
mainly based on Schmidli [91] and Rolski et al. [89], for a more
comprehensive overview we refer to the literature.
H.-U. Gerber
5.1
Net profit condition
We consider time series of premium payments t and total claim

amount payments St over several accounting years t N. In this
set-up we study the question under which circumstances the premia t suffice to pay all claims St (instantaneously when they occur, allowing for carry-over of possible gains). In order to do this,
we define the following (discrete time) surplus process (Ct )tN0 .
121
E.S.W. Shiu
122
Chapter 5. Ruin Theory in Discrete Time
Definition 5.1 (surplus process). Choose t N. The surplus at time t is given by

(c0 )
Ct = Ct
= c0 +
t
X
(u Su ) ,
u=1
for initial capital C0 = c0 0 at time 0 and an i.i.d. sequence (t , St )tN with:

the premium t received for accounting year t satisfies t > 0, P-a.s.;
w)
the total claim amount St in accounting year t satisfies St 0, P-a.s.;

t and St are independent for all t N.
tes
(m
The last assumption in the previous definition is not really

necessary but it may simplify calculations.
The surplus process (Ct )tN0 models the equity or the net
asset value process of an insurance company which starts
with (deterministic) initial capital C0 = c0 0, collects
every year a premium t and pays for the corresponding
(non-negative) claim St . At the first sight it looks artificial
to model the premium t stochastically. The reason therefore is that it may be advantageous in some situations to
have randomized premia. The ultimate goal is to achieve
no
Ct 0
for all t 0,
NL
otherwise the company cannot fulfill its liabilities at any point in time t N0 . In
the present set-up we look at a homogeneous surplus process (having independent
and stationary increments Xt = t St ). Moreover, no financial return on assets is
considered. Of course, this is a rather synthetic situation. For the present purpose
it is sufficient because it already highlights crucial issues and it will be refined for
solvency considerations in Chapter 10.
Definition 5.2 (ruin time and finite horizon ruin probability). We define the ruin
time of the surplus process (Ct )tN0 by
= inf {s N0 ; Cs < 0} .
The finite horizon ruin probability up to time t N and for initial capital c0 0
is defined by
t (c0 ) = P [ t| C0 = c0 ] = P
inf C (c0 )
s=0,...,t s
<0 .
Remark on the notation. Below we use that for c0 = 0 the stochastic process
(0)
(Ct )tN0 = (Ct )tN0 is a random walk on the probability space (, F, P) starting
123
(c )
(0)
at zero. The general surplus process can then be described by (Ct 0 )tN0 = (Ct +
c0 )tN0 under P and, as stated in Definition 5.2, we can indicate the initial capital
by using the notation P[|C0 = c0 ]. In Markov process theory it has naturalized
that the latter is written as Pc0 [] meaning that (Ct )tN0 under Pc0 is equal in law
(0)
to (Ct + c0 )tN0 under P.
The event { t} can be written as follows
n
{ t} = inf {s N0 ; Cs < 0} t =
{Cs < 0} ,
w)
s=0,...,t
(m
and therefore is a stopping time w.r.t. the filtration generated by (Ct )tN0 . To
consider the limiting case t we need to extend the positive real line by an
additional point {} because is not necessarily finite, P-a.s. We use the notation
R+ for the extended positive real line [0, ].
The finite horizon ruin probability t (c0 ) is non-decreasing in t and it is
bounded by 1 (because it is a probability). This immediately implies convergence
for t and
tes
we can define the ultimate ruin probability by the following limit

(c0 ) = lim t (c0 ) [0, 1].
(5.1)
no
Lemma 5.3 (ultimate ruin probability). The ultimate ruin probability for initial
capital c0 0 is given by

(c0 ) = Pc0 [ < ] = Pc0 inf Ct < 0

tN0
[0, 1].
NL
Proof. The second equality is a direct consequence of the definition, note that

[
[ [
[
{ < } =
{ t} =
{Cs < 0} =
{Ct < 0} = inf Ct < 0 .
tN0 s=0,...,t
tN0
tN0
tN0
For the first equality we use the monotone convergence property of probability measures, note
{ t} { t + 1},
"
#
[
Pc0 [ < ] = Pc0
{ t} = lim Pc0 [ t] = lim t (c0 ) = (c0 ).
t
tN0
We analyze this ultimate ruin probability in various situations. Therefore, we

(c )
modify the surplus process (Ct 0 )tN0 . We define Z0 = 0 and for t N
(c0 )
Zt = Ct
(0)
c0 = C t
t
X
(u Su ) =
u=1
t
X
u=1
Xu ,
(5.2)
124
where we define the i.i.d. sequence (Xt )tN by Xt = t St . In probability theory

the process (Zt )tN0 is called general random walk. A main object of interest of
random walk theory is the study of its long time behavior. The key theorem is the
following statement:
if E[X1 ] > 0 then limt Zt = , P-a.s.;

if E[X1 ] < 0 then limt Zt = , P-a.s.;
w)
Theorem 5.4 (random walk theorem). Assume Xt are i.i.d. with P[X1 = 0] <
1 and E[|X1 |] < . The random walk (Zt )tN0 defined in (5.2) has one of the
following three behaviors
(m
if E[X1 ] = 0 then lim inf t Zt = and lim supt Zt = , P-a.s.

Proof. See, e.g., Proposition 7.2.3 in Resnick [87]. 2
From now on we exclude the trivial case P[1 S1 = 0] = 1

and we assume that 1 and S1 have finite first moments.
tes
The random walk theorem immediately gives the following

crucial corollary for our context:
no
Corollary 5.5 (ultimate ruin with probability one). Assume E[1 ] E[S1 ]. Then (c0 ) 1 for any initial capital
c0 0.
Proof. The random walk theorem implies for E[X1 ] = E[1 ]E[S1 ]
0 that lim inf t Zt = , P-a.s., and thus lim inf t Ct = , Pc0 -a.s (for any c0 0). But
this means that we have ultimate ruin with probability 1.
2
NL
Henceforth, for avoiding ultimate ruin with positive probability we need to charge
an (expected) annual premium E[1 ] which exceeds the expected annual claim
E[S1 ]. This gives rise to the following standard assumption.
Assumption 5.6 (net profit condition). The surplus process satisfies the net profit
condition (NPC) given by
E[1 ] > E[S1 ].
Corollary 5.7. Assume (NPC), then (0) < 1.

Proof. The assumption E[1 ] > E[S1 ] implies E[X1 ] > 0 and, thus, limt Zt = , P-a.s. This
implies that P[lim inf t Zt = ] = 0. The latter is equivalent to P[inf tN0 Zt 0] > 0, see
for instance Proposition 7.2.1 in Resnick [87]. But then the proof follows.
2
125
Moreover, observe that (c0 ) is non-increasing in c0 (this can be seen path by

(c )
path because Ct 0 = Zt + c0 is strictly increasing in the initial capital c0 ).
This implies that (c0 ) (0) < 1 under (NPC).
5.2
w)
Our next goal is to find more explicit bounds on the ruin probability as a function
of the initial capital c0 0.
Lundberg bound
(m
We start with a lemma which gives the renewal property of the surplus process.
i.i.d.
We define the distribution function F by S1 1 F . Thus, we have Xt F .
Note that from S1 FS , 1 F and independence of S1 and 1 it follows
F = FS F .
tes
Lemma 5.8. The finite horizon ruin probability and the ultimate ruin probability
satisfy the following equations for t N0 and initial capital c0 0
t+1 (c0 ) = 1 F (c0 ) +
Z c0
t (c0 y) dF (y),
(c0 y) dF (y).
no
(c0 ) = 1 F (c0 ) +
Z c0
Proof. We start with the finite horizon ruin probability. Observe that we have partition for
c0 0
{ t + 1} = { 1} {1 < t + 1} = {S1 1 > c0 } {1 < t + 1}.
NL
The i.i.d. property of (t , St )t implies

t+1 (c0 )
= Pc0 [ t + 1] = P[S1 1 > c0 ] + Pc0 [1 < t + 1]

Z c0
= P[S1 1 > c0 ] +
Pc0 [ 1 < t + 1| S1 1 = y] dF (y)
Z c0
= P[S1 1 > c0 ] +
Pc0 [ 1 < t + 1| C1 = c0 y] dF (y)
Z c0
= P[S1 1 > c0 ] +
Pc0 y [ t] dF (y)
Z c0
= 1 F (c0 ) +
t (c0 y) dF (y).
The ultimate ruin probability statement is a direct consequence of the finite horizon statement.
Using that we have point-wise convergence (5.1) and that t is bounded by 1 which is integrable
w.r.t. dF we can apply the dominated convergence theorem to the finite horizon ruin probability
statement which provides the claim for the ultimate ruin probability as t .
2
126
Definition 5.9 (Lundberg coefficient, adjustment coefficient). Assume there exists

an R > 0 such that
MX1 (R) = MS1 1 (R) = 1.
Then, this R > 0 is called Lundberg coefficient.
no
tes
(m
w
Lemma 5.10 (uniqueness of Lundberg coefficient). Assume that (NPC) holds and
that a Lundberg coefficient R > 0 exists. Then, R is unique.
Figure 5.1: Lundberg coefficient R of the function r 7 MX1 (r).
NL
Proof. Due to the existence of a Lundberg coefficient R > 0 and due to the independence
between S1 and 1 the following function is well-defined for all r [0, R] and satisfies

r 7 h(r) = log MS1 1 (r) = log(MS1 (r) M1 (r)) = log E erS1 + log E er1 .
Similar to Lemma 1.6 we see that h(r) is a convex function on [0, R] with h(0) = 0 and h0 (0) =
E[S1 1 ] < 0 under (NPC). But then there is at most one R > 0 with h(R) = 0. This proves
the uniqueness of the Lundberg coefficient.
2
Theorem 5.11 (Lundbergs exponential bound). Assume (NPC) and R > 0 exists.
(c0 ) eRc0
for all c0 0.
Proof. It suffices to prove that t (c0 ) eRc0 for all t N because t (c0 ) (c0 ) for t .
We apply Lemma 5.8 to the finite horizon ruin probability t (c0 ) to obtain the following proof
by induction.
127
t = 1: We apply Chebychevs inequality to obtain for Lundberg coefficient R > 0 and any c0 0
h
i
1 (c0 ) = Pc0 [ 1] = P[S1 1 > c0 ] = P eR(S1 1 ) > eRc0
eRc0 MS1 1 (R) = eRc0 .
t t + 1: We assume that the claim holds true for t (c0 ) and any c0 0. Then with Lemma 5.8
Z
Z c0
t+1 (c0 ) =
dF (y) +
t (c0 y) dF (y)
c0
Z
Z c0
eR(c0 y) dF (y) +
eR(c0 y) dF (y)
c0
(m
w
= eRc0 MS1 1 (R) = eRc0 ,
due to the choice of the Lundberg coefficient R > 0. This proves the Lundberg bound.
Remarks on Lundbergs exponential bound.
Under (NPC) and the existence of the Lundberg coefficient

R > 0 we have an exponentially decaying bound on the ultimate ruin probability as initial capital c0 , i.e.
tes
(c0 ) eRc0 .
no
Set > 0 (small). There exists c0 = c0 (R, ) 0 such that

(c0 ) . This means that in the Lundberg case we can E.F.O.
specify a maximal admissible ruin probability as tolerance Lundberg
and then we can choose an appropriate initial capital c0 which
implies that the ultimate ruin probability (c0 ) is bounded by this tolerance.
The existence of the Lundberg coefficient R > 0 implies that MS1 (R) <
and, using Chebychevs inequality,
h
NL
P[S1 > x] = P eRS1 > eRx eRx MS1 (R) eRx
as x .
This means that the claims S1 have exponentially decaying tails which are
so-called light tailed claims.
A main question is whether this exponential bound can be improved in the case
where the Lundberg coefficient exists. The difficulty in most selected cases is that
the ultimate ruin probability cannot be calculated explicitly. An exception is the
Bernoulli case.
Proposition 5.12 (Bernoulli random walk). Assume that Xt are i.i.d. with P[Xt =
1] = p and P[Xt = 1] = 1 p for given p > 1/2. For all c0 N we have
(c0 ) =
1p
p
!c0 +1
128
Note that this model is obtained by assuming t 1 and St {0, 2} with probability p having a zero claim.
Proof. We choose a finite interval (1, a) for a N and define for fixed c0 [0, a) N0 the
stopping time
a = inf {s N0 ; Cs = c0 + Zs
/ (1, a)} .
w)
The random walk theorem implies a < , P-a.s., because the interval (1, a) is finite. We define
the random variable

c +Z

C
1p 0 t
1p t
Yt =
=
.
p
p
It satisfies
"
=
Yt1
(m
E [ Yt | Yt1 ]
#
"
#
c0 +Zt1 +Xt
Xt
1p

Yt1 = Yt1 E
Yt1

p
#

1

1p
1p
(1 p)
= Yt1 ,
+p
p
p
1p
p
"
no
tes
thus (Yt )t0 is a martingale. Then also the stopped process (Ya t )t0 is a martingale. Moreover,
the latter martingale is bounded and since the stopping time is finite, P-a.s., we can apply the
stopping theorem (uniform integrability), see Section 10.10 in Williams [97], which provides

c
1p 0
= E[Y0 ] = E[Ya ]
p
1

a

1p
1p
Pc0 [Ca = 1] +
Pc0 [Ca = a]
=
p
p

1

a
1p
1p
=
Pc0 [Ca = 1] +
(1 Pc0 [Ca = 1]) ,
p
p
NL
where the last step follows because (Ct )tN0 leaves the interval (1, a), Pc0 -a.s., either at 1 or
at a. This provides the identity
c0
a

1p
1p
p
p
Pc0 [Ca = 1] =
1
a .
1p
1p
p
p
Finally, note that {Ca = 1} is increasing in a and thus

(c0 ) = Pc0 [ < ] = lim Pc0 [Ca = 1] =

a
1p
p
c0 +1
,
2
because p > 1 p. This proves the theorem.
The Lundberg coefficient for the Bernoulli random walk is found by the positive
solution of
R
MX1 (R) = (1 p)e + pe
= 1,
i.e.
p
R = log
1p
This together with Proposition 5.12 provides in the Bernoulli case

(c0 ) =
1p
p
eRc0 .
> 0.
129
That is, the Lundberg bound is optimal in the sense that we cannot improve the
exponential order of decay because the Lundberg coefficient R already provides the
optimal order.
5.3
(m
w)
In most cases we cannot explicitly calculate the ultimate ruin probability (c0 ).
Exceptions are the Bernoulli random walk of Proposition 5.12 and the CramrLundberg process in continuous time with an exponential claim size distribution,
see (5.3.8) in Rolski et al. [89]. In other cases where the Lundberg coefficient
exists we apply Lundbergs exponential bound of Theorem 5.11, or refined versions
thereof. But the following question remains: what can we do if the Lundberg
coefficient does not exist, i.e. if the tail probability of St does not necessarily decay
exponentially? The latter is quite typical in non-life insurance modeling.
Pollaczek-Khinchin formula
5.3.1
Ladder epochs
The practically oriented reader may skip this section.
no
tes
We assume (NPC) throughout this section, thus we know

that Ct , Pc0 -a.s., and (0) < 1. Under these assumptions we can study the (local) minima of the surplus
process. This study is done by looking at the ladder heights
that define these minima. We follow Bhlmann [18], Section 6.2.6, Feller [45], Chapter XII, and Rolski et al. [89],
Chapter 6. We define the stopping times 0 = 0 and for
kN
o
if k1 < ,
otherwise.
NL
inf t > k1 ; Zt < Zk1

k =

k is called the k-th strong descending ladder epoch, see (6.3.6) in Rolski et al. [89].
These stopping times form an increasing sequence that record the arrivals of new
ladder heights (descending records). For their distribution functions we have under
the i.i.d. property of the Xt s (independent and stationary increments)
h
P [ k < | k1 < ] = P inf t > k1 ; Zt < Zk1 < k1 <

= P [ inf {t > 0 ; Zt < Z0 } < | 0 < ]
= P [inf {t > 0; Zt < 0} < ] = (0) < 1.
The probability of a finite ladder epoch is exactly equal to the ultimate ruin probability (0) with initial capital c0 = 0.
Note that we could have t St 0, P-a.s., which would imply that the ultimate
ruin probability (0) = 0 because the premium collected is bigger than the maximal
130
claim, P-a.s. We exclude this situation as it is not interesting for ruin probability
considerations and because the insured will (hopefully) never pay a premium that
exceeds his maximal loss in any situation. Henceforth, under (NPC) we throughout
assume that (0) (0, 1) (where the upper bound follows from (NPC)).
We define the random variable
K + = sup {k N0 ; k < } .
K + counts the total number of finite ladder epochs, i.e. the total number of strong
descending records. We have (applying the tower property several times)
h
w)
P K + = k = P [k < , k+1 = ] = (0)k (1 (0)),
(m
that is, the total number of finite ladder epochs has a geometric distribution with
success probability 1 (0) (0, 1) under (NPC). On the set {K + = k}, k 1,
we study the ladder heights which are for l k given by
Zl+ = Zl1 Zl > 0,
P-a.s.
The random variable Zl+ measures by which amount the old local minima Zl1 is
improved. Due to the i.i.d. property of the Xt s we have
{Zl+
l=1

xl } K +

=k =
k
Y
P Zl+ xl l < =
k
Y
H(xl ),
(5.3)
l=1
l=1
tes
" k
\
K
X
M =
l=1
no
where the distribution function H neither depends on k nor on l. Thus, the ladder
heights (Zl+ )l=1,...,k are i.i.d. on the set {K + = k}. Finally, we consider the maximal
height achieved by (Zt )tN0 , this is the global minimum of the random walk
(Zt )tN0 ,
Zl+ = Z0 ZK + = ZK + = sup Zt = inf Zt .
tN0
tN0
This now allows to study the ultimate ruin probability as follows. Choose initial
capital c0 0. The ultimate ruin probability is given by

NL
(c0 ) = Pc0 inf Ct < 0

tN0
= P M + > c0
= (1 (0))
= Pc0 inf Ct c0 < c0

tN0
P K+ = k P
kN0
K+
X
K
X
Zl+
l=1
= P inf Zt < c0
tN0
Zl+ >
l=1
(0)k 1 P
kN
= (1 (0))

c0 K +

c0 K +

= k
= k
(0)k 1 H k (c0 ) .
kN
This proves Spitzers formula, which is Corollary 6.3.1 in Rolski et al. [89]:
Theorem 5.13 (Spitzers formula). Assume (0) (0, 1). Then for c0 0
(c0 ) = (1 (0))
(0)k 1 H k (c0 ) .
kN
131
The previous theorem goes back to Frank Ludvig Spitzer

(1926-1992). It gives us another description of the ruin probability under (NPC). The main difficulty is the determination of
the ladder height distribution H defined in (5.3). In special cases
this can be calculated explicitly. We give the Cramr-Lundberg
case below, for further details we also refer to Rolski et al. [89],
Section 6.4.3. The random walk is given by, see (5.2),
Zt =
(u Su ) =
u=1
t
X
F.L. Spitzer
Xu .
u=1
w)
t
X
In the next section we consider a special case thereof.
Cramr-Lundberg process
(m
5.3.2
In classical (continuous time) ruin theory one starts with a homogeneous Poisson
point process (Nt )tR+ having constant intensity v > 0 for the arrival of claims.
The premium income is modeled proportionally to time with constant premium
rate v > 0. The continuous time surplus process is then defined by C0 = c0 0
and for t > 0

= c0 + vt
Nt
X
tes
Ct
(5.4)
Su ,
u=1
no
with i.i.d. claim amounts Su satisfying Su > 0, P-a.s., and with these claim amounts
being independent of the claims arrival process (Nt )tR+ . This continuous time
surplus process (Ct )tR+ is called Cramr-Lundberg process. Definition 5.2 of the
ruin time is then extended to continuous time, namely
= inf {s R+ ; Cs < 0} .
NL
Note that ruin can only occur at time points where claims happen, otherwise the
continuous time surplus process (Ct )tR+ is strictly increasing with constant slope
v > 0 (in fact, the continuous time surplus process is a spectrally negative Lvy
process, see Chapter 1 in Kyprianou [68]). We define the inter-arrival times between
two claims by Wu , u N. For the homogeneous Poisson point process (Nt )tR+
these inter-arrival times are i.i.d. exponentially distributed with parameter v >
0. Therefore, we can rewrite the continuous time surplus process in these claims
P
arrival times by, define Vn = nu=1 Wu ,
def.
Cn =
CVn
NVn
= c0 + vVn
X
u=1
Su = c0 +
n
X
(vWu Su ).
u=1
This is exactly in the set-up of Definition 5.1 with i.i.d. premia u = vWu , u N.
A crucial thing that has changed is time, moving from t R+ to operational time
n N0 , and therefore
P [ < | C0 = c0 ] = Pc0 [ < ] = (c0 ),
(5.5)
132
with u = vWu . For (NPC) we require premium rate v > 0 such that
0 < E[X1 ] = vE[W1 ] E[S1 ] = v/(v) E[S1 ]
v > vE[S1 ].
H(x) = 1 E[S1 ]1
w)
The exponential distribution has the lack-of-memory property

which means that the waiting time for a next claim does not
depend on how long we have already been waiting for it. It is this
property which allows to calculate H explicitly in the CramrLundberg/compound Poisson case (5.4), namely, for x 0
Z
x
P[S1 > y] dy.
(5.6)
(m
We do not prove this statement, it uses the Wiener-Hopf factorization, for details we refer to Theorem 6.4.4 in Rolski et al. [89].
F. Pollaczek
R
Note that H is a distribution function on R+ because 0 P[S1 > y] dy = E[S1 ].
This then allows to state the following theorem which gives the Flix Pollaczek
(1892-1981) and Aleksandr Yakovlevich Khinchin (1894-1959) formula.
tes
Theorem 5.14 (Pollaczek-Khinchin formula). Assume we have the compound

Poisson model (5.4) with (NPC) given by = E[S1 ]/ (0, 1). The ultimate
ruin probability for initial capital c0 0 is given by
(c0 ) = (1 )
k 1 H k (c0 ) ,
kN
no
with distribution function H given by (5.6).

Proof. See Rolski et al. [89], Theorem 6.4.4. 2
NL
Remark. In the compound Poisson model (5.4) with (NPC)

one can also prove an integral equation for the ultimate ruin
probability given by
(c0 ) =
Z
c0
(1 FS (x))dx +
Z c0
0
(c0 x)(1 FS (x))dx ,
with distribution function S1 FS . We do not prove this statement because the Pollaczek-Khinchin formula is sufficient for A.Y. Khinchin
our purposes. The exact assumptions and a proof of this integral equation are, for
instance, provided in Rolski et al. [89], Theorem 5.3.2.
We conclude that for the compound Poisson case (5.4) we have three different
descriptions for the ultimate ruin probability: (i) probabilistic description, (ii)
Pollaczek-Khinchin formula from renewal theory, and (iii) the integral equation.
Depending on the problem one then chooses the most convenient one, i.e. we can
apply different techniques coming from different fields to solve the questions.
5.4
133
Subexponential claim sizes
A distribution function F supported on R+ , i.e. F (0) = 0, is called subexponential

if
1 F 2 (x)
lim
= 2.
x 1 F (x)
We start with a technical lemma that gives properties of subexponential distribution functions and a characterization. We follow the proofs in Rolski et al. [89],
Section 2.5.2.
(m
w
Lemma 5.15 (subexponential distribution functions). Assume F is subexponential

then the following statements hold true:
1. For all n N
1 F n (x)
= n.
1 F (x)
In fact, this is an if and only if statement.
lim
x
2. For all r > 0
lim erx (1 F (x)) = .
tes
3. For all > 0 there exists D < such that for all n 2 and all x 0
1 F n (x)
D(1 + )n .
1 F (x)
no
Proof of Lemma 5.15. We start with the following statement for subexponential distribution
functions F : for all t R
1 F (x t)
lim
= 1.
(5.7)
x
1 F (x)
NL
We first prove (5.7). Choose t 0, then we have for x > t, using monotonicity of F ,
Z x
1 F 2 (x)
F (x) F 2 (x)
1 F (x y)
1 =
=
dF (y)
1 F (x)
1 F (x)
1 F (x)
0
Z x
Z t
1 F (x y)
1 F (x y)
=
dF (y) +
dF (y)
1
F
(x)
1 F (x)
0
t
1 F (x t)
F (t) +
(F (x) F (t)) .
1 F (x)
This implies (the sandwich is for lim inf x lim supx )

1 F (x t)
1 F 2 (x)
1
1 lim
lim sup (F (x) F (t))
1 F (t)
= 1.
x
1 F (x)
1 F (x)
x
For t < 0 note that
lim
1 F (x t)
= lim
x
1 F (x)
1
1F (x)
1F (xt)
= lim
y 1F (y(t))
1F (y)
This proves (5.7). The second auxiliary statement is

Z x
1 F (x y)
lim
dF (y) = 1.
x 0
1 F (x)
= 1.
(5.8)
134
This is an immediate consequence of

1 F 2 (x)
F (x) F 2 (x)
1=
=
1 F (x)
1 F (x)
Z
0
1 F (x y)
dF (y).
1 F (x)
(5.9)
We now turn to the proof of the first statement of Lemma 5.15. We prove the claim by induction.
For n = 2, 1 the statement holds true by definition. Thus, we assume that it holds true for n 2
and we would like to prove it for n + 1. Choose > 0 then there exists x0 such that for all x > x0

1 F n (x)

1 F (x) n < .
This implies for x > x0
(m
w
Z x
1 F (n+1) (x)
F (x) F (n+1) (x)
1 F n (x y)
1=
=
dF (y)
1 F (x)
1 F (x)
1 F (x)
0
Z xx0
Z x
1 F n (x y) 1 F (x y)
1 F n (x y)
=
dF (y) +
dF (y).
1 F (x y)
1 F (x)
1 F (x)
0
xx0
The second integral is non-negative and using (5.7) we obtain
Z x
Z x
1
1 F n (x y)
dF (y) lim sup
dF (y)
lim sup
1 F (x)
x
x
xx0 1 F (x)
xx0
1 F (x x0 )
F (x) F (x x0 )
= lim sup
= 1 + lim sup
= 0.
1 F (x)
1 F (x)
x
x
no
tes
For the first integral we have for x > x0 , using the triangle inequality,
Z xx0

Z xx0

1 F (x y)
1 F n (x y) 1 F (x y)

dF (y) n
dF (y) 1 n

1 F (x y)
1 F (x)
1 F (x)
0
0
Z xx0

n

1 F (x y)
1 F (x y)
+
n
dF (y)
1 F (x y)
1 F (x)
0

Z xx0
Z xx0

1 F (x y)
1
F
(x
y)
dF (y) 1 +
dF (y).
n
1 F (x)
1 F (x)
0
0
NL
Finally observe
Z xx0
Z x
Z x
1 F (x y)
1 F (x y)
1 F (x y)
dF (y) =
dF (y)
dF (y),
1
F
(x)
1
F
(x)
1 F (x)
0
0
xx0
the first integral converges to 1, see (5.8), and the second integral converges to 0 because it is
non-negative with
Z x
Z x
1 F (x y)
1
lim sup
dF (y) lim sup
dF (y)
1 F (x)
x
x
xx0
xx0 1 F (x)
F (x) F (x x0 )
1 F (x x0 )
= lim sup
= 1 + lim sup
= 0.
1 F (x)
1 F (x)
x
x
This proves that for all > 0 there exists x1 x0 such that for all x > x1 we have

1 F (n+1) (x)

(n + 1) 4.
1 F (x)
This proves the first statement of Lemma 5.15. We now turn to the second statement of the
lemma. Note that for 0 < y < x
erx (1 F (x)) =
1 F (x)
(1 F (x y))er(xy) ery .
1 F (x y)

Choose > 0 and y >
1
r
135
log(3/(1 )) > 0. With (5.7) there exists x0 such that for all x > x0
erx (1 F (x)) (1 )(1 F (x y))er(xy) ery > 3(1 F (x y))er(xy) .

This implies that the function is strictly increasing with limit +. So there remains the proof
of the last statement of Lemma 5.15. Define n = supx0 (1 F n (x))/(1 F (x)). Note that
the first assertion of the lemma implies that n < . Moreover, we have 1 F (n+1) (x) =
1 F F n (x) = 1 F (x) + F (1 F n (x)). This implies for any x0 (0, )
1 F (x) + F (1 F n (x))
1 F (x)
x0
Z x
Z x
1 F n (x y)
1 F n (x y)
= 1 + sup
dF (y) + sup
dF (y)
1 F (x)
1 F (x)
x>x0 0
0xx0 0
Z x
1 F n (x y) 1 F (x y)
1
+ sup
dF (y)
1+
1 F (x0 ) x>x0 0 1 F (x y)
1 F (x)
Z x
1
1 F (x y)
1+
+ n sup
dF (y)
1 F (x0 )
1 F (x)
x>x0 0

1 F 2 (x)
1
+ n sup
1 ,
= 1+
1 F (x0 )
x>x0 1 F (x)
=
sup
(m
w)
n+1
tes
where we have used (5.9) in the last step. The subexponentiality of F implies that for all > 0
there exists x0 such that
1
n+1 1 +
+ n (1 + ).
1 F (x0 )
Iteration provides

1
1
+ 1+
+ n1 (1 + ) (1 + )
1 F (x0 )
1 F (x0 )

n1
X
1
(1 + )k + (1 + )n
1+
1 F (x0 )
k=0

X

n
1
1
1
1+
(1 + )k
1+
(1 + )n+1 ,
1 F (x0 )
1 F (x0 )
1+
no
n+1
k=0
NL
which proves the claim for D = (1 + (1 F (x0 ))1 )/ (0, ). This proves Lemma 5.15.
Statements 1. and 3. of Lemma 5.15 will be important in the analysis of the

Pollaczek-Khinchin formula. Statement 2. of Lemma 5.15 says that for subexponential distributions the moment generating function for r > 0 does not exist,
choose X F with F subexponential
E[erX ] =
= r
P erX > y dy =
Z
0
P [X > log(y)/r] dy
erx P [X > x] dx = .
(5.10)
We conclude that for any r > 0 the moment generating function of subexponential
distributions does not exist, and therefore there is no Lundberg coefficient in this
case. We call such subexponential distributions heavy tailed distributions.
136
Theorem 2.5.5 in Rolski et al. [89] gives an important sufficient condition for having
a subexponential distribution.
Lemma 5.16 (regularly varying survival function). Assume that F is supported on
R+ and has regularly varying survival function at infinity with index (0, ),
i.e. for all y > 0
1 F (xy)
lim
= y ,
x 1 F (x)
w)
then F is subexponential.
Proof. Assume that X1 and X2 are two i.i.d. random variables with regularly varying survival
functions with parameter (0, ). Note that we have for all (0, 1)
(m
{X1 + X2 > x} {X1 > (1 )x} {X2 > (1 )x} {X1 > x, X2 > x}.
The i.i.d. property implies
P[X1 + X2 > x] 2 P[X1 > (1 )x] + P[X1 > x]2 .

Thus, we have
1 F 2 (x)
1 F (x)
inf lim sup
(0,1) x
2(1 F ((1 )x) + (1 F (x))2

1 F (x)
tes
lim sup
inf 2(1 ) = 2.
(0,1)
On the other hand we have for any positively supported distribution function F , see also (5.9),
F (x) F 2 (x)
= 1+
= 1+
1 F (x)
Z x
1+
dF (y) = 1 + F (x),
no
1 F 2 (x)
1 F (x)
1 F (x y)
dF (y)
1 F (x)
NL
since by assumption F (0) = 0. This immediately implies that

lim inf
x
1 F 2 (x)
2.
1 F (x)
Note that the lower bound holds true for any distribution function supported on R+ .
Remarks 5.17. Lemma 5.16 gives the connection to classical extreme value theory.
In extreme value theory one distinguishes three different domains of attraction for
tail behavior, see Section 3.3 in Embrechts et al. [39]: (i) Weibull case, which are
distribution functions with finite right endpoint of their support; (ii) Gumbel case,
which are light tailed to moderately heavy tailed distribution functions; (iii) Frchet
case, which are heavy tailed distribution functions. The Frchet case is exactly
characterized by regularly varying survival functions with (tail) index (0, ),
see Theorem 3.3.7 in Embrechts et al. [39]. This index has already been met
in Section 3.2, see formula (3.4). Lemma 5.16 now says that every distribution
137
function that belongs to the Frchet domain of attraction is also subexponential.

However, the class of subexponential distribution functions is larger than the class
of distribution functions with regularly varying survival functions: the Weibull
distribution of Section 3.2.2 with 0 < < 1 is subexponential but does not have a
regularly varying survival function (see Example 1.4.3 in Embrechts et al. [39]) and
also the log-normal distribution is subexponential but does not have a regularly
varying survival function (see Example 1.4.7 in Embrechts et al. [39]).
regularly varying at
no
no
no
yes
yes
(m
w
subexponential
gamma distribution
no
Weibull distribution with < 1
yes
log-normal distribution
yes
log-gamma distribution
yes
Pareto distribution
yes
Table 5.1: Subexponentiality and regular variation at infinity.
tes
We apply the Pollaczek-Khinchin formula, see Theorem 5.14, to obtain the following result in the subexponential case.
no
Theorem 5.18 (subexponential case, Embrechts-Veraverbeke). Assume we have

the compound Poisson model (5.4) with (NPC) given by = E[S1 ]/ (0, 1).
Moreover, we assume that the ladder height distribution function H given by (5.6)
is subexponential. Then we have
lim
c
NL
(c0 )
=
.
1 H(c0 )
1
Proof. Our aim is to apply Lemma 5.15 to the Pollaczek-Khinchin formula. The latter provides
X
(c0 )
1 H k (c0 )
= (1 ) lim
k
.
c0 1 H(c0 )
c0
1 H(c0 )
lim
kN
Our aim is to exchange the limit c0 and the infinite summation.

Note that Lemma 5.15 provides point-wise convergence of the last terms
to k as c0 , therefore our aim is to find a uniform integrable upper
bound so that we can apply the dominated convergence theorem. To this
P. Embrechts
end we choose (0, 1/ 1). Then Lemma 5.15 implies that there
exists D < such that for all k 1 and c0 0 we have a uniform integrable upper bound given
by
(1 )
X
kN
X
X
1 H k (c0 )
(1 )
k D(1 + )k = (1 )D
((1 + ))k < ,
1 H(c0 )
kN
kN
138
because (1 + ) < 1. Thus, we have found a uniform integrable upper bound and this allows to
exchange the two limits. This provides
X
X
(c0 )
1 H k (c0 )
= (1 )
k lim
= (1 )
k k.
c0 1 H(c0 )
c0 1 H(c0 )
lim
kN
kN
The last term is the expected value of the geometric distribution which is given by /(1 ).
This proves the theorem.
2
= 1
Z
x
P[S1 > y] dy

1Z y
(m
H(x) = 1 E[S1 ]1
w)
Example 5.19 (Pareto claim sizes). We assume that we are in

the compound Poisson model of Theorem 5.14. The claim size
distribution of S1 is given by a Pareto distribution with threshold
> 0 and tail parameter > 1. Under these assumptions we
calculate the ladder height distribution H. For x
N. Veraverbeke
+1
1 x
dy = 1

lim
(c0 )
c0
+1
no
c0 1
tes
This implies that H has a regularly varying survival function

with tail index 1 > 0. Therefore, Lemma 5.16 implies that H is subexponential
and we can apply Theorem 5.18 to obtain
.
( 1)
That is, we have found in the Pareto (subexponential) case for > 1
(c0 )
c0 +1
( 1)
as c0 .
NL
Conclusions. We conclude that the heavy tailed case may

lead to a much more dangerous ruin behavior. In Example
5.19 we obtain for the asymptotic ruin behavior a power law
decay as the initial capital goes to infinity, whereas in the light
tailed case we obtain the exponentially decaying Lundberg
bound, see Theorem 5.11. This is an impressive example that
heavy tailed claims require careful risk management practice.
C.M. Goldie
For instance, an excess-of-loss reinsurance cover with retention level M > would completely change the ruin behavior of a company facing
Pareto distributed claims St , t N. Also the triggers of ruin are very different in
the two cases. In the light tailed case it is the big mass of claims that causes ruin,
whereas in the heavy tailed case it is the single large claim event that causes ruin.
139
NL
no
tes
(m
w)
The most general version of asymptotic ruin behavior in the subexponential case
goes back to Paul Embrechts and Nol Veraverbeke [41]. However, an important missing piece in the argumentation was provided by Charles M. Goldie.
The Pareto case has previously been solved by Bengt von Bahr [8].
NL
no
tes
(m
w)
140
Chapter 6
w)
Premium Calculation Principles
(m
From the random walk Theorem 5.4 and from Assumption 5.6 we see that we need
to charge an (expected) premium that exceeds the expected claim amount E[St ],
otherwise there is ultimate ruin, P-a.s. This is referred to the net profit condition
(NPC). In the present chapter we assume that the premium t is deterministic, then
(NPC) reads as t > E[St ]. For simplicity (because we consider a fixed accounting
year in this chapter) we drop the time index t and then (NPC) is given by
tes
> E[S],
(6.1)
with total (annual) claim amount S FS . In this chapter
no
we justify why the insurance company can charge a premium that exceeds the
average claim amount E[S], i.e. why the insured is willing to pay a premium
that exceeds his expected claim amount E[S]; and
we give different pricing principles to calculate premium loadings E[S] > 0.
NL
Simple solution (expected value principle). Choose a fixed constant > 0 and
charge (to everyone) the premium
= (1 + ) E[S].
(6.2)
Are we happy with this solution?
Example 6.1 (expected value principle). We consider two different portfolios with
claims S1 and S2 having the same mean E[S1 ] = E[S2 ]. Under the previous simple
solution both insured pay the same insurance premium
= (1 + ) E[S1 ] = (1 + ) E[S2 ] > E[S2 ] = E[S1 ].
We give an explicit distributional example.
141
142
Chapter 6. Premium Calculation Principles
Assume S1 (, c) with mean E[S1 ] = /c, and

S2 /c is a constant.
w)
Observe that there is absolutely no uncertainty in portfolio S2 , that is, we can

perfectly predict claim S2 (and, of course, also the insured can perfectly predict
his claim). But then it is natural that the insured is not willing to pay a premium
that exceeds his (maximal possible) loss S2 = /c, i.e. (hopefully) he refuses to pay
insurance premium > E[S2 ] = S2 . Moreover, the risk characteristics are rather
different between the two portfolios S1 and S2 .
6.1
(m
Conclusion. The premium loading should be risk-based! That is, the loading
E[S] > 0 should reflect the risk of fluctuations of S around its mean E[S].
Simple risk-based principles
tes
The first notion of risk is usually described by the variance of a random variable.
Therefore, we assume in this section that the second moment of S exists.
Variance loading principle. Choose a fixed constant > 0 and define the
insurance premium by
= E[S] + Var(S).
NL
no
Revisiting Example 6.1 we obtain insurance premia using variance loadings
1 = E[S1 ] + Var(S1 ) = + 2 >

= E[S2 ] + Var(S2 ) = 2 .
c
c
c
That is, for the risky position S1 we now charge a premium that strictly exceeds the
expected value and the loading is zero for the deterministic claim S2 . An unpleasant
feature of the variance loading principle is that the calibration is difficult because
the loading constant is not scaling invariant, that is, the principle is not invariant
under scalings such as changes of currencies, etc. Let us give an example. Assume
that rfx > 0 is the (deterministic) exchange rate between two different currencies.
Assume rfx 6= 1, then we obtain
2
Var(S) 6= rfx .
fx = E[rfx S] + Var(rfx S) = rfx E[S] + rfx
This non-linearity of the variance implies that the premium cannot easily be scaled
with exchange rates and inflation indexes. Therefore, one often studies modifications of the variance principle which brings us to the next principle.
Standard deviation loading principle. Choose a fixed constant > 0 and
define the insurance premium by
= E[S] + Var(S)1/2 = E[S] (1 + Vco(S)) ,
143
where the last equality requires that E[S] > 0.

This principle gives an explicit meaning to the loading constant in (6.2), namely
it says that the loading constant should be proportional to the coefficient of variation of S, or the corresponding confidence bounds measured in terms of standard
deviations. If we revisit Example 6.1 we obtain premia
1/2
+
>
= E[S2 ] + Var(S2 )1/2 = 2 .
c
c
c
For the risky position S1 we charge a premium that strictly exceeds the expected
claim and the loading is zero for the deterministic claim S2 . The standard deviation
loading principle is usually better understood than the variance loading principle
because practitioners often have a good feeling for appropriate ranges of the coefficient of variation. For instance, they know that for certain lines of business
it should be around 10%. Moreover, this principle is invariant under changes of
currencies. Assume that rfx > 0 is again the (deterministic) exchange rate between
two different currencies. Then we obtain the identity
(m
w)
1 = E[S1 ] + Var(S1 )1/2 =
fx = E[rfx S] + Var(rfx S)1/2 = rfx E[S] + rfx Var(S)1/2 = rfx .
tes
The previous examples consider rather simple premium loading principles and there
are more principles of this type such as the modified variance principle. In the next
section we describe more sophisticated principles which are motivated by economic
behavior of financial agents and give risk measurement and risk management perspectives. These more advanced principles try to describe decision making and
include:
no
utility theory pricing principles

Esscher premium principle
probability distortion pricing principles
NL
cost-of-capital principles based on risk measures

deflator pricing principles
Exercise 13. We would like to insure the following car fleet:
i
passenger car
delivery van
truck
vi
40
30
10
i
25%
23%
19%
(i)
E[Y1 ]
2000
1700
4000
(i)
Vco(Y1 )
2.5
2.0
3.0
Assume that the car fleet can be modeled by a compound Poisson distribution.
1. Calculate the expected claim amount of the car fleet.
2. Calculate the premium for the car fleet using the variance loading principle
with = 3 106 .
144
6.2
Advanced premium calculation principles
In this section we consider more advanced principles for the calculation of premium
loadings. These considerations can also be viewed as an introduction to economic
decision making, risk measurement and risk management.
6.2.1
Utility theory pricing principles
(m
w)
Utility theory aims at modeling the happiness index of financial agents making economic decision. That is, for a financial
agent holding a position X, we try to evaluate an index that
quantifies his happiness generated by this position X.
Utility theory can be introduced in a rather general framework
using preference ordering. If this system of preference ordering
is sufficiently regular then there exists a so-called numerical representation for the preference ordering, for details we refer to the J. von Neumann
book of Fllmer-Schied [47].
tes
We always start from the latter and assume that there exists a
John von Neumann (1903-1957) and Oskar Morgenstern
(1902-1977) representation for the preference ordering on a given
set
X L1 (, F, P).
O. Morgenstern
NL
no
The set X describes the (risky) positions X X of interest.

In this set-up the Xs reflect gains. Thus, we restrict ourselves to a set X of
available risky positions X and among these positions we would like to choose the
position which makes us as happy as possible. The von Neumann-Morgenstern
representation equips us with a utility function u with the following properties:
u:IR
is strictly increasing on a non-empty interval I R, where we assume that X I,

P-a.s., for all X X . Two examples for u are given in Figure 6.1.
In general, we are interested in risk-averse utility functions u : I R which makes
the additional assumption that u is strictly concave on I, see Figure 6.1. This
risk-averse utility function now allows us to define a preference ordering on the set
of all risky positions in X .
Definition 6.2. Assume u : I R is strictly increasing and strictly concave on
the non-empty interval I R (with X I, P-a.s., for all X X ). Then we prefer
the position X X over the position Y X , write X Y , if
E [u(X)] E [u(Y )] .
145
exponential utility function
power utility function
50
50
100
10
15
20
(m
100
w)
2500
2000
1500
1000
500
gamma>1
gamma=1
gamma<1
Figure 6.1: lhs: exponential utility function with = 0.05 and I = R, see (6.6);
rhs: power utility function with {0.5, 1, 1.5} and I = R+ , see (6.7).
no
tes
Colloquially speaking this means that holding position X makes us at least as

happy as holding position Y , therefore we prefer position X over position Y . Thus,
Definition 6.2 introduces a preference ordering on X . If E [u(X)] > E [u(Y )] we
strictly prefer X over Y and we write X Y ; if E [u(X)] = E [u(Y )] we are
indifferent between X and Y and we write X Y .
For u C 2 strictly increasing and strictly concave means
u0 > 0
and
u00 < 0
on I, respectively.
NL
Strict increasing property. Strictly increasing implies that for X Y , P-a.s.,

and X > Y with positive P-probability we have
E [u(X)] > E [u(Y )] ,
(6.3)
i.e. we strictly prefer X over Y . In this context, X has always the interpretation of
a gain and if the gain of position X dominates the gain of position Y (in the above
sense) we have strict preference X Y . We conclude: u introduces a preference
ordering on X where positive outcomes of X X describe gains and negative
outcomes losses.
Strict concavity property. Strict concavity implies that we can apply Jensens
inequality which provides for all X X
E [u(X)] u (E [X]) ,
(6.4)
146
and if X X is non-deterministic we even have a strict inequality in (6.4). Thus,

for non-deterministic positions, strict concavity of u implies E[X] X. The interpretation of this preference ordering is that under risk-aversion we try to avoid
uncertainties which results in the fact that we always prefer the mean value E[X]
over the corresponding random outcome X.
(m
w)
This latter property is exactly the argument why policyholders are willing to pay
an insurance premium that exceeds their average claim amount E[Y ], and hence
finance (NPC). Assume that a policyholder has (deterministic) initial wealth c0 and
he faces a risk that may reduce his wealth by (the random amount) Y . Hence, he
holds a risky position X = c0 Y and his happiness index of this position is given
by E[u(c0 Y )] if u describes the (risk-averse) utility function of this policyholder.
The strict concavity and increasing properties now imply the following preference
tes
E [u(c0 Y )] < u (c0 E [Y ]) .
no
The left-hand side describes the present happiness and the right-hand side describes
the happiness that he would achieve if he could exchange Y by E[Y ]. Therefore,
any deterministic premium > E [Y ] such that
E [u(c0 Y )] < u (c0 ) < u (c0 E [Y ]) ,
NL
would make him more happy than his current position c0 Y . Thus, strict concavity
and increasing property of u implies that he is willing to pay any premium in
the (non-empty) interval

E [Y ] , c0 u1 (E [u(c0 Y )]) ,
(6.5)
to improve his happiness position. The lower bound of this interval is the (NPC)
and the upper bound is the maximal price that the policyholder will just tolerate
according to his risk-averse utility function u (this bound may also be infinite).
The less risk-averse he is the narrower the interval will get. The extreme case of
risk-neutrality, which corresponds to the linear function u(x) = x, will just provide
that the upper bound is equal to the lower bound in (6.5), and no insurance is
necessary.
147
The two most popular utility functions are, see also Figure 6.1:
exponential utility function, constant absolute risk-aversion (CARA) utility
function: for > 0 (defined on I = R)
u(x) = 1
1
exp {x} ;
(6.6)
( x1
1
log x
for 6= 1,
for = 1.
(6.7)
(m
u(x) =
w)
power utility function, constant relative risk-aversion (CRRA) utility function, isoelastic utility function (defined on I = R+ )
Example 6.3 (exponential utility function). Assume that the policyholder has
exponential utility function (6.6), he has initial wealth c0 and he faces a risky
position Y L1 (, F, P) with Var(Y ) > 0 and Y 0, P-a.s. This implies that
the expected claim is given by E[Y ] > 0. The exponential utility function has the
following properties
and
u00 (x) = exp{x} < 0.
tes
u0 (x) = exp{x} > 0
no
Therefore, it is strictly increasing and concave on R, see Figure 6.1 (lhs). Its inverse
is given by
1
u1 (y) = log ((1 y)) .
This implies that acceptable premia lie in the non-empty interval, see (6.5),
1
E [Y ] , log E [exp{Y }] ,
where the upper bound is infinite if the moment generating function of Y does not
exist in . The important observation in this example is that the price tolerance
in does not depend on the initial wealth c0 of the policyholder. We will see that
this property uniquely holds true for the exponential utility function, and we may
ask the question how realistic this property is in real world decision making?

NL
Example 6.4 (power utility function). Assume that the policyholder has power
utility function (6.7), he has initial wealth c0 > 1 and he faces a risky position Y
Bernoulli(p = 1/2). This implies that the expected claim is given by E[Y ] = 1/2.
The power utility function has the following properties
u0 (x) = x > 0
and
u00 (x) = x1 < 0.
Therefore, it is strictly increasing and concave on I = R+ , see Figure 6.1 (rhs).

For our example we choose = 1. In this case the inverse of the utility function is
given by
u1 (y) = exp{y}.
148
We calculate the upper bound in (6.5),

c0 u1 (E [u(c0 Y )]) = c0 exp {E [log(c0 Y )]}

1
1
= c0 exp
log(c0 ) + log(c0 1)
2
2
q
def.
c0 (c0 1) = b(c0 ).
= c0
This implies that any possible premium lies in the non-empty interval, see (6.5),
q
1
, c0 c0 (c0 1) .
2

w)
The important observation in this example is that the price tolerance in depends
on the initial wealth c0 > 1 of the policyholder.
and
lim b(c0 ) = 1/2,
0.6
c0
0.8
lim b(c0 ) = 1
c0 1
1.0
1.2
(m
The function b is defined on (1, ) and we have
no
b0 (c0 ) = 1
c20 c0
NL
1 2c0 1
q
=
2 c0 (c0 1)
0.0
tes
0.2
0.4
the second statement can

q be seen by applying lHpitals
rule to b(c0 ) = c0 (1 1 1/c0 ). These limits say that
if the policyholder is very poor, i.e. c0 is close to 1, he is
willing to pay almost the maximal possible claim size 1
as premium; on the other hand if he is very rich, i.e. c0
function b(c0 )
is close to , he is only willing to pay for the average
claim amount E[Y ] = 1/2 because basically he can do the risk bearing himself.
The derivative of b is given by
0
10
c0 (c0 1) (c0 1/2)

q
c0 (c0 1)
c20 c0 + 1/4
c0 (c0 1)
< 0.
This shows that we have strict monotonicity in the initial capital c0 > 1, i.e. the
richer the policyholder the narrower the price tolerance interval (6.5), see also
Example 6.14, below.

Definition 6.5 (utility indifference price). The utility indifference price =
(u, FS , c0 ) R for utility function u, initial capital c0 I and risky position
S FS is given by the solution of (subject to existence)
u(c0 ) = E[u(c0 + S)].
149
Of course, and S need to be such that c0 + S I, P-a.s. This may give rise
to restrictions on the range of S if I is a bounded interval, see also Example 6.4.
Note that if the utility indifference price exists, it is unique. This follows from
the strict monotonicity of u.
w)
The utility indifference price given in Definition 6.5 gives the insurance companys
point of view. It is assumed that the insurance company has initial capital c0 I,
similar to the surplus process given in Definition 5.1. It will then only accept an
insurance contract S at price if the resulting utility does not decrease, i.e. if it is
indifferent about accepting S at price and not selling such a contract.
(m
Jensens inequality and the strict increasing property of u immediately provide the
following corollary.
Corollary 6.6. The utility indifference price = (u, FS , c0 ) for initial capital c0 ,
risk-averse utility function u and risky position S FS satisfies
= (u, FS , c0 ) > E[S].
tes
Proof. Exercise. 2
Example 6.7 (exponential utility function). Assume we have initial capital c0 R,

exponential utility function (6.6) with risk-aversion parameter > 0, and we would
like to insure a risky position S N (, 2 ). Thus, we need to solve
1
1
1 exp {c0 } = E 1 exp {(c0 + S)} .
no
This is equivalent to solving
NL
exp {} = E [exp {S}] = exp{ + 2 2 /2}.

Therefore we obtain utility indifference price for S
= (u, FS , c0 ) = + 2 /2 > .
Remarks.
We obtain an insurance premium > = E[S] (Jensens inequality) and

therefore (NPC) is fulfilled.
The loading is of the form 2 /2 = Var(S)/2. That is, for the exponential
utility function we get a variance loading. This is exact for S N (, 2 )
and it is approximately true for other distribution functions (using a Taylor
approximation).
The utility indifference price does not depend on the initial capital c0 .
150

Exercise 14. Choose the exponential utility function (6.6).

Calculate the utility indifference price for S (, c).
Calculate the utility indifference price for S Pareto(, ).
(m
w
Proposition 6.8. Assume u C 2 is a risk-averse utility function on R. The

following two are equivalent:
the utility indifference prices = (u, FS , c0 ) do not depend on c0 for all S;
the utility function u is of the form
u(x) = a b exp{cx},
for a R and b, c > 0.
tes
Remark. Note that the utility function u(x) = a b exp{cx} gives the same
preference ordering as the exponential utility function (6.6) with c = : if we have
two different utility functions u() and v() with v = a + bu for a R and b R+
(positive affine transformation) then they generate the same preference ordering.
no
Proof of Proposition 6.8. Note that assumption u C 2 is not necessary because concavity
implies that u is differentiable almost everywhere, and this is sufficient to prove the result, for
details on this we refer to Lemma 1.8 in Schmidli [91].
Direction is immediately clear just by evaluating Definition 6.5. So we prove direction .
The following proof is borrowed from Schmidli [91]. Choose S Bernoulli(p). Definition 6.5
implies for this Bernoulli claim S identity
NL
u(c0 ) = E[u(c0 + S)] = pu(c0 + 1) + (1 p)u(c0 + ),
for utility indifference price = (p) = (u, p, c0 ) depending on p (0, 1) only. We now consider
the derivatives w.r.t. c0 and p. The former provides
u0 (c0 )
[pu0 (c0 + 1) + (1 p)u0 (c0 + )]
pu0 (c0 + 1) + (1 p)u0 (c0 + ),
(c0 + )
c0
where in the last step we have used the assumption that the premium does not depend on
c0 . The derivative w.r.t. p is given by (the implicit function theorem provides existence of the
derivative of w.r.t. p, denoted by 0 (p))
0
u(c0 + 1) + pu0 (c0 + 1) 0 (p) u(c0 + ) + (1 p)u0 (c0 + ) 0 (p).
Merging the last two identities provides

u0 (c0 ) 0 (p) = u(c0 + ) u(c0 + 1).
(6.8)
151
Strict increasing property of u implies that 0 (p) > 0. Next we calculate the derivatives of (6.8)
w.r.t. c0 and p (again using the implicit function theorem for the latter). This provides the two
identities
u00 (c0 ) 0 (p) = u0 (c0 + ) u0 (c0 + 1),
and
u0 (c0 ) 00 (p) = [u0 (c0 + ) u0 (c0 + 1)] 0 (p).
Merging these identities implies
u00 (c0 )
00 (p)
c < 0,
=
u0 (c0 )
( 0 (p))2
(m
w
for some constant c > 0. The last identity follows because the left-hand side is independent of p
and the middle term is independent of c0 . This last identity is a differential equation for utility
function u whose (unique) solution is exactly given by the exponential function.
2
The proof of Proposition 6.8 provides insights into risk-aversion. Define the absolute and the relative risk-aversions of a twice differentiable utility function u
by
u00 (x)
u0 (x)
and
RRA (x) = uRRA (x) = x
u00 (x)
.
u0 (x)
tes
ARA (x) = uARA (x) =
Example 6.9 (exponential utility function). The exponential utility function (6.6)
with risk-aversion parameter > 0 satisfies for all x R
no
ARA (x) = .
This explains the terminology constant absolute risk-aversion (CARA) utility.
NL
Example 6.10 (power utility function). The power utility function (6.7) with
risk-aversion parameter > 0 satisfies for all x R+
RRA (x) = .
This explains the terminology constant relative risk-aversion (CRRA) utility.
Assume that u and v are two utility functions that are defined on the same interval
I. Then, u is more risk-averse than v on I if for any X with range I we have
u1 (E[u(X)]) v 1 (E[v(X)]) .
Proposition 6.11. Assume that u, v C 2 (I) are two utility functions defined on
the same interval I R. The following are equivalent:
u is more risk-averse than v on I;
uARA (x) vARA (x) for all x I.
152
Proof. We first prove direction . The proof goes by contradiction. Assume that the claim
does not hold true. Due to the twice continuous differentiability property of the utility functions
on I there exists a non-empty open interval O I such that
uARA (x) =
u00 (x)
v 00 (x)
>
= vARA (x)
u0 (x)
v 0 (x)
for all x O.
We consider the function u(v 1 ()) on the non-empty open interval v(O) (note that v is continuous
and strictly increasing). We calculate
w)
d
u0 (v 1 (z))
d
u(v 1 (z)) = u0 (v 1 (z)) v 1 (z) = 0 1
> 0,
dz
dz
v (v (z))
because both u and v are strictly increasing, and
=
=
u00 (v 1 (z))
u0 (v 1 (z))v 00 (v 1 (z))
0
1
2
(v (v (z)))
(v 0 (v 1 (z)))3

u0 (v 1 (z))
u00 (v 1 (z)) v 00 (v 1 (z))
> 0
(v 0 (v 1 (z)))2 u0 (v 1 (z))
v 0 (v 1 (z))
(m
d2
u(v 1 (z))
dz 2
for all z v(O).
tes
This implies that u(v 1 ()) is a risk-seeking (convex) utility function on the non-empty interval
v(O). Choose a non-deterministic random variable Y such that Y O, P-a.s. Since O is a
non-empty open interval such a random variable can be chosen (i.e. no concentration in a single
point). This implies that Z = v(Y ) is a non-deterministic random variable with range in v(O)
and the strict convexity of u(v 1 ()) on v(O) implies using Jensens inequality

u1 (E [u(Y )]) = u1 E u(v 1 (v(Y ))) > u1 u v 1 (E [v(Y )]) = v 1 (E [v(Y )]) .
(6.9)
no
This is contradiction and proves direction .

For the direction we consider the function u(v 1 ()) on v(I). This is a strictly increasing
function because u and v are utility functions, see above. Moreover, we have
00 1

u0 (v 1 (z))
u (v (z)) v 00 (v 1 (z))
d2
1
u(v (z)) =
0 1
dz 2
(v 0 (v 1 (z)))2 u0 (v 1 (z))
v (v (z))

u0 (v 1 (z)) v
ARA (v 1 (z)) uARA (v 1 (z)) 0
for all z v(I).
=
0
1
2
(v (v (z)))
NL
The proof then follows similar to (6.9) using Jensens inequality.
The above result has a nice interpretation.

Corollary 6.12. Assume u is more risk-averse than v. We have for the utility
indifference prices
(u, FS , c0 ) (v, FS , c0 ).
Proof. We have the following:
c0 = u1 (E[u(c0 + (u, FS , c0 ) S)]) v 1 (E[v(c0 + (u, FS , c0 ) S)]) .
Since both v 1 and v are strictly increasing we see that (u, FS , c0 ) (v, FS , c0 ).
The last corollary also explains that the price elasticity interval (6.5) becomes more
narrow for decreasing risk-aversion.
153
Theorem 6.13. Assume u C 3 (I) is a risk-averse utility function on I. The

following are equivalent:
(u, FS , c0 ) is decreasing in c0 for all S;
uARA (x) is decreasing for all x I.
Observe that v = u0 is a utility function on I. This implies
w)
Proof of Theorem 6.13. We start with direction . Calculating the derivatives w.r.t. c0 ,
using that is decreasing in c0 and setting v = u0 we obtain

1
c0 = v
E v(c0 + (u, FS , c0 ) S)
(c0 + (c0 ))
v 1 (E [v(c0 + (u, FS , c0 ) S)]) .
c0
u1 (E [u(c0 + (u, FS , c0 ) S)]) = c0 v 1 (E [v(c0 + (u, FS , c0 ) S)]) .
(m
Since this holds for any c0 and S we obtain that v is more risk-averse than u, and Proposition
6.11 implies that vARA (x) uARA (x) for all x I. From this we obtain
u000
v 00
u00
= 0 0,
00
u
v
u
and thus for all x I
tes

d u00
u000
(u00 )2
u00 u000
u00
d u
(x) =
= 0 + 0 2 = 0
0 0.
dx ARA
dx u0
u
(u )
u u00
u
no
This proves the first direction of the equivalence. The proof of direction is received by just
reading the above proof into the other direction (all the statements are equivalences).
2
Example 6.14 (power utility function). The power utility function (6.7) with
risk-aversion parameter > 0 satisfies for all x R+
ARA (x) = x1 .
NL
This is a strictly decreasing function in x R+ . Therefore the utility indifference

price (u, FS , c0 ) becomes a decreasing function in c0 , see Theorem 6.13. This is the
property that economists consider to be reasonable for financial decision making.
This was already explored in Example 6.4.

Exercise 15. Choose exponential utility function (6.6).
i.i.d.
Assume Y1 , . . . , Yn (, c). Calculate the utility indifference price for

Pn
i=1 Yi .
Assume S CompPoi(v = n, G = (, c)). Calculate the utility indifference
price for S.
Compare the two results of the previous items.
What can be said about diversification benefits?
154

Exercise 16. Choose the car fleet example from Exercise 13 on page 143. Assume
that this car fleet can be modeled by an appropriate compound Poisson distribution
having gamma claim sizes.
1. Calculate the expected claim amount of the car fleet.
2. Calculate the premium for the car fleet using the utility indifference price
principle for the exponential utility function with parameter = 1.5 106 .
6.2.2
Esscher premium
(m
w)
3. Compare Exercises 13 and 16. What happens if we replace the compound

Poisson distribution by a Gaussian distribution with the same first two moments?
Choose a random variable S F with finite first moment given by

E[S] =
s dF (s).
tes
For utility indifference pricing we have modified the payments s by introducing a

happiness index u(c0 + s), see Definition 6.5. The Esscher premium takes a
different approach, instead of acting on the payments s it aims at modifying the
probability distribution F of S.
NL
no
Classical actuarial practice calculates premium loadings by giving more weight to bad events compared to good events. Basically, this means that one does a change of measure towards a
less favorable probability measure. Hans Bhlmann [19] introduces this idea in the actuarial literature by constructing the
Esscher measure.
Define for > 0 the Esscher (probability) distribution F of F
H. Bhlmann
as follows:
Z s
1
ex dF (x),
F (s) =
MS ()
under the additional assumption that the moment generating function MS () of S
exists in . Note that this defines a (normalized) distribution function F .
Definition 6.15 (Esscher premium). Choose S F and assume that there exists
r0 > 0 such that MS (r) < for all r (r0 , r0 ). The Esscher premium of S
in (0, r0 ) is defined by
= E [S] =
s dF (s).
155
Corollary 6.16. Under the assumptions in Definition 6.15 we have

=
d
log MS (r)|r= E[S],
dr
where the inequality is strict for non-deterministic S.
w)
Proof. Note that Lemma 1.1 implies for (0, r0 )

Z
1
MS0 ()
d
=
ses dF (s) =
=
log MS (r)|r= .
MS () R
MS ()
dr
The claim then follows from Lemma 1.6.
(m
Example 6.17 (Esscher premium for Gaussian distributions). Choose > 0 and
assume that S N (, 2 ). Then we have
d
log MS (r)|r= = + 2 > = E[S].
dr
tes
In the Gaussian case we obtain the variance loading. Thus, the variance loading,
the exponential utility function and the Esscher premium principles provide exactly
the same insurance premium in the Gaussian case.
Conclusions.
no
Exercise 17 (Esscher premium for gamma distributions). Assume that S (, c)

with , c > 0. Calculate the Esscher premium of S for (0, c).
The Esscher premium can easily be calculated from the moment generating
function MS (r).
NL
The Esscher premium can only be calculated for light tailed claims, see also
Section 5.2 on the Lundberg coefficient. Towards all more heavy tailed claims
the Esscher premium reacts so sensitive that it becomes infinite. In the next
section we study probability distortion principles that allow for more heavy
tailed distributions in premium calculations still leading to finite premia.
In classical economic theory, prices are often derived by the assumption of
market clearing in a risk exchange economy. That is, if we assume that
we have (i) an economy with risky positions S1 , . . . , SK ; (ii) market participants who have an exponential utility function with risk aversion parameters
i > 0; and (iii) market clearing in the sense that all risky positions are allocated to the market participants, then one can prove that the risky positions
are exactly priced with the Esscher measure of the aggregate market capitalization. This is in the spirit of Bhlmann [19] and is, for instance, found in
Tsanakas-Christofides [96].
156
6.2.3
Probability distortion pricing principles
In the previous section we have met a pricing principle that was based on probability
distortions. In this first case it was only possible to calculate insurance prices for
light tailed claims because the distortion reacted very sensitively to heavy tails. In
the present section we look at probability distortions from a different angle which
will allow for more flexibility. Assume that S F with S 0, P-a.s. Then using
integration by parts the expected claim is calculated as
x dF (x) =
Z
0
P[S > x] dx.
w)
E[S] =
(m
In this section we directly distort the survival function F (x) = P[S > x]. Therefore,
we introduce a distortion function h : [0, 1] [0, 1] which is a continuous, increasing
and concave function with h(0) = 0 and h(1) = 1, in Figure 6.2 we give two
examples.
no
power distortion
expected shortfall
NL
0.0
0.2
0.4
0.6
0.8
tes
1.0
probability distortions
0.0
0.2
0.4
0.6
0.8
1.0
Figure 6.2: Distortion functions h of Examples 6.19 and 6.20, below, with = 1/2
and q = 0.1, respectively.
h(p) distorts the probability p with the property that h(p) p for all p [0, 1]
because h is increasing and concave with h(0) = 0 and h(1) = 1.
The concavity of h reflects risk aversion, similar to the utility functions used
in Section 6.2.1.
157
Note that the existence of p (0, 1) with h(p) > p implies that h(p) > p for
all p (0, 1). Therefore, we assume under strict risk-aversion that h(p) > p
for all p (0, 1).
Definition 6.18. Assume that h : [0, 1] [0, 1] is a continuous, increasing and
concave function with h(0) = 0, h(1) = 1 and h(p) > p for all p (0, 1). The
probability distorted price h of S 0 is defined by (subject to existence)
h(P[S > x]) dx.
w)
h = Eh [S] =
We obtain a risk loading that provides

0
P[S > x] dx
Z
0
h(P[S > x]) dx = Eh [S] = h ,
(m
E[S] =
where the inequality is strict for non-deterministic S.

Remarks.
tes
Similar to the Esscher premium we modify the probability distribution function of the claims S (in contrast to the utility theory approach where we
modify the claim sizes).
no
The probability distortion approach is a technique to

construct coherent risk measures for bounded random
variables. For a detailed outline we refer to Freddy
Delbaen [32], in particular, to the corresponding Example 4.7 and Corollary 7.6 which relates convex games
to coherent risk measures.
NL
This probability distortion approach is similar to life

F. Delbaen
insurance pricing where one constructs first order life
tables out of second order life tables (expected mortality
rates) in order to have a security and profit margin, see also Denneberg [34].
Example 6.19 (probability distortion for Pareto distribution). Choose claim S
Pareto(, ) with > 1 and > 0, and probability distortion function, see Example
4.5 in Delbaen [32] and Figure 6.2,
h(p) = p
(0, 1).
for
(6.10)
The probability distorted price of S is given by

h =
=
h(P[S > x]) dx =
1 dx +
Z " #
x
Z
0
1 dx +
Z
x
dx =
Z
0
P[S > x] dx,
dx
158
where S Pareto(, ). This immediately implies
h =
>
= E[S]
for (1/, 1).
1
1
In contrast to the Esscher premium we can calculate the probability distorted
premium also for heavy tailed claims as long as the risk aversion (concavity of h)
is not too large, i.e. in our case (1/, 1).

Exercise 18. Choose power distortion function (6.10). Calculate the probability
distorted price of S (1, c) and of S Bernoulli(p).
x/q
1
for x q,
otherwise.
(6.11)
(m
h(x) =
w)
Example 6.20 (expected shortfall). Choose distortion function h : [0, 1] [0, 1]

as follows, see Remark 7.7 in Delbaen [32] and Figure 6.2: fix q (0, 1) and define
Choose S F with S 0, P-a.s. The left-continuous generalized inverse of F for

(0, 1) is given by, see Chapter 1,
F () = inf{x R; F (x) }.
tes
For simplicity we assume that F is continuous and strictly increasing. This simplifies considerations because then also F is continuous and strictly increasing and
we have F (F ()) = and F (F (x)) = x, see Chapter 1 (the strictly increasing
property of F would not be necessary for getting the full flavor of this example).
Consider the survival function of S given by F (x) = 1 F (x) = P[S > x]. Note
that under our assumptions
This identity implies

h =
Z
0
no
{x < F (1 q)} = {F (x) < 1 q} = {F (x) > q}.

Z
1Z
1 dx
h(P[S > x]) dx =
F (x) dx +
q {F (x)q}
{F (x)>q}
NL
Z
1Z
=
1 dx
F (x) dx +
q {xF (1q)}
{x<F (1q)}
1Z
=
P[S > x] dx + F (1 q).
q F (1q)
Note that these identities need more care if F is not strictly increasing. The
continuity and strictly increasing property of F also imply
P[S F (1 q)] = 1 P[S < F (1 q)] = 1 F (F (1 q)) = q.

This provides, using continuity and strictly increasing property of F ,
Z
1
h =
P[S > x] dx + F (1 q)
P[S F (1 q)] F (1q)

=
=
Z
F (1q)
Z
0
P [S > x |S F (1 q)] dx + F (1 q)
P [S > x |S F (1 q)] dx = E [S |S F (1 q)] .
159
The latter is exactly the so-called Tail-Value-at-Risk (TVaR) or the conditional tail
expectation (CTE) of the random variable S at the 1 q security level. Moreover,
F (1 q) is the Value-at-Risk (VaR) of the random variable S at the 1 q security
level. The continuity of F implies that this TVaR is equal to the expected shortfall
(ES) of S at the security level 1 q, that is,
h = E [S |S F (1 q)] =
1 Z1
F (u) du = ES1q (S),
q 1q
w)
see Artzner et al. [5, 6], Acerbi-Tasche [1] and Lemma 2.16 in McNeil et al. [77].
The proof again uses the fact that for continuous distribution functions F we have
F (F ()) = and then the left-hand side of the above statement can be obtained
by a change of variables from the right-hand side.
(m
We conclude that under continuity assumptions the risk measure ES1q (S) can be
obtained via probability distortion (6.11), and following Delbaen [32], it is therefore
a coherent risk measure, see also next section.
Exercise 19. Choose probability distortion (6.11) for q = 1% and calculate the
probability distorted price for
tes
S LN(, 2 ),
S Pareto(, ) with > 1,
i.i.d.
Sn = ni=1 Yi with Yi (1, 1) and study the diversification benefit of the

probability distorted price of Sn as a function of n N.
6.2.4
no
Cost-of-capital principles using risk measures
NL
Denote by X L1 (, F, P) the set of (risky) positions X of interest, importantly

for this section: X denotes losses. This is different from page 144!
A risk measure % on X is a mapping
%:X R
with
X 7 %(X).
Remarks.
A risk measure % attaches to each (risky) position X a value %(X) R.
If the risk measure % is the regulatory risk measure then %(X) R reflects the
necessary risk bearing capital that needs to be available within the insurance
company to run business X. This is the minimal equity the insurance company needs to hold to balance possible shortfalls in the insurance portfolio.
This is going to be explained in more detail below.
160
By a change of sign in X we can observe the similarities to the expected

utility framework of Section 6.2.1.
For having a good risk measure one requires additional properties for %
such as monotonicity, coherence, etc. This is described below.
The most commonly used risk measures are: variance, Value-at-Risk (VaR),
expected shortfall (ES) already met in Example 6.20. We further discuss
them below.
tes
(m
w)
Assume a (regulatory) risk measure % : X R with X 7 %(X) is given. We would

like to price an insurance portfolio S under the assumption X = S E[S] X .
That is, we study the possible losses beyond the best-estimate prediction E[S] of
S. The regulatory capital requirement then prescribes that the insurance company
needs to hold at least risk bearing capital %(S E[S]). This risk bearing capital
%(S E[S]) quantifies the necessary financial strength of the insurance company
so that it is able to finance shortfalls beyond the pure risk premium E[S] exactly
up to the amount %(S E[S]).
We assume %(S E[S]) > 0 for non-deterministic positions S. Then the insurance
company needs to find shareholders who are willing to provide this risk bearing
capital %(S E[S]) > 0. The shareholders will provide this capital as soon as the
promised expected return on this (invested) capital is sufficiently high. We call the
expected rate of return on this shareholder capital cost-of-capital rate rCoC > 0.
Thus, the shareholders/investors expected return is
no
rCoC %(S E[S]) > 0,
on their investment %(S E[S]) > 0.
Definition 6.21. The cost-of-capital pricing principle is given by
NL
CoC = E[S] + rCoC %(S E[S]).
Interpretation.
For outcomes S E[S]: the claim can be financed by the pure risk premium
E[S], solely.
For outcomes S > E[S]: the pure risk premium E[S] is not sufficient and the
shortfall S E[S] > 0 needs to be paid from %(S E[S]). Thus, the investors
capital %(S E[S]) is at risk, and he may lose (part of) it. Therefore, he will
ask for a cost-of-capital rate
rCoC > r0 ,
if r0 denotes the risk-free rate (he receives on a risk-free bank account with
the same time to maturity as his investment).
161
We give desired properties of risk measures. For details we refer to Artzner et

al. [5, 6], McNeil et al. [77] and Fllmer-Schied [47]. The first assumption is that
X is a convex cone containing R, i.e. it satisfies
(1) c X for all c R,
(2) X + Y X for all X, Y X , and
w)
(3) X X for all X X and > 0.

Then we state the following axioms for risk measures % on X .
(m
Axioms 6.22 (axioms for risk measures %). Assume % is a risk measure on the
convex cone X containing R. Then we define for X, Y X , c R and > 0:
(a) normalization: %(0) = 0;
(b) monotonicity: for X, Y with X Y , P-a.s., we have %(X) %(Y );

(c) translation invariance: for all X and every c we have %(X + c) = %(X) + c;
tes
(d) positive homogeneity: for all X and for every > 0 we have %(X) = %(X);
(e) subadditivity: for all X, Y we have %(X + Y ) %(X) + %(Y ).
NL
no
Observe that some of the axioms imply others, e.g. positive homogeneity implies
normalization since %(0) = %(0) = %(0) for all > 0 this immediately says
%(0) = 0. For a detailed analysis of such implications we refer to Section 6.1 in
McNeil et al. [77] and Section 9.1 in Wthrich-Merz [101].
For our analysis we require (at least) normalization (a) and translation invariance
(c). We briefly comment on this.
Translation invariance. If we hold a risky position X and if we inject capital
c > 0 then the loss is reduced to X c. This implies for risk measure % that the
reduced position satisfies
%(X c) = %(X) c.
This justifies the definition of the regulatory risk measure as stated above. Namely,
if we sell a risky portfolio S and we collect pure risk premium E[S] then the risk
of the residual loss S E[S] is given by
%(S E[S]) = %(S) E[S].
Normalization and translation invariance. A balance sheet of an insurance
company is called acceptable if its (future) surplus C1 X satisfies %(C1 ) 0, see
also Wthrich [98]. Assume that the insurance company sells a policy S at price
162
E[S] and at the same time it has initial capital c0 = %(S E[S]) 0. Then
the future surplus of the company is given by C1 = c0 + S. The regulator then
checks the acceptability condition which reads as
%(C1 ) = % ((c0 + S)) = c0 + %(S) = + E[S] 0.
(6.12)
(m
w)
Thus, we have an acceptable position. Coming back to the cost-of-capital pricing

principle given in Definition 6.21 this needs to be interpreted as follows: assume
that the initial capital c0 > 0 is provided by an investor who expects cost-ofcapital rate rCoC > r0 on his investment. Then, the insurance company also needs
to finance the cost-of-capital cash flow rCoC c0 = rCoC %(S E[S]) to the investor.
This can exactly be done with the cost-of-capital premium CoC and the insurance
company keeps its acceptable position in (6.12) if rCoC c0 is also considered as a
liability of the insurance company.
Monotonicity and normalization imply that more risky positions are charged
with higher capital requirements and, in particular, if we have only downside risks,
i.e. X 0, P-a.s., then we will have positive capital charges %(X) %(0) = 0.
tes
Definition 6.23 (coherent risk measure). The risk measure % is called coherent if
it satisfies Axioms 6.22.
no
Coherent risk measures were introduced by Artzner et al. [5, 6]

and the properties of coherent risk measures are often regarded
as useful in practice. In particular, the subadditivity property
means that if we merge two portfolios we expect diversification
benefits in the sense of a release of necessary risk bearing capital.
P. Artzner
NL
We close this section with a discussion of the three most popular

risk measures.
Example 6.24. The standard deviation risk measure is for S
with finite second moment given by
%(S) = (S) = Var(S)1/2 ,
for a given parameter > 0. This risk measure is normalized, positive homogeneous, and subadditive. But it is neither translation invariant nor monotone. Note
that for the standard deviation risk measure the cost-of-capital pricing principle
coincides with the standard deviation loading principle presented in Section 6.1.
Example 6.25 (Value-at-Risk, VaR). The VaR of S F at security level 1 q
(0, 1) is given by the left-continuous generalized inverse of F at 1 q, i.e.
%(S) = VaR1q (S) = F (1 q).
163
The VaR is normalized, monotone, translation invariant and positive homogeneous,

but it is not subadditive, and hence not coherent. There are many examples in the
literature showing this non-coherence, see, for instance, Artzner et al. [5, 6], McNeil
et al. [77] and Embrechts et al. [40].
w)
Example 6.26 (expected shortfall). The expected shortfall has already been introduced in Example 6.20, where we have stated that the expected shortfall is equal
to the TVaR for continuous distribution functions F . Instead of introducing it via
probability distortion functions we can also directly define it. Assume that S F
with F continuous. Then we have
1Z 1
%(S) = TVaR1q (S) = E [S |S VaR1q (S)] =
VaRu (S) du = ES1q (S).
q 1q
(m
ES1q (S) is a coherent risk measure on L1 (, F, P). The cost-of-capital pricing

principle is then given by
= E[S] + rCoC ES1q (S E[S]) = E[S] + rCoC (ES1q (S) E[S]) .
tes
This cost-of-capital pricing principle can also be obtained with probability distortion functions: choose h as in Example 6.20 and define the distortion function
e : [0, 1] [0, 1] as follows
h
e
h(x)
= (1 rCoC ) x + rCoC h(x),
eh =
no
for fixed rCoC (0, 1), see Figure 6.3. For a non-negative random variable S
0 with continuous (and strictly increasing) distribution function we obtain, see
Example 6.20,
e (P[S > x]) dx
h
NL
= (1 rCoC )
P[S > x] dx + rCoC
h (P[S > x]) dx
= (1 rCoC ) E[S] + rCoC ES1q (S)

= E[S] + rCoC ES1q (S E[S]) ,
which proves the claim.
Remarks.
Solvency II considers VaR1q (S E[S]) for 1 q = 99.5% as the regulatory
risk measure.
The Swiss Solvency Test considers ES1q (S E[S]) for 1 q = 99% as the
regulatory risk measure.
164
0.2
w)
0.4
0.6
0.8
1.0
probability distortions
0.0
0.2
0.4
(m
0.0
expected shortfall (ES)

ES costofcapital loading
0.6
0.8
1.0
tes
Figure 6.3: Distortion functions h of Example 6.20 (expected shortfall) and corree for expected shortfall cost-of-capital loading.
sponding h
For rCoC one often sets 6% above the risk-free rate. However, this is a heavily
debated number because in stress periods this rate should probably be higher.
no
Exercise 20. Assume that S N (, 2 ) has a Gaussian distribution. Choose

1 q = 99% and rCoC = 6%. The cost-of-capital pricing principle for the expected
shortfall risk measure gives price

NL
= + rCoC
2
1
1
.
exp 1 (1 q)
q 2
2
(a) Prove this statement.

(b) Calibrate the security level for the VaR risk measure such that the cost-ofcapital insurance price is the same as for the expected shortfall risk measure.
(c) Calibrate the standard deviation risk measure loading parameter > 0 such
that the price is the same as for the expected shortfall risk measure.
Remark. This parameter calibration only holds true under the Gaussian model
assumption.
6.2.5
Deflator based pricing principles
165
E[] = d0 =
(m
w)
Up to now we have completely neglected that cash flows also

have time values, i.e., in general, future cash flows need to be
discounted for valuation purposes. In a financial mathematics setting insurance cash flow valuation can be considered
as a pricing problem in an incomplete financial market setting. The pricing in such a financial market setting can be
J.D. Duffie
done either by risk neutral measures or, equivalently, by using
(state price) deflators. This provides pricing systems that are free of arbitrage also
known as the Fundamental Theorem of Asset Pricing, see Delbaen-Schachermayer
[33]. Deflators were introduced in the actuarial literature by Bhlmann [20, 21, 22]
and heavily used in Wthrich et al. [99] and Wthrich-Merz [101]. The terminology
deflator was introduced by James Darrell Duffie [37].
Assume that is an integrable and strictly positive random variable with
1
(0, 1].
1 + r0
tes
Then, d0 can be seen as deterministic discount factor and r0 0 can be seen

as deterministic risk-free rate. This is the general version of a deflator . To
make deflator pricing comparable to the previously introduced pricing principles
we assume that d0 = 1, i.e. no time values are added to cash flows.
no
Fix L1 (, F, P) strictly positive with d0 = 1 and assume that and S are

positively correlated. Then we can define the deflator based price by (subject to
existence)
(0) = E[S] E[]E[S] = E[S].
NL
We use the upper index in (0) to indicate that we set d0 = 1.

Thus, all random variables S which are positively correlated with receive a positive premium loading. The next example shows that this is a generalization of the
Esscher premium, or more generally, it can be understood as a probability distortion principle because allows to define the equivalent probability measure P by
the Radon-Nikodym derivative as follows
dP
= ,
dP
because is a strictly positive density w.r.t. P for d0 = 1. Then, we price S under

the equivalent probability measure P by
(0) = E[S] = E [S].
Example 6.27 (Esscher premium). Choose a random variable S and define =
MS ()1 exp{S} for given > 0 with MS () < . It follows that is strictly
positive, P-a.s., and normalized. That is, is a deflator with d0 = 1. Due to
166
the FKG inequality, see Fortuin et al. [48], it follows that and S are positively
correlated, and thus
(0) = E[S] E[S].
Observe the identity
(0) = E[S] =
h
i
1
E eS S = ,
MS ()
w)
which is exactly the Esscher premium , and P is the Esscher measure corresponding to F , see Section 6.2.2.
(m
The previous example shows that the deflator approach is a

generalization of the Esscher premium. The crucial point is
that and S are positively correlated so that we obtain a
positive premium loading. Moreover, this deflator approach
also allows for stochastic discounting by choosing a deflator
with E[] (0, 1), and generalizations to multiperiod problems
are easily possible and straightforward. For more details we
refer to Wthrich-Merz [101] and Wthrich et al. [99].
no
tes
Example 6.28 (cost-of-capital loading with expected shortfall). This example treats the expected shortfall risk measure. Assume S F
with continuous distribution function F . The VaR on security level 1 q (0, 1)
is then given by VaR1q (S) = F (1 q), see Example 6.25. Note again that
F (F (1 q)) = 1 q, see Chapter 1. Choose rCoC (0, 1) and define the probability distortion
rCoC
= (1 rCoC ) +
1{SVaR1q (S)} > 0,
P-a.s.
q
NL
This choice and the continuity of F imply

rCoC
rCoC
E[] = (1 rCoC ) +
P [S VaR1q (S)] = (1 rCoC ) +
q = 1,
q
q
that is, we obtain the required normalization. The premium is then given by
"
(0)
! #
rCoC
= E (1 rCoC ) +
1{SVaR1q (S)} S
q
i
rCoC h
= (1 rCoC ) E [S] +
E 1{SVaR1q (S)} S
q
= (1 rCoC ) E [S] + rCoC E [ S| S VaR1q (S)]
= E [S] + rCoC (E [ S| S VaR1q (S)] E [S])

= E [S] + rCoC ES1q (S E [S]) .
We conclude that we exactly obtain the cost-of-capital loading principle with expected shortfall as risk measure, see Example 6.26.
Chapter 7
(m
w)
Tariffication and Generalized

Linear Models
no
tes
Assume we have v N insurance policies denoted by l = 1, . . . , v. These insurance

policies should be sufficiently similar such that we obtain a homogeneous insurance
portfolio to which the law of large numbers (LLN) applies, see (1.1). The ideal case
of i.i.d. risks justifies to charge the same premium to every policy. If there is no
perfect homogeneity (and there never is) then there are two different possibilities
of charging a premium: (a) everyone pays the same premium which reflects more
the aspect of social insurance, where one tries to achieve a balance between the
rich and the poor; (b) the individual premium should reflect the quality of the
specific insurance policy, i.e. we try to calculate risk adjusted premia. In the present
chapter we try to achieve (b). We explain this with the compound Poisson model at
hand. The aggregation and the disjoint decomposition properties of the compound
Poisson model S CompPoi(v, G), see Theorems 2.12 and 2.14, suggest the
consideration of the following decomposition
NL
S =
N
X
i=1
(l)
Yi =
(l)
v
N
X
X
l=1
(l)
Yi
i=1
v
X
Sl ,
l=1
(l)
where Sl = N
describes the total claim amount of policy l = 1, . . . , v. This
i=1 Yi
decoupling provides independent compound Poisson distributions Sl . That is, we
have Sl CompPoi(l , Gl ), where we set volume vl = 1, l > 0 is the expected
(l)
number of claims of policy l and Yi Gl describes the claim size distribution of
policy l. This provides the following decomposition of the mean
P
E[S] =
v
X
l=1
E[Sl ] =
v
X
(l)
l E[Y1 ] = E[Y1 ]
l=1
v
X
l=1
(l)
v
X
l E[Y1 ]
=
(l) ,
E[Y1 ]
l=1
where = E[S]/v = E[Y1 ] is the average claim over all policies and (l) > 0 reflects
the contribution of policy l = 1, . . . , v. This means that in the case of heterogeneity
we should determine these risk characteristics (l) for every policy l to obtain
167
168
Chapter 7. Tariffication and Generalized Linear Models
w)
risk adjusted premia because these risk characteristics (l) describe the differences
between the policies. This would require to model v different parameters. To avoid
over-parametrization and to have sufficient volume(s) for a LLN one chooses a fixed
(finite) number, say K, of tariff criteria (like age, type of car, kilometers yearly
driven, place of living, etc.) such that the total portfolio is divided into sufficiently
homogeneous sub-portfolios (risk classes, risk cells). These tariff criteria play the
role of covariates in regression theory.
Then we try to modify the overall average claim = E[S]/v = E[Y1 ] to these risk
classes such that their prices become a function of the risk characteristics in the
K tariff criteria. This way we may substantially reduce the number of parameters
and estimation can be done.
(m
For this exposition we assume to only have two tariff criteria (covariates), i.e. K =
2, and we would like to set up a multiplicative tariff structure.
The generalization to K > 2 is then straightforward.
tes
Assume we have K = 2 tariff criteria. The first criterion (covariate) has I risk
characteristics i {1, . . . , I} and the second criterion (covariate) has J risk characteristics j {1, . . . , J}. Thus, we have M = I J different risk classes (risk cells),
see Table 7.1 for an illustration.
no
1
..
.
NL
i
..
.
risk classes (i, j)
Table 7.1: K = 2 tariff criteria with I and J risk characteristics, respectively.

We assume that policy l belongs to risk class (i, j), and we assume for the corresponding risk characteristics (l) = (i,j) . This provides decomposition
E[S] =
vi,j (i,j) ,
i,j
where vi,j denotes the number of policies belonging to risk class (i, j) and (i,j)
describes the quality of that risk class. Our aim is to set up
169
a multiplicative tariff structure for these K = 2 tariff criteria, i.e. we assume

(i,j) = 1,i 2,j ,
(7.1)
where k,lk describes the specifics of criterion k if it has risk characteristics lk .

In particular, this means that a multiplicative tariff structure (which is the model
assumption here) may reflect the quality of each risk class (i, j).
w)
Example 7.1 (multiplicative tariff). A classical example in car insurance is the

following: choose as tariff criteria the kilometers yearly driven and the years
driven without an accident.
(m
1st tariff criterion 1,i : kilometers yearly driven

2nd tariff criterion 2,j : years driven without an accident (bonus-malus level)
yearly km
2,
1,
0-10000
10-15000
15-20000
20-25000
25000+
0.8
0.9
1.0
1.1
1.2
0 years
1.2
1 year
1.1
2 years
1.0
3 years
0.9
4 years
0.8
5 years
0.7
6+ years
0.5
no
no accident
tes
Observe that the 1st tariff criterion is continuous, but typically it is discretized for
having finitely many risk characteristics, see Table 7.2 for an example.
(4,5) = 1,4 2,5 = 1.1 0.8 = 0.88
NL
Table 7.2: Tariffication scheme for K = 2 tariff criteria.
We have K = 2 tariff criteria. Criterion k = 1 has I = 5 risk characteristics and

criterion k = 2 has J = 7 risk characteristics. This gives M = I J = 35 risk
classes (i, j) for i {1, . . . , I} and j {1, . . . , J}.
The general aim is to determine tariff criteria such that they give sufficiently large
homogeneous risk classes. These risk classes are then priced by choosing appropriate multiplicative pricing factors k,lk (under the assumption that a multiplicative
tariff structure (7.1) fits the problem).

Remarks.
A prior choice of tariff criteria should be done using expert opinion. Statistical analysis should then select as few as possible significant criteria. However,
170

also market specifications of competitors are important to avoid adverse selection.
Related to the first item: the aim should be to build homogeneous risk classes
of sufficient volume such that a LLN applies and we get statistical significance.
w)
Variable reduction techniques and multivariate statistical analysis need to be

applied to avoid an over-correction of dependent factors, e.g. in the rather
trivial example above, the relation between the factors is not immediately
clear: it could be that kilometers yearly driven is strongly related to years
driven without an accident. If this is the case we might correct twice for the
same factor.
(m
We consider a bivariate model using simple methods for categorical risk

classes and will then go over to more sophisticated models using generalized
linear model (GLM) techniques.
i,j
tes
Assume we have two tariff criteria (covariates) i and j which give M = I J

risk classes. Our aim is to find appropriate multiplicative pricing factors 1,i , i
{1, . . . , I}, and 2,j , j {1, . . . , J}, which describe the risk classes (i, j) according
to the multiplicative tariff structure (7.1).
We define by Si,j the total claim of risk class (i, j) and by vi,j the corresponding
volume with
X
X
vi,j = v
and
Si,j = S.
i,j
no
This implies that we need to study

E[Si,j ] = vi,j
E[S] (i,j)
= vi,j 1,i 2,j ,

v
(7.2)
7.1
NL
where = E[Y1 ] is the average claim per policy over the whole portfolio v,
i.e. E[S] = v, and (i,j) = 1,i 2,j describes the multiplicative tariff structure
for two tariff criteria.
Simple tariffication methods
L.J. Simon (right)
We start with the method of Robert A. Bailey & LeRoy

J. Simon [10] which was introduced in 1960 for rate-making.
The method of Bailey & Simon is rather simple and it is
not directly motivated by a stochastic model which considers the claim Si,j of risk class (i, j) in a consistent way.
It specifies parameters , 1,i and 2,j > 0 such that the
following expression is minimized
X2 =
X
i,j
171
(Si,j vi,j 1,i 2,j )2

.
vi,j 1,i 2,j
w)
The motivation behind this approach is that X 2 describes the

test statistics of the 2 -goodness-of-fit test, see (3.9). This test
rejects a model if X 2 exceeds the quantile of a 2 -distributed
random variable on a certain significance level. Therefore, the
aim is to choose the parameters such that X 2 becomes as small
as possible.
Note that this approach is not based on a stochastic model it
is just based on a statistical argument. Moreover, it has the
following unpleasant feature.
(7.3)
R.A. Bailey
(m
Lemma 7.2. The minimizers of (7.3) have a (systematic) positive bias.
Proof. We denote the minimizers of (7.3) by

b,
b1,i and
b2,j . We would like to prove that
X
X
vi,j
b
b1,i
b2,j
Si,j = S.
i,j
i,j
tes
This can either be done by first summing over rows i or columns j. Note that
b2,j is found by
the solution of
X (Si,j vi,j 1,i 2,j )2
X 2
=
.
2,j
2,j
vi,j 1,i 2,j
i
This provides estimates
no
b2,j =
S 2 /(vi,j
b
b1,i )
Pi,j
b
b1,i
i vi,j
!1/2
.
If we sum over i and plug in the estimates

b2,j we obtain
X
vi,j
b
b1,i
b2,j
vi,j
b
b1,i
!1/2
X
i
2
Si,j
vi,j
b
b1,i
!1/2
.
NL
Next we apply the Schwarz inequality to the terms on the right-hand side which provides the
following lower bound
!1/2
2
X
X
X
S
i,j
1/2
=
vi,j
b
b1,i
b2,j
(vi,j
b
b1,i )
Si,j .
vi,j
b
b1,i
i
i
i
2
Example 7.3 (method of Bailey & Simon). We choose an example with two tariff
criteria. The first one specifies whether the car is owned or leased, the second
one specifies the age of the driver. For simplicity we set vi,j 1 and we aim to
determine the tariff factors , 1,i and 2,j . The method of Bailey & Simon then
requires minimization of
2
X =
X
i,j
(Si,j 1,i 2,j )2

.
1,i 2,j
172
Note that we need to initialize the estimators for obtaining a unique solution. We
set b = 1 and b1,1 = 1. The observations Si,j are given by, see also Figure 7.1,
21-30y
owned 1300
leased 1800
31-40y
1200
1300
41-50y
1000
1300
51-60y
1200
1500
w)
L leased
O owned
(m
1600
1400
claim amount
1800
2000
scatter plot
1200
tes
1000
2130y
3140y
4150y
5160y
age classes
no
Figure 7.1: Observations Si,j .
NL
We have M = I J = 2 4 = 8 risk classes (i, j) and observations Si,j . The number

def.
of parameters to be estimated are r + 1 = I + J 1 = 5 (taking into account
the initialization b = b1,1 = 1). Minimizing X 2 numerically provides the following
multiplicative tariff structure for 1,i , i {1, 2}, and 2,j , j {1, . . . , 4}.
21-30y
owned 1376
leased 1727
b2,j
1376
31-40y
1112
1395
1112
41-50y
1020
1280
1020
51-60y
b1,i
1197 1.0000
1503 1.2548
1197
In this example we have (systematic) positive bias as stated in Lemma 7.2, i.e.
X
i,j
b1,i b2,j = 100 611 > 100 600 =
Si,j = S.
i,j
J
X
vi,j 1,i 2,j =
j=1
j=1
vi,j 1,i 2,j =
I
X
Si,j ,
(7.4)
Si,j .
(7.5)
(m
I
X
J
X
i=1
J. Jung
w)
The method of Robert A. Bailey & Jan Jung (1922-2005)

[9, 63] intends to improve the weakness of the positive bias of the
previous method, see Lemma 7.2 and Example 7.3. But it is still
a simple method that is not directly motivated by a stochastic
model. However, we will see below that it has its groundings in a
stochastic model. It imposes unbiasedness of rows and columns
by definition: Choose , 1,i and 2,j > 0 such that the rows i
and columns j satisfy
173
i=1
Remarks.
tes
This method is also called method of total marginal sums.

It is more robust than the method of Bailey & Simon.
no
If Si,j are independent Poisson distributed with cross-classified means, then

the above system is exactly the MLE system that needs to be solved. We
will discuss this in Section 7.3.1 below.
NL
Both the method of Bailey & Simon and the method of Bailey & Jung are
rather pragmatic methods because they are not directly based on a stochastic
model. Therefore, in the remainder of this chapter we are going to describe
more sophisticated methods which are motivated by a probabilistic model.
Example 7.4 (method of Bailey & Jung, method of total marginal sums). We
revisit the data of Example 7.3. This time we determine the parameters by solving
the system (7.4)-(7.5). This needs to be done numerically and provides the following
multiplicative tariff structure:
21-30y
owned 1375
leased 1725
b2,j
1375
31-40y
1108
1392
1108
41-50y
1020
1280
1020
51-60y
b1,i
1197 1.0000
1503 1.2553
1197
We conclude that both methods give similar results for this example.
174
7.2
7.2.1
Gaussian approximation
Maximum likelihood estimation
Ri,j = Si,j /vi,j .
w)
In the previous section we have presented two pragmatic tariffication methods. In

this section we give a more advanced method, in the sense that we use an explicit
stochastic model. However, the approach is still pragmatic because the stochastic
model is assumed to be a good approximation to the true tariffication problem.
We consider the claims ratio in risk class (i, j) defined by
The expected value of this claim ratio is given by, see (7.2),
(m
E[Ri,j ] = 1,i 2,j .

We use two simple facts:
1. The simplest absolutely continuous distribution is the Gaussian one.

2. Taking logarithms turns products into sums.
def.
tes
Combining this two items implies that we plan to consider the following model

Xi,j = log Ri,j N 0 + 1,i + 2,j , 2 .
no
Thus, taking logarithms may turn the multiplicative tariff structure into an additive
structure. If this logarithm Xi,j of Ri,j has a Gaussian distribution we have nice
mathematical properties. Therefore, we assume a log-normal distribution for Ri,j
which hopefully gives a good approximation to the true tariffication problem. These
choices imply for the first two moments
2 /2
e1,i e2,j
and
Var(Ri,j ) = E[Ri,j ]2 (e 1).
NL
E[Ri,j ] = e0 +
Observe that the mean has the right multiplicative structure, set = e0 + /2 ,
1,i = e1,i and 2,j = e2,j . However, the distributional properties are rather
different from compound models, and the underlying volumes vi,j are also not
considered in an appropriate way. Nevertheless, this log-linear additive Gaussian
structure is often used because of its nice mathematical structure and because
popular statistical methods can be applied.
Set M = I J and define for Xi,j = log Ri,j = log(Si,j /vi,j ) the vector
X = (X1 , . . . , XM )0 = (X1,1 , . . . , X1,J , . . . , XI,1 , . . . , XI,J )0 RM .
(7.6)
Note that we change the labeling of the observations because this is going to be
simpler in the sequel. Index m always refers to
m = m(i, j) = (i 1)J + j {1, . . . , M = I J}.
(7.7)
175
We assume that X has a multivariate Gaussian distribution

X N (Z, ) ,
(7.8)
with diagonal covariance matrix = 2 diag(w1 , . . . , wM ), parameter vector

= (0 , 1,2 , . . . , 1,I , 2,2 , . . . , 2,J )0 Rr+1 ,
set r + 1 = I + J 1, and design matrix Z RM (r+1) such that for m = m(i, j)
w)
E[Xi,j ] = (Z)m = 0 + 1,i + 2,j .
Xi,j
Var(Ri,j ) = Var(e
(m
Throughout we assume that Z has full rank. We initialize 1,1 = 2,1 = 0 and
0 plays the role of the intercept. At the moment the weights wm do not have a
1
natural meaning, often one sets wm = vi,j
(inversely proportional to the underlying
volume) because in this case one has
2 /vi.j
) = E[Ri,j ] (e
2
E[Ri,j ]2 ,
1)
vi,j
leased
0
0
0
0
1
1
1
1
21-30y
1
0
0
0
1
0
0
0
31-40y
0
1
0
0
0
1
0
0
no
owned
1
1
1
1
0
0
0
0
NL
1
2
tes
for vi,j large. Thus, the variance of the claims ratio Ri,j is roughly inversely proportional to the underlying volume vi,j . In view of Example 7.3 this gives the following
table where the 1s show to which class the observations belong to:
41-50y
0
0
1
0
0
0
1
0
51-60y
0
0
0
1
0
0
0
1
X = log R
7.17
7.09
6.91
7.09
7.50
7.17
7.17
7.31
This table needs to be turned into the appropriate form so that it fits to (7.8).
Therefore we need to drop the columns owned and 21-30y because of the
chosen normalization 1,1 = 2,1 = 0. This provides the following table:
Z =
intercept
1
1
1
1
1
1
1
1
leased
0
0
0
0
1
1
1
1
31-40y
0
1
0
0
0
1
0
0
41-50y
0
0
1
0
0
0
1
0
51-60y
0
0
0
1
0
0
0
1
0
1,2
2,2
2,3
2,4
176
Under assumption (7.8) we know that X has density

f (x) =
(2)M/2 ||1/2
exp
1
(x Z)0 1 (x Z) .
2

b MLE of the parameter vector :

This allows for the calculation of the MLE
MLE
= Z 0 1 Z
1
Z 0 1 X.
(7.9)
w)
The tariff factors can then be estimated by (avoiding the variance correction term
which is appropriate for 2 wm /2 0 )
o
MLE
b1,i = exp b1,i
and
MLE
b2,j = exp b2,j
.
(m
b = exp b0MLE ,
If we have homoscedasticity, i.e. if we assume identical weights wm w and =

b MLE = (Z 0 Z)1 Z 0 X.
2 w1, then the estimator of is given by
tes
Example 7.5 (log-linear model). We use the data Si,j from Example 7.3. Assume
1
wm = vi,j
1 and initialize b = 1 and b1,1 = 1. The log-linear MLE formula (7.9)
provides the following multiplicative tariff structure:
31-40y
1117
1396
1117
no
21-30y
owned 1368
leased 1710
b2,j
1368
41-50y
1020
1274
1020
51-60y
b1,i
1200 1.0000
1500 1.2495
1200
NL
We compare the results from the method of Bailey & Simon, the method of total
marginal sums (Bailey & Jung) and the log-linear MLE method.
We see that in this example all three methods provide similar results.
Observe: the risk class (owned, 21-30y) is punished by the bad performance
of (leased, 21-30y) and vice verse. A similar remark holds true for risk class
(leased, 31-40y).
Remarks.
177
The multiplicative tariff construction above has used the design matrix Z =
(zm,k )m,k RM (r+1) which was generated by categorical variables. Categorical variables allow to group observations into disjoint risk categories.
w)
Binary variables are a special case of categorical variables that can only have
two specifications, 1 for true and 0 for false. Recall that all our zm,k {0, 1}.
E.g., the observation Si,j either belongs to the class owned or to the class
leased.
tes
(m
Often the linear regression model X = Z + with

N (0, ) is introduced for continuous variables (zm,k )m,k
which generate the design matrix Z. E.g. if there is a
(clear) functional relationship between age and tariff criterion 2 , for instance if 2 is a linear function of age,
then variable zm,k R+ modeling age is directly reflecting this relationship (linear regression). For more on this
subject we refer to Frees [49]. For the present discussion
we concentrate on binary variables, also because often it is difficult to find a
clear functional relationship, see also example in Section 7.3.4, below.
7.2.2
NL
no
A serious drawback of the log-linear model is that we need to have observations in all risk classes because otherwise Xi,j = log(Si,j /vi,j ) is not welldefined. In practice, it may happen that one has a risk class with positive
volume vi,j > 0 but there is no claim in that risk class. This results in Si,j = 0.
In this case one should use the more sophisticated models presented below,
see for instance Section 7.3.4 for a claims count example. Moreover, volumes
vi,j should be large in order to have the right relationship for the resulting
variances of the claims ratios.
Goodness-of-fit analysis
Compared to the methods in the previous section, the log-linear MLE formula (7.9)
has the advantage that we can apply classical statistical methods for a goodness-offit test and for variable selection/reduction techniques. We introduce this statistical
language. For this discussion we assume homoscedasticity, i.e. identical weights
wm = 1
and
= 2 1,
MLE
b
which simplifies the MLE to
= (Z 0 Z)1 Z 0 X. The general case is treated in
the next section. We introduce the total sum of squares (the first and last equalities
are definitions)
178
SStot =
X
Xm X
2
X
c X
X
m
2
X
c
Xm X
m
2
= SSreg + SSerr ,
(7.10)
with X =
1
M
PM
m=1
Xm and
c
X
b MLE .
Z
SStot is the total difference between observations Xm and the sample mean
X without knowing the explaining variables Z.
w)
SSreg is the difference explained by the explaining variables Z.

SSerr is the residual difference not explained by the regression.
(m
Proof of (7.10). We rewrite the total sums of squares SStot in vector notation. Therefore we
define
b MLE = X X
c
= X (1, . . . , 1)0 .
b
= X Z
and
X
(7.11)
We calculate
X 0X
c+b
c+b
cX
c + 2X
cb
+b
0 b
.
(X
)0 (X
) = X
b MLE minimizes in the homoscedastic case (X Z)0 (X Z) and thus we have

The MLE
MLE
tes
b
0 = Z 0 (X Z
and as a consequence
) = Z 0b
,
(7.12)
0
b MLE )0 b
cb
X
= (Z
= 0.
no
This implies
X 0X
cX
c+b
X
0 b
.
0X
to obtain
We subtract on both sides X
0X
=b
cX
cX
0X
= SSerr + SSreg ,
SStot = X 0 X X
0 b
+X
NL
where for the last step we need to observe that the intercept 0 is contained in every row of the
design matrix Z, therefore the first column in Z is equal to (1, . . . , 1)0 . This and (7.12) imply
0
P
P b
cc 0
0 = (1, . . . , 1)0 b
=
Xm X
m . This treats the cross-product terms leading to X X X X =
SSreg . This proves (7.10).
2
We define and consider the coefficient of determination R2 given by

R2 =
SSerr
SSreg
=1
[0, 1].
SStot
SStot
This is the ratio of explaining variables SSreg and the total sum of squares SStot . If
the model explains well the structure in the observations then R2 should be close
c is able to explain the underlying structure.
to 1, because X
179
For Example 7.5 we obtain R2 = 0.9202 which is in favor for this model explaining
the data Si,j .
Residual standard deviation : For further analysis we also need the residual
standard deviation . It is estimated (in the homoscedastic case) by

b0 b
1 X
c 2 = = SSerr ,
Xm X
m
M m
M
M
w)
b 2 =
(m
where b was defined in (7.11). Set r = I +J 2, i.e. the dimension of parameter is r +1. b 2 is the MLE for 2 and
M b 2 is distributed as 2 2M r1 , see, for instance, Section
7.4 in Johnson-Wichern [62]. Often, one also considers the
M
unbiased variance parameter estimator sb2 = M r1
b 2 .
tes
Revisiting Example 7.5, we have M = 8 observations,

r + 1 = 5 parameters and hence df = M r 1 = 3
degrees of freedom. In our case we obtain sb = 0.07447.
Likelihood ratio test: Finally, we would like to see whether we need to include
a specific parameter k,lk .
no
We have a r + 1 = I + J 1 dimensional parameter vector given by

= (0 , 1,2 , . . . , 1,I , 2,2 , . . . , 2,J )0 Rr+1 .
NL
Note that the model is, of course, invariant under permutation of parameters and
components. Therefore, we can choose any specific ordering and to simplify notation we define
= (0 , 1 , . . . , r )0 Rr+1 ,
(7.13)
so that we have the ordering of components that is appropriate for the next layout.
Null hypothesis H0 : 0 = . . . = p1 = 0 for given p < r + 1.
b in the full model with r + 1
1. Calculate the residual differences SSfull
err and
r+1
dimensional parameter vector R .
0
r+1p
0
2. Calculate residual differences SSH
.
err in the reduced model (p , . . . , r ) R
We calculate the likelihood ratio . Therefore, we denote the design matrix of the
180
reduced model by Z0 . Then it is given by

fbH0 (X)
=
fbfull (X)
SSerr0
M
SSfull
err
M
bH0
bfull
M/2
M
exp
2b12
H
(X Z0
b MLE )0 (X
H0
Z0

MLE
b
)
H0
MLE
MLE
b
b
0
exp 2b12 (X Z
full ) (X Z full )
full
0
SSH
err
SSfull
err
!M/2
SSH0 SSfull
1 + err full err
SSerr
!M/2
. (7.14)
w)
The likelihood ratio test rejects the null hypothesis H0 for small values of . This
full
full
0
is equivalent to rejection for large values of (SSH
err SSerr )/SSerr .
This motivates to consider the test statistics
(m
full
full
0
0
SSH
SSH
err SSerr M r 1
err SSerr
F=
=
.
p
p sb2full
SSfull
err
(7.15)
tes
F has an F -distribution with degrees of freedom given by df 1 = p and df 2 =

M r 1, see Result 7.6 in Johnson-Wichern [62] or (4.2) in Frees [49]. Therefore,
we reject the null hypothesis H0 on the significance level 1 if
F > Fp,M
r1 (),
(7.16)
no
where the latter denotes the quantile of the F -distribution with degrees of freedom df 1 and df 2 . The heteroscedastic case is given in (7.22), below.
Example 7.6 (regression model, revisited). We revisit Example 7.5.
In Figure 7.2 we give the R output of the command lm.
The lines Call give the MLE problem to be solved.
NL
The lines Residuals display b .

The lines Coefficients give the MLEs for the parameters 0 (intercept),
1,2 (leased) and 2,2 , . . . , 2,4 . For these parameters a standard estimation
error is calculated and a t-test is applied to each parameter individually,
whether they are different from zero, see formula (7.14) in Johnson-Wichern
[62]. From this analysis we see that we might only question 2,4 because of
the large p-value of 0.1675, the other parameters are well justified by the
observations.
The bottom lines then display the residual standard error sb = 0.07447 on
df = 3 degrees of freedom, the coefficient of determination R2 = 0.9202, the
adjusted coefficient of determination Ra2 corrects for the degrees of freedom
Ra2 = 1
SSerr M 1
.
SStot M r 1
181
(m
w)
Figure 7.2: R output of Example 7.5 using R command lm.
tes
The final line displays an F test statistics (7.15) of value 8.653 for df1 = 4
and df2 = 3 for dropping all variables except of the intercept 0 . This gives
a p-value of 5.36% which says that the null hypothesis is just about to be
rejected on the 5% significance level and we stay with the full model.
no
For the reduction of the variable owned or leased. We obtain an F test

statistics of 18.36 for df1 = 1 and df2 = 3. This gives a p-value of 2.34%
which says that we reject the null hypothesis of setting 1,2 = 0 on the 5%
significance level.
NL
In the reduced model 1,2 = 0 we obtain an F test statistics of 1.071 for

df1 = 3 and df2 = 3 for dropping all remaining variables variables 2,2 =
. . . = 2,4 = 0. This gives a p-value of 45.52% which says that we cannot
reject this null hypothesis on the 5% significance level.
We conclude that we need the variable to distinguish between owned and leased.
The classification in age classes 21-30y, . . ., 51-60y can be discussed. This
discussion will also depend on whether we want such a tariffication criterion and
whether our competitors consider similar variables.

Exercise 21. Provide design matrix Z for the pricing problem specified by the
following risk class specification (assuming a multiplicative tariff structure).
passenger car
delivery van
truck
21-30y
2000
2200
2500
31-40y
1800
1600
2000
41-50y
1500
1400
1700
51-60y
1600
1400
1600
182
Calculate a tariff using the different tariffication methods introduced above.
7.3
Generalized linear models
(1)
E[Si,j ] = E[Ni,j ] E[Yi,j ],
w)
In the previous section we have taken a log-normal approximation for the total claim
amounts Si,j in risk classes (i, j). Taking logarithms has then led to a multiplicative
structure in a natural way. In the present section we express the expected claim of
risk class (i, j) as expected number of claims times the average claim, i.e.
(l)
(l)
We now analyze Ni,j and Yi,j separately.
(m
where Ni,j describes the number of claims in risk class (i, j) and Yi,j the corresponding i.i.d. claim sizes for l = 1, . . . , Ni,j in risk class (i, j). Note that we suppose a
compound distribution for this decoupling.
tes
Definition 7.7 (exponential dispersion family). X fX belongs to the exponential

dispersion family if fX is of the form
(
x b()
fX (x; , ) = exp
+ c(x, , w) ,
/w
w>0
>0
no
write X EDF(, , w, b()), where
is the dispersion parameter,

is the (unknown) parameter of the distribution,
is an open set of possible parameters ,
NL
is a given weight,
b:R
c(, , )
is the cumulant function,
is the normalization, not depending on .
fX can either be a density in the absolutely continuous sense, it can be probability

weights in the discrete case or it can be a mixture thereof. Moreover, depending
on the choice of the cumulant function b() and of the possible parameters the
support of X may need to be restricted to subsets of R.
Lemma 7.8. Choose a fixed cumulant function b() and assume that the exponential dispersion family EDF(, , w, b()) gives well-defined densities with identical
supports for all parameters in an open (non-empty) set . Assume that
for any there exists a neighborhood of zero such that the moment generating
function MX (r) of X EDF(, , w, b()) is finite in this neighborhood of zero (for
183
r). Then we have for all and r sufficiently close to zero

)
b( + r/w) b()
.
MX (r) = exp
/w
w)
Proof. Choose and r in the neighborhood of zero such that MX (r) exists. Then we have

Z
x b()
rx
MX (r) =
e exp
+ c(x, , w) dx
/w

Z
x( + r/w) b()
=
exp
+ c(x, , w) dx
/w

Z

b( + r/w) b()
x( + r/w) b( + r/w)
= exp
exp
+ c(x, , w) dx.
/w
/w
(m
We have assumed that is an open set. Therefore, for any we have that r = +r/w
for r sufficiently close to zero. Therefore, the last integral is the density that corresponds to
EDF(r , , w, b()) and since this is a well-defined density with identical support for all r
this last integral is equal to 1. This proves the claim.
2
Corollary 7.9. We make the same assumptions as in Lemma 7.8 and in addition
we assume that b C 2 in the interior of . Then we have
and
Var(X) =
tes
E[X] = b0 ()
00
b ().
w
and

d2

M
(r)
X

dr2
r=0
no
Proof. In view of (1.3) we only need to calculate the first and second derivatives at zero of the
moment generating function. We have from Lemma 7.8

b( + r/w) b() 0
d

= exp
= b0 (),
MX (r)
b ( + r/w)
dr
/w
r=0
r=0

00
0
2
(b ( + r/w)) + b ( + r/w)
w
r=0
NL
b( + r/w) b()
exp
/w
(b0 ())2 + b00 ().

w
Example 7.10 (exponential dispersion family). In Chapters 2 and 3 we have

met several examples that belong to the exponential dispersion family. We revisit
these examples and explain how they fit into the exponential dispersion family
framework. These considerations also lead to an explicit explanation of the weight
w > 0. We start with the discrete case assuming X fX .
Binomial distribution: Choose = R, b() = log(1 + e ), = 1 and w = v.
In this case we obtain for x {0, 1/v, 2/v, . . . , 1}
fX (x; , 1)
exp{c(x, 1, v)}
=
=

exp v x log(1 + e )
= exp v x log e log(1 + e )

e
1
exp vx log
exp
v(1
x)
log
= pvx (1 p)vvx ,
1 + e
1 + e
184

for p = e /(1 + e ) (0, 1). The first two moments are obtained by
E[X] = b0 () =
e
= p,
1 + e
and
Var(X) =
1 00
1 e
1
1
b () =
= p(1 p).
v
v 1+e 1+e
v
w)
From this we see that N = vX Binom(v, p).

Poisson distribution: Choose = R, b() = exp{}, = 1 and w = v. In
this case we obtain for x N0 /v
(m
n
o
fX (x; , 1)
= exp v x e
= vx ev ,
exp{c(x, 1, v)}
for = e > 0. The first two moments are obtained by

E[X] = b0 () = e =
Var(X) =
1 00
1
1
b () = e = .
v
v
v
tes
and
From this we see that N = vX Poi(v).
In the absolutely continuous case we have the following examples.
no
Gaussian distribution: Choose = R and b() = 2 /2. In this case we have

for x R
(
fX (x; , )
x 2 /2
= exp
exp {c(x, , w)}
/w
1 2 2x
= exp
,
2 /w
NL
which is the Gaussian density with mean and variance /w.

Gamma distribution: Choose = R+ and b() = log(). In this case
we have for x R+
(
fX (x; , )
x + log()
= exp
exp {c(x, , w)}
/w
= ()
w/
w
exp
x ,
this is a gamma density with shape parameter = w/ > 0 and scale

parameter c = w/ = > 0. The first two moments are obtained by
E[X] = b0 () = 1/ =
and
Var(X) =
= 2.
2
w
c
For more examples we refer to Table 13.8 in Frees [49] on page 379.
185
These examples show that several popular distribution functions belong to the
exponential dispersion family. In the present notes we concentrate on the Poisson
and the gamma distributions for pricing the two components number of claims
and claims severities. However, the theory holds true in more generality. Our aim
is to consider compound Poisson models and to express the expected claim of risk
class (i, j) as expected number of claims times the average claim, i.e.
(1)
E[Si,j ] = E[Ni,j ] E[Yi,j ],

(l)
(m
w)
where Ni,j describes the number of claims in risk class (i, j) and Yi,j the corresponding i.i.d. claim sizes for l = 1, . . . , Ni,j in risk class (i, j). We then aim for
calculating a multiplicative tariff which considers risk characteristics s for both
the number of claims and the claims severities of risk class (i, j).
7.3.1
tes
We assume that Ni,j are independent with Ni,j Poi(i,j vi,j ) and vi,j counting the
number of policies in risk class (i, j). Under these assumptions we derive a multiplicative tariff structure for the characteristics of the expected claims frequency
i,j . For the claim sizes we will do a similar construction by making a gamma distributional assumption. Since the latter is slightly more involved than the former
we start with the Poisson case.
GLM for Poisson claims counts
no
We assume that Ni,j are independent with Ni,j Poi(i,j vi,j ) where vi,j denotes the
number of policies in risk class (i, j). In view of the exponential dispersion family
we make the following Ansatz for the expected claims frequency, see Example 7.10,
"
i,j
Ni,j
=E
= b0 (i,j ) = exp{i,j } = exp{(Z)m },
vi,j
(7.17)
NL
where in the last step we assume having a multiplicative tariff structure which
provides an additive structure on the log-scale reflected by the linear term Z.
The index m = m(i, j) was defined in (7.7), matrix Z RM (r+1) denotes the
design matrix and Rr+1 is the parameter vector. Thus, we assume that
Xi,j = Ni,j /vi,j N0 /vi,j are independent with
Xi,j EDF(i,j = (Z)m , = 1, vi,j , b() = exp{}).
Our aim is to estimate the parameter vector Rr+1 . Identity (7.17) immediately explains that the natural link function g in this problem (between mean
and parameter) is the so-called log-link function g() = log(), because this turns
the multiplicative tariff structure into an additive form. The joint log-likelihood
function of X RM
+ is given by (we use independence here)
`X ()
X
m
X Xm (Z)m exp{(Z)m }
Xm m exp{m }
=
,
1/vm
1/vm
m
186
where we have applied the relabeling of the components of X and vi,j such that
b MLE for is found by
they fit to the design matrix Z, see also (7.6). The MLE
the solution of
`X () = 0.
(7.18)
We calculate the partial derivatives of the log-likelihood function
w)
X Xm exp{m } m
X Xm m exp{m }
`X () =
=
l
l m
1/vm
1/vm
l
m
X Xm exp{(Z)m } (Z)m
X Xm exp{(Z)m }
=
=
zm,l ,
1/vm
l
1/vm
m
m
(m
where Z = (zm,l )m,l RM (r+1) . If we define the weight matrix V = diag(v1 , . . . , vM )

then we have just proved the following proposition:
Proposition 7.11. The solution to the MLE problem (7.18) in the Poisson case
is given by the solution of
Z 0 V exp{Z} = Z 0 V X.
b
Z 0 1 Z
MLE
tes
Remarks. One should observe the similarities between the Gaussian case (7.9)
and the Poisson case of Proposition 7.11 given by, respectively,
= Z 0 1 X
and
b
Z 0 V exp{Z
MLE
} = Z 0 V X.
no
The Gaussian case is solved analytically (assuming full rank of Z), the Poisson case
can only be solved numerically, due to the presence of the exponential function.
The Poisson case can be rewritten as
Z 0 V exp{Z MLE } Z 0 N = 0.
7.3.2
NL
Observe that the latter exactly provides the solution to the method of total marginal
sums by Bailey & Jung [9, 63] given by (7.4)-(7.5).
GLM for gamma claim sizes
The analysis of the gamma claim sizes is more involved because it needs more
(l)
transformations. We denote by ni,j the number of observations Yi,j in risk class
(i, j), this plays the role of the volume in the exponential dispersion family. We
assume that
(l) i.i.d.
Yi,j (i,j , ci,j )
for l = 1, . . . , ni,j .
From the moment generating function given in Section 3.3.3 we immediately see
that for given ni,j the convolution is given by
Yi,j =
ni,j
X
(l)
Yi,j (i,j ni,j , ci,j ).
l=1
187
Thus, the total claim amount Yi,j in risk class (i, j) for given ni,j has a gamma
distribution (which belongs to the exponential dispersion family). We define the
normalized random variable Xm = Yi,j /ni,j , where we again use the relabeling
defined in (7.7). Observe that the family of gamma distributions is closed towards
multiplication, see (3.5). Therefore, the density of Xm is given by
fXm (x) =
(cm nm )m nm m nm 1
x
exp{cm nm x}.
(m nm )
(7.19)
w)
Next we do a re-parametrization similar to Example 7.10 so that we obtain the

parametrization of the exponential dispersion family. Set m = 1/m > 0 and
cm = m /m > 0. This provides gamma density
(m nm /m )nm /m nm /m 1
m nm
fXm (x) =
x
exp
x .
(nm /m )
m
(m
Finally, define cumulant function b() = log() for < 0, see Example 7.10.
The density of Xm = Yi,j /ni,j in risk class (i, j) is then given by
(
m x b(m )
fXm (x) = exp
m /nm
1
(nm /m )
nm
m
!nm /m
xnm /m 1 .
tes
Thus, we have for m = R+
Xm EDF(m , m , nm , b() = log()).
no
The first two moments are given by, see Corollary 7.9,
1
E[Xm ] = m
and
Var(Xm ) =
m 2
.
nm m
Analogous to the Poisson case we assume a multiplicative structure in the mean.

Using again the log-link function g() = log() we obtain additive structure
NL
1
log E[Xm ] = log(m
) = log(m ) = (Z)m ,
(7.20)
with design matrix Z RM (r+1) and parameter vector Rr+1 . This gives
relationship
m = exp {(Z)m } .
For the joint log-likelihood function of X RM
+ we then obtain (assuming independence between the components of X)
`X ()
+ log(m ) X nm
=
[Xm exp{(Z)m } (Z)m ] .
m /nm
m m
X m Xm
m
Note that this excludes risk classes (i, j) with no observation ni,j = 0. The MLE
b MLE for is found by the solution of
`X () = 0.
(7.21)
188
We calculate the partial derivatives of the log-likelihood

X nm
X nm
`X () =
[Xm exp{(Z)m } 1] zm,l =
[Xm m 1] zm,l ,
l
m m
m m
where Z = (zm,l )m,l RM (r+1) . For rewriting the previous equation in matrix
form we define the weight matrix V = diag(1 n1 /1 , . . . , M nM /M ). The last
equation is then written as
We have just proved the following proposition:
w)
`X () = Z 0 V X Z 0 V exp{Z}.
(m
Proposition 7.12. The solution to the MLE problem (7.21) in the gamma case is
given by the solution of
Z 0 V exp{Z} = Z 0 V X.
Remarks.
tes
Proposition 7.12 for the gamma case looks very promising because it has
the same structure as Proposition 7.11 for the Poisson case. However, this
similarity is only at the first sight: parameter vector determines which
is also integrated into the weight matrix V = V() . Therefore, the MLE
MLE
no
is only found numerically, using either Fishers scoring method or the

Newton-Raphson algorithm.
NL
Note that the parameter vector acts on the scale parameter cm because
cm = m /m with m = exp {(Z)m }. The shape parameter m is
determined through the dispersion parameter, i.e. m = 1/m .
For the general case within the exponential dispersion family with link function g we refer to Section 2.3.2 in Ohlsson-Johansson [82].
We have seen that the weights wi,j are given by the number of policies vi,j in
the Poisson case and by the number of claims ni,j in the gamma case.
In the log-linear Gaussian model there was the difficulty that we could not
handle risk classes without claims, see page 177. For the Poisson model, this
is not a difficulty because Xm = Nm /vm = 0 is a valid observation. For the
gamma claim sizes risk classes without an observation are naturally excluded
in the MLE.
We summarize the 3 cases considered:
189
Gaussian case:
b
Z 0 1 Z
MLE
Z 0 1 X = 0.
Poisson case:
b
Z 0 V exp{Z
MLE
} Z 0 V X = 0.
MLE
} Z 0 Vb X = 0,
b
Z 0 Vb exp{Z
b MLE }.
with b = exp{Z
Variable reduction analysis
(m
7.3.3
w)
Gamma case:
no
tes
In this section, we consider variable reduction for the exponential dispersion family
under the assumption of choosing the log-link function. In the Gaussian case we
have calculated the F statistics given in (7.15). This F statistics was based on
the classical (unscaled) Pearsons residuals b which measure the difference between
the observations and the (estimated) mean, see (7.11). In the general case of the
exponential dispersion family it is more appropriate to replace Pearsons residuals
by the deviance residuals which measure the contributions of residual differences
to the log-likelihood. This we are going to explain next.
Having observations X = (X1 , . . . , XM )0 with independent components, we deterb MLE for Rr+1 within the exponential dispersion family with
mine the MLE
log-link function and design matrix Z RM (r+1) as described above. This then
provides the estimate for the mean given by, see (7.17) and (7.20),
b (b
m)
= exp
b MLE )
(Z
NL
b m =
We define the inverse function h = (b0 )1 which implies bm = h(b m ). The logb is then given by
likelihood function at this estimate
b =
`X ()
X
m
Xm h(b m ) b(h(b m ))
+ c(Xm , , wm ),
/wm
where we assume that m = for all m = 1, . . . , M . Observe that this maximizes

the likelihood function over all possible choices of (under given design matrix
Z, cumulant function b and log-link function). Similar to the likelihood ratio test
(7.14) in the Gaussian model we do a likelihood ratio test for this model within the
exponential dispersion family. Therefore, we consider the model Z and compare
it to the saturated model which has as many parameters as observations:
`X (X) =
X
m
Xm h(Xm ) b(h(Xm ))
+ c(Xm , , wm ).
/wm
190
The scaled deviance is then defined by

b
b
D (X, )
= 2 (`X (X) `X ())
h
i
2X
=
wm Xm h(Xm ) b(h(Xm )) Xm h(b m ) + b(h(b m )) .
m
The deviance statistics is defined by
w)
b = D (X, )
b = 2 (`X (X) `X ())
b .
D(X, )
(m
Observe that these deviance statistics play the role of the residual differences SSerr
(Pearsons residuals) which were used in the likelihood ratio given in (7.14).
This deviance statistics measure the contribution of the residual differences to the
log-likelihood.
Similar to Section 7.2.2 we would now like to see whether we can reduce the number
of parameters in Rr+1 .
tes
Null hypothesis H0 : 0 = . . . = p1 = 0 for given p < r + 1.

b full ) in the full model Rr+1 .
1. Calculate the deviance statistics D(X,
no
b H0 ) under the null hypothesis H0 .

2. Calculate the deviance statistics D(X,
Define the test statistics, see also (7.15),
b H0 ) D(X,
b full ) M r 1
D(X,
0.
b full )
D(X,
p
(7.22)
NL
F=
The test statistics F has approximately an F -distribution with degrees of freedom

given by df 1 = p and df 2 = M r 1. Therefore, we apply the same criterion as
in (7.16). Note that in the homogeneous Gaussian case we exactly obtain identity
(7.15), see also Example 7.13, below.
A second test statistics considered is, see Lemma 3.1 in Ohlsson-Johansson [82],
b H0 ) D (X,
b full ) 0.
X 2 = D (X,
(7.23)
The test statistics X 2 is approximately 2 -distributed with df = p degrees of

freedom. In order to calculate this latter test statistics we need to estimate the
dispersion parameter . For the Poisson case it is assumed to be 1, in the other
cases we have two different options for the estimation of . Assume that m was
191
estimated by bm (under the assumption that m = for all m and thus cancels
in the MLE). Then, we can estimate from Pearsons (classical) residuals by
bP =
X
1
(Xm b0 (bm ))2
.
wm
M r1 m
b00 (bm )
An alternative approach is to use the deviances which provide estimate

b full )
D(X,
.
M r1
bD =
w)
We can also calculate bP and bD in the Poisson case and if they are substantially
different from 1, then we either have under- or over-dispersion, and a different
model should be used.
(m
Finally, to check the accuracy of the model and the fit one should also plot the
residuals. Again, we have two options. We can either study Pearsons residuals
given by
Xm b0 (bm )
q
rP,m =
,
b00 (bm )/wm
or the deviance residuals given by
r
2wm Xm h(Xm ) bm b(h(Xm )) + b(bm ) , (7.24)
tes
rD,m = sgn(Xm b0 (bm ))
no
for m = 1, . . . , M . These residuals should not show any structure because the
Xm s were assumed to be independent and the observed residuals should roughly
be centered having similar variances. We come back to this in Section 7.3.4, below.
Example 7.13. Assume that X1 , . . . , XM are independent with
Xm EDF(m , , wm , b() = ()2 /2).
(7.25)
From Example 7.10 we know that these Xm s have a Gaussian distribution, i.e. their
densities are given by
(
NL
1 (xm m )2
f (xm ; m , ) = q
exp
.
2
/wm
2/wm
1
b = ,
b
b = b0 ()
The scaled deviance is given by, set

2
1X
b =
D (X, )
wm Xm bm ,
m
and the deviance statistics is given by

b =
D(X, )
wm Xm bm
2
(7.26)
Compare this to the residual difference SSerr of Section 7.2.2. Compare (7.22) and
(7.15) for the Gaussian model (7.25).

Exercise 22. Calculate the deviance statistics for the Poisson and the gamma
model, see also (3.4) in Ohlsson-Johansson [82].

192
7.3.4
Claims frequency example
w)
In this section we consider a real data example

for tariffication of claims frequencies. We use
the GLM method for Poisson claims counts presented in Section 7.3.1. The data comes from
motor third party liability (MTPL) car insurance. For confidentiality reasons we do not explicitly provide the underlying volume measures
vm which correspond to the number of policies
in risk classes m. For this MTPL car insurance example we choose K = 4 tariff
criteria which provide for risk classes m = m(l1 , l2 , l3 , l4 ) the model
(m
Nm = Nl1 ,l2 ,l3 ,l4 Poi (l1 ,l2 ,l3 ,l4 vl1 ,l2 ,l3 ,l4 ) ,
tes
with vm = vl1 ,l2 ,l3 ,l4 being the number of policies in risk class m and l1 ,l2 ,l3 ,l4 the
expected claims frequency in the corresponding risk class. We assume independence
between different risk classes and we choose a multiplicative tariff structure for the
expected claims frequency, see also (7.1) and (7.17),
m = l1 ,l2 ,l3 ,l4 = exp {l1 ,l2 ,l3 ,l4 } = exp {0 + 1,l1 + 2,l2 + 3,l3 + 4,l4 } ,
(7.27)
no
with intercept 0 and tariff factors k,lk for the tariff criteria k = 1, . . . , 4. The 4
tariff criteria reflect weight category of car, age of driver, kilometers yearly
driven and local region (canton) in Switzerland. We define the relative volume
measures for the 4 different tariff factors as follows
vlweight
1 ,
vl1 ,l2 ,l3 ,l4

[0, 1],
l1 ,l2 ,l3 ,l4 vl1 ,l2 ,l3 ,l4
=P
l2 ,l3 ,l4
NL
, vlkm
and vlcanton
. Moreover, for all tariff criteria k =
and analogously for vlage
3 ,
4 ,
2 ,
1, . . . , 4 we can consider the marginal MLEs. These are given by, see also Estimator
2.32,
X
1
b weight = P
Nl1 ,l2 ,l3 ,l4 ,

l1
l2 ,l3 ,l4 vl1 ,l2 ,l3 ,l4 l2 ,l3 ,l4
b age ,
b km and
b canton .
and analogously we define the marginal MLEs
l3
l4
l2
k = 1. The first tariff criterion is the weight category of the car. We have the
following 7 risk characteristics for l1 {1, . . . , 7}:
l1
in kg
label
vlweight
,
1
1-500
W1-500
<1%
2
501-1000
W501-1000
8%
3
1001-1500
W1001-1500
56%
4
1501-2000
W1501-2000
30%
5
2001-2500
W2001-2500
4%
6
2501-3000
W2501-3000
1%
7
3001-3500
W3001-3500
1%
b
weight
l
15.4%
7.1%
6.7%
7.3%
11.0%
13.3%
21.4%
193
k = 2. The second tariff criterion is the age of the driver. We have the following
8 risk characteristics l2 {1, . . . , 8}:
l2
age
label
vlage
,
1
18-20
Y18-20
6%
2
21-25
Y21-25
5%
3
26-30
Y26-30
6%
4
31-40
Y31-40
17%
5
41-50
Y41-50
22%
6
51-60
Y51-60
20%
7
61-70
Y61-70
14%
8
71-99
Y71-99
10%
b
age
l
19.8%
8.8%
7.7%
6.6%
6.2%
5.8%
5.4%
6.7%
1
1-5
K1-5
1%
2
6-10
K6-10
52%
3
11-15
K11-15
30%
b
km
l
7.4%
6.6%
7.3%
4
16-20
K16-20
14%
5
21-25
K21-25
1%
6
26-30
K26-30
1%
(m
l3
in 10 000 km
label
vlkm,
w)
k = 3. The third tariff criterion is the kilometers yearly driven (in 10 000 km).
We have the following 7 risk characteristics l3 {1, . . . , 7}:
8.2%
12.3%
12.4%
7
31-99
K31-99
1%
13.1%
NL
no
tes
k = 4. The fourth tariff criterion is the Swiss canton the car is registered in
(according to its license plate). There are 26 different cantons in Switzerland which
implies l4 {1, . . . , 26}.
label
AG
AI
BE
AR
BL
BS
GE
FR
GL
GR
JU
LU
NE
NW
OW
SG
SH
SO
SZ
TG
TI
UR
VD
VS
ZG
ZH
Figure 7.3: Fourth tariff criterion: cantons of Switzerland the car is registered in,
i.e. l4 {AG, AI, . . . , ZH}.
k = 1. We observe that the light weight category W1-500 and the heavy weight
categories W2001-2500, W2501-3000 and W3001-3500 have a much higher claims
frequencies than the middle weight classes, see Figure 7.4 (lhs). The straight horizontal line is the overall sample claims frequency. Figure 7.4 (lhs) also indicates
that light and heavy weight categories have much less volume then the middle
weight categories, this is also reflected in the values vlweight
, l1 = 1, . . . , 7.
1 ,
(m
w)
194
NL
no
tes
b weight and (rhs)

Figure 7.4: Marginal MLEs (lhs) for the different weight categories
l1
b age .
for the different age classes
l2
Figure 7.5: Marginal MLEs (lhs) for the different kilometers yearly driven cateb km and (rhs) for the different cantons
b canton .
gories
l3
l4
k = 2. From the marginal claims frequencies for the different age classes mainly
young drivers are conspicuous, see Figure 7.4 (rhs). The average claims frequency
of drivers between 18 and 20 is more than twice as large as the average claims
frequency of all other drivers.
k = 3. Figure 7.5 (lhs) shows that frequent long-distance drivers have a much
higher claims frequency than other drivers. But frequent long-distance drivers are
only a small proportion of the total MTPL portfolio, see also vlkm
values.
3 ,
195
tes
(m
w)
k = 4. Figure 7.5 (rhs) shows that we expect substantial differences between

different cantons. Probably mountain regions are different from urban regions, and
it is also noticeable that there seem to be differences between the linguistic areas in
Switzerland. The high frequency observation in Appenzell Innerrhoden (AI) is also
conspicuous, it comes from the fact that rental companies get good deals for license
plates in AI and therefore many rental cars are registered in AI (which obviously
cause higher claims frequencies).
no
b
Figure 7.6: (lhs) Tukey-Anscombe plot which shows the fitted means E[N
m ] versus
the deviance residuals rD,m for m = 1, . . . , M ; (rhs) QQ plot of the deviance
residuals rD,m versus the theoretical (estimated) quantiles qbm for m = 1, . . . , M .
NL
Observe that in this example we have 7 8 7 26 = 100 192 (potential) risk classes.
However, in only M = 60 146 risk classes we have a positive volume vm > 0 (in the
other risk classes we have not sold any policies). Introducing the multiplicative
tariff structure (7.27) with K = 4 tariff criteria reduces the complexity to r + 1 =
7 + 8 + 7 + 26 3 = 45 parameters. We apply the GLM estimation method for
Poisson claims counts, i.e. we evaluate (7.18) using Proposition 7.11. This is done
with R command
> d.glm <- glm(counts W1-500 + W501-1000 + ...,
data=input, offset=log(volumes), family=poisson())
where input contains the counts Nm , the volumes vm as well as the corresponding design matrix Z {0, 1}M (r+1) that consists of binary variables only. The
summary of the results, similar to Figure 7.2, is obtained by R command
196
> summary(d.glm)
MLE
w)
b
This R command provides the MLE
Rr+1 with corresponding standard
b = 30 761.8 on degrees of freedom
errors and p-values, the deviance statistics D(X, )
M r 1 = 60 146 45 = 60 101 and the AIC value of 13519. Furthermore, it
provides the so-called Null Deviance which corresponds to a model which only has
an intercept 0 . This Null Deviance corresponds to the total difference SStot in the
Gaussian model. In our example the Null Deviance is 80 025.5 on 60 146 1 = 60 145
degrees of freedom. The R command
> glm.fitted <- fitted(d.glm)
m]
b
E[N
= vm
b MLE
(m
then calculates the estimated expected claims numbers for m = 1, . . . , M

= vm exp
b MLE )
(Z
Next we determine the deviance residuals rD,m , see (7.24). In the Poisson case they
take a rather simple form
m ])
"
b
b
E[N
E[N
m]
m]
+
1 ,
log
Nm
Nm
tes
rD,m = sgn(Nm
b
E[N
v
u
u
t2N
no
b
for Nm = 0 the deviance residual reduces to rD,m = 2E[N
m ]. These deviance
b
residuals and the corresponding theoretical quantiles qm are obtained from the R
command
> glm.dev <- qqplot.glmRob(input$counts, glm.fitted, 1)
NL
This R command provides the deviance residuals glm.dev$deviance.residuals

and the corresponding theoretical quantiles glm.dev$quantiles. Figure 7.6 (lhs)
b
gives the Tukey-Anscombe plot which plots the fitted means E[N
m ] versus the
deviance residuals rD,m for m = 1, . . . , M . This plot should not show any structure
in order to support the Poisson model for claims counts. Figure 7.6 (lhs) is not
completely convincing but still acceptable.
Figure 7.6 (rhs) gives the QQ plot of the deviance residuals rD,m versus the theoretical (estimated) quantiles qbm for m = 1, . . . , M , for more background we also
refer to Garcia Ben-Yohai [51]. This QQ plot is also not too convincing for the
Poisson model choice. This can also be seen from the estimates of the dispersion
parameter
bD =
b
30 761.8
D(X, )
= 0
= 0.62
M r1
6 101
and
bP = 0.89.
Both estimates bD and bP suggest under-dispersion in the data (a 2 -goodnessVersion April 14, 2016, M.V. Wthrich, ETH Zurich
197
(m
w)
NL
no
tes
Figure 7.7: Marginal MLEs and the GLM fitted values: (lhs) for the different
weight categories and (rhs) for the different age classes.
Figure 7.8: (lhs) Marginal MLEs and the GLM fitted values for the different kilometers yearly driven categories; (rhs) GLM fitted values for the different cantons.
of-fit test, see (2.8) for the test statistics, would reject the Poisson assumption on
the 1% significance level). However, as long as we are only interested into tariff
segmentation for different risk classes we may still use the GLM fit as relative tariff
factors unless we have drastic changes in the portfolio mix. Finally, in Figures
7.7 and 7.8 we provide the fitted tariff factors compared to the marginal MLE
estimates.
k = 1. We see that the GLM fit punishes the light weight cars even slightly more
whereas heavy weight cars are relieved, see Figure 7.7 (lhs). From a practical point
198
of view the former seems a bit unreasonable. It probably has to do with the fact
weight
in the lowest weight class is very small (this weight class
that the volume v1,
should probably be merged with the next one). The relieve for the heavy weight
cars might be compensated by the fact that these heavy weight cars are typically
driven by frequent long-distant drivers.
k = 2. The marginal estimates for different age classes are very much in line
with the corresponding GLM fits, see Figure 7.7 (rhs).
w)
k = 3. Figure 7.8 (lhs) suggests that we can probably merge the three kilometers
yearly driven classes K21-25, K26-30 and K31-99, also due to their small volumes.
(m
k = 4. Figure 7.8 (rhs) shows that we might be able to merge the different
cantons into 4 or 5 different tariff regions to reduce the complexity of the tariff
structure. This is what we will analyze next using the variable reduction technique
of Section 7.3.3.
no
tes
In the last step we present the reduction of variables technique presented in Section
7.3.3. We have performed this for all tariff criteria: the weight category criterion
and the age classes cannot further be reduced. This is a bit surprising for the lowest
weight class W1-500 because the resulting estimate seems a bit unreasonable and
weight
. But the tests clearly reject the null
this risk factor has a very low volume v1,
hypothesis of a merger with the next weight class. Therefore, we only present the
analysis for the kilometers yearly driven tariff criterion and for the canton tariff
criterion.
NL
Null hypothesis H0 : The three kilometers yearly driven classes K21-25, K26-30 and
K31-99 are merged, i.e. 3,5 = 3,6 = 3,7 .
We calculate the test statistics F given in (7.22), the test statistics X 2 given in
(7.23) and the AIC. These values are given in Table 7.3.
AIC
deviance statistics
test statistics F
test statistics X 2
full model
13519
3761.8
under H0
13516
3762.0
test statistics
p-value
0.23
0.28
79%
87%
Table 7.3: Parameter reduction analysis for the tariff criterion kilometers yearly
driven.
The AIC supports the model with merged classes K21-25, K26-30 and K31-99
199
and both, the F test statistics and the X 2 test statistics, do not reject the null
hypothesis H0 on a 5% significance level. Therefore, we consider a merger of these
risk classes to one risk factor according to null hypothesis H0 .
(m
w)
Null hypothesis H0 :
(i) The three kilometers yearly driven classes K21-25, K26-30 and K31-99 are
merged, i.e. 3,5 = 3,6 = 3,7 ;
(ii) the following cantons are merged:
(a) 4,AG = 4,BE = 4,LU ,
(b) 4,AI = 4,AR ,
(c) 4,GR = 4,SG ,
(d) 4,GL = 4,NW = 4,OW = 4,SZ = 4,UR = 4,ZG ,
(e) 4,FR = 4,GE = 4,JU = 4,NE = 4,TI = 4,VD = 4,VS .
We again calculate the test statistics F given in (7.22), the test statistics X 2 given
in (7.23) and the AIC. The results are presented in Table 7.4.
under H0
13516
3792.4
test statistics
tes
AIC
deviance statistics
test statistics F
test statistics X 2
full model
13519
3761.8
2.92
30.6
p-value
<1%
2.2%
no
Table 7.4: Parameter reduction analysis for the tariff criteria kilometers yearly
driven and for the cantons according to null hypothesis H0 .
NL
The AIC supports the reduced model of null hypothesis H0 , the F test statistics
rejects H0 on a 1% significance level, whereas the X 2 test statistics does not reject
the null hypothesis H0 on a 1% significance level. Thus, if we want to reduce
the complexity of the tariff structure, we could choose the reduced model of null
hypothesis H0 , see Figure 7.9 for the resulting regional tariff factors.
At the end this tariff decision is a strategic business decision (which is supported by
statistical analysis). This business decision will also depend on the tariff structure
applied in the previous year: in these considerations, in particular when introducing a new tariff structure, one should always keep in mind that the individual
premia on single policies should not change too much from one year to the next.
Otherwise loyal customers will be very upset about the new price politics in the
insurance company and they will think that the companys business is not under
control. Therefore, transitions should always be done as smoothly as possible. An
other reason for such business decisions is that the prices should be competitive
(hopefully) in many segments. Therefore, it is also important that these business
decisions take into account what competitors are doing.
(m
w)
200
NL
no
tes
Figure 7.9: Tariff factors for cantons: (lhs) full model; (rhs) reduced model of null
hypothesis H0 .
Chapter 8
(m
w)
Bayesian Models and Credibility

Theory
tes
In the previous chapter we have done tariffication using GLM. This was done by
splitting the total portfolio into different homogeneous risk classes (i, j). The volume measures in these risk classes (i, j) were given by vi,j in Section 7.3.1 (Poisson
case) and by ni,j in Section 7.3.2 (gamma case), respectively. There might occur the
situation where a risk class (i, j) has only small volume vi,j and ni,j , respectively,
i.e. only a few policies or claims fall into that risk class. In that case an observation
Ni,j and Si,j may not be very informative and single outliers may disturb the whole
picture, see Figure 7.7 (lhs). Credibility theory aims at dealing with such situations
in that it specifies a tariff of the following type
Si,j
+ (1 i,j ),
vi,j
no
b i,j = i,j
NL
i.e. the tariff b i,j for next accounting year is calculated as a credibility weighted
average between the individual past observation Si,j /vi,j and the overall average
with credibility weight i,j [0, 1]. For i,j = 1 we completely believe into
past the observation Si,j /vi,j , for i,j = 0 we only believe into the overall average
. Credibility theory makes this approach rigorous and specifies the credibility
weights i,j .
Credibility theory belongs to the field of Bayesian statistics:
There are exact Bayesian methods which allow for analytical solutions.
There are simulation methods such as the Markov chain Monte Carlo (MCMC)
method which allow for numerical solutions of Bayesian models.
There are approximations such as linear credibility methods which give optimal solutions in sub-spaces of possible solutions.
Central to these methods is the Bayes rule.
201
202
8.1
Chapter 8. Bayesian Models and Credibility Theory
Exact Bayesian models
(m
w)
We start by explaining Bayes rule. The basic idea of Bayes

rule goes back to Reverend Thomas Bayes (1701-1761) who
discovered the rule during the 1740s. It was then Richard
Price (1723-1791) who devoted much of his time to clean and
prepare Bayes essay on the probability of causes and he submitted An essay toward solving a problem in the doctrine of
chances to the Royal Societys Philosophical Transactions. In
T. Bayes
1774 Pierre-Simon Laplace discovered the rule on its own
and he has brought it into todays form. Therefore, the Bayes rule should be called
Bayes-Price-Laplaces rule. For a historical review we refer to McGrayne [76]. As
we will just see, Bayes rule is the mathematical tool to combine prior knowledge
and observations into posterior knowledge. Technically it exchanges probabilities,
therefore it is also known under the name method of inverse probabilities.
NL
no
tes
Assume we have an observation X that has density f (x). Often

the difficulty is that the parameter is not known/specified. In
previous chapters we have estimated this parameter with the
MLE method and with the method of moments. These methods
are purely observation based. What can we do if we have no
R. Price
past observations or only scarce past observations? This is the
question we would like to answer in this chapter. It will lead to a new attitude and
to a new estimation method.
Figure 8.1: (lhs) Grave of the Bayes family at Bunhill Fields Burial Ground, London
UK; (rhs) historical review of McGrayne [76].
We specify a prior distribution/density for the (unknown) parameter . We
will explain below how this prior distribution is specified. The joint density of
203
observation X and parameter is then given by

f (x, ) = f (x)().
Bayes rule allows to calculate the posterior distribution of , given observation x,
f (x)()
f (x)().
f (x)() d
w)
(|x) = R
(m
This means that we start with a prior distribution (). This prior distribution
either expresses expert knowledge or is determined from a portfolio of similar business. Having observed x, we modify the prior believe () to obtain the posterior
distribution (|x) that reflects both prior knowledge () about and experience
x. That is, the prior believe () is improved by the arriving observation x. The
general idea then is to update our (prior) knowledge about whenever an observation arrives. These updates constantly improve our estimation of the unknown
model parameter .
tes
This is exactly what Bayesian and credibility theory is about.

We start with an explicit example to show how this mechanism works.
Poisson-gamma model
no
8.1.1
NL
In this section we present one of the most popular Bayesian models

which has a closed form solution for the posterior distribution.
As mentioned in Bhlmann-Gisler [24], this mathematical model
can be traced back in the actuarial literature to Fritz Bichsel
(1921-1999) [11]. He has introduced it in the 1960s to calculate
a bonus-malus tariff system for Swiss motor third party liability
insurance. The aim was to punish bad drivers and to reward good
drivers according to the collected individual claims experience.
This has led to bonus-malus considerations.
F. Bichsel
Definition 8.1 (Poisson-gamma model). Assume fixed volumes vt > 0 are given
for t N.
Conditionally, given , the components of N = (N1 , . . . , NT ) are independent
with Nt Poi(vt ).
(, c) with prior parameters > 0 and c > 0.
204
Remark. Observe that there is a fundamental difference to the negative-binomial

distribution considered in Section 2.2.4. Here, we assume that N1 , . . . , NT belong
all to the same , whereas for having independent negative-binomial distributions N1 , . . . , NT every component belongs to another independent latent factor
1 , . . . , T . In the latter case the components of N are independent, whereas in
the former case they are dependent (and only conditionally independent, given ).
|{N } +
T
X
Nt , c +
Proof. The posterior is given by

(|N ) f (N ) ()
T
Y
T
X
vt .
t=1
(m
t=1
w)
Theorem 8.2. Assume N = (N1 , . . . , NT ) follows the Poisson-gamma model of

Definition 8.1. The posterior distribution of , conditional on N , is given by
(vt )Nt c 1 c
e
Nt ! ()
t=1

PT
PT
c+
vt
t=1
.
+ t=1 Nt 1 e
evt
tes
Remarks 8.3.
no
This is a gamma density with the required properties.
The posterior distribution is again a gamma distribution but with modified

parameters. For the parameters we obtain the updates
=+
NL
Tpost
T
X
Nt
and
c 7
cpost
T
=c+
T
X
vt .
t=1
t=1
Often and c are called prior parameters and Tpost and cpost
posterior paT
rameters (at time T ).
Note that this update has a recursive structure
Tpost = Tpost
1 + NT
and
cpost
= cpost
T
T 1 + vT .
The remarkable property in the Poisson-gamma model is that the posterior

distribution stays in the same family of distributions as the prior distribution.
There are more examples of this kind as we will see below. Many of these
examples belong to the exponential dispersion family with conjugate priors.
205
For the estimation of the unknown parameter we obtain the following prior
and posterior estimators
0 = E[] = ,
c
P
Tpost
+ Tt=1 Nt
post
b
.
T
= E[|N ] = post =
P
c + Tt=1 vt
cT
w)
b post in more detail below, which will

We analyze the posterior estimator
T
provide the basic credibility formula.
Corollary 8.4. Assume N = (N1 , . . . , NT ) follows the Poisson-gamma model of

b post has the following credibility form
Definition 8.1. The posterior estimator
T
(m
b post =
b + (1 ) .
T
T
T
0
T
b given by
with credibility weight T and observation based estimator
T
PT
T =
c+
t=1 vt
PT
t=1 vt
(0, 1)
and
b = P 1
T
T
t=1
T
X
vt
Nt .
t=1

post 2
b
T
N
Tpost
1 b post
= post 2 = (1 T )
.
c T
(cT )
no
tes
The (mean square error) uncertainty of this estimator is given by
Proof. In view of Theorem 8.2 we have for the posterior mean

PT
PT
T
X
c
1
t=1 vt
bpost = + Pt=1 Nt =
Nt +
PT
PT
PT
T
T
c
c + t=1 vt
c + t=1 vt
v
c
+
v
t=1 t t=1
t=1 t
bT + (1 T ) 0 .
T
NL
This proves the first claim. For the estimation uncertainty we have

2
Tpost
1 bpost
post
b
N
.
E T
=
Var
(
|
N
)
=
post 2 = (1 T )

c T
(cT )
Remarks 8.5.
b post is a credibility weighted
Corollary 8.4 shows that the posterior estimator
T
average between the prior guess 0 and the purely observation based estimator
b with credibility weight (0, 1).
T
T
The credibility weight T has the following properties:

1. for the number of observed years T : T 1 (since vt 1 for all
t if vt counts the number of policies);
206

2. for the volume vt : T 1;
3. for the prior uncertainty going to infinity, i.e. c 0: T 1;
4. for the prior uncertainty going to zero, i.e. c : T 0.
Note that
1
= 0 .
2
c
c
For c large we have informative prior distribution, for c small we have vague
prior distribution and for c = 0 we have non-informative or improper prior
distribution. The latter means that we have no prior parameter knowledge
(this has to be understood in an asymptotic sense).
w)
Var () =
The observation based estimator satisfies, see Estimators 2.27 and 2.32,
(m
b MV =
b MLE =
b .
T
T
T
b post has the nice property of a recursive update

The posterior estimator
T
structure which is important in many situations, see next corollary.
tes
Corollary 8.6. Assume N = (N1 , . . . , NT ) follows the Poisson-gamma model of

b post the posterior estib post denote the posterior estimator and
Definition 8.1. Let
T 1
T
mator in the sub-model where we only have observed (N1 , . . . , NT 1 ). The posterior
b post has the following recursive update structure
estimator
T
with credibility weight
NT
b post .
+ (1 T )
T 1
vT
no
b post =
T
T
T =
vT
c+
PT
t=1
(0, 1).
vt
Proof. In view of Corollary 8.4 we have for the posterior mean

PT
t=1 vt
PT
t=1 vt
NL
bpost
c+
c+
c+
1
PT
t=1
1
PT
t=1
PT
t=1 vt
T
X
vt
T
X
Nt +
t=1
T
1
X
vt
Nt +
t=1
c+
c
PT
c
v
t=1 t
c+
PT 1
vt
c+
PT
vt c +
t=1
t=1
c
PT 1
t=1
vt c
!
Nt + NT
+ (1 T )(1 T 1 )0 .
t=1
For the first term we have
c+
1
PT
T
1
X
t=1
vt
c+
= T
!
Nt + NT
t=1
PT 1
PT 1
T
1
X
c + t=1 vt
NT
1
t=1 vt
+
Nt
PT
PT 1
PT 1
vT
c + t=1 vt c + t=1 vt
t=1 vt
t=1 vt t=1
vT
PT
NT
bT 1 .
+ (1 T ) T 1
vT
207
2
Collecting all terms provides the claim.
Conclusions. For pricing such a portfolio, we need to have prior information 0

about the premium. This prior information can come from experts, from similar
portfolios, from market information or from a combination thereof. If we have
no observations we charge premium 0 . When we start to collect observations
N1 , N2 , . . . , we constantly update the premium by the rule
Nt
b post ,
+ (1 t )
t1
vt
w)
b post =
t
t
tes
(m
b post = . The prior information has an uncertainty

for t 1, where we set
0
0
parameter c for the credibility weighting of 0 . The bigger the prior uncertainty
the faster the prior knowledge will disappear as t . In the limit (as t )
we have a premium that is completely based on the observations and which, in this
Poisson-gamma case, coincides with the MLE.
However, the credibility formula of Corollary 8.4 is of special interest when we only
have a few observations, i.e. t small, and these few observations are only based on
a small portfolio, i.e. vs small for all s t. In such cases the credibility weight t
may be around, say, 60% and therefore the prior mean 0 substantially smooths
b . This way we get much more stability
the purely observation based estimator
t
and reliability in the premium calculation because we add an additional source of
information to the premium calculation problem (prior choice).
8.1.2
no
Next we study a larger class of distribution functions for which we can explicitly
solve the pricing problem in a Bayesian context.
Exponential dispersion family with conjugate priors
NL
The crucial property of the Poisson-gamma model is that the prior and the posterior distributions belong to the same family of parametric distributions, only the
parameters change from prior parameters to posterior parameters. There are many
examples of this type. The best known examples belong to the exponential dispersion family with conjugate priors. We have already met the exponential dispersion
family in Definition 7.7, X EDF(, , w, b()) has (generalized) density
(
x b()
fX (x; , ) = exp
+ c(x, , w) ,
/w
for an (unknown) parameter in the open set . In the Bayesian case we will
model this parameter = with a prior distribution on and then try to
determine the posterior distribution after we have collected (independent) observations X1 , . . . , XT that belong to this EDF(, , w, b()).
208
Model Assumptions 8.7 (exponential dispersion family with conjugate priors).

Assume fixed volumes wt > 0, t = 1, . . . , T , a dispersion parameter > 0 and a
cumulant function b : R on an open set R are given.
Assume the random variable has the following density on
(
x0 b()
x0 , () = exp
+ d(x0 , ) ,
2
(m
w
with fixed prior parameters x0 I and (0, c ), and d(, ) describes the
normalization. I R denotes the possible choices of x0 so that x0 , is a
well-defined density on for all (0, c ) for a fixed given constant c > 0.
Conditionally, given , the components of X = (X1 , . . . , XT ) are independent with Xt EDF(, , wt , b()), having well-defined densities with
supports not depending on .
post
" T
X
tes
Theorem 8.8. We make Model Assumptions 8.7 and assume that the domain I of
possible prior choices x0 is an open interval which contains the range of Xt for all
and t = 1, . . . , T . The posterior distribution of , given X, is given by the
density bxpost , post () with
1
wt
+ 2
t=1
#1/2
<
with
post (0, c ),
no
= T xbMV
xbpost
+ (1 T )x0 I,
T
T
with credibility weight T and (minimum variance) estimator xbMV

T
PT
wt
t=1 wt + 2
T = PT
t=1
and
xbMV
= PT
T
T
X
t=1
wt
wt X t ,
t=1
NL
where for the minimum variance statement we additionally assume that the second
moments of Xt |{} exist and the cumulant function b C 2 in the interior of .
Proof. The Bayes rule gives for the posterior distribution of , conditionally given X,

T
T
Y
Y
Xt b()
x0 b()
(| X)
fXt (Xt ; , ) x0 , ()
exp
exp
/wt
2
t=1
t=1
(" T
#
" T
#
)
X wt
X Xt wt
x0
1
= exp
+ 2
+ 2 b()
t=1
t=1
"
#
"
#
1
T
T
X
X
w
1
X
w
x
t
t t
0
= exp ( post )2
+ 2
+ 2 b() .
t=1
Observe that 0 < post < < c and

" T
#1 " T
#
X wt
X Xt wt
1
x0
+ 2
+ 2
t=1
t=1
t=1
T PT
T
X
wt
wt
t=1 t=1
Xt + (1 T )x0 I.
209
The latter holds true because I is (by assumption) an open interval that contains x0 and the
range of all possible outcomes Xt for all and t = 1, . . . , T . Therefore, we obtain posterior
density b
which is a well-defined density on by assumption. There remains the proof
xpost
, post
T
of the minimum variance statement. For fixed parameter we know that X = (X1 , . . . , Xn )
are independent with Xt EDF(, , wt , b()). Corollary 7.9 (or its generalization) implies
E[Xt |] = b0 ()
and
Var(Xt |) =
00
b ().
wt
(8.1)
w)
Note that does not depend on t, therefore the statement of the minimum variance estimator
follows from Lemma 2.26. This closes the proof.
2
Theorem 8.9 (credibility estimator). We make the assumptions of Theorem 8.8.

In addition we assume that exp{(x0 b())/ 2 } disappears on the boundary of
for all x0 I and (0, c ) and that b C 1 in the interior of . We have
and
= T xbMV
+ (1 T ) x0 ,
E [b0 ()| X] = xbpost
T
T
(m
E [b0 ()] = x0
see Theorem 8.8 for notation.
no
tes
Proof. In view of Theorem 8.8 it suffices to prove the first statement for all x0 I and (0, c ).

Z
x0 b()
0
0
+ d(x0 , ) d
E [b ()] =
b () exp
2

Z
x0 b0 ()
x0 b()
2
= x0
exp
+ d(x0 , ) d
2
2

x0 b()
= x0 2 exp {d(x0 , )} exp
= x0 .

2
NL
Example 8.10 (exact credibility model). We make the assumptions of Theorem

8.9 but we extend the random vector (X1 , . . . , XT , XT +1 ), i.e. we add one additional component XT +1 to the random vector, and we assume that conditionally,
given , these components are all independent satisfying Model Assumptions 8.7.
Our aim is to price XT +1 based on the observations X1 , . . . , XT and on the prior
knowledge x0 , . Therefore we calculate the conditional expectation of XT +1 , given
observations X1 , . . . , XT , by applying the tower property. This provides
E [XT +1 | X1 , . . . , XT ] = E [E [XT +1 | , X1 , . . . , XT ]| X1 , . . . , XT ]
= E [E [XT +1 | ]| X1 , . . . , XT ]
= E [b0 ()| X1 , . . . , XT ] = xbpost
T
(8.2)
+ (1 T ) x0 .
= T xbMV
T
Thus, we get a credibility weighted average for the premium of XT +1 which is based
on the prior knowledge x0 , and on the past experience X1 , . . . , XT . Similar to
210
Corollary 8.6 we obtain a recursive update structure for this experience premium,
which allows to express the premium more and more accurately as time passes
(under the above stationarity assumptions, of course).

Remarks 8.11.
w)
Examples that belong to the exponential dispersion family with conjugate priors are: Poisson-gamma model,
gamma-gamma model, (log-)normal-normal model. For
detailed information we refer to Chapter 2 in BhlmannGisler [24].
(m
All models that have been studied in GLM Chapter 7

can also be studied in the Bayesian sense as illustrated
above.
tes
Theorem 8.8 gives an additional way of parameter estimation within the exponential dispersion family. In contrast to the MLEs and
the minimum variance estimators, this Bayesian way also allows to include
prior information, which may come from experts or from similar business.
Moreover, parameter uncertainty is quantified by the posterior distribution.
no
This Bayesian idea can be extended to other families of distribution, for

example the Pareto-gamma case is treated in Section 2.6 of Bhlmann-Gisler
[24].
NL
Example 8.12 (gamma-gamma model). We close this section with the example of
the gamma-gamma model. We recall Example 7.10. Choose fixed volumes wt > 0,
t = 1, . . . , T , and dispersion parameter = 1/ > 0. Assume that conditionally,
given > 0, X1 , . . . , XT are independent gamma distributed with densities
fXt (x; , ) =
(wt /)wt / wt /1
x
exp {wt / x}
(wt /)
for x R+ .
This is the form used in (7.19) with scale parameter c = / > 0. Observe that
the range of the random variables Xt is R+ and that we obtain well-defined gamma
densities on R+ for all R+ and all t = 1, . . . , T . This motivates the choice of
f = R for the possible parameter choices .
the open set
+
Thus, we need to show two things: (i) the density fXt (x; , ) belongs to the
exponential dispersion family for a particular cumulant function b : R; (ii)
this will allow to define the conjugate prior density x0 , for which we would like
to show that we can apply Theorem 8.9.
211
Item (i) was already done in Example 7.10, however we will do it once more because
the signs need a careful treatment.
fXt (x; , ) = wt / exp {wt / x} exp {c(x, , wt )}
n
= exp log wt / wt / x exp {c(x, , wt )}

)
x() ( log(()))
exp {c(x, , wt )} .
= exp
/wt
w)
The last formula seems to be a waste of minus signs, but with the definitions
= and b() = log() for < 0 we see that the gamma density belongs
to the exponential dispersion family, that is, by a slight abuse of notation in fXt ,
(m
x b()
fXt (x; , ) = exp
exp {c(x, , wt )} .
/wt
f = R for the domain of b. Corollary 7.9 then implies

Moreover, we set =
+
for all t = 1, . . . , T
1
= 1 R+ .
tes
E [Xt | ] = b0 () =
This completes task (i).
(ii) The prior density on is then chosen by

)
no
x0 b()
+ d(x0 , )
x0 , () = exp
2
() 2 +11 exp
x0
() .
2

NL
This is a gamma density, set = , with shape parameter 1 + 1/ 2 > 0 and scale
parameter x0 / 2 . This implies that we should choose I = R+ and > 0. In view of
Theorem 8.8 the assumptions are fulfilled because I is an open interval containing
all possible observations Xt , and thus Theorem 8.8 can be applied.
Next we observe that this density disappears on the boundary of = R+ given
by the set {} {0}. Therefore, we have from Theorem 8.9 (we also perform
the whole calculation to back test the result)
h
x0 = E[b ()] = E
1 (x0 /
R+
2 1+1/ 2
1
)
x0
2 +11 exp 2
2
(1 + 1/ )
2
2

(x0 / 2 )1+1/ (1/ 2 ) Z (x0 / 2 )1/ 12 1
x0
exp
d = x0 .
(1 + 1/ 2 ) (x0 / 2 )1/ 2 R+ (1/ 2 )
2
Moreover, the posterior mean is given by

h
E 1 X = xbpost
= T xbMV
+ (1 T ) x0 ,
T
T
212
with credibility weight

PT
wt
.
t=1 wt + 2
T = PT
t=1
> 0 describes the degree of information contained in the prior distribution.
8.2
w)
In this section we have considered examples for which we can explicitly calculate
the posterior distribution. The next section will give approximations where this is
not the case.
Linear credibility estimation
no
tes
(m
In Model Assumptions 8.7 we have studied Bayesian models

which were based on the exponential dispersion family with conjugate priors. As a result we were able to explicitly calculate the
posterior distribution in these models and, moreover, this posterior distribution belonged to the same class of distributions
as the prior itself, see Theorem 8.8. In many applied modeling
problems we do not face such an ideal situation. Nowadays there
are powerful simulation techniques that can handle more comH. Bhlmann
plicated models and problems. In the case of Bayesian analysis
we can use Markov chain Monte Carlo (MCMC) methods, such as Gibbs sampling,
the Metropolis-Hastings algorithm and sequential Monte Carlo samplers, which
will provide the posterior distribution in almost any situation where we can write
down the posterior density up to the normalizing constant. That is, whenever we
have an explicit posterior density of the following crucial form
NL
(|x) f (x)(),
and the right-hand side of this proportionality is explicit as a

function of , we can use an acceptance-rejection simulation algorithm (within MCMC methods) which allows to approximate
(|x) empirically. For MCMC methods we refer to the related
literature, see for instance Congdon [28], Gilks et al. [53], Green
[56, 57], Johansen et al. [61] or Robert [88].
Linear credibility theory is not based on simulation methods
but it tries to approximate the posterior mean by the best linear
estimator. This we are going to describe more explicitly in this
E. Straub
section. The key model to this analysis is the Hans Bhlmann
and Erwin Straub (1938-2004) model [25]. This model was mainly used in
an insurance pricing context but, of course, possible applications are much more
widespread. For literature we refer to Bhlmann-Gisler [24].
8.2.1
213
Bhlmann-Straub model
Model 8.13 (Bhlmann-Straub (BS) model [25]). Assume we have I risk classes
and T random variables per risk class. Assume fixed volumes wi,t > 0, i = 1, . . . , I
and t = 1, . . . , T , are given.
Conditionally, given i , the components of X i = (Xi,1 , . . . , Xi,T ) are independent with the first two conditional moments given by
w)
E [ Xi,t | i ] = (i ),
2 (i )
Var (Xi,t | i ) =
.
wi,t
(m
The pairs (1 , X 1 ), . . . , (I , X I ) are independent and 1 , . . . , I are i.i.d.

2
Throughout, we assume that the second moments are finite, i.e. E[Xi,t
] < for
all i, t.
tes
Remarks 8.14.
no
We assume that each risk class i is characterized by a risk characteristics i

with range . A priori (before having any observations Xi,t ) all risk classes
are considered to be similar which is expressed by the i.i.d. property of i .
This describes our prior knowledge about the risk classes.
The conditional mean and variance are characterized by the two functions
: R and 2 : R+ ; 7 () and 7 2 ().
NL
If we set I = 1, i.e. we only have one risk class, then an explicit example to the
BS Model 8.13 is given by the exponential dispersion family with conjugate
priors, Model Assumptions 8.7. The conditional mean and variance are then
modeled by, see (8.1),
() = b0 ()
and
2 () = b00 (),
for the corresponding (sufficiently smooth) cumulant function b : R.

For the BS credibility estimator we define the following structural parameters:
0 = E[(1 )]
= Var((1 ))
= E[ (1 )]
collective mean,
(8.3)
volatility between risk classes,
(8.4)
(expected) volatility within risk classes.
(8.5)
214
8.2.2
Bhlmann-Straub credibility formula
The Bayesian estimator for the (unknown) mean (i ) of risk class i is given by
\
(
i ) = E [(i )| X 1 , . . . , X I ] .
(8.6)
w)
In the exponential dispersion family with conjugate priors this posterior mean can
be calculated explicitly, see Theorem 8.9. In most other situations, however, this is
not the case. Therefore, we approximate this posterior mean. We briefly describe
how this approximation is done. Assume that all considered random variables
are square integrable, thus we work on the Hilbert space L2 (, F, P) of square
integrable random variables, where the inner product is given by
for X, Y L2 (, F, P).
hX, Y i = E [XY ]
L0 (X) =
b = a0 +
X
i,t
b =
ai,t Xi,t ; a0 R, ai,t R for all i, t
no
L(X, 1) =
tes
(m
In this Hilbert space the random vectors X 1 , . . . , X I generate the subspace G(X)
\
of all (X 1 , . . . , X I )-measurable random variables. The posterior mean (
i)
2
given by (8.6) is the element of the subspace G(X) that minimizes the L -distance
\
between this subspace G(X) and (i ). In the Hilbert space this estimate (
i)
corresponds to the orthogonal projection of (i ) onto G(X). In general, this minimization and orthogonal projection to G(X), respectively, has a too complicated
form. To reduce this complexity we restrict the orthogonal projection to simpler
\
subsets L of G(X). This will provide approximations to (
i ) G(X) in the more
restricted subsets L G(X). We define the following two subsets
b = 0
ai,t Xi,t ; ai,t R for all i, t and E[]
i,t
G(X),
G(X).
NL
The first subset L(X, 1) includes the constants which will imply unbiasedness of
the estimators, whereas in the second case L0 (X) we need to enforce unbiasedness
by a side constraint.
Definition 8.15 (inhomogeneous and homogeneous credibility estimator). We assume that the BS Model 8.13 is fulfilled with collective mean 0 R.
The inhomogeneous (linear) credibility estimator of (i ) based on X 1 , . . . , X I is
defined by
i
h
\
\
b (i ))2 .
(
i ) = arg min E (
bL(X,1)
The homogeneous (linear) credibility estimator of (i ) based on X 1 , . . . , X I is

defined by
hom
h
i
\
\
(i )
= arg min E (b (i ))2 .
bL0 (X)
215
\
\
Remark 8.16. The inhomogeneous credibility estimator (
i ) is the best approx2
imation to (i ) (in the L -sense) among all linear estimators given by L(X, 1).
Because L(X, 1) is a subset of G(X), we immediately obtain for the mean square
error with the Pythagorean theorem for successive orthogonal projections
!2
"
\
(
i ) (i )
=E
2 #
\
\
\
+ E (
i ) (i )
!2
.
(8.7)
\
\
E (
i ) (i )
A. Gisler
tes
(m
w
The left-hand side describes the error of the inhomogeneous

credibility estimator which can be split (right-hand side) into
the error of the (best) Bayesian estimator and the approximation error by the inhomogeneous credibility estimator to the
Bayesian estimator, see Theorem 3.14 in Bhlmann-Gisler [24].
hom
\
\
In a similar spirit the homogeneous credibility estimator (i )
\
\
\
is the best approximation to (
i ) and (i ) within L0 (X)
(here we additionally use unbiasedness).
no
Theorem 8.17 (inhomogeneous and homogeneous BS estimator). We assume that

the BS Model 8.13 is fulfilled with parameters 0 , 2 and 2 given by (8.3)-(8.5).
The inhomogeneous credibility estimator is given by
\
c
\
(
i ) = i,T Xi,1:T + (1 i,T ) 0 ,
c
with credibility weight i,T and observation based estimator X
i,1:T
PT
wi,t
2
t=1 wi,t + 2
t=1
NL
i,T = PT
c
X
i,1:T = PT
and
T
X
t=1
wi,t
wi,t Xi,t .
t=1
The homogeneous credibility estimator is given by

hom
\
\
(
i)
c
bT ,
= i,T X
i,1:T + (1 i,T )
with estimate
b T = PI
i=1 i,T
I
X
c
i,T X
i,1:T .
i=1
Proof of Theorem 8.17. The theorem can be proved by brute force doing convex optimization
(using the method of Lagrange in the latter case) or we can apply Hilbert space techniques using
projection properties, see Chapters 3 and 4 in Bhlmann-Gisler [24]. We do the brute force
216
calculation because it is quite straightforward. We minimize
2
X
al,t Xl,t (i )
h(a) = E a0 +
l,t
(m
w
l,t
over all possible choices a0 , al,t R. This requires that we calculate all derivatives w.r.t. these
parameters and set them equal to zero.
!
h(a) = 2E a0 +
al,t Xl,t (i ) = 0,
(8.8)
a0
l,t
!
h(a) = 2E Xj,s a0 +
al,t Xl,t (i ) = 0.
(8.9)
aj,s
Equation (8.8) immediately implies unbiasedness of the inhomogeneous credibility estimator, and
moreover that
X
al,t .
a0 = 0 1
l,t
Plugging this into (8.9) and using (8.8) once more immediately gives for all j, s the requirement
X
!
Cov Xj,s ,
al,t Xl,t (i ) = 0.
tes
l,t
no
Using the uncorrelatedness between different risk classes (which is implied by the independence)
we obtain the following (normal) equations, see Corollary 3.17 and Section 4.3 in Bhlmann-Gisler
[24],
X
a0 = 0 1
al,t ,
(8.10)
l,t
Cov (Xj,s , (i )) =
T
X
aj,t Cov (Xj,s , Xj,t )
for all j, s.
(8.11)
t=1
We calculate these last covariance terms
E [Cov ( Xj,s , Xj,t | j )] + Cov (E [ Xj,s | j ] , E [ Xj,t | j ])

1
E 2 (j ) 1{t=s} + Var ((j ))
wj,s
NL
Cov (Xj,s , Xj,t )
2
1{t=s} + 2 > 0.
wj,s
The first covariance term is given by

Cov (Xj,s , (i ))
Var ((i )) 1{j=i} = 2 1{j=i} .
This implies that the left-hand side of (8.11) is equal to 0 for j 6= i and because Cov (Xj,s , Xj,t )
2 > 0 it follows that aj,s = 0 for all j 6= i. This is not surprising because we have assumed that
the different risk classes are independent. Therefore (8.10)-(8.11) reduce to
!
T
X
def.
a0 = 0 1
ai,t
= 0 (1 i,T ) ,
(8.12)
t=1
T
X
2
+ 2
ai,t = ai,s
+ 2 i,T
wi,s
w
i,s
t=1
2
= ai,s
for all s.
(8.13)
217
PT
This defines i,T = t=1 ai,t and we still need to see that this credibility weight has the claimed
form. Requirement (8.13) then implies for all s
ai,s
2
(1 i,T ) wi,s .
2
If we sum this over s we obtain

T
X
i,T =
ai,s =
s=1
T
X
2
(1
)
wi,s .
i,T
2
s=1
2
2
i,T =
2
2
PT
PT
t=1 wi,t
=
,
PT
PT
2
s=1 wi,s + 1
t=1 wi,t + 2
s=1
wi,s
and the ai,s are given by

PT
1 PT
t=1
wi,t
2
2
t=1 wi,t +
wi,s =
2
2
P
2 Tt=1 wi,t +
wi,s
wi,s = i,T PT
.
t=1 wi,t
(m
ai,s
2
= 2
w)
Solving this for i,T gives the following credibility weights
2
2
If we collect all the terms we have found the following inhomogeneous credibility estimator
1
\
\
(
i ) = i,T PT
bi,1:T + (1 i,T ) 0 .
wi,s Xi,s + (1 i,T ) 0 = i,T X
s=1
tes
t=1 wi,t
T
X
no
This proves the first claim and an important observation is that this credibility estimator is
unbiased for 0 . Therefore, it coincides with the estimator if we would have projected to
L0 (X, 1) = L(X, 1) {b
L2 (, F, P) : E[b
] = 0 }.
The proof of the homogeneous credibility estimator goes along the same lines as the inhomogeneous one, using the method of Lagrange for replacing (8.8) by the side constraint
X
X
X
0 = E [b
] = E
ai,t Xi,t =
ai,t E [Xi,t ] =
ai,t 0 ,
i,t
i,t
i,t
NL
which implies i,t ai,t = 1. An alternative proof would go by using the iterative property and the
linearity of orthogonal projections on subspaces. For details we refer to Section 4.6 in BhlmannGisler [24]. This closes the proof of Theorem 8.17.
2
Remarks 8.18 (interpretation of the BS formula of Theorem 8.17). The BS formula provides the best linear approximations to the true premium (i ) and the
2
\
Bayesian estimator (
i ) in the L -sense, see also (8.7).
The inhomogeneous and the homogeneous credibility estimators are somewhat different which may also lead to different interpretations.
For the inhomogeneous credibility estimator we assume that there is prior
knowledge on (i ) in the form of the prior mean parameter 0 . This prior
knowledge has uncertainty described by the variance parameter 2 and the
resulting estimator is the classical credibility weighted average between portfolio experience X i and prior knowledge 0 which leads to the credibility
weights i,T . To calculate this estimator it is sufficient to have one risk class
only.
218
The homogeneous credibility estimator can be interpreted as the modified

version of the inhomogeneous one if we do not have prior knowledge. In
this case we extract additional information from similar portfolios. That is,
we consider all risk classes simultaneously to obtain b T which replaces the
prior knowledge 0 . The precision that is given to this overall knowledge b T
depends on the volatility between the risk classes, i.e. on the significance of
particular observations.
(m
w)
The so-called credibility coefficient is defined by, see Bhlmann-Gisler [24]

page 84,
2
= 2.
(8.14)
It describes the ratio of volatilities within risk classes and between risk classes.
This is the crucial ratio that determines the credibility weights
PT
wi,t
.
t=1 wi,t +
i,T = PT
t=1
8.2.3
no
tes
This latter case can now be used for tariffication of risk factors on different risk
classes, similar to the GLM Chapter 7. The overall premium is given by b T , the
c
experience of risk class i is given by X
i,1:T and the credibility weight i,T (0, 1)
explains how this information needs to be combined to obtain the risk adjusted
premium of risk class i.
Estimation of structural parameters
NL
In order to apply the credibility estimators there remains the specification of the
structural parameters 2 and 2 . We make the same choice as in Bhlmann-Gisler
[24]. We define the sample estimators of risk class i
sb2i =
T

2
X
1
c
wi,t Xi,t X
.
i,1:T
T 1 t=1
A straightforward calculation shows that this is an unbiased estimator for 2 (i ),

conditionally given i . But this immediately implies that sb2i is an unbiased estimator for 2 for all i. Therefore, we set
bT2 =
I
1 X
sb2 ,
I i=1 i
(8.15)
with E[bT2 ] = 2 . Observe that one risk class is sufficient to get an estimate for 2
if T > 1.
If we have prior knowledge 0 then 2 should be calibrated such that it quantifies the
reliability of this prior knowledge. If we use the homogeneous credibility estimator
219
then 2 is estimated from the volatility between the risk classes (here we need more
than one risk class i). We define the weighted sample mean over all observations
!
X
X X
1
c
=P 1
wi,t X
X
wi,t Xi,t = P
i,1:T .
i,t wi,t i,t
i,t wi,t
t
i
In analogy to (2.7) we define

P

X
I
t wi,t
c
2.
=
X
P
i,1:T X
I 1 i
j,s wj,s
w)
vbT2
(m
Similar to Lemma 2.29 we can calculate the expected value of vbT2 which then shows
that we need to define
!
I bT2
2
2
b
tT = cw vbT P
,
j,s wj,s
with constant
"
I 1 X t wi,t
wi,t
cw =
1 P t
P
I
j,s wj,s
j,s wj,s
i
P
!#1
tes
This estimator has the unbiasedness property E[tb2T ] = 2 , we refer to Section 4.8 in
Bhlmann-Gisler [24]. The only difficulty is that it might become negative which,
of course, is non-sense for estimating 2 . Therefore, we set for the final estimator
n
no
bT2 = max tb2T , 0 .
(8.16)
NL
Example 8.19. We do Exercise 4.1 of Bhlmann-Gisler [24]. We have I = 5 risk

classes and for every risk class we have T = 5 observations. The data is provided
in Table 8.1. We have claims Si,t and corresponding numbers of policies vi,t . In
order to apply the BS model we choose volumes wi,t = vi,t , i.e. the volumes wi,t are
determined by the number of policies in the corresponding cell (i, t) and we define
the claims ratios Xi,t = Si,t /vi,t . Our aim is to apply the BS model to (Xi,t )i,t .
Observe that the application of the BS model is motivated by the fact that some
cells have small volumes and volatile claims ratios. Therefore, Bayesian methods
are applied to smooth the calculated premia.
hom
\
\
We would like to calculate the homogeneous credibility estimator (
)
for the
i
claims ratios of the risk classes i = 1, . . . , 5, see Theorem 8.17. Therefore, we
first need to estimate the structural parameters. With formulas (8.15) and (8.16)
we obtain bT2 = 261.2 and bT2 = 0.1021. This gives estimated credibility coefficient
bT =
b T2 /bT2 = 2558 and from this we can estimate the credibility weights i,T . The
estimates are provided in Table 8.2. We see that in risk class 4 we have big volumes
b 4,T = 87.8%. In risk class
v4,t which results in a high credibility weight estimate of
5 we have small volumes v5,t which results in a low credibility weight estimate of
b 5,T = 45.2%. From this we calculate the credibility weighted overall claims ratio
risk class 2
risk class 3
risk class 4
3
872
262
30.0%
2090
326
15.6%
874
699
80.0%
3715
3121
84.0%
422
169
40.0%
4
951
837
88.0%
2300
463
20.1%
917
1742
190.0%
3859
4129
107.0%
424
1018
240.1%
5
1019
1630
160.0%
2368
895
37.8%
944
1038
110.0%
4198
3358
80.0%
440
44
10.0%
tes
risk class 5
v1,t
S1,t
X1,t
v2,t
S2,t
X2,t
v3,t
S3,t
X3,t
v4,t
S4,t
X4,t
v5,t
S5,t
X5,t
2
786
1100
139.9%
1802
1298
72.0%
827
496
60.0%
3454
4145
120.0%
420
0
0.0%
w)
risk class 1
1
729
583
80.0%
1631
99
6.1%
796
1433
180.0%
3152
1765
56.0%
400
40
10.0%
(m
220
NL
no
Table 8.1: Observed claims Si,t and corresponding numbers of policies vi,t .
b i,T
b
Xi,1:T
\
\
(
i)
risk class 1
63.0%
101.3%
risk class 2
79.9%
30.2%
risk class 3
63.0%
124.1%
risk class 4
87.8%
89.9%
risk class 5
45.2%
60.4%
93.5%
40.3%
107.9%
88.7%
71.3%
hom
c
b i,T , observation based estimate X
Table 8.2: Estimated credibility weights
i,1:T and
hom
\
\
homogeneous credibility estimate (
of the claims ratio at time T = 5.
i)
221
= 77.9%) and
b T = 80.4% (which should be compared to the sample mean X
from this we finally calculate the homogeneous credibility estimators for the claims
c
b T according to the
ratios, see Table 8.2. We observe smoothing of X
i,1:T towards
b i,T .
credibility weights
8.2.4
w)
Exercise 23.
(a) Choose the data of Table 8.1 and calculate the inhomogeneous credibility esti\
\
mators (
i ) for the claims ratios under the assumption that the collective mean
is given by 0 = 90% and the variance between risk classes is given by 2 = 0.20.
(b) What changes if the variance between risk classes is given by 2 = 0.05?
Prediction error in the Bhlmann-Straub model
(m
\
\
Observe that the credibility estimator (
i ) is used to estimate (i ) and to predict
next years claim Xi,T +1 . Similar to (1.9) we can analyze the total prediction error
!
\
\
\
\
Xi,T +1 (
i ) = (Xi,T +1 (i )) + (i ) (i ) .
\
\
E Xi,T +1 (
i)
tes
If we assume that Xi,1 , . . . , Xi,T +1 are independent, conditionally given i , then we

obtain from unbiasedness
!2
= E (Xi,T +1 (i ))
!2
\
\
+ E (i ) (
i)
no
\
\
= E [Var (Xi,T +1 | i )] + E (i ) (
i)
wi,T +1
+ (1 i,T ) 2 ,
!2
(8.17)
NL
see Theorem 4.3 in Bhlmann-Gisler [24]. The first term on the right-hand side
of (8.17) is called process variance and the second term parameter uncertainty.
Similarly we obtain for the homogeneous credibility estimator, see Theorem 4.6 in
Bhlmann-Gisler [24],
E Xi,T +1
hom
\
\
(
i)
2
wi,T +1
+ (1 i,T ) 2
1 i,T
1+ P
i i,T
(8.18)
The expressions in (8.17) and (8.18) are called mean square error of prediction
(MSEP). We will come back to this notion in Section 9.3 and for a comprehensive
treatment we refer to Section 3.1 in Wthrich-Merz [100].
hom
\
\
Exercise 24. Estimate the prediction uncertainty E[(Xi,T +1 (
)2 ] for the
i)
data of Example 8.19 under the assumption that the volume grows 5% in each risk
class.

222
Ni
3880
794
8941
3448
1672
5186
314
1934
2285
2689
661
4878
1205
1646
850
2229
3389
5937
1530
671
15014
69153
(m
vi
50061
10135
121310
35045
19720
39092
4192
19635
21618
34332
11105
56590
13551
19139
10242
28137
33846
61573
17067
8263
148872
763525
tes
region i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
total
w)
Exercise 25. We consider Example 4.1 of Bhlmann-Gisler [24]. The observed

numbers of policies vi and claims counts Ni in 21 different regions are given in
Table 8.3.
Table 8.3: Observed volumes vi and claims counts Ni in regions i = 1, . . . , 21.
no
Calculate the inhomogeneous credibility estimators for each region i under the
assumption that Ni |i has a Poisson distribution with mean (i )vi = i 0 vi and
E[i ] = 1. The prior frequency parameter is given by 0 = 8.8% and the prior
uncertainty by 2 = 2.4 104 .
NL
Hint: For the estimation of the credibility coefficient = 2 / 2 one should use that
Ni |i is Poisson distributed which has direct consequences for the corresponding
variance 2 (i ), see also Proposition 2.8.

Example 8.20 (MTPL frequencies). We revisit the MTPL example of Section
7.3.4. In this example we have observed that some of the risk classes m have a
rather small volume vm which, of course, is in favor of applying credibility methods.
For this analysis we only consider a risk classification by cantons. This exactly
corresponds to the marginal consideration in Figure 7.5 (rhs). We assume that the
regional data fulfills the BS model assumptions with wm = vm and, henceforth,
we can use Theorem 8.17. We do this under the additional assumption of having
conditional Poisson distributions for Nm , m {AG, AI, . . . , ZH}. The latter implies
for Xm = Nm /vm that, see also Exercise 25 above,
2 (m )
E [ Xm | m ]
(m )
= Var( Xm | m ) =
=
.
wm
vm
vm
223
(m
w)
hom
\
b canton and (rhs) credibility estimators (
\
Figure 8.2: (lhs) MLEs
m)
m
different cantons m {AG, AI, . . . , ZH}.
for the
Therefore, we have 2 (m ) = (m ), note that wm = vm , and
tes
2 = E[ 2 (m )] = E[(m )] = 0 ,
NL
no
if 0 > 0 denotes the prior frequency parameter (collective mean). Therefore, we

do not need to estimate 2 for given prior frequency parameter 0 , but we only
need to estimate parameter 2 for applying the homogeneous credibility estimator.
The additional nice feature about the conditional Poisson model is that this can
be done with only one period of observations. Applying the iterative algorithm of
b = 7.48%
Bhlmann-Gisler [24], pages 102-103, we find estimate b2 for 2 and
0
(we have obtained sufficient convergence after 4 iterations). From this we calculate
the homogeneous BS credibility estimators
hom
\
\
(
m)
b ,
= m Xm + (1 m )
0
with credibility weights
m =
vm
.
b /b2
vm +
0
The resulting credibility weights m are within the interval (35%, 98%) depending
b = 7.48%
on having a small or large canton. Remarkable is that the estimate
0
is substantially higher than the sample mean of 7.15%. In Figure 8.2 we present
hom
\
canton
b
\
the MLEs m
= Xm and the credibility estimators (m )
for the different
cantons m {AG, AI, . . . , ZH}. We observe that the MLEs are smoothed out
b . This applies in particular to small cantons
towards the collective mean estimate
0
224
NL
no
tes
(m
w)
such as m = AI, whereas large cantons are only marginally affected by the collective
mean estimate.
This picture could now be further refined using methods from spatial statistics
(based on the intuition that neighboring cantons behave more similarly, etc.). This
has, for instance, been done in Fringeli [50].
Chapter 9
w)
Claims Reserving
St =
Nt
X
i=1
(m
This chapter will give a completely new perspective to non-life insurance business
which has not been covered in these notes, yet. Until now we have assumed that
the total claim amount for a fixed accounting year can be described by a compound
distribution of the form
Yi ,
NL
no
tes
where t = 1, . . . , T denote the different accounting years, Nt counts the number of

claims in accounting year t and Yi describes the claim size of claim i. This was
the base model for collective risk modeling in Chapter 2, it was used for the study
of the surplus process (Ct )tN0 in Chapter 5 (see Definition 5.1) and it was also
the base assumption for parameter estimation (based on past claims experience)
for the prediction of future claims. This model suggests that we have Nt claims in
accounting year t and their claim sizes Y1 , . . . , YNt describe the total payouts to the
insured. The main issue in practice is that a typical non-life insurance claim cannot
be settled immediately at occurrence. That is, if Yi describes the claim amount of
claim i = 1, . . . , Nt in accounting year t then, in general, this claim amount is not
observable at time t due to a settlement delay that allows for a final assessment
of the claim only later. Likewise Nt is not observable at the end of accounting
year t because there might be claims that have occurred in year t but which are
reported only later. We describe reasons for such reporting and settlement delays
in the next section. As a consequence we need to predict future cash flows of claims
that have occurred in the past and are only settled in the future in order to have
a sound basis for pricing future insurance contracts. This task is exactly known
as the claims reserving problem and it assesses outstanding loss liabilities for past
claims. The prediction of these outstanding loss liabilities constitute the claims
reserves. Importantly, these claims reserves typically are the largest position on
the liability side of the balance sheet of a non-life insurance company, see Table
9.1, and are crucial for the financial strength of the company. Therefore, we aim
at describing the claims reserving process in this chapter and we would also like to
describe the uncertainties involved.
225
226
Chapter 9. Claims Reserving

mio. CHF
debt securities
equity securities
loans & mortgages
real estate
participation
short term investments
other assets
6374
1280
1882
908
2101
693
696
total assets
13934
liabilities as of 31/12/2013
mio. CHF
claims reserves
provisions for annuities
other liabilities and provisions
share capital
legal reserve
free reserve, forwarded gains
total liabilities & equity
7189
1178
2481
169
951
1966
13934
w)
assets as of 31/12/2013
Table 9.1: Source: Annual Report 2013 of AXA Versicherungen AG, Switzerland.
Outstanding loss liabilities
(m
9.1
NL
no
tes
A claim in non-life insurance is triggered by an accident which is an event that

causes (financial) damage covered by an insurance contract. The date of claims
occurrence is called accident date. Typically, time elapses until such a claim is in
the administrative system of the insurance company and is available for statistical analysis. The time lag between the accident date and the registration at the
insurance company is called reporting delay and the date of registration is termed
reporting date.
The reporting delay can be small, say days, but it can also be very large, for
example several years. Reasons for such reporting delays are that claims are not
immediately reported to the insurance company, for instance, a stolen bike is only
reported once it is clear that it will not reappear, but of course the accident
date is the day the bike was stolen. Large reporting delays are typically caused by
claims which are not immediately noticed. A common example is an asbestos claim
which is typically caused a long time before cancer is diagnosed and reported. The
accident date refers to the event when there was contact with asbestos, the trigger
of the cancer, and not to the date of the breakout of the asbestos disease.
Once a claim is reported to the insurance company it typically cannot be settled
immediately. The insurance company starts an investigation, monitors the recovery
process, waits for external information, external bills, court decisions, etc. This
process may last for several years for more involved claims. Of course, the insurance
company cannot wait with claims payments until there is a final assessment of the
claim but it will continuously pay for justified claims benefits. Therefore, insurance
claims trigger a whole sequence of cash flows after the reporting date. This period
is called settlement period and the final assessment of a claim is called settlement
date or closing date.
Thus, we have three important (ordered) dates for non-life insurance claims:
accident date T1 reporting date T2 settlement date T3 .
In addition, there are the following two important dates: beginning of insurance
227
period U1 and ending of insurance period U2 > U1 , we always assume U2 < .

Typically, the insurance company is only liable for a claim if T1 [U1 , U2 ], thus, we
only consider claims that have accident dates T1 that fall into the insurance period
[U1 , U2 ] specified in the insurance contract.
accident date T_1
claims payments
claims closing T_3
w)
reporting date T_2
insurance period [U_1,U_2]
time
(m
Figure 9.1: Non-life insurance run-off showing insurance period [U1 , U2 ] and a claim
with accident date T1 [U1 , U2 ], reporting date T2 > U2 > T1 and settlement date
T3 > T2 . Moreover, we have claims payments during the settlement period [T2 , T3 ].
tes
If we denote todays time point by t U1 we can have four different situations:
no
1. t < T1 . Such (potential) claims have not yet occurred. If the company is
lucky then T1 > U2 . This means that it is not liable for this particular claim
with the actual insurance policy because the contract is already terminated
at claims occurrence. Be careful, the company may still be liable for this
particular claim, namely, if the contract is renewed and T1 falls into the
renewed insurance period, but renewals are not of interest for the present
(claims reserving) discussion because they correspond to insurance exposures
only sold in the future.
NL
In this first case t < T1 the only information available at the insurance company is the insurance contract signed, i.e. the exposure for which it is liable in
case of a claims occurrence T1 [U1 , U2 ]. Therefore, one often speaks about
unearned premium if the exposure has not yet expired, i.e. if t < U2 .
2. T1 t < T2 and T1 [U1 , U2 ]. In this case the insurance claim has occurred
but it has not yet been reported to the insurance company. These claims
are called Incurred But Not Yet Reported (IBNYR) claims. For such claims
we do not have any individual claims information (because it is IBNYR) but
we already have external information, like economic environment (e.g. unemployment rate, inflation rate, financial distress), weather conditions and
natural catastrophes (storm, flood, earthquake, etc.), nuclear power accident,
flu epidemic, and so on. This external information already gives us a hint
whether we should expect more or less claims reportings.
228
3. T2 t < T3 and T1 [U1 , U2 ]. These claims are reported at the company but
the final assessment is still missing. Typically, we are in the situation of more
and more information becoming available about the claim, i.e. the prediction
uncertainty in the final claim assessment decreases. However, these claims
are not completely resolved and therefore they are called Reported But Not
Settled (RBNS) claims. The settlement period [T2 , T3 ] is also the period
within which claims payments are done, see Figure 9.1.
w)
During the settlement period we receive more and more information of the
individual claim like accident date, cause of accident, type of accident, line-ofbusiness and contracts involved, claims assessments and predictions by claims
adjusters, payments already done, etc.
tes
(m
4. T3 < t and T1 [U1 , U2 ]. Claim is settled, file is closed and stored and we
expect no further payments for that claim. Under some circumstances, it
may be necessary that a claim file is re-opened due to unexpected further
claims development. If this happens too often then the files are probably
closed too early and the claims settlement philosophy should be reviewed in
that particular company. If there is a systematic re-opening it may also ask
for a special provision for unexpected re-openings, for example, for contracts
which have a timely unlimited cover for relapses.
NL
no
To give statistical statements about insurance contracts and claims behavior, insurance companies build homogeneous groups and sub-portfolios to which a LLN
applies. In non-life insurance, contracts are often grouped into different business
lines such as private property, commercial property, private liability, commercial
liability, accident insurance, health insurance, motor third party liability insurance,
motor hull insurance, etc. If this classification is too rough it can further be divided
into sub-portfolios, for example, private property can be divided by hazard categories like fire, water, theft, etc. Often such sub-classes are built by geographical
markets and for different jurisdictions.
Once these (hopefully) homogeneous risk classes are built we study all claims that
belong to such a sub-portfolio. These claims are further categorized by the accident date. Claims that fall into the same accident period are triggered by similar
external factors like weather conditions, economic environment; therefore such a
classification is reasonable. Since the usual time scale for insurance contracts and
business consolidation is years, claims are typically gathered on the yearly time
scale. Therefore, we consider accounting years denoted by k N. All claims that
have accident dates T1 [1/1/k, 31/12/k] are called claims with accident year k.
We abbreviate the latter interval by [k, k + 1). These claims generate cash flows
which are also considered on the consolidated yearly level, i.e. all payments that
are done in the same accounting year are aggregated. This motivates the classical
229
claims reserving notation for fixed i N and j N0

Xi,j =
all payments done for claims with accident year i

and paid in accounting year k = i + j N.
w)
Thus, we consider all claims (of a given sub-portfolio) which have accident dates
T1 [i, i+1) = [1/1/i, 31/12/i], i.e. have the same accident year i. For these claims
we consider aggregate cash flows which are further sub-divided by their payment
delays denoted by j N0 and called development years. For instance,
payments in year [i, i + 1) for claims with accident year i;
Xi,1 =
payments in year [i + 1, i + 2) for claims with accident year i;
Xi,j =
payments in year [i + j, i + j + 1) for claims with accident year i.
(m
Xi,0 =
1
..
.
tJ
X1,0
..
.
XtJ,0
..
.
X1,1
..
.
XtJ,1
..
.
NL
i
..
.
t1
t
development years j
...
j
no
accident
year i
tes
Moreover, a common assumption is that there is fixed maximal settlement delay

J N, i.e. Xi,j 0 for all development years j > J. Of course, this maximal
settlement delay J depends on the business line considered, typically it is smaller
for property insurance and larger for liability insurance. At time t N, with
t > J, this motivates the graphical representation given in Table 9.2 (note that we
identify t with 31/12/t). This table displays three time axes: (1) accident year axis
Xt,0
Xt,1
...
X1,J
..
.
XtJ,J
..
.
observations Dt
to be predicted Dtc
Xt,J
Table 9.2: Claims development triangle/trapezoid Dt at time t > J.

i {1, . . . , t} (vertical axis); (2) development year axis j {0, . . . , J} (horizontal
axis); and (3) accounting year axis k = i + j {1, . . . , t + J} (diagonal axis). In
claims reserving all three time axes are important: (1) i collects the claims with
the same accident year; (2) j describes payments with the same payment delay
(relative to the accident year); and (3) k = i + j describes the payments that are
230
done in the same accounting year (and hence are influenced by the same external
factors like inflation). Therefore, we denote the accounting year payments by
Xk =
Xi,j =
i+j=k
tk
X
i=1(kJ)
J(k1)
Xi,ki =
Xkj,j .
(9.1)
j=0(kt)
(m
w)
At time t N we are liable for all claims that have occurred in accident years i t.
We call these claims past exposure claims. Some of these past exposure claims are
already settled (if the settlement date T3 t), others belong to either the class of
RBNS claims (if the reporting date T2 t but the settlement date T3 > t) or the
class of IBNYR claims (if the reporting date T2 > t).
On the aggregate level we have the following payment information at time t N
for past exposure claims
Dt = {Xi,j ; i + j t, 1 i t, 0 j J} .
(9.2)
tes
This information exactly corresponds to the upper triangle (if t = J + 1) or the

upper trapezoid (if t > J +1) of Table 9.2. These past exposure claims will generate
cash flows in future accounting years given by
Dtc = {Xi,j ; i + j > t, 1 i t, 0 j J} .
NL
no
This corresponds to the lower triangle in Table 9.2. This lower triangle Dtc is
called outstanding loss liabilities and it is the major object of interest. Namely,
these outstanding loss liabilities constitute the liabilities of the insurance company
originating from past premium exposures. In particular, the company needs to
build appropriate provisions so that it is able to fulfill these future cash flows.
These provisions are called claims reserves and they should satisfy the following
requirements:
the claims reserves should be evaluated such that it considers all relevant
(past) information;
the claims reserves should be a best-estimate for the outstanding loss liabilities adjusted for time value of money.
Basically, this means that we need to predict the lower triangle Dtc based on all
available information Ft Dt at time t. In particular, we need to define a stochastic
model on a filtered probability space (, F, P, F) (i) that allows to incorporate
past information Ft F through a filtration F = (Ft )t ; (ii) that reflects the
characteristics of past observations Dt ; (iii) that is able to predict future payments
of the outstanding loss liabilities Dtc ; and (vi) that is able to attach time values
of money to these future cash flows Xi,j , i + j > t. Of course, this is a rather
ambitious program and we will build such a stochastic model step-by-step.
231
For the time-being we skip the task of attaching time values of money to cash flows
and we only consider nominal payments. The total nominal claims payments for
accident year i are given by
J
X
Xi,j =
j=0
Ni
X
Yl = Si .
(9.3)
l=1
(m
w)
Thus, for assessing the total claim amount Si of accounting year i we need to
describe the claims settlement process Xi,0 , . . . , Xi,J . In particular, we need to
predict the (unobserved) future claims cash flows of the outstanding loss liabilities
to quantify the total claim Si of accounting year i. Here, Si is measured on a
nominal basis, therefore we use the symbol = in the above identity (9.3),
see also Wthrich [98]. Moreover, we see that the total claim amount of a fixed
accounting year i is by far more complex than a compound distribution.
We assume that the latest observed accident/accounting year is t = I and we do
all considerations based on this (fixed) accounting year.
R=
X
i+j>I
tes
The (nominal) best-estimate reserves at time t = I > J for past exposure claims
are (under these model assumptions) defined by
X
E [Xi,j | FI ] =
E [Xi,j | FI ] ,
(i,j)IIc
no
where we define the sets I, II and IIc of indexes as follows

I = {1, . . . , I} {0, . . . , J},
II = {(i, j) I; i + j I}
and
IIc = I \ II .
NL
The set IIc exactly corresponds to the lower triangle DIc . (Ft )t0 is a filtration on
(, F, P) assuming that Xi,j is Fi+j -measurable for all (i, j) I.
The best-estimate reserves R are a predictor for the (nominal) outstanding loss
liabilities of past exposure claims at time t = I
X
Xi,j .
(i,j)IIc
This predictor R is based on the available information FI at time I. Often Ft and

Dt are identified, i.e. one assumes that there is no other information available than
the claims themselves.
The next question of interest is the uncertainty in this prediction called prediction
uncertainty. That is, we want to investigate the possible fluctuation of the true
cash flows around their best-estimate reserves. If the confidence interval is narrow,
we can predict the outstanding loss liabilities rather accurately. If we obtain wide
232
confidence bounds, an additional risk margin is necessary which protects against

possible shortfalls in the outstanding loss liability cash flows. We will discuss this
below.
9.2
Claims reserving algorithms
9.2.1
no
tes
(m
w)
The title of this section contains the term algorithms. Initially, in the insurance industry, actuaries have designed algorithms that enable to determine claims
reserves R. These algorithms should be understood as mechanical guidelines to
obtain claims reserves. Only much later actuaries started to think about stochastic
models underlying these algorithms. In this section we present claims reserving
from this algorithmic point of view and in the next section we present stochastic
models that support these algorithms.
The two most popular algorithms are the so-called chain-ladder (CL) algorithm
and the Bornhuetter-Ferguson (BF) algorithm [16]. These two algorithms take
different viewpoints. The CL algorithm takes the position that the observations DI
are extrapolated into the lower triangle, the BF algorithm takes the position that
the lower triangle DIc is extrapolated independently of the observations DI using
expert knowledge. Depending on the line of business considered and the progress
of the claims development process one or the other method may provide better
predictions. Only actuarial experience may tell which one should be preferred in
which particular situation. Therefore, we are going to present both algorithms
from a rather mechanical point of view, because we cannot provide applied insight
to a given data set.
Chain-ladder algorithm
NL
For the study of the CL algorithm we need to define (nominal) cumulative payments
Ci,j =
j
X
Xi,l .
l=0
That is, we sum all payments Xi,l , l 0, for a fixed accident year i so that
ultimately we obtain Ci,J = Si , if Si denotes the total (nominal) claim amount
that corresponds to accident year i, see also (9.3). We call Ci,J ultimate (nominal)
claim of accident year i.
CL idea. All accident years i {1, . . . , I} behave similarly and for cumulative
payments we have approximately
Ci,j+1 fj Ci,j ,
(9.4)
for given factors fj > 0. These factors fj are called CL factors, age-to-age factors
or link ratios.
233
Structure (9.4) immediately provides the intuition for predicting the ultimate claim
Ci,J based on observations DI , namely, choose for every accident year i the observation on the last observed diagonal, that is Ci,Ii , and multiply this observation
with the successive CL factors fIi , . . . , fJ1 .
fbCL
PIj1
X
Ci,j+1 Ij1
Ci,j
Ci,j+1
= Pi=1
=
.
PIj1
Ij1
C
C
C
i,j
i,j
n,j
i=1
n=1
i=1
(9.5)
(m
w)
The remaining difficulty is that, in general, the CL factors fj are not known and,
henceforth, need to be estimated. Assuming that a volume weighted average provides the most reliable results we set in view of (9.4)
tes
This formula (9.5) expresses that we should divide the sums

of observed successive columns by each other which exactly
reflects link ratio structure (9.4). Thus, we calculate a volume weighted average of the individual loss development ratios Fi,j+1 = Ci,j+1 /Ci,j which have been observed in DI . In
Table 9.4 we provide as example the claims reserving example
of Wthrich-Merz [100].
Equipped with these CL factor estimators we predict the ultimate (nominal) claim Ci,J for i + J > I at time t = I by
J1
Y
no
CL
Cbi,J
= Ci,Ii
fbjCL ,
(9.6)
j=Ii
and, in general, we set
n1
Y
CL
Cbi,n
= Ci,Ii
fbjCL ,
j=Ii
NL
for i + n > I.
The CL reserves at time t = I for accident years i > I J are given by
cCL = C
b CL C
R
i,Ii = Ci,Ii
i,J
i
J1
Y
fbjCL 1 ,
j=Ii
and aggregated over all accident years we predict the (nominal) outstanding loss
liabilities of past exposure claims by the CL predictor
cCL =
R
I
X
cCL .
R
i
i=IJ+1
A numerical example is presented in Tables 9.3, 9.4 and 9.5, below.


1.4925
fbjCL
206704
67824
152703
132976
230288
104957
tes
207760
151797
262768
190653
273395
244899
225517
62124
36603
65444
88340
105224
development years j
3
4
5
6
65813
52752
53545
43329
7
14850
11186
8924
8
11130
11646
1.0778
9668212
9593162
9245313
8546239
8524114
9013132
8493391
7728169
7648729
1.0229
10563929
10316383
10092366
9268771
9178009
9585897
9056505
8256211
1.0148
10771690
10468180
10355134
9459424
9451404
9830796
9282022
1.0070
10978394
10536004
10507837
9592399
9681692
9935753
1.0051
11040518
10572608
10573282
9680740
9786916
1.0011
11106331
10625360
10626827
9724068
7
11121181
10636546
10635751
1.0010
1.0014
11132310
10648192
w)
(m
development years j
4
5
9
15813
9
11148124
Table 9.4: Observed cumulative payments Ci,j with (i, j) II and estimated CL factors fbjCL .
5946975
6346756
6269090
5863015
5778885
6184793
5600184
5288066
5290793
5675568
1
2
3
4
5
6
7
8
9
10
accident
year i
895717
723222
847053
722532
653894
572765
563114
528043
no
3721237
3246406
2976223
2683224
2745229
2828338
2893207
2440103
2357936
Table 9.3: Observed payments Xi,j with (i, j) II with I = J + 1 = 10.
5946975
6346756
6269090
5863015
5778885
6184793
5600184
5288066
5290793
5675568
NL
1
2
3
4
5
6
7
8
9
10
accident
year i
234
1
2
3
4
5
6
7
8
9
10
accident
year i
2
8470989
8243496
9129696
NL
9837277
10056528
9534279
8674568
8661208
9592313
tes
9419776
8570389
8557190
9477113
10005044
9485469
8630159
8616868
9543206
development years j
4
5
no
8445057
8432051
9338521
9734574
9847906
10067393
9544580
8683940
8670566
9602676
10646884
9744764
9858214
10077931
9554571
8693030
8679642
9612728
8
10663318
10662008
9758606
9872218
10092247
9568143
8705378
8691971
9626383
total

11653101
11367306
10962965
10616762
11044881
11480700
11413572
11126527
10986548
11618437
1
2
3
4
5
6
7
8
9
10
100.0%
99.9%
99.8%
99.6%
99.1%
98.4%
97.0%
94.8%
88.0%
59.0%
CL
bIi
11148124
10664316
10662749
9761643
9882350
10113777
9623328
8830301
8967375
10443953
BF
bi,J
C
11148124
10663318
10662008
9758606
9872218
10092247
9568143
8705378
8691971
9626383
CL
bi,J
C
total
(m
15126
26257
34538
85302
156494
286121
449167
1043242
3950815
6047061
CL reserves
bCL
R
i
w)
16124
26998
37575
95434
178024
341305
574089
1318646
4768384
7356580
BF reserves
bBF
R
i
Table 9.6: Claims reserves from the BF method and the CL method.
prior
estimate
bi
accident
year i
CL
cCL .
Table 9.5: CL predicted cumulative payments Cbi,j
, (i, j) IIc , and estimated CL reserves R
i
0
0
15126
26257
34538
85302
156494
286121
449167
1043242
3950815
6047061
bCL
R
i

235
236
9.2.2
Bornhuetter-Ferguson algorithm
w)
The Ronald Bornhuetter and Ronald E. Ferguson (BF)

method [16] is based on the assumption of having prior information b i for the expected ultimate claim of accident year i. This
prior information then allows to predict DIc as soon as we have a
so-called claims development pattern (j )j=0,...,J which describes
the proportions paid in each development year. Thus, the BF
method extrapolates prior knowledge b i into the lower triangle
R. Bornhuetter
DIc by using a development pattern (j )j=0,...,J .
(m
BF idea. All accident years i {1, . . . , I} behave similarly and payments approximately behave as
Xi,j j b i ,
(9.7)
for given prior information b i and given development pattern (j )j=0,...,J with norP
malization Jj=0 j = 1.
bCL
j
no
tes
The prior value b i should reflect an estimate for the total expected
ultimate claim E[Ci,J ] of accounting year i. It is assumed that this
prior value is given externally by expert opinion which, in theory,
should not be based on DI . There only remains to estimate the
development pattern (j )j . In view of the CL method, one defines
the following estimates for the development pattern:
=
J1
Y
l=j
fblCL
Qj1 bCL
l=0 fl
QJ1 bCL .
l=0
R.E. Ferguson
fl
NL
This ratio exactly reflects the proportion paid after the first j development periods
(according to the estimated CL pattern). Therefore, we define estimates
b0CL = b0CL ,
CL
bjCL = bjCL bj1
bJCL
for j = 1, . . . , J 1,
1 bCL
J1 .
Equipped with these estimates we predict the ultimate claim Ci,J for i + J > I in
the BF method by
J
X
BF
Cbi,J
= Ci,Ii + b i
CL
.
bjCL = Ci,Ii + b i 1 bIi
j=Ii+1
The BF reserves at time t = I for accident years i > I J are given by

cBF =
bi
R
i
J
X
CL
bjCL = b i 1 bIi
,
j=Ii+1
(9.8)
237
and aggregated over all accident years we predict the (nominal)

outstanding loss liabilities of past exposure claims by
cBF =
R
I
X
cBF =
R
i
b i bjCL .
X
(i,j)IIc
i=IJ+1
An example is provided in Table 9.6.
fbjCL 1
j=Ii
This gives the following comparison

J1
Y
j=Ii
1
.
fbjCL
(m
CL
Cbi,J
= Ci,Ii + Ci,Ii
J1
Y
w)
We conclude this section with a comparison of the BF and CL

predictors. Therefore, we modify CL formula (9.6) as follows
CL
CL
CL
Cbi,J
= Ci,Ii + 1 bIi
Cbi,J
,
tes
BF
CL
Cbi,J
= Ci,Ii + 1 bIi
b i .
Stochastic claims reserving methods
NL
9.3
no
Thus, we see that we have the same structure. The only difference is that for
the BF method we use the external prior estimate b i for the ultimate claim and
CL
. Therefore, we have two
in the CL method the observation based estimate Cbi,J
complementing prediction positions, which exactly gives the explanation mentioned
in the introduction to Section 9.2. For further remarks (also detailed remarks on
the example in Tables 9.3-9.6) we refer to Wthrich-Merz [100].
In the previous section we have presented algorithms for the calculation of the
claims reserves R. Of course, we should also estimate the precision of these preP
dictions, i.e. by how much the true payouts (i,j)IIc Xi,j may deviate from these
predictions R, see also (1.9) and Smith-Thaper [93]. This brings us back to the
notion of risk measures of Section 6.2.4. In claims reserving, the most popular
risk measure is the conditional mean square error of prediction (MSEP) because it
c is a D can be calculated or estimated explicitly in many examples. Assume X
I
measurable predictor for the random variable X. The conditional MSEP is defined
by
msepX|DI

c
X
=E

2
c

X DI .
(9.9)
238
The conditional MSEP is an L2 -distance measure. This conditional MSEP can

be decoupled into two parts, the so-called process uncertainty and the parameter
estimation error as follows, see also (1.9),
c = Var (X| D ) + E [ X| D ] X
c 2.
msepX|DI X
I
I

(9.10)
(m
w)
If all parameters are known and if we can calculate E [ X| DI ] explicitly then we

c = E [ X| D ] because this minimizes the conditional MSEP in (9.10).
should set X
I
In any other case we try to estimate E [ X| DI ] as accurately as possible and then
we try to determine the possible sources of parameter error and uncertainty in this
estimation. In order to analyze this prediction uncertainty we need to put the
claims reserving algorithms into a stochastic framework.
For the CL method there are different stochastic models that provide the CL reserves as predictors:
distribution-free CL model of Thomas Mack [73],
tes
over-dispersed Poisson (ODP) model of Renshaw and Verrall

[85] and of England and Verrall [42] with MLE parameter estimates,
Bayesian CL model of Gisler and Wthrich [55] and of Bhlmann, T. Mack
De Felice, Gisler, Moriconi and Wthrich [23].
NL
no
Macks distribution-free CL model [73] is probably the most popular stochastic

claims reserving model. It is straightforward from a stochastic point of view and
it is easy to implement. The crucial contribution by Mack was the derivation
of an estimate for the parameter estimation error term. In the present text we
do not consider Macks distribution-free CL model, but we provide the gammagamma Bayesian CL model in detail. This model belongs to the family of Bayesian
CL models for which the conditional MSEP can be calculated explicitly. We will
compare the conditional MSEP formula of the gamma-gamma Bayesian CL model
to the famous Mack formula.
For the BF method there are different approaches such as:
BF ODP model of Alai, Merz and Wthrich [3, 4],
BF model of Mack [74],
BF model of Saluz, Gisler and Wthrich [90],
Bayesian BF model of England, Verrall and Wthrich [43].
Some of these models also use estimates of j different from the ones previously
suggested. In the present text we are not going to consider stochastic models for
the BF method and we refer to specialized lectures on stochastic claims reserving.
9.3.1
239
Gamma-gamma Bayesian CL model
In this section we consider an explicit distributional Bayesian model that belongs

to the exponential dispersion family with conjugate priors. The advantage of such
an explicit distributional model is that we can calculate the posterior distribution
analytically. This allows us to determine the quantities of interest in closed form.
Model Assumptions 9.1 (gamma-gamma Bayesian CL model). Assume that
j > 0, j = 0, . . . , J 1, are given fixed constants.
w)
(a) Conditionally, given vector = (0 , . . . , J1 ), (Ci,j )j=0,...,J are independent

(in i) Markov processes (in j) with conditional distributions

(m
Ci,j+1 |Ci,j , Ci,j j2 , j j2 .
(b) j are independent and (j , fj (j 1))-distributed with given prior parameters fj > 0 and j > 1 for j = 0, . . . , J 1.
(c) and C1,0 , . . . , CI,0 are independent and Ci,0 > 0, P-a.s., for all i = 1, . . . , I.
tes
For given parameters we have conditional means
E [ Ci,j+1 | Ci,j , ] = 1
j Ci,j .
no
From this we see that 1

j plays the role of the CL factor introduced in (9.4). We
have
h
i
1
E 1
=
fj (j 1) = fj .
j
j 1
This explains the choices of the prior parameters of the distribution of j : fj
corresponds to the prior mean of 1
j and j is used to calibrate prior uncertainty.
For the conditional variance we have
NL
Var (Ci,j+1 | Ci,j , ) = Ci,j j2 2

j .
(9.11)
The joint likelihood function of observations DI and parameters is given by

h(DI , ) =
(i,j)II ,j1
j1
2
j1
Ci,j1
2
j1
Ci,j1
2
j1
g(C1,0 , . . . , CI,0 )

J1
Y
j=0
Ci,j1
1
2
j1
j1
exp 2 Ci,j
j1
Ci,j
(fj (j 1))j j 1
j
exp {j fj (j 1)} .
(j )
g(C1,0 , . . . , CI,0 ) denotes the density of the first column j = 0. Applying Bayes
rule provides for the posterior distribution of , conditionally given DI ,
PIj1
h(|DI )
J1
Y j +
i=1
Ci,j
2
j
PIj1
1 j fj (j 1)+
j=0
i=1
Ci,j+1
2
j
240
We have just proved the following lemma:

Lemma 9.2. Under Model Assumptions 9.1, the posteriors of 0 , . . . , J1 are
conditionally, given DI , independent with
j |DI j +
Ij1
X
i=1
Ci,j
, fj (j 1) +
j2
Ij1
X
i=1
Ci,j+1
.
j2
w)
Corollary 9.3. Under Model Assumptions 9.1, the posterior Bayesian CL factors
are given by

h
i
def.

bCL + (1 )f ,
fbjBCL = E 1

D
I = j f j
j j
j
PIj1
j = PIj1
i=1
Ci,j
(0, 1).
+ j2 (j 1)
i=1
Ci,j
(m
with CL factor estimate fbjCL given by (9.5) and credibility weight
(9.12)
tes
Proof. The proof is a straightforward application of the gamma distributional properties, namely
"
#
Ij1
X Ci,j+1
1
1
=
E j DI
fj (j 1) +
PIj1 Ci,j
j2
j + i=1
2 1
i=1
j
Remarks 9.4.
PIj1
Ci,j
j2
fj +
i=1
j 1 +
Ci,j
j2
PIj1
i=1
no
j 1
PIj1
j 1 + i=1
PIj1
i=1
Ci,j
j2
Ci,j+1
j2
PIj1
i=1
Ci,j
j2
.
2
NL
Lemma 9.2 and Corollary 9.3 are the key for the derivation of the reserves.
The result says that in the gamma-gamma Bayesian CL model the Bayesian
CL factors should be estimated by a credibility weighted average between the
classical CL estimate fbjCL and the prior estimate fj with credibility weight
j (0, 1). Moreover, for j 0, we can consider the product of these
estimates fbjBCL due to posterior independence, this will be highlighted in
more detail in Theorem 9.5, below.
The parameter j describes the degree of information contained in the prior
distribution of j . If we let j 1 (non-informative priors) we obtain
j 1. In this case we give full credibility to the observation based estimate,
i.e. we have fbjBCL fbjCL in the non-informative limit j 1.
Observe that the individual development factors Fi,j+1 = Ci,j+1 /Ci,j satisfy
the Bhlmann-Straub (BS) model, see Model 8.13: conditionally given j
241
and C1,j , . . . , CI,j , the Fi,j+1 are independent with

E [ Fi,j+1 | Ci,j , j ] = (j ) = 1
j ,
Var (Fi,j+1 | Ci,j , j ) =
j2 (j )
Ci,j
j2 2
j
Ci,j
(9.13)
.
(9.14)
plays
Thus, Ci,j plays the role of the volume measure and j2 () = j2 2
j
the role of the variance function. We calculate, see (8.4) and (8.5),
1
,
j 2
2 2 j 1
= E[j2 2
.
j ] = j fj
j 2
ej2
w)
j2 = Var((j )) = fj2
j =
(m
This implies for the credibility coefficient, see (8.14),

ej2
= j2 (j 1).
j2
Therefore, Corollary 9.3 provides the classical BS formula and the structure
of the credibility weights is given by, see Theorem 8.17 and (9.12),
PIj1
Ci,j
.
Ci,j + j
tes
i=1
j = PIj1
i=1
no
Note that the BS formula requires j > 2 otherwise the credibility coefficient
j cannot be calculated. However, (9.12) is more general in this sense because
the second prior moment of 1
j does not need to exist for Corollary 9.3.
Theorem 9.5. Under Model Assumptions 9.1, the Bayesian CL predictor for Ci,J
with i + J > I is given by
NL
BCL
Cbi,J
= E [ Ci,J | DI ] = Ci,Ii
J1
Y
fbjBCL .
j=Ii
Proof. We use conditional independence between different accident years, the conditional Markov
property and the tower property to obtain

J1
Y

1
BCL
b
Ci,J
= E [ E [ Ci,J | Ci,0 , . . . , Ci,Ii , ]| DI ] = Ci,Ii E
j DI .

j=Ii
Using the posterior independence of Lemma 9.2 and Corollary 9.3 proves the claim.
Remark 9.6. Theorem 9.5 explains that our Model Assumptions 9.1 give the CL
reserves if we let the prior distributions of 1
become non-informative, i.e. for
j
j 1, j = I i, . . . , J 1, we have
BCL
CL
Cbi,J
.
Cbi,J
(9.15)
242
For this reason we can use the (non-informative prior) gamma-gamma Bayesian
CL model as a stochastic representation of the CL algorithm (9.6). This analogy
allows to study prediction uncertainty within Model Assumptions 9.1 for the CL
algorithm in an asymptotic sense.
For the conditional MSEP we obtain, see (9.10),

BCL
msepCi,J |DI Cbi,J
BCL
= Var (Ci,J | DI ) + E [ Ci,J | DI ] Cbi,J
2
= Var (Ci,J | DI ) .
j =
(m
w
BCL
This shows the optimality of the Bayesian CL predictor Cbi,J
within our model
assumptions and there remains the calculation of the conditional variance of the
ultimate claim Ci,J . We define (subject to being well-defined)
j2
j2 (j 2) +
PIj1
l=1
Cl,j
Note that j is DI1 -measurable, i.e. observable at time t = I 1.
BCL
msepCi,J |DI Cbi,J
tes
Theorem 9.7. Under Model Assumptions 9.1 the Bayesian CL predictor satisfies
for i > I J
J1
X
BCL
= Cbi,J
j2
no
j=Ii
J1
Y
BCL
Cbi,J
n=j
fbnBCL (1 + n )
2
J1
Y
(1 + j ) 1 ,
j=Ii
NL
In1
Cl,n /n2 > 2 for all I i n
under the additional assumption that n + l=1
J 1; otherwise the second moment is infinite. The aggregated conditional MSEP
is given by
msepP
C |DI
i i,J
BCL
Cbi,J
BCL
msepCi,J |DI Cbi,J
+2
BCL b BCL
Cl,J
Cbi,J
J1
Y
(1 + j ) 1 ,
j=Ii
i<l
where the summations run over I J + 1 i I and I J + 1 i < l I,

respectively.
Proof. We first decouple accident years
!
X
BCL
P
b
Ci,J
= Var
msep
Ci,J |DI
i
X
i
!

X

Ci,J DI
=
Cov (Ci,J , Cl,J | DI ) .

i,l
243
We calculate these covariance terms. Applying the tower property for conditional expectations
implies for i, l > I J
Cov (Ci,J , Cl,J | DI )
= E [ Cov (Ci,J , Cl,J | DI , )| DI ]
(9.16)
+ Cov (E [ Ci,J | DI , ] , E [ Cl,J | DI , ]| DI ) .

We start with the first term on the right-hand side of (9.16). Observe that this term is zero for
i 6= l because of the conditional independence between different accident years. Therefore, we
only need to consider the case i = l > I J. For this case we have, applying the tower property
and using conditional independence and the conditional Markov property,
E [ Var ( Ci,J | Ci,J1 , )| DI , ] + Var ( E [ Ci,J | Ci,J1 , ]| DI , )

1
2

E Ci,J1 J1
2
J1 DI , + Var Ci,J1 J1 DI ,
Ci,Ii
J2
Y
w)
Var (Ci,J | DI , )
2
2
1
J1
2
j
J1 + J1 Var ( Ci,J1 | DI , ) .
(m
j=Ii
Hence, we obtain the well-known recursive formula for the process variance in the CL method
(see Section 3.2.2 in Wthrich-Merz [100]). By iterating the recursion we find for given (see
also Lemma 3.6 in Wthrich-Merz [100])
Var (Ci,J | DI , )
J1
X
= Ci,Ii
j1
Y
2 2
1
m j j
J1
Y
2
n ,
(9.17)
n=j+1
tes
j=Ii m=Ii
where empty products are set equal to 1. Applying operator E[|DI ] to (9.17) and using the
posterior independence of the random variables j we obtain
J1
X
E [ Var (Ci,J | DI , )| DI ] = Ci,Ii
j1
Y
BCL 2
fbm
j
Ci,Ii
no
J1
X
j1
Y
BCL 2
fbm
j
j=Ii m=Ii
BCL
bi,J
C
J1
X
J1
Y
(fbnBCL )2
n=j
n 1 +
n 2 +
Cl,n
2
l=1
n
PIn1 Cl,n
2
l=1
n
PIn1

fbnBCL (1 + n ) .
n=j
NL
j=Ii
j2
J1
Y

DI
E 2
n
n=j
j=Ii m=Ii
J1
Y
PIn1
Note that in the second step we need n + l=1 Cl,n /n2 > 2 for all I i n J 1 so
that these conditional expectations are finite. For the second term in (9.16) we have, w.l.o.g. we
assume l i > I J,

J1
J1
Y
Y

1
1
Cov (E [ Ci,J | DI , ] , E [ Cl,J | DI , ]| DI ) = Ci,Ii Cl,Il Cov

j ,
j D I

j=Ii
j=Il

Ii1
J1
J1
J1
Y
Y
Y
Y

1
2
1
1
= Ci,Ii Cl,Il E
j
j DI E
j DI E
j DI

j=Ii
j=Ii
j=Il
j=Il
J1
Y
BCL b BCL
bi,J
= C
Cl,J
(1 + j ) 1 .
j=Ii
This proves the statements.
244
We analyze the terms of Theorem 9.7 involving j . Under assumption

j2
Ij1
X
(9.18)
Cl,j ,
l=1
we obtain
0 j 1.
Note that assumption (9.18) is stronger than j + Ij1
Cl,j /j2 > 2 which provides
l=1
finiteness of conditional variances in Theorem 9.7. Assumption (9.18) for all j then
implies for the first term in Theorem 9.7
J1
X
Cb BCL
i,J
j2
J1
Y
fbBCL (1 +
n
n)
Cb BCL
i,J
n=j
J1
X
J1
Y
j2
BCL
Cbi,J
j2
J1
X
2
fbnBCL
n=j
j=Ii
(m
j=Ii
w)
j=Ii
BCL
Cbi,j
In fact, the right-hand side is a lower bound for the left-hand side for any j > 1
(where the second posterior moment exists). For the second term in Theorem 9.7
we have under (9.18)
BCL
Cbi,J
2
J1
Y
(1 + j ) 1
j=Ii
BCL
Cbi,J
tes
2 J1
X
j .
j=Ii
no
In fact, the right-hand side is again a lower bound for the left-hand side for any
j > 1.
This implies that under assumption (9.18) for all I i j J 1 we have
approximation

BCL
BCL
msepCi,J |DI Cbi,J
Cbi,J
2
J1
X
j2
BCL
Cbi,j
+ j ,
(9.19)
NL
j=Ii
where the right-hand side is a lower bound for the left-hand side for any j > 1.
Since the latter formula applies to any j > 1 (it can even be made uniform in
j 1) we can consider its non-informative limit j 1. In this case the Bayesian
CL predictor converges to the classical CL predictor, see (9.15). For the j -terms
we obtain in the non-informative limit
lim j = lim
j2
j2
j2
PIj1
,
Cl,j
(9.20)
where in the last step we have again used (9.18). In fact the last approximation is
again a lower bound. This motivates in the non-informative prior case j 1 the
following approximation and lower bound to (9.19) and Theorem 9.7, respectively,
j 1
j 1
j2 (j 2) +
PIj1
l=1
Cl,j
j2 +
PIj1
l=1
Cl,j
l=1
245
b CL = (C
b CL )2
msepMack
Ci,J |DI Ci,J
i,J
J1
X
s2j /(fbjCL )2
CL
Cbi,j
j=Ii
s2j /(fbjCL )2
+ PIj1
l=1
Cl,j
(9.21)
where we set j2 = s2j /(fbjCL )2 . The conditional MSEP formula (9.21) is exactly the
famous Mack formula [73]. We emphasis important remarks and differences:
Remarks 9.8.
(m
w)
Mack [73] has derived formula (9.21) in Macks distribution-free CL model,

we have derive formula (9.21) as approximation (and lower bound) to the
non-informative prior case of the gamma-gamma Bayesian CL model. These
two stochastic models are different and, therefore, our derivation cannot be
considered as a conditional MSEP formula in Macks distribution-free CL
model.
tes
Both models, Macks distribution-free CL model and the non-informative

prior gamma-gamma Bayesian CL model, have in common that they provide
the CL reserves. Therefore, both models can be used to derive a conditional
MSEP formula for the CL algorithm of Section 9.2.1.
no
This implies that Macks model and our model are different and the derivations of the corresponding conditional MSEP formulas are different. However,
we have proved that under assumption (9.18) we expect that the numerical
results of the two approaches are very close. This will be justified in Example 9.9, below. Assumption (9.18) is fulfilled in many applied data sets
and, therefore, it is a relief because both methods come to similar conclusions
about prediction uncertainties in many applied situations.
NL
We use variance parameters j2 , Mack [73] uses variance parameters s2j . Their
relationship is justified by identity (9.11), see also (9.14).
The blue terms are the process uncertainty terms and the red terms are the
parameter estimation error terms in Macks formula. For more interpretation
we refer to Section 9.4, below, and to Merz-Wthrich [80].
For aggregated accident years, one has under assumption (9.18) approximation and
lower bound to Theorem 9.7 given by
!
Mack
msepP
C |DI
i i,J
Cb CL
i,J
b CL
msepMack
Ci,J |DI Ci,J
(9.22)
+2
X
i<l
J1
X
CL b CL
Cbi,J
Cl,J
j=Ii
s2j /(fbjCL )2
PIj1
n=1
Cn,j
246
Again, the red term describes the parameter estimation error in Macks formula
and for interpretation we refer to Merz-Wthrich [80].
tes
(m
w)
Example 9.9 (gamma-gamma Bayesian CL model and Macks formula). We come

back to the claims reserving example presented in Table 9.4. We consider the
gamma-gamma Bayesian CL model with non-informative priors j 1 in Theorem
9.7. In this non-informative prior case we have j = 1, see (9.12), and therefore
BCL
CL
we obtain fbjBCL = fbjCL and Cbi,J
= Cbi,J
in the non-informative prior limit. This
immediately implies that the claims reserves for the outstanding loss liabilities in
this non-informative prior gamma-gamma Bayesian CL model are given by Tables
9.5 and 9.6.
There remains the calculation of the prediction uncertainty in this non-informative
prior Bayesian CL model. In order to do this we need an estimate for j2 . From
which is compared to s2j = j2 (fbjCL )2 . If we
(9.14) we see that j2 () = j2 2
j
estimate 2
by (fbjCL )2 then we can find estimates bj2 = sb2j /(fbjCL )2 once we have
j
estimated s2j . The estimation of the latter is done rather ad-hoc by the classical
estimates of Macks distribution-free CL model, see Lemma 3.5 in Wthrich-Merz
[100],
!2
Ij1
X
1
Ci,j+1
2
CL
b
sbj =
Ci,j
fj
.
(9.23)
I j 2 i=1
Ci,j
no
For triangles I = J + 1 the variance parameter s2J1 cannot be estimated because

we do not have sufficiently many observations in this last column. In practice, one
therefore often uses Macks [73] estimate (which is based on exponential decay),
see also (3.13) in Wthrich-Merz [100],
n
sb2J1 = min sb4J2 /sb2J3 ; sb2J2 ; sb2J3 .
(9.24)
NL
This provides the estimates in Table 9.7.
sbj
b
j
135.25
90.62
33.80
31.36
15.76
15.41
19.85
19.56
9.34
9.27
2.00
1.99
0.82
0.82
0.22
0.22
0.06
0.06
Table 9.7: Estimated standard deviation parameters in the non-informative priors

gamma-gamma Bayesian model, where we set bj = sbj /fbjCL .
These parameters provide the results for the square-rooted conditional MSEPs
given in Table 9.8. We observe that for the total claims reserves the 1 standard
deviation confidence bounds are about 7.7% of the total claims reserves. These
confidence bounds should now be put in relation to the point estimator in the
balance sheet of Table 9.1 for the claims reserves.
1
2
3
4
5
6
7
8
9
10
covariance1/2
total
15126
26257
34538
85302
156494
286121
449167
1043242
3950815
6047061
msep1/2 msep1/2
Bayes
Mack
267
914
3058
7628
33341
73467
85399
134338
410850
116811
462990
in %
reserves
267
914
3058
7628
33341
73467
85398
134337
410817
116810
462960
1.8%
3.5%
8.9%
8.9%
21.3%
25.7%
19.0%
12.9%
10.4%
w)
CL reserves
cCL
R
i
(m
accident
year i
247
7.7%
tes
Table 9.8: Claims reserves and prediction uncertainty in the non-informative priors
gamma-gamma Bayesian CL model (see Theorem 9.7) and Macks formula (9.21)(9.22).
9.3.2
no
We also observe that the exact formula given by Theorem 9.7 with non-informative
priors and Macks formula (9.21)-(9.22) are very close, i.e. 462990 versus 462960.
This observation holds true for many typical non-life insurance data sets and it
says that both models (though being different) come to the same conclusion about
prediction uncertainty.
Over-dispersed Poisson model
NL
Another stochastic model that has attracted a lot of attention

in the insurance industry is the so-called over-dispersed Poisson
(ODP) model. It goes back to Renshaw-Verrall [85]. Peter
D. England and Richard J. Verrall [42] have popularized
the model a lot. It belongs to the family of GLM models and
it is quite attractive because bootstrap simulation can easily be
applied.
R.J. Verrall
Model Assumptions 9.10 (over-dispersed Poisson model).
Assume there exist positive parameters 1 , . . . , I , 0 , . . . , J and such that all
Xi,j are independent (in i and j) with
Xi,j
Poi(i j /).
248
Observe that
E[Xi,j ] = i j ,
Var(Xi,j ) = i j .
1 = 1
or
J
X
P.D. England
w)
We have a cross-classified mean with i modeling the exposure

of accident year i and j the development pattern of the payout
delay j, see also (9.7). In order to make the parameters i
and j uniquely identifiable we need a side constraint. The two
commonly used side constraints are either
j = 1.
(m
j=0
The first option is more convenient in the application of GLM methods, the second
option gives an explicit meaning to the pattern (j )j , namely, it corresponds to the
cash flow pattern.
The best-estimate reserves at time I are given by
X
E [ Xi,j | DI ] =
tes
R=
i j .
(i,j)IIc
(i,j)IIc
`DI (, , ) =
no
Hence, we need to estimate the parameters i and j . This is done with MLE
methods. We assume J + 1 = I which simplifies notation. Having observations DI
allows to estimate the parameters. The log-likelihood function for = (1 , . . . , I ),
= (0 , . . . , J ) and is given by
X
i j / + (Xi,j /) log(i j /) log((Xi,j /)!).
NL
(i,j)II
Calculating the derivatives w.r.t. and and setting them equal to zero implies
that we need to solve the following system of equations to find the MLEs
j
Ij
X
i=1
Ii
X
j=0
i =
j =
Ij
X
i=1
Ii
X
Xi,j
for all j = 0, . . . , J,
(9.25)
Xi,j
for all i = 1, . . . , I,
(9.26)
j=0
w.l.o.g. under side constraint Jj=0 j = 1. The remarkable fact about the MLE
system (9.25)-(9.26) is that it can be solved explicitly and that it provides the CL
reserves. Moreover, the constant dispersion parameter cancels and is not relevant
for estimating the reserves.
P
249
Theorem 9.11. Under Model Assumptions 9.10, the MLEs for and under
P
side constraint Jj=0 j = 1, given DI , are given by
CL
b MLE
= Cbi,J
i
and
bjMLE =
J1
Y
k=j
1
1
1 bCL ,
CL
b
fk
fj1
for i = 1, . . . , I and j = 1, . . . , J (an empty product is set equal to 1). Moreover,

Q
bCL
b0MLE = J1
k=0 1/fk . For the estimated reserves we have
J
X
cCL .
bjMLE = R
i
j=Ii+1
w)
cODP =
b MLE
R
i
i
(m
Proof. For the proof we refer to Lemma 2.16, Corollary 2.18 and Remarks 2.19 in Wthrich-Merz
[100]. Basically, the proof goes by induction along the last observed diagonal in DI .
2
Remarks 9.12.
tes
Theorem 9.11 goes back to Hachemeister-Stanard [58], Kremer [67] and Mack
[72].
no
Theorem 9.11 explains the popularity of the ODP model for claims reserving
because it provides exactly the CL reserves. Thus, we have found a second
stochastic model (besides the non-informative prior gamma-gamma Bayesian
CL model) that can be used to explain the CL algorithm from a stochastic
point of view.
NL
In this ODP model we can also give an estimate for the conditional MSEP.
This estimate uses that MLEs can be approximated by standard Gaussian
asymptotic results for GLM. For details we refer England-Verrall [42] and
Wthrich-Merz [100], Section 6.4.3. Another way to assess prediction uncertainty is to use bootstrap simulation.
The ODP framework also allows to give an estimate for the conditional MSEP
P
in the BF method, and it justifies the choice bjCL = jk=0 bkMLE . For details
we refer to Alai et al. [3, 4].
9.4
9.4.1
Claims development result

Definition of the claims development result
In the previous sections we have given a static point of view of claims reserving.
However, claims reserving should be understood as a dynamic process, where more
and more information becomes available over time and prediction is continuously
250
adapted to this new knowledge. This is also the viewpoint that needs to be taken
for solvency considerations.
We consider the run-off situation, and thus the last accident year I is kept fixed.
In the run-off situation the flow of information (9.2) is changed to (we do a slight
abuse of notation here)
Dt = {Xi,j ; i + j t, 1 i I, 0 j J} .
w)
This generates a filtration denoted by (Dt )t0 on (, F, P) that describes the flow
of information (we abbreviate Dt = (Dt )). At time t I the ultimate claim of
accident year i is predicted by the best-estimate
(t)
Cbi,J = E [ Ci,J | Dt ] .
(9.27)
(t)
(t)
(m
This is the predictor that minimizes the conditional MSEP at time t. The bestestimate reserves at time t I for accident year i > t J are provided by
Ri = Cbi,J Ci,ti .
(9.28)
(t)
tes
In accounting year t + 1 we then collect new information resulting in Dt+1 and we

do payments Xi,ti+1 = Ci,ti+1 Ci,ti . This allows to define the so-called claims
development result (CDR) of accident year i > t J in accounting year t + 1 by,
see Michael Merz and Wthrich [78],

(t+1)
(t)
(t+1)
= Cbi,J Cbi,J
(9.29)
no
CDRi,t+1 = Ri Xi,ti+1 + Ri
NL
The claims development result CDRi,t+1 explains how we change

the prediction of the ultimate claim when new information becomes available. If the claims development result is negative
we have a loss in the P&L statement because we have underestimated the outstanding loss liabilities at time t, otherwise we
have a gain. This is exactly the classical earning statement view
in order to understand the risk that derives from the development of the outstanding loss liabilities.
The tower property immediately gives the following crucial statement:
M. Merz
Corollary 9.13. Assume Ci,J has finite first moment and i + J > t I. Then we
have
E [ CDRi,t+1 | Dt ] = 0.
This corollary explains that in average we neither expect losses nor gains in the
claims development result but the prediction is just unbiased. Note that (9.27)
251
defines a martingale in t (under integrability) and remark that martingales have

uncorrelated innovations (claims development results). Our aim is to study the
uncertainty in this position measured by the conditional MSEP. For simplicity we
set t = I and assume i + J > I. Then we define

msepCDRi,I+1 |DI (0) = E (CDRi,I+1 0)2 DI = Var (CDRi,I+1 | DI )

DI .
(I+1)
= Var Cbi,J
(9.30)
9.4.2
w)
We aim to study the volatility of this one-period update. We do this in the gammagamma Bayesian CL Model 9.1.
One-year uncertainty in the Bayesian CL model
(m
Firstly, observe that Lemma 9.2 easily extends to the following lemma (the proof
is an exercise).
Lemma 9.14. Choose t I. Under Model Assumptions 9.1, the posteriors of
0 , . . . , J1 are independent, conditionally given Dt , with
(tj1)I
tes
j |Dt j +
(tj1)I
X
Ci,j
Ci,j+1
,
f
(
1)
+
.
j
j
2
j
j2
i=1
i=1
The Bayesian CL predictor for Ci,J , i + J > t I, is given by

(t)
Cb
= E [ Ci,J | Dt ] = Ci,ti
no
i,J
J1
Y
(t)
fbj ,
j=ti
(t)
with posterior expected Bayesian CL factors given by fbj = E[1
j |Dt ].
NL
Here, we slightly change notation, the upper index now indicates the time point t
of the available information Dt .
Next we exploit the recursive structure of credibility estimators, see for instance
Corollary 8.6. This holds true in quite some generality, for the current exposition
we restrict to t {I, I + 1} because these are the only indexes of interest for the
analysis of (9.30). For t = I + 1 and j 0 we have (in the last step we use the
calculation of the proof of Corollary 9.3)
(I+1)
fbj
= E

i

1
j DI+1
fj (j 1) +
=
CIj,j+1
j2
j 1 +
(I)
= j
PIj Ci,j
i=1 2
j
j 1 +
PIj Ci,j+1
i=1
i=1 2
j
fj (j 1) +
+
j2
PIj Ci,j
j 1 +
PIj1 Ci,j+1
i=1
PIj Ci,j

CIj,j+1
(I)
(I)
+ 1 j fbj ,
CIj,j
i=1 2
j
j2
252
with DI -measurable credibility weight
(I)
j =
j2
CIj,j
(0, 1).
P
(j 1) + Ij
i=1 Ci,j
(I+1)
w)
The important observation is that there is only one random term in fbj
, conditionally given DI . This is crucial in the calculation of the conditional MSEP of the
claims development result prediction. We start with a technical lemma.
Var (Ci,Ii+1 | DI ) =
(m
Lemma 9.15. Under Model Assumptions 9.1 we have for I i + 1 J

(I)
Cbi,Ii+1
Pi1
l=1
(I)
(Ii )1 Ii ,
2
> 2; otherwise the
Cl,Ii /Ii
tes
under the additional assumption that Ii +

second moment is infinite.
2
Proof. In the first step we apply Theorem 9.7 for J 1 = I i and then we derive
(I)
(I)
2
2
= Ci,Ii Ii
(fbIi )2 (1 + Ii ) + Ci,Ii
(fbIi )2 Ii

2 2

Ii
b (I)
C
=
(1 + Ii ) + Ii
i,Ii+1
Ci,Ii

2 2
Ii
b (I)
+
1
(1
+
1
.
=
C
Ii
i,Ii+1
Ci,Ii
no
Var (Ci,Ii+1 | DI )
NL
We calculate the square bracket. It is given by

!

2

2
2
Ii
Ii
Ii
+ 1 (1 + Ii ) 1 =
+1
1+
1
Pi1
2 (
Ci,Ii
Ci,Ii
Ii
Ii 2) +
l=1 Cl,Ii
=
2
2
2
2
Ii
Ii
Ii
Ii
+
+
P
Pi1
i1
2 (
2 (
Ci,Ii
Ci,Ii Ii
Ii
Ii 2) +
Ii 2) +
l=1 Cl,Ii
l=1 Cl,Ii
Pi1
2
2 (
2) + l=1 Cl,Ii + Ci,Ii + Ii
2
Ii Ii

Ii
P
i1
2 (
C
Ci,Ii Ii
Ii 2) +
l,Ii
l=1
(I)
2
Ii
(Ii )1
(I)
= (Ii )1 Ii .
Pi1
2 (
Ii
2)
+
C
Ii
l=1 l,Ii
253
Theorem 9.16. Under Model Assumptions 9.1 the Bayesian CL predictor satisfies
i>I J
msepCDRi,I+1 |DI (0) =
(I)
(Cbi,J )2

1 + ( (I) )1 Ii
J1
Y
Ii
1+
(I)
j j
1 ,
j=Ii+1
where we assume that j + Ij1

Cl,j /j2 > 2 for all I i j J 1; otherwise
l=1
the conditional MSEP is infinite. For aggregated accident years we have
msepP
CDRi,I+1 |DI (0)

i
msepCDRi,I+1 |DI (0)
J1
Y
(m
w
+2
(I)
(I)
Cbi,J Cbl,J (1 + Ii )
(I)
1 + j j 1 ,
j=Ii+1
i<l
the summations run over I J + 1 i I and I J + 1 i < l I, respectively.

Proof. We first decouple accident years
CDRi,I+1 |DI (0)
Var
b (I+1) DI
C
i,J

b (I+1) , C
b (I+1) DI .
Cov C
i,J
l,J
i,l
tes
msepP
We calculate these covariance terms. Observe

b (I+1) = Ci,Ii+1
C
i,J
J1
Y
J1
Y
(I+1)
fbj
= Ci,Ii+1
j=Ii+1
j=Ii+1

(I) b(I)
(I) CIj,j+1
+ 1 j
fj
.
j
CIj,j
NL
no
The only random terms under the measure P[|DI ] are Ci,Ii+1 , Ci1,Ii+2 , . . . , CIJ+1,J . All
these random variables belong to different accident years i and to different development periods
j. Therefore, they are independent given DI , this follows from Model Assumptions 9.1 and
Lemma 9.2. Moreover, we have the following unbiasedness of successive estimations (use the
tower property)

i

h
(I) CIj,j+1
(I) b(I)
(I+1)
(I)
E j
+ 1 j
fj DI = E fbj
DI = fbj .
CIj,j
In the first step we decouple the covariance as follows

i

h
b (I+1) , C
b (I+1) DI = E C
b (I+1) C
b (I+1) DI C
b (I) C
b (I)
Cov C
i,J
i,J
i,J l,J ,
l,J
l,J
with
i
h
b (I+1) C
b (I+1) DI = E Ci,Ii+1
E C
i,J
J1
Y
l,J
j=Ii+1
(I+1)
fbj

(I+1)
Cl,Il+1
fbm
DI .

m=Il+1
J1
Y
We first treat the variance case i = l. In that case we have using conditional independence

2

J1
i

h
Y

(I) b(I)
(I) CIj,j+1
b (I+1) )2 DI
DI
+ 1 j
fj
E (C
= E (Ci,Ii+1 )2
j
i,J

CIj,j

j=Ii+1
#
"

2

Y
J1

(I) CIj,j+1
(I) b(I)
= E (Ci,Ii+1 )2 DI
E
j
+ 1 j
fj
DI ,

CIj,j
j=Ii+1
254
w)
which allows to calculate each term individually. Unbiasedness and Lemma 9.15 for i = I j
imply for these individual terms

i
h

(I) CIj,j+1
(I+1) 2
(I) b(I)
(I)
= Var j
E (fbj
) DI
+ 1 j
fj DI + (fbj )2
CIj,j
!2
(I)
j
(I)
=
Var (CIj,j+1 | DI ) + (fbj )2
CIj,j
!2
(I)

2
j
(I)
(I)
b (I)
=
C
(j )1 j + (fbj )2
Ij,j+1
CIj,j

(I)
(I)
= (fbj )2 j j + 1 .
Similarly we have for the first term

2
b (I)
Var (Ci,Ii+1 | DI ) + C
i,Ii+1

2

(I) 1
b (I)
C
(
)
+
1
.
Ii
i,Ii+1
Ii
(m

E (Ci,Ii+1 )2 DI
=
= Cl,Il
Ii1
Y
tes
Collecting all the terms proves the statement for i = l. There remains the case of different
accident years. W.l.o.g. we assume i < l which implies I i + 1 > I l + 1. This and conditional
independence, given DI , imply for the covariance between these accident years

J1
J1
i
h
Y
Y

b(I+1) Cl,Il+1
b(I+1) DI
b (I+1) DI
b (I+1) C
Ci,Ii+1
=
E
E C
f
f
m
j
i,J
l,J

j=Ii+1
m=Il+1
i
h
(I+1) 2
E (fbj
) DI
j=Ii+1
m=Il

h

i
(I+1)
(I)
Cov Ci,Ii+1 , fbIi DI + Ci,Ii (fbIi )2
no
b (I)
= C
l,Ii
J1
Y
i
h
(I+1)
(I)
fbm
E Ci,Ii+1 fbIi DI
b (I) [Ii + 1]
b (I) C
= C
i,J
l,J
J1
Y
(I)
J1
Y

(I)
(I)
j j + 1 (fbj )2
j=Ii+1
j j + 1 .
j=Ii+1
NL
We study the conditional MSEP formula of the claims development result under
assumption (9.18). This assumption implies again that 0 j 1. Moreover, we
(I)
have j (0, 1) from which we see that (9.18) implies
(I)
0 j j 1.
The other term in Theorem 9.16 is more sophisticated. We have from the proof of
Lemma 9.15
(I)
(Ii )1 Ii
2
Ii
+ 1 (1 + Ii ) 1.
Ci,Ii
If in addition to (9.18) we assume

2
Ii
Ci,Ii ,
(9.31)
255
then we also obtain

(I)
0 (j )1 j 1.
Moreover, we get approximation (and lower bound) under (9.18) and (9.31)
2
Ii
+ Ii .
Ci,Ii
(I)
(Ii )1 Ii
This implies that under assumptions (9.18) and (9.31) we obtain approximation
w)
J1
X
2
(I)
(I)
msepCDRi,I+1 |DI (0) (Cbi,J )2 Ii + Ii +
j j ,
Ci,Ii
j=Ii+1
(9.32)
(m
where the right-hand side is a lower bound for the left-hand side for any j > 1.
tes
This formula should be compared to (9.19). We will give interpretations below,

after formula (9.34). Since (9.32) applies to any j > 1 we can again consider its
non-informative limit j 1: the Bayesian CL predictor converges to the classical
(I)
CL
CL predictor, Cbi,J Cbi,J
, see (9.15), and for the j -terms we obtain, see (9.20),
j2
lim j PIj1
1
j
l=1
Cl,j
For the credibility weights we have
j 1
CIj,j def. e(I)

CIj,j
= j .
= PIj
PIj
(j 1) + i=1 Ci,j
l=1 Cl,j
no
(I)
lim j = lim
j 1 2
j
(9.33)
NL
This motivates in the non-informative prior case j 1 the following approximation and lower bound to (9.32) and Theorem 9.16, respectively,
msepMW
CDRi,I+1 |DI
"
(0) =
(Cb CL )2
i,J
CL 2
s2Ii /(fbIi
)
Ci,Ii
(9.34)
2
J1
bCL 2
CL 2
X
s2Ii /(fbIi
)
e(I) sj /(fj )
+ P
+
,
P
j
i1
Ij1
Cl,j
l=1 Cl,Ii
j=Ii+1
l=1
where we set j2 = s2j /(fbjCL )2 . This is the Merz-Wthrich (MW) formula, see (3.17)
in [78]. We also refer to Bhlmann et al. [23] and Merz-Wthrich [80].
Remarks 9.17.
Concerning derivation and stochastic model choice for the MW formula (9.34)
the same Remarks 9.8 apply as for Macks formula (9.21).
256
w)
Macks formula (9.21) is often called total run-off uncertainty and the MW
formula (9.34) corresponds to the one-year run-off uncertainty. Comparing
these two formulas we observe that from the total run-off uncertainty the first
blue term with index j = I i also appears in the one-year run-off uncertainty.
This is the process variance in period j = I i. From the red terms, the
first red term with index j = I i appears (parameter uncertainty) and the
remaining red terms j I i + 1 of the summation in (9.21) are scaled
(I)
with factor ej (0, 1) to obtain the one-year run-off uncertainty . These
scalings reflect the release of parameter uncertainty when new information (a
new diagonal in the claims development triangle) arrives.
The same interpretation applies to (9.32) versus (9.19).
P
msepMW
CDRi,I+1 |DI (0) =
i
msepMW
CDRi,I+1 |DI (0)
"
X
Cb CL Cb CL
i,J
l,J
i<l
(9.35)
2
J1
bCL )2
CL 2
X
)
s2Ii /(fbIi
(I) s /(f
ej PjIj1j
.
+
Pi1
Cn,j
n=1 Cn,Ii
j=Ii+1
n=1
tes
+2
(m
For aggregated accident years, one has estimate
Example 9.18. We revisit claims reserving Example 9.9 and calculate the claims
development result uncertainty. We consider the non-informative prior case and
15126
26257
34538
85302
156494
286121
449167
1043242
3950815
6047061
NL
2
3
4
5
6
7
8
9
10
total
total msep1/2
Mack (9.21)
CDR msep1/2
MW (9.34)
CDR/total
msep1/2
267
914
3058
7628
33341
73467
85398
134337
410817
462960
267
884
2948
7018
32470
66178
50296
104311
385773
420220
100%
97%
96%
92%
97%
90%
59%
78%
94%
91%
no
CL reserves
b CL
R
i
Table 9.9: Claims reserves and prediction uncertainty: Macks formula (9.21)-(9.22)
for the total run-off uncertainty and MW formula (9.34)-(9.35) for the one-year
claims development uncertainty.
we choose the same parameter estimates as in Example 9.9. Moreover, we consider

the MW formula (9.34)-(9.35).
257
w)
The results are presented in Table 9.9. We see that in this example the one-year
claims development result uncertainty measured by the square-rooted conditional
MSEP results in 91% of the total run-off uncertainty. The reason for this high
value is that knowing the next diagonal in the claims development triangle already
releases a major part of the claims run-off risks. For the next accounting year
we predict payments of 3873205 which is almost 2/3 of the total claims reserves,
i.e. we expect a rather fast claims settlement in this example and a fast decrease
of run-off uncertainties. Typically, the square-rooted conditional MSEP of the
claims development result is in the range of 50% to 95% relative to the total runoff uncertainty, the former relates to liability insurance and the latter to property
insurance.
The full picture of run-off uncertainty
no
9.4.3
tes
(m
Exercise 26 (Italian motor third party liability insurance example). We revisit the
Italian motor third party liability insurance example of Bhlmann et al. [23]. The
field study considers 12 12 run-off triangles of 37 Italian insurance companies at
the end of 2006. For these data the claims reserves and the corresponding squarerooted conditional MSEPs for the total run-off uncertainty and for the one-year
claims development result uncertainty using Macks formula (9.22) and the MW
formula (9.35), respectively, were calculated. The results are presented in Table
9.10. Note that for confidentiality reasons the volumes of the 4 biggest companies
were all set equal to 100.0 and the order of these 4 companies is arbitrary.
Give interpretations to these results.
NL
Note that in Theorem 9.16 and in the MW formula (9.34)-(9.35) we have only
derived the uncertainties in the next accounting year I + 1. A natural question is
what can we say about the individual uncertainties in all future accounting years?
This is exactly the question answered in Merz-Wthrich [80]. We would like to
briefly summarize these results (without proofs) because they give further insight
in the run-off of risk behavior of claims development triangles.
We consider the total prediction error as a telescoping sum of successive claims
(i+J)
development results. Note that we have Cbi,J = Ci,J , P-a.s., because this ultimate claim is observable at time t = i + J. This and the definition of the claims
development result imply for the total prediction error at time t = I
(I)
Cbi,J Ci,J =
i+J
X
k=I+1
(k1)
Cbi,J
(k)
Cbi,J =
i+JI
X
CDRi,I+k ,
k=1
for i > I J, see (9.29). This telescoping sum describes all innovations of the
claims development process. These innovations have mean zero (martingale), see
Corollary 9.13. This immediately implies that they are uncorrelated. Under the
assumption that the second moment exists, uncorrelatedness provides the following
258

total msep1/2
CDR msep1/2
CDR msep1/2
total msep1/2
volume in %
(in % reserves)
(in % reserves)
(in %)
100.0
100.0
100.0
100.0
61.8
56.9
53.0
49.4
46.2
41.6
..
.
4.03
2.90
2.41
3.45
3.66
5.54
4.52
4.60
5.61
5.32
..
.
3.24
2.36
1.98
2.85
3.04
4.50
3.70
3.82
4.59
4.36
..
.
80.4
81.4
82.3
82.6
82.9
81.2
81.8
83.1
81.8
82.0
..
.
3.5
3.4
2.6
2.5
2.2
2.0
1.8
1.8
18.02
17.23
18.73
23.11
20.83
17.01
26.16
27.79
0.96
14.78
13.92
14.89
19.10
17.53
13.87
21.54
22.25
0.78
82.0
80.8
79.5
82.6
84.2
81.5
82.4
80.1
81.8
30
31
32
33
34
35
36
37
total
(m
1
2
3
4
5
6
7
8
9
10
..
.
w)
business
tes
company
no
Table 9.10: Italian motor third party liability insurance example of Bhlmann et
al. [23]. Prediction uncertainties: Macks formula (9.22) for the total run-off uncertainty and MW formula (9.35) for the one-year claims development uncertainty.
NL
decoupling property of the total prediction uncertainty

(I)
msepCi,J |DI Cbi,J
=
=
=
i+JI
X
k=1
i+JI
X
k=1
i+JI
X
Var (CDRi,I+k | DI )
msepCDRi,I+k |DI (0)
h
(9.36)

E msepCDRi,I+k |DI+k1 (0) DI .
k=1
The first line of (9.36) describes the total run-off uncertainty over the entire settlement period of the claims; the second line considers the claims development
result volatilities based on todays knowledge DI ; and the third line considers the
expected one-year run-off uncertainties of all future periods. Thus, formula (9.36)
exactly explains how the total run-off uncertainty needs to be split (dynamically)
across all future development periods. In Theorem 9.16 and the MW formula (9.34)
259
we have only derived the first term with index k = 1 of this sum on the right-hand
side (in the gamma-gamma Bayesian CL model).
In the non-informative prior gamma-gamma Bayesian CL model all terms k =
1, . . . , i + J I can be estimated, see Merz-Wthrich [80], and these estimates have
exactly the same structure as the MW formula (9.34). They are estimated by

MW
b
msepMW
CDRi,I+k |DI (0) = E msepCDRi,I+k |DI+k1 (0) DI
(9.37)
CL
2
bCL
s2
Y
)2 k1
/(fbIi+k1
s2
Ii+k1 /(fIi+k1 )
e(I)
Ii+k1
+
1
P
Ii+m
ik
CL
Cbi,Ii+k1
m=1
l=1 Cl,Ii+k1

CL 2
Cbi,J
CL
+ Cbi,J
2
J1
X
e(I)
k2
Y
jk+1
m=0
s2 /(fbCL )2
(I)
j
.
1 ejm PjIj1
l=1
Cl,j
(m
j=Ii+k
w)
def.
no
tes
Note that this latter formula is an approximation in the non-informative prior

b on the first
gamma-gamma Bayesian CL model. This is indicated by the symbol E
line of (9.37). For its derivation we refer to Merz-Wthrich [80]. The coloring in
formula (9.37) is exactly the same as in the MW formula (9.34), and also the same
interpretations apply. Note that (9.37) provides the natural split of Macks formula
(9.21) across all future accounting years.
For aggregated accident years we have estimation for k 1
MW
P
MW
I+k|I = msep I
CDRi,I+k |DI
i=IJ+k
k1
Y
I
X
(0) =
msepMW
CDRi,I+k |DI (0)
i=IJ+k
2
bCL
Ii+k1 /(fIi+k1 )
(9.38)
Pik
n=1 Cn,Ii+k1
m=1
i<l
J1
k2

s2 /(fbCL(I) )2
X
X
Y
j
(I)
CL b CL
e(I)
,
2
Cbi,J
Cl,J
1 ejm Pj Ij1
jk+1
Cn,j
m=0
n=1
i<l
j=Ii+k
CL b CL
Cbi,J
Cl,J
(I)
1 eIi+m
s2
NL
+2
+
where the last two summations run over I J + k i < l I.

These formulas are all implemented in the R package ChainLadder, see [52]. Let
us describe the relevant code, for more details we refer to [52].
# bringing data in appropriate triangular form and labeling axes
> tri <- as.triangle(as.matrix(data.cumulative))
> dimnames(tri)=list(origin=1:nrow(tri),dev=1:ncol(tri))
# illustrating data using standard plots in R
260
> plot(tri,ylab="",main="")
> plot(tri,lattice=TRUE,ylab="",main="")
# calculating the CL reserves and the corresponding MSEPs
> M <- MackChainLadder(tri,est.sigma="Mack")
CL reserves and Macks formula (9.21)-(9.22) including illustrations
M
plot(M)
plot(M,lattice=TRUE)
#
>
>
>
>
split of (9.21)-(9.22) into process variance and parameter error

M$Mack.ProcessRisk[,ncol(tri)]
M$Total.ProcessRisk[ncol(tri)]
M$Mack.ParameterRisk[,ncol(tri)]
M$Total.ParameterRisk[ncol(tri)]
(m
w)
#
>
>
>
tes
# CL reserves and the MW formula (9.34)-(9.35)

> CDR(M)
# full uncertainty picture (9.37)-(9.38)
> CDR(M,dev="all")
no
Example 9.19. We revisit Example 9.9 and calculate for this example the full runoff uncertainty picture using (9.37)-(9.38). We start by illustrating the data using
the above R commands. This provides Figure 9.2. The graphs show that the data
2
1
2
3
6
5
4
NL
1
2
3
1
2
3
1
2
3
6
4
5
7
4
5
7
3
2
5
4
3
2
3
2
10
10
10
8
7
8
11
in 1'000'000
4
5
7
11
8
9
10
8
7
in 1'000'000
10
11
10
11
2
3
6
1
4
5
0
7
10
9
8
9
8
10
10
dev. period
dev. period
Figure 9.2: Illustration of the data of Table 9.4 (the labeling of the development
year axis is shifted by 1).
261
is rather regular, with a small decrease of volume over accident years. Moreover,
most of the payments are done in the first two development years j = 0, 1.
10
Chain ladder developments by origin period

10 11
Mack Chain Ladder Results
Forecast
Latest
9
8
Amount
6
4
5
7
1
3
2
1
3
2
5
4
1
2
1
3
2
Chain ladder developments by origin period
Chain ladder dev.

2
8
9
Amount
1
2
3
6
5
4
1
2
3
6
4
5
7
1
2
3
1
2
3
6
4
5
7
2
3
6
1
4
5
0
7
9
8
10
Mack's S.E.
10
10
11
10
2
Origin period
10
Development period
9.0
10.0
10.5
2
1
11.0
Fitted
10
11
Amount
9.5
w)
8.5
8.0
1
0
Standardised residuals
Origin period
10
7
6
Calendar period
(m
10
11
10
Development period
Development period
tes
Figure 9.3: Predicted claims development including 1 standard deviation confidence

bounds (the labeling of the development year axis is shifted by 1) and observed
residuals in upper triangle.
NL
no
In Figure 9.3 the graphs of Figure 9.2 are complemented by the predicted payments
in the lower triangle. These graphs also include the 1 standard deviation confidence
bounds (top left and right-hand side). Moreover, Figure 9.3 (lhs) provides residuals
in the direction of all three time axes and ordered by the size of the observations.
These residuals should not show any trends in one of the (time) axis. We see that
there might be some problem in the accident year direction. The decrease in the
accounting/calendar year direction should not be overstated because the first two
years contain rather scarce information.
Finally, in Table 9.11 we provide the full run-off picture. This table summarizes in
the 5th column the expected future accounting year cash flows for t > I
X
i+j=t
E [Xi,j | DI ] =
Ci,Ii
i+j=t
j2
Y
(I)
fbl
(I)
fbj1 1 ,
l=Ii
and in the 2nd column the corresponding expected run-off of the claims reserves
for t I

h
i
X
E R(t) DI =
E [Xi,j | DI ] .
i+jt+1
Moreover, the table provides in the 6th column the square-rooted expected one1/2
year uncertainties (MW
for t I and in the 3rd column the expected run-off
t+1|I )
of the total uncertainty calculated as
1/2
X
MW
,
s+1|I
st
262
6047061
2173856
1048144
570584
293063
148951
67824
36036
13655
0
462960
194285
122813
79758
32397
7739
2906
769
191
0
in %
reserves
expected
cash flows
P
E[Xi,j |DI ]
1/2
(MW
t+1|I )
3873205
1125712
477560
277521
144112
81127
31788
22381
13655
420220
150544
93390
72882
31459
7172
2803
744
191
0
i+j=t
8%
9%
12%
14%
11%
5%
4%
2%
1%
(m
10
11
12
13
14
15
16
17
18
19
rooted exp.
run-off of
MSEP
w)
accounting
years t
exp. run-off
of reserves
E[R(t) |DI ]
Table 9.11: Full run-off picture of Example 9.9.
NL
no
tes
where the first term t = I corresponds to the square-rooted Mack formula (9.22).
We conclude that we now have the full run-off picture, the 2nd column displays the
expected run-off of the claims reserves and the 3rd column provides the expected
run-off of the prediction uncertainty (measured by the square-rooted remaining
conditional MSEPs). This is in particular of interest for risk margin calculations
in solvency considerations.
Chapter 10
w)
Solvency Considerations
NL
no
tes
(m
In the previous chapters we have mainly discussed the modeling of insurance contracts, the related liability cash flows and
the implications for tariffication. If we remind of the discussion
in Chapter 1, we recall that the insurance company organizes
the equal balance within the community. That is, it issues insurance contracts at a fixed premium and in return it promises
to cover all (financial) claims that fall under these contracts.
Of course, we need to make sure that the insurance company
can keep its promises. This is exactly the crucial task of supervision (regulation) and sound risk management practice. Regulation aims to
protect the policyholder in that it enforces (by law) the insurance company to
follow good risk management practice. Companies should be sufficiently well capitalized so that they can fulfill their promises also under certain stress scenarios.
This is exactly what we would like (and need) to study in the present chapter.
We have already touched this issue in Chapter 5 on ruin theory. The main purpose
of Chapter 5 was to explain that there is a huge difference in ruin behavior between
light tailed and heavy tailed claims. Beyond that insight the random walk model of
Chapter 5 is much too simple to reflect real world insurance problems. Therefore,
we modify the ultimate ruin probability considerations so that they reflect the
current risk management task. In a first step we will discuss more general risk
management views, for a comprehensive discussion we refer to Wthrich-Merz [101],
and in a second step we discuss more explicitly the solvency and risk management
implementations used in the insurance industry.
10.1
Balance sheet and solvency
In Chapter 1 of Wthrich-Merz [101] we have provided the balance sheet of an

insurance company. It may look as follows (we only provide the positions that are
relevant for non-life insurance companies):
263
264
Chapter 10. Solvency Considerations

liabilities
cash and cash equivalents

debt securities
bonds
loans
mortgages
real estate
equity
equity securities
private equity
investments in associates
hedge funds
derivatives
futures, swaptions, equity options
insurance and other receivables
reinsurance assets
property and equipment
intangible assets
goodwill
deferred acquisition costs
income tax assets
other assets
deposits
policyholder deposits
reinsurance deposits
borrowings
money market
hybrid debt
convertible debt
insurance liabilities
claims reserves
premium reserves
annuities
derivatives
(m
w)
assets
insurance and other payables

reinsurance liabilities
employee benefit plan
other provisions
tes
income tax liabilities

other liabilities
no
Table 10.1: Balance sheet of a non-life insurance company at a fixed point in time.
NL
Table 10.1 presents a snap shot of a non-life insurance companys balance sheet,
that is, it reflects all positions at a certain moment in time t R+ . The left
hand side shows the assets at time point t and the right hand side should show the
liabilities at the same time point t. We denote the value of the assets at time t by
At , and Lt denotes the value of the liabilities at time t.
In the language of Chapter 5, we can think of At denoting all asset values in the
company at time t. These comprise the initial capital, all premia received and all
other amounts received minus the payments done up to time t. These amounts
are invested at the financial market and, thus, are allocated to the different asset
classes displayed in Table 10.1. On the other hand, the liabilities Lt reflect the
value of all obligations accepted by the insurance company that are still open at
time t.
In a similar context to the ruin theory Chapter 5, we should have At Lt in
order to cover the liabilities by asset values at time t. In fact, we may study the
continuous time surplus process (Cet )tR+ , given by Cet = At Lt , which should
265
fulfill for a given large probability 1 p (0, 1)

P inf Cet 0 Ce0 = c0 = Pc0 inf At Lt 0 1 p.
tR+
tR+
(10.1)
Since an insurance company cannot continuously verify the solvency situation,

condition (10.1) is only checked on a discrete time grid t N0 , this is similar to
(5.5). But in fact, one even goes beyond that which we are just going to describe.
This will be done in several steps, see also Wthrich [98].
(m
w)
Step 1 (one-period problem). Let us assume that we are at time t = 0 and

we would like to check a solvency condition (no ruin condition) similar to (10.1).
Moreover, we assume that at time 0 we have only sold one-year contracts (one-year
risk exposures) for which we receive a premium at time 0 and for which the claim
is paid at the end of the year, i.e. at time t = 1.
The total asset value at time 0 is given by A0 . This value is invested at the financial
market and generates value A1 at time 1. Thus, for this one-period problem the
no ruin condition reads as follows:
tes
for a given large probability 1p (0, 1) the initial capital c0 and the asset strategy
should be chosen such that
Pc0 [A1 L1 ] = Pc0 [L1 A1 0] 1 p.
(10.2)
no
This means that we need to choose the initial capital c0 and the asset strategy, which
maps value A0 at time 0 to value A1 at time 1, such that the (given stochastic)
liabilities L1 can be covered with large probability at time 1. Note that A1 and L1
are, in general, not independent.
NL
Step 2 (risk measure). The no ruin condition in (10.2) is described under the
Value-at-Risk risk measure VaR1p (L1 A1 ) on security level 1 p (0, 1), see
Example 6.25. Assume we have a normalized, monotone and translation invariant
risk measure %, see (6.12), then more generally
the initial capital c0 and the asset strategy should be chosen such that
% (L1 A1 ) 0.
(10.3)
Solvency II uses the VaR risk measure on the 1 p = 99.5% security level and the
Swiss Solvency Test (SST) uses the TVaR risk measure on the 1p = 99% security
level, see also Examples 6.25, 6.20 and 6.26. The main aspect is now concerned
with the stochastic modeling of position L1 A1 .
266
w)
Step 3 (market(-consistent) values). The main difficulty is the stochastic modeling of L1 A1 . Some positions in this difference are traded at active financial
markets. For these positions we need to stochastically model their market prices
at time 1 (viewed from time 0). However, most positions (on the liability) side
of the balance sheet are not traded at active markets. For these positions we
need to determine market-consistent values of their stochastic developments in a
marked-to-model approach, see also Happ et al. [59]. Let us explain the rationale
behind this with the liabilities L1 at hand and using the claims reserving context
of Chapter 9.
Assume we can split the liabilities L1 into two elements:
(m
(i) payments X1 done at time 1 (similar to Section 9.1 we map all payments in
accounting year [1, 2) = [1/1/1, 31/12/1] to its endpoint);
(ii) outstanding loss liabilities L+
1 at time 1 (at the end of accounting year 1).
The liabilities at time 1 are then given by
tes
L1 = X 1 + L+
1.
no
The easier part is the modeling of X1 . We need to find a stochastic model that is
able to predict the payments X1 and capture the dependencies with A1 and L+
1.
+
The more complicated part is L1 . This amount should reflect a market-consistent
value for the outstanding loss liabilities at time 1. Observe that it differs from the
best-estimate reserves R(1) given in (9.28) in two crucial ways:
(1) The best-estimate reserves R(1) were calculated on a nominal basis, i.e. the
time value of money was not considered because no discounting was applied
to R(1) .
NL
(2) The best-estimate reserves R(1) are conditional expectations, conditioned on

the information F1 . That is, these are expected payouts and we should add
a (risk, market-value) margin/loading to obtain market-consistent values.
Otherwise risk averse financial agents are not willing to do the run-off of
these liabilities at price L+
1 , see Chapter 6 and Happ et al. [59].
The aim in these two items (1) and (2) is motivated by the fact that L+
1 should
reflect a price at which another insurance company is willing to take over the liabilities at time 1 and to complete the run-off of the outstanding loss liabilities (reflected
by an appropriated marked-to-model approach price, sometimes also called transfer
value).
Step 4 (acceptability and solvency). As described above, we have three building
blocks A1 , X1 and L+
1 that we need to model stochastically (note that these building
267
blocks are not independent). In the last step, we need to evaluate risk measure
condition (10.3). If this condition is fulfilled we have an acceptable balance sheet
and the company is solvent at time 0 w.r.t. the chosen risk measure %. If (10.3)
is not fulfilled we have an unacceptable balance sheet and it needs to be modified
to achieve solvency. Options for modification are the following: change the asset
strategy so that it matches better the liabilities; reduce liabilities and mitigate
uncertainties in liabilities (if possible); inject more initial capital c0 .
w)
In the remainder of this chapter we discuss the modeling of the asset deficit at time
t = 1, where the asset deficit is for t N0 defined by
def.
ADt = Lt At = Xt + L+
t At .
(10.4)
(m
Thus, the insurance company is solvent at time 0 (w.r.t. the risk measure %) if
% (AD1 ) = % (L1 A1 ) 0.
Risk modules
tes
10.2
NL
no
Typically the modeling of the asset deficit AD1 at time t = 1, defined in (10.4), is
split into different modules that reflect different risk classes. In a first step each
risk class is studied individually and in a second step the results are aggregated to
obtain the overall picture.
Figure 10.1: lhs: Swiss Solvency Test risk modules; rhs: Solvency II risk modules
(sources [26] and [44]).
One may question whether this modeling approach is smart. Modeling individual risk classes may still be fine, but aggregation of risk classes is rather nonstraightforward because it is very difficult to capture the interaction between the
different risk classes. Nevertheless we would like to describe the approach used in
practice (and also the short cuts applied).
268
In Figure 10.1 we show the individual risk modules used in the Swiss Solvency Test
[26] and in Solvency II [44]. Overall they are rather similar though some differences
exist. Often one considers the following 4 risk classes that are driven by the risk
factors that we will just describe:
w)
1. Market risk. We cite SCR.5.1. of QIS5 [44]: Market risk arises from the level
or volatility of market prices of financial instruments. Exposure to market
risk is measured by the impact of movements in the level of financial variables
such as stock prices, interest rates, real estate prices and exchange rates.
(m
2. Insurance risk. Insurance risk is typically split into the different insurance branches: non-life insurance, life insurance, health insurance and reinsurance. Here we concentrate on non-life insurance risk. This is further
subdivided into (i) reserve risk which describes outstanding loss liabilities of
past exposure claims; and (ii) premium risk which describes the risk deriving
from newly sold contracts that give an exposure over the next accounting period. Additionally, there is often an annuity portfolio deriving from liability
insurance covering disability claims of third party.
no
tes
3. Credit risk. We cite SCR.6.1. of QIS5 [44]: The counterparty default risk
module should reflect possible losses due to unexpected default, or deterioration in the credit standing, of the counterparties and debtors of undertakings
over the forthcoming twelve months. The scope of the counterparty default
risk module includes risk-mitigating contracts, such as reinsurance arrangements, securitisations and derivatives, and receivables from intermediaries,
as well as any other credit exposures which are not covered in the spread risk
sub-module.
NL
4. Operational risk. We cite SCR.3.1. of QIS5 [44]: Operational risk is the risk
of loss arising from inadequate or failed internal processes, or from personnel
and systems, or from external events. Operational risk should include legal
risks, and exclude risks arising from strategic decisions, as well as reputation
risks. The operational risk module is designed to address operational risks to
the extent that these have not been explicitly covered in other risk modules.
Let us formalize these risk factors and classes. Therefore, we first consider the
beginning of accounting year 1. At time t = 0 the asset deficit is given by
AD0 = L0 A0 .
+
We assume that X0 = 0 which implies that L0 = L+
0 , thus, L0 is the value of all
liabilities that need to be settled after t = 0. For simplification we assume that
the liabilities consist of insurance liabilities only. In this case, L+
0 describes the
liabilities stemming from claims with accident date prior to t = 0 (these are the
liabilities of past exposure claims; we denote them by previous year (PY) claims,
269
see also Chapter 9), and of claims with accident date in accounting year 1 (these
are all liabilities of the new premium exposure if we assume one-year contracts
only; we denote them by current year (CY) claims). Summarizing, this implies on
the liability side of the balance sheet at time t = 0 (with the obvious notation)
PY
CY
L0 = L+
0 = L0 + L0 .
On the asset side of the balance sheet we have (this is also a simplified version)
w)
CY
,
A0 = c0 + APY
0 +
L1 = X1 + L+
1 =
(m
CY
where APY
are the provisions to cover the PY liabilities LPY
is the premium
0
0 ,
CY
received for the CY claims L0 and c0 is the initial capital. As described above,
this amount A0 is invested at the financial market and provides value A1 at time
t = 1. This value needs to be compared to
X1PY + X1CY + X1Op + L+,PY

+ L+,CY
,
1
1
AD1 =
no
tes
where X1PY are the payments for PY claims, X1CY are the payments for CY claims,
L+,PY
is the value of the outstanding loss liabilities at time t = 1 for claims with
1
is the value of the outstanding loss liabilities
accident year prior to t = 0, and L+,CY
1
at time t = 1 for CY claims (i.e. accident date in year 1). Thus, if we merge these
+,PY
we obtain the new outstanding loss liabilities for
+ L+,CY
two values L+
1
1 = L1
past exposure claims with accident date prior to t = 1. Finally, X1Op denotes
the operational risk loss payment where, for simplicity, we assume that this can
immediately be settled. We conclude that the asset deficit at time 1 is given by

X1PY + X1CY + X1Op + L+,PY

+ L+,CY
A1
1
1

X1PY + L+,PY
+ X1CY + L+,CY
+ X1Op A1 .
1
1
(10.5)
(10.6)
NL
Let us comment on (10.5)-(10.6).

Formula (10.5) gives the split into payments and outstanding loss liabilities.
This view is crucial for doing asset-and-liability management, i.e. to compare
the structure of the asset portfolio to the maturities of the liabilities.
Formula (10.6) provides the split into PY risk and CY risk. The PY risk is
mainly described by the claims development result described in Section 9.4.
The CY risk is described by a compound distribution as, for instance, seen
in Example 4.11. However, both these descriptions only consider nominal
claims and in order to get values we still need to add time values for cash
flow payments and a risk margin for bearing the run-off risks. Therefore,
these values also depend on financial market movements. This second view is
important for profitability analysis because it allows to match liabilities with
the corresponding insurance premium.
270
Coming back to the risk modules: market risk affects all variables in (10.5);
; credit risk
, X1CY and L+,CY
insurance risk is mainly reflected in X1PY , L+,PY
1
1
is a main risk driver in A0 (if we assume that liabilities are considered before
re-insurance is applied (gross)); and operational risk is reflected in X1Op .
In the remainder we concentrate on the modeling of insurance liabilities.
Insurance liability variables
10.3.1
Market-consistent values
w)
10.3
We still face the difficulty of attaching market-consistent values to the insurance

liabilities which provides value at time t = 1 given by
def.
(m
+ X1CY + L+,CY
.
LIns
= X1PY + L+,PY
1
1
1
tes
Op
Note that in our terminology L1 = LIns
1 + X1 . Assume that X = (X1 , . . . , Xn )
denotes the (random) cash flow that is generated by the insurance liabilities, see
also (9.1). We assume that this cash flow is adapted to the filtration F = (Fs )s1 .
In analogy to Wthrich-Merz [100] we need to choose an appropriate (state price)
deflator = (1 , . . . , n ) (which is F-adapted and strictly positive, P-a.s.) and
then
1 X
1 X
LIns
=
E [ s Xs | F1 ] = X1 +
E [ s Xs | F1 ]
1
1 s1
1 s2
X
1
E [ s | F1 ] E [ Xs | F1 ] =
P (1, s) E [ Xs | F1 ]
s1 1
s1
X
NL
LIns
=
1
no
provides a market-consistent value in an arbitrage-free pricing system described by

the triple (P, F, ). For a one-period problem this was already described in Section
6.2.5. Under the assumption of uncorrelatedness of s and Xs , conditionally given
F1 , we can rewrite the market-consistent value of the insurance liabilities as
= X1PY + X1CY +
(10.7)
P (1, s) E [ Xs | F1 ] ,
s2
where P (1, s) denotes the price at time 1 of the zero-coupon bond that matures
at time s 2 (and P (1, 1) = 1). Note that viewed from time 0 both P (1, s) and
E [ Xs | F1 ] are F1 -measurable random variables in (10.7) and (expected) insurance
cash flows are adjusted for time value of money.
Under all the previous assumptions (in particular the uncorrelatedness assumption
(10.7)) the acceptability requirement (10.3) reads as:
The initial capital c0 and the asset strategy should be chosen such that
%(AD1 ) = % X1PY + X1CY + X1Op +
P (1, s) E [ Xs | F1 ] A1 0.
s2
(10.8)
271
Since the asset deficit still has a rather involved form the model is further simplified.
Denote the expected values
p(1, s) = E [ P (1, s)| F0 ]
xs = E [ Xs | F0 ] .
and
Then, we use the following linear approximation

P (1, s) E [ Xs | F1 ] p(1, s)xs + (P (1, s) p(1, s)) xs + p(1, s) (E [ Xs | F1 ] xs ) .
Z1 =
(m
w)
The first term p(1, s)xs is the expected value (viewed from time 0) of the time-1price P (1, s)E [ Xs | F1 ]. The term (P (1, s) p(1, s)) xs coins uncertainties in financial discounting and p(1, s) (E [ Xs | F1 ] xs ) describes volatilities in the insurance
cash flows. The cross term of the uncertainties was dropped in this approximation. Typically, the above terms are assumed to be independent so that they can
be studied individually and aggregation is obtained by simply convoluting their
marginal distributions.
This approximation implies that for (10.8) we study the following three terms
p(1, s)xs + (P (1, s) p(1, s)) xs A1 ,
Z2 =
Z3 =
tes
s1
p(1, s) (E [ Xs | F1 ] xs ) ,
s1
X1Op .
10.3.2
NL
no
Z1 describes market and credit risks, Z2 describes insurance risk and Z3 describes
operational risk. In non-life insurance one often assumes that these three random variables are independent (which may be problematic in particular w.r.t. reinsurance).
In the remainder of this chapter we describe the insurance liability variable Z2 . For
the other terms we refer to the related solvency literature QIS5 [44], Swiss Solvency
Test [46] and Wthrich-Merz [101].
Insurance risk
We study insurance risk given by

Z2 =
p(1, s) (E [ Xs | F1 ] xs ) .
s1
As already mentioned the insurance variables are separated into PY variables and
CY variables w.r.t. the valuation date t = 0. This provides the split
Z2
=
def.
Z2PY + Z2CY
X
s1
p(1, s) E XsPY F1 xPY

+
s
p(1, s) E XsCY F1 xCY

.
s
s1
272
The final simplification is that we assume that there are deterministic payout patterns (sPY )s1 and (sCY )s1 , for instance, obtained by the CL method, see Theorem 9.11 (and the estimation errors in these patterns are neglected). Then the last
expressions can be modified to
Z2PY =
p(1, s)sPY X1PY + R(1) R(0) ,
s1
p(1, s)sCY [E [ S1 | F1 ] E [ S1 | F0 ]] .
w)
Z2CY =
X
s1
Claims development result
tes
(m
The first line Z2PY reflects the study of the claims development result, see (9.29).
The second line Z2CY describes the total nominal claim S1 of accident year 1 that
is caused by the premium exposure CY . The terms in the round brackets are the
deterministic discount factors that respect the underlying maturities of the cash
flows; the terms in the square brackets are the random terms that need further
modeling and analysis.
The claims development result for PY claims, given by

h
no
CDR1 = X1PY + R(1) R(0) ,
has expected value 0, see Corollary 9.13, if the claims reserves are defined by
conditional expectations in a Bayesian model. Therefore, there remains the study
of higher moments. In practice, one restricts to the second moment:
NL
Calculate for every line of business the conditional MSEP of the claims development result prediction, for instance using MW formula (9.35). This
provides a variance estimate for every line of business.
Specify a correlation matrix between the different lines of business, see for
instance SCR.9.34. in QIS5 [44].
The previous two items allow to aggregate the uncertainties of the individual
lines of business to obtain the overall variance over the sum of all lines of
business.
Fit a translated gamma or log-normal distribution to these first two moments assuming that the mean is exactly given by R(0) . This provides an
approximation to the distribution of CDR1 .
273
Premium liability risk
(m
w)
The claim E [ S1 | F1 ] resulting from the premium exposure CY is split into two
independent random variables Ssc and Slc , where Ssc reflects all small claims below
a given threshold M and Slc the claims above that threshold, see Examples 2.16
and 4.11.
The large claim Ssc is modeled per line of business (or per peril) by independent
compound Poisson distributions with Pareto claims severities and aggregation is
done using the aggregation Theorem 2.12 resulting in a compound Poisson distribution. The latter can be determined, for instance, with the Panjer algorithm,
see Theorem 4.9, or the fast Fourier transform FFT, see Section 4.2.2.
The small claim Ssc is treated similarly to the claims development result, i.e. estimate per line of business the first two moments. Aggregate these moments using
an appropriate correlation matrix, see for instance Section 8.4.2 in the technical
Swiss Solvency Test document [46], and fit a gamma or a log-normal distribution
to this first two moments.
Remarks.
no
tes
In the Swiss Solvency Test one distinguishes between pure process risk and
parameter uncertainty for the small claims layer, too. Process risk is diversifiable with increasing volume, whereas parameter uncertainty is not. As a
result the coefficient of variation per line of business has a similar form as has
been found for the compound negative-binomial distribution, see Proposition
2.24. That is, for volume v the coefficient of variation does not vanish
but stays strictly positive.
NL
In the Swiss Solvency Test one aggregates in addition so-called scenarios. The
motivation for this is that the present model cannot reflect all uncertainties
and therefore it is disturbed by scenarios. Basically, these scenarios are claims
of Bernoulli type, i.e. they occur with a certain probability and if they occur
they have a given amount.
For the aggregation between PY and CY claims it is either assumed that they
are independent, or that the claims development result uncertainty CDR1 and
the small claim CY Ssc are again aggregated via a correlation matrix and then
an overall distribution is fitted to the resulting first two moments.
In summary we see that many approximations are used (as described above)
and, also crucially, that aggregation is done over correlation matrices. The
latter may be quite problematic because correlations typically also depend on
underlying volumes which is neglected in actual solvency implementations.
Therefore, this needs to be revised carefully in each individual case.
274
Market-value margin
NL
no
tes
(m
w)
The careful reader will have noticed that we have lost the risk margin somewhere
on the way to the final result. We will not further discuss the risk and marketvalue margin here, we only want to mention that the current calculation of the
market-value margin is quite ad-hoc, see Chapter 6 in Swiss Solvency Test [46] and
Section 10.3 in Wthrich-Merz [100], and further refinements are necessary. The
crucial point is that the conditional uncorrelatedness in (10.7) does not hold true
in general, see Wthrich-Merz [100], and for a more general discussion we also refer
to Happ et al. [59] and Wthrich [98].
w)
Appendix
Derivations from Gaussian distributions
Assume Z0 , Z1 , . . .
i.i.d.
N (0, 1). We can derive the following distributions.
Xk =
k
X
i=1
(m
2 -distribution. Define for k N the random variable

Zi2 .
f (x) =
tes
Xk has a 2 -distribution with k degrees of freedom, see Example 2 on page 22. Its
density is given by
1
xk/21 exp {x/2}
2k/2 (k/2)
for x 0,
no
and the corresponding moment generating function is

MXk (r) = (1 2r)k/2
for r < 1/2.
Moreover, we have E[Xk ] = k and Var(Xk ) = 2k.
NL
t-distribution. Define for k N the random variable

Xk = q P
k
Z0
i=1
Zi2 /k
Xk has a t-distribution with k degrees of freedom. Its density is given by

((k + 1)/2)
f (x) =
(1 + x2 /k)(k+1)/2
k (k/2)
for x R.
The moment generating function MXk (r) does not exist for r > 0, and we have
E[Xk ] = 0, for k > 1, and Var(Xk ) = k/(k 2), for k > 2.
F -distribution. Define for k, m N the random variable
Pk
Zi2 /k
.
2
i=k+1 Zi /m
i=1
Xk,m = Pk+m
275
276
Xk,m has an F -distribution with k, m degrees of freedom. Its density is given by

f (x) =
((k + m)/2) k/2 m/2 k/21

k m
x
(m + kx)(k+m)/2
(k/2)(m/2)
for x 0.
The moment generating function MXk,m (r) does not exist for r > 0, and we have
E[Xk,m ] = m/(m 2), for m > 2.
i.i.d.
k
1 X
Z =
Zj
k j=1
and
S2 =
w)
Lemma A.1. Assume Z1 , . . . , Zk N (, 2 ). Define the sample estimators

k
2
X
1
Zj Z .
k 1 j=1
Z N (, 2 /k)
and
(m
Then Z and S 2 are independent with

(d)
S2 =
2
2 ,
k 1 k1
where 2k1 is 2 -distributed with k 1 degrees of freedom.
tes
The previous lemma implies that

T =

k Z
NL
no
has a t-distribution with k 1 degrees of freedom.
Bibliography
w)
[1] Acerbi, C., Tasche, D. (2002). On the coherence of expected shortfall. Journal
Banking and Finance 26/7, 1487-1503.
[2] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control 19/6, 716-723.
(m
[3] Alai, D.H., Merz, M., Wthrich, M.V. (2009). Mean square error of prediction in
the Bornhuetter-Ferguson claims reserving method. Annals of Actuarial Science
4/1, 7-31.
tes
[4] Alai, D.H., Merz, M., Wthrich, M.V. (2010). Prediction uncertainty in the
Bornhuetter-Ferguson claims reserving method: revisited. Annals of Actuarial Science 5/1, 7-17.
[5] Artzner, P., Delbaen, F., Eber, J.M., Heath, D. (1997). Thinking coherently. Risk
10/11, 68-71.
no
[6] Artzner, P., Delbaen, F., Eber, J.M., Heath, D. (1999). Coherent measures of risk.
Mathematical Finance 9/3, 203-228.
[7] Asmussen, S., Albrecher, H. (2010). Ruin Probabilities. 2nd edition. World Scientific.
NL
[8] Bahr, von B. (1975). Asymptotic ruin probabilities when exponential moments do
not exist. Scandinavian Actuarial Journal 1975, 6-10.
[9] Bailey, R.A. (1963). Insurance rates with minimum bias. Proceedings CAS 50, 4-11.
[10] Bailey, R.A., Simon, L.J. (1960). Two studies on automobile insurance ratemaking.
ASTIN Bulletin 1, 192-217.
[11] Bichsel, F. (1964). Erfahrungstarifierung in der Motorfahrzeug-HaftpflichtVersicherung. Bulletin of the Swiss Association of Actuaries 1964, 119-130.
[12] Billingsley, P. (1968). Probability and Measure. Wiley.
[13] Billingsley, P. (1995). Probability and Measure. 3rd edition. Wiley.
[14] Boland, P.J. (2007). Statistical and Probabilistic Methods in Actuarial Science.
Chapman & Hall/CRC.
[15] Bolthausen, E., Wthrich, M.V. (2013). Bernoullis law of large numbers. ASTIN
Bulletin 43/2, 73-79.
277
278
Bibliography
[16] Bornhuetter, R.L., Ferguson, R.E. (1972). The actuary and IBNR. Proceedings
CAS 59, 181-195.
[17] Boyd, S., Vandenberghe, L. (2004). Convex Optimization. Cambridge University
Press.
[18] Bhlmann, H. (1970). Mathematical Methods in Risk Theory. Springer.
[19] Bhlmann, H. (1980). An economic premium principle. ASTIN Bulletin 11/1, 5260.
w)
[20] Bhlmann, H. (1992). Stochastic discounting. Insurance: Mathematics and Economics 11/2, 113-127.
(m
[21] Bhlmann, H. (1995). Life insurance with stochastic interest rates. In: Financial
Risk in Insurance, G. Ottaviani (ed.), Springer, 1-24.
[22] Bhlmann, H. (2004). Multidimensional valuation. Finance 25, 15-29.
[23] Bhlmann, H., De Felice, M., Gisler, A., Moriconi, F., Wthrich, M.V. (2009).
Recursive credibility formula for chain ladder factors and the claims development
result. ASTIN Bulletin 39/1, 275-306.
tes
[24] Bhlmann, H., Gisler, A. (2005). A Course in Credibility Theory and its Applications. Springer.
[25] Bhlmann, H., Straub, E. (1970). Glaubwrdigkeit fr Schadenstze. Bulletin of
the Swiss Association of Actuaries 1970, 111-131.
no
[26] Bundesamt fr Privatversicherungen (2004). Weissbuch des Schweizer Solvenztests.

November 2004.
[27] Cern,
A. (2006). Introduction to fast Fourier transform in finance. SSRN
manuscript ID 559416.
NL
[28] Congdon, P. (2006). Bayesian Statistical Modelling. 2nd edition. Wiley.

[29] Cramr, H. (1930). On the Mathematical Theory of Risk. Skandia Jubilee Volume,
Stockholm.
[30] Cramr, H. (1955). Collective Risk Theory. Skandia Jubilee Volume, Stockholm.
[31] Cramr, H. (1994). Collected Works. Volumes I & II. Edited by A. Martin-Lf.
Springer.
[32] Delbaen, F. (2000). Coherent Risk Measures. Cattedra Galileiana. Pisa.
[33] Delbaen, F., Schachermayer, W. (1994). A general version of the fundamental theorem of asset pricing. Mathematische Annalen 300, 463-520.
[34] Denneberg, D. (1989). Verzerrte Wahrscheinlichkeiten in der Versicherungsmathematik, quantilabhngige Prmienprinzipien. Mathematik Arbeitspapiere 34, University of Bremen.
Bibliography
279
[35] Denuit, M., Marchal, X., Pitrebois, S., Walhin, J.-F. (2007). Actuarial Modelling
of Claims Count. Wiley.
[36] Dickson, D.C.M. (2005). Insurance Risk and Ruin. Cambridge University Press.
[37] Duffie, D. (2001). Dynamic Asset Pricing Theory. 3rd edition. Princeton University
Press.
w)
[38] Embrechts, P., Frei, M. (2009). Panjer recursion versus FFT for compound distributions. Mathematical Methods of Operations Research 69/3, 497-508.
[39] Embrechts, P., Klppelberg, C., Mikosch, T. (2003). Modelling Extremal Events
for Insurance and Finance. 4th printing. Springer.
(m
[40] Embrechts, P., Nelehov, J., Wthrich, M.V. (2009). Additivity properties for
Value-at-Risk under Archimedean dependence and heavy-tailedness. Insurance:
Mathematics and Economics 44/2, 164-169.
[41] Embrechts, P., Veraverbeke, N. (1982). Estimates for the probability of ruin with
special emphasis on the possibility of large claims. Insurance: Mathematics and
Economics 1/1, 55-72.
tes
[42] England, P.D., Verrall, R.J. (2002). Stochastic claims reserving in general insurance. British Actuarial Journal 8/3, 443-518.
no
[43] England, P.D., Verrall, R.J., Wthrich, M.V. (2012). Bayesian overdispersed Poisson model and the Bornhuetter-Ferguson claims reserving method. Annals of Actuarial Science 6/2, 258-283.
[44] European Commission (2010). QIS 5 Technical Specifications, Annex to Call for
Advice from CEIOPS on QIS5.
NL
[45] Feller, W. (1966). An Introduction to Probability Theory and its Applications. Volume II. Wiley.
[46] FINMA (2006). Swiss Solvency Test. FINMA SST Technisches Dokument, Version
2. October 2006.
[47] Fllmer, H., Schied, A. (2004). Stochastic Finance, An Introduction in Discrete
Time. 2nd edition. de Gruyter.
[48] Fortuin, C.M., Kasteleyn, P.W., Ginibre, J. (1971). Correlation inequalities on
some partially ordered sets. Communication Mathematical Physics 22/2, 89-103.
[49] Frees, E.W. (2010). Regression Modeling with Actuarial and Financial Applications.
Cambridge University Press.
[50] Fringeli, M. (2005). Credibility fr Probleme mit rumlicher Abhngigkeit. Diploma
Thesis, ETH Zurich.
280
Bibliography
[51] Garcia Ben, M., Yohai, V.J. (2004). Quantile-quantile plot for deviance residuals
in the generalized linear model. Journal of Computational and Graphical Statistics
13/1, 36-47.
[52] Gesmann, M., Murphy, D., Zhang, W., Carrato, A., Crupi G., Wthrich, M.V.
(2015). ChainLadder: statistical methods and models for the calculation of outstanding claims reserves in general insurance. R package version 0.2.0.
w)
[53] Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (1996). Markov Chain Monte Carlo
in Practice. Chapman & Hall.
[54] Gisler, A. (2011). Nicht-Leben Versicherungsmathematik. Lecture Notes, ETH
Zurich.
(m
[55] Gisler, A., Wthrich, M.V. (2008). Credibility for the chain ladder reserving
method. ASTIN Bulletin 38/2, 565-600.
[56] Green, P.J. (1995). Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination. Biometrika 82/4, 711-732.
tes
[57] Green, P.J. (2003). Trans-dimensional Markov chain Monte Carlo. In: Highly Structured Stochastic Systems, P.J. Green, N.L. Hjort, S. Richardson (eds.), Oxford
Statistical Science Series, 179-206. Oxford University Press.
[58] Hachemeister, C.A., Stanard, J.N. (1975). IBNR claims count estimation with
static lag functions. ASTIN Colloquium 1975, Portugal.
no
[59] Happ, S., Merz, M., Wthrich, M.V. (2015). Best-estimate claims reserves in incomplete markets. European Actuarial Journal 5/1, 55-77.
[60] Hofert, M., Wthrich, M.V. (2013). Statistical review of nuclear power accidents.
Asia-Pacific Journal of Risk and Insurance 7/1, Article 1.
NL
[61] Johansen, A.M., Evers, L., Whiteley, N. (2010). Monte Carlo Methods. Lecture
Notes, Department of Mathematics, University of Bristol.
[62] Johnson, R.A., Wichern, D.W. (1998). Applied Multivariate Statistical Analysis.
4th edition. Prentice-Hall.
[63] Jung, J. (1968). On automobile insurance ratemaking. ASTIN Bulletin 5, 41-48.
[64] Kaas, R., Goovaerts, M., Dhaene, J., Denuit, M. (2008). Modern Actuarial Risk
Theory, Using R. 2nd edition. Springer.
[65] Kehlmann, D. (2005). Die Vermessung der Welt. Rowohlt Verlag.
[66] Kolmogoroff, A. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer.
[67] Kremer, E. (1985). Einfhrung in die Versicherungsmathematik. Vandenhoek &
Ruprecht, Gttingen.
[68] Kyprianou, A. (2014). Gerber-Shiu Risk Theory. Springer.
Bibliography
281
[69] Laplace, P.S. (1812). Thorie analytique des probabilits. Suppl. to 3rd edition
Courcier, Paris 1820.
[70] Lehmann, E.L. (1983). Theory of Point Estimation. Wiley.
[71] Lundberg, F. (1903). Approximerad framstllning av sannolikhetsfunktionen. terfrskering av kolletivrisker. Almqvist & Wiksell, Uppsala.
[72] Mack, T. (1991). A simple parametric model for rating automobile insurance or
estimating IBNR claims reserves. ASTIN Bulletin 21/1, 93-109.
w)
[73] Mack, T. (1993). Distribution-free calculation of the standard error of chain ladder
reserve estimates. ASTIN Bulletin 23/2, 213-225.
(m
[74] Mack, T. (2008). The prediction error of Bornhuetter/Ferguson. ASTIN Bulletin

38/1, 87-103.
[75] McCullagh, P., Nelder, J.A. (1983). Generalized Linear Models. Chapman & Hall.
[76] McGrayne, S.B. (2011). The Theory That Would Not Die. Yale University Press.
tes
[77] McNeil, A.J., Frey, R., Embrechts, P. (2005). Quantitative Risk Management: Concepts, Techniques and Tools. Princeton University Press.
[78] Merz, M., Wthrich, M.V. (2008). Modelling the claims development result for
solvency purposes. CAS E-Forum Fall 2008, 542-568.
no
[79] Merz, M., Wthrich, M.V. (2013). Mathematik fr Wirtschaftswissenschaftler.

Vahlen.
[80] Merz, M., Wthrich, M.V. (2014). Claims run-off uncertainty: the full picture.
SSRN manuscript ID 2524352.
[81] Mikosch, T. (2006). Non-Life Insurance Mathematics. Springer.
NL
[82] Ohlsson, E., Johansson, B. (2010). Non-Life Insurance Pricing with Generalized
Linear Models. Springer.
[83] Panjer, H.H. (1981). Recursive evaluation of a family of compound distributions.
ASTIN Bulletin 12/1, 22-26.
[84] Panjer, H.H. (2006). Operational Risk: Modeling Analytics. Wiley.
[85] Renshaw, A.E., Verrall, R.J. (1998). A stochastic model underlying the chainladder technique. British Actuarial Journal 4/4, 903-923.
[86] Resnick, S.I. (1997). Heavy tail modeling of teletraffic data. Annals of Statistics
25/5, 1805-1869.
[87] Resnick, S.I. (2002). Adventures in Stochastic Processes. 3rd printing. Birkhuser.
[88] Robert, C.P. (2001). The Bayesian Choice. 2nd edition. Springer.
282
Bibliography
[89] Rolski, T., Schmidli, H., Schmidt, V., Teugels, J. (1999). Stochastic Processes for
Insurance and Finance. Wiley.
[90] Saluz, A., Gisler, A., Wthrich, M.V. (2011). Development pattern and prediction error for the stochastic Bornhuetter-Ferguson claims reserving model. ASTIN
Bulletin 41/2, 279-317.
[91] Schmidli, H. (2007). Risk Theory. Lecture Notes, University of Cologne.
w)
[92] Schweizer, M. (2009). Stochastic Processes and Stochastic Analysis. Lecture Notes,
ETH Zurich.
[93] Smith, A., Thaper, S. (2014). Making uncertainty explicit: stochastic modelling.
Actuarial Post, February 12, 2014, 12-15.
(m
[94] Sovacool, B.K. (2008). The costs of failure: a preliminary assessment of major
energy accidents, 19072007. Energy Policy 36/5, 1802-1820.
[95] Sundt, B., Jewell, W.S. (1981). Further results of recursive evaluation of compound
distributions. ASTIN Bulletin 12/1, 27-39.
tes
[96] Tsanakas, A., Christofides, N. (2006). Risk exchange with distorted probabilities.
ASTIN Bulletin 36/1, 219-243.
[97] Williams, D. (1991). Probability with Martingales. Cambridge University Press.
no
[98] Wthrich, M.V. (2013). From ruin theory to solvency in non-life insurance. To
appear in Scandinavian Actuarial Journal.
[99] Wthrich, M.V., Bhlmann, H., Furrer, H. (2010). Market-Consistent Actuarial
Valuation. 2nd edition. Springer.
[100] Wthrich, M.V., Merz, M. (2008). Stochastic Claims Reserving Methods in Insurance. Wiley.
NL
[101] Wthrich, M.V., Merz, M. (2013). Financial Modeling, Actuarial Valuation and
Solvency in Insurance. Springer.
[102] Wthrich, M.V., Merz, M. (2015). Stochastic claims reserving manual: advances
in dynamic modeling. SSRN Manuscript ID 2649057.
(m
tes
NL
no
Exercise 1, page 18
Exercise 2, page 22
Corollary 2.7, page 28
Exercise 3, page 28
Exercise 4, page 40
Exercise 5, page 51
Exercise 6, page 51
Exercise 7, page 60
Exercise 8, page 78
Exercise 9, page 84
Exercise 10, page 90
Corollary 6.6, page 149
w)
List of exercises
283
Index
NL
(m
no
tes
absolutely continuous distribution, 15

acceptable, 161, 266
accident
date, 226
year, 228
AD (asset deficit), 267
AD test, 82
adjustment coefficient, 125
admissible, 32
age-to-age factor, 232
aggregation property, 31
AIC, 83
Akaike information criterion, 83
Akaike, Hirotugu, 83
alternative hypothesis, 21
Anderson, Theodore Wilbur, 82
Anderson-Darling test, 82
approximation
Edgeworth, 100
normal, 94
translated gamma, 97
translated log-normal, 97
arbitrage-free pricing, 270
asset deficit, 267
Bayes, Thomas, 202

Bayesian
inference, 41, 201
Bayesian CL
factor, 240
predictor, 241
Bayesian information criterion, 83
Bernoulli
distribution, 26
experiment, 26
random walk, 127
Bernoulli, Jakob, 12
best-estimate reserves, 231
BF
method, 232, 236
reserves, 236
BIC, 83
Bichsel, Fritz, 203
binary variable, 177
binomial distribution, 26, 183
definition, 26
moments, 27
Bornhuetter, Ronald, 236
Bornhuetter-Ferguson method, 232, 236
BS model, 213
w)
F -distribution, 180
2 -distribution, 22, 62
2 -goodness-of-fit test, 49, 83
p-value, 22
Bhlmann, Hans, 154, 212

Bhlmann-Straub model, 213
Bahr, von Bengt, 139
Bailey, Robert A., 170
balance sheet, 264
Bayes rule, 203
CARA utility function, 147

categorical variable, 177
CDR, 250, 272
uncertainty, 255
central limit theorem, 13, 94
chain-ladder method, 232
chain-ladder model
distribution-free, 238
Chebychevs inequality, 127
Chebychev, Pafnuty Lvovich, 17
284
Index
285
NL
no
tes
(m
w)
decomposition property, 33
chi-square distribution, 22, 62
chi-square-goodness-of-fit test, 49, 83
definition, 30
CL factor, 232
moments, 30
Bayes, 240
concave, 144
estimate, 240
conditional tail expectation, 159
CL method, 232
conjugate prior, 208
CL model
constant absolute risk-aversion, 147
gamma-gamma Bayes, 239
constant relative risk-aversion, 147
MSEP, 242
continuous variable, 177
CL reserves, 233
convergence in distribution, 17
claims
convex cone, 161
counts, 23
convolution, 25
frequency, 26
cost-of-capital, 160, 163
claims development
rate, 160, 164
result, 250, 272
Cramr, Harald, 121
triangle, 229
Cramr-Lundberg process, 121
claims inflation, 89
credibility coefficient, 218, 241
claims reserves, 230, 231
credibility estimator, 209
claims reserving, 225
homogeneous, 214
algorithm, 232
inhomogeneous, 214
method stochastic, 237
credibility weight, 201, 205, 208
closing date, 226
credit risk, 268, 271
CLT, 13, 94
CRRA utility function, 147
CoC, 160
CTE, 159
rate, 160
cumulant function, 182
coefficient of determination, 178
cumulant generating function, 19
coefficient of variation, 16, 58
current year claim, 269
coherent risk measure, 157, 162
CY claim, 269
collective mean, 213
CY risk, 272
collective risk model, 23
Darling, Donald Allan, 82
compound binomial distribution, 27
De Moivre, Abraham, 13, 94
definition, 27
decomposition property, 33
moments, 28
deductible, 88
compound distribution, 23
deflator, 165, 270
definition, 23
Delbaen, Freddy, 157
moments, 24
compound negative-binomial distribution, density, 15, 58
descending ladder epoch, 129
40
design matrix, 175
definition, 40
development year, 229
moments, 40
deviance statistics, 190
compound Poisson distribution, 30
aggregation property, 31
discrete distribution, 15
286
Index
w)
gamma distribution, 37, 83, 184

gamma-gamma Bayes CL model, 239
gamma-gamma model, 210
Gauss, Carl Friedrich, 18
Gaussian distribution, 17, 184
generalized inverse, 20
generalized linear model, 167, 182
Gerber, Hans-Ulrich, 121
Gerber-Shiu risk theory, 121
Gisler, Alois, 215
Glivenko-Cantelli theorem, 80, 93
GLM, 170, 182
Goldie, Charles M., 139
goodness-of-fit, 83, 177
happiness index, 144
heavy tailed, 135
Hill
estimator, 75
plot, 75
histogram, 55
homogeneous credibility estimator, 214
NL
no
tes
EDF, 182
Edgeworth approximation, 100
Edgeworth, Francis Ysidro, 100
Embrechts, Paul, 139
Embrechts-Veraverbeke theorem, 137
empirical
distribution function, 56
loss size index function, 56
mean excess function, 56
England, Peter D., 247
ES, 159
Esscher
measure, 154
premium, 154, 165
estimation error, 21
estimator, 21
expectation, 15
expected claims frequency, 26
expected shortfall, 158, 159, 163
expected value, 15, 58
expected value principle, 141
exponential dispersion family, 182, 208
exponential distribution, 62
exponential utility function, 147
Fourier, Jean Baptiste Joseph, 117

Frchet, Maurice, 64
(m
discretization, 108
disjoint decomposition, 32
property, 33
dispersion, 182
distortion function, 156
distribution function, 15
distribution-free CL model, 238
Duffie, James Darrell, 165
F-distribution, 180
fast Fourier transform, 116
Ferguson, Ronald E., 236
FFT, 116
finite horizon ruin probability, 122
first moment, 15
Fisher, Sir Ronald Aylmer, 45
Fourier transform
discrete, 117
i.i.d., 20
IBNYR, 227
incomplete gamma function, 59
independent and identically distributed,
20
individual claim size, 23, 53
informative prior, 206
inhomogeneous credibility estimator, 214
insurance risk, 268, 271
inverse Gaussian distribution, 63
inversion formula, 117
isoelastic utility function, 147
Jewell, William S., 106
Jung, Jan, 173
Khinchin, Aleksandr Yakovlevich, 132
Kolmogorov distribution, 80
Index
287
NL
no
w)
tes
ladder
epoch, 129
height, 129
Laplace, Pierre-Simon, 13, 94, 202
large claims separation, 35
law of large numbers, 12
layer, 58, 86
leverage effect, 89
likelihood function, 45
likelihood ratio test, 179
linear credibility, 201, 212
link ratio, 232
LLN, 12
log-gamma distribution, 70
log-likelihood function, 45
log-linear model, 176
log-link function, 185
log-log plot, 57
log-normal distribution, 66
loss size index function, 56, 58
Lundberg
bound, 125, 126
coefficient, 125
Lundberg, Ernst Filip Oskar, 121
Lyapunov, Aleksandr Mikhailovich, 94
mean, 15, 58
mean excess function, 56, 58
mean excess plot, 57
mean square error of prediction
conditional, 237
Merz, Michael, 250
Merz-Wthrich formula, 255
method of
Bailey & Jung, 173
Bailey & Simon, 170
total marginal sums, 173
method of moments, 40
minimal variance estimator, 42
mixed Poisson distribution, 36
definition, 36
MLE, 40, 45
MM, 40
model risk, 13
model world, 14
moment estimator, 41
moment generating function, 16, 19, 58
moments, 15
monotonicity, 161
Monte Carlo simulation, 93
Morgenstern, Oskar, 144
MSEP, 237, 242
multiplicative tariff, 168
multivariate Gaussian distribution, 175
density, 176
MV, 42
MW formula, 255
(m
Kolmogorov, Andrey Nikolaevich, 79

Kolmogorov-Smirnov test, 79
KS test, 79
Mack CL model, 238

Mack formula, 245
Mack, Thomas, 238
margin, 266
market risk, 268, 271
market-consistent, 266
value, 270
market-value margin, 266, 274
Markov chain Monte Carlo, 201
Markov, Andrey Andreyevich, 17
maximum likelihood estimator, 45
maximum likelihood method, 40
MCMC, 201
negative-binomial distribution, 37
definition, 37
moments, 38
net profit condition, 124
Neumann, von John, 144
non-informative prior, 206
normal approximation, 94
normalization, 161
NPC, 124
null hypothesis, 21
288
Index
radius of convergence, 16
Radon-Nikodym derivative, 165
random variables, 14
random walk theorem, 124
rapidly varying, 59
RBNS, 228
re-insurance, 88
real world, 14
regularly varying, 59, 136
renewal property, 125
reporting
date, 226
delay, 226
reserve risk, 268
reserves, 230, 231
residual standard deviation, 179
Resnick, Sidney Ira, 75
Riemann-Stieltjes integral, 15
risk
averse, 144
bearing capital, 159
characteristics, 168
class, 168
components, 13
margin, 266, 274
measure, 159, 265
modules, 267
NL
no
tes
p-value, 22
Plya, George, 38
Panjer
algorithm, 105, 107
distribution, 105
recursion, 105
Panjer, Harry H., 105
parameter estimation
claims count distribution, 40
error, 238
Pareto distribution, 73
Pareto, Vilfredo Federico Damaso, 73
past exposure claim, 230
Pearsons residuals, 191
Pearson, Karl, 59, 83
Poisson distribution, 29, 184
definition, 29
moments, 29
Poisson, Simon Denis, 29
Poisson-gamma model, 203
Pollaczek, Flix, 132
Pollaczek-Khinchin formula, 129, 132
positive homogeneity, 161
posterior
distribution, 203
parameter, 204
power law distribution, 73
power utility function, 147
prediction error, 21, 221
predictor, 21, 237
premium
calculation principle, 141
CY, 269
w)
ODP model, 247

one-period problem, 265
one-year uncertainty, 256
operational risk, 268, 271
loss, 269
outstanding loss liabilities, 226, 230
over-dispersed Poisson model, 247
elements, 13
liability risk, 268, 273
previous year claim, 268
Price, Richard, 202
prior
distribution, 202
parameter, 204
probability distortion, 156
probability space, 14
process uncertainty, 238
provisions, 230, 269
pure randomness, 13
PY claim, 268
PY risk, 272
(m
number of claims, 23
Index
289
surplus process, 121, 264
survival function, 20, 58
Swiss Solvency Test, 265
w)
ultimate ruin probability, 123

utility function
exponential, 147
power, 147
utility indifference price, 148
utility theory, 144
NL
no
tes
sample
estimators, 41
mean, 41, 54
variance, 41, 54
saturated model, 189
scale parameter, 59
scaled deviance, 190
scatter plot, 54
settlement
date, 226
delay, 229
period, 226
shape parameter, 59
Shiu, Elias S.W., 121
significance level, 21
Simon, LeRoy J., 170
skewness, 16, 58
slowly varying, 59
Smirnov, Nikolai Vasilyevich, 79
solvency, 266
Solvency II, 265
Spitzers formula, 130
Spitzer, Frank Ludvig, 131
SST, 265
standard assumptions for compound distributions, 23
standard deviation, 16
standard deviation loading principle, 142
stochastic claims reserving method, 237
stochastic dominance, 109
stopping time, 129
Straub, Erwin, 212
structural parameter, 218
subadditivity, 161
subexponential, 133, 135, 137
Sundt, Bjrn, 106
tail index, 59, 136

Tail-Value-at-Risk, 159
tariff criterion, 168
tariffication, 167
total claim amount, 23
total uncertainty, 256
tower property, 20
translated gamma approximation, 97
translated log-normal approximation, 97
translation invariance, 161
TVaR, 159, 265
(m
ruin probability
finite horizon, 122
ultimate, 123
ruin theory, 121
ruin time, 122
vague prior, 206

value
assets, 264
liabilities, 264
Value-at-Risk, 159, 162
VaR, 159, 162, 265
Var, 16
variable reduction analysis, 189
variance, 16, 58
variance loading principle, 142
Vco, 16
Veraverbeke, Nol, 139
Verrall, Richard J., 247
volume, 26
Weibull distribution, 64
Weibull, Ernst Hjalmar Waloddi, 64
zero claim, 53
zero-coupon bond, 270

Main Textbook ACTL3162 UNSW

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Main Textbook ACTL3162 UNSW

Uploaded by

Copyright:

Available Formats

w)

Version April 14, 2016

Electronic copy available at: http://ssrn.com/abstract=2319328

Version April 14, 2016, M.V. Wthrich, ETH Zurich

Preface and Terms of Use

Version April 14, 2016, M.V. Wthrich, ETH Zurich

Zurich, April 14, 2016

Version April 14, 2016, M.V. Wthrich, ETH Zurich

2 Collective Risk Modeling

3 Individual Claim Size Modeling

Claim size modeling using layers . . . . . . . . . . . . . . . .

5 Ruin Theory in Discrete Time

6 Premium Calculation Principles

4 Approximations for Compound Distributions

8 Bayesian Models and Credibility Theory

Version April 14, 2016, M.V. Wthrich, ETH Zurich

Version April 14, 2016, M.V. Wthrich, ETH Zurich

Nature of non-life insurance

Non-life insurance and the law of large numbers

Modern insurance is traced back to the Great

For independent and identically distributed random variables

i.e. in the limit (and under appropriate scaling) we obtain a

denominator only increases of order n, i.e. it increases at a

Risk components and premium elements

Insurance contracts involve many different risky components. We briefly present

+ pure risk premium = E[Yi ]

+ risk margin to protect against the risks mentioned above

financial gains on investments

+ other administrative expenses

Probability theory and statistics

denotes the probability that X has an outcome less or equal to x. In general, we

We distinguish two important types of random variables:

with kA pk = 1. We call pk probability weight of X in k A;

Version April 14, 2016, M.V. Wthrich, ETH Zurich

standard deviation and coefficient of variation of X F

for E[X] > 0;

moment generating function of X F at position r R

exp {rx} dF (x).

The moment generating function is crucial to identify the properties of random

This proves the lemma.

Version April 14, 2016, M.V. Wthrich, ETH Zurich

Proof. See Section 22 of Billingsley [13], in particular Theorem 22.2.

Lemma 1.3 gives for two random variables X F and Y G

This property is often used to identify distribution functions.

Lemma 1.4. Assume that the random variables Xn , n N, P.L. Chebychev

Proof. See Section 30 of Billingsley [13]. Basically, Chebychevs inequality

The Pafnuty Lvovich Chebychev (1821-1894) inequality is

Example 1.5 (Gaussian distribution). Assume X N (, 2 ) has a Gaussian

The moment generating function of X N (, 2 ) is given by

MX (r) = exp r + r2 2 /2 <

This moment generating function is obtained by direct calculation completing the

and for the second moment we obtain

This implies for the variance of Gaussian distributions

(a) Assume X N (0, 1). Prove that a + bX N (a, b2 ) for a, b R.

The Gaussian distribution is named after

Version April 14, 2016, M.V. Wthrich, ETH Zurich

Often we do not directly consider the moment generating function MX of a random

Define the new function Fr by

Observe that Fr is a distribution function. Thus, we can choose a random variable Xr Fr

This proves the claim.

Version April 14, 2016, M.V. Wthrich, ETH Zurich

Often we deal with sequences X1 , X2 , . . . of random variables which are independent

E [X] = E [E [X| G]] .

In particular, if X and Y are two random variables on (, F, P) we have