SSRN Id2319328

w)
Non-Life Insurance:
Mathematics & Statistics
Lecture Notes
(m
tes
Mario V. Wthrich
RiskLab Switzerland
Department of Mathematics
ETH Zurich
no
NL
Version March 14, 2017
Electronic copy available at: https://ssrn.com/abstract=2319328

2
w)
(m
tes
no
NL
Version March 14, 2017, M.V. Wthrich, ETH Zurich

Preface and Terms of Use
Lecture notes. The present lecture notes cover the lecture Non-Life Insurance:
w)
Mathematics & Statistics which is held in the Department of Mathematics at ETH
Zurich. This lecture is a merger of the two lectures Nicht-Leben Versicherungs-
mathematik and Risk Theory for Insurance. It was taught for its first time in
the present form in Spring 2014 at ETH Zurich and in Fall 2014 at University of
(m
Bologna (jointly with Tim Verdonck). The lecture aims at providing a basis in non-
life insurance mathematics which forms a core subject of actuarial sciences. After
this course, the students are recommended to follow lectures that give a deeper
knowledge in different subjects of non-life insurance mathematics, such as Credibil-
ity Theory, Non-Life Insurance Pricing with Generalized Linear Models, Stochastic
Claims Reserving Methods, Market-Consistent Actuarial Valuation, Quantitative
tes
Risk Management, Data Analytics & Machine Learning, etc.
Prerequisites. The prerequisites for this lecture are a solid education in mathe-
matics, in particular, in probability theory and statistics.
no
Terms of Use. These lecture notes are an ongoing project which is continuously
revised and updated. Of course, there may be errors in the notes and there is always
room for improvement. Therefore, I appreciate any comment and/or corrections
that readers may have. However, I would like you to respect the following rules:
NL
These notes are provided solely for educational, personal and non-commercial
use. Any commercial use or reproduction is forbidden.
All rights remain with the author. He may update the manuscript or with-
draw the manuscript at any time. There is no right of the availability of any
(old) version of these notes. The author may also change these terms of use
at any time.
The author disclaims all warranties, including but not limited to the use or
the contents of these notes. On using these notes, you fully agree to this.
Citation: please use the SSRN URL.
All pictures and graphs included in these notes are either downloaded from the
internet (open access) or were plotted by the author. If downloaded graphs

4
violate copyright, I appreciate an immediate note and the corresponding pic-

tures will be removed from these lecture notes.
Previous versions.
September 2, 2013
December 2, 2013
August 27, 2014
June 29, 2015
April 14, 2016
w)
(m
tes
no
NL

Acknowledgment
Writing these notes, I profited greatly from various inspiring as well as ongoing
w)
discussions, concrete contributions and critical comments with and by several peo-
ple: first of all, the students that have been following our lectures at ETH Zurich
since 2006; furthermore Hans Bhlmann, Christoph Buser, Philippe Deprez, Paul
Embrechts, Farhad Farhadmotamed, Urs Fitze, Markus Gesmann, Alois Gisler,
(m
Laurent Huber, Lukas Meier, Michael Merz, Esbjrn Ohlsson, Gareth Peters, Al-
bert Pinyol i Agelet, Peter Reinhard, Simon Rentzmann, Rodrigo Targino, Teja
Turk, Tim Verdonck, Maximilien Vila, Yitian Yang, Patrick Zchbauer. I espe-
cially thank Alois Gisler for providing his lecture notes [54] and the corresponding
exercises.
tes
Zurich, March 14, 2017 Mario V. Wthrich
no
NL
5
6
w)
(m
tes
no
NL

Contents
w)
1 Introduction 11
1.1 Nature of non-life insurance . . . . . . . . . . . . . . . . . . . . . . 11
1.1.1 Non-life insurance and the law of large numbers . . . . . . . 11
1.1.2 Risk components and premium elements . . . . . . . . . . . 13
(m
1.2 Probability theory and statistics . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Random variables and distribution functions . . . . . . . . . 14
1.2.2 Terminology in statistics . . . . . . . . . . . . . . . . . . . . 21
2 Collective Risk Modeling 23

2.1 Compound distributions . . . . . . . . . . . . . . . . . . . . . . . . 23
tes
2.2 Explicit claims count distributions . . . . . . . . . . . . . . . . . . . 25
2.2.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 Mixed Poisson distribution . . . . . . . . . . . . . . . . . . . 36
2.2.4 Negative-binomial distribution . . . . . . . . . . . . . . . . . 37
no
2.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.3.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . 41
2.3.2 Maximum likelihood estimators . . . . . . . . . . . . . . . . 45
2.3.3 Example and 2 -goodness-of-fit analysis . . . . . . . . . . . 47
NL
3 Individual Claim Size Modeling 53

3.1 Data analysis and descriptive statistics . . . . . . . . . . . . . . . . 53
3.2 Selected parametric claims size distributions . . . . . . . . . . . . . 58
3.2.1 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . 59
3.2.2 Weibull distribution . . . . . . . . . . . . . . . . . . . . . . 64
3.2.3 Log-normal distribution . . . . . . . . . . . . . . . . . . . . 66
3.2.4 Log-gamma distribution . . . . . . . . . . . . . . . . . . . . 70
3.2.5 Pareto distribution . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.3.1 Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . 79
3.3.2 Anderson-Darling test . . . . . . . . . . . . . . . . . . . . . 82
3.3.3 Goodness-of-fit and information criteria . . . . . . . . . . . . 83
3.4 Calculating within layers for claim sizes . . . . . . . . . . . . . . . . 86
7
8 Contents
3.4.1 Claim size modeling using layers . . . . . . . . . . . . . . . . 86

3.4.2 Re-insurance layers and deductibles . . . . . . . . . . . . . . 88
4 Approximations for Compound Distributions 93

4.1 Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.1 Normal approximation . . . . . . . . . . . . . . . . . . . . . 94
4.1.2 Translated gamma and log-normal approximations . . . . . . 97
4.1.3 Edgeworth approximation . . . . . . . . . . . . . . . . . . . 100
4.2 Algorithms for compound distributions . . . . . . . . . . . . . . . . 105
4.2.1 Panjer algorithm . . . . . . . . . . . . . . . . . . . . . . . . 105
w)
4.2.2 Fast Fourier transform . . . . . . . . . . . . . . . . . . . . . 116
5 Ruin Theory in Discrete Time 5.4, except proofs and Theorem 5.18. 121
5.1 Net profit condition . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
(m
5.2 Lundberg bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3 Pollaczek-Khinchin formula . . . . . . . . . . . . . . . . . . . . . . 129
5.3.1 Ladder epochs . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.2 Cramr-Lundberg process . . . . . . . . . . . . . . . . . . . 131
5.4 Subexponential claim sizes . . . . . . . . . . . . . . . . . . . . . . . 133
tes
6 Premium Calculation Principles 141
6.1 Simple risk-based principles . . . . . . . . . . . . . . . . . . . . . . 142
6.2 Advanced premium calculation principles . . . . . . . . . . . . . . . 144
6.2.1 Utility theory pricing principles . . . . . . . . . . . . . . . . 144
6.2.2 Esscher premium . . . . . . . . . . . . . . . . . . . . . . . . 154
no
6.2.3 Probability distortion pricing principles . . . . . . . . . . . . 156

6.2.4 Cost-of-capital principles using risk measures . . . . . . . . . 159
6.2.5 Deflator based pricing principles . . . . . . . . . . . . . . . . 164
7 Tariffication and Generalized Linear Models 167

NL
7.1 Simple tariffication methods . . . . . . . . . . . . . . . . . . . . . . 171

7.2 Gaussian approximation . . . . . . . . . . . . . . . . . . . . . . . . 174
7.2.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . 174
7.2.2 Goodness-of-fit analysis . . . . . . . . . . . . . . . . . . . . 177
7.3 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . 182
7.3.1 GLM for Poisson claims counts . . . . . . . . . . . . . . . . 185
7.3.2 GLM for gamma claim sizes . . . . . . . . . . . . . . . . . . 186
7.3.3 Variable reduction analysis . . . . . . . . . . . . . . . . . . . 189
7.3.4 Claims frequency example . . . . . . . . . . . . . . . . . . . 192
8 Bayesian Models and Credibility Theory 201

8.1 Exact Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . 202
8.1.1 Poisson-gamma model . . . . . . . . . . . . . . . . . . . . . 203

Contents 9
8.1.2 Exponential dispersion family with conjugate priors . . . . . 207

8.2 Linear credibility estimation . . . . . . . . . . . . . . . . . . . . . . 212
8.2.1 Bhlmann-Straub model . . . . . . . . . . . . . . . . . . . . 213
8.2.2 Bhlmann-Straub credibility formula . . . . . . . . . . . . . 214
8.2.3 Estimation of structural parameters . . . . . . . . . . . . . . 218
8.2.4 Prediction error in the Bhlmann-Straub model . . . . . . . 221
9 Claims Reserving 225

9.1 Outstanding loss liabilities . . . . . . . . . . . . . . . . . . . . . . . 226
9.2 Claims reserving algorithms . . . . . . . . . . . . . . . . . . . . . . 232
w)
9.2.1 Chain-ladder algorithm . . . . . . . . . . . . . . . . . . . . . 232
9.2.2 Bornhuetter-Ferguson algorithm . . . . . . . . . . . . . . . . 236
9.3 Stochastic claims reserving methods . . . . . . . . . . . . . . . . . . 237
9.3.1 Gamma-gamma Bayesian CL model . . . . . . . . . . . . . . 239
(m
9.3.2 Over-dispersed Poisson model . . . . . . . . . . . . . . . . . 247
9.4 Claims development result . . . . . . . . . . . . . . . . . . . . . . . 249
9.4.1 Definition of the claims development result . . . . . . . . . . 249
9.4.2 One-year uncertainty in the Bayesian CL model . . . . . . . 251
9.4.3 The full picture of run-off uncertainty . . . . . . . . . . . . . 257
tes
10 Solvency Considerations 263
10.1 Balance sheet and solvency . . . . . . . . . . . . . . . . . . . . . . . 263
10.2 Risk modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10.3 Insurance liability variables . . . . . . . . . . . . . . . . . . . . . . 270
10.3.1 Market-consistent values . . . . . . . . . . . . . . . . . . . . 270
no
10.3.2 Insurance risk . . . . . . . . . . . . . . . . . . . . . . . . . . 271

NL

10 Contents
w)
(m
tes
no
NL

Chapter 1
Introduction
w)
1.1 Nature of non-life insurance
1.1.1
(m
Non-life insurance and the law of large numbers
Insurance originates from a general demand of society who asks for protection
against unforeseeable events which might cause serious (financial) damage to in-
dividuals and society. Insurance organizes the financial protection against such
tes
unforeseeable (random) events, meaning that it takes care of the financial replace-
ments of the (potential) damage. The general idea is to build a community (col-
lective) to which everybody contributes a certain amount (fixed deterministic pre-
mium1 ) and then the financial damage is financed by the means of this community.
no
NL
1
In special cases, for instance in re-insurance or accident insurance, the premium can also have
a random part. This is not further discussed here.
11
12 Chapter 1. Introduction
The basic features of such communities are that every member faces similar risks.
By building such communities the individual members profit from diversification
benefits in the form of a law of large numbers that applies to the community.
Insurance companies organize the fair distribution within the community.
Modern insurance is traced back to the Great
Fire of London in 1666 which has destroyed a
big part of the city of London. This event has
initiated fire insurance protection against such
disastrous events. Today, fire insurance belongs
w)
to the branch of non-life insurance which is also
known as property and casualty insurance in
Great Fire of London 1666
the US and general insurance in the UK and
Australia. Non-life insurance comprises car insurance, liability insurance, property
(m
insurance, accident and health insurance, marine insurance, credit insurance, legal
protection insurance, travel insurance and other similar products. Insurance con-
tracts for these types of products have in common that they specify an insurance
period (typically of one year). Then all insured (random) events that occur within
this insurance period and which are causing financial damage to which the insur-
ance contract applies are indemnified. Such random payments caused by insured
tes
events are called insurance claims.
Typically, the insurance premium for these con-
tracts is paid at the beginning of the insurance
period (upfront). To determine this insurance
premium, the insurance company pools similar
no
risks whose individual insurance claims can be

described by a sequence Y1 , . . . , Yn , n N, of
random variables. These insurance claims Yi
are random at the beginning of the insurance
period and therefore need to be described with J. Bernoulli
NL
probability theory. Assume we have a proba-

bility space (, F, P) and Y1 , . . . , Yn are uncorrelated and identically distributed
random variables on that probability space with finite mean = E[Y1 ]. In that
case we can apply the weak law of large numbers (LLN) which says that for all
>0
n
" #
1 X
lim P
Yi = 0.

(1.1)
n n
i=1
Basically, this means that the total claim amount becomes more predictable with
increasing portfolio size n and, therefore, we can calculate the insurance premium
quite accurately for large portfolio sizes n because this provides the required equal
balance. The weak law of large numbers is therefore considered to be a theoret-
ical cornerstone of insurance. It goes back to the Swiss mathematician Jakob
Bernoulli (1655-1705) of the famous Bernoulli family and was first published in

Chapter 1. Introduction 13
his path-breaking work Ars Conjectandi which has appeared in 1713, eight years
after his death, see Bolthausen-Wthrich [15].
For independent and identically distributed random variables
Y1 , Y2 , . . . with finite variances 2 the weak law of large num-
bers can further be refined by Chebychevs inequality which
provides rates of convergence and by the central limit theorem
(CLT) which provides the asymptotic limit distribution. The
CLT states under the above assumptions that we have the fol-
lowing convergence in distribution
w)
Pn A. De Moivre
Y n
i=1
i N (0, 1) as n , (1.2)
n
(m
i.e. in the limit (and under appropriate scaling) we obtain a
standard Gaussian distribution. The crucial feature is that the

denominator only increases of order n, i.e. it increases at a
slower rate than n. This exactly implies that the total claim
amount of the portfolio becomes predictable in the limit be-
cause the relative confidence bounds get narrower the bigger the
tes
portfolio becomes. These are the basics why insurance works.
The CLT goes back to Abraham De Moivre (1667-1754) who
P.-S. Laplace
published a first article on the CLT in 1733 based on coin toss-
ing, this was way ahead of time, and to Pierre-Simon Laplace (1749-1827) who
no
provided an extension in 1812, see also page 94 below.
1.1.2 Risk components and premium elements

Insurance contracts involve many different risky components. We briefly present
them from the insurance company point of view.
NL
1. Pure randomness: The outcomes of the claims Yi are uncertain/random. This

risk is taken care of by the volume n of the insurance portfolio (as described
in (1.1) and (1.2)). That is, this risk can be controlled in a sufficient way if
the insurance portfolio is large.
2. Model risk: The description of the randomness of the variables Yi , described in

the previous item, is always based on a stochastic model, i.e. we describe the
random outcomes in a model world. This modeling should have the minimal
requirement that it characterizes the nature of Yi in a sufficiently accurate
way. However, typically model risk arises because our model description does
not perfectly describe real world behavior. There are different things that
may go wrong in this modeling task:

(a) the model world does not provide an appropriate description of real world
behavior;
(b) the parameters in the chosen model are misspecified;
(c) risk factors change over time so that past observations do not appro-
priately describe what may happen in the future (non-stationarity), of
course, this is closely related to (a) and (b).
In practice, these uncertainties (including pure randomness) ask for a risk loading
(risk margin) beyond the pure risk premium defined by = E[Yi ]. The aim of this
w)
risk loading is to provide financial stability. We will describe this in detail below
in Chapters 5, 6 and 10.
We close this section by describing the premium elements that are considered for
insurance premium calculation:
(m
+ pure risk premium = E[Yi ]
+ risk margin to protect against the risks mentioned above
+ profit margin
financial gains on investments

tes
+ sales commissions to agents
+ other administrative expenses
+ taxes
no
The sum of all these items specifies the insurance premium. Non-life insurance
mathematics and statistics typically studies the first two items. This is part of the
program of the subsequent chapters.
NL
1.2 Probability theory and statistics

1.2.1 Random variables and distribution functions
In this section we briefly recall the crucial notation and key results of probability
theory used in these notes. We denote the underlying probability space by (, F, P)
and assume throughout that this probability space is sufficiently rich so that it
carries all the objects that we are going to consider.
Random variables on this probability space (, F, P) are denoted by capital let-
ters X, Y, S, N, . . . and the corresponding observations are denoted by small letters
x, y, s, n, . . .. That is, x constitutes a realization of X. Random vectors are de-
noted by boldface, e.g., X = (X1 , . . . , Xd )0 and the corresponding observation by
x = (x1 , . . . , xd )0 for a given dimension d N. Since there is broad similarity

between random variables and random vectors, we restrict ourselves to random

variables for introducing the crucial terminology from probability theory.
Random variables X are characterized by (probability, cumulative) distribution
functions F = FX : R [0, 1], meaning that for all x R
F (x) = FX (x) = P [X x] [0, 1]
denotes the probability that X has an outcome less or equal to x. In general, we

drop the subscript in the distribution function F = FX and we simply write X F
for X having distribution function F ; a (cumulative) distribution function is a right-
w)
continuous, non-decreasing function with limx F (x) = 0 and limx F (x) = 1.
We distinguish two important types of random variables:

(i) a random variable X F is called discrete if F is a step function with countably
many steps in discrete points k A R. In this case we write
(m
pk = P [X = k] > 0 for k A,
with kA pk = 1. We call pk probability weight of X in k A;

P
(ii) a random variable X F is called absolutely continuous if there exists a

measurable function f 0 with f = F 0 , i.e.
tes
Z x
F (x) = f (y) dy for all x R.

This function f is called density of X and in that case we also use the terminology
X f.
no
Assume X F and h : R R is a sufficiently nice measurable function. We

define the expected value of h(X) by
P
Z

kA h(k) pk if X is discrete,
E [h(X)] = h(x) dF (x) =
NL
R R h(x)f (x) dx if X is absolutely continuous.

R
The middle term uses the general framework of the Riemann-Stieltjes integral
R
R h dF (and in fact the second equality is not an identity because the middle
term is more general than the right-hand side). The sufficiently nice refers to the
fact that E [h(X)] is only defined upon existence. The most important functions h
in our analysis define the following moments (based upon existence):
mean, expectation, expected value or first moment of X F
Z
X = E [X] = x dF (x);
R
k-th moment of X F
h i Z
E Xk = xk dF (x);
R

variance of X F
h i h i
2
X = Var (X) = E (X E[X])2 = E X 2 E [X]2 0;
standard deviation and coefficient of variation of X F

X
X = Var (X)1/2 and Vco(X) = for E[X] > 0;
E[X]
skewness of X F h i
E (X E[X])3
)
X = 3
;
X
w
moment generating function of X F at position r R
Z
MX (r) = E [exp {rX}] = exp {rx} dF (x).
(m
R
The moment generating function is crucial to identify the properties of random

variables X, see Lemmas 1.2, 1.3 and 1.4 below.
Lemma 1.1. Choose X F and assume that there exists r0 > 0 such that
MX (r) < for all r (r0 , r0 ). Then MX (r) has a power series expansion
tes
for r (r0 , r0 ) with
X rk h i
MX (r) = E Xk .
k0 k!
Proof. Note that it suffices to choose r (r0 , r0 ) with r 6= 0. Since e|rx| erx + erx , the
no
assumptions imply the following integrability E [exp {|rX|}] < . This implies that E[|X|k ] <
for all k N because |x|k is dominated by e|rx| for sufficiently large |x|. It also implies that
Pm
the partial sums |fm (x)| = | k=0 (rx)k /k!| are uniformly bounded by the integrable (w.r.t. dF )
k |rx|
P
function k0 |rx| /k! = e . This allows us to apply the dominated convergence theorem
which provides
NL
m
X rk h i
E X k = lim E [fm (X)] = E lim fm (X) = MX (r).

lim
m k! m m
k=0
This proves the lemma. 2
Lemma 1.1 implies that the power series converges for all r (r0 , r0 ) for given
r0 > 0 and, thus, we have a strictly positive radius of convergence 0 > 0. A
standard result from analysis implies that in the interior of the interval [0 , 0 ]
we can differentiate MX () arbitrarily often (term by term of the power series) and
the derivatives at the origin are given by
dk h
k
i
MX (r)|r=0 = E X < for k N0 . (1.3)
drk

Lemma 1.2. Choose a random variable X F and assume that there exists r0 > 0
such that MX (r) < for all r (r0 , r0 ). Then the distribution function F of X
is completely determined by its moment generating function MX .
Proof. The existence of a strictly positive radius of convergence 0 implies that all moments
of X exist and that they are directly determined by the moment generating function via (1.3).
Theorem 30.1 of Billingsley [13] then implies that there is at most one distribution function F
which has the same moments (1.3) for all k N. 2
For one-sided random variables the statement even holds true in general:
w)
Lemma 1.3. Assume X 0, P-a.s. The distribution function F of X is completely
determined by its moment generating function MX .
Proof. See Section 22 of Billingsley [13], in particular Theorem 22.2. 2
(m
Lemma 1.3 gives for two random variables X F and Y G
with X 0 and Y 0, P-a.s., the following implication
(d)
MX MY X = Y.
tes
This property is often used to identify distribution functions.
Lemma 1.4. Assume that the random variables Xn , n N, P.L. Chebychev

and X have finite moment generating functions MXn , n N,
no
and MX on a common interval (r0 , r0 ) with r0 > 0. Suppose

limn MXn (r) = MX (r) for all r (r0 , r0 ). Then (Xn )n converges in distribu-
tion to X, write Xn X for n .
Proof. See Section 30 of Billingsley [13]. Basically, Chebychevs inequality

NL
implies tightness of the underlying probability measures from which the

convergence in distribution is derived. 2
The Pafnuty Lvovich Chebychev (1821-1894) inequality is

sometimes also called Andrey Andreyevich Markov (1856-
1922) inequality. Chebychev was Markovs teacher and the in-
equality first appeared in the work of Chebychev. Note that
A.A. Markov
there are different spellings of Chebychev such as Tchebysheff,
etc.
Example 1.5 (Gaussian distribution). Assume X N (, 2 ) has a Gaussian

distribution with parameters R and 2 > 0. X is an absolutely continuous
random variable with density f (x) for x R given by

1 (x )2
( )
1
f (x) = exp .
2 2 2
The moment generating function of X N (, 2 ) is given by
n o
MX (r) = exp r + r2 2 /2 < for r R. (1.4)
This moment generating function is obtained by direct calculation completing the

square. Observe that MX () is finite on R and, thus, all moments exist and
d 1

MX (r)|r=0 = exp r + r2 2 + r 2
w)
X = E [X] = = ,
dr 2 r=0
and for the second moment we obtain

d2 1 2 2
h i
2
E X = 2 MX (r)|r=0 = exp r + r ( + r 2 )2 + 2 = 2 + 2 .
(m
dr 2 r=0
This implies for the variance of Gaussian distributions

h i
2
X = Var(X) = E X 2 E [X]2 = 2 .
Moreover, any random variable Y that has moment generating function of the form
tes
(1.4) is Gaussian with mean Y = and variance Y2 = 2 , see Lemma 1.2.
Exercise 1 (Gaussian distribution).
(a) Assume X N (0, 1). Prove that a + bX N (a, b2 ) for a, b R.

no
(b) Assume that Xi are independent and Xi N (i , i2 ). Prove that Xi

P
i
N ( i i , i i2 ).
P P
(c) Assume X N (0, 1). Prove that E[X 2k+1 ] = 0 for all k N0 .
NL
The Gaussian distribution is named after

Carl Friedrich Gauss (1777-1855).
He was one of the greatest mathemati-
cians and has contributed to many dif-
ferent fields in mathematics and physics.
We recommend the novel of Kehlmann
[65] that fictitiously describes the lives of
Carl Friedrich Gauss and of the nat- C.F. Gauss
ural scientist Alexander von Hum-
boldt (1769-1859).

Often we do not directly consider the moment generating function MX of a random

variable X but rather its logarithm. The cumulant generating function of X is given
by
log MX (r) = log E [exp {rX}] .
Assume that MX is finite on (r0 , r0 ) with r0 > 0. We have
MX0 (r)

d
log MX (r)|r=0 = = E [X] = X ,
dr MX (r) r=0
MX00 (r)MX (r) (MX0 (r))2

d2 2
log M (r)| = = Var (X) = X , (1.5)
w)
X r=0
dr2 (MX (r))2

r=0
d3 h
3
i
3
log MX (r)|r=0 = E (X E[X]) = X X .
dr3
(m
Lemma 1.6. Assume that MX is finite on (r0 , r0 ) with r0 > 0. Then log MX ()
is a convex function on (r0 , r0 ).
Proof. In order to prove convexity we calculate the second derivative at position r (r0 , r0 )
00 0 00 0
2
d2 (r))2

MX (r)MX (r) (MX MX (r) MX (r)
log MX (r) = =
tes
dr2 (MX (r)) 2 MX (r) MX (r)
!2
E X 2 erX E XerX
= .
E [erX ] E [erX ]
Define the new function Fr by

no
Z x
1
Fr (x) = ery dF (y). (1.6)
MX (r)
Observe that Fr is a distribution function. Thus, we can choose a random variable Xr Fr

whose variance is given by
!2
E X 2 erX E XerX
NL
d2
0 Var(Xr ) = E[Xr2 ] 2
E[Xr ] = = 2 log MX (r).
E [erX ] rX
E [e ] dr
This proves the claim. 2
Remark. The distribution function Fr defined in (1.6) gives the Esscher measure
of F . The Esscher measure has been introduced by Bhlmann [19] for a new
premium calculation principle. We come back to this in Section 6.2.2, below.
The next formula is often used: Assume that X F is non-negative, P-a.s., and
has finite first moment. Then we have identity
Z Z Z
E[X] = x dF (x) = [1 F (x)] dx = P [X > x] dx.
0 0 0

The proof uses integration by parts and the result says that we can calculate
expected values from survival functions F (x) = 1 F (x) = P[X > x]. Survival
functions will be important for the study of the fatness of the tails of distribution
functions. This plays a crucial role for the modeling of large claims.
Often we deal with sequences X1 , X2 , . . . of random variables which are independent

and identically distributed (i.i.d.) with distribution function F . In this case we use
i.i.d.
the notation X1 , X2 , . . . F .
w)
Another property that is going to be used quite frequently is the so-called tower
property, see Williams [97]. It states that for any sub--algebra G F on our
probability space (, F, P) we have for any integrable random variable X F
E [X] = E [E [X| G]] . (1.7)
(m
In particular, if X and Y are two random variables on (, F, P) we have
E [X] = E [E [X| Y ]] ,
where E[X|Y ] is an abbreviation for E[X|(Y )] with (Y ) F denoting the -

tes
algebra generated by the random variable Y . Assume that X is square integrable
then tower property (1.7) implies
Var(X) = E [Var (X| G)] + Var (E [ X| G]) . (1.8)

no
We have mentioned above that distribution functions F are right-continuous and

non-decreasing. This allows to define the left-continuous generalized inverse of F
by
F (p) = inf {x; F (x) p} ,
where we use convention inf = . For p (0, 1), F (p) is often called the
NL
p-quantile of X F . The generalized inverse F is only tricky at places where F

has a discontinuity or where F is not strictly increasing. It satisfies the following
properties, see Proposition A3 in McNeil et al. [77],
1. F is non-decreasing and left-continuous.
2. F is continuous iff F is strictly increasing.
3. F is strictly increasing iff F is continuous.
4. (If F is right-continuous, then) F (x) z iff F (z) x.
5. F (F (x)) x.
6. F (F (z)) z.

7. If F is strictly increasing, then F (F (x)) = x.
8. If F is continuous, then F (F (z)) = z.
Items 4. to 8. need F (z) < . Note that the first part of item 4. is put in brackets
because distribution functions are right-continuous. However, generalized inverses
can also be defined for functions that are not right-continuous (as long as they are
non-decreasing) and then the condition in the bracket of item 4. is needed.
1.2.2 Terminology in statistics
w)
Often we face the problem that we need to predict the outcome of a random vari-
able X F . This problem is solved by specifying an appropriate predictor X. c For
instance, we can choose as predictor X c = = E[X]. On the other hand a distri-
X
bution function F often involves unknown parameters. These unknown parameters
(m
need to be estimated, for instance, using past experience and expert opinion. For
example, we can estimate the (unknown) mean X of X by an estimator b X . If we
now choose predictor X c= b X for predicting X, then b X serves at the same time
as estimator for X and as predictor for X. In this sense we obtain an estimation
error which is specified by the difference X b X and we obtain a prediction error
tes
which is characterized by the following difference
X X
c = X
b X = (X X ) + (X
bX ) . (1.9)
The second term on the right-hand side of (1.9) specifies the estimation error and
the first term on the right-hand side of (1.9) is often called pure process error which
no
is due to the stochastic nature of X, see also Section 9.3.
Statistical tests deal with the problem of making decisions. Assume we have an
observation x of a random vector X F with given but unknown parameter
which lies in a given set of possible parameters. The aim is to test whether
NL
the (true, unknown) parameter that has generated x may belong to some subset
0 . In the simplest case we have a singleton 0 = {0 }. Assume that we
would like to check whether x may have been generated by a given parameter 0 .
Null hypothesis H0 : = 0 .
(Two-sided) alternative hypothesis H1 : 6= 0 .
We then build a test statistics T (X) whose distribution function is known under the
null hypothesis H0 and we consider the question whether T (x) takes an unlikely
value under the null hypothesis. Therefore, one chooses a significance level q
(0, 1) (typically 5% or 1%) and for this significance level one chooses a critical region
Cq with P[T (X) Cq ] q (under the null hypothesis). The null hypothesis is then
rejected if T (x) falls into this critical region. In practice, one often calculates the

so-called p-value. This denotes the critical probability at which the null hypothesis
is just rejected (for one-sided unbounded intervals). For instance, if we choose a
significance level of 5% and the resulting p-value of T (x) is less or equal to 5% then
the test rejects the null hypothesis on the 5% significance level.
Exercise 2 (2 -distribution). Assume that Xk has a 2 -distribution with k N

degrees of freedom, i.e. Xk is absolutely continuous with density
1
f (x) = k/2
xk/21 exp {x/2} , for x 0.
2 (k/2)
w)
(a) Prove that f is a density. Hint: see Section 3.3.3 and proof of Proposition
2.20.
(b) Prove
(m
MXk (r) = (1 2r)k/2 for r < 1/2.
(d)
(c) Choose Z N (0, 1) and prove Z 2 = X1 .
i.i.d. Pk (d)
(d) Choose Z1 , . . . , Zk N (0, 1). Prove i=1 Zi2 = Xk and calculate the first
two moments of the latter.
tes

no
NL

Chapter 2
Collective Risk Modeling
w)
The aim of this chapter is to describe the probability distribution of the total claim
(m
amount S that an insurance company faces within a fixed time period. For the
time period we take one (accounting) year. Assume that N counts all claims that
occur within this fixed accounting year. The total claim amount S is given by
N
X
S = Y1 + Y2 + . . . + YN = Yi ,
i=1
tes
where Y1 , . . . , YN models the individual claim sizes. If we are at the beginning
of this accounting year then neither the number of claims N nor the individual
claim sizes Y1 , . . . , YN are known. Therefore, we model all these unknowns with
random variables that characterize the possible outcomes of the total claim amount
no
S (which, of course, then is also a random variable). We call such models for S
collective risk models because we consider the whole portfolio as a collective. Col-
lective risk models are based on the law of large numbers which allow the insurance
company to benefit from diversification benefits (between individual risks).
NL
2.1 Compound distributions

The starting point of the modeling of S is a compound distribution. This compound
distribution is based on rather strong model assumptions on the one hand, but on
the other hand it already leads to a good description and understanding of the
possible outcomes of the total claim amount S.
Model Assumptions 2.1 (compound distribution). The total claim amount S is

given by the following compound distribution
N
X
S = Y1 + Y2 + . . . + YN = Yi ,
i=1
with the three standard assumptions
23
24 Chapter 2. Collective Risk Modeling
1. N is a discrete random variable which only takes values in A N0 ;

i.i.d.
2. Y1 , Y2 , . . . G with G(0) = 0;
3. N and (Y1 , Y2 , . . .) are independent.
Remarks.
If S satisfies these three standard assumptions from Model Assumptions 2.1

we say that S has a compound distribution.
The first assumption of the compound distribution says that the number of
)
claims N takes only non-negative integer values. The event {N = 0} means
w
that no claim occurs which provides a total claim amount of S = 0.
The second assumption means that the individual claim sizes Yi do not affect
(m
each other, for instance, if we face a large first claim Y1 this does not give
us any information for the remaining claims Yi , i 2. Moreover, we have
homogeneity in the sense that all claims have the same marginal distribution
function G with
0 = G(0) = P [Y1 0] ,
tes
i.e. the individual claim sizes Yi are strictly positive, P-a.s. We use synony-
mous the terminology (individual) claim size, (individual) claim and claims
severity for Yi .
Finally, the last assumption says that the individual claim sizes are not af-
no
fected by the number of claims and vice versa, for instance, if we observe
many claims this does not contain any information whether these claims are
of smaller or larger size.
This compound distribution is the base model for collective risk modeling and we
are going to describe different choices for the claims count distribution of N and
NL
for the individual claim size distribution of Yi . We start with the basic recognition
features of compound distributions.
Proposition 2.2. Assume S has a compound distribution. We have
E[S] = E[N ] E[Y1 ],

Var(S) = Var(N ) E[Y1 ]2 + E[N ] Var(Y1 ),
s
1
Vco(S) = Vco(N )2 + Vco(Y1 )2 ,
E[N ]
MS (r) = MN (log(MY1 (r))) for r R,
whenever they exist.

Chapter 2. Collective Risk Modeling 25
Proof. Using the tower property (1.7) we obtain for the mean of S
"N # " " N ## "N # "N #
X X X X
E[S] = E Yi = E E Yi N = E E [ Yi | N ] = E E [Yi ]

i=1 i=1 i=1 i=1
= E [N E [Y1 ]] = E [N ] E [Y1 ] .
For the second statement we have, see also (1.8),

N
! " N #! " N
!#
X X X
Var(S) = Var Yi = Var E Yi N + E Var Yi N

i=1 i=1 i=1
N
! "N #
w)
X X
= Var E [ Yi | N ] + E Var ( Yi | N )
i=1 i=1
N
! " N
#
X X 2
= Var E [Yi ] +E Var (Yi ) = Var (N ) E [Y1 ] + E [N ] Var (Y1 ) .
i=1 i=1
(m
Finally, for the moment generating function we have
" ( N )# " " N ## "N #
X Y Y
MS (r) = E exp r Yi = E E exp {rYi } N = E E [ exp {rYi }| N ]

i=1 i=1 i=1
= E MY1 (r)N = E [exp {N log(MY1 (r))}] = MN (log(MY1 (r))).

This proves the proposition. 2

tes
Under Model Assumptions 2.1 the distribution function of S can be written as
" N #
X X
P [S x] = Yi x N = k P [N = k] (2.1)

P

kA i=1
no
" k #
Gk (x) P [N = k] ,
X X X
= P Yi x P [N = k] =
kA i=1 kA
where Gk denotes the k-th convolution of the distribution function G. In partic-

i.i.d.
ular, we have for Y1 , Y2 G
NL
Z
P [Y1 + Y2 x] = G(x y) dG(y) = G2 (x).
With formula (2.1) we obtain a closed form solution for the distribution function
of S. However, in general, this formula is not useful due to the computational
complexity of calculating Gk for too many k A. We present other solutions
for the calculation of the distribution function of S. These involve simulations,
approximations and smart analytic techniques under additional model assumptions.
2.2 Explicit claims count distributions

In this section we give explicit distribution functions for the number of claims N
modeling. The three most commonly used distribution functions are the binomial

distribution, the Poisson distribution and the negative-binomial distribution. Our

aim is to present these three distribution functions, describe the properties of the
resulting compound distributions, and discuss parameter estimation. These three
distribution functions constitute the family of Panjer distributions, see Lemma 4.7
below.
In a non-life insurance context the claims count random variable N should always
be understood in relation to an underlying (deterministic) volume v > 0. There-
fore, we consistently use a volume measure to describe N . Often this is not done
in the related literature. The volume measure will become especially important for
w)
the study of diversification benefits, parameter estimation and in the evaluation of
parameter uncertainty. The volume measure can be of different nature depending
on the insurance business considered and one should always choose the most ap-
propriate one. Typical volume measures are: number of insured persons, number
of policies, number of risks or time insured. In health and accident insurance it
(m
could also be the aggregated wages insured or in fire insurance the total insured
value. To make language simple we interpret v > 0 as the number of risks insured.
On the other side, N counts the number claims. The ratio N/v is called claims
frequency and the expected number of claims is given by
E[N ] = v,
tes
where > 0 denotes the expected claims frequency. Under these premises we would
like to describe the probability weights
pk = P [N = k] for k A N0 .
no
2.2.1 Binomial distribution

For the binomial distribution we choose a fixed volume v N and a fixed default
probability p (0, 1) (expected claims frequency).
NL
We say N has a binomial distribution, write N Binom(v, p), if

!
v
pk = P [N = k] = pk (1 p)vk for all k {0, . . . , v} = A.
k
P
The binomial formula provides kA pk = 1, see e.g. Section 5.3 in Merz-Wthrich
[79], and, hence, we have a discrete distribution function on the set A = {0, . . . , v}.
The special case v = 1 is called Bernoulli distribution or Bernoulli experiment, write
N Bernoulli(p), and reflects the coin tossing experiment
(
1p for k = 0,
P [N = k] =
p for k = 1.
This describes whether a single risk defaults or not.

Proposition 2.3. Assume N Binom(v, p) for fixed v N and p (0, 1). Then
s
1p
E[N ] = vp, Var(N ) = vp(1 p), Vco(N ) = ,
vp
MN (r) = (per + (1 p))v for all r R.
Proof. We calculate the moment generating function and then the first two moments follow from
formula (1.5). For the moment generating function we have
X v
rk v
X k
k vk
MN (r) = e p (1 p) = (per ) (1 p)vk
k k
w)
kA kA
k vk
r v
X v per 1p
= (pe + (1 p)) .
k per + (1 p) per + (1 p)
kA
The last sum is again a summation over probability weights pk , k A, of a binomial distribution
(m
with default probability p = (per )/(per + (1 p)) (0, 1). Therefore it adds up to 1 which
completes the proof. 2
Next we give a second characterization of the binomial distribution which leads to

the interpretation of the binomial distribution.
Corollary 2.4. Assume N Binom(v, p) with given v N and p (0, 1). Choose
tes
i.i.d.
X1 , . . . , Xv Bernoulli(p). Then we have
v
(d) X
N = Xi .
i=1
no
Pv
Proof. In view of Lemma 1.3 it suffices to prove that N and X = i=1 Xi have the same
moment generating function. The moment generating function of the latter is given by
" v # v v
Y Y Y
rXi
E erXi = (per + (1 p)) = MN (r).

MX (r) = E e =
i=1 i=1 i=1
NL
This completes the proof. 2
Remarks. The corollary states that N describes the number of defaults within
a portfolio of fixed size v N. Every risk in this portfolio has the same default
probability p and defaults between different risks do not influence each other (are
independent). Thus, if N has a binomial distribution then every risk in such a
portfolio can default at most once. This is the case, for instance, for life insurance
policies where an insured person can die at most once. In non-life insurance this
distribution is less commonly used because for typical non-life insurance policies
we can have more than one claim within a fixed time interval, e.g., a car insurance
policy can suffer two or more accidents within the same accounting year. Therefore,
the binomial distribution is not of central interest in non-life insurance modeling.

Definition 2.5 (compound binomial model). The total claim amount S has a
compound binomial distribution, write
S CompBinom(v, p, G),
if S has a compound distribution with N Binom(v, p) for given v N and

p (0, 1) and individual claim size distribution G.
Proposition 2.6. Assume S CompBinom(v, p, G). We have
w)
E[S] = vp E[Y1 ],

Var(S) = vp E[Y12 ] pE[Y1 ]2 ,
s
1 q
Vco(S) = 1 p + Vco(Y1 )2 ,
vp
(m
MS (r) = (pMY1 (r) + (1 p))v for r R,

Proof. The proof is an immediate consequence of Propositions 2.2 and 2.3. 2
tes
Remark. The coefficient of variation Vco(S) is a measure for the degree of di-
versification within the portfolio. If S has a compound binomial distribution with
fixed default probability p and fixed claim size distribution G having finite second
moment, then the coefficient of variation converges to zero of order v 1/2 as the
portfolio size v increases.
no
Corollary 2.7 (aggregation property). Assume S1 , . . . , Sn are independent with

Sj CompBinom(vj , p, G) for all j = 1, . . . , n. The aggregated claim has a com-
pound binomial distribution with

NL
n
X n
X
S= Sj CompBinom vj , p, G .
j=1 j=1
Proof. Exercise. Note here that n describes the (deterministic) number of portfolios and should
not be confused with the binomial random variable N . 2
Exercise 3. Assume S CompBinom(v, p, G) and choose M > 0 such that

G(M ) (0, 1). Define the compound distribution of claims Yi exceeding threshold
M by
N
X
Slc = Yi 1{Yi >M } .
i=1
Then we have Slc CompBinom(v, p(1 G(M )), Glc ) where the large claims size
distribution satisfies Glc (y) = P [Y1 y|Y1 > M ].

2.2.2 Poisson distribution

For defining the Poisson distribution we choose a fixed volume v > 0 and a fixed
expected claims frequency > 0.
We say N has a Poisson distribution, write N Poi(v), if
(v)k
pk = P [N = k] = ev for all k A = N0 .
k!
w)
The power series expansion of the exponential function ev
P
provides k0 pk = 1 and thus we have a discrete distribu-
tion function on the set A = N0 .
(m
The Poisson distribution goes back to Simon Denis Pois-
son (1781-1840) who has published his work on probability
theory in 1837.
Note that parameter v only appears as a product in the S.D. Poisson

Poisson distribution. Therefore, we could also define c =
tes
v > 0 and work solely with c. This is the way how the Poisson distribution is
usually treated in the literature. Here we insist on keeping the separation of c into
and v because this provides the expected frequency interpretation for and it
will allows us for the study of diversification benefits. The latter is exactly one of
the statements in the next proposition.
no
Proposition 2.8. Assume N Poi(v) for fixed , v > 0. Then

s
1
E[N ] = v = Var(N ), Vco(N ) = ,
v
NL
MN (r) = exp {v(er 1)} for all r R.
Proof. We calculate the moment generating function and then the first two moments follow from
formula (1.5). For the moment generating function we have using the power series expansion of
the exponential function
X (v)k X (ver )k
MN (r) = erk ev = ev = exp {v + ver } .
k! k!
k0 k0
Proposition 2.8 provides the interpretation of the parameter . For given volume
v > 0 the expected claims frequency is
N

E = .
v

Moreover, for the coefficient of variation of the claims frequency N/v we obtain
N

Vco = (v)1/2 0 for v . (2.2)
v
Next we give a constructive characterization of the Poisson distribution.
Lemma 2.9. Assume Nv Binom(v, p) with v N and p = p(v) (0, 1) such

that limv vp = c (0, ). Then Nv converges in distribution to N Poi(c) as
v .
Proof. In view of Lemma 1.4 we need to prove that the moment generating functions of Nv have
w)
the appropriate convergence property.
h ivp(v)
1/p(v)
MNv (r) = (per + (1 p))v = (1 + p(v) (er 1)) .
Note that p(v) 0 as v . If we apply this limit to the inner bracket (1 + p(v)(er 1))1/p(v)
(m
we exactly obtain the limit definition of the exponential function exp{er 1}, see Definition 14.30
in Merz-Wthrich [79]. This with the fact that vp(v) c as v provides the proof. 2
Interpretation. Binomially distributed claims counts Nv can be approximated

by a Poisson distribution if the default probability p is very small compared to the
portfolio size v.
tes
Definition 2.10 (compound Poisson model). The total claim amount S has a
compound Poisson distribution, write
S CompPoi(v, G),
no
if S has a compound distribution with N Poi(v) for given , v > 0 and individual
claim size distribution G.
Proposition 2.11. Assume S CompPoi(v, G). We have

NL
E[S] = v E[Y1 ],
Var(S) = v E[Y12 ],
s
1 q
Vco(S) = 1 + Vco(Y1 )2 ,
v
MS (r) = exp {v(MY1 (r) 1)} for r R,
Remark. If S has a compound Poisson distribution with fixed expected claims

frequency > 0 and fixed claim size distribution G having finite second moment,

then the coefficient of variation converges to zero of order v 1/2 as the portfolio
size v increases.
The compound Poisson distribution has the so-called aggregation property and the
disjoint decomposition property. These are two extremely beautiful and useful
properties which explain part of the popularity of the compound Poisson model.
We first state and prove these two properties and then we give interpretations in
the context of non-life insurance portfolio modeling.
Theorem 2.12 (aggregation of compound Poisson distributions). Assume
w)
S1 , . . . , Sn are independent with Sj CompPoi(j vj , Gj ) for all j = 1, . . . , n. The
aggregated claim has a compound Poisson distribution
n
X
S= Sj CompPoi(v, G),
(m
j=1
with n n n
X X vj X j vj
v= vj , = j and G= Gj .
j=1 j=1 v j=1 v
Note that n describes the (deterministic) number of portfolios S1 , . . . , Sn here, and

tes
it should not be confused with the Poisson random variable N .
Proof. We have assumed that Gj (0) = 0 for all j = 1, . . . , n which implies that S 0, P-a.s.
From Lemma 1.3 it follows that we only need to identify the moment generating function of S in
order to prove that it is compound Poisson distributed. Observe that MS (r) exists at least for
no
r 0. Thus, we calculate (using the independence of the Sj s)

X n n
Y Yn
MS (r) = E exp r Sj = E exp {rSj } = E [exp {rSj }]

j=1 j=1 j=1

n n
Y n o X j vj
NL
= exp j vj MY (j) (r) 1 = exp v MY (j) (r) 1 ,

j=1
1 v 1
j=1

(j)
where we have assumed Y1 Gj . This is a compound Poisson distribution with expected num-
ber of claims v and the claim size distribution G is obtained from the moment generating func-
Pn j vj Pn j vj
tion j=1 v MY (j) (r): note that G = j=1 v Gj is a distribution function (non-decreasing,
1
right-continuous, limx G(x) = 0 and limx G(x) = 1). We choose Y G and obtain

Z Z n
X j v j
MY (r) = ery dG(y) = ery d Gj (y)
0 0 j=1
v
n Z n
X j vj X j vj
= ery dGj (y) = MY (j) (r).
j=1
v 0 j=1
v 1
Using Lemma 1.3 once more for the claim size distribution proves the theorem. 2

Next we analyze the disjoint decomposition property. Therefore we slightly extend

the compound Poisson model. Let (p+ j )j=1,...,m be a discrete probability distribution
on the finite set {1, . . . , m}. Assume p+ j > 0 for all j. We can interpret the set
{1, . . . , m} as different sub-portfolios or different lines of business. For instance,
if we have a car insurance portfolio, a property insurance portfolio and a liability
insurance portfolio we set m = 3, with j {1, 2, 3} labeling the portfolios of the
three lines of business. Assume Gj are the corresponding claim size distributions
of the sub-portfolios with Gj (0) = 0. Then, we can define the mixture distribution
by
m
w)
p+
X
G(y) = j Gj (y) for y R.
j=1
Theorem 2.12 exactly provides such a mixture distribution with p+ j = j vj /(v) if

we aggregate the sub-portfolios.
(m
The next theorem provides the opposite direction, i.e. it is aiming at decomposing
(mixing) distribution G. We define a discrete random variable I which indicates
to which sub-portfolio a particular claim Y belongs to: define I by
P [I = j] = p+
j for all j {1, . . . , m}. (2.3)
This allows us to extend the compound Poisson model from Definition 2.10.
tes
Definition 2.13 (extended compound Poisson model). The total claim amount
S= N
P
i=1 Yi has a compound Poisson distribution as defined in Definition 2.10. In
addition, we assume that (Yi , Ii )i1 are i.i.d. and independent of N with Yi having
marginal distribution function G with G(0) = 0 and Ii having marginal distribution
function given by (2.3).
no
Remark. Note that Definition 2.13 gives a well-defined extension, i.e. it fully
respects the assumptions made in Definition 2.10 because (Yi , Ii )i1 are i.i.d. and
independent of N with Yi having the appropriate marginal distribution function
G. Observe that we do not specify the dependence structure between Yi and Ii . If
we choose m = 1 in (2.3) we are back in the classical compound Poisson model.
NL
Therefore, the next theorem especially applies to the compound Poisson model.
Before stating the next theorem we introduce an admissible and measurable disjoint
decomposition (called partition) of the total space. The random vector (Y1 , I1 )
takes values in R+ {1, . . . , m}. On this latter set we choose a finite sequence
A1 , . . . , An of (measurable) sets such that Ak Al = for all k 6= l and
n
[
Ak = R+ {1, . . . , m}. (2.4)
k=1
Such a sequence A1 , . . . , An is called a measurable disjoint decomposition or par-

tition of R+ {1, . . . , m}. This partition is called admissible for (Y1 , I1 ) if for all
k = 1, . . . , n
p(k) = P [(Y1 , I1 ) Ak ] > 0.

Pn
Note that k=1 p(k) = 1, due to (2.4) and the mutual disjointness.
Theorem 2.14 (disjoint decomposition of compound Poisson distributions).

Assume that S fulfills the extended compound Poisson model assumptions of Defi-
nition 2.13. We choose an admissible partition A1 , . . . , An for (Y1 , I1 ). Define for
k = 1, . . . , n the random variables
N
X
Sk = Yi 1{(Yi ,Ii )Ak } .
i=1
w)
Sk are independent and CompPoi(k vk , Gk ) distributed for k = 1, . . . , n with
k vk = vp(k) > 0 and Gk (y) = P [ Y1 y| (Y1 , I1 ) Ak ] .
(m
Proof of Theorem 2.14. We prove the theorem using the multivariate extension of the mo-
ment generating function. Choose r = (r1 , . . . , rn )0 Rn . The multivariate moment generating
function of random vector S = (S1 , . . . , Sn )0 is given by
" ( n )# " ( n N
)#
X X X
0
MS (r) = E [exp {r S}] = E exp rk Sk = E exp rk Yi 1{(Yi ,Ii )Ak }
k=1 k=1 i=1
"N " ( n
) ##
Y X
tes
= E E exp rk Yi 1{(Yi ,Ii )Ak } N

i=1 k=1
"N " ( n )##
Y X
= E E exp rk Yi 1{(Yi ,Ii )Ak } .
i=1 k=1
Note that N is a Poisson distributed random variable and n denotes the deterministic number of
no
disjoint sets A1 , . . . , An . We calculate the inner expected values of the last expression.
" ( n )# n
" ( n ) #
X X X
E exp rk Yi 1{(Yi ,Ii )Ak } = E exp rk Yi 1{(Yi ,Ii )Ak } 1{(Yi ,Ii )Al }
k=1 l=1 k=1
n
" ( n
) #
X X
= E exp rk Yi 1{(Yi ,Ii )Ak } (Yi , Ii ) Al P [(Yi , Ii ) Al ]

NL

l=1 k=1
Xn n
X
= E [ exp {rl Yi }| (Yi , Ii ) Al ] p(l) = p(l) MY (l) (rl ),
1
l=1 l=1
(l)
where we assume Y1 Gl . Collecting all terms we obtain
!N " ( !)#
Xn n
X
MS (r) = E p(l) MY (l) (rl ) = E exp N log p(l) MY (l) (rl )
1 1
l=1 l=1
( n
!) ( n
)
X X
(l) (l)
= exp v p MY (l) (rl ) 1 = exp v p MY (l) (rl ) 1
1 1
l=1 l=1
n
Y n o n
Y
= exp vp(l) MY (l) (rl ) 1 = MSl (rl ).
1
l=1 l=1
This proves the theorem because we have obtained a product (i.e. independence) of moment
generating functions of compound Poisson distributed random variables Sl , l = 1, . . . , n. 2

Remarks 2.15 (Aggregation and disjoint decomposition properties).

The aggregation property implies that we can follow a bottom-up modeling
approach for the entire insurance business: we model each sub-portfolio Sj
independently with a compound Poisson distribution. The total portfolio S
is then easily obtained by the aggregation theorem and we stay in the same
family of distributions. This theorem is of special importance when we esti-
mate the frequency parameters j and the individual claim size distributions
Gj on the bottom (sub-portfolio) level.
w)
The disjoint decomposition property implies that we
can also follow a top-down modeling approach: we
model the overall portfolio S by a compound Pois-
(m
son distribution. The disjoint decomposition prop-
erty then easily allows us to allocate the total claim
amount to the sub-portfolios. The crucial result here
is, at the first sight surprisingly, that this allocation
results in independent compound Poisson distributions
for Sj . This independence property does not hold true
tes
for other compound distributions because it essentially
uses the independent space and time decoupling property of Poisson point
processes, see also Section 3.3.2 in Mikosch [81].
For I we have chosen a finite (discrete) indicator. Of course, this model can
no
easily be extended to other indicators. The crucial property is the i.i.d. as-
sumption on the random vectors (Yi , Ii ). We have chosen a finite indicator I
because this has the natural interpretation of sub-portfolios. If I = 1, P-a.s.,
then we can completely drop this indicator.
The choice of the appropriate volume on the sub-portfolios depends on the choice
NL
of the indicator I. If m = 1, i.e. if we only consider one portfolio, and if we apply

a disjoint decomposition of this portfolio as follows
Yi = Yi 1{Yi A1 } + . . . + Yi 1{Yi An } ,
then it is natural to set vk = v and k = p(k) for k = 1, . . . , n. That is, the volume
v > 0 remains constant but the expected claims frequencies k change accordingly
to Ak . This is also called thinning of the Poisson point process.
The second extreme case is m = n > 1 and the disjoint decomposition is given by
{(Yi , Ii ) Ak } = {Ii = k},
i.e. we only consider a decomposition according to different sub-portfolios k =
1, . . . , m. In this case we may define vk > 0 by the volume of portfolio k and
k = p(k) v/vk .

Example 2.16 (large claims separation). A very important application of the

disjoint decomposition property of compound Poisson distributions is the separa-
tion of large claims from small claims. Often, there does not exist one parametric
distribution function G that applies to the entire range of possible outcomes of
the individual claim sizes Yi . Therefore, these individual claim sizes are divided
into different layers that need to be concatenated. Let us assume that we would
like to model two layers. We choose a large claims threshold M > 0 such that
G(M ) (0, 1), i.e. G(M ) is bounded away from zero and one. We then define the
partition A1 , A2 of R+ by
w)
A = A1 = {Y1 M } and Ac = A2 = {Y1 > M } .
Assume that S CompPoi(v, G). We define the total claim Ssc in the small
claims layer and the total claim Slc in the large claims layer by
Ssc =
N
X
i=1
Yi 1{Yi M } and
(m
Slc =
N
X
i=1
Yi 1{Yi >M } .
tes
Theorem 2.14 implies that Ssc and Slc are independent and compound Poisson
distributed with
Ssc CompPoi (sc v = G(M )v , Gsc (y) = P [Y1 y|Y1 M ]) ,

no
and
Slc CompPoi (lc v = (1 G(M ))v , Glc (y) = P [Y1 y|Y1 > M ]) .
In particular, this means that we can model the small and the large claims layers
completely separately and then obtain the total claim amount distribution by a sim-
NL
ple convolution of the two resulting distribution functions (due to independence),

see Example 4.11, below.
For the large claims layer we need to determine the expected large claims frequency
lc > 0. The individual large claim sizes Yi |{Yi >M } are often modeled by a Pareto
distribution with threshold M and tail parameter > 1, for more details see
Sections 3.2.5 and 3.4.1.
The small claims layer is often approximated by a parametric distribution func-
tion: we have seen in (2.1) that compound distributions may lead to rather time
consuming computational complexity when the expected number of claims sc v is
large. Therefore, one typically assumes that the expected number of small claims
is sufficiently large so that we are already in the asymptotic regime of the central
limit theorem and then we approximate this compound distribution by the Gaus-
sian distribution, see Theorem 4.1 below, or maybe by a distribution function that

is slightly skewed, see Sections 4.1.2 and 4.1.3. Note that the small claims layer
cannot be distorted by large claims because they are already sorted out by the
threshold M . We will describe this in more detail in Section 3.4.1, below.
2.2.3 Mixed Poisson distribution

Above we have introduced the binomial and the Poisson distributions. These two
distributions have the following relationship
w)
binomial distribution E [N ] > Var(N ),
Poisson distribution E [N ] = Var(N ).
However, insurance data often suggests
(m E [N ] < Var(N ).
Therefore, we present more claims count distributions for N . In particular, the

mixed Poisson distribution enjoys the latter property of a variance dominating the
mean (over-dispersion). We remark that similar constructions could also be done
tes
for the binomial distribution. We refrain from doing so because the Poisson case
is more appropriate for non-life insurance modeling.
The mixed Poisson distribution gives the general principle and a specific example
will be given in the next section. The idea is to attach volatility (or uncertainty)
no
to the claims frequency parameter , thus, the claims frequency will be modeled
as a latent (random) variable. Based on this latent variable we then choose the
claims count distribution being conditionally Poisson distributed.
Definition 2.17 (mixed Poisson distribution).

NL
Assume H with H(0) = 0, E [] = and Var() > 0.
Conditionally, given , N Poi(v) for a fixed volume v > 0.
Lemma 2.18. Assume N satisfies Definition 2.17. We have E [N ] < Var(N ).
Proof. The tower property implies E[N ] = E[E[N |]] = E[v] = v and
Var(N ) = E[Var(N |)] + Var(E[N |]) = vE[] + v 2 Var() > v.
In the next section we make an explicit choice for the distribution function H.

2.2.4 Negative-binomial distribution

In this section we assume that N has a mixed Poisson distribution with latent
variable drawn from a gamma distribution. Therefore, we briefly introduce the
gamma distribution, which is described in more detail in Section 3.3.3, below.
We say X has a gamma distribution, write X (, c), with shape parameter

> 0 and scale parameter c > 0 if X is a non-negative, absolutely continuous
random variable with density
c
w)
f (x) = x1 exp {cx} for x 0.
()
The moments of X are given by

c

(m
E[X] = , Var(X) = 2 and MX (r) = for r < c.
c c cr
The gamma distribution has many nice properties and it is used rather frequently
for the modeling of latent variables and for the modeling of individual claim sizes,
see Section 3.3.3.
tes
Definition 2.19 (negative-binomial distribution, 1st definition). We say N has
a negative-binomial distribution, write N NegBin(v, ), with volume v > 0,
expected claims frequency > 0 and dispersion parameter > 0, if
(, ), and
no
conditionally, given , N Poi(v).
Note that for = we are exactly in the context of Definition 2.17 with the first
two moments given by
NL
E[] = and Var() = 2 / > 0.
Proposition 2.20 (negative-binomial distribution, 2nd defini-

tion). The negative-binomial distribution as given in Definition
2.19 satisfies for k A = N0
!
k+1
pk = P[N = k] = (1 p) pk ,
k
where we choose p = v/( + v) (0, 1). G. Plya
This second representation is often used as the definition of the
negative-binomial distribution. It is sometimes also named after

George Plya (1887-1985). In our context, it is simpler to

work with the first definition. Especially, parameter estimation
will give an explicit meaning to the latent variable .
Proof of Proposition 2.20. We apply the tower property which implies
(v)k

P[N = k] = E [P[N = k|]] = E exp{v}
k!
Z k
(xv)
= exp{xv} x1 exp {x} dx
0 k! ()
Z
(v)k ( + k) ( + v)+k +k1
w)
= x exp {( + v)x} dx
() k! ( + v)+k 0 ( + k)
k
( + k) v k+1
= = (1 p) pk ,
() k! + v + v k
notice that the second last inequality follows because we have a gamma density with shape
(m
parameter + k and scale parameter + v under the integral. This trick of completion should
be remembered because it is applied very frequently. 2
Proposition 2.21. Assume N NegBin(v, ) for fixed , v, > 0. Then
E[N ] = v
Var(N ) = v(1 + v/) > v,
tes
s
1 q
Vco(N ) = 1 + v/ > 1/2 > 0,
v
!
1p
MN (r) = for all r < log p,
1 per
no
and p = v/( + v) (0, 1).

Proof. The first three statements are a direct consequence of the proof of Lemma 2.18 and the
properties of the gamma distribution. Therefore, it remains to calculate the moment generating
function. The tower property implies
NL
MN (r) = E E erN = E [exp {v (er 1)}] = M (v(er 1)),

from which the claim follows for (, ) and 1 p = /( + v). 2
Proposition 2.21 provides a nice interpretation. For given volume v > 0 the ex-
pected claims frequency is
N

E = .
v
Moreover, for the coefficient of variation of the claims frequency N/v we obtain
N
q
Vco = (v)1 + 1 1/2 > 0 for v .
v
This can be interpreted as follows. The random variable reflects the uncertainty
in the true underlying frequency parameter of the Poisson distribution. This

uncertainty also remains in the portfolio for infinitely large volume v, i.e. this
risk is not diversifiable, and the positive lower bound 1/2 is determined by the
dispersion parameter (0, ). In particular, consider a time series N1 , N2 , . . .
of claims counts in different accounting years 1, 2, . . .. Each of these accounting
years has its own (risk) characteristics 1 , 2 , . . ., like weather conditions, inflation
index, portfolio fluctuations, etc. Since we do not know these characteristics a
priori, i.e. prior to future accounting years, we model these characteristics with
a latent factor process (t )t1 which provides the true frequency parameter for
accounting years t, given by t = t . This differs from the Poisson case, see (2.2).
w)
Example 2.22 (claims count distributions). We compare the binomial, Poisson
and the negative-binomial distributions. We assume that they have identical means
E[N ] = 500 with v = 1000, p = = 0.5 and = 100.
(m
0.025
binomial
Poisson
negativebinomial
0.020
tes
probability weights p_k
0.015
0.010
no
0.005
NL 0.000
200 300 400 500 600 700 800
Figure 2.1: Probability weights pk of binomial, Poisson and negative binomial

distributions with identical means (for convenience plotted as lines).
In Figure 2.1 we plot the corresponding probability weights pk . We observe that

the coefficient of variation is increasing from the binomial over the Poisson to the
negative-binomial distribution, which gives increasingly more uncertainty to claims
counts.

Definition 2.23 (compound negative-binomial model). The total claim amount S

has a compound negative-binomial distribution, write
S CompNB(v, , G),
if S has a compound distribution with N NegBin(v, ) for given , v, > 0 and

individual claim size distribution G.
Proposition 2.24. Assume S CompNB(v, , G). We have, whenever they
w)
exist,
E[S] = v E[Y1 ],
Var(S) = v E[Y12 ] + (v)2 E[Y1 ]2 /,
s
(m
1 q
Vco(S) = 1 + Vco(Y1 )2 + v/ > 1/2 ,
v
!
1p
MS (r) = for r R such that MY1 (r) < 1/p,
1 pMY1 (r)
with p = v/( + v) (0, 1).

tes
Exercise 4. Assume S CompNB(v, , G) and choose M > 0 such that G(M )

(0, 1). Define the compound distribution of claims Yi exceeding threshold M by
no
N
X
Slc = Yi 1{Yi >M } .
i=1
Then we have Slc CompNB((1 G(M ))v, , Glc ) where the large claims size
distribution satisfies Glc (y) = P [Y1 y|Y1 > M ].
NL
2.3 Parameter estimation

Once we have specified the distribution functions for N and Yi we still need to
determine their parameters. In the case of the claims count distribution of N these
are (i) the default probability p for the binomial distribution; (ii) the expected
claims frequency for the Poisson distribution; or (iii) the expected claims fre-
quency and the dispersion parameter for the negative-binomial distribution.
Essentially, there are three different common ways to estimate these parameters:
1. method of moments (MM),
2. maximum likelihood estimation (MLE) method,

3. Bayesian inference method (inverse probability method).
In this section we describe the first two methods. The Bayesian inference method
is deferred to Chapter 8.
2.3.1 Method of moments

We start with an example to explain the method of moments. Assume that we have
i.i.d.
an i.i.d. sequence X1 , . . . , XT F , where F is a parametric distribution function
that depends (for simplicity) on a two dimensional (real valued) parameter (1 , 2 ).
w)
Assume that the first two moments of X1 are finite, and thus, for all t = 1, . . . , T
we have mean and variance (as a function of (1 , 2 ))
= (1 , 2 ) = E[Xt ] < and 2 = 2 (1 , 2 ) = Var(Xt ) < .
(m
Remark. For general d-dimensional (real valued) parameters (1 , . . . , d ) we ex-
tend the argument to the first d moments of Xt .
We define the sample mean and sample variance by, T 2 for the latter,
tes
T T
1 X 1
bT2 = (Xt b T )2 .
X
b T = Xt and (2.5)
T t=1 T 1 t=1
A straightforward calculation shows that these are unbiased estimators for and
2 , that is,
no
E[b T ] = = (1 , 2 ) and E[bT2 ] = 2 = 2 (1 , 2 ). (2.6)
This motivates the moment estimator (b1 , b2 ) for (1 , 2 ) by solving the system of
equations
NL
b T = (b1 , b2 ) and bT2 = 2 (b1 , b2 ).
In our situation the problem is more involved. Assume we have a vector of obser-
vations N = (N1 , . . . , NT )0 , where Nt denotes the number of claims in accounting
year t. The difficulty is that Nt , t = 1, . . . , T , are not i.i.d. because they depend on
different volumes vt . That is, in general, the portfolio changes over the accounting
years. Therefore, we need to slightly modify the framework described above.
Assumption 2.25. Assume there exist strictly positive volumes v1 , . . . , vT such

that the components of F = (N1 /v1 , . . . , NT /vT )0 are independent with
= E[Nt /vt ] and t2 = Var(Nt /vt ) (0, ),
for all t = 1, . . . , T .

Lemma 2.26. We make Assumption 2.25. The unbiased linear (in F) estimator
for with minimal variance is given by
T
!1 T
b MV
X1 X Nt /vt
T = 2 2
,
t=1 t t=1 t
the variance of this estimator is given by

T
!1

b MV
X1
Var T = 2
.
t=1 t
w)
The upper index MV stands for minimal variance estimator.
Proof. We apply the method of Lagrange, see Section 24.3 in Merz-Wthrich [79]. We define
the mean vector = e = (1, . . . , 1)0 RT and the diagonal positive definite covariance matrix
= diag(12 , . . . , T2 ) of F. Then, we would like to solve the following minimization problem
(m
1 0
x+ = arg min{xRT ;x0 =} x x,
2
thus, we minimize the variance Var(x0 F) = x0 x subject to all unbiased linear combinations of
F which gives the constraint = E[x0 F] = x0 . The Lagrangian for this problem is given by
1 0
L(x, c) = x x c(x0 ),
2
tes
with Lagrange multiplier c. The optimal value x+ is found by the solution of

L(x, c) = x c = 0 and L(x, c) = x0 + = 0.
x c
The first requirement implies x = c1 = c1 e. Plugging this into the second requirement
implies = x0 = c2 e0 1 e. If we solve this for the Lagrange multiplier we obtain c =
no
1 (e0 1 e)1 . This provides

T
!1
1 X 1 0
+
x = 0 1 e =1
2 12 , . . . , T2 .
e e
t=1 t
bMV = (x+ )0 F and the variance is given by

This implies T
NL
T
!1
X
bMV = (x+ )0 x+ = (e0 1 e)1 =
Var T t2 .
t=1
We apply this lemma to the case of the binomial and the Poisson distributions.
Assume that Nt , t = 1, . . . , T , are independent with Nt Binom(vt , p) or Nt
Poi(vt ), respectively. Then we have in the binomial case
E[Nt /vt ] = p and Var(Nt /vt ) = p(1 p)/vt = t2 ,
and in the Poisson case
E[Nt /vt ] = and Var(Nt /vt ) = /vt = t2 .

Note that in both cases the unknown parameter p and , respectively, appears in the
variance. However, the appearance is of multiplicative
1
nature which implies that
2 PT 2
it cancels in the weights wt = t s=1 s . Therefore, we get the following
minimal variance estimators in the binomial and the Poisson cases.
Estimator 2.27 (method of moment estimators in the binomial and Poisson

cases).
We have the following unbiased linear minimal variance estimators:
binomial case for p
w)
T T
1 vt Nt
pbMV
X X
T = PT Nt = PT ;
s=1 vs t=1 t=1 s=1 vs vt
(m
Poisson case for
T T
b MV = P 1 X X vt Nt
T T Nt = PT .
The variances of these estimators are given by

tes
p(1 p) b MV = P

Var pbMV
T = PT and Var T T .
s=1 vs s=1 vs
These variances (and uncertainties) converge to zero for Ts=1 vs , and they can
P
be estimated by replacing the unknown parameters p and , respectively, by their

no
estimators. Note that we can explicitly give these distributions of the estimators
because in the former case Tt=1 Nt Binom( Tt=1 vt , p) and in the latter case
P P
PT PT
t=1 Nt Poi( t=1 vt ).
The negative-binomial case is more involved. Assume that Nt , t = 1, . . . , T , are

NL
independent with Nt NegBin(vt , ). For the first two moments we have

E[Nt /vt ] = and Var(Nt /vt ) = /vt + 2 / = t2 .
The variance term has two unknown parameters and and we lose the nice
multiplicative structure from the binomial and the Poisson case which has allowed
us to apply Lemma 2.26 in a straightforward manner. If we drop the condition
minimal variance we obtain the following unbiased linear estimator.
Estimator 2.28 (moment estimator in the negative-binomial case (1/2)).

We have the following unbiased linear estimator for
T T
b NB = P 1 X X vt Nt
T T Nt = PT .

In the last formula we could also take other volume weighted averages. The unbi-
asedness of b NB immediately follows from the assumptions of the negative-binomial
T
distribution. The variance of this estimator is given by
!2 T PT

b NB
1 X t=1 vt + (vt )2 /
Var T = PT Var(Nt ) = 2 .
vs
P
T
s=1 t=1 s=1 vs
There remains the estimate of . Therefore, we define
w)
T 2
1 X Nt b N B

VbT2 = vt T . (2.7)
T 1 t=1 vt
Lemma 2.29. In the negative-binomial model VbT2 satisfies
(m
T T
!
2 vt2
P
h i 1
E VbT2 vt Pt=1
X
=+ T .
T 1 t=1 t=1 vt
.
This motivates the following estimator.

tes
Estimator 2.30 (moment estimator in the negative-binomial case (2/2)).
The method of moments suggests the following estimator for
T T
!
(b NB )2 1 vt2
P
bTNB = b 2 T b NB vt Pt=1
X
,
no
T
VT T T 1 t=1 t=1 vt
b NB , otherwise use the Poisson or the binomial model (no over-dispersion

for VbT2 > T
in data N1 , . . . , NT ).
NL
bN B for we have
Proof of Lemma 2.29. Using the unbiasedness of T
T
" 2 # T
h i 1 X Nt bN B 1 X Nt b N B
E VbT2 = vt E T = vt Var T
T 1 t=1 vt T 1 t=1 vt
T
1 X Nt Nt b N B
NB
= vt Var 2Cov , + Var T b
T 1 t=1 vt vt T

T T
1 X 1 1 1 1 X
= vt 2 Var (Nt ) 2 PT Var (Nt ) + P Var (Ns )

2
T 1 t=1 vt vt s=1 vs T
v s=1
s=1 s
" T PT #
vt + (vt )2 / vt + (vt )2 /

1 X
t=1
= PT .
T 1 t=1
vt s=1 vs

We justify these estimators in the case of vt = v for all t = 1, . . . , T . This uniform

volume case provides
T
b NB v = 1
X
T Nt = b T ,
T t=1
which is the sample mean of i.i.d. random variables Nt . For the estimate we
obtain in the uniform volume case
(b NB v)2
T
bTNB = 2 b NB v
with
VT v
b
T
T
1 X bN B v 2 =

VbT2 v = Nt b T2 ,
w)
T
T 1 t=1
where the latter term is the sample variance of i.i.d. random variables Nt . Or in
other words, the proposed estimators in the uniform volume vt = v case are found
by looking at the system of equations (2.6). In the negative-binomial model this
(m
system is given by
E[b T ] = = v and E[bT2 ] = 2 = v + (v)2 /.
Replacing and 2 by their sample estimators and solving the system of equations
b NB and
provides bTNB in the uniform volume case.
T
tes
2.3.2 Maximum likelihood estimators
The MLE method has been popularized by Sir Ronald Aylmer
Fisher (1890-1962) but it has been used already before by Gauss,
Laplace and others. The philosophy behind MLE is different
no
compared to the method of moments. For MLE the first ob-

jective is not unbiasedness but maximizing the probability of a
given observation. MLE can be done for densities or for probabil-
ity weights, we formulate it for the latter because at the moment
Sir R.A. Fisher
we are looking at discrete random variables N .
NL
Assume that the components of N = (N1 , . . . , NT )0 are independent with probabil-

(t)
ity weights pk () = P [Nt = k] = P[Nt = k] that depend on a common unknown
parameter . The independence property of N1 , . . . , NT implies that the
joint likelihood function for observation N is given by

T
Y (t)
LN () = pNt (),
t=1
and their joint log-likelihood function is given by

T
X (t)
`N () = log LN () = log pNt ().
t=1

The MLE for is based on the rationale that should be chosen such that the
probability of observing N = (N1 , . . . , NT )0 is maximized. The MLE bMLE
T for
based on the observation N is thus given by (subject to existence and uniqueness)
bMLE
T = arg max LN () = arg max `N ().

)
This is solved by a root search algorithm. Under suitable regularity properties and
real valued parameter the MLE bMLE is found as solution of
w
T
T
X (t)
`N () = log pNt () = 0.
t=1
(m
(t)
If the probability weights pk () are sufficiently regular as a function of in a reg-
ular domain which contains the true parameter , then the MLE bMLE T is asymp-
totically unbiased for T and under appropriate scaling it has an asymptotic
Gaussian distribution with inverse Fishers information as covariance matrix, for
details see Theorem 4.1 in Lehmann [70].
tes
Estimator 2.31 (MLE in the binomial case). Assume N1 , . . . , NT are independent
and Binom(vt , p). The MLE is given by
T T
1 vt Nt
no
pbMLE = pbMV
X X
T = PT Nt = PT T .
Proof. The log-likelihood function is given by

T
X vt
`N (p) = log + Nt log p + (vt Nt ) log(1 p).
NL
t=1
Nt
Calculating the derivative w.r.t. p provides the requirement

T
X Nt v t Nt
`N (p) = = 0.
p t=1
p 1p
Solving this for p proves the claim. 2
Estimator 2.32 (MLE in the Poisson case). Assume N1 , . . . , NT are independent

and Poi(vt ). The MLE is given by
T T
b MLE = P 1 X X vt Nt b MV .
T T Nt = PT = T
s=1 vs t=1 t=1 s=1 vs v t

Proof. The log-likelihood function is given by

T
X
`N () = (vt ) + Nt log(vt ) log(Nt !).
t=1
Calculating the derivative w.r.t. provides the requirement

T
X Nt
`N () = vt + = 0.
t=1

Solving this for provides the claim. 2
w)
Estimator 2.33 (MLE in the negative-binomial case). Assume N1 , . . . , NT are
b MLE ,
independent and NegBin(vt , ). The MLE ( b MLE ) is the solution of
T
!
X Nt + 1
log + log(1 pt ) + Nt log pt = 0,
(m
(, ) t=1 Nt
with pt = vt /( + vt ) (0, 1).
Unfortunately, this system of equations does not have a closed form solution, and
a root search algorithm is needed to find the MLE solution for (, ), see also page
tes
61 below.
2.3.3 Example and 2 -goodness-of-fit analysis

no
We apply the claims count models (Poisson and

negative-binomial) to a real data set. We take
the data set provided in Gisler [54]. This data
set describes the number of claims in an insur-
ance portfolio that protects private households
against water claims. The data is displayed in
NL
Table 2.1 and Figure 2.2.
We observe a strong growth of volume of more than 40% in this insurance portfolio
from v1982 = 2400 755 policies to v1991 = 3440 757 policies. Such a strong growth
might question the stationarity assumption in the expected claims frequency t
because this growth might also reflect a substantial change in the portfolio (and
the underlying product, possibly). Nevertheless we assume its validity (because
the observed claims frequencies Nt /vt do not show any structure such as a linear
trend, see Figure 2.2) and we fit the Poisson and, if necessary, the negative-binomial
distribution to this data set.

year volume number of frequency

t vt claims Nt Nt /vt
1982 240755 13153 5.46%
1983 255571 14186 5.55%
1984 269739 14207 5.27%
1985 281708 13461 4.78%
1986 306888 21261 6.93%
1987 320265 19934 6.22%
1988 323481 15796 4.88%
1989 334753 15157 4.53%
w)
1990 340265 17483 5.14%
1991 344757 19185 5.56%
total 3018182 163823 5.43%
Table 2.1: Private households water insurance: number of policies, claims counts
(m
and observed yearly claims frequencies, source Gisler [54].
Poisson model. We assume that Nt are independent with Nt Poi(vt ). The

linear minimal variance estimator and the MLE for are given by, see Estimator
tes
2.32,
1991
b MV = b MLE = P 1
X
T T 1991 Nt = 5.43%.
s=1982 vs t=1982
The coefficient of variation in the Poisson model is given by, see (2.2),
no
Vco(Nt /vt ) = (vt )1/2 .

b MV by
This coefficient of variation is estimated using T
Vco(N b MV 1/2 0.8%.

t /vt ) = (T vt )
d
NL
If we choose 1 standard deviation as confidence bounds, i.e. if we consider the

confidence interval CIt = ( (vt )1/2 ), we obtain estimated confidence intervals
(for any t) of roughly
CI
c = (5.39%, 5.47%).
t
These resulting confidence bounds are very narrow and we observe that most of
the observed yearly claims frequencies Nt /vt in Table 2.1 lie outside of these con-
fidence bounds, see Figure 2.3 (lhs). This clearly rejects the assumption of having
Poisson distributions for the number of claims and suggests that we should study
the negative-binomial model for Nt .
Negative-binomial model. As described above, the negative-binomial model is

able to model the heterogeneity over different accounting years t. It assumes that
every accounting year t is characterized by a latent (risk) factor t which describes

0.070

0.065

observed frequencies
0.060
0.055
w)

0.050

0.045
(m
1982 1984 1986 1988 1990
Figure 2.2: Observed yearly claims frequencies Nt /vt from t = 1982 to 1991 com-
pared to the overall average frequency of 5.43%, see Table 2.1.
tes
the nature of that particular accounting year t. A priori all years are similar which
is expressed by the i.i.d. property of t with t (, ) for identical dispersion
parameters > 0. We estimate this dispersion parameter with Estimator 2.30.
We expect that it substantially differs from , i.e. VbT2 > b NB =
T
b MV . We obtain
T
VbT2 = 15.84 > 5.43%. Thus, we have a clear over-dispersion which results in the
no
estimate
q
bTNB = 56.23 and Vco(N
d
t /vt ) = b NB v )1 + (
( T t bTNB )1 13%.
If we calculate the estimated 1 standard deviation confidence bounds we obtain for

NL
all t roughly
CI
c = (4.70%, 6.15%).
t
This makes much more sense in view of the observed frequencies Nt /vt in Table
2.1. We see that 7/10 of the observations are within these confidence bounds, see
Figure 2.3 (rhs).
We close this section with a statistical test: In the previous example it was obvious
that the Poisson model does not fit to the data. In situations where this is less
obvious we can use the following 2 -goodness-of-fit test.
Null hypothesis H0 : Nt are independent and Poi(vt ) distributed for t = 1, . . . , T .

0.070
0.070

0.065
0.065

0.060
0.060
0.055
0.055

0.050
0.050

w)
0.045
0.045

1982 1984 1986 1988 1990 1982 1984 1986 1988 1990
(m
Figure 2.3: Observed yearly claims frequencies Nt /vt from t = 1982 to 1991 com-
pared to the to the estimated overall frequency of 5.43%. (lhs): 1 standard devia-
tion confidence bounds Poisson case; (rhs): 1 standard deviation confidence bounds
negative-binomial case.
We are going to build a test statistics for the evaluation of this null hypothesis H0 .
tes
We define

XT
(Nt /vt )2
= (N) = .
t=1 /vt
It is not straightforward to determine the explicit distribution function of .
Therefore, we give an approximate answer to this request of hypothesis testing.
no
The aggregation and disjoint decomposition theorems (Theorems 2.12 and 2.14)
imply that Nt Poi(vt ) can be understood as a sum of vt i.i.d. random variables
Xi Poi(). That is,
vt
(d) X
Nt = Xi ,
NL
i=1
with E[X1 ] = and Var(X1 ) = . But then the CLT (1.2) applies with
Pvt
Nt /vt Nt vt (d) X vt
Ze t = q = = i=1
i N (0, 1) as vt .
/vt vt vt
This explains that Zet can be approximated in distribution by a standard Gaussian

random variable Zt N (0, 1) for vt sufficiently large.
i.i.d.
Next, if we assume that Z1 , . . . , ZT N (0, 1) then a standard result in statistics
says that Tt=1 Zt2 has a 2 -distribution with T degrees of freedom, denoted by 2T ,
P
see also Exercise 2 on page 22. Therefore, we obtain the asymptotic approximation
in distribution
T
(Nt /vt )2 X T T
(d) X 2
= (N) = Zet2 Zt 2T .
X
=
t=1 /vt t=1 t=1

In the last step we need to replace the unknown parameter by its estimate b MLE .
T
By doing so, we lose one degree of freedom, thus, we get the test statistics b and
the corresponding distributional approximation
2
T b MLE
Nt /vt T (d)
b = 2T 1 .
X
vt b MLE
(2.8)
t=1 T
We revisit the previous example. For the data in Table 2.1 we obtain b = 20 627.
The 99%-quantile of the 2 -distribution with T 1 = 9 degrees of freedom is given
w)
by 21.67. Since this is by far smaller than b we reject the null hypothesis H0 on
the significance level of q = 1%. This, of course, is not surprising in view of Figure
2.3 (lhs).
(m
Exercise 5. Consider the data given in Table 2.2. Estimate the parameters for
t 1 2 3 4 5 6 7 8 9 10
Nt 1000 997 985 989 1056 1070 994 986 1093 1054
vt 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000
tes
Table 2.2: Observed claims counts Nt and corresponding volumes vt .
the Poisson and the negative-binomial models. Which model is preferred? Does
a 2 -goodness-of-fit test reject the null hypothesis on the 5% significance level of
no
having Poisson distributions?
Exercise 6. An insurance company decides to offer a no-claims bonus to good car

drivers, namely,
NL
a 10% discount after 3 years of no claim, and
a 20% discount after 6 years of no claim.

How does the base premium need to be adjusted so that this no-claims bonus can
be financed? For simplicity we assume that all risks have been insured for at least
6 years. Answer the question in the following two situations:
(a) Homogeneous portfolio with i.i.d. risks having i.i.d. Poisson claim counts with
frequency parameter = 0.2.
(b) Heterogeneous portfolio with independent risks being characterized by a fre-

quency parameter having a gamma distribution with mean = 0.2 and
Vco() = 1. Conditionally, given , the individual years have i.i.d. Poisson
claim counts with frequency parameter .

w)
(m
tes
no
NL

Chapter 3
Individual Claim Size Modeling
w)
In Model Assumptions 2.1 we have introduced the compound distribution
(m
N
X
S = Y1 + Y2 + . . . + YN = Yi ,
i=1
with the three standard assumptions
1. N is a discrete random variable which takes values in A N0 ;

tes
i.i.d.
2. Y1 , Y2 , . . . G with G(0) = 0;
3. N and (Y1 , Y2 , . . .) are independent.
In Chapter 2 we have discussed the modeling of the claims count distribution of

no
N . In this chapter we concentrate on the modeling of the individual claim sizes Yi .

To get an understanding of the modeling of G we
present a data analysis based on two explicit data sets.
The first data set is a private property (PP) insurance
data set that consists of 72769 claims records. The
NL
second data set is a commercial property (CP) insur-

ance data set that consists of 18285 claims records.
Before presenting sophisticated mathematical modeling methods for G we analyze
these two data sets using tools from descriptive statistics.
3.1 Data analysis and descriptive statistics

The first observation is that the two data sets contain many claims records with
zero claims payments. That is, many of the recorded claims were settled without
any payments. In the case of PP insurance these were about 16% of the reported
claims and in the case of CP insurance we observe about 21% of zero claims. Zero
claims are due to reasons such as: the final claim does not exceed the deductible,
the insurance company is not liable for the claim, another insurance policy covers
53
54 Chapter 3. Individual Claim Size Modeling
the claim, reporting a (small) claim reduces the no-claims-benefit too much so that
the insured decides to withdraw the claim, etc.
We can treat zero claims in two different ways: (i) estimate the proportion of
zero claims separately and add this probability weight to G at 0; (ii) we simply
reduce the expected claims frequency by these zero claims. The first way (i) is
mathematically consistent, but contradicts our model assumption G(0) = 0; the
second way (ii) perfectly fits into the compound Poisson modeling framework due
to the disjoint decomposition Theorem 2.14 (also the binomial and the negative-
binomial case can be handled, see Examples 3 and 4). In general, the second
w)
version (ii) is the simpler one to deal with (however, one may lose information by
dropping zero claims). Here, we assume that G(0) = 0 and E[N ] = v, where v > 0
is the portfolio size and N only counts strictly positive claims. Henceforth, after
subtracting these zero claims, we have n = 610 053 strictly positive claims records
in PP insurance and n = 140 532 in CP insurance denoted by Y1 , . . . , Yn .
(m
We start with the scatter plots of the data, see Figures 3.1 and 3.2. We plot the
individual claim sizes (ordered by arrival date) both on the original scale (lhs) and
on the log scale (rhs). These scatter plots do not offer much information because
tes
no
NL
Figure 3.1: Scatter plot of the n = 610 053 strictly positive claims records of PP
insurance ordered by arrival date: original scale (lhs) and log scale (rhs).
they are overloaded, at least they do not show any obvious trends (and therefore
suggest stationarity of the data). We calculate the sample means and the sample
variances of the observations, see also (2.5),
n n
1 X 1 X
b n = Yi and bn2 = (Yi b n )2 ,
n i=1 n 1 i=1

Chapter 3. Individual Claim Size Modeling 55
w)
(m
Figure 3.2: Scatter plot of the n = 140 532 strictly positive claims records of CP
insurance ordered by arrival date: original scale (lhs) and log scale (rhs).
For our data sets we obtain empirical moments
PP : b n = 30 116, bn = 70 534, Vco

d = 2.42;
n (3.1)
tes
0 0
CP : b n = 6 850, bn = 28 505, Vco
d = 4.16.
n (3.2)
Next we give the histogram for PP insurance, see Figure 3.3 (lhs). We see that
histogram claim sizes PP insurance histogram logged claim sizes PP insurance

no
60000
12000
50000
10000
40000
8000
30000
count
count
6000
NL
20000
4000
10000
2000
0
0 50000 100000 150000 200000 250000 4 6 8 10 12
claim sizes logged claim sizes
Figure 3.3: Histogram of the n = 610 053 strictly positive claims records of PP
insurance: original scale (lhs) and log scale (rhs).
a few large claims distort the whole picture so that the histogram is not helpful.
We could plot a second one only considering small claims. In Figure 3.3 (rhs) we
plot the histogram for logged claim sizes. In Figure 3.4 we give the corresponding

w)
(m
Figure 3.4: Box plots of claims records of PP and CP insurance: original scale (lhs)
and log scale (rhs).
box plots. They show positive skewness. The ultimate goal is to model the full
distribution functions G(y) = P[Y y] for the two portfolios PP and CP. Having so
many observations we could directly work with the empirical distribution function
tes
(at least for small claims, see Section 3.4.1) which is given by
n
b (y) = 1 X
Gn 1{Y y} . (3.3)
n i=1 i
The empirical distribution function of logged claim sizes is given in Figure 3.5
no
(lhs). For a sequence of observations Y1 , . . . , Yn we denote the ordered sample by

Y(1) Y(2) . . . Y(n) . For the next definitions we assume that Y G has finite
mean. We define the loss size index function and its empirical counterpart by
Ry Pbnc
0 z dG(z) Y(i)
I(G(y)) = R and Ib
n () = Pi=1
n ,
0 z dG(z) i=1 Yi
NL
for [0, 1]. The loss size index function I(G(y)) is also called exposure curve (in
re-insurance). It chooses a claim size threshold y and then evaluates the relative
expected claim that is explained by claim sizes below this threshold y. The resulting
empirical graphs are presented in Figure 3.5 (rhs). Rather typically in non-life
insurance we see that the 20% largest claims roughly cause 75% of the total claim
size! This explains that large claims heavily influence the total claim amount.
We have already seen in the previous figures that large claims may lead to several
modeling challenges. Two plots that especially focus on large claims are the mean
excess plot and the log-log plot. We define the mean excess function and empirical
mean excess function by
Pn
i=1 (Yi u)1{Yi >u}
e(u) = E [Yi u|Yi > u] and ebn (u) = Pn .
i=1 1{Yi >u}

w)
(m
Figure 3.5: Empirical distribution functions G b of PP and CP insurance on log
n
scale (lhs) and corresponding empirical loss size index functions Ibn (rhs).
The (empirical) mean excess plot is obtained by

tes
u 7 e(u) and u 7 ebn (u),
and the (empirical) log-log plot by

y 7 (log y, log(1 G(y))) and y 7 log y, log(1 G
b (y)) .
n
no
NL
Figure 3.6: Empirical log-log plot (lhs) and empirical mean excess plot (rhs) of PP
and CP insurance data.
In Figure 3.6 we present the empirical log-log and mean excess plots of the two

data sets. Linear decrease in the log-log plot and linear increase in the mean excess
plot will have the interpretation of heavy tailed distributions in the sense that the
survival function G = 1 G is regularly varying at infinity, see (3.4) below.
3.2 Selected parametric claims size distributions

In this section we introduce popular parametric claim size distributions. We only
consider distribution functions G with unbounded support in R+ . We use the
following notation for a random variable Y G:
w)
g density of Y for G being absolutely continuous,
(m
MY (r) moment generating function of Y in r R, where it exists,
Y expected value of Y , if it exists,
Y2 variance of Y , if it exists,
Vco(Y ) coefficient of variation of Y , if it exists,
Y skewness of Y , if it exists,
tes
G = 1 G survival function of Y , i.e. G(y) = P[Y > y].
For analyzing G the following quantities are of interest (assuming Y < ):

no
E[Y 1{u1 <Y u2 } ] expected value of Y within layer (u1 , u2 ],

I(G(y)) = E[Y 1{Y y} ]/Y loss size index function for level y,
e(u) = E[Y u|Y > u] mean excess function of Y above u.
NL
If G depends on a parameter and we have i.i.d. observations Yi G, then we

can estimate this parameter from the data. The method of moments estimator
is denoted by bMM and the MLE by bMLE , see also Section 2.3. Note that if
one estimates this parameter one should also try to assess the precision of this
parameter estimate (parameter uncertainty).

For the analysis of the tail of the distribution function we

consider the property of regular variation at infinity. There-
fore, we assume that G has infinite support at +. Then,
we say that the survival function G = 1G is regularly vary-
ing at infinity with (tail) index > 0, we write G R , if
for all t > 0
G(xt) 1 G(xt)
lim = lim = t . (3.4)
x G(x) x 1 G(x)
If the above holds true for = 0 then we say G is slowly
w)
varying at infinity and we write G R0 ; if the above holds
true for = then we say G is rapidly varying at infinity and we write G R .
From an insurance point of view, distribution functions G with G R for some
[0, ) are dangerous because they have a large potential for big claims, see
(m
Chapter 3 in Embrechts et al. [39]. Therefore, it is crucial to know this index of
regular variation at infinity, see also Remarks 5.17.
3.2.1 Gamma distribution

Some people refer the gamma distribution to Karl Pearson (1857-1936), how-
tes
ever, it seems that already Laplace [69] has used it. We have introduced the gamma
distribution in Section 2.2.4 for the definition of the negative-binomial distribution
and we will also see that this distribution is very useful in the context of generalized
linear models and Bayesian modeling, see Chapters 7, 8 and 9 below.
no
The gamma distribution has two parameters, a shape parameter > 0 and a scale
parameter c > 0. We write Y (, c). The distribution function of Y has positive
support R+ with density for y 0 given by
c 1
NL
g(y) = y exp {cy} .

()
There is no closed form solution for the distribution function G. For y 0 it can
only be expressed as
Z y
c 1 cx 1 Z cy 1 z
G(y) = x e dx = z e dz = G(, cy),
0 () () 0
where G(, ) is the incomplete gamma function. From this we see that the family
of gamma distributions is closed towards multiplication with a positive constant,
that is, for > 0 we have
Y (, c/). (3.5)

This property is important when we deal with claims inflation and it explains why
c is called scale parameter. For the moment generating function and the moments
we have

c

MY (r) = for r < c,
cr

Y = , Y2 = 2 ,
c c
1/2
Vco(Y ) = , Y = 2 1/2 > 0.
w)
For 0 u1 < u2 and u, y > 0 we obtain

E[Y 1{u1 <Y u2 } ] = [G( + 1, cu2 ) G( + 1, cu1 )] ,
c
(m
I(G(y)) = G( + 1, cy),
!
1 G( + 1, cu)
e(u) = u.
c 1 G(, cu)
tes
no
NL
Figure 3.7: Gamma distribution with mean Y = 1 and shape parameter

{1/2, 1, 3/2, 2}. lhs: density g(y); rhs: log-log plot.
Exercise 7. Assume Y (, c).

Prove the statements of the moment generating function MY and the loss
size index function I(G(y)). Hint: use the trick of the proof of Proposition
2.20.
Prove the statements

1 I(G(u))
e(u) = Y u, E[Y 1{u1 <Y u2 } ] = Y (I(G(u2 )) I(G(u1 ))) .
1 G(u)

The gamma distribution does not have a regularly varying tail at infinity, see
Table 3.4.4 in Embrechts et al. [39]. In fact, G(y) = 1 G(y) decays roughly as
exp{cy} to 0 as y , because exp{cy} gives an asymptotic lower bound and
exp{(c )y} an asymptotic upper bound for any > 0 on G(y). Note that the
gamma distribution is also not subexponential due to (5.10), below.
For generating gamma random numbers in R the following code is used (n stands
for the number of random numbers to be generated)
w)
> rgamma(n, shape=, rate=c)
The method of moments estimators (based on the first two empirical moments) are
(m
given by
b n b 2
cbMM = 2 and b MM = n2 .
bn bn
For the MLE we have log-likelihood function, set Y = (Y1 , . . . , Yn )0 ,
n
X
`Y (, c) = log c log () + ( 1) log Yi cYi .
tes
i=1
The MLE b MLE of is the solution of

n
0 () 1 X
log log b n + log Yi = 0. (3.6)
no
() n i=1
This is solved numerically, and the MLE for c is then given by
b MLE
cbMLE = .
b n
NL
For the numerical solution in R one can use the command
> fitdistr(data, gamma)
The numerical fitting does not always work when the range of observations Y is too
large. In such cases it is recommended that in the first step the data is scaled by a
constant factor > 0, this can be done due to (3.5); next parameters are estimated
for scaled data; and in the last step the estimated scale parameter is scaled back by
the same factor. An alternative way is to explicitly program the function given in
(3.6) and then apply the root search command uniroot(). The term 0 ()/()
is calculated with digamma(), see also Section 3.9.5 in Kaas et al. [64].

Remark 3.1 (exponential and 2 -distributions). The special case = 1 is referred

to the exponential distribution with parameter c > 0, and denoted by expo(c).
The special case = k/2 and c = 1/2 is the 2 -distribution with k N degrees of
freedom, see Exercise 2 on page 22.
Example 3.2 (gamma distribution for PP data). We fit the PP insurance data
displayed in Figure 3.1 to the gamma distribution.
w)
(m
tes
Figure 3.8: Gamma distribution with MM and MLE fits applied to the PP insur-
no
ance data. lhs: QQ plot; rhs: loss size index function.

NL
Figure 3.9: Gamma distribution with MM and MLE fits applied to the PP insur-
ance data. lhs: log-log plot; rhs: mean excess plot.

From Figures 3.8 and 3.9 we immediately conclude that the gamma model does not
fit to the PP data. The reason is that the data is more heavy tailed. This can be
seen in the QQ plot in Figure 3.8 (lhs): the data at the right end of the distribution
lie substantially above the line. The MM estimators manage to model the data up
to some layer, the MLE estimators, however, are heavily distorted by the small
claims which can be seen in the mean excess plot in Figure 3.9 (rhs). In fact, we
have too many small claims (observations below 1500) to be explained by a gamma
distribution. The MLE is heavily based on these small observations, in Figure 3.8
(rhs) and Figure 3.9 (lhs) we see that MLE fits well for small claims, whereas MM
w)
provides more appropriate results in the upper range of the data. Summarizing, we
should choose more heavy tailed distribution functions to model this data and the
resulting figures are already sufficient for rejecting the gamma model. This first
data example also indicates that probably there is not one distribution that fits all
claims layers simultaneously. We come back to this in Section 3.4.
(m

Remark 3.3 (inverse Gaussian distribution1 ). A distribution function which is also

found quite often in the actuarial literature is the inverse Gaussian distribution,
see for instance Section 3.9.6 in Kaas et al. [64]. Its density is for y 0 given by
( ) ( )
3/2 ( cy)2 3/2 2 1
g(y) = y exp = y exp + cy ,
tes
2c 2cy 2c 2cy 2
where > 0 is a shape parameter and c > 0 a scale parameter. Observe that this
density behaves similar as the gamma density for y . For the distribution
function we have a closed form solution in the following (weak) sense
no
! !

G(y) = + cy + e2 cy ,
cy cy
where () is the standard Gaussian distribution. This can be checked by calcu-

lating the derivative of the latter. For the moment generating function and the
moments we have
NL
n h io
MY (r) = exp 1 (1 2r/c)1/2 for r c/2,

Y = , Y2 = 2 ,
c c
1/2
Vco(Y ) = , Y = 31/2 > 0.
b MM and cbMM . The MLE is given by

From this we calculate the MM estimators
" n
! n
! #1
1X 1X b MLE

b MLE
= Yi1 Yi 1 and cbMLE = .
n i=1 n i=1 b n
The inverse Gaussian distribution leads to an improvement of the fit compared to

the gamma distribution. Overall it is also not convincing, especially in the tails (it
1
This is for further reading.

has the same asymptotic behavior as the gamma distribution). Since the inverse
Gaussian distribution is less handy than the ones that will be presented below we
refrain from further discussing this distribution function.
3.2.2 Weibull distribution
The Weibull distribution2 has his name from Ernst Hjalmar

Waloddi Weibull (1887-1979), however it was first identi-
fied by Maurice Frchet (1878-1973) in 1927, but Weibull
w)
was probably the first one who has described the distribution
function in detail (in 1951).
The Weibull distribution has two parameters, a shape parameter
(m
> 0 and a scale parameter c > 0. We write Y Weibull(, c).
E.H.W. Weibull
The distribution function of Y has positive support R+ with
density for y 0 given by
g(y) = (c ) (cy) 1 exp {(cy) } .

tes
We are especially interested in (0, 1) because this provides a slower decay of
the survival function compared to the gamma distribution. For y 0 we have
no
G(y) = 1 exp {(cy) } .
This still does not have a regularly varying tail at infinity but the decay of the
survival function G is slower than in the gamma case for < 1, see also Table 3.4.4
NL
in Embrechts et al. [39]. In fact, the survival function G(y) = 1 G(y) decays as
exp{(cy) } to 0 for y . Note that the Weibull distribution is subexponential
for (0, 1), see Example 1.4.7 in Embrechts et al. [39]. We will come back to
subexponentiality in Section 5.4.
The family of Weibull distributions is closed towards multiplication with positive
constants, that is, for > 0 we have
Y Weibull(, c/).
The moment generating function and the moments are given by
2
This section is recommended for further reading.

MY (r) does not exist for < 1 and r > 0,

(1 + 1/ )
Y = ,
c
(1 + 2/ )
Y2 = 2
2Y ,
"c #
1 (1 + 3/ ) 2 3
Y = 3Y Y Y .
Y3 c3
w)
(1 + 1/ )
E[Y 1{u1 <Y u2 } ] = [G(1 + 1/, (cu2 ) ) G(1 + 1/, (cu1 ) )] ,
c
I(G(y)) = G(1 + 1/, (cy) ),
(m
(1 + 1/ ) 1 G(1 + 1/, (cu) )
!
e(u) = u.
c exp{(cu) }
tes
no
NL
Figure 3.10: Weibull distribution with mean Y = 1 and shape parameter

{1/4, 1/3, 1/2, 1}. lhs: density g(y); rhs: log-log plot.
(d)
For generating Weibull random numbers observe that we have the identity Y =
1 1/ (d)
c
Z with Z expo(1) = (1, 1). The R code for the (1, 1) distribution is
> rgamma(n, shape=1, rate=1)
The method of moments estimators are given by

(1 + 1/bMM ) bn2 (1 + 2/bMM )
cbMM = and + 1 = .
b n b 2n (1 + 1/bMM )2

The latter needs to be solved numerically in R:
> f <- function(x,a){lgamma(1+2/x)-2*lgamma(1+1/x)-log(a+1)}

> tau <- uniroot(f, c(0.001,1), tol=0.001, a=var(data)/mean(data)2)
For the MLE we need to solve the system of equations (numerically)

n
!1/ n
1X 1X
c= Y and log(cYi ) ((cYi ) 1) = 1.
n i=1 i n i=1
w)
Example 3.4 (Weibull distribution for PP data). We fit the PP insurance data
displayed in Figure 3.1. From Figures 3.11 and 3.12 we see that the Weibull model
(m
tes
no
Figure 3.11: Weibull distribution with MM and MLE fits applied to the PP insur-
ance data. lhs: QQ plot; rhs: loss size index function.
gives a better fit to the PP data compared to the gamma model. The reason is
NL
that it allows for more probability mass in the upper tail of the distribution, the
estimate for is in the interval (0.5, 0.75). The MM estimators manage to model
the data up to some layer. The MLE estimators, however, are still distorted by
the big mass of small claims which can be seen in the mean excess plot in Figure
3.12 (rhs). Summarizing, we should choose even more heavy tailed distributions to
model this data, and we should carefully treat (and probably separate) small and
large claims.
3.2.3 Log-normal distribution

Making the tail of the distribution function more heavy tailed than the Weibull
distribution tails leads us to the log-normal distribution. The log-normal distri-
bution has two parameters, a mean parameter R and a standard deviation

w)
(m
Figure 3.12: Weibull distribution with MM and MLE fits applied to the PP insur-
ance data. lhs: log-log plot; rhs: mean excess plot.
parameter > 0. We write Y LN(, 2 ). The log-normal distribution has the

property that log Y N (, 2 ). Therefore, almost every crucial feature can be
tes
obtained from normal distributions. The distribution of Y has positive support R+
with density for y 0 given by
( )
1 1 1 (log y )2
g(y) = exp .
no
2 y 2 2
For y 0 we have distribution function

NL
!
log y
G(y) = ,

with () denoting the standard Gaussian distribution function. The family of

log-normal distributions is closed towards multiplication with a positive constant,
that is, for > 0 we have
Y LN( + log , 2 ).
We have the following

MY (r) does not exist for r > 0,

n o
Y = exp + 2 /2 ,
n o
Y2 = exp 2 + 2 exp{ 2 } 1 ,
1/2
Vco(Y ) = exp{ 2 } 1 ,
1/2
Y = exp{ 2 } + 2 exp{ 2 } 1 > 0.
w)
" ! !#
log u2 ( + 2 ) log u1 ( + 2 )
E[Y 1{u1 <Y u2 } ] = Y ,

(m
!
2
log y ( + )
I(G(y)) = ,

log u(+ 2 )

1
e(u) = Y u.
log u
1
tes
no
NL
Figure 3.13: Log-normal distribution with mean Y = 1 and standard deviation

parameters {0.5, 1, 1.25, 1.5}. lhs: density g(y); rhs: log-log plot.
The log-normal distribution does not have a regularly varying survival function
at infinity, see Table 3.4.4 in Embrechts et al. [39]. Note that the log-normal
distribution is subexponential, see Example 1.4.7 in Embrechts et al. [39]. We will
come back to subexponentiality in Section 5.4.
For generating log-normal random numbers we simply choose standard Gaussian
random numbers Z and then set Y = exp{ + Z}.


" !#1/2
MM bn2
b = log +1 and b MM = log b n (b MM )2 /2.
b 2n
The MLE is given by
n n
1X 1X 2
b MLE = log Yi and (b MLE )2 = log Yi b MLE .
n i=1 n i=1
w)
Example 3.5 (log-normal distribution for PP data). We fit the PP insurance data
displayed in Figure 3.1. In Figures 3.14 and 3.15 we present the results. We observe
(m
tes
no
Figure 3.14: Log-normal distribution with MM and MLE fits applied to the PP
insurance data. lhs: QQ plot; rhs: loss size index function.
that the log-normal distribution gives quite a good fit. We give some comments on
NL
the plots: The MM estimator looks convincing because the observations match the
lines quite well in the QQ plot. The only things that slightly disturb the picture
are the three largest observations, see QQ plot. It seems that they are less heavy
tailed then the log-normal distribution suggests. This is also the reason why the
empirical mean excess plot deviates from the log-normal distribution, see Figure
3.15 (rhs). A little puzzling is the bad performance of the MLE. The reason is again
that more than 50% of the claims are less than 1500. The MLE therefore is very
much based on these small observations and provides a good fit in that range of
observations but it gives a bad fit for larger claims. We conclude from this that the
PP data set should be modeled with different distributions in different layers. The
reason for this heterogeneity is that PP insurance contracts have different modules
such as theft, water damage, fire, etc. and it is recommended (if data allows) to
model each of these modules separately. This may also explain the abnormalities in

w)
(m
Figure 3.15: Log-normal distribution with MM and MLE fits applied to the PP
insurance data. lhs: log-log plot; rhs: mean excess plot. Note that the small
hump in the empirical distribution is at CHF 3000 which is probably induced by
a maximal cover for a particular risk factor.
the log-log plot because these different modules, in general, have different maximal
tes
covers.
3.2.4 Log-gamma distribution

no
The log-gamma distribution3 is more heavy tailed than the log-normal distribution
and is obtained by assuming that log Y (, c) for positive parameters and c.
The density for y 1 is given by
NL
c
g(y) = (log y)1 y (c+1) ,
()
and the distribution function can be written as
G(y) = G(, c log y).
For the moment generating function and the moments we have
3
This section is recommended for further reading


c

Y = for c > 1,
c 1

c
Y2 = 2Y for c > 2,
c2

1 c

2 3
Y = 3Y Y Y for c > 3.
Y3 c3
w)
For c > 1, 1 u1 < u2 and u, y > 1 we obtain
c
E[Y 1{u1 <Y u2 } ] = [G(, (c 1) log u2 ) G(, (c 1) log u1 )] ,
c1
I(G(y)) = G(, (c 1) log y),
(m
!
c 1 G(, (c 1) log u)

e(u) = u.
c1 1 G(, c log u)
tes
no
NL
Figure 3.16: Log-gamma distribution with mean Y = 2 and parameter c

{2, 3, 4, 8}. lhs: density g(y); rhs: log-log plot.
The log-gamma distribution has a regularly varying survival function at infinity

with tail index c > 0, see Table 3.4.2 in Embrechts et al. [39].
log b n log(bn2 + b 2n ) log cbMM log(cbMM 2)
b MM = MM
and = ,
c
log bcMM
b log b n log cbMM log(cbMM 1)
1
where the latter is solved numerically using, e.g., the R command uniroot().
The MLE is obtained analogously to the MLE for gamma observations by simply
replacing Yi by log Yi .

w)
(m
Figure 3.17: Log-gamma distribution with MM and MLE fits applied to the PP
insurance data. lhs: QQ plot; rhs: loss size index function.
tes
no
NL
Figure 3.18: Log-gamma distribution with MM and MLE fits applied to the PP
insurance data. lhs: log-log plot; rhs: mean excess plot.
Example 3.6 (log-gamma distribution for PP data). We fit the PP insurance

data displayed in Figure 3.1. From Figures 3.17 and 3.18 we conclude that the
log-gamma model provides the best fit to the data from the models considered so
far. As already commented on in the log-normal example, we see that probably
the only thing that does not entirely fit to the log-gamma distribution are the 3 or
4 largest claims which are less heavy tailed than the log-gamma distribution would
suggest. The tail index of regular variation is about cb = 5.8 in this example.

3.2.5 Pareto distribution
We have seen that large claims often need a special treatment.

Therefore, large claims are often modeled separately with ei-
ther a Pareto or a generalized Pareto distribution. Here, we
concentrate on the Pareto distribution. The Pareto distribu-
tion is named after Vilfredo Federico Damaso Pareto
(1848-1923) who initially used this distribution to describe the
allocation of wealth. Sometimes the Pareto distribution is also
w)
called power law distribution because of the power law decay of
its survival function. V.F.D. Pareto
The Pareto distribution specifies a (large claims) threshold > 0 and then only
(m
models claims above this threshold, see also Example 2.16. The claims above this
threshold are assumed to have regularly varying tails with tail index > 0. For
Y Pareto(, ), the density for y is given by
(+1)
y
g(y) = ,

tes
and distribution function can be written as (y )

y
no
G(y) = 1 .

We have closedness towards multiplication with a positive constant, that is, for
> 0 we have
NL
Y Pareto(, ). (3.7)
For the moment generating function and the moments we have

Y = for > 1,
1

Y2 = 2 for > 2,
( 1)2 ( 2)
2(1 + ) 2 1/2

Y = for > 3.
3

For > 0, u1 < u2 and u, y > we obtain

" #
u1 +1 u2 +1

E[Y 1{u1 <Y u2 } ] = for 6= 1,
1
+1
y
I(G(y)) = 1 for > 1,

1
e(u) = u for > 1,
1
and for = 1 we have E[Y 1{u1 <Y u2 } ] = log(u2 /u1 ).
w )
(m
tes
Figure 3.19: Comparison of Pareto, log-gamma, log-normal, Weibull and gamma
no
distributions all having mean Y = 2 and variance Y2 = 20. lhs: densities g(y);
rhs: log-log plot.
As soon as we only study tails of distributions we should use MLEs for parameter
estimation (the method of moments is not sufficiently robust against outliers).
Since the threshold has a natural meaning we only need to estimate . The MLE
NL
is given by !1
n
MLE 1X

b = log Yi log .
n i=1
i.i.d.
Lemma 3.7. Assume Y1 , . . . , Yn Pareto(, ). We have
h i n n2
b MLE =
E and b MLE =
Var 2 .
n1 (n 1) (n 2)
2
(d)
Proof. Choose Z expo() = (1, ). Then, eZ = Y Pareto(, ) (this can be seen
by a change of variables in the corresponding densities). This immediately implies that Zi =
i.i.d.
log Yi log expo(). The sum of these i.i.d. exponential random variables is gamma
distributed with parameters = n and c = . Using the scaling property (3.5) we conclude that
MLE )1 (n, n) .
(b

This implies for k < n

(n)n n1 nz (n k)
Z
z k
MLE k

E (b ) = z e dz = (n)k . (3.8)
0 (n) (n)
From this the claim follows. 2
For the MLE of it was assumed that the threshold

is given in a natural way. If this threshold needs to be
detected from data, the Hill plot can be of help. For the
Hill method we refer to McNeil et al. [77], Section 7.2.4.
w)
We order the claims accordingly to Y(1) Y(2) . . .
Y(n) . The Hill plot explores the stability of the MLEs
when successively dropping the smallest observations.
Therefore we define for k < n the Hill estimator by
(m
n
!1
H 1 X

b k,n = log Y(i) log Y(k) .
n k + 1 i=k
The Hill estimator is based on the rationale that the Pareto distribution is closed
towards increasing thresholds, i.e. for Y Pareto(0 , ) and 1 > 0 we have for
all y 1
y

tes

y

0
P [ Y > y| Y 1 ] = = .
1 1
0
Therefore, if the data comes from a Pareto distribution we should observe stability
H
in
b k,n for changing k. The confidence bounds of the Hill estimators are determined
no
by Lemma 3.7.
Example 3.8 (Pareto for extremes of PP insurance). We start the analysis with
the PP insurance data.
To perform this large claims analysis we choose only the largest
H
claims of Figure 3.1. The Hill plot k 7 b k,n is given in Figure
NL
3.20 (together with confidence bounds of 1 standard deviation,

estimated by Lemma 3.7). We observe a fairly stable picture
in k around value = 2.5 up to the largest 100 claims. For
larger claims the Hill estimator disappears to 4 or 5 which
(once more) explains that the tail of the largest observations
is not really heavy tailed. This is similar to the log-normal
and the log-gamma fit. Sidney Ira Resnick [86] has called S.I. Resnick
this phenomenon Hill horror plot and it stems from the difficulty that the Hill
estimator cannot correctly adjust non-Pareto like tails. The right-hand side of
Figure 3.20 gives the log-log plot for = 2.5, in accordance to the Hill plot we
see that the slope of the data is slightly less than this value for smaller claims,
but the data becomes less heavy tailed further out in the tails. This becomes also
obvious from the mean excess plot and the QQ plot in Figure 3.21.

w)
(m
H
Figure 3.20: PP insurance data; lhs: Hill plot k 7 b k,n with confidence bounds of
1 standard deviation; rhs: log-log plot for = 2.5.
tes
no
NL
Figure 3.21: PP insurance data largest claims only; lhs: QQ plot; rhs: mean excess
plot for = 2.5.
Example 3.9 (Pareto for extremes of CP insurance). In a second analysis we

examine the extremes of the CP claims data of Figure 3.2. The results are presented
in Figure 3.22. At the first sight they look similar to the PP insurance example,
i.e. they begin to destabilize between the 150 and 100 largest claims. However, the
main difference is that the tail index is much smaller in the CP example. That is,
there is a higher potential for large claims for this line of business.

Example 3.10 (nuclear power accident example). We revisit the nuclear power
accident data set studied in Hofert-Wthrich [60], see also Sovacool [94].

w)
(m
H
Figure 3.22: CP insurance data; lhs: Hill plot k 7 b k,n with a confidence interval
of 1 standard deviation; rhs: log-log plot for = 1.4.
In Figure 3.23 we plot all nuclear power accidents that

have occurred until the end of 2011 and which have a
claim size larger than 20 mio. USD (as of 2010). These
tes
events include Three Mile Island (United States, 1979),
Chernobyl (Ukraine, 1986) and Fukushima (Japan, 2011).
Fukushima 2011
In Figure 3.24 we provide the Hill plot. We observe that
no
this data is very heavy tailed. The Hill plot suggests to set the tail index around
scatter plot logged claim sizes nuclear power accidents empirical distribution
1.0
24
nuclear power accidents

23

0.8

NL

22

claim sizes (log scale)

empirical distribution

0.6

21

20

0.4

19

0.2

18

17

0.0
0 10 20 30 40 50 60 17 18 19 20 21 22 23 24
logged claim sizes
Figure 3.23: 61 largest nuclear power accidents until 2011; lhs: logged claim sizes
(in chronological order, the last entry is Fukushima); rhs: empirical distribution
function of claim sizes.

Hill plot of nuclear power accidents loglog plot (with alpha = 0.64 for the 61 largest observations)
0

1.2

log (1distribution function)

1.0

Pareto parameter
2

0.8
3

0.6
w)
Pareto distribution
0.4
observations
61 51 41 31 21 11 17 18 19 20 21 22 23 24
number of observations log (claim size)
(m
H
Figure 3.24: 61 largest nuclear power accidents until 2011; lhs: Hill plot k 7
b k,n
with confidence bounds of 1 standard deviation; rhs: log-log plot for = 0.64.
0.64, which means that we have an infinite mean model. The log-log plot in Figure
3.24 shows that this tail index choice captures the slope quite well.
tes
Exercise 8. Natural hazards in Switzerland are covered by the so-called Schweiz-
erische Elementarschaden-Pool (ES-Pool). This is a pool of private Swiss insurance
companies which organizes the diversification of natural hazards in Switzerland.
no
For pricing of these natural hazards one distinguishes be-

tween small events and large events, the latter having a
total claim amount exceeding CHF 50 millions per event.
The following 15 storm and flood events have been ob-
served in years 1986 2005 (these are the events with a
NL
total claim amount exceeding CHF 50 millions). Storm Lothar 26.12.1999
date amount in CHF mio. date amount in CHF mio.

20.06.1986 52.8 18.05.1994 78.5
18.08.1986 135.2 18.02.1999 75.3
18.07.1987 55.9 12.05.1999 178.3
23.08.1987 138.6 26.12.1999 182.8
26.02.1990 122.9 04.07.2000 54.4
21.08.1992 55.8 13.10.2000 365.3
24.09.1993 368.2 20.08.2005 1051.1
08.10.1993 83.8

Fit a Pareto distribution with parameters = 50 and > 0 to the observed

claim sizes. Estimate parameter using the unbiased version of the MLE.
We introduce a maximal claims cover of M = 2 billions CHF per event,

i.e. the individual claims are given by Yi M = min{Yi , M }, see also Section
3.4.2. For the yearly claim amount of storm and flood events we assume
a compound Poisson distribution with Pareto claim sizes Yi . What is the
expected total yearly claim amount?
What is the probability that we observe a storm and flood event next year
w)
which exceeds the level of M = 2 billions CHF?
(m
3.3 Model selection
In the previous section we have presented different distributions for claim size
modeling and we have debated which one fits best to the observed data. The ar-
gumentation was completely based on graphical tools like log-log plots. Graphical
tes
tools are important, but in statistics there are also methodological tools that con-
sider these questions from a more analytical point of view. Two commonly used
tests are the Kolmogorov-Smirnov (KS) test and the Anderson-Darling (AD) test.
These two tests are discussed in Sections 3.3.1 and 3.3.2.
In Section 3.3.3 we give the 2 -goodness-of-fit test and we discuss the Akaike
no
information criterion (AIC) as well as the Bayesian information criterion (BIC).
3.3.1 Kolmogorov-Smirnov test
Andrey Nikolaevich Kolmogorov (1903-1987) was

NL
the world leading probabilist. In 1933 he gave the mod-

ern axiomatic foundations of probability theory, his book
was called Grundbegriffe der Wahrscheinlichkeitsrech-
nung and has appeared in German, see [66]. Unfortu-
nately, there is less known on Nikolai Vasilyevich
Smirnov (1900-1966).
A.N. Kolmogorov

The KS test is a non-parametric test investigating whether

a particular continuous distribution function G0 fits to a
given sample Y1 , . . . , Yn . Therefore, one compares the em-
pirical distribution function G b of the sample and the distri-
n
bution function G0 . The argument is based on the Glivenko-
Cantelli theorem which says that the empirical distribution
function of an i.i.d. sample converges uniformly to the true
underlying distribution function, P-a.s., if the number n of
i.i.d. observations goes to infinity (this result does not re-
w)
quire continuity of the distribution function), see Theorem
20.6 in Billingsley [13].
Assume we have an i.i.d. sequence Y1 , Y2 , . . . from an unknown con-
(m
tinuous distribution function G and we denote the corresponding
empirical distribution function of finite sample size n by G b , see
n
also (3.3). We would like to test whether these samples Y1 , Y2 , . . .
may stem from G0 .
Consider the null hypothesis H0 : G = G0 against the two-sided
N.V. Smirnov
alternative hypothesis that these distribution functions differ. We
tes
define the KS test statistics by

b G
Dn = Dn (Y1 , . . . , Yn ) = G
b (y) G (y) .
= sup G

n 0 n 0
y
The KS test statistics has the property, see (13.4) in Billingsley [12],
no

nDn Kolmogorov distribution K as n .
The Kolmogorov distribution K is for y R+ given by

NL
n o
(1)j+1 exp 2j 2 y 2 .
X
K(y) = 1 2
j=1
The null hypothesis H0 is rejected on the significance level q (0, 1) if
Dn > n1/2 K (1 q),
where K (1 q) denotes the (1 q)-quantile of the Kolmogorov distribution K.
q 20% 10% 5% 2% 1%

K (1 q) 1.07 1.22 1.36 1.52 1.63

w)
(m
Figure 3.24: KS test statistics for method of moments and MLE fits applied to the
PP insurance data; lhs: log-normal distribution; rhs: log-gamma distribution.
Example 3.11 (KS test, PP insurance data). We apply the KS test to the log-
normal and the log-gamma fits of the PP insurance data, see Examples 3.5 and 3.6.
In the log-normal case we obtain for the MLE fit Dn = 0.05 and for the methods of
tes
moment fit Dn = 0.12. These values are far too large compared to the large sample
size of n = 610 053 and the KS test clearly rejects the null hypothesis of having a
log-normal distribution on the 1% significance level. If we look at Figure 3.24 (lhs)
we see that these big values of the KS test statistics are driven by small claims,
no
i.e. we obtain a bad fit for small claims, the tails however do not look too badly.
The log-gamma fit looks better than the log-normal fit, see Figure 3.24 (note that
the y-axes have different scales in the two plots). It provides KS test statistics
Dn = 0.04 for the MLE fit and Dn = 0.06 for the method of moments fit. These
values are still far too large to not reject H0 on the 1% significance level.
Conclusion. The claim size modeling should be split into different claim size layers.
NL
Example 3.12 (KS test, tail distribution). In this example we investigate the tail
fits of the Pareto distributions in the CP and the PP examples for the n = 505
largest claims, see Examples 3.8 and 3.9. The results are presented in Figure 3.25.
For the PP insurance data we obtain Dn = 0.027 (for = 2.5) and for the CP
insurance data we receive Dn = 0.061 (for = 1.4). The first value is sufficiently
small so that the null hypothesis cannot be rejected on the 5% significance level,
the CP insurance value reflects just about the critical value on the 5% significance
level, i.e. the resulting p-value is just about 5%. The plot of the point-wise terms of
Dn looks fine for the PP insurance data, however, the graph for the CP insurance
data looks a bit one-sided, suggesting two different regimes (this can also seen from
Figure 3.22).

w)
(m
Figure 3.25: Point-wise terms of KS test statistics for MLE fits applied to the 505
largest claims; lhs: PP insurance data; rhs: CP insurance data.
3.3.2 Anderson-Darling test

The advantage of the (non-parametric) KS test is that it can be applied to any
tes
situation of continuous distribution functions. The drawback of this large general-
ity, of course, is that it is often not very powerful and, especially, not very good in
detecting particular properties such as tail behavior.
The two statisticians Theodore Wilbur Anderson and Don-
ald Allan Darling have developed a modification of the KS
no
test, the so-called AD test, which gives more weight to the tail
of the distributions. It is therefore more sensitive in detecting
tail fits, but on the other hand it has the disadvantage of not
being non-parametric, and critical values need to be calculated
for every chosen distribution function.
NL
The KS test statistics is modified by the introduction of a weight

T.W. Anderson
function : [0, 1] R+ which then modifies the KS test statis-
tics Dn as follows
q
sup Gn (y) G0 (y) (G0 (y)).
b
y
Different choices of allow to weight different regions of the sup-

port of the distribution function differently, the KS test statistics
is obtained by 1. The choice proposed by Anderson and Dar-
ling is (t) = (t(1 t))1 in order to investigate the tails of the
distributions. D.A. Darling
In contrast to the maximal difference between the empirical distribution function
Gb and the null hypothesis G we could also consider a weighted L2 -distance. This
n 0

leads to the Anderson-Darling modification of the Cramr-von Mises test. The AD

test statistics for (t) = (t(1 t))1 is obtained from
2
Z b (y) G (y)
Gn 0
A2n = n dG0 (y).
R G0 (y)(1 G0 (y))
Anderson-Darling have explicitly identified the asymptotic behavior of An as n

by determining the limiting characteristic function. We do not further elaborate
on this but refer to the literature in statistics.
w)
3.3.3 Goodness-of-fit and information criteria
There are many other criteria that can be applied for testing fits
and distributional choices. Many of them are based on asymp-
(m
totic normality. For instance, a 2 -goodness-of-fit test splits the
support of the null hypothesis distribution function G0 into K
disjoint intervals Ik = [ck , ck+1 ), k = 1, . . . , K. Then, data is
grouped according to these intervals, i.e. Ok counts the number
of observed realizations Y1 , . . . , Yn in interval Ik and Ek denotes
the expected number of observations in Ik according to the dis- K. Pearson
tes
tribution function G0 . The test statistics of n observations is
then defined by
2
XK
(Ok Ek )2
Xn,K = . (3.9)
k=1 Ek
no
2
If d parameters were estimated in G0 , then Xn,K is compared to a 2 -distribution
with K 1 d degrees of freedom, see also Exercise 2 on page 22. Often it is
suggested that we should have Ek > 4 for reasonable results. However, these
rules-of-thumbs are not very reliable.
This 2 -goodness-of-fit test is sometimes also called Pearsons -square test, named
NL
after Karl Pearson (1857-1936) who has investigated this test in 1900.
Within the framework of MLE methods the Hirotugu
Akaike (1927-2009) information criterion (AIC) and the
Bayesian information criterion (BIC) are often used, we
refer to Akaike [2] and Section 2.2 in Congdon [28]. These
criteria are used to compare different distribution func-
tions and densities. Assume we want to compare two
different densities g1 and g2 that where fitted to (i.i.d.) H. Akaike
data Y = (Y1 , . . . , Yn )0 . The AIC is defined by
(i)
AIC(i) = 2`Y + 2d(i) ,

(i)
where `Y is the log-likelihood function of density gi for data Y and d(i) denotes
the number of estimated parameters in gi , for i = 1, 2. For MLE we maximize
(i)
`Y and in order to evaluate the AIC we penalize the model for having too many
parameters. The AIC then says that the model with the smallest AIC value should
be preferred.
The BIC uses a different penalty term for the number of parameters (all these
penalty terms are motivated by asymptotic results). It reads as
(i)
BIC(i) = 2`Y + log(n) d(i) ,
w)
and the model with the smallest BIC value should be preferred.
(m
tes
no
Figure 3.26: Akaikes original hand notes on the AIC (lhs) at the Institute of
Statistical Mathematics in Tokyo, Japan (rhs).
Exercise 9 (AIC and BIC). Assume we have i.i.d. claim sizes Y = (Y1 , . . . , Yn )0
NL
with n = 1000 which were generated by a gamma distribution, see Figure 3.27.
The sample mean and sample standard deviation are given by
b n = 0.1039 and bn = 0.1050.
If we fit the parameters of the gamma distribution we obtain the method of mo-
ments estimators and the MLEs
b MM = 0.9794 and cbMM = 9.4249,
b MLE = 1.0013 and cbMLE = 9.6360.
This provides the fitted distributions displayed in Figure 3.28. The fits look perfect
and the corresponding log-likelihoods are given by
`Y (b MM , cbMM ) = 1264.013 and `Y (b MLE , cbMLE ) = 1264.171.

w)
(m
Figure 3.27: I.i.d. claim sizes Y = (Y1 , . . . , Yn )0 with n = 1000; lhs: observed data;
rhs: empirical distribution function.
tes
no
NL
Figure 3.28: Fitted gamma distributions; lhs: log-log plot; rhs: QQ plot.
(a) Why is `Y (b MLE , cbMLE ) > `Y (b MM , cbMM ) and which fit should be preferred
according to AIC?
(b) The estimates of are very close to 1 and we could also use an exponential
distribution function. For the exponential distribution function we obtain
MLE cbMLE = 9.6231 and `Y (cbMLE ) = 1264.169. Which model (gamma or
exponential) should be preferred according to the AIC and the BIC?

3.4 Calculating within layers for claim sizes

In the previous sections we have experienced that it is difficult to fit one parametric
distribution function to the entire range of possible outcomes of the claim sizes.
Therefore, we often consider claim sizes in different layers. Another reason why
different layers of claim sizes are of interest is that re-insurance can often be bought
for different claims layers. For these reasons we would like to understand how claim
sizes behave in different layers. First we discuss the modeling issue and second we
describe modeling of re-insurance layers.
w)
3.4.1 Claim size modeling using layers
We come back to the issue that the KS test rejects the most popular parametric
i.i.d.
fits, see Example 3.11. We assume that Y1 , Y2 , . . . G and we would like to split
(m
G into different layers. The simplest case is to choose two layers, see Example 2.16,
that is, choose a large claims threshold M > 0 such that G(M ) (0, 1), i.e. G(M )
is bounded away from zero and one. We then define the partition
{Y1 M } and {Y1 > M } .

tes
Assume that S CompPoi(v, G). We consider the total claim Ssc in the small
claims layer and the total claim Slc in the large claims layer given by
N
X N
X
Ssc = Yi 1{Yi M } and Slc = Yi 1{Yi >M } .
no
i=1 i=1
Theorem 2.14 implies that Ssc and Slc are independent and compound Poisson
distributed with
Ssc CompPoi (sc v = G(M )v , Gsc (y) = P [Y1 y|Y1 M ]) ,

NL
and
Slc CompPoi (lc v = (1 G(M ))v , Glc (y) = P [Y1 y|Y1 > M ]) .
Thus, we can model large claims and small claims separately (independently).
Observe that we have the following decomposition
G(y) = P [ Y1 y| Y1 M ] G(M ) + P [ Y1 y| Y1 > M ] (1 G(M ))

= Gsc (y)G(M ) + Glc (y)(1 G(M )).
Often a successful modeling approach involves 3 steps:

1. Choose threshold M > 0 sufficiently large so that many of the observations

fall into the lower layer (0, M ]. In this lower layer one either fits a parametric
distribution function to the data or one directly works with the empirical
distribution function (due to the Glivenko-Cantelli theorem). If a distribu-
tion function is fitted one should ensure that this distribution function has
compact support (0, M ], for instance, by choosing a truncated gamma distri-
bution.
2. Estimate probability G(M ) of the event {Y1 M } which is typically large.
w)
3. Fit a Pareto distribution to Glc for threshold = M , i.e. estimate the tail
index > 0 from the observations exceeding this threshold M .
(m
Example 3.13. We revisit the PP and the CP insurance data set. We choose
tes
no
NL
Figure 3.29: Empirical fit in small claims layer and Pareto distribution fit in large
claims layer, the gray lines show the large claims threshold; lhs: PP insurance data;
rhs: CP insurance data.
large claims threshold M = 500 000 in both cases. In the PP insurance data set we
have 237 observations above this threshold, which provides estimate 1 G(M
b )=
0
237/61 053 = 0.39%. For the CP insurance example we have 272 claims above
this threshold, which provides estimate 1 G(M
b ) = 1.87%. Next we calculate
the sample mean and the sample coefficient of variation in the small claims layer
{Yi M }:
PP : b {Yi M } = 20 805, Vco

d
{Yi M } = 1.80,
CP : b {Yi M } = 40 377, Vco

d
{Yi M } = 1.51.

These should be compared to (3.1)-(3.2). We observe a substantial reduction of

the sample coefficient of variation in the small claims layer compared to the entire
range of possible outcomes. This is not surprising because large claims drive the
coefficient of variation. For CP insurance we also see that the sample mean in the
lower claims layer is substantially reduced. This is due to the fact that 1.87% of
claims exceed the threshold M = 500 000 and these claims may get very large and
drive the mean, see also loss size index function in Figure 3.5.
Finally, we fit distribution function G to the data. We choose the empirical dis-
tribution functions below the threshold M and Pareto distributions for the tail fit
w)
in the large claims layer, having tail parameters as estimated in Examples 3.8
and 3.9 (this is also supported by the KS tests, see Example 3.12). The results are
presented in Figure 3.29. For PP insurance they look convincing, whereas the CP
insurance fit is not entirely satisfactory in the large claims layer (which might ask
for a bigger large claims threshold M and a slightly bigger tail parameter ).
3.4.2 Re-insurance layers and deductibles
(m
Above we have calculated expected values in claims layers E[Y 1{u1 <Y u2 } ] for var-
ious parametric distribution functions. This is of interest for several reasons. This
we are going to discuss next.
tes
(i) The first reason is that insurance contracts often have deductibles. On the one
hand small claims often cause too much administrative costs, and on the other
hand deductibles are also an instrument to prevent from fraud (moral hazard). For
no
instance, it can become quite expensive for an insurance company if every insured
claims that his umbrella got stolen. Therefore, a deductible d > 0 of, say, 200
CHF is introduced and the insurance company only covers the claim (Y d)+ that
exceeds this deductible d. In this case the pure risk premium for claim Y G is
given by
NL
Z
E [(Y d)+ ] = (y d) dG(y) = E[Y 1{Y >d} ] d P[Y > d] (3.10)
d
= P[Y > d] (E[Y |Y > d] d) = P[Y > d] e(d),
under the assumption that P[Y > d] > 0 and that the mean excess function e() of
Y exists.
Remark. Fitting a distribution function to claims data (Y d)+ needs some
care. If the original claims Y G (absolutely continuous with density g), then the
density after deductible is for y d given by
g(y)
gd (y) = . (3.11)
1 G(d)
Thus, MLE of parameters becomes more involved.

(ii) The second reason is that the insurance company may have a maximal insurance
cover per claim, i.e. it covers claims only up to a maximal size of M > 0 and the
exceedances need to be paid by the insured; or, similarly, it may cover claims
exceeding M but has a re-insurance cover for these exceedances. In that case the
insurance company covers (Y M ) and the pure risk premium for this (bounded)
claim is given by
Z M
E [Y M ] = y dG(y) + M P[Y > M ] = E[Y 1{Y M } ] + M P[Y > M ]
)
0

= E[Y ] E[Y 1{Y >M } ] M P[Y > M ]
w
= E[Y ] P[Y > M ] e(M ) = E[Y ] E [(Y M )+ ] .
(m
If we combine deductibles with maximal covers we obtain excess-of-loss (XL) (re-)
insurance treaties. Assume we have a deductible u1 > 0 (in re-insurance terminol-
ogy this also called priority or retention). Then the insurance treaty u2 xs u1
covers the claims layer (u1 , u1 + u2 ], that is, this contract covers a maximal excess
of u2 above the priority u1 . The pure risk premium for such contracts is then given
by
tes
E[((Y u1 )+ ) u2 ] = E[(Y u1 )+ ] E[(Y u1 u2 )+ ].
An issue, when dealing with layers, is claims inflation. Assume we sell insurance
no
contracts with a deductible d > 0 and we ask for a pure risk premium E [(Y d)+ ].
Since cash flows have time values this premium has to be revised carefully for later
periods as the following theorem shows.
Theorem 3.14 (leverage effect of claims inflation). Choose a fixed deductible d >
NL
0 and assume that the claim at time 0 is given by Y0 . Assume that there is a
deterministic inflation index i > 0 such that the claim at time 1 can be represented
(d)
by Y1 = (1 + i)Y0 . We have
E[(Y1 d)+ ] (1 + i) E[(Y0 d)+ ].
Proof. We calculate the pure risk premium

Z Z
E[(Y1 d)+ ] = P[(Y1 d)+ > y] dy = P[Y1 > y + d] dy
Z0 Z 0
x
= P[Y1 > x] dx = P Y0 > dx
d d 1+i
Z
= (1 + i) P [Y0 > y] dy,
d
1+i

where we have twice applied a change of variables. The latter is calculated as follows
!
Z d Z
E[(Y1 d)+ ] = (1 + i) P [Y0 > y] dy + P[Y0 > y] dy
d
1+i d
Z d
= (1 + i) P [Y0 > y] dy + (1 + i) E[(Y0 d)+ ].
d
1+i
Example 3.15 (leverage effect of claims inflation). Assume that Y0 Pareto(, )
w)
with > 1 and choose a deductible d > . In that case we have, see (3.10),
!
d 1
E [(Y0 d)+ ] = d.
1
(m
Choose inflation index i > 0 such that (1 + i) < d. From (3.7) we obtain
(d)
Y1 = (1 + i)Y0 Pareto((1 + i), ).
This provides for > 1 and i > 0

tes
!
d 1
E [(Y1 d)+ ] = d
(1 + i) 1
!
d 1
= (1 + i) d > (1 + i) E [(Y0 d)+ ] .
1
no
Observe that we obtain a strict inequality, i.e. the pure risk premium grows faster
than the claim sizes itself. The reason for this faster growth is that claims Y0 d
may entitle for claims payments after claims inflation adjustments, i.e. not only the
claim sizes are growing under inflation but also the number of claims is growing if
one does not adapt the deductible to inflation.
NL
Exercise 10. In Figure 3.30 we display the distribution function of loss Y G and
the distribution function of the loss after applying different re-insurance covers to
Y . Can you explicitly determine the re-insurance covers from the graphs in Figure
3.30.
Exercise 11. Assume claims sizes Yi in a given line of business can be described
by a log-normal distribution with mean E[Yi ] = 30 000 and Vco(Yi ) = 4.
Up to now the insurance company was not offering contracts with deductibles. Now
it wants to offer the following three deductible versions d = 200, 500, 10 000. Answer
the following questions:
1. How does the claims frequency change by the introduction of deductibles?

w)
(m
Figure 3.30: Distribution functions implied by re-insurance contracts.
2. How does the expected claim size change by the introduction of deductibles?
3. By which amount changes the expected total claim amount?

tes
Note that this consideration is in line with the compulsory social health insurance
in Switzerland where customers have the choice between different deductibles.
no
NL

w)
(m
tes
no
NL

Chapter 4
Approximations for Compound
w)
Distributions
(m
In Chapter 2 we have introduced claims count distributions for the modeling of the
number of claims N within a fixed time period. In Chapter 3 we have met several
claim size distribution functions G for the modeling of the claim sizes Y1 , Y2 , . . ..
Ultimately, we always need to calculate the compound distribution of S, see Defini-
tion 2.1. As explained in Proposition 2.2, we can easily calculate the moments and
tes
the moment generating function of this compound distribution. But the distribu-
tion function of S given in (2.1) is a notoriously difficult object because it involves
(too) many convolutions of the claim size distribution function G. The aim here is
to explain how we can circumvent this difficulty.
no
The most commonly used approach in insurance practice to

overcome this problem is to apply Monte Carlo simulations
and then consider the resulting empirical distribution function
as a sufficiently good approximation to the true distribution
function. This approach is based on the Glivenko-Cantelli the-
orem, see Billingsley [13], Chapter 20. Though this is a feasible
NL
way we do not recommend it. The issue is that the assessment

of sufficiently good is often unclear, i.e. the rates of conver-
gence of the Monte Carlo samples may be very poor which
results in a lot of simulations. This is especially true for heavy tailed distribution
functions of regularly varying type (3.4). Therefore, we recommend approxima-
tions like the Panjer algorithm and fast Fourier transforms (FFT) which are often
more efficient.
4.1 Approximations
In many situations approximations to S are used. These may be justified by the
central limit theorem (CLT) if the expected number of claims is large. Compound
93
94 Chapter 4. Approximations for Compound Distributions
distributions may have two different risk drivers in the tail of the distribution func-
tion, namely, the number of claims N may contribute to large values of S or single
large claims in Y1 , . . . , YN may drive extreme values in S. Let us concentrate on
the compound Poisson model, in particular, we would like to use the decomposi-
tion theorem in the spirit of Example 2.16. In this case, mostly the claim sizes Yi
contribute to the tail of the distribution (if these are heavy tailed). Therefore, we
emphasize that in the light of the compound Poisson model one should separate
small from large claims resulting in the independent decomposition S = Ssc + Slc .
Next, if the expected number of small claims vsc is large, Ssc can be approximated
by a parametric distribution function and Slc should be modeled explicitly. This
)
we are going to describe in detail in the remainder of this chapter.
w
4.1.1 Normal approximation
(m
The normal approximation is motivated by the CLT which goes
back to de Moivre (1733) and Laplace (1812), see (1.2). It
was then Aleksandr Mikhailovich Lyapunov (1857-1918)
who stated it in the general version and who discovered the
importance of the CLT.
tes
The classical CLT holds for a fixed number of claims. In our
approach the number of claims is not fixed, therefore we need
A.M. Lyapunov
a refinement of the CLT. We do this for a Poissonian number
of claims N by keeping the expected claims frequency fixed and by sending the
no
volume v .
Theorem 4.1. Assume S CompPoi(v, G) with G having a finite second mo-

ment. We have
S vE[Y1 ]
NL
q N (0, 1) as v .
vE[Y12 ]
Observe that we consider a special class of distribution functions G having finite

second moment. As long as we work in the set-up of Ssc this is not a restriction
because claim sizes are bounded by the large claims threshold M and therefore
have finite variance.
Proof. Observe that it is sufficient to consider v N because intermediate volumes v allow for
approximations bvc and dve (and the approximation error is asymptotically negligible). Thus, we
choose v N. Disjoint decomposition Theorem 2.14 then provides
N N
v X v
X (d) X (`)
X
S= Yi = Yi = S` ,
i=1 `=1 i=1 `=1

Chapter 4. Approximations for Compound Distributions 95
i.i.d.
where S` CompPoi(, G). The first two moments of these compound Poisson distributions
are given by E[S1 ] = E[Y1 ] and Var(S1 ) = E[Y12 ]. Therefore, the assumptions of the CLT are
fulfilled and the claim follows from (1.2). 2
Theorem 4.1 is the motivation for the following approximation of the distribution
function of S

S vE[Y1 ] x vE[Y1 ] x vE[Y1 ]
P [S x] = P q q q , (4.1)
vE[Y12 ] vE[Y12 ] vE[Y12 ]
w)
where denotes the standard Gaussian distribution function. This approximation
works well when v is large and if the claim sizes Yi do not have heavy tailed
distribution functions G. Otherwise it under-estimates the true potential of large
(m
outcomes of S (because Theorem 4.1 provides a good approximation solely around
the mean of S, in particular, because the Gaussian distribution has zero skewness).
For rates of convergences we refer to the literature, see for instance Embrechts et
al. [39].
Note that the normal approximation (4.1) also allows for negative claims S, which
under our model assumptions is excluded, thus, it is really an approximation that
tes
needs to be considered carefully.
Example 4.2 (Normal approximation for PP insurance). We revisit the PP insur-
ance data of Example 3.13. We consider 3 different examples:
no
(a) Only small claims: in this example we only consider the claim size distribution
function G(y) = P [Y y|Y M ], i.e. the claims are compactly supported
in (0, M ]. As explicit claim size distribution function we choose the empirical
distribution of Example 3.13, see Figure 3.29 (lhs), with M = 500 000. We
choose portfolio size v such that v = 100.
NL
(b) Claim size distribution function G is chosen as in (a), but this time we choose
portfolio size v such that v = 1000.
(c) In addition to (b) we add the large claims layer modeled by a Pareto distri-
bution with = M = 500 000 and = 2.5 and for the expected number of
large claims we set lc v = 3.9.
For simplicity the true distribution function is evaluated by Monte Carlo simu-
lation, which contradicts our statement above, but is appropriate for sufficiently
large samples (and sufficient patience). We choose 100000 simulations, this will
be further illustrated in Example 4.11 below.
In Figure 4.1 we present the results of the normal approximation (4.1) in case (a).
We observe an appropriately good fit around the mean but the normal approxima-
tion clearly under-estimates the tails of the true distribution function, see log-log

w)
(m
Figure 4.1: Compound Poisson distribution of S and normal approximation (4.1)
in case (a), i.e. no large claims, expected number of claims 100; lhs: distribution
function; rhs: log-log plot.
plot in Figure 4.1 (rhs). Moreover, the true distribution function has positive skew-
ness S = 0.43 whereas the normal approximation has zeroqskewness. In the normal
tes
approximation we obtain probability mass (vE[Y1 ]/ vE[Y12 ]) = 6 107 for
a negative total claim amount (which is fairly small).
no
NL

in case (b), i.e. no large claims, expected number of claims 1000; lhs: distribution
function; rhs: log-log plot.
In Figure 4.2 we show situation (b) which is the same as situation (a) the only
difference is that we enlarge the portfolio size by a factor 10. We see better ap-
proximation properties due to the fact that we have convergence in distribution for

portfolio size v . We observe a lower skewness S = 0.15 which improves the

normal approximation, also in the tails.
w)
(m
in case (c), i.e. with large claims, total expected number of claims 1003.9; lhs:
distribution function; rhs: log-log plot.
tes
Finally, in Figure 4.3 we also include large claims (in contrast to Figure 4.2) having
an expected number of large claims of 3.9 and a Pareto tail parameter of = 2.5.
We see that in this case the normal approximation is useless in the tail, which
strongly favors the large claims separation as suggested in Example 2.16.
no
4.1.2 Translated gamma and log-normal approximations

In Example 4.2 we have seen that the normal approximation can be useful for large
portfolio sizes v and under the exclusion of large claims. For small portfolio sizes
the approximation may be bad because the true distribution often has substantial
NL
positive skewness. This leads to the idea of approximating the small claims layer
by other distribution functions that enjoy positive skewness.
We choose k R and define the (translated or shifted) random variables
X = k + Z, where Z (, c) or Z LN(, 2 ).
We have in the translated gamma case
E[X] = k + /c, Var(X) = /c2 and X = 2 1/2 > 0,
and in the translated log-normal case
E[X] = k + exp{ + 2 /2},
Var(X) = exp{2 + 2 }(exp{ 2 } 1),
2
X = (e + 2)(exp{ 2 } 1)1/2 > 0.

The idea now is to do a fit of moments between S and X. Assume that S has finite
third moment and then we choose
X = k + Z, where Z (, c) or Z LN(, 2 ),
such that the three parameters of X fulfill
E[X] = E[S], Var(X) = Var(S) and X = S , (4.2)
and then this fitted random variable X is chosen as an approximation to S.
w)
Exercise 12. Assume that S has a compound Poisson distribution with expected
number of claims v > 0 and claim size distribution G having finite third moment.
1. Prove that the fit of moments approximation (4.2) for a translated gamma
(m
distribution for X provides the following system of equations
E[Y13 ]
v E[Y1 ] = k + /c, v E[Y12 ] = /c2 and = 2 1/2 .
(v)1/2 E[Y12 ]3/2
2. Solve this system of equations for k R, > 0 and c > 0 and prove that it
tes
has a well-defined solution for G(0) = 0.
3. Why should this approximation not be applied to case (c) of Example 4.2?

no
Example 4.3 (Translated gamma and log-normal approximations). We revisit

cases (a) and (b) of Example 4.2, that is, we only consider the small claims layer
and we would like to approximate the compound Poisson distribution in this small
claims layer by translated gamma and log-normal distributions.
NL
The approximations for expected number of claims v = 100, i.e. case (a), are
presented in Figure 4.4 and the ones for expected number of claims v = 1000,
i.e. case (b), in Figure 4.5. In both cases we see that the translated gamma and log-
normal approximations provide remarkably good fits. For this reason, the small
claims layer is often approximated by one of these two parametric distribution
functions.
Observe that for k > v we have a Chernoff type bound of (Stirlings formula
provides asymptotic behavior k! = O(exp{k log(k/e)}) as k )
P [N k] exp {k log k v + k log(ev)} .
This explains that the compound Poisson distribution with bounded claim sizes
Yi M is less heavy tailed compared to the translated gamma and log-normal
distributions.

w)
(m
Figure 4.4: Compound Poisson distribution of S and normal approximation (4.1),
translated gamma and log-normal approximations (4.2) in case (a), i.e. no large
claims, expected number of claims 100; lhs: distribution function; rhs: log-log plot.
tes
no
NL

translated gamma and log-normal approximations (4.2) in case (b), i.e. no large
claims, expected number of claims 1000; lhs: distribution function; rhs: log-log
plot.
The KS test rejects the null hypothesis on the 5% significance level for the normal
approximation in both cases (a) and (b), whereas this is not the case for the
translated gamma and log-normal approximations in both cases (a) and (b), the
p-values are clearly bigger than 5%; for the exact p-values we refer to Table 4.1,
below. In case (a) the translated gamma approximation is favored, in case (b)
the translated log-normal approximation (though the differences in the latter are

negligible).
4.1.3 Edgeworth approximation

The Edgeworth approximation1 is named after Francis Ysidro
Edgeworth (1845-1926). The approximations presented in the
previous section were rather ad-hoc. We have just chosen a (sim-
ple) distribution function that enjoys skewness and then we have
done a fit of moments (with no further argument on the shape of
w)
the approximating distribution function). The Edgeworth ap-
proximation starts from the CLT and then tries to adjust higher
order terms in approximation (4.1) by the evaluation of moment F.Y. Edgeworth
generating functions in terms of Taylor expansions.
Assume S is compound Poisson distributed with claim size distribution G hav-
(m
ing a positive radius of convergence 0 > 0. As in Theorem 4.1 we consider the
normalized random variable
S vE[Y1 ]
Z= q .
vE[Y12 ]
tes
We have E[Z] = 0, Var(Z) = 1 and Z = S . The aim now is to approximate
the moment generating function of Z by comparable terms coming from normal
distributions and argue with Lemma 1.4. Therefore, we first consider the following
Taylor expansion around the origin, choose n 3,
no
n dk
k log MZ (r)|r=0 k
r + o(rn )
X
dr
log MZ (r) = as r 0.
k=0 k!
k

d
We set ak = dr k log MZ (r)|r=0 /k! and note that we have a0 = log MZ (0) = 0,
a1 = E[Z] = 0 and a2 = Var(Z)/2! = 1/2. This provides approximation

NL
n n
( ) ( )
1 2 X 1 2

ak r k ak rk .
X
MZ (r) exp r + = exp r exp
2 k=3 2 k=3
Using a second Taylor expansion for ex = 1 + x + x2 /2! + . . . applied to the latter

exponential function in the last expression, the moment generating function of Z
is approximated by
P 2
n
2 /2
n
k=3 ak r k
MZ (r) er ak r k +
X
1 + + . . .
.

k=3 2!
Depending on the required precision as r 0 we can choose more terms in the

bracket (highlighted by + . . .) and we can take more terms in the summation
1
This section is recommended for further reading.

reflected by the upper index n in the summation. Thus, for appropriate constants
bk R we get the approximation (for small r)

r2 /2 1 + a3 r 3 bk r k .
X
MZ (r) e + (4.3)
k4
Lemma 4.4. Let denote the standard Gaussian distribution function and (k)
its k-th derivative. For k N0 and r R
Z
k r2 /2 k
r e = (1) erx (k+1) (x) dx.

w)
Proof. The proof goes by induction. Choose k = 0, then
Z Z
rx 0 1 2 2
e (x) dx = erx ex /2 dx = MX (r) = er /2 ,
2
(m
which is the moment generating function of X N (0, 1).
Induction step k k + 1. Using integration by parts we have
Z h i Z
k+1 rx (k+2) k+1 rx (k+1) k+1
(1) e (x) dx = (1) e (x) (1) rerx (k+1) (x) dx.

Note that the first term on the right-hand side is equal to zero because (k+1) (x) goes faster to
zero than erx may possibly converge to infinity. This and the induction assumption for k provides
tes
identity
Z Z
2
k+1 rx (k+2) k
(1) e (x) dx = r (1) erx (k+1) (x) dx = r rk er /2 ,

which proves the claim. 2

no
Lemma 4.4 allows us to rewrite approximation (4.3) for small r as follows, set
X N (0, 1),
h i Z Z
MZ (r) E erX a3 erx (4) (x) dx + bk (1)k erx (k+1) (x) dx
X
k4

NL
Z
erx 0 (x) a3 (4) (x) + bk (1)k (k+1) (x) dx.
X
=
k4
Assume that Z has distribution function denoted by FZ , then the latter suggests
approximation, see Lemmas 1.2, 1.3 and 1.4,

dFZ (z) 0 (z) a3 (4) (z) + bk (1)k (k+1) (z) dz.

X
k4
q
Integration provides Edgeworth approximation, set x = vE[Y12 ] z + vE[Y1 ],
def.
P [S x] = FZ (z) EW(z) = (z) a3 (3) (z) + bk (1)k (k) (z). (4.4)
X
k4

This formula provides the refinement of the normal approximation (4.1), namely we
correct the first order approximation by higher order terms involving skewness
and other higher order terms reflected by a3 and bk in (4.4). The Edgeworth
approximation (4.4) is elegant but its use requires some care as we are just going
to highlight.
We first consider the derivatives (k) for k 1. The first derivative is given by
1 2
0 (z) = ez /2 ,
w)
2
and the higher order derivatives for k 2 are given by
dk1 1 z2 /2
k1 z 2 /2

(k) (z) = e = O z e for |z| .
(m
dz k1 2
From this we immediately see that
lim EW(z) = 0 and lim EW(z) = 1.

z
z
Attention. The issue with the Edgeworth approximation EW(z) is that it is not
tes
necessarily a distribution function because it does not need to be monotone in z,
see Example 4.5, below!
no
Example 4.5. To see the possible non-monotonicity of EW(z) we only take into
account skewness, i.e. a3 = Z Z3 /6 = S /6, and the approximation ez 1 + z in
(4.4). We have
0 1 2
(z) = ez /2 ,
2
NL
1 2
(2) (z) = z ez /2 ,
2
1 2 1 2
(3) (z) = ez /2 + z 2 ez /2 ,
2 2
1 2 1 2 1 2
(4) (z) = z ez /2 + 2z ez /2 z 3 ez /2 .
2 2 2
This implies
d
EW(z) = 0 (z) a3 (4) (z) = 0 (z) 1 3a3 z + a3 z 3 . (4.5)
dz
Consider the function h(z) = 1 3a3 z + a3 z 3 for positive skewness S > 0. Then
we have
lim h(z) = and lim h(z) = ,
z
z

which explains that the derivative of EW(z) has both signs and therefore EW(z)
is not monotone. However, in the upper tail of the distribution of S, that is, for
z sufficiently large, the Edgeworth approximation (4.5) is monotone and can be
used as an appropriate approximation. We emphasize that these monotonicity
properties should always be carefully checked in the Edgeworth approximation.
w)
(m
tes
translated gamma approximation (4.2) and Edgeworth approximation (4.4) in case
(a), i.e. no large claims, expected number of claims 100; lhs: distribution function;
rhs: log-log plot.
no
We revisit the numerical examples given in Examples 4.3. In Figure 4.6 we give
the approximation in case (a), i.e. expected number of claims equal 100, and in
Figure 4.7 we give the approximation in case (b), i.e. expected number of claims
equal 1000. In both cases we only choose the next additional moment, which is
the skewness and refers to term a3 , and we choose approximation ez 1 + z in
NL
(4.4). We see in both cases that the Edgeworth approximation clearly outperforms
the Gaussian approximation. However, the Edgeworth approximation is still light-
tailed which can be seen by comparing it to the translated gamma approximation.
In Figure 4.8 we compare the Edgeworth density (4.5) to the Gaussian density.
We clearly see the influence of the skewness parameter a3 and S > 0, respectively.
Moreover, we also see that the influence of the skewness parameter is decreasing
with a higher expected number of claims. Of course, this exactly reflects the CLT,
see Theorem 4.1.
If we calculate the minimal value of the Edgeworth density (4.5) we obtain in
case (a) the value 9.8 104 and in case (b) the value 4.1 105 . This exactly
explains that the Edgeworth density is not a proper probability density because
it violates the positivity property. However, this only occurs in the range of very
small claims and therefore it can be used as an approximation in the range of large

w)
(m
translated gamma approximation (4.2) and Edgeworth approximation (4.4) in case
(b), i.e. no large claims, expected number of claims 1000; lhs: distribution function;
rhs: log-log plot.
tes
no
NL
Figure 4.8: We compare the Edgeworth density (4.5) to the Gaussian density;
lhs: in case (a), i.e. expected number of claims 100; rhs: in case (b), i.e. expected
number of claims 1000.
claims.
Finally, in Table 4.1 we present the p-values resulting from the KS test of the
different approximations, see Section 3.3.1. In this particular case we see that the
translated gamma distribution is preferred in case (a), whereas in case (b) the
approximations are very similar. For this reason, one often chooses a translated
gamma distribution in practice (and also because it can easily be handled). Note

case (a) case (b)

normal approximation 0% 0%
translated gamma approximation 51% 57%
translated log-normal approximation 8% 59%
Edgeworth approximation 13% 58%
Table 4.1: p-values of the KS test of Section 3.3.1.
that the Edgeworth approximation can be refined and improved by considering
w)
more terms in the Taylor expansion. This closes the example.
We finally remark that there exist similar approximations as the Edgeworth ap-
proximation, for instance, the Gram-Charlier expansion, the Laguerre-gamma ex-
(m
pansion or the Jacobi-beta expansion. These expansions are quite popular in engi-
neering but they have similar weaknesses as the Edgeworth approximation and we
will not further discuss them.
4.2 Algorithms for compound distributions

tes
4.2.1 Panjer algorithm
The Panjer algorithm (also known as Panjers recursion) goes back

no
to Harry H. Panjer [83]. The Panjer algorithm assumes a spe-

cific property for the claims count distribution and then it applies
this property in a clever way to develop a recursive algorithm for
the calculation of the distribution function of S.
Throughout this section we assume that N is a claims count dis-

NL
H.H. Panjer
tribution that is supported in a possibly infinite interval A N0
containing 0. The corresponding probability weights are denoted by pk for k N0
and we set pk = 0 for k
/ A.
Definition 4.6 (Panjer distribution). N has a Panjer distribution if there exist

constants a, b R such that for all k N we have the recursion
pk = pk1 (a + b/k) .

Note that Panjer distributions require p0 > 0, otherwise the recur-

sion for k 1 will not provide a well-defined distribution function.
Bjrn Sundt and William S. Jewell (1932-2003) have char-

acterized the Panjer distributions. This is stated in the following
lemma. B. Sundt
Lemma 4.7 (Sundt-Jewell [95]). Assume N has a non-degenerate Panjer distri-

bution. N is either binomially, Poisson or negative-binomially distributed.
w)
Proof. In order for N to have a non-degenerate distribution function we
need to have |A| > 1. Thus, we may and will choose as initialization for
the recursion k = 1 A (A is an interval containing at least 0 and 1). The
Panjer distribution then provides for this k the identity p1 = p0 (a + b). To
(m
have a well-defined distribution function we need to have a + b 0, otherwise
p1 < 0. The case a + b = 0 provides a degenerate distribution function, thus
we even need to have a + b > 0.
Case (i). Assume a = 0. This implies b > 0 and
b W.S. Jewell
pk = pk1 >0 for all k N.
k
tes
This is exactly the Poisson distribution with parameters a = 0 and b = v > 0 for A = N0
because for the Poisson distribution we have, see Section 2.2.2, pk /pk1 = v/k.
Case (ii). Assume a < 0. To have positive probabilities we need to make sure that a + b/k
remains positive for all k A. This requires |A| < . We denote the maximal value in A
by v N (assuming it has pv > 0). The positivity constraint then provides b/v > a and
a + b/(v + 1) = 0. The latter implies that pk = 0 for all k > v and is equivalent to the requirement
no
v = (a + b)/a > 0. We set p = a/(1 a) (0, 1) which provides

b a(v + 1) p v+1
pk = pk1 a + = pk1 a = pk1 1 .
k k 1p k
For the binomial distribution we have on A, see Section 2.2.1,

NL
pk p vk+1 p p v+1
= = + .
pk1 1p k 1p 1p k
This is exactly the binomial distribution with parameters a = p/(1 p) and b = (v + 1)p/(1 p)
and A = {0, . . . , v}.
Case (iii). Assume a > 0. In this case we define = (a + b)/a > 0. This provides b = a( 1)
and
b 1
pk = pk1 a + = pk1 a 1 + .
k k
Since the latter should be summable in order to obtain a well-defined distribution function we
need to have a < 1. For the negative-binomial distribution we have, see Proposition 2.20,
pk p(k + 1) p( 1)
= =p+ .
pk1 k k
This is exactly the negative-binomial distribution with parameters a = p and b = p( 1) and

A = N0 . This proves the lemma. 2

The previous lemma shows that the (important) claims count distributions that
we have considered in Chapter 2 are Panjer distributions and the corresponding
choices a, b R are provided in the proof of Lemma 4.7. We restate this in the
next corollary.
Corollary 4.8. Assume N has a non-degenerate Panjer distribution. For a =

p/(1 p) and b = (v + 1)p/(1 p) we have the binomial distribution, for a = 0
and b = v we have the Poisson distribution, and for a = p and b = p( 1) we
obtain the negative-binomial distribution with p = v/( + v).
w)
Theorem 4.9 (Panjer algorithm [83]). Assume S has a compound distribution
according to Model Assumptions 2.1 with N having a Panjer distribution with pa-
(m
rameters a, b R and the claim size distribution G is discrete with support N.
Denote gm = P[Y1 = m] for m N. Then we have for r N0

def.
p0 for r = 0,
fr = P[S = r] = Pr
k=1 a+ b kr gk frk for r 1.
tes
Proof of Theorem 4.9. We will prove a more general result in Theorem 4.9(B) below. 2
Remarks.
no
The Panjer algorithm requires a Panjer distribution for N and strictly positive
and discrete claim sizes Yi N, P-a.s. Then it provides an algorithm that
allows us to calculate the compound distribution without doing the involved
convolutions (2.1): Assume N Poi(v), henceforth, a = 0, b = v and for
rN
NL
r
X k
fr = v gk frk . (4.6)
k=1 r
Theorem 4.9 allows to apply recursion (4.6) as follows
f0 = p0 = ev ,
f1 = vg1 f0 ,
1
f2 = v g1 f1 + vg2 f0 ,
2
1 2
f3 = v g1 f2 + v g2 f1 + vg3 f0 ,
3 3
..
.
Observe that fr only depends on f0 , . . . , fr1 .

In practical applications there might occur the situation where the initial
value f0 is nonsensical on IT systems. This has to do with the fact that
IT systems can represent numbers only up to some numerical precision. Let
us explain this using the compound Poisson distribution providing Panjer
algorithm (4.6). If the expected number of claims v is very large, then
on IT systems the initial value f0 = p0 = ev may be interpreted as zero
and thus the algorithm cannot start due to missing numerical precision and
meaningful starting value. We call this numerical underflow.
In this case we can modify the Panjer algorithm as follows: choose any strictly
w)
positive starting value fe0 > 0 and develop the iteration
fe1 = vg1 fe0 ,

1
fe2 = v g1 fe1 + vg2 fe0 ,
2
(m
1 2
fe3 = v g1 fe2 + v g2 fe1 + vg3 fe0 ,
3 3
..
.
Observe that this provides a multiplicative shift from fr to fer . The true
probability weights are then found by
tes
n o
fr = exp log fer + log f0 log fe0 ,
where we go over to the log-scale to avoid another multiplication with missing

numerical precision.
no
This multiplicative shift may lead to a numerical overflow, which might re-
quire to shift forward and backward the algorithm several times to get sensible
values. Important at the end is a final check to see whether
n
X
fr 1 as n ,
NL
r=0
in order to have total probability mass 1.
We need to have discrete claim sizes Yi N. Of course, this can be modified

to any other span d > 0, i.e. Yi dN, because for r N
"N # "N # "N #
X X X
P[S = dr] = P Yi = dr = P Yi /d = r = P Ye
i =r ,
i=1 i=1 i=1
with Yei = Yi /d N.
For non-discrete claim sizes Yi we need to discretize them in order to apply

the Panjer algorithm. Choose span size d > 0 and consider for k N0
G((k + 1)d) G(kd) = P [kd < Y1 (k + 1)d] .

These probabilities can now either be shifted to the left or to the right end-
point of the interval [kd, (k+1)d]. We define the two new discrete distribution
functions for k N0
h i
+
gk+1 = P Y1+ = (k + 1)d = G((k + 1)d) G(kd), (4.7)
and h i
gk = P Y1 = kd = G((k + 1)d) G(kd). (4.8)
This provides the following stochastic ordering (stochastic dominance)
w)
Y1 sd Y1 sd Y1+ ,
where the latter means P[Y1 > x] P[Y1 > x] P[Y1+ > x]. This implies
(m
N N N
S = Yi sd S = Yi sd S + = Yi+ ,
X X X
i=1 i=1 i=1
for Yi being i.i.d. copies of Y1 and Yi+ being i.i.d. copies of Y1+ (also inde-
pendent of N ). Thus, we get lower and upper bounds S sd S sd S + which
become more narrow the smaller we choose the span d. In most applications,
tes
especially for small v, these bounds/approximations are sufficient compared
to the other uncertainties involved in the prediction process (parameter esti-
mation uncertainty, etc.).
To S + we can directly apply the Panjer algorithm. S is more subtle be-
no
cause it may happen that g0 > 0 and, thus, the Panjer algorithm cannot be
applied in its classical form of Theorem 4.9. In the case of the compound
Poisson distribution this problem is easily circumvented due to the disjoint
decomposition theorem, Theorem 2.14, which says that
NL
N N
(d)
S = Yi = Yi 1{Y >0} = Se
X X
i
i=1 i=1
has again a compound Poisson distribution with parameters v e = v(1 g )

0

and weights of the claim sizes gk = gk /(1 g0 ) for k N. Finally, we apply
e
the Panjer algorithm to the compound Poisson distributed random variable
Se to get the second bound. We prefer to give a more general version of
the Panjer algorithm that also allows to treat the case g0 > 0, see Theorem
4.9(B) below.
There are more sophisticated discretization methods, but often our proposal
(4.7)-(4.8) is sufficient. Moreover, it provides explicit upper and lower bounds
which is an advantage if one tries to quantify the precision of the approxima-
tion.

Theorem 4.9(B) (modified Panjer algorithm) Assume S has a compound distri-

bution according to Model Assumptions 2.1 with N having a Panjer distribution
with parameters a, b R and the claim size distribution G is discrete with support
N0 (we allow for g0 = P[Y1 = 0] > 0). Then we have for r N0
P
k

kN0 pk g0 for r = 0,
fr = P[S = r] = r

1
b kr for r 1.
P
1ag0 k=1 a + gk frk
)
Proof of Theorem 4.9(B). Note that we have kpk = (ak + b)pk1 = a(k 1)pk1 + (a + b)pk1 .
w
We multiply this equation with (MY1 (x))k1 MY0 1 (x) and sum over k N. This provides the
identity
X X
kpk (MY1 (x))k1 MY0 1 (x) = (a(k 1)pk1 + (a + b)pk1 ) (MY1 (x))k1 MY0 1 (x).
(m
kN kN
The left-hand side is the derivative w.r.t. x of

" " ( N ) ##
X X
= E MY1 (x)N = pk (MY1 (x))k ,

MS (x) = E E exp x Yi N

i=1 kN0
whereas the right-hand side fulfills, again using the derivative of MS (x) in the second step,
tes
X
(a(k 1)pk1 + (a + b)pk1 ) (MY1 (x))k1 MY0 1 (x)
kN
X
= (akpk + (a + b)pk ) (MY1 (x))k MY0 1 (x) = aMS0 (x)MY1 (x) + (a + b)MS (x)MY0 1 (x).
kN0
no
Thus, we have just proved that the moment generating function for compound Panjer distributions
satisfies the following differential equation
MS0 (x) = aMY1 (x)MS0 (x) + (a + b)MY0 1 (x)MS (x).
Each side of the above identity can be expanded as powers of ex

X X X X X
fr rexr = a gk exk fl lexl + (a + b) gk kexk fl exl .
NL
r1 k0 l1 k1 l0
Comparing the terms with the same powers r 1 of ex we obtain

r1
X r
X
rfr = a gk (r k)frk + (a + b) kgk frk
k=0 k=1
r
X r
X
= arg0 fr + a gk (r k)frk + (a + b) kgk frk
k=1 k=1
Xr r
X r
X
= arg0 fr + ar gk frk + b kgk frk = arg0 fr + (ar + bk)gk frk .
k=1 k=1 k=1
Dividing both sides by r 1 and bringing the first term on the right-hand side of the last equality
to the other side we obtain
r
X k
(1 ag0 )fr = a+b gk frk .
r
k=1

This proves the claim for r > 0. For r = 0 we have

" k #
X X X
P[S = 0] = p0 + pk P Yi = 0 = p0 + pk P [Y1 = . . . = Yk = 0]
kN i=1 kN
X X
= p0 + pk g0k = pk g0k ,
kN kN0
where in the second last step we have used the independence property of the claim sizes Yi .
This finishes the proof. Note that we have (implicitly) assumed that there exists a positive
radius of convergence for the moment generating functions, see also Lemma 1.1. We can do this
w.l.o.g. because in order to calculate fr = P[S = r] we may replace the claim sizes Yi by bounded
w)
claim sizes Yi (r + 1) and the resulting probability weight fr will be the same. 2
Example 4.10 (Panjer algorithm compound Poisson distribution). We choose a

compound Poisson model with expected number of claims v = 1 and Pareto
(m
i.i.d.
claim size distribution Yi Pareto(, ) with = 500 000 and = 2.5. In a
first step we need to discretize the claim sizes. We calculate the distributions of
Yi sd Yi sd Yi+ according to (4.7) and (4.8) with
! !
kd (k + 1)d
gk = +
gk+1 = G((k + 1)d) G(kd) = .

tes
no
NL
Figure 4.9: Discretized claim size distributions (gk )k and (gk+ )k ; lhs: case (i) with
span d = 100 000; rhs: case (ii) with span d = 10 000.
As span size we choose two different values: (i) d = 100 000 and (ii) d = 10 000. In
Figure 4.9 we plot the resulting probability weights (gk )k and (gk+ )k . We see that
the discretization error disappears for decreasing span d.
We then implement the Panjer algorithm in R. The implementation is rather
straightforward. In a first step we invert the ordering in the claim size distributions

(gk )k and (gk+ )k so that in the second step we can apply matrix multiplications.
This looks as follows:
# Note that we shift indexes by 1 (because arrays start at 1)

> for (k in 0:(Kmax-1)) { g[2,Kmax-k] <- g[1,k+1]*k }
> f[1] <- exp(-lambda * v)
> for (r in 1:(Kmax-1)) {
f[r+1] <- g[2,(Kmax-r):(Kmax-1)] %*% f[1:r] * lambda * v / r
}
w)
The results are presented in Figures 4.10 and 4.11.
(m
tes
no
Figure 4.10: Discrete probability weights of compound Poisson distribution with

v = 1 from Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii)
with span d = 10 000.
NL
In Figure 4.10 we plot the resulting probability weights of the (discretized) com-
pound Poisson distribution, the left-hand side gives the picture for span d = 100 000
and the right-hand side for d = 10 000. The observation is that span d = 100 000
gives quite some differences between lower and upper bounds reflected by (gk )k
and (gk+ )k , for span d = 10 000 they are sufficiently close so that we obtain appro-
priate approximations to the continuous Pareto distribution case. We also observe
that the resulting distribution has two obvious modes, see Figure 4.10 (rhs), these
reflect the cases of having one claim N = 1 and having N = 2 claims, the cases
N 3 only give smaller discontinuities.
Finally, in Figure 4.11 we show the log-log plots of the distribution functions. The
straight blue line reflects the Pareto distribution Y1 Pareto(, ), i.e. of having
exactly one claim with tail parameter = 2.5 (which corresponds to the negative

w)
(m
Figure 4.11: Log-log plot of compound Poisson distribution with v = 1 from
Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii) with span
d = 10 000.
slope of the blue line). We observe that asymptotically the compound Poisson
distribution with v = 1 coincides with the Pareto claim size distribution.
tes
Example 4.11. We revisit case (c) of Example 4.2. For large claims Slc we assume
a compound Poisson distribution with expected number of claims lc v = 3.9 and
Pareto(, ) claim size distribution with = 500 000 and = 2.5. We choose the
same two discretizations as in Example 4.10, see Figure 4.9, and then we apply the
no
Panjer algorithm to the large claims layer as explained above. The results for the
distributions of Slc are presented in Figures 4.12 and 4.13.
The results are in line with the ones of Example 4.10 and we should prefer span size
d = 10 000 which gives a sufficiently good approximation to the continuous Pareto
claims size distribution. Observe that due to lc v = 3.9 the resulting compound
NL
Poisson distribution has more modes now, see Figure 4.12 (rhs). In Figure 4.13 we
see that the asymptotic behavior is sandwiched between the Pareto distribution
Pareto(, ) with tail parameter = 2.5 and this Pareto distribution stretched
with the expected number of claims lc v = 3.9 (blue lines in Figure 4.13). We
observe a rather slow convergence to the asymptotic slope which tells us that
parameter estimation for Pareto distributions is a very difficult (if not impossible)
task if only few observations are available.
Finally, we convolute the large claims layer Slc of case (c) in Example 4.2 with the
corresponding small claims layer Ssc , see case (b) of Example 4.2. For the small
claims layer we choose a translated gamma distribution as approximation to the
true distribution function of Ssc , i.e. we set
S = Ssc + Slc Xsc + Slc , (4.9)

w)
(m
Figure 4.12: Discrete probability weights of a compound Poisson distribution with
lc v = 3.9 from Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii)
with span d = 10 000.
tes
no
NL
Figure 4.13: Log-log plot of compound Poisson distribution with lc v = 3.9 from
Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii) with span
d = 10 000.
where Xsc is the translated gamma approximation to Ssc (see Example 2.16 and
(4.2)) and Slc are the discretized versions of Slc which model the large claims layer
having a compound Poisson distribution with Pareto claim sizes.
In order to calculate the compound Poisson random variable Slc we apply the Panjer
algorithm with span d = 10 000. The disjoint decomposition theorem, see Theorem
2.14 and Example 2.16, implies that in the compound Poisson case we may and will
assume that the large claims separation leads to an independent decoupling of Ssc

w)
(m

Figure 4.14: Case (c) of Example 4.2: exact discretized distribution Xsc + Slc for
span d = 10 000, Monte Carlo approximation and normal approximation (only rhs).
lhs: discrete probability weights (upper and lower bounds); rhs: log-log plot (see
also Figure 4.3 (rhs)).
and Slc , and Xsc and Slc , respectively, see (4.9). Therefore, the aggregate distribu-
tes
tion of Xsc + Slc is obtained by a simple convolution of the marginal distributions
of Xsc and Slc . Using also a discretization for the distribution function of Xsc to
the same span d = 10 000 as in the Panjer algorithm for Slc , denoted by Xsc
, the

convolution of Xsc + Slc can easily be calculated analytically. That is, no Monte

no
Carlo simulation is needed. Namely, denote the discrete probability weights of Xsc
(1) (2)
by (fk )k0 and the discrete probability weights of Slc by (fk )k0 , i.e. set
h i h i
(1) (2)
fk
= P Xsc = kd and fk = P Slc = kd .
Then, due to independence, we have for all r N0 discrete probability weights

NL
r
def.
h i
(1) (2)

+ Slc = rd =
X
fr = P Xsc fk frk . (4.10)
k=0
# Note that we shift indexes by 1 (because arrays start at 1)

> for (k in 0:(Kmax-1)) { f2[2,Kmax-k] <- f2[1,k+1] }
> for (r in 1:Kmax) { f[r] <- f2[2,(Kmax-r+1):Kmax] %*% f1[1:r] }
The results are presented in Figure 4.14. On the left-hand side we present the
probability weights (fr )r0 and on the right-hand side the log-log plot of the re-
sulting distribution function. We observe that the Monte Carlo approximation
(100000 simulations) has bad properties in the tail of the distribution, see Figure
4.14 (rhs), and one should avoid the simulation approach if possible. Especially,

for heavy tailed distribution functions the Monte Carlo simulation approach has
a weak speed of convergence performance. Note that convolution (4.10) is exact,
and in some sense this discretized version can be interpreted as an optimal Monte
Carlo sample with equidistant observations.
We conclude that approximation (4.9) with a translated gamma distribution for
the small claims layer and a compound Poisson distribution with Pareto tails for
the large claims layer is often a good model for total claim amount modeling in
non-life insurance. Moreover, using a discretization with appropriate span size d
the resulting discrete distribution function can be calculated analytically (and we
w)
obtain upper and lower bounds which can be controlled).
expected claim amount E[S] 30 1310 397

standard deviation Var(S)1/2 3380 819
coefficient of variation Vco(S) 10.8%
(m
99.5%-VaR upper bound (from discretization) 4 0460 500
0
99.5%-VaR lower bound (from discretization) 40 0380 500

99.5%-VaR E[S] 9120 500
Table 4.2: Resulting key figures, the 99.5%-VaR corresponds to the 99.5%-quantile
tes
of S, see Example 6.25, below. The 99.5%-VaR is calculated with the discretized
version with span d = 10 000, therefore we obtain upper and lower bounds resulting

from the discretization error in Xsc + Slc .
no
Finally, in Table 4.2 we present the resulting key figures. We observe that the
resulting distribution function is substantially more heavy tailed than the Gaussian
distribution which is not surprising in view of Figure 4.14 (rhs).
4.2.2 Fast Fourier transform

NL
We briefly sketch the fast Fourier transform (FFT) to explain the main idea. We
follow Embrechts-Frei [38], Section 6.7 in Panjer [84], and we also recommend Cern
[27] as a reference.
In Chapter 1 we have introduced the moment generating function of X given by
MX (r) = E[erX ]. The beauty of such transforms is that they allow us to treat
independent random variables in an elegant way, in the sense that convolutions
turn into products, i.e. for X and Y independent we have (whenever they exist)
MX+Y (r) = MX (r)MY (r).
For compound distributed random variables S we have, see Proposition 2.2,
MS (r) = MN (log MY1 (r)). (4.11)

If we manage to identify the right-hand side of the latter equation, that is, find Z
such that MN (log MY1 (r)) = MZ (r), then Lemma 1.2 explains that S and Z have
the same distribution function and we do not need to perform the convolutions (if
Z is sufficiently explicit). This is also the idea behind this section.
In the sequel of this section the moment generating function is

replaced by the (discrete) Fourier transform, which is named af-
ter Jean Baptiste Joseph Fourier (1768-1830). The reason
for this replacement is that the Fourier transform has a nice in-
w)
version formula that is crucial in this section (and which allows
us to identify the right-hand side of (4.11) in a straight forward
manner). We present the discretized case as it is usually used in J.B.J. Fourier
practice.
(m
Assume we have finite support A = {0, . . . , n 1} and that (fl )lA is a discrete
distribution function on A. The discrete Fourier transform of (fl )l is defined by
n1
( )
zl
fz =
X
fl exp 2i for z A. (4.12)
l=0 n
tes
Assume S (fl )l , then we have, by a slight abuse of notation,
z zS

fz = MS 2i = E exp 2i .
n n
The discrete Fourier transform has the following nice inversion formula
no
1 n1
( )
zl
fz exp 2i
X
fl = for l A. (4.13)
n z=0 n
This provides the first part of the idea to the algorithm: if we are able to explicitly
calculate the discrete Fourier transform (fz )z , then the inversion formula provides
NL
the wanted probability weights (fl )l . Note that this idea also applies if (fl )l are
weights that do not necessarily add up to 1.
Remarks. In the literature one also finds an other definition of the discrete Fourier
transform, namely in (4.12) the factor 2i is sometimes replaced by 2i. This
implies that we also need a switch of sign in the inversion formula (4.13). Similarly,
the scaling n1 in (4.13) may be shifted to (4.12). Note that the discrete Fourier
transform acts on the cyclic group Z/nZ.
The above gives the following recipe:
Step 1. Choose threshold n N up to which we would like to determine the

distribution function of S, i.e. we are interested in P[S n 1].

Step 2. Discretize the claim severity distribution G to obtain weights (gk )kA .
For discretization we refer to the last section on the Panjer algorithm, see
P
remarks on page 107. Note that typically we have kA gk < 1, because
claims Yi may exceed threshold n 1 with positive probability.
Step 3. Calculate the discrete Fourier transform (gz )zA of of (gk )kA .
Step 4. Calculate the discrete Fourier transform (fz )zA of S (fl )lA using
identity (4.11) with r = 2iz/n and (gz )zA , respectively, that is, set
z z
w)

fz = MS 2i = MN log MY1 2i = MN (log gz ) . (4.14)
n n
Step 5. Apply the inversion formula to obtain (fl )lA from (fz )zA .
(m
The remaining part of the FFT explains how to calculate the discrete Fourier
transform (gz )zA of Y1 (gl )lA efficiently. There is a nice recursive algorithm
that allows us to calculate these discrete Fourier transforms for the choices n = 2d ,
d N0 . The discrete Fourier transform of (gl )l for n = 2d is given by
tes
d 1
2X
( )
zl
gz = gl exp 2i d
l=0 2
2d1
X1 2d1
X1
( ) ( )
2zl z(2l + 1)
= g2l exp 2i d + g2l+1 exp 2i
l=0 2 l=0 2d
no
2d1
X1 2d1
X1
( ) ( )
zl z zl

= g2l exp 2i + exp 2i d g2l+1 exp 2i
l=0 2d1 2 l=0 2d1
z

= gbz(0) + exp 2i gb(1) ,
2d z
NL
(0)
where gbz(0) is the discrete Fourier transform of (gl )l=0,...,m1 = (g2l )l=0,...,m1 and
(1)
gbz(1) is the discrete Fourier transform of (gl )l=0,...,m1 = (g2l+1 )l=0,...,m1 for m =
2d1 . Note that this step reduces length 2d to length 2d1 and iterating this until we
have reduced the total length 2d to 20 = 1 calculates the discrete Fourier transform
of (gl )l in an efficient way.
Observe that the total length of (fz )z is also n = 2d . Therefore, the exactly same
recursive algorithm is applied for the calculation of the inversion formula to obtain
(fl )l .
In R there is a command for the FFT. Use the following lines to transform a
discrete, finite distribution g = (gl )l :

# Check normalizations in (4.12)-(4.13) (depending on implementation they

# may be different)
> g_hat <- fft(g)
> g <- fft(g_hat, inverse = TRUE)/length(g)
For more information on the FFT and calculation with complex numbers we refer
to Cern [27].
We conclude this section with remarks on compound distributions, for details we
w)
refer to Embrechts-Frei [38]. First, we compare the efficiency of the proposed meth-
ods. We assume that we calculate the compound distribution of S with discrete
claim sizes Y (gl )lA of length n for n .
method operations claims counts N precision
(m
full convolution (2.1) O(n3 ) any distribution exact
Panjer algorithm O(n2 ) Panjer distributions exact
FFT O(n log n) any distribution not exact
Observe that we have hidden one issue when applying the FFT to compound dis-
tributions. As mentioned above, the discrete Fourier transform acts on the cyclic
tes
group Z/nZ. But transformation (4.14) does not respect this cyclic structure and
compound claims that exceed n 1 are wrapped around. This wrap around error
(also called aliasing error) can be substantial and needs a careful consideration. If
it is too large, then n should be increased so that less probability mass exceeds the
no
threshold n 1, an example is provided in Figure 4.15.

NL
Figure 4.15: Panjer algorithm versus FFT for the compound Poisson distribution
with v = 1 and discrete claim size distribution (g` )` with g` = 1/10 for ` =
1, . . . , 10 with (lhs) n = 12, (middle) n = 15, and (rhs) n = 20.

w)
(m
tes
no
NL

Chapter 5
Ruin Theory in Discrete Time
w)
Ruin theory1 has its origin in the early twentieth century when
(m
Ernst Filip Oskar Lundberg (1876-1965) [71] wrote his fa-
mous Uppsala PhD thesis in 1903. It was later the distinguished
Swedish mathematician and actuary Harald Cramr (1893-
1985) [29, 30] who developed the cornerstones in ruin theory
and has made many of Lundbergs ideas mathematically rigor-
ous. Therefore, the underlying process studied in ruin theory
tes
is called Cramr-Lundberg process. For the collected work of H. Cramr
Cramr we refer to [31]. Since then a vast literature has devel-
oped in this field, important contributions are Feller [45], Bhlmann [19], Rolski et
al. [89], Asmussen-Albrecher [7], Dickson [36], Kaas et al. [64] and many scientific
papers by Hans-Ulrich Gerber and Elias S.W. Shiu.
no
Because it is not our intention to write another textbook on ruin

theory we keep this chapter rather short and only give some key
ideas and results. In particular, we investigate the importance
of the tail of the claim size distribution. Our short summary is
mainly based on Schmidli [91] and Rolski et al. [89], for a more
NL
comprehensive overview we refer to the literature.

H.-U. Gerber
5.1 Net profit condition
We consider time series of premium payments t and total claim
amount payments St over several accounting years t N. In this
set-up we study the question under which circumstances the pre-
mia t suffice to pay all claims St (instantaneously when they oc-
E.S.W. Shiu
cur, allowing us to carry-over possible gains). In order to do this,
we define the following (discrete time) surplus process (Ct )tN0 .
1
The fast reader may skip this chapter. In this chapter we justify the premium loading from
a ruin theory point of view and the notion of subexponential distributions is introduced.
121
122 Chapter 5. Ruin Theory in Discrete Time
Definition 5.1 (surplus process). Choose t N. The surplus at time t is given by

t
(c0 ) X
Ct = Ct = c0 + (u Su ) ,
u=1
for initial capital C0 = c0 0 at time 0 and an i.i.d. sequence (t , St )tN with:
the premium t received for accounting year t satisfies t > 0, P-a.s., and has
a positive radius of convergence for its moment generating function;
w)
the total claim amount St in accounting year t satisfies St 0, P-a.s.;
t and St are independent for all t N.
(m
The last assumption in the previous definition is not really
necessary but it may simplify calculations.
The surplus process (Ct )tN0 models the equity or the net
asset value process of an insurance company which starts
with (deterministic) initial capital C0 = c0 0, collects
every year a premium t and pays for the corresponding
tes
(non-negative) claim St . At the first sight it looks artificial
to model the premium t stochastically. The reason there-
fore is that it may be advantageous in some situations to
have randomized premia. The ultimate goal is to achieve
no
Ct 0 for all t 0,
otherwise the company cannot fulfill its liabilities at any point in time t N0 . In
the present set-up we look at a homogeneous surplus process (having independent
and stationary increments Xt = t St ). Moreover, no financial return on assets is
NL
considered. Of course, this is a rather synthetic situation. For the present purpose
it is sufficient because it already highlights crucial issues and it will be refined for
solvency considerations in Chapter 10.
Definition 5.2 (ruin time and finite horizon ruin probability). We define the ruin
time of the surplus process (Ct )tN0 by
= inf {s N0 ; Cs < 0} .
The finite horizon ruin probability up to time t N and for initial capital c0 0
is defined by

t (c0 ) = P [ t| C0 = c0 ] = P inf C (c0 ) <0 .
s=0,...,t s

Chapter 5. Ruin Theory in Discrete Time 123
Remark on the notation. Below we use that for c0 = 0 the stochastic process
(0)
(Ct )tN0 = (Ct )tN0 is a random walk on the probability space (, F, P) starting
(c ) (0)
at zero. The general surplus process can then be described by (Ct 0 )tN0 = (Ct +
c0 )tN0 under P and, as stated in Definition 5.2, we can indicate the initial capital
by using the notation P[|C0 = c0 ]. In Markov process theory it has naturalized
that the latter is written as Pc0 [] meaning that (Ct )tN0 under Pc0 is equal in law
(0)
to (Ct + c0 )tN0 under P.
The event { t} can be written as follows
w)
n o [
{ t} = inf {s N0 ; Cs < 0} t = {Cs < 0} ,
s=0,...,t
and therefore is a stopping time w.r.t. the filtration generated by (Ct )tN0 . To
consider the limiting case t we need to extend the positive real line by an
(m
additional point {} because is not necessarily finite, P-a.s. We use the notation
R+ for the extended positive real line [0, ].
The finite horizon ruin probability t (c0 ) is non-decreasing in t and it is
bounded by 1 (because it is a probability). This immediately implies convergence
for t and
tes
we can define the ultimate ruin probability by the following limit
(c0 ) = lim t (c0 ) [0, 1]. (5.1)

t
no
Lemma 5.3 (ultimate ruin probability). The ultimate ruin probability for initial
capital c0 0 is given by

(c0 ) = Pc0 [ < ] = Pc0 inf Ct < 0 [0, 1].
tN0
Proof. The second equality is a direct consequence of the definition, note that
NL
[ [ [ [
{ < } = { t} = {Cs < 0} = {Ct < 0} = inf Ct < 0 .
tN0
tN0 tN0 s=0,...,t tN0
For the first equality we use the monotone convergence property of probability measures, note
{ t} { t + 1},
" #
[
Pc0 [ < ] = Pc0 { t} = lim Pc0 [ t] = lim t (c0 ) = (c0 ).
t t
tN0
We analyze this ultimate ruin probability in various situations. Therefore, we

(c )
modify the surplus process (Ct 0 )tN0 . We define Z0 = 0 and for t N
t t
(c0 ) (0) X X
Zt = Ct c0 = C t = (u Su ) = Xu , (5.2)
u=1 u=1

where we define the i.i.d. sequence (Xt )tN by Xt = t St . In probability theory

the process (Zt )tN0 is called general random walk. A main object of interest of
random walk theory is the study of its long time behavior. The key theorem is the
following statement:
Theorem 5.4 (random walk theorem). Assume Xt are i.i.d. with P[X1 = 0] <
1 and E[|X1 |] < . The random walk (Zt )tN0 defined in (5.2) has one of the
following three behaviors
if E[X1 ] > 0 then limt Zt = , P-a.s.;
w)
if E[X1 ] < 0 then limt Zt = , P-a.s.;
if E[X1 ] = 0 then lim inf t Zt = and lim supt Zt = , P-a.s.
(m
Proof. See, e.g., Proposition 7.2.3 in Resnick [87]. 2
From now on we exclude the trivial case P[1 S1 = 0] = 1

and we assume that 1 and S1 have finite first moments.
The random walk theorem immediately gives the following

tes
crucial corollary for our context:
Corollary 5.5 (ultimate ruin with probability one). As-

sume E[1 ] E[S1 ]. Then (c0 ) 1 for any initial capital
c0 0.
no
Proof. The random walk theorem implies for E[X1 ] = E[1 ]E[S1 ]
0 that lim inf t Zt = , P-a.s., and thus lim inf t Ct = , Pc0 -a.s (for any c0 0). But
this means that we have ultimate ruin with probability 1. 2
Henceforth, for avoiding ultimate ruin with positive probability we need to charge
NL
an (expected) annual premium E[1 ] which exceeds the expected annual claim
E[S1 ]. This gives rise to the following standard assumption.
Assumption 5.6 (net profit condition). The surplus process satisfies the net profit
condition (NPC) given by
E[1 ] > E[S1 ].
Corollary 5.7. Assume (NPC), then (0) < 1.

Proof. The assumption E[1 ] > E[S1 ] implies E[X1 ] > 0 and, thus, limt Zt = , P-a.s. This
implies that P[lim inf t Zt = ] = 0. The latter is equivalent to P[inf tN0 Zt 0] > 0, see
for instance Proposition 7.2.1 in Resnick [87]. But then the proof follows. 2

Moreover, observe that (c0 ) is non-increasing in c0 (this can be seen path by

(c )
path because Ct 0 = Zt + c0 is strictly increasing in the initial capital c0 ).
This implies that (c0 ) (0) < 1 under (NPC).
Our next goal is to find more explicit bounds on the ruin probability as a function
of the initial capital c0 0.
w)
5.2 Lundberg bound
We start with a lemma which gives the renewal property of the surplus process.
i.i.d.
We define the distribution function F by S1 1 F . Thus, we have Xt F .
(m
Note that from S1 FS , 1 F and independence of S1 and 1 it follows
F = FS F .
Lemma 5.8. The finite horizon ruin probability and the ultimate ruin probability
satisfy the following equations for t N0 and initial capital c0 0
tes
Z c0
t+1 (c0 ) = 1 F (c0 ) + t (c0 y) dF (y),

Z c0
(c0 ) = 1 F (c0 ) + (c0 y) dF (y).

no
Proof. We start with the finite horizon ruin probability. Observe that we have partition for
c0 0
{ t + 1} = { 1} {1 < t + 1} = {S1 1 > c0 } {1 < t + 1}.
The i.i.d. property of (t , St )t implies

NL
t+1 (c0 ) = Pc0 [ t + 1] = P[S1 1 > c0 ] + Pc0 [1 < t + 1]

Z c0
= P[S1 1 > c0 ] + Pc0 [ 1 < t + 1| S1 1 = y] dF (y)

Z c0
= P[S1 1 > c0 ] + Pc0 [ 1 < t + 1| C1 = c0 y] dF (y)

Z c0
= P[S1 1 > c0 ] + Pc0 y [ t] dF (y)

Z c0
= 1 F (c0 ) + t (c0 y) dF (y).

The ultimate ruin probability statement is a direct consequence of the finite horizon statement.
Using that we have point-wise convergence (5.1) and that t is bounded by 1 which is integrable
w.r.t. dF we can apply the dominated convergence theorem to the finite horizon ruin probability
statement which provides the claim for the ultimate ruin probability as t . 2

Definition 5.9 (Lundberg coefficient, adjustment coefficient). Assume there exists

an R > 0 such that
MX1 (R) = MS1 1 (R) = 1.
Then, this R > 0 is called Lundberg coefficient.
Lemma 5.10 (uniqueness of Lundberg coefficient). Assume that (NPC) holds and
that a Lundberg coefficient R > 0 exists. Then, R is unique.
w )
(m
tes
no
Figure 5.1: Lundberg coefficient R of the function r 7 MX1 (r).
Proof. Due to the existence of a Lundberg coefficient R > 0 and due to the independence
NL
between S1 and 1 the following function is well-defined for all r [0, R] and satisfies
r 7 h(r) = log MS1 1 (r) = log(MS1 (r) M1 (r)) = log E erS1 + log E er1 .

Similar to Lemma 1.6 we see that h(r) is a convex function on [0, R] with h(0) = 0 and h0 (0) =
E[S1 1 ] < 0 under (NPC). But then there is at most one R > 0 with h(R) = 0. This proves
the uniqueness of the Lundberg coefficient. 2
Theorem 5.11 (Lundbergs exponential bound). Assume (NPC) and R > 0 exists.
(c0 ) eRc0 for all c0 0.
Proof. It suffices to prove that t (c0 ) eRc0 for all t N because t (c0 ) (c0 ) for t .
We apply Lemma 5.8 to the finite horizon ruin probability t (c0 ) to obtain the following proof
by induction.

t = 1: We apply Chebychevs inequality to obtain for Lundberg coefficient R > 0 and any c0 0
h i
1 (c0 ) = Pc0 [ 1] = P[S1 1 > c0 ] = P eR(S1 1 ) > eRc0
eRc0 MS1 1 (R) = eRc0 .
t t + 1: We assume that the claim holds true for t (c0 ) and any c0 0. Then with Lemma 5.8
Z Z c0
t+1 (c0 ) = dF (y) + t (c0 y) dF (y)
c0
Z Z c0
eR(c0 y) dF (y) + eR(c0 y) dF (y)
c0
)
= eRc0 MS1 1 (R) = eRc0 ,
w
due to the choice of the Lundberg coefficient R > 0. This proves the Lundberg bound. 2
Remarks on Lundbergs exponential bound.
(m
Under (NPC) and the existence of the Lundberg coefficient
R > 0 we have an exponentially decaying bound on the ulti-
mate ruin probability as initial capital c0 , i.e.
(c0 ) eRc0 .
tes
Set > 0 (small). There exists c0 = c0 (R, ) 0 such that
(c0 ) . This means that in the Lundberg case we can E.F.O.
specify a maximal admissible ruin probability as tolerance Lundberg
and then we can choose an appropriate initial capital c0 which
no
implies that the ultimate ruin probability (c0 ) is bounded by this tolerance.
The existence of the Lundberg coefficient R > 0 implies that MS1 (R) <
and, using Chebychevs inequality,
h i
P[S1 > x] = P eRS1 > eRx eRx MS1 (R) eRx as x .
NL
This means that the claims S1 have exponentially decaying tails which are
so-called light tailed claims.
A main question is whether this exponential bound can be improved in the case
where the Lundberg coefficient exists. The difficulty in most selected cases is that
the ultimate ruin probability cannot be calculated explicitly. An exception is the
Bernoulli case.
Proposition 5.12 (Bernoulli random walk). Assume that Xt are i.i.d. with P[Xt =
1] = p and P[Xt = 1] = 1 p for given p > 1/2. For all c0 N we have
!c0 +1
1p
(c0 ) = .
p

Note that this model is obtained by assuming t 1 and St {0, 2} with proba-
bility p having a zero claim.
Proof. We choose a finite interval (1, a) for a N and define for fixed c0 [0, a) N0 the
stopping time
a = inf {s N0 ; Cs = c0 + Zs
/ (1, a)} .
The random walk theorem implies a < , P-a.s., because the interval (1, a) is finite. We define
the random variable c +Z C
1p 0 t

1p t
Yt = = .
p p
w)
It satisfies
" c0 +Zt1 +Xt # " Xt #
1p 1p
E [ Yt | Yt1 ] = Yt1 = Yt1 E Yt1

E
p p
" 1 #
1p 1p
= Yt1 (1 p) +p = Yt1 ,
(m
p p
thus (Yt )t0 is a martingale. Then also the stopped process (Ya t )t0 is a martingale. Moreover,
the latter martingale is bounded and since the stopping time is finite, P-a.s., we can apply the
stopping theorem (uniform integrability), see Section 10.10 in Williams [97], which provides
c
1p 0
= E[Y0 ] = E[Ya ]
p
tes
1 a
1p 1p
= Pc0 [Ca = 1] + Pc0 [Ca = a]
p p
1 a
1p 1p
= Pc0 [Ca = 1] + (1 Pc0 [Ca = 1]) ,
p p
no
where the last step follows because (Ct )tN0 leaves the interval (1, a), Pc0 -a.s., either at 1 or
at a. This provides the identity
c0 a
1p 1p
p p
Pc0 [Ca = 1] = 1 a .
1p 1p
p p
NL
Finally, note that {Ca = 1} is increasing in a and thus

c0 +1
1p
(c0 ) = Pc0 [ < ] = lim Pc0 [Ca = 1] = ,
a p
because p > 1 p. This proves the theorem. 2
The Lundberg coefficient for the Bernoulli random walk is found by the positive
solution of
!
R R p
MX1 (R) = (1 p)e + pe = 1, i.e. R = log > 0.
1p
This together with Proposition 5.12 provides in the Bernoulli case
!
1p
(c0 ) = eRc0 .
p

That is, the Lundberg bound is optimal in the sense that we cannot improve the
exponential order of decay because the Lundberg coefficient R already provides the
optimal order.
In most cases we cannot explicitly calculate the ultimate ruin probability (c0 ).
Exceptions are the Bernoulli random walk of Proposition 5.12 and the Cramr-
Lundberg process in continuous time with an exponential claim size distribution,
see (5.3.8) in Rolski et al. [89]. In other cases where the Lundberg coefficient
exists we apply Lundbergs exponential bound of Theorem 5.11, or refined versions
w)
thereof. But the following question remains: what can we do if the Lundberg
coefficient does not exist, i.e. if the tail probability of St does not necessarily decay
exponentially? The latter is quite typical in non-life insurance modeling.
(m
5.3 Pollaczek-Khinchin formula
5.3.1 Ladder epochs
The practically oriented reader may skip this section.
We assume (NPC) throughout this section, thus we know
tes
that Ct , Pc0 -a.s., and (0) < 1. Under these as-
sumptions we can study the (local) minima of the surplus
process. This study is done by looking at the ladder heights
that define these minima. We follow Bhlmann [18], Sec-
tion 6.2.6, Feller [45], Chapter XII, and Rolski et al. [89],
no
Chapter 6. We define the stopping times 0 = 0 and for

kN
n o
inf t > k1 ; Zt < Zk1
if k1 < ,
k =
otherwise.
NL
k is called the k-th strong descending ladder epoch, see (6.3.6) in Rolski et al. [89].
These stopping times form an increasing sequence that record the arrivals of new
ladder heights (descending records). For their distribution functions we have under
the i.i.d. property of the Xt s (independent and stationary increments)
h n o i
P [ k < | k1 < ] = P inf t > k1 ; Zt < Zk1 < k1 <

= P [ inf {t > 0 ; Zt < Z0 } < | 0 < ]

= P [inf {t > 0; Zt < 0} < ] = (0) < 1.
The probability of a finite ladder epoch is exactly equal to the ultimate ruin prob-
ability (0) with initial capital c0 = 0.
Note that we could have t St 0, P-a.s., which would imply that the ultimate
ruin probability (0) = 0 because the premium collected is bigger than the maximal

claim, P-a.s. We exclude this situation as it is not interesting for ruin probability
considerations and because the insured will (hopefully) never pay a premium that
exceeds his maximal loss in any situation. Henceforth, under (NPC) we throughout
assume that (0) (0, 1) (where the upper bound follows from (NPC)).
We define the random variable
K + = sup {k N0 ; k < } .
K + counts the total number of finite ladder epochs, i.e. the total number of strong
descending records. We have (applying the tower property several times)
h i
P K + = k = P [k < , k+1 = ] = (0)k (1 (0)),
w)
that is, the total number of finite ladder epochs has a geometric distribution with
success probability 1 (0) (0, 1) under (NPC). On the set {K + = k}, k 1,
we study the ladder heights which are for l k given by
(m
Zl+ = Zl1 Zl > 0, P-a.s.
The random variable Zl+ measures by which amount the old local minima Zl1 is
improved. Due to the i.i.d. property of the Xt s we have
" k # k k
h i
{Zl+ xl } K + P Zl+ xl l < =
\ Y Y
=k = H(xl ), (5.3)

P

l=1 l=1 l=1
tes
where the distribution function H neither depends on k nor on l. Thus, the ladder
heights (Zl+ )l=1,...,k are i.i.d. on the set {K + = k}. Finally, we consider the maximal
height achieved by (Zt )tN0 , this is the global minimum of the random walk
(Zt )tN0 ,
no
K+
+
Zl+ = Z0 ZK + = ZK + = sup Zt = inf Zt .
X
M =
tN0 tN0
l=1
This now allows us to study the ultimate ruin probability as follows. Choose initial
capital c0 0. The ultimate ruin probability is given by

(c0 ) = Pc0 inf Ct < 0 = Pc0 inf Ct c0 < c0 = P inf Zt < c0
NL
tN0 tN0 tN0

h i h i K+
= P M + > c0 P K+ = k P Zl+ > c0 K +
X X
= = k
kN0 l=1

K+
(0)k 1 P Zl+ c0 K +
X X
= (1 (0)) = k
kN l=1

(0)k 1 H k (c0 ) .
X
= (1 (0))
kN
This proves Spitzers formula, which is Corollary 6.3.1 in Rolski et al. [89]:
Theorem 5.13 (Spitzers formula). Assume (0) (0, 1). Then for c0 0

(0)k 1 H k (c0 ) .
X
(c0 ) = (1 (0))
kN

The previous theorem goes back to Frank Ludvig Spitzer

(1926-1992). It gives us another description of the ruin proba-
bility under (NPC). The main difficulty is the determination of
the ladder height distribution H defined in (5.3). In special cases
this can be calculated explicitly. We give the Cramr-Lundberg
case below, for further details we also refer to Rolski et al. [89],
Section 6.4.3. The random walk is given by, see (5.2),
t t
Zt =
X
(u Su ) =
X
Xu . F.L. Spitzer
u=1 u=1
w)
In the next section we consider a special case thereof.
5.3.2 Cramr-Lundberg process
(m
In classical (continuous time) ruin theory one starts with a homogeneous Poisson
point process (Nt )tR+ having constant intensity v > 0 for the arrival of claims.
The premium income is modeled proportionally to time with constant premium
rate v > 0. The continuous time surplus process is then defined by C0 = c0 0
and for t > 0 Nt
Ct
X
= c0 + vt
tes
Su , (5.4)
u=1
with i.i.d. claim amounts Su satisfying Su > 0, P-a.s., and with these claim amounts
being independent of the claims arrival process (Nt )tR+ . This continuous time
surplus process (Ct )tR+ is called Cramr-Lundberg process. Definition 5.2 of the
no
ruin time is then extended to continuous time, namely

= inf {s R+ ; Cs < 0} .
Note that ruin can only occur at time points where claims happen, otherwise the
continuous time surplus process (Ct )tR+ is strictly increasing with constant slope
v > 0 (in fact, the continuous time surplus process is a spectrally negative Lvy
NL
process, see Chapter 1 in Kyprianou [68]). We define the inter-arrival times between
two claims by Wu , u N. For the homogeneous Poisson point process (Nt )tR+
these inter-arrival times are i.i.d. exponentially distributed with parameter v >
0. Therefore, we can rewrite the continuous time surplus process in these claims
arrival times by, define Vn = nu=1 Wu ,
P
NVn n
def.
CVn
X X
Cn = = c0 + vVn Su = c0 + (vWu Su ).
u=1 u=1
This is exactly in the set-up of Definition 5.1 with i.i.d. premia u = vWu , u N.
A crucial thing that has changed is time, moving from t R+ to operational time
n N0 , and therefore
P [ < | C0 = c0 ] = Pc0 [ < ] = (c0 ), (5.5)

with u = vWu . For (NPC) we require premium rate v > 0 such that
0 < E[X1 ] = vE[W1 ] E[S1 ] = v/(v) E[S1 ] = v > vE[S1 ].
The exponential distribution has the lack-of-memory property

which means that the waiting time for a next claim does not
depend on how long we have already been waiting for it. It is this
property which allows us to calculate H explicitly in the Cramr-
Lundberg/compound Poisson case (5.4), namely, for x 0
w)
Z
H(x) = 1 E[S1 ]1 P[S1 > y] dy. (5.6)
x
We do not prove this statement, it uses the Wiener-Hopf factor-

F. Pollaczek ization, for details we refer to Theorem 6.4.4 in Rolski et al. [89].
(m
Note that H is a distribution function on R+ because 0 P[S1 > y] dy = E[S1 ]. This
R
then allows us to state the following theorem which gives the Flix Pollaczek
(1892-1981) and Aleksandr Yakovlevich Khinchin (1894-1959) formula.
Theorem 5.14 (Pollaczek-Khinchin formula). Assume we have the compound

Poisson model (5.4) with (NPC) given by = E[S1 ]/ (0, 1). The ultimate
tes
ruin probability for initial capital c0 0 is given by

k 1 H k (c0 ) ,
X
(c0 ) = (1 )
kN
with distribution function H given by (5.6).

no
Proof. See Rolski et al. [89], Theorem 6.4.4. 2 2
Remark. In the compound Poisson model (5.4) with (NPC)

one can also prove an integral equation for the ultimate ruin
NL
probability given by
Z Z c0

(c0 ) = (1 FS (x))dx + (c0 x)(1 FS (x))dx ,
c0 0
with distribution function S1 FS . We do not prove this state-

ment because the Pollaczek-Khinchin formula is sufficient for A.Y. Khinchin
our purposes. The exact assumptions and a proof of this integral equation are, for
instance, provided in Rolski et al. [89], Theorem 5.3.2.
We conclude that for the compound Poisson case (5.4) we have three different
descriptions for the ultimate ruin probability: (i) probabilistic description, (ii)
Pollaczek-Khinchin formula from renewal theory, and (iii) the integral equation.
Depending on the problem one then chooses the most convenient one, i.e. we can
apply different techniques coming from different fields to solve the questions.

5.4 Subexponential claim sizes

A distribution function F supported on R+ , i.e. F (0) = 0, is called subexponential
if
1 F 2 (x)
lim
x 1 F (x)
= 2.
We start with a technical lemma that gives properties of subexponential distribu-

tion functions and a characterization. We follow the proofs in Rolski et al. [89],
Section 2.5.2.
)
Lemma 5.15 (subexponential distribution functions). Assume F is subexponential
then the following statements hold true:
w
1. For all n N
1 F n (x)
lim = n.
(m
x1 F (x)
In fact, this is an if and only if statement.
2. For all r > 0

lim erx (1 F (x)) = .
x
3. For all > 0 there exists D < such that for all n 2 and all x 0
tes
1 F n (x)
D(1 + )n .
1 F (x)
Proof of Lemma 5.15. We start with the following statement for subexponential distribution
functions F : for all t R
no
1 F (x t)
lim = 1. (5.7)
x 1 F (x)
We first prove (5.7). Choose t 0, then we have for x > t, using monotonicity of F ,
Z x
1 F 2 (x) F (x) F 2 (x) 1 F (x y)
1 = = dF (y)
1 F (x) 1 F (x) 0 1 F (x)
NL
Z t Z x
1 F (x y) 1 F (x y)
= dF (y) + dF (y)
0 1 F (x) t 1 F (x)
1 F (x t)
F (t) + (F (x) F (t)) .
1 F (x)
This implies (the sandwich is for lim inf x lim supx )
1 F 2 (x)

1 F (x t) 1
1 lim lim sup (F (x) F (t)) 1 F (t) = 1.
x 1 F (x) x 1 F (x)
For t < 0 note that
1 F (x t) 1 1
lim = lim 1F (x)
= lim = 1.
x 1 F (x) x y 1F (y(t))
1F (xt) 1F (y)
This proves (5.7). The second auxiliary statement is

Z x
1 F (x y)
lim dF (y) = 1. (5.8)
x 0 1 F (x)

This is an immediate consequence of

x
1 F 2 (x) F (x) F 2 (x) 1 F (x y)
Z
1= = dF (y). (5.9)
1 F (x) 1 F (x) 0 1 F (x)
We now turn to the proof of the first statement of Lemma 5.15. We prove the claim by induction.
For n = 2, 1 the statement holds true by definition. Thus, we assume that it holds true for n 2
and we would like to prove it for n + 1. Choose > 0 then there exists x0 such that for all x > x0
1 F n (x)

1 F (x) n < .

This implies for x > x0
)
Z x
1 F (n+1) (x) F (x) F (n+1) (x) 1 F n (x y)
1= = dF (y)
w
1 F (x) 1 F (x) 0 1 F (x)
Z xx0 Z x
1 F n (x y) 1 F (x y) 1 F n (x y)
= dF (y) + dF (y).
0 1 F (x y) 1 F (x) xx0 1 F (x)
(m
The second integral is non-negative and using (5.7) we obtain
Z x Z x
1 F n (x y) 1
lim sup dF (y) lim sup dF (y)
x xx0 1 F (x) x xx0 1 F (x)
F (x) F (x x0 ) 1 F (x x0 )
= lim sup = 1 + lim sup = 0.
x 1 F (x) x 1 F (x)
For the first integral we have for x > x0 , using the triangle inequality,
tes
Z xx0 Z xx0
1 F n (x y) 1 F (x y)

1 F (x y)
dF (y) n
dF (y) 1 n

0 1 F (x y) 1 F (x) 0 1 F (x)
Z xx0 n

1 F (x y) 1 F (x y)
+ n dF (y)
0 1 F (x y) 1 F (x)
no
Z xx0 Z xx0
1 F (x y) 1 F (x y)
n dF (y) 1 + dF (y).
0 1 F (x) 0 1 F (x)
Finally observe
Z xx0 Z x Z x
1 F (x y) 1 F (x y) 1 F (x y)
dF (y) = dF (y) dF (y),
0 1 F (x) 0 1 F (x) xx0 1 F (x)
NL
the first integral converges to 1, see (5.8), and the second integral converges to 0 because it is
non-negative with
Z x Z x
1 F (x y) 1
lim sup dF (y) lim sup dF (y)
x xx0 1 F (x) x xx0 1 F (x)
F (x) F (x x0 ) 1 F (x x0 )
= lim sup = 1 + lim sup = 0.
x 1 F (x) x 1 F (x)
This proves that for all > 0 there exists x1 x0 such that for all x > x1 we have
1 F (n+1) (x)

1 F (x) (n + 1) 4.
This proves the first statement of Lemma 5.15. We now turn to the second statement of the
lemma. Note that for 0 < y < x
1 F (x)
erx (1 F (x)) = (1 F (x y))er(xy) ery .
1 F (x y)

1
Choose > 0 and y > r log(3/(1 )) > 0. With (5.7) there exists x0 such that for all x > x0
erx (1 F (x)) (1 )(1 F (x y))er(xy) ery > 3(1 F (x y))er(xy) .
This implies that the function is strictly increasing with limit +. So there remains the proof
of the last statement of Lemma 5.15. Define n = supx0 (1 F n (x))/(1 F (x)). Note that
the first assertion of the lemma implies that n < . Moreover, we have 1 F (n+1) (x) =
1 F F n (x) = 1 F (x) + F (1 F n (x)). This implies for any x0 (0, )
1 F (x) + F (1 F n (x))
n+1 = sup
x0 1 F (x)
Z x Z x
1 F n (x y) 1 F n (x y)
= 1 + sup dF (y) + sup dF (y)
w)
0xx0 0 1 F (x) x>x0 0 1 F (x)
Z x
1 1 F n (x y) 1 F (x y)
1+ + sup dF (y)
1 F (x0 ) x>x0 0 1 F (x y) 1 F (x)
Z x
1 1 F (x y)
1+ + n sup dF (y)
1 F (x0 ) 1 F (x)
(m
x>x0 0
1 F 2 (x)

1
= 1+ + n sup 1 ,
1 F (x0 ) x>x0 1 F (x)
where we have used (5.9) in the last step. The subexponentiality of F implies that for all > 0
there exists x0 such that
1
n+1 1 + + n (1 + ).
1 F (x0 )
tes
Iteration provides

1 1
n+1 1+ + 1+ + n1 (1 + ) (1 + )
1 F (x0 ) 1 F (x0 )
n1
1 X
1+ (1 + )k + (1 + )n
1 F (x0 )
no
k=0
X n
1 1 1
1+ (1 + )k 1+ (1 + )n+1 ,
1 F (x0 ) 1 F (x0 )
k=0
which proves the claim for D = (1 + (1 F (x0 ))1 )/ (0, ). This proves Lemma 5.15. 2
NL
Statements 1. and 3. of Lemma 5.15 will be important in the analysis of the

Pollaczek-Khinchin formula. Statement 2. of Lemma 5.15 says that for subex-
ponential distributions the moment generating function for r > 0 does not exist,
choose X F with F subexponential
Z h i Z
E[erX ] = P erX > y dy = P [X > log(y)/r] dy
0 0
Z
= r erx P [X > x] dx = . (5.10)

We conclude that for any r > 0 the moment generating function of subexponential
distributions does not exist, and therefore there is no Lundberg coefficient in this
case. We call such subexponential distributions heavy tailed distributions.

Theorem 2.5.5 in Rolski et al. [89] gives an important sufficient condition for having
a subexponential distribution.
Lemma 5.16 (regularly varying survival function). Assume that F is supported on

R+ and has regularly varying survival function at infinity with index (0, ),
i.e. for all y > 0
1 F (xy)
lim = y ,
x 1 F (x)
then F is subexponential.
w)
Proof. Assume that X1 and X2 are two i.i.d. random variables with regularly varying survival
functions with parameter (0, ). Note that we have for all (0, 1)
{X1 + X2 > x} {X1 > (1 )x} {X2 > (1 )x} {X1 > x, X2 > x}.
(m
The i.i.d. property implies
P[X1 + X2 > x] 2 P[X1 > (1 )x] + P[X1 > x]2 .
Thus, we have
1 F 2 (x) 2(1 F ((1 )x) + (1 F (x))2

lim sup inf lim sup
tes
x 1 F (x) (0,1) x 1 F (x)
= inf 2(1 ) = 2.
(0,1)
On the other hand we have for any positively supported distribution function F , see also (5.9),
x
1 F 2 (x) F (x) F 2 (x) 1 F (x y)
Z
no
= 1+ = 1+ dF (y)
1 F (x) 1 F (x) 0 1 F (x)
Z x
1+ dF (y) = 1 + F (x),
0
since by assumption F (0) = 0. This immediately implies that
1 F 2 (x)
NL
lim inf 2.
x 1 F (x)
Note that the lower bound holds true for any distribution function supported on R+ . 2
Remarks 5.17. Lemma 5.16 gives the connection to classical extreme value theory.
In extreme value theory one distinguishes three different domains of attraction for
tail behavior, see Section 3.3 in Embrechts et al. [39]: (i) Weibull case, which are
distribution functions with finite right endpoint of their support; (ii) Gumbel case,
which are light tailed to moderately heavy tailed distribution functions; (iii) Frchet
case, which are heavy tailed distribution functions. The Frchet case is exactly
characterized by regularly varying survival functions with (tail) index (0, ),
see Theorem 3.3.7 in Embrechts et al. [39]. This index has already been met
in Section 3.2, see formula (3.4). Lemma 5.16 now says that every distribution

function that belongs to the Frchet domain of attraction is also subexponential.

However, the class of subexponential distribution functions is larger than the class
of distribution functions with regularly varying survival functions: the Weibull
distribution of Section 3.2.2 with 0 < < 1 is subexponential but does not have a
regularly varying survival function (see Example 1.4.3 in Embrechts et al. [39]) and
also the log-normal distribution is subexponential but does not have a regularly
varying survival function (see Example 1.4.7 in Embrechts et al. [39]).
subexponential regularly varying at

gamma distribution no no
)
Weibull distribution with < 1 yes no
w
log-normal distribution yes no
log-gamma distribution yes yes
Pareto distribution yes yes
(m
Table 5.1: Subexponentiality and regular variation at infinity.
We apply the Pollaczek-Khinchin formula, see Theorem 5.14, to obtain the follow-
tes
ing result in the subexponential case.
Theorem 5.18 (subexponential case, Embrechts-Veraverbeke). Assume we have

the compound Poisson model (5.4) with (NPC) given by = E[S1 ]/ (0, 1).
Moreover, we assume that the ladder height distribution function H given by (5.6)
no
is subexponential. Then we have
(c0 )
lim = .
c
0 1 H(c0 ) 1
NL
Proof. Our aim is to apply Lemma 5.15 to the Pollaczek-Khinchin for-

mula. The latter provides
(c0 ) X 1 H k (c0 )
lim = (1 ) lim k .
c0 1 H(c0 ) c0 1 H(c0 )
kN
Our aim is to exchange the limit c0 and the infinite summation.

Note that Lemma 5.15 provides point-wise convergence of the last terms
to k as c0 , therefore our aim is to find a uniform integrable upper
bound so that we can apply the dominated convergence theorem. To this P. Embrechts
end we choose (0, 1/ 1). Then Lemma 5.15 implies that there
exists D < such that for all k 1 and c0 0 we have a uniform integrable upper bound given
by
X 1 H k (c0 ) X X
(1 ) k (1 ) k D(1 + )k = (1 )D ((1 + ))k < ,
1 H(c0 )
kN kN kN

because (1 + ) < 1. Thus, we have found a uniform integrable upper bound and this allows us
to exchange the two limits. This provides
(c0 ) X 1 H k (c0 ) X
lim = (1 ) k lim = (1 ) k k.
c0 1 H(c0 ) c0 1 H(c0 )
kN kN
The last term is the expected value of the geometric distribution which is given by /(1 ).
This proves the theorem. 2
Example 5.19 (Pareto claim sizes). We assume that we are in

the compound Poisson model of Theorem 5.14. The claim size
w)
distribution of S1 is given by a Pareto distribution with threshold
> 0 and tail parameter > 1. Under these assumptions we
calculate the ladder height distribution H. For x
Z
(m
H(x) = 1 E[S1 ]1 P[S1 > y] dy
x N. Veraverbeke
+1
1 1Z y 1 x
= 1 dy = 1 .
x
This implies that H has a regularly varying survival function
with tail index 1 > 0. Therefore, Lemma 5.16 implies that H is subexponential
tes
and we can apply Theorem 5.18 to obtain
(c0 )
lim +1 = .
c0 1 ( 1)

c0

no
That is, we have found in the Pareto (subexponential) case for > 1

(c0 ) c0 +1 as c0 .
( 1)

NL
Conclusions. We conclude that the heavy tailed case may

lead to a much more dangerous ruin behavior. In Example
5.19 we obtain for the asymptotic ruin behavior a power law
decay as the initial capital goes to infinity, whereas in the light
tailed case we obtain the exponentially decaying Lundberg
bound, see Theorem 5.11. This is an impressive example that
heavy tailed claims require careful risk management practice. C.M. Goldie
For instance, an excess-of-loss reinsurance cover with reten-
tion level M > would completely change the ruin behavior of a company facing
Pareto distributed claims St , t N. Also the triggers of ruin are very different in
the two cases. In the light tailed case it is the big mass of claims that causes ruin,
whereas in the heavy tailed case it is the single large claim event that causes ruin.

The most general version of asymptotic ruin behavior in the subexponential case
goes back to Paul Embrechts and Nol Veraverbeke [41]. However, an im-
portant missing piece in the argumentation was provided by Charles M. Goldie.
The Pareto case has previously been solved by Bengt von Bahr [8].
w)
(m
tes
no
NL

w)
(m
tes
no
NL

Chapter 6
Premium Calculation Principles
w)
From the random walk Theorem 5.4 and from Assumption 5.6 we see that we need
to charge an (expected) premium that exceeds the expected claim amount E[St ],
(m
otherwise there is ultimate ruin, P-a.s. This is referred to the net profit condition
(NPC). In the present chapter we assume that the premium t is deterministic, then
(NPC) reads as t > E[St ]. For simplicity (because we consider a fixed accounting
year in this chapter) we drop the time index t and then (NPC) is given by
> E[S], (6.1)

tes
with total (annual) claim amount S FS . In this chapter
we justify why the insurance company can charge a premium that exceeds
the average claim amount E[S], i.e. why the insured is willing to pay a pre-
no
mium that exceeds his expected claim amount E[S]; and
we give different pricing principles to calculate premium loadings E[S] > 0.

Simple solution (expected value principle). Choose a fixed constant > 0 and
charge (to everyone) the premium
NL
= (1 + ) E[S]. (6.2)
Are we happy with this solution?
Example 6.1 (expected value principle). We consider two different portfolios with
claims S1 and S2 having the same mean E[S1 ] = E[S2 ]. Under the previous simple
solution both insured pay the same insurance premium
= (1 + ) E[S1 ] = (1 + ) E[S2 ] > E[S2 ] = E[S1 ].
We give an explicit distributional example.
141
142 Chapter 6. Premium Calculation Principles
Assume S1 (, c) with mean E[S1 ] = /c, and
S2 /c is a constant.
Observe that there is absolutely no uncertainty in portfolio S2 , that is, we can
perfectly predict claim S2 (and, of course, also the insured can perfectly predict
his claim). But then it is natural that the insured is not willing to pay a premium
that exceeds his (maximal possible) loss S2 = /c, i.e. (hopefully) he refuses to pay
insurance premium > E[S2 ] = S2 . Moreover, the risk characteristics are rather
different between the two portfolios S1 and S2 .
w)
Conclusion. The premium loading should be risk-based! That is, the loading
E[S] > 0 should reflect the risk of fluctuations of S around its mean E[S].
(m
6.1 Simple risk-based principles
The first notion of risk is usually described by the variance of a random variable.
Therefore, we assume in this section that the second moment of S exists.
tes
Variance loading principle. Choose a fixed constant > 0 and define the
insurance premium by
= E[S] + Var(S).
Revisiting Example 6.1 we obtain insurance premia using variance loadings

no
1 = E[S1 ] + Var(S1 ) = + 2 > = E[S2 ] + Var(S2 ) = 2 .

c c c
That is, for the risky position S1 we now charge a premium that strictly exceeds the
expected value and the loading is zero for the deterministic claim S2 . An unpleasant
feature of the variance loading principle is that the calibration is difficult because
the loading constant is not scaling invariant, that is, the principle is not invariant
NL
under scalings such as changes of currencies, etc. Let us give an example. Assume
that rfx > 0 is the (deterministic) exchange rate between two different currencies.
Assume rfx 6= 1, then we obtain
2
fx = E[rfx S] + Var(rfx S) = rfx E[S] + rfx Var(S) 6= rfx .
This non-linearity of the variance implies that the premium cannot easily be scaled
with exchange rates and inflation indexes. Therefore, one often studies modifica-
tions of the variance principle which brings us to the next principle.
Standard deviation loading principle. Choose a fixed constant > 0 and

define the insurance premium by
= E[S] + Var(S)1/2 = E[S] (1 + Vco(S)) ,

Chapter 6. Premium Calculation Principles 143
where the last equality requires that E[S] > 0.

This principle gives an explicit meaning to the loading constant in (6.2), namely
it says that the loading constant should be proportional to the coefficient of varia-
tion of S, or the corresponding confidence bounds measured in terms of standard
deviations. If we revisit Example 6.1 we obtain premia
1/2
1 = E[S1 ] + Var(S1 )1/2 = + > = E[S2 ] + Var(S2 )1/2 = 2 .
c c c
For the risky position S1 we charge a premium that strictly exceeds the expected
claim and the loading is zero for the deterministic claim S2 . The standard deviation
w)
loading principle is usually better understood than the variance loading principle
because practitioners often have a good feeling for appropriate ranges of the co-
efficient of variation. For instance, they know that for certain lines of business
it should be around 10%. Moreover, this principle is invariant under changes of
(m
currencies. Assume that rfx > 0 is again the (deterministic) exchange rate between
two different currencies. Then we obtain the identity
fx = E[rfx S] + Var(rfx S)1/2 = rfx E[S] + rfx Var(S)1/2 = rfx .
The previous examples consider rather simple premium loading principles and there
are more principles of this type such as the modified variance principle. In the next
tes
section we describe more sophisticated principles which are motivated by economic
behavior of financial agents and give risk measurement and risk management per-
spectives. These more advanced principles try to describe decision making and
include:
utility theory pricing principles
no
Esscher premium principle
probability distortion pricing principles
cost-of-capital principles based on risk measures

NL
deflator pricing principles

Exercise 13. We would like to insure the following car fleet:
(i) (i)
i vi i E[Y1 ] Vco(Y1 )
passenger car 40 25% 2000 2.5
delivery van 30 23% 1700 2.0
truck 10 19% 4000 3.0
Assume that the car fleet can be modeled by a compound Poisson distribution.
1. Calculate the expected claim amount of the car fleet.
2. Calculate the premium for the car fleet using the variance loading principle
with = 3 106 .


6.2 Advanced premium calculation principles

In this section we consider more advanced principles for the calculation of premium
loadings. These considerations can also be viewed as an introduction to economic
decision making, risk measurement and risk management.
6.2.1 Utility theory pricing principles

Utility theory aims at modeling the happiness index of finan-
cial agents making economic decisions. That is, for a financial
w)
agent holding a position X, we try to evaluate an index that
quantifies his happiness generated by this position X.
Utility theory can be introduced in a rather general framework
using preference ordering. If this system of preference ordering
(m
is sufficiently regular then there exists a so-called numerical rep-
resentation for the preference ordering, for details we refer to the J. von Neumann
book of Fllmer-Schied [47].
We always start from the latter and assume that there exists a
John von Neumann (1903-1957) and Oskar Morgenstern
tes
(1902-1977) representation for the preference ordering on a given
set
X L1 (, F, P).
O. Morgenstern
The set X describes the (risky) positions X X of interest.
no
In this set-up the Xs reflect gains. Thus, we restrict ourselves to a set X of

available risky positions X and among these positions we would like to choose the
position which makes us as happy as possible. The von Neumann-Morgenstern
representation equips us with a utility function u with the following properties:
u:IR
NL
is strictly increasing on a non-empty interval I R, where we assume that X I,

P-a.s., for all X X . Two examples for u are given in Figure 6.1.
In general, we are interested in risk-averse utility functions u : I R which makes
the additional assumption that u is strictly concave on I, see Figure 6.1. This
risk-averse utility function now allows us to define a preference ordering on the set
of all risky positions in X .
Definition 6.2. Assume u : I R is strictly increasing and strictly concave on

the non-empty interval I R (with X I, P-a.s., for all X X ). Then we prefer
the position X X over the position Y X , write X Y , if
E [u(X)] E [u(Y )] .

exponential utility function power utility function
gamma>1
0
gamma=1
gamma<1
500
5
1000
1500
0
2000
2500
w)
100 50 0 50 100 0 5 10 15 20
(m
Figure 6.1: lhs: exponential utility function with = 0.05 and I = R, see (6.6);
rhs: power utility function with {0.5, 1, 1.5} and I = R+ , see (6.7).
Colloquially speaking this means that holding position X makes us at least as

happy as holding position Y , therefore we prefer position X over position Y . Thus,
tes
Definition 6.2 introduces a preference ordering on X . If E [u(X)] > E [u(Y )] we
strictly prefer X over Y and we write X Y ; if E [u(X)] = E [u(Y )] we are
indifferent between X and Y and we write X Y .
no
For u C 2 strictly increasing and strictly concave means
u0 > 0 and u00 < 0 on I, respectively.
Strict increasing property. Strictly increasing implies that for X Y , P-a.s.,

and X > Y with positive P-probability we have
NL
E [u(X)] > E [u(Y )] , (6.3)
i.e. we strictly prefer X over Y . In this context, X has always the interpretation of
a gain and if the gain of position X dominates the gain of position Y (in the above
sense) we have strict preference X Y . We conclude: u introduces a preference
ordering on X where positive outcomes of X X describe gains and negative
outcomes losses.
Strict concavity property. Strict concavity implies that we can apply Jensens
inequality which provides for all X X
E [u(X)] u (E [X]) , (6.4)

and if X X is non-deterministic we even have a strict inequality in (6.4). Thus,

for non-deterministic positions, strict concavity of u implies E[X] X. The in-
terpretation of this preference ordering is that under risk-aversion we try to avoid
uncertainties which results in the fact that we always prefer the (deterministic)
mean value E[X] over the corresponding random outcome X.
This latter property is exactly the argument why policyholders are willing to pay
an insurance premium that exceeds their average claim amount E[Y ], and hence
finance (NPC). Assume that a policyholder has (deterministic) initial wealth c0 and
w)
he faces a risk that may reduce his wealth by (the random amount) Y . Hence, he
holds a risky position X = c0 Y and his happiness index of this position is given
by E[u(c0 Y )] if u describes the (risk-averse) utility function of this policyholder.
The strict concavity and increasing properties now imply the following preference
(m
E [u(c0 Y )] < u (c0 E [Y ]) .
tes
The left-hand side describes the present happiness and the right-hand side describes
the happiness that he would achieve if he could exchange Y by E[Y ]. Therefore,
any deterministic premium > E [Y ] such that
no
E [u(c0 Y )] < u (c0 ) < u (c0 E [Y ]) ,
would make him more happy than his current position c0 Y . Thus, strict concavity
and increasing property of u implies that he is willing to pay any premium in
NL
the (non-empty) interval

E [Y ] , c0 u1 (E [u(c0 Y )]) , (6.5)
to improve his happiness position. The lower bound of this interval is the (NPC)
and the upper bound is the maximal price that the policyholder will just tolerate
according to his risk-averse utility function u (this bound may also be infinite).
The less risk-averse he is the narrower the interval will get. The extreme case of
risk-neutrality, which corresponds to the linear function u(x) = x, will just provide
that the upper bound is equal to the lower bound in (6.5), and no insurance is
necessary.

The two most popular utility functions are, see also Figure 6.1:
exponential utility function, constant absolute risk-aversion (CARA) utility

function: for > 0 (defined on I = R)
1
u(x) = 1 exp {x} ; (6.6)

power utility function, constant relative risk-aversion (CRRA) utility func-

tion, isoelastic utility function (defined on I = R+ )
w)
( x1
1
for 6= 1,
u(x) = (6.7)
log x for = 1.
(m
Example 6.3 (exponential utility function). Assume that the policyholder has
exponential utility function (6.6), he has initial wealth c0 and he faces a risky
position Y L1 (, F, P) with Var(Y ) > 0 and Y 0, P-a.s. This implies that
the expected claim is given by E[Y ] > 0. The exponential utility function has the
following properties
u0 (x) = exp{x} > 0 and u00 (x) = exp{x} < 0.
tes
Therefore, it is strictly increasing and concave on R, see Figure 6.1 (lhs). Its inverse
is given by
1
u1 (y) = log ((1 y)) .

no
This implies that acceptable premia lie in the non-empty interval, see (6.5),
1

E [Y ] , log E [exp{Y }] ,

where the upper bound is infinite if the moment generating function of Y does not
exist in . The important observation in this example is that the price tolerance
NL
in does not depend on the initial wealth c0 of the policyholder. We will see that
this property uniquely holds true for the exponential utility function, and we may
ask the question how realistic this property is in real world decision making?
Example 6.4 (power utility function). Assume that the policyholder has power
utility function (6.7), he has initial wealth c0 > 1 and he faces a risky position Y
Bernoulli(p = 1/2). This implies that the expected claim is given by E[Y ] = 1/2.
The power utility function has the following properties
u0 (x) = x > 0 and u00 (x) = x1 < 0.
Therefore, it is strictly increasing and concave on I = R+ , see Figure 6.1 (rhs).
For our example we choose = 1. In this case the inverse of the utility function is
given by
u1 (y) = exp{y}.

We calculate the upper bound in (6.5),
c0 u1 (E [u(c0 Y )]) = c0 exp {E [log(c0 Y )]}

1 1

= c0 exp log(c0 ) + log(c0 1)
q
2 2
def.
= c0 c0 (c0 1) = b(c0 ).
This implies that any possible premium lies in the non-empty interval, see (6.5),
1
q
, c0 c0 (c0 1) .
w)
2
The important observation in this example is that the price tolerance in depends
on the initial wealth c0 > 1 of the policyholder.
The function b is defined on (1, ) and we have
(m 1.2
1.0
lim b(c0 ) = 1 and lim b(c0 ) = 1/2,
c0 1 c0
0.8
0.6
the second statement canq be seen by applying lHpitals
0.4
rule to b(c0 ) = c0 (1 1 1/c0 ). These limits say that
0.2
if the policyholder is very poor, i.e. c0 is close to 1, he is
tes
0.0
willing to pay almost the maximal possible claim size 1 0 2 4 6 8 10
as premium; on the other hand if he is very rich, i.e. c0

function b(c0 )
is close to , he is only willing to pay for the average
claim amount E[Y ] = 1/2 because, basically, he can do the risk bearing himself.
no
The derivative of b is given by

q
1 2c0 1 c0 (c0 1) (c0 1/2)
b0 (c0 ) = 1 q = q
2 c0 (c0 1) c0 (c0 1)
q q
c20 c0 c20 c0 + 1/4
NL
= q < 0.
c0 (c0 1)
This shows that we have strict monotonicity in the initial capital c0 > 1, i.e. the
richer the policyholder the narrower the price tolerance interval (6.5), see also
Example 6.14, below.
Definition 6.5 (utility indifference price). The utility indifference price =

(u, FS , c0 ) R for utility function u, initial capital c0 I and risky position
S FS is given by the solution of (subject to existence)
u(c0 ) = E[u(c0 + S)].

Of course, and S need to be such that c0 + S I, P-a.s. This may give rise
to restrictions on the range of S if I is a bounded interval, see also Example 6.4.
Note that if the utility indifference price exists, it is unique. This follows from
the strict monotonicity of u.
The utility indifference price given in Definition 6.5 gives the insurance companys
point of view. It is assumed that the insurance company has initial capital c0 I,
similar to the surplus process given in Definition 5.1. It will then only accept an
insurance contract S at price if the resulting utility does not decrease, i.e. if it is
w)
indifferent about accepting S at price and not selling such a contract.
Jensens inequality and the strict increasing property of u immediately provide the
following corollary.
(m
Corollary 6.6. The utility indifference price = (u, FS , c0 ) for initial capital c0 ,
risk-averse utility function u and risky position S FS satisfies
= (u, FS , c0 ) > E[S].

Proof. Exercise. 2
tes
Example 6.7 (exponential utility function). Assume we have initial capital c0 R,
exponential utility function (6.6) with risk-aversion parameter > 0, and we would
like to insure a risky position S N (, 2 ). Thus, we need to solve
1 1

no
1 exp {c0 } = E 1 exp {(c0 + S)} .

This is equivalent to solving
exp {} = E [exp {S}] = exp{ + 2 2 /2}.

NL
Therefore we obtain utility indifference price for S
= (u, FS , c0 ) = + 2 /2 > .
Remarks.
We obtain an insurance premium > = E[S] (Jensens inequality) and

therefore (NPC) is fulfilled.
The loading is of the form 2 /2 = Var(S)/2. That is, for the exponential
utility function we get a variance loading. This is exact for S N (, 2 )
and it is approximately true for other distribution functions (using a Taylor
approximation).
The utility indifference price does not depend on the initial capital c0 .

Exercise 14. Choose the exponential utility function (6.6).
Calculate the utility indifference price for S (, c).
Calculate the utility indifference price for S Pareto(, ).
Proposition 6.8. Assume u C 2 is a risk-averse utility function on R. The
)
following two are equivalent:
w
the utility indifference prices = (u, FS , c0 ) do not depend on c0 for all S;
the utility function u is of the form
(m
u(x) = a b exp{cx},
for a R and b, c > 0.
Remark. Note that the utility function u(x) = a b exp{cx} gives the same
preference ordering as the exponential utility function (6.6) with c = : if we have
tes
two different utility functions u() and v() with v = a + bu for a R and b R+
(positive affine transformation) then they generate the same preference ordering.
Proof of Proposition 6.8. Note that assumption u C 2 is not necessary because concavity
no
implies that u is differentiable almost everywhere, and this is sufficient to prove the result, for
details on this we refer to Lemma 1.8 in Schmidli [91].
Direction is immediately clear just by evaluating Definition 6.5. So we prove direction .
The following proof is borrowed from Schmidli [91]. Choose S Bernoulli(p). Definition 6.5
implies for this Bernoulli claim S identity
u(c0 ) = E[u(c0 + S)] = pu(c0 + 1) + (1 p)u(c0 + ),

NL
for utility indifference price = (p) = (u, p, c0 ) depending on p (0, 1) only. We now consider
the derivatives w.r.t. c0 and p. The former provides

u0 (c0 ) = [pu0 (c0 + 1) + (1 p)u0 (c0 + )] (c0 + )
c0
= pu0 (c0 + 1) + (1 p)u0 (c0 + ),
where in the last step we have used the assumption that the premium does not depend on
c0 . The derivative w.r.t. p is given by (the implicit function theorem provides existence of the
derivative of w.r.t. p, denoted by 0 (p))
0 = u(c0 + 1) + pu0 (c0 + 1) 0 (p) u(c0 + ) + (1 p)u0 (c0 + ) 0 (p).
Merging the last two identities provides
u0 (c0 ) 0 (p) = u(c0 + ) u(c0 + 1). (6.8)

Strict increasing property of u implies that 0 (p) > 0. Next we calculate the derivatives of (6.8)
w.r.t. c0 and p (again using the implicit function theorem for the latter). This provides the two
identities
u00 (c0 ) 0 (p) = u0 (c0 + ) u0 (c0 + 1),
and
u0 (c0 ) 00 (p) = [u0 (c0 + ) u0 (c0 + 1)] 0 (p).
Merging these identities implies
u00 (c0 ) 00 (p)

= c < 0,
u0 (c0 ) ( 0 (p))2
)
for some constant c > 0. The last identity follows because the left-hand side is independent of p
and the middle term is independent of c0 . This last identity is a differential equation for utility
w
function u whose (unique) solution is exactly given by the exponential function. 2
(m
The proof of Proposition 6.8 provides insights into risk-aversion. Define the ab-
solute and the relative risk-aversions of a twice differentiable utility function u
by
u00 (x) u00 (x)

ARA (x) = uARA (x) = and RRA (x) = uRRA (x) = x .
u0 (x) u0 (x)
tes
Example 6.9 (exponential utility function). The exponential utility function (6.6)
with risk-aversion parameter > 0 satisfies for all x R
ARA (x) = .
no
This explains the terminology constant absolute risk-aversion (CARA) utility.
Example 6.10 (power utility function). The power utility function (6.7) with
risk-aversion parameter > 0 satisfies for all x R+
RRA (x) = .
NL
This explains the terminology constant relative risk-aversion (CRRA) utility.
Assume that u and v are two utility functions that are defined on the same interval
I. Then, u is more risk-averse than v on I if for any X with range I we have
u1 (E[u(X)]) v 1 (E[v(X)]) .
Proposition 6.11. Assume that u, v C 2 (I) are two utility functions defined on
the same interval I R. The following are equivalent:
u is more risk-averse than v on I;
uARA (x) vARA (x) for all x I.

Proof. We first prove direction . The proof goes by contradiction. Assume that the claim
does not hold true. Due to the twice continuous differentiability property of the utility functions
on I there exists a non-empty open interval O I such that
u00 (x) v 00 (x)

uARA (x) = > = vARA (x) for all x O.
u0 (x) v 0 (x)
We consider the function u(v 1 ()) on the non-empty open interval v(O) (note that v is continuous
and strictly increasing). We calculate
d d u0 (v 1 (z))
u(v 1 (z)) = u0 (v 1 (z)) v 1 (z) = 0 1 > 0,
dz dz v (v (z))
w)
because both u and v are strictly increasing, and
d2 u00 (v 1 (z)) u0 (v 1 (z))v 00 (v 1 (z))

u(v 1 (z)) = 0 1

dz 2 (v (v (z))) 2 (v 0 (v 1 (z)))3
u0 (v 1 (z)) u00 (v 1 (z)) v 00 (v 1 (z))

(m
= > 0 for all z v(O).
(v 0 (v 1 (z)))2 u0 (v 1 (z)) v 0 (v 1 (z))
This implies that u(v 1 ()) is a risk-seeking (convex) utility function on the non-empty interval
v(O). Choose a non-deterministic random variable Y such that Y O, P-a.s. Since O is a
non-empty open interval such a random variable can be chosen (i.e. no concentration in a single
point). This implies that Z = v(Y ) is a non-deterministic random variable with range in v(O)
and the strict convexity of u(v 1 ()) on v(O) implies using Jensens inequality
tes
u1 (E [u(Y )]) = u1 E u(v 1 (v(Y ))) > u1 u v 1 (E [v(Y )]) = v 1 (E [v(Y )]) .

(6.9)
This is contradiction and proves direction .

For the direction we consider the function u(v 1 ()) on v(I). This is a strictly increasing
function because u and v are utility functions, see above. Moreover, we have
no
u0 (v 1 (z))
00 1
d2 u (v (z)) v 00 (v 1 (z))

1
u(v (z)) = 0 1
dz 2 (v 0 (v 1 (z)))2 u0 (v 1 (z)) v (v (z))
u0 (v 1 (z)) v
ARA (v 1 (z)) uARA (v 1 (z)) 0

= 0 1 2
for all z v(I).
(v (v (z)))
The proof then follows similar to (6.9) using Jensens inequality. 2

NL
The above result has a nice interpretation.
Corollary 6.12. Assume u is more risk-averse than v. We have for the utility
indifference prices
(u, FS , c0 ) (v, FS , c0 ).
Proof. We have the following:
c0 = u1 (E[u(c0 + (u, FS , c0 ) S)]) v 1 (E[v(c0 + (u, FS , c0 ) S)]) .
Since both v 1 and v are strictly increasing we see that (u, FS , c0 ) (v, FS , c0 ). 2
The last corollary also explains that the price elasticity interval (6.5) becomes more
narrow for decreasing risk-aversion.

Theorem 6.13. Assume u C 3 (I) is a risk-averse utility function on I. The

following are equivalent:
(u, FS , c0 ) is decreasing in c0 for all S;
uARA (x) is decreasing for all x I.

Proof of Theorem 6.13. We start with direction . Calculating the derivatives w.r.t. c0 ,
using that is decreasing in c0 and setting v = u0 we obtain

1
c0 = v E v(c0 + (u, FS , c0 ) S) (c0 + (c0 )) v 1 (E [v(c0 + (u, FS , c0 ) S)]) .
c0
w)
Observe that v = u0 is a utility function on I. This implies
u1 (E [u(c0 + (u, FS , c0 ) S)]) = c0 v 1 (E [v(c0 + (u, FS , c0 ) S)]) .
Since this holds for any c0 and S we obtain that v is more risk-averse than u, and Proposition
(m
6.11 implies that vARA (x) uARA (x) for all x I. From this we obtain
u000 v 00 u00
00
= 0 0,
u v u
and thus for all x I
d u00 u000 (u00 )2 u00 u000 u00

d u
(x) = = 0 + 0 2 = 0 0 0.
dx ARA
tes
dx u0 u (u ) u u00 u
This proves the first direction of the equivalence. The proof of direction is received by just
reading the above proof into the other direction (all the statements are equivalences). 2
no
Example 6.14 (power utility function). The power utility function (6.7) with
risk-aversion parameter > 0 satisfies for all x R+
ARA (x) = x1 .
This is a strictly decreasing function in x R+ . Therefore the utility indifference

NL
price (u, FS , c0 ) becomes a decreasing function in c0 , see Theorem 6.13. This is the
property that economists consider to be reasonable for financial decision making.
This was already explored in Example 6.4.
Exercise 15. Choose exponential utility function (6.6).

i.i.d.
Assume Y1 , . . . , Yn (, c). Calculate the utility indifference price for
Pn
i=1 Yi .
Assume S CompPoi(v = n, G = (, c)). Calculate the utility indifference

price for S.
Compare the two results of the previous items.
What can be said about diversification benefits?

Exercise 16. Choose the car fleet example from Exercise 13 on page 143. Assume
that this car fleet can be modeled by an appropriate compound Poisson distribution
having gamma claim sizes.
1. Calculate the expected claim amount of the car fleet.
2. Calculate the premium for the car fleet using the utility indifference price
principle for the exponential utility function with parameter = 1.5 106 .
3. Compare Exercises 13 and 16. What happens if we replace the compound
w)
Poisson distribution by a Gaussian distribution with the same first two mo-
ments?

(m
6.2.2 Esscher premium
Choose a random variable S F with finite first moment given by
Z
E[S] = s dF (s).
R
tes
For utility indifference pricing we have modified the payments s by introducing a
happiness index u(c0 + s), see Definition 6.5. The Esscher premium takes a
different approach, instead of acting on the payments s it aims at modifying the
probability distribution F of S.
Classical actuarial practice calculates premium loadings by giv-
no
ing more weight to bad events compared to good events. Basi-

cally, this means that one does a change of measure towards a
less favorable probability measure. Hans Bhlmann [19] in-
troduces this idea in the actuarial literature by constructing the
Esscher measure.
NL
Define for > 0 the Esscher (probability) distribution F of F

H. Bhlmann
as follows:
1 Z s
F (s) = ex dF (x),
MS ()
under the additional assumption that the moment generating function MS () of S
exists in . Note that this defines a (normalized) distribution function F .
Definition 6.15 (Esscher premium). Choose S F and assume that there exists
r0 > 0 such that MS (r) < for all r (r0 , r0 ). The Esscher premium of S
in (0, r0 ) is defined by
Z
= E [S] = s dF (s).
R

Corollary 6.16. Under the assumptions in Definition 6.15 we have

d
= log MS (r)|r= E[S],
dr
where the inequality is strict for non-deterministic S.
Proof. Note that Lemma 1.1 implies for (0, r0 )
MS0 ()
Z
1 d
= ses dF (s) = = log MS (r)|r= .
MS () R MS () dr
w)
The claim then follows from Lemma 1.6. 2
Example 6.17 (Esscher premium for Gaussian distributions). Choose > 0 and
(m
assume that S N (, 2 ). Then we have
d
= log MS (r)|r= = + 2 > = E[S].
dr
In the Gaussian case we obtain the variance loading. Thus, the variance loading,
the exponential utility function and the Esscher premium principles provide exactly
tes
the same insurance premium in the Gaussian case.
Exercise 17 (Esscher premium for gamma distributions). Assume that S (, c)

with , c > 0. Calculate the Esscher premium of S for (0, c).
Conclusions.
no
The Esscher premium can easily be calculated from the moment generating
function MS (r).
The Esscher premium can only be calculated for light tailed claims, see also
Section 5.2 on the Lundberg coefficient. Towards all more heavy tailed claims
NL
the Esscher premium reacts so sensitive that it becomes infinite. In the next
section we study probability distortion principles that allow for more heavy
tailed distributions in premium calculations still leading to finite premia.
In classical economic theory, prices are often derived by the assumption of

market clearing in a risk exchange economy. That is, if we assume that
we have (i) an economy with risky positions S1 , . . . , SK ; (ii) market partici-
pants who have an exponential utility function with risk aversion parameters
i > 0; and (iii) market clearing in the sense that all risky positions are allo-
cated to the market participants, then one can prove that the risky positions
are exactly priced with the Esscher measure of the aggregate market capital-
ization. This is in the spirit of Bhlmann [19] and is, for instance, found in
Tsanakas-Christofides [96].

6.2.3 Probability distortion pricing principles

In the previous section we have met a pricing principle that was based on probability
distortions. In this first case it was only possible to calculate insurance prices for
light tailed claims because the distortion reacted very sensitively to heavy tails. In
the present section we look at probability distortions from a different angle which
will allow for more flexibility. Assume that S F with S 0, P-a.s. Then using
integration by parts the expected claim is calculated as
Z Z
E[S] = x dF (x) = P[S > x] dx.
w)
0 0
In this section we directly distort the survival function F (x) = P[S > x]. Therefore,
we introduce a distortion function h : [0, 1] [0, 1] which is a continuous, increasing
and concave function with h(0) = 0 and h(1) = 1, in Figure 6.2 we give two
(m
examples.
probability distortions
1.0
0.8
tes
0.6
0.4
no
0.2
power distortion
0.0
expected shortfall
0.0 0.2 0.4 0.6 0.8 1.0

NL
Figure 6.2: Distortion functions h of Examples 6.19 and 6.20, below, with = 1/2
and q = 0.1, respectively.
h(p) distorts the probability p with the property that h(p) p for all p [0, 1]
because h is increasing and concave with h(0) = 0 and h(1) = 1.
The concavity of h reflects risk aversion, similar to the utility functions used
in Section 6.2.1.
Note that the existence of p (0, 1) with h(p) > p implies that h(p) > p for
all p (0, 1). Therefore, we assume under strict risk-aversion that h(p) > p
for all p (0, 1).

Definition 6.18. Assume that h : [0, 1] [0, 1] is a continuous, increasing and

concave function with h(0) = 0, h(1) = 1 and h(p) > p for all p (0, 1). The
probability distorted price h of S 0 is defined by (subject to existence)
Z
h = Eh [S] = h(P[S > x]) dx.
0
We obtain a risk loading that provides

Z Z
w)
E[S] = P[S > x] dx h(P[S > x]) dx = Eh [S] = h ,
0 0
where the inequality is strict for non-deterministic S.
Remarks.
(m
Similar to the Esscher premium we modify the probability distribution func-
tion of the claims S (in contrast to the utility theory approach where we
modify the claim sizes).
The probability distortion approach is a technique to
construct coherent risk measures for bounded random
tes
variables. For a detailed outline we refer to Freddy
Delbaen [32], in particular, to the corresponding Ex-
ample 4.7 and Corollary 7.6 which relates convex games
to coherent risk measures.
no
This probability distortion approach is similar to life

insurance pricing where one constructs first order life F. Delbaen
tables out of second order life tables (expected mortality
rates) in order to have a security and profit margin, see also Denneberg [34].
Example 6.19 (probability distortion for Pareto distribution). Choose claim S
NL
Pareto(, ) with > 1 and > 0, and probability distortion function, see Example
4.5 in Delbaen [32] and Figure 6.2,
h(p) = p for (0, 1). (6.10)
The probability distorted price of S is given by

Z Z Z " #
x
h = h(P[S > x]) dx = 1 dx + dx
0 0
Z Z Z
x
= 1 dx + dx = P[S > x] dx,
0 0
where S Pareto(, ). This immediately implies

h = > = E[S] for (1/, 1).
1 1

In contrast to the Esscher premium we can calculate the probability distorted

premium also for heavy tailed claims as long as the risk aversion (concavity of h)
is not too large, i.e. in our case (1/, 1).
Exercise 18. Choose power distortion function (6.10). Calculate the probability
distorted price of S (1, c) and of S Bernoulli(p).
Example 6.20 (expected shortfall). Choose distortion function h : [0, 1] [0, 1]

as follows, see Remark 7.7 in Delbaen [32] and Figure 6.2: fix q (0, 1) and define
(
x/q for x q,
w)
h(x) = (6.11)
1 otherwise.
Choose S F with S 0, P-a.s. The left-continuous generalized inverse of F for

(0, 1) is given by, see Chapter 1,
(m
F () = inf{x R; F (x) }.
For simplicity we assume that F is continuous and strictly increasing. This simpli-
fies considerations because then also F is continuous and strictly increasing and
we have F (F ()) = and F (F (x)) = x, see Chapter 1 (the strictly increasing
property of F would not be necessary for getting the full flavor of this example).
tes
Consider the survival function of S given by F (x) = 1 F (x) = P[S > x]. Note
that under our assumptions
{x < F (1 q)} = {F (x) < 1 q} = {F (x) > q}.

no
This identity implies

Z
1Z Z
h = h(P[S > x]) dx = F (x) dx + 1 dx
0 q {F (x)q} {F (x)>q}
1Z Z
= F (x) dx + 1 dx
q {xF (1q)} {x<F (1q)}
NL
1Z
= P[S > x] dx + F (1 q).
q F (1q)
Note that these identities need more care if F is not strictly increasing. The
continuity and strictly increasing property of F also imply
P[S F (1 q)] = 1 P[S < F (1 q)] = 1 F (F (1 q)) = q.
This provides, using continuity and strictly increasing property of F ,

Z
1
h = P[S > x] dx + F (1 q)
P[S F (1 q)] F (1q)
Z
= P [S > x |S F (1 q)] dx + F (1 q)
F (1q)
Z
= P [S > x |S F (1 q)] dx = E [S |S F (1 q)] .
0

The latter is exactly the so-called Tail-Value-at-Risk (TVaR) or the conditional tail
expectation (CTE) of the random variable S at the 1 q security level. Moreover,
F (1 q) is the Value-at-Risk (VaR) of the random variable S at the 1 q security
level. The continuity of F implies that this TVaR is equal to the expected shortfall
(ES) of S at the security level 1 q, that is,
1 Z1
h = E [S |S F (1 q)] = F (u) du = ES1q (S),
q 1q
see Artzner et al. [5, 6], Acerbi-Tasche [1] and Lemma 2.16 in McNeil et al. [77].
The proof again uses the fact that for continuous distribution functions F we have
w)
F (F ()) = and then the left-hand side of the above statement can be obtained
by a change of variables from the right-hand side.
We conclude that under continuity assumptions the risk measure ES1q (S) can be
obtained via probability distortion (6.11), and following Delbaen [32], it is therefore
(m
a coherent risk measure, see also next section.
Exercise 19. Choose probability distortion (6.11) for q = 1% and calculate the
probability distorted price for
S LN(, 2 ),
tes
S Pareto(, ) with > 1,
i.i.d.
Sn = ni=1 Yi with Yi (1, 1) and study the diversification benefit of the
P
probability distorted price of Sn as a function of n N.

no
6.2.4 Cost-of-capital principles using risk measures

Denote by X L1 (, F, P) the set of (risky) positions X of interest, importantly
for this section: X denotes losses. This is different from page 144!
NL
A risk measure % on X is a mapping
%:X R with X 7 %(X).
Remarks.
A risk measure % attaches to each (risky) position X a value %(X) R.
If the risk measure % is the regulatory risk measure then %(X) R reflects the
necessary risk bearing capital that needs to be available within the insurance
company to run business X. This is the minimal equity the insurance com-
pany needs to hold to balance possible shortfalls in the insurance portfolio.
This is going to be explained in more detail below.

By a change of sign in X we can observe the similarities to the expected

utility framework of Section 6.2.1.
For having a good risk measure one requires additional properties for %
such as monotonicity, coherence, etc. This is described below.
The most commonly used risk measures are: variance, Value-at-Risk (VaR),
expected shortfall (ES) already met in Example 6.20. We further discuss
them below.
Assume a (regulatory) risk measure % : X R with X 7 %(X) is given. We would
w)
like to price an insurance portfolio S under the assumption X = S E[S] X .
That is, we study the possible losses beyond the best-estimate prediction E[S] of
S. The regulatory capital requirement then prescribes that the insurance company
needs to hold at least risk bearing capital %(S E[S]). This risk bearing capital
(m
%(S E[S]) quantifies the necessary financial strength of the insurance company
so that it is able to finance shortfalls beyond the pure risk premium E[S] exactly
up to the amount %(S E[S]).
We assume %(S E[S]) > 0 for non-deterministic positions S. Then the insurance
company needs to find shareholders who are willing to provide this risk bearing
capital %(S E[S]) > 0. The shareholders will provide this capital as soon as the
tes
promised expected return on this (invested) capital is sufficiently high. We call the
expected rate of return on this shareholder capital cost-of-capital rate rCoC > 0.
Thus, the shareholders/investors expected return is
rCoC %(S E[S]) > 0,

no
on their investment %(S E[S]) > 0.
Definition 6.21. The cost-of-capital pricing principle is given by
CoC = E[S] + rCoC %(S E[S]).

NL
Interpretation.
For outcomes S E[S]: the claim can be financed by the pure risk premium
E[S], solely.
For outcomes S > E[S]: the pure risk premium E[S] is not sufficient and the
shortfall S E[S] > 0 needs to be paid from %(S E[S]). Thus, the investors
capital %(S E[S]) is at risk, and he may lose (part of) it. Therefore, he will
ask for a cost-of-capital rate
rCoC > r0 ,
if r0 denotes the risk-free rate (he receives on a risk-free bank account with
the same time to maturity as his investment).

We give desired properties of risk measures. For details we refer to Artzner et

al. [5, 6], McNeil et al. [77] and Fllmer-Schied [47]. The first assumption is that
X is a convex cone containing R, i.e. it satisfies
(1) c X for all c R,
(2) X + Y X for all X, Y X , and
(3) X X for all X X and > 0.
w)
Then we state the following axioms for risk measures % on X .
Axioms 6.22 (axioms for risk measures %). Assume % is a risk measure on the
convex cone X containing R. Then we define for X, Y X , c R and > 0:
(m
(a) normalization: %(0) = 0;
(b) monotonicity: for X, Y with X Y , P-a.s., we have %(X) %(Y );
(c) translation invariance: for all X and every c we have %(X + c) = %(X) + c;
(d) positive homogeneity: for all X and for every > 0 we have %(X) = %(X);
tes
(e) subadditivity: for all X, Y we have %(X + Y ) %(X) + %(Y ).
Observe that some of the axioms imply others, e.g. positive homogeneity implies
normalization since %(0) = %(0) = %(0) for all > 0 this immediately says
no
%(0) = 0. For a detailed analysis of such implications we refer to Section 6.1 in

McNeil et al. [77] and Section 9.1 in Wthrich-Merz [103].
For our analysis we require (at least) normalization (a) and translation invariance
(c). We briefly comment on this.
NL
Translation invariance. If we hold a risky position X and if we inject capital

c > 0 then the loss is reduced to X c. This implies for risk measure % that the
reduced position satisfies
%(X c) = %(X) c.
This justifies the definition of the regulatory risk measure as stated above. Namely,
if we sell a risky portfolio S and we collect pure risk premium E[S] then the risk
of the residual loss S E[S] is given by
%(S E[S]) = %(S) E[S].
Normalization and translation invariance. A balance sheet of an insurance

company is called acceptable if its (future) surplus C1 X satisfies %(C1 ) 0, see
also Wthrich [98]. Assume that the insurance company sells a policy S at price

E[S] and at the same time it has initial capital c0 = %(S E[S]) 0. Then
the future surplus of the company is given by C1 = c0 + S. The regulator then
checks the acceptability condition which reads as
%(C1 ) = % ((c0 + S)) = c0 + %(S) = + E[S] 0. (6.12)
Thus, we have an acceptable position. Coming back to the cost-of-capital pricing

principle given in Definition 6.21 this needs to be interpreted as follows: assume
that the initial capital c0 > 0 is provided by an investor who expects cost-of-
capital rate rCoC > r0 on his investment. Then, the insurance company also needs
w)
to finance the cost-of-capital cash flow rCoC c0 = rCoC %(S E[S]) to the investor.
This can exactly be done with the cost-of-capital premium CoC and the insurance
company keeps its acceptable position in (6.12) if rCoC c0 is also considered as a
liability of the insurance company.
(m
Monotonicity and normalization imply that more risky positions are charged
with higher capital requirements and, in particular, if we have only downside risks,
i.e. X 0, P-a.s., then we will have positive capital charges %(X) %(0) = 0.
Definition 6.23 (coherent risk measure). The risk measure % is called coherent if
tes
it satisfies Axioms 6.22.
Coherent risk measures were introduced by Artzner et al. [5, 6]

and the properties of coherent risk measures are often regarded
as useful in practice. In particular, the subadditivity property
no
means that if we merge two portfolios we expect diversification

benefits in the sense of a release of necessary risk bearing capital.
We close this section with a discussion of the three most popular

risk measures. P. Artzner
NL
Example 6.24. The standard deviation risk measure is for S

with finite second moment given by
%(S) = (S) = Var(S)1/2 ,
for a given parameter > 0. This risk measure is normalized, positive homoge-
neous, and subadditive. But it is neither translation invariant nor monotone. Note
that for the standard deviation risk measure the cost-of-capital pricing principle
coincides with the standard deviation loading principle presented in Section 6.1.
Example 6.25 (Value-at-Risk, VaR). The VaR of S F at security level 1 q

(0, 1) is given by the left-continuous generalized inverse of F at 1 q, i.e.
%(S) = VaR1q (S) = F (1 q).

The VaR is normalized, monotone, translation invariant and positive homogeneous,

but it is not subadditive, and hence not coherent. There are many examples in the
literature showing this non-coherence, see, for instance, Artzner et al. [5, 6], McNeil
et al. [77] and Embrechts et al. [40].
Example 6.26 (expected shortfall). The expected shortfall has already been in-
troduced in Example 6.20, where we have stated that the expected shortfall is equal
to the TVaR for continuous distribution functions F . Instead of introducing it via
probability distortion functions we can also directly define it. Assume that S F
with F continuous. Then we have
w)
1Z 1
%(S) = TVaR1q (S) = E [S |S VaR1q (S)] = VaRu (S) du = ES1q (S).
q 1q
ES1q (S) is a coherent risk measure on L1 (, F, P). The cost-of-capital pricing
(m
principle is then given by
= E[S] + rCoC ES1q (S E[S]) = E[S] + rCoC (ES1q (S) E[S]) .
This cost-of-capital pricing principle can also be obtained with probability distor-
tion functions: choose h as in Example 6.20 and define the distortion function
tes
e : [0, 1] [0, 1] as follows
h
h(x)
e = (1 rCoC ) x + rCoC h(x),
for fixed rCoC (0, 1), see Figure 6.3. For a non-negative random variable S
no
probability distortions
1.0
0.8
NL 0.6
0.4
0.2
expected shortfall (ES)

0.0
ES costofcapital loading
0.0 0.2 0.4 0.6 0.8 1.0
Figure 6.3: Distortion functions h of Example 6.20 (expected shortfall) and corre-
sponding he for expected shortfall cost-of-capital loading.

0 with continuous (and strictly increasing) distribution function we obtain, see

Example 6.20,
Z
eh = e (P[S > x]) dx
h
0
Z Z
= (1 rCoC ) P[S > x] dx + rCoC h (P[S > x]) dx
0 0
= (1 rCoC ) E[S] + rCoC ES1q (S)
= E[S] + rCoC ES1q (S E[S]) ,
which proves the claim.
w)

Remarks.
(m
Solvency II considers VaR1q (S E[S]) for 1 q = 99.5% as the regulatory
risk measure.
The Swiss Solvency Test considers ES1q (S E[S]) for 1 q = 99% as the
regulatory risk measure.
For rCoC one often sets 6% above the risk-free rate. However, this is a heavily
tes
debated number because in stress periods this rate should probably be higher.
Exercise 20. Assume that S N (, 2 ) has a Gaussian distribution. Choose

1 q = 99% and rCoC = 6%. The cost-of-capital pricing principle for the expected
shortfall risk measure gives price
no
1 1
2
= + rCoC exp 1 (1 q) .
q 2 2
(a) Prove this statement.

(b) Calibrate the security level for the VaR risk measure such that the cost-of-
NL
capital insurance price is the same as for the expected shortfall risk measure.
(c) Calibrate the standard deviation risk measure loading parameter > 0 such
that the price is the same as for the expected shortfall risk measure.
Remark. This parameter calibration only holds true under the Gaussian model
assumption.
6.2.5 Deflator based pricing principles

Up1 to now we have completely neglected that cash flows also have time values,
i.e., in general, future cash flows need to be discounted for valuation purposes. In
1
This section is for further reading and is treated in detail in the lecture Market-Consistent
Actuarial Valuation, see [99].

a financial mathematics setting insurance cash flow valuation can be considered

as a pricing problem in an incomplete financial market setting. The pricing in
such a financial market setting can be done either by risk neutral measures or,
equivalently, by using (state price) deflators. This provides pricing systems that
are free of arbitrage also known as the Fundamental Theorem of Asset Pricing, see
Delbaen-Schachermayer [33]. Deflators were introduced in the actuarial literature
by Bhlmann [20, 21, 22] and heavily used in Wthrich [99] and Wthrich-Merz
[103]. The terminology deflator was introduced by James Darrell Duffie [37].
Assume that is an integrable and strictly positive random
w)
variable with
1
E[] = d0 = (0, 1].
1 + r0
Then, d0 can be seen as deterministic discount factor and
(m
r0 0 can be seen as deterministic risk-free rate. This is J.D. Duffie
the general version of a deflator . To make deflator pricing comparable to the
previously introduced pricing principles we assume that d0 = 1, i.e. no time values
are added to cash flows.
Fix L1 (, F, P) strictly positive with d0 = 1 and assume that and S are

tes
positively correlated. Then we can define the deflator based price by (subject to
existence)
(0) = E[S] E[]E[S] = E[S].
no
We use the upper index in (0) to indicate that we set d0 = 1.

Thus, all random variables S which are positively correlated with receive a posi-
tive premium loading. The next example shows that this is a generalization of the
Esscher premium, or more generally, it can be understood as a probability distor-
tion principle because allows to define the equivalent probability measure P by
NL
the Radon-Nikodym derivative as follows

dP
= ,
dP
because is a strictly positive density w.r.t. P for d0 = 1. Then, we price S under
the equivalent probability measure P by
(0) = E[S] = E [S].
Example 6.27 (Esscher premium). Choose a random variable S and define =

MS ()1 exp{S} for given > 0 with MS () < . It follows that is strictly
positive, P-a.s., and normalized. That is, is a deflator with d0 = 1. Due to
the FKG inequality, see Fortuin et al. [48], it follows that and S are positively
correlated, and thus
(0) = E[S] E[S].

Observe the identity

1 h i
(0) = E[S] = E eS S = ,
MS ()
which is exactly the Esscher premium , and P is the Esscher measure corre-
sponding to F , see Section 6.2.2.
The previous example shows that the deflator approach is a

generalization of the Esscher premium. The crucial point is
that and S are positively correlated so that we obtain a
w)
positive premium loading. Moreover, this deflator approach
also allows for stochastic discounting by choosing a deflator
with E[] (0, 1), and generalizations to multiperiod problems
are easily possible and straightforward. For more details we
(m
refer to Wthrich-Merz [103] and Wthrich [99].
Example 6.28 (cost-of-capital loading with expected short-

fall). This example treats the expected shortfall risk measure. Assume S F
with continuous distribution function F . The VaR on security level 1 q (0, 1)
is then given by VaR1q (S) = F (1 q), see Example 6.25. Note again that
tes
F (F (1 q)) = 1 q, see Chapter 1. Choose rCoC (0, 1) and define the proba-
bility distortion
rCoC
= (1 rCoC ) + 1{SVaR1q (S)} > 0, P-a.s.
q
no
This choice and the continuity of F imply

rCoC rCoC
E[] = (1 rCoC ) + P [S VaR1q (S)] = (1 rCoC ) + q = 1,
q q
that is, we obtain the required normalization. The premium is then given by
NL
" ! #
rCoC
(0) = E (1 rCoC ) + 1{SVaR1q (S)} S
q
rCoC h i
= (1 rCoC ) E [S] + E 1{SVaR1q (S)} S
q
= (1 rCoC ) E [S] + rCoC E [ S| S VaR1q (S)]
= E [S] + rCoC (E [ S| S VaR1q (S)] E [S])
= E [S] + rCoC ES1q (S E [S]) .
We conclude that we exactly obtain the cost-of-capital loading principle with ex-
pected shortfall as risk measure, see Example 6.26.

Chapter 7
Tariffication and Generalized
w)
Linear Models
(m
Assume we have v N insurance policies denoted by l = 1, . . . , v. These insurance
policies should be sufficiently similar such that we obtain a homogeneous insurance
portfolio to which the law of large numbers (LLN) applies, see (1.1). The ideal case
of i.i.d. risks justifies to charge the same premium to every policy. If there is no
perfect homogeneity (and there never is) then there are two different possibilities
tes
of charging a premium: (a) everyone pays the same premium which reflects more
the aspect of social insurance, where one tries to achieve a balance between the
rich and the poor; (b) the individual premium should reflect the quality of the
specific insurance policy, i.e. we try to calculate risk adjusted premia. In the present
chapter we try to achieve (b). We explain this with the compound Poisson model at
no
hand. The aggregation and the disjoint decomposition properties of the compound
Poisson model S CompPoi(v, G), see Theorems 2.12 and 2.14, suggest the
following decomposition
N v N (l) v
X X X (l) X
S = Yi = Yi = Sl ,
NL
i=1 l=1 i=1 l=1
(l) (l)
where Sl = N
P
i=1 Yi describes the total claim amount of policy l = 1, . . . , v. This
decoupling provides independent compound Poisson distributions Sl . That is, we
have Sl CompPoi(l , Gl ), where we set volume vl = 1, l > 0 is the expected
(l)
number of claims of policy l and Yi Gl describes the claim size distribution of
policy l. This provides the following decomposition of the mean
v v v (l) v
(l) l E[Y1 ]
(l) ,
X X X X
E[S] = E[Sl ] = l E[Y1 ] = E[Y1 ] =
l=1 l=1 l=1 E[Y1 ] l=1
where = E[S]/v = E[Y1 ] is the average claim over all policies and (l) > 0 reflects
the contribution of policy l = 1, . . . , v. This means that in the case of heterogeneity
we should determine these risk characteristics (l) for every policy l to obtain
167
168 Chapter 7. Tariffication and Generalized Linear Models
risk adjusted premia because these risk characteristics (l) describe the differences
between the policies. This would require to model v different parameters, that is,
every policy is characterized by its covariate zl (also called feature) and the expected
frequency and the expected claim size need to be understood as the following two
regression functions
h i
(l)
zl 7 l = (zl ) and zl 7 ml = m(zl ) = E Y1 .
The corresponding risk characteristics are then obtained by
w)
(zl )m(zl )
zl 7 (l) = .

To avoid over-parametrization and to have sufficient volume for a LLN one reduces
(m
the (potential) complexity in zl . One therefore chooses a fixed (finite) number,
say K, of tariff criteria (like age, type of car, kilometers yearly driven, etc.) and
one divides the total portfolio into sufficiently homogeneous sub-portfolios (cate-
gorical risk classes, risk cells). Then one tries to modify the overall average claim
= E[S]/v to these risk classes such that their prices become a function of the
corresponding risk characteristics, for more details we also refer to Wthrich-Buser
tes
[101]. This way we substantially reduce the number of parameters and statistical
estimation can be done.
For this exposition we assume to only have two tariff criteria (covariates), i.e. K =
2, and we would like to set up a multiplicative tariff structure.
no
The generalization to K > 2 is then straightforward.
Assume we have K = 2 tariff criteria. The first criterion (covariate) has I risk
NL
characteristics i {1, . . . , I} and the second criterion (covariate) has J risk char-
acteristics j {1, . . . , J}. Thus, we have M = I J different categorical risk classes
(risk cells), see Table 7.1 for an illustration.
1 j J
1
..
.
i risk classes (i, j)
..
.
I
Table 7.1: K = 2 tariff criteria with I and J risk characteristics, respectively.

Chapter 7. Tariffication and Generalized Linear Models 169
We assume that policy l belongs to risk class (i, j), and we assume for the corre-
sponding risk characteristics (l) = (i,j) . This provides decomposition
vi,j (i,j) ,
X
E[S] =
i,j
where vi,j denotes the number of policies belonging to risk class (i, j) and (i,j)
describes the quality of that risk class. Our aim is to set up
a multiplicative tariff structure for these K = 2 tariff criteria, i.e. we assume
w)
(i,j) = 1,i 2,j , (7.1)
where k,lk describes the specifics of criterion k if it has risk characteristics lk .
(m
In particular, this means that a multiplicative tariff structure (which is the model
assumption here) may reflect the quality of each risk class (i, j).
Example 7.1 (multiplicative tariff). A classical example in car insurance is the

following: choose as tariff criteria the kilometers yearly driven and the years
driven without an accident.
tes
1st tariff criterion 1,i : kilometers yearly driven
2nd tariff criterion 2,j : years driven without an accident (bonus-malus level)
Observe that the 1st tariff criterion is continuous, but typically it is discretized
no
(and categorized) for having finitely many risk characteristics, see Table 7.2 for an
example.
no accident 0 years 1 year 2 years 3 years 4 years 5 years 6+ years

NL
2, 1.2 1.1 1.0 0.9 0.8 0.7 0.5

yearly km 1,
0-10000 0.8
10-15000 0.9
15-20000 1.0
20-25000 1.1 (4,5) = 1,4 2,5 = 1.1 0.8 = 0.88
25000+ 1.2
Table 7.2: Tariffication scheme for K = 2 tariff criteria.
We have K = 2 tariff criteria. Criterion k = 1 has I = 5 risk characteristics

and criterion k = 2 has J = 7 risk characteristics. This gives M = I J = 35
(categorical) risk classes (i, j) for i {1, . . . , I} and j {1, . . . , J}.

The general aim is to determine tariff criteria such that they give sufficiently large
homogeneous risk classes. These risk classes are then priced by choosing appropri-
ate multiplicative pricing factors k,lk (under the assumption that a multiplicative
tariff structure (7.1) fits the problem).
Remarks.
A prior choice of tariff criteria should be done using expert opinion. Statisti-
cal analysis should then select as few as possible significant criteria. However,
also market specifications of competitors are important to avoid adverse se-
w)
lection. If we have categorical covariates like cantons of Switzerland then
machine learning methods may also be used to build appropriate risk cells,
see Wthrich-Buser [101].
(m
Related to the first item: the aim should be to build homogeneous risk classes
of sufficient volume such that a LLN applies and we get statistical significance.
Variable reduction techniques and multivariate statistical analysis need to be

applied to avoid an over-correction of dependent factors, e.g. in the rather
trivial example above, the relation between the factors is not immediately
tes
clear: it could be that kilometers yearly driven is strongly related to years
driven without an accident. If this is the case we might correct twice for the
same factor.
We consider a bivariate model using simple methods for categorical risk

no
classes and will then go over to more sophisticated models using generalized
linear model (GLM) techniques.
Assume we have two tariff criteria (covariates) i and j which give M = I J

categorical risk classes. Our aim is to find appropriate multiplicative pricing factors
NL
1,i , i {1, . . . , I}, and 2,j , j {1, . . . , J}, which describe the risk classes (i, j)
according to the multiplicative tariff structure (7.1).
We define by Si,j the total claim of risk class (i, j) and by vi,j the corresponding
volume with
X X
vi,j = v and Si,j = S.
i,j i,j
This implies that we need to study
E[S] (i,j)
E[Si,j ] = vi,j = vi,j 1,i 2,j , (7.2)
v
where = E[Y1 ] is the average claim per policy over the whole portfolio v,
i.e. E[S] = v, and (i,j) = 1,i 2,j describes the multiplicative tariff structure
for two tariff criteria selected.

7.1 Simple tariffication methods

We start with the method of Robert A. Bailey & LeRoy
J. Simon [10] which was introduced in 1960 for rate-making.
The method of Bailey & Simon is rather simple and it is
not directly motivated by a stochastic model which con-
siders the claim Si,j of risk class (i, j) in a consistent way.
L.J. Simon (right) It specifies parameters , 1,i and 2,j > 0 such that the
following expression is minimized
w)
(Si,j vi,j 1,i 2,j )2
X2 =
X
. (7.3)
i,j vi,j 1,i 2,j
The motivation behind this approach is that X 2 describes the
(m
test statistics of the 2 -goodness-of-fit test, see (3.9). This test
rejects a model if X 2 exceeds the quantile of a 2 -distributed
random variable on a certain significance level. Therefore, the
aim is to choose the parameters such that X 2 becomes as small
as possible.
Note that this approach is not based on a stochastic model it
tes
is just based on a statistical argument. Moreover, it has the R.A. Bailey
following unpleasant feature.
Lemma 7.2. The minimizers of (7.3) have a (systematic) positive bias.
Proof. We denote the minimizers of (7.3) by
b,
b1,i and
b2,j . We would like to prove that
no
X X
vi,j
b b2,j
b1,i Si,j = S.
i,j i,j
This can either be done by first summing over rows i or columns j. Note that
b2,j is found by
the solution of
! X 2 X (Si,j vi,j 1,i 2,j )2
0 = = .
NL
2,j i
2,j vi,j 1,i 2,j
This provides estimates
!1/2
S 2 /(vi,j
P
bb1,i )

b2,j = i
Pi,j .
i vi,j
bb1,i
If we sum over i and plug in the estimates
b2,j we obtain
!1/2 !1/2
2
X X X Si,j
vi,j
bb1,i
b2,j = vi,j
bb1,i .
i i i
vi,j
b b1,i
Next we apply the Schwarz inequality kxk2 kyk2 |hx, yi| to the terms on the right-hand side
which provides the following lower bound
!1/2
2
X X 1/2 Si,j X
vi,j
b b2,j
b1,i (vi,j
bb1,i ) = Si,j .
i i
vi,j
b b1,i i

Example 7.3 (method of Bailey & Simon). We choose an example with two tariff
criteria. The first one specifies whether the car is owned or leased, the second
one specifies the age of the driver. For simplicity we set vi,j 1 and we aim to
determine the tariff factors , 1,i and 2,j . The method of Bailey & Simon then
requires minimization of
2
X (Si,j 1,i 2,j )2
X = .
i,j 1,i 2,j
Note that we need to initialize the estimators for obtaining a unique solution. We
w)
set b = 1 and b1,1 = 1. The observations Si,j are given by, see also Figure 7.1,
21-30y 31-40y 41-50y 51-60y

owned 1300 1200 1000 1200
leased 1800 1300 1300 1500
(m
scatter plot
2000
L leased
O owned
1800
L
tes
1600
claim amount
L
1400
no
O L L
1200
O O
1000
2130y 3140y 4150y 5160y

NL
age classes
Figure 7.1: Observations Si,j .
We have M = I J = 2 4 = 8 risk classes (i, j) and observations Si,j . The number

def.
of parameters to be estimated are r + 1 = I + J 1 = 5 (taking into account
the initialization b = b1,1 = 1). Minimizing X 2 numerically provides the following
multiplicative tariff structure for 1,i , i {1, 2}, and 2,j , j {1, . . . , 4}.
21-30y 31-40y 41-50y 51-60y b1,i

owned 1376 1112 1020 1197 1.0000
leased 1727 1395 1280 1503 1.2548
b2,j 1376 1112 1020 1197

In this example we have (systematic) positive bias as stated in Lemma 7.2, i.e.
b1,i b2,j = 100 611 > 100 600 =
X X
Si,j = S.
i,j i,j

The method of Robert A. Bailey & Jan Jung (1922-2005)

[9, 63] intends to improve the weakness of the positive bias of the
previous method, see Lemma 7.2 and Example 7.3. But it is still
a simple method that is not directly motivated by a stochastic
model. However, we will see below that it has its groundings in a
w)
stochastic model. It imposes unbiasedness of rows and columns
by definition: Choose , 1,i and 2,j > 0 such that the rows i
and columns j satisfy
J. Jung
(m
J
X J
X
vi,j 1,i 2,j = Si,j , (7.4)
j=1 j=1
I
X I
X
vi,j 1,i 2,j = Si,j . (7.5)
i=1 i=1
tes
Remarks.
This method is also called method of total marginal sums.
It is more robust than the method of Bailey & Simon.

no
If Si,j are independent Poisson distributed with cross-classified means, then

the above system is exactly the MLE system that needs to be solved. We
will discuss this in Section 7.3.1 below.
Both the method of Bailey & Simon and the method of Bailey & Jung are
NL
rather pragmatic methods because they are not directly based on a stochastic
model. Therefore, in the remainder of this chapter we are going to describe
more sophisticated methods which are motivated by a probabilistic model.
Example 7.4 (method of Bailey & Jung, method of total marginal sums). We
revisit the data of Example 7.3. This time we determine the parameters by solving
the system (7.4)-(7.5). This needs to be done numerically and provides the following
multiplicative tariff structure:
21-30y 31-40y 41-50y 51-60y b1,i
owned 1375 1108 1020 1197 1.0000
leased 1725 1392 1280 1503 1.2553
b2,j 1375 1108 1020 1197
We conclude that both methods give similar results for this example.

7.2 Gaussian approximation

7.2.1 Maximum likelihood estimation
In the previous section we have presented two pragmatic tariffication methods. In
this section we give a more advanced method, in the sense that we use an explicit
stochastic model. However, the approach is still pragmatic because the stochastic
model is assumed to be a good approximation to the true tariffication problem.
We consider the claims ratio in risk class (i, j) defined by
w)
Ri,j = Si,j /vi,j .
The expected value of this claim ratio is given by, see (7.2),
E[Ri,j ] = 1,i 2,j .
(m
We use two simple facts:
1. The simplest absolutely continuous distribution is the Gaussian one.
2. Taking logarithms turns products into sums.
Combining this two items implies that we plan to consider the following model
tes
def.

Xi,j = log Ri,j N 0 + 1,i + 2,j , 2 .
Thus, taking logarithms may turn the multiplicative tariff structure into an additive
structure. If this logarithm Xi,j of Ri,j has a Gaussian distribution we have nice
no
mathematical properties. Therefore, we assume a log-normal distribution for Ri,j

which hopefully gives a good approximation to the true tariffication problem. These
choices imply for the first two moments
2 /2 2
E[Ri,j ] = e0 + e1,i e2,j and Var(Ri,j ) = E[Ri,j ]2 (e 1).
NL
2
Observe that the mean has the right multiplicative structure, set = e0 + /2 ,
1,i = e1,i and 2,j = e2,j . However, the distributional properties are rather
different from compound models, and the underlying volumes vi,j are also not
considered in an appropriate way. Nevertheless, this log-linear additive Gaussian
structure is often used because of its nice mathematical structure and because
popular statistical methods can be applied.
Set M = I J and define for Xi,j = log Ri,j = log(Si,j /vi,j ) the vector
X = (X1 , . . . , XM )0 = (X1,1 , . . . , X1,J , . . . , XI,1 , . . . , XI,J )0 RM . (7.6)
Note that we change the labeling of the observations because this is going to be
simpler in the sequel. Index m always refers to
m = m(i, j) = (i 1)J + j {1, . . . , M = I J}. (7.7)

We assume that X has a multivariate Gaussian distribution
X N (Z, ) , (7.8)
with diagonal covariance matrix = 2 diag(w1 , . . . , wM ), parameter vector
= (0 , 1,2 , . . . , 1,I , 2,2 , . . . , 2,J )0 Rr+1 ,
set r + 1 = I + J 1, and design matrix Z RM (r+1) such that for m = m(i, j)
E[Xi,j ] = (Z)m = 0 + 1,i + 2,j .
w)
Throughout we assume that Z has full rank. We initialize 1,1 = 2,1 = 0 and
0 plays the role of the intercept. At the moment the weights wm do not have a
1
natural meaning, often one sets wm = vi,j (inversely proportional to the underlying
(m
volume) because in this case one has
Xi,j 2 2 /vi.j 2
Var(Ri,j ) = Var(e ) = E[Ri,j ] (e 1) E[Ri,j ]2 ,
vi,j
for vi,j large. Thus, the variance of the claims ratio Ri,j is roughly inversely propor-
tional to the underlying volume vi,j . In view of Example 7.3 this gives the following
tes
table where the 1s show to which class the observations belong to:
owned leased 21-30y 31-40y 41-50y 51-60y X = log R
1 1 0 1 0 0 0 7.17
2 1 0 0 1 0 0 7.09
no
1 0 0 0 1 0 6.91
1 0 0 0 0 1 7.09
m 0 1 1 0 0 0 7.50
0 1 0 1 0 0 7.17
0 1 0 0 1 0 7.17
M 0 1 0 0 0 1 7.31
NL
This table needs to be turned into the appropriate form so that it fits to (7.8).
Therefore we need to drop the columns owned and 21-30y because of the
chosen normalization 1,1 = 2,1 = 0. This provides the following table:
intercept leased 31-40y 41-50y 51-60y

1 0 0 0 0

1 0 1 0 0 0
1 0 0 1 0 1,2

Z = 1 0 0 0 1
2,2

1 1 0 0 0 2,3

1 1 1 0 0 2,4
1 1 0 1 0
1 1 0 0 1

Under assumption (7.8) we know that X has density

1 1

f (x) = exp (x Z)0 1 (x Z) .
(2)M/2 ||1/2 2
b MLE of the parameter vector :

This allows us for the calculation of the MLE
MLE 1

b = Z 0 1 Z Z 0 1 X. (7.9)
w)
The tariff factors can then be estimated by (avoiding the variance correction term
which is appropriate for 2 wm /2 0 )
n o n o n o
b = exp b0MLE , MLE
b1,i = exp b1,i and MLE
b2,j = exp b2,j .
(m
If we have homoscedasticity, i.e. if we assume identical weights wm w and =
b MLE = (Z 0 Z)1 Z 0 X.
2 w1, then the estimator of is given by
Example 7.5 (log-linear model). We use the data Si,j from Example 7.3. Assume
1
wm = vi,j 1 and initialize b = 1 and b1,1 = 1. The log-linear MLE formula (7.9)
provides the following multiplicative tariff structure:
tes
21-30y 31-40y 41-50y 51-60y b1,i
owned 1368 1117 1020 1200 1.0000
leased 1710 1396 1274 1500 1.2495
b2,j 1368 1117 1020 1200
no
We compare the results from the method of Bailey & Simon, the method of total
marginal sums (Bailey & Jung) and the log-linear MLE method.
NL
We see that in this example all three methods provide similar results.
Observe: the risk class (owned, 21-30y) is punished by the bad performance
of (leased, 21-30y) and vice verse. A similar remark holds true for risk class
(leased, 31-40y).
Remarks.

The multiplicative tariff construction above has used the design matrix Z =
(zm,k )m,k RM (r+1) which was generated by categorical variables. Categor-
ical variables allow to group observations into disjoint risk categories.
Binary variables are a special case of categorical variables that can only have
two specifications, 1 for true and 0 for false. Recall that all our zm,k {0, 1}.
E.g., the observation Si,j either belongs to the class owned or to the class
leased.
w)
Often the linear regression model X = Z + with
N (0, ) is introduced for continuous variables (zm,k )m,k
which generate the design matrix Z. E.g. if there is a
(clear) linear relationship between the age and tariff cri-
(m
terion 2 , then variable zm,k (modeling age) is directly
reflecting this relationship (linear regression). For more
on this subject we refer to Frees [49] and Section 2.3.3
in Wthrich-Buser [101]. For the present discussion we
concentrate on categorical variables, also because often it is difficult to find
a clear functional relationship, see also example in Section 7.3.4, below.
tes
A serious drawback of the log-linear model is that we need to have claims
observations in all risk classes because otherwise Xi,j = log(Si,j /vi,j ) is not
well-defined. In practice, it may happen that one has a risk class with positive
volume vi,j > 0 but there is no claim in that risk class. This results in Si,j = 0.
no
In this case one should use the more sophisticated models presented below,
see for instance Section 7.3.4 for a claims count example. Moreover, volumes
vi,j should be large in order to have the right relationship for the resulting
variances of the claims ratios.
NL
7.2.2 Goodness-of-fit analysis

Compared to the methods in the previous section, the log-linear MLE formula (7.9)
has the advantage that we can apply classical statistical methods for a goodness-of-
fit test and for variable selection/reduction techniques. We introduce this statistical
language. For this discussion we assume homoscedasticity, i.e. identical weights
wm = 1 and = 2 1,
MLE
which simplifies the MLE to b = (Z 0 Z)1 Z 0 X. The general case is treated in
the next section. We introduce the total sum of squares (the first and last equalities
are definitions)

X 2 X 2 X 2
SStot = Xm X = c X
X m + Xm X
c
m = SSreg + SSerr ,
m m m
(7.10)
with X = 1 PM
Xm and X
c = b MLE .
Z
M m=1
SStot is the total difference between observations Xm and the sample mean
X without knowing the explaining variables Z (regression model structure).
SSreg is the difference explained by the model structure Z.
w)
SSerr is the residual difference not explained by the model structure.
Proof of (7.10). We rewrite the total sums of squares SStot in vector notation. Therefore we
define
(m
= X Z
b b MLE = X X c and X = X (1, . . . , 1)0 . (7.11)
We calculate
0 0
X 0X = (X )0 (X
c+b ) = X
c+b cXc + 2X
cb 0 b
+b .
b MLE minimizes in the homoscedastic case (X Z)0 (X Z) and thus we have

The MLE
MLE
tes
0 = Z 0 (X Z
b ) = Z 0b
, (7.12)
and as a consequence
0
X
cb b MLE )0 b
= (Z = 0.
This implies
no
0
X 0X = X
cX 0 b
c+b .
0
We subtract on both sides X X to obtain
0 0 0
0 b
SStot = X 0 X X X = b +X
cXc X X = SSerr + SSreg ,
where for the last step we need to observe that the intercept 0 is contained in every row of the
NL
design matrix Z, therefore the first column in Z is equal to (1, . . . , 1)0 . This and (7.12) imply
0 0
0 = (1, . . . , 1)0 b
P P b
= Xm X m . This treats the cross-product terms leading to X X X X =
cc
SSreg . This proves (7.10). 2
We define and consider the coefficient of determination R2 given by
SSreg SSerr
R2 = =1 [0, 1].
SStot SStot
This is the ratio of explaining variables SSreg and the total sum of squares SStot . If
the regression model explains well the structure in the observations then R2 should
be close to 1, because Xc is able to explain the underlying structure.

For Example 7.5 we obtain R2 = 0.9202 which is in favor for this model explaining
the data Si,j .
Residual standard deviation : For further analysis we also need the residual
standard deviation . It is estimated (in the homoscedastic case) by
1 X b0 b
c 2 = = SSerr ,

b 2 = Xm X m
M m M M
w)
where b was defined in (7.11). Set r = I +J 2, i.e. the di-
mension of parameter is r +1. b 2 is the MLE for 2 and
M b 2 is distributed as 2 2M r1 , see, for instance, Section
7.4 in Johnson-Wichern [62]. Often, one also considers the
(m
M
unbiased variance parameter estimator sb2 = M r1 b 2 .
Revisiting Example 7.5, we have M = 8 observations,

r + 1 = 5 parameters and hence df = M r 1 = 3
degrees of freedom. In our case we obtain sb = 0.07447.
tes
Likelihood ratio test: Finally, we would like to see whether we need to include
a specific parameter k,lk .
We have a r + 1 = I + J 1 dimensional parameter vector given by

no
= (0 , 1,2 , . . . , 1,I , 2,2 , . . . , 2,J )0 Rr+1 .
Note that the model is, of course, invariant under permutation of parameters and
components. Therefore, we can choose any specific ordering and to simplify nota-
NL
tion we define
= (0 , 1 , . . . , r )0 Rr+1 , (7.13)
so that we have the ordering of components that is appropriate for the next layout.
Null hypothesis H0 : 0 = . . . = p1 = 0 for given p < r + 1.
1. Calculate the residual differences SSfull

err and
b in the full model with r + 1
r+1
dimensional parameter vector R .
2. Calculate residual differences SSH 0

err in the reduced model (p , . . . , r ) R
0 r+1p
.
We calculate the likelihood ratio . Therefore, we denote the design matrix of the

reduced model by Z0 . Then it is given by

M exp 2b12 (X Z0 b MLE )0 (X
Z0
b MLE
)
fbH0 (X)

bH0 H0 H0
H 0
= =
MLE MLE

fbfull (X) bfull exp 2b12 (X Z 0
full ) (X Z full )
b b
full
H M/2 !M/2 !M/2
SSerr0
SSH 0
err SSH0 SSfull
= M
SSfull
= = 1 + err full err . (7.14)
err SSfull
err SSerr
M
The likelihood ratio test rejects the null hypothesis H0 for small values of . This
w)
is equivalent to rejection for large values of (SSH full full
err SSerr )/SSerr .
0
This motivates to consider the test statistics
SSH full
err SSerr M r 1
0
SSH
err SSerr
0 full
F= = . (7.15)
SSfull
(m
err p p sb2full
F has an F -distribution with degrees of freedom given by df 1 = p and df 2 =

M r 1, see Result 7.6 in Johnson-Wichern [62] or (4.2) in Frees [49]. Therefore,
we reject the null hypothesis H0 on the significance level 1 if
tes

F > Fp,M r1 (), (7.16)
where the latter denotes the quantile of the F -distribution with degrees of free-
dom df 1 and df 2 . The heteroscedastic case is given in (7.23), below.
no
Example 7.6 (regression model, revisited). We revisit Example 7.5.

In Figure 7.2 we give the R output of the command lm.
The lines Call give the MLE problem to be solved.

MLE
The lines Residuals display b = X Z
b .
NL
The lines Coefficients give the MLEs for the parameters 0 (intercept),
1,2 (leased) and 2,2 , . . . , 2,4 . For these parameters a standard estimation
error (inverse of the estimated Fisher information matrix) is calculated and
a t-test is applied to each parameter individually, whether they are different
from zero, see formula (7.14) in Johnson-Wichern [62]. From this analysis we
see that we might only question 2,4 because of the large p-value of 0.1675,
the other parameters are well justified by the observations.
The bottom lines then display the residual standard error sb = 0.07447 on
df = 3 degrees of freedom, the coefficient of determination R2 = 0.9202, the
adjusted coefficient of determination Ra2 corrects for the degrees of freedom
SSerr M 1
Ra2 = 1 .
SStot M r 1

w)
(m
Figure 7.2: R output of Example 7.5 using R command lm.
The final line displays an F test statistics (7.15) of value 8.653 for df1 = 4
tes
and df2 = 3 for dropping all variables except of the intercept 0 . This gives
a p-value of 5.36% which says that the null hypothesis is just about to be
rejected on the 5% significance level and we stay with the full model.
For the reduction of the variable owned or leased. We obtain an F test

no
statistics of 18.36 for df1 = 1 and df2 = 3. This gives a p-value of 2.34%
which says that we reject the null hypothesis of setting 1,2 = 0 on the 5%
significance level.
In the reduced model 1,2 = 0 we obtain an F test statistics of 1.071 for

NL
df1 = 3 and df2 = 3 for dropping all remaining variables variables 2,2 =
. . . = 2,4 = 0. This gives a p-value of 45.52% which says that we cannot
reject this null hypothesis on the 5% significance level.
We conclude that we need the variable to distinguish between owned and leased.
The classification in age classes 21-30y, . . ., 51-60y can be discussed. This
discussion will also depend on whether we want such a tariffication criterion and
whether our competitors consider similar variables.
Exercise 21. Provide design matrix Z for the pricing problem specified by the
following risk class specification (assuming a multiplicative tariff structure).
21-30y 31-40y 41-50y 51-60y
passenger car 2000 1800 1500 1600
delivery van 2200 1600 1400 1400
truck 2500 2000 1700 1600

Calculate a tariff using the different tariffication methods introduced above.
7.3 Generalized linear models

In the previous section we have taken a log-normal approximation for the total claim
amounts Si,j in risk classes (i, j). Taking logarithms has then led to a multiplicative
structure in a natural way. In the present section we express the expected claim of
risk class (i, j) as expected number of claims times the average claim, i.e.
w)
(1)
E[Si,j ] = E[Ni,j ] E[Yi,j ],
(l)
where Ni,j describes the number of claims in risk class (i, j) and Yi,j the correspond-
ing i.i.d. claim sizes for l = 1, . . . , Ni,j in risk class (i, j). Note that we suppose a
(m
compound distribution for this decoupling.
(l)
We now analyze Ni,j and Yi,j separately.
Definition 7.7 (exponential dispersion family). X fX belongs to the exponential

dispersion family if fX is of the form
tes
( )
x b()
fX (x; , ) = exp + c(x, , w) ,
/w
write X EDF(, , w, b()), where

no
w>0 is a given weight,

>0 is the dispersion parameter,
is the (unknown) parameter of the distribution,
R is an open set of possible parameters ,
NL
b:R is the cumulant function,

c(, , ) is the normalization, not depending on .
fX can either be a density in the absolutely continuous sense, it can be probability

weights in the discrete case or it can be a mixture thereof. Moreover, depending
on the choice of the cumulant function b() and of the possible parameters the
support of X may need to be restricted to subsets of R.
Lemma 7.8. Choose a fixed cumulant function b() and assume that the exponen-
tial dispersion family EDF(, , w, b()) gives well-defined densities with identical
supports for all parameters in an open (non-empty) set . Assume that
for any there exists a neighborhood of zero such that the moment generating
function MX (r) of X EDF(, , w, b()) is finite in this neighborhood of zero (for

r). Then we have for all and r sufficiently close to zero

( )
b( + r/w) b()
MX (r) = exp .
/w
Proof. Choose and r in the neighborhood of zero such that MX (r) exists. Then we have

x b()
Z
rx
MX (r) = e exp + c(x, , w) dx
/w

x( + r/w) b()
Z
= exp + c(x, , w) dx
/w
Z
b( + r/w) b() x( + r/w) b( + r/w)
w)
= exp exp + c(x, , w) dx.
/w /w
We have assumed that is an open set. Therefore, for any we have that r = +r/w
for r sufficiently close to zero. Therefore, the last integral is the density that corresponds to
EDF(r , , w, b()) and since this is a well-defined density with identical support for all r
(m
this last integral is equal to 1. This proves the claim. 2
Corollary 7.9. We make the same assumptions as in Lemma 7.8 and in addition
we assume that b C 2 in the interior of . Then we have
00
E[X] = b0 () and Var(X) = b ().
w
tes
Proof. In view of (1.3) we only need to calculate the first and second derivatives at zero of the
moment generating function. We have from Lemma 7.8

d b( + r/w) b() 0
= b0 (),

MX (r) = exp b ( + r/w)
dr r=0 /w r=0
no
and
d2

b( + r/w) b() 0 2 00
M X (r) = exp (b ( + r/w)) + b ( + r/w)
dr2
r=0 /w w r=0

= (b0 ())2 + b00 ().
w
2
NL
This proves the claim.
Example 7.10 (exponential dispersion family). In Chapters 2 and 3 we have

met several examples that belong to the exponential dispersion family. We revisit
these examples and explain how they fit into the exponential dispersion family
framework. These considerations also lead to an explicit explanation of the weight
w > 0. We start with the discrete case assuming X fX .
Binomial distribution: Choose = R, b() = log(1 + e ), = 1 and w = v.
In this case we obtain for x {0, 1/v, 2/v, . . . , 1}
fX (x; , 1)
exp v x log(1 + e ) = exp v x log e log(1 + e )

=
exp{c(x, 1, v)}
e

1
= exp vx log exp v(1 x) log = pvx (1 p)vvx ,
1 + e 1 + e

for p = e /(1 + e ) (0, 1). The first two moments are obtained by
e
E[X] = b0 () = = p,
1 + e
and
1 00 1 e 1 1
Var(X) = b () =
= p(1 p).
v v 1+e 1+e v
From this we see that N = vX Binom(v, p).
w)
Poisson distribution: Choose = R, b() = exp{}, = 1 and w = v. In
this case we obtain for x N0 /v
fX (x; , 1) n o
= exp v x e = vx ev ,
exp{c(x, 1, v)}
(m
for = e > 0. The first two moments are obtained by
1 00 1 1
E[X] = b0 () = e = and Var(X) = b () = e = .
v v v
tes
From this we see that N = vX Poi(v).
In the absolutely continuous case we have the following examples.
Gaussian distribution: Choose = R and b() = 2 /2. In this case we have

no
for x R
( ) ( )
fX (x; , ) x 2 /2 1 2 2x
= exp = exp ,
exp {c(x, , w)} /w 2 /w
which is the Gaussian density with mean and variance /w.

NL
Gamma distribution: Choose = R+ and b() = log(). In this case

we have for x R+
( ) ( )
fX (x; , ) x + log() w/ w
= exp = () exp x ,
exp {c(x, , w)} /w
this is a gamma density with shape parameter = w/ > 0 and scale

parameter c = w/ = > 0. The first two moments are obtained by

E[X] = b0 () = 1/ = and Var(X) = 2
= 2.
c w c
For more examples we refer to Table 13.8 in Frees [49] on page 379.

These examples show that several popular distribution functions belong to the
exponential dispersion family. In the present notes we concentrate on the Poisson
and the gamma distributions for pricing the two components number of claims
and claims severities. However, the theory holds true in more generality. Our aim
is to consider compound Poisson models and to express the expected claim of risk
class (i, j) as expected number of claims times the average claim, i.e.
(1)
E[Si,j ] = E[Ni,j ] E[Yi,j ],
(l)
where Ni,j describes the number of claims in risk class (i, j) and Yi,j the corre-
w)
sponding i.i.d. claim sizes for l = 1, . . . , Ni,j in risk class (i, j). We then aim for
calculating a multiplicative tariff which considers risk characteristics s for both
the number of claims and the claims severities of risk class (i, j).
(m
We assume that Ni,j are independent with Ni,j Poi(i,j vi,j ) and vi,j counting the
number of policies in risk class (i, j). Under these assumptions we derive a mul-
tiplicative tariff structure for the characteristics of the expected claims frequency
i,j . For the claim sizes we will do a similar construction by making a gamma dis-
tributional assumption. Since the latter is slightly more involved than the former
we start with the Poisson case.
tes
7.3.1 GLM for Poisson claims counts
We assume that Ni,j are independent with Ni,j Poi(i,j vi,j ) where vi,j denotes the
number of policies in risk class (i, j). In view of the exponential dispersion family
no
we make the following Ansatz for the expected claims frequency, see Example 7.10,
" #
Ni,j
i,j =E = b0 (i,j ) = exp{i,j } = exp{(Z)m }, (7.17)
vi,j
where in the last step we assume having a multiplicative tariff structure which
NL
provides an additive structure on the log-scale reflected by the linear term Z.

The index m = m(i, j) was defined in (7.7), matrix Z RM (r+1) denotes the
design matrix and Rr+1 is the parameter vector. Thus, we assume that
Xi,j = Ni,j /vi,j N0 /vi,j are independent with
Xi,j EDF(i,j = (Z)m , = 1, vi,j , b() = exp{}).
Our aim is to estimate the parameter vector Rr+1 . Identity (7.17) immedi-
ately explains that the appropriate link function g in this problem (between mean
and parameter) is the so-called log-link function g() = log(), because this turns
the multiplicative tariff structure into an additive form. The joint log-likelihood
function of X RM + is given by (we use independence here)
X Xm m exp{m } X Xm (Z)m exp{(Z)m }

`X () = , (7.18)
m 1/vm m 1/vm

where we have applied the relabeling of the components of X and vi,j such that
b MLE for is found by
they fit to the design matrix Z, see also (7.6). The MLE
the solution of

`X () = 0. (7.19)

We calculate the partial derivatives of the log-likelihood function
X Xm m exp{m } X Xm exp{m } m
`X () = =
l l m 1/vm m 1/vm l
X Xm exp{(Z)m } (Z)m X Xm exp{(Z)m }
w)
= = zm,l ,
m 1/vm l m 1/vm
where Z = (zm,l )m,l RM (r+1) . If we define the weight matrix V = diag(v1 , . . . , vM )

then we have just proved the following proposition:
(m
Proposition 7.11. The solution to the MLE problem (7.19) in the Poisson case
is given by the solution of
Z 0 V exp{Z} = Z 0 V X.
Remarks. One should observe the similarities between the Gaussian case (7.9)
tes
and the Poisson case of Proposition 7.11 given by, respectively,
MLE MLE
Z 0 1 Z
b = Z 0 1 X and Z 0 V exp{Z
b } = Z 0 V X.
The Gaussian case is solved analytically (assuming full rank of Z), the Poisson case
no
can only be solved numerically, due to the presence of the exponential function.
The Poisson case can be rewritten as
MLE
Z 0 V exp{Z
b } Z 0 N = 0.
Observe that the latter exactly provides the solution to the method of total marginal
NL
sums by Bailey & Jung [9, 63] given by (7.4)-(7.5).

The second remark concerns the log-likelihood given in (7.18). Note that it is
calculated on the risk class (i, j) level. We could do the same calculation on an
individual policy level, however, this is not recommend because it increases com-
putational complexity, i.e. always work with sufficient statistics.
7.3.2 GLM for gamma claim sizes

The analysis of the gamma claim sizes is more involved because it needs more
(l)
transformations. We denote by ni,j the number of observations Yi,j in risk class
(i, j), this plays the role of the volume in the exponential dispersion family. We
assume that
(l) i.i.d.
Yi,j (i,j , ci,j ) for l = 1, . . . , ni,j .

From the moment generating function given in Section 3.3.3 we immediately see
that for given ni,j the convolution is given by
ni,j
X (l)
Ym = Yi,j = Yi,j (i,j ni,j , ci,j ).
l=1
Thus, the total claim amount Yi,j in risk class (i, j) for given ni,j has a gamma
distribution (which belongs to the exponential dispersion family). We define the
normalized random variable Xm = Yi,j /ni,j , where we again use the relabeling
defined in (7.7). Observe that the family of gamma distributions is closed towards
w)
multiplication, see (3.5). Therefore, the density of Xm is given by
(cm nm )m nm m nm 1
fXm (x) = x exp{cm nm x}. (7.20)
(m nm )
Next we do a re-parametrization similar to Example 7.10 so that we obtain the
(m
parametrization of the exponential dispersion family. Set m = 1/m > 0 and
cm = m /m > 0. This provides gamma density
(m nm /m )nm /m nm /m 1
( )
m nm
fXm (x) = x exp x .
(nm /m ) m
Finally, define cumulant function b() = log() for < 0, see Example 7.10.
tes
The density of Xm = Yi,j /ni,j in risk class (i, j) is then given by
( ) !nm /m
m x b(m ) 1 nm
fXm (x) = exp xnm /m 1 .
m /nm (nm /m ) m
no
Thus, we have for m = R+
Xm EDF(m , m , nm , b() = log()).
The first two moments are given by, see Corollary 7.9,
1 m 2
E[Xm ] = m
NL
and Var(Xm ) = .
nm m
Analogous to the Poisson case we assume a multiplicative structure in the mean.
Using again the log-link function g() = log() we obtain additive structure
1
log E[Xm ] = log(m ) = log(m ) = (Z)m , (7.21)
with design matrix Z RM (r+1) and parameter vector Rr+1 . This gives
relationship
m = exp {(Z)m } .
For the joint log-likelihood function of X RM
+ we then obtain (assuming inde-
pendence between the components of X)
X m Xm + log(m ) X nm
`X () = [Xm exp{(Z)m } (Z)m ] .
m m /nm m m

Note that this excludes risk classes (i, j) with no observation ni,j = 0. The MLE
b MLE for is found by the solution of

`X () = 0. (7.22)

We calculate the partial derivatives of the log-likelihood
X nm X nm
`X () = [Xm exp{(Z)m } 1] zm,l = [Xm m 1] zm,l ,
l m m m m
where Z = (zm,l )m,l RM (r+1) . For rewriting the previous equation in matrix
form we define the weight matrix V = diag(1 n1 /1 , . . . , M nM /M ). The last
w)
equation is then written as

`X () = Z 0 V X Z 0 V exp{Z}.

We have just proved the following proposition:
(m
Proposition 7.12. The solution to the MLE problem (7.22) in the gamma case is
given by the solution of
Z 0 V exp{Z} = Z 0 V X.
tes
Remarks.
Proposition 7.12 for the gamma case looks very promising because it has
the same structure as Proposition 7.11 for the Poisson case. However, this
similarity is only at the first sight: parameter vector determines which
is also integrated into the weight matrix V = V() . Therefore, the MLE
no
MLE

b is only found numerically, using either Fishers scoring method or the
Newton-Raphson algorithm.
Note that the parameter vector acts on the scale parameter cm because
cm = m /m with m = exp {(Z)m }. The shape parameter m is
NL
determined through the dispersion parameter, i.e. m = 1/m .
For the general case within the exponential dispersion family with link func-
tion g we refer to Section 2.3.2 in Ohlsson-Johansson [82].
We have seen that the weights wi,j are given by the number of policies vi,j in
the Poisson case and by the number of claims ni,j in the gamma case.
In the log-linear Gaussian model there was the difficulty that we could not
handle risk classes without claims, see page 177. For the Poisson model, this
is not a difficulty because Xm = Nm /vm = 0 is a valid observation. For the
gamma claim sizes risk classes without an observation are naturally excluded
in the MLE.
We summarize the 3 cases considered:

Gaussian case:
MLE
Z 0 1 Z
b Z 0 1 X = 0.
Poisson case:
MLE
Z 0 V exp{Z
b } Z 0 V X = 0.
Gamma case:
MLE
Z 0 Vb exp{Z
b } Z 0 Vb X = 0,
w)
b MLE }.
with b = exp{Z
7.3.3 Variable reduction analysis
(m
In this section, we consider variable reduction for the exponential dispersion family
under the assumption of choosing the log-link function. In the Gaussian case we
have calculated the F statistics given in (7.15). This F statistics was based on
the classical (unscaled) Pearsons residuals b which measure the difference between
the observations and the (estimated) mean, see (7.11). In the general case of the
tes
exponential dispersion family it is more appropriate to replace Pearsons residuals
by the deviance residuals which measure the contributions of residual differences
to the log-likelihood. This we are going to explain next.
Having observations X = (X1 , . . . , XM )0 with independent components, we deter-
mine the MLE b MLE for Rr+1 within the exponential dispersion family with
no
log-link function and design matrix Z RM (r+1) as described above. This then
provides the estimate for the mean given by, see (7.17) and (7.21),

b m = 0
b (b b MLE )
m) = exp (Z m .
NL
We define the inverse function h = (b0 )1 which implies bm = h(b m ). The log-
likelihood function at this estimate
b is then given by
X Xm h(b m ) b(h(b m ))
`X ()
b = + c(Xm , , wm ),
m /wm
where we assume that m = for all m = 1, . . . , M . Observe that this maximizes
the likelihood function over all possible choices of (under given design matrix
Z, cumulant function b and log-link function). Similar to the likelihood ratio test
(7.14) in the Gaussian model we do a likelihood ratio test for this model within the
exponential dispersion family. Therefore, we consider the model Z and compare
it to the saturated model which has as many parameters as observations:
X Xm h(Xm ) b(h(Xm ))
`X (X) = + c(Xm , , wm ).
m /wm

The scaled deviance is then defined by
D (X, )
b = 2 (`X (X) `X ())
b
2X h i
= wm Xm h(Xm ) b(h(Xm )) Xm h(b m ) + b(h(b m )) .
m
The deviance statistics is defined by
b = D (X, )
D(X, ) b = 2 (`X (X) `X ())
b .
w)
Observe that these deviance statistics play the role of the residual differences SSerr
(Pearsons residuals) which were used in the likelihood ratio given in (7.14).
This deviance statistics measure the contribution of the residual differences to the
(m
log-likelihood and they may serve as loss functions that need to be minimized, see
Wthrich-Buser [101].
Similar to Section 7.2.2 we would now like to see whether we can reduce the number
of parameters in Rr+1 .
tes
Null hypothesis H0 : 0 = . . . = p1 = 0 for given p < r + 1.
b full ) in the full model Rr+1 .
1. Calculate the deviance statistics D(X,
2. Calculate the deviance statistics D(X,

b H0 ) under the null hypothesis H0 .
no
Define the test statistics, see also (7.15),
b H0 ) D(X,
D(X, b full ) M r 1
F= 0. (7.23)
D(X, b full ) p
NL
The test statistics F has approximately an F -distribution with degrees of freedom

given by df 1 = p and df 2 = M r 1. Therefore, we apply the same criterion as
in (7.16). Note that in the homogeneous Gaussian case we exactly obtain identity
(7.15), see also Example 7.13, below.
A second test statistics considered is, see Lemma 3.1 in Ohlsson-Johansson [82],
X 2 = D (X,
b H0 ) D (X,
b full ) 0. (7.24)
The test statistics X 2 is approximately 2 -distributed with df = p degrees of

freedom. In order to calculate this latter test statistics we need to estimate the
dispersion parameter . For the Poisson case it is assumed to be 1, in the other
cases we have two different options for the estimation of . Assume that m was

estimated by bm (under the assumption that m = for all m and thus cancels
in the MLE). Then, we can estimate from Pearsons (classical) residuals by
1 X (Xm b0 (bm ))2

bP = wm .
M r1 m b00 (bm )
An alternative approach is to use the deviances which provide estimate
b full )
D(X,
bD = .
M r1
w)
We can also calculate bP and bD in the Poisson case and if they are substantially
different from 1, then we either have under- or over-dispersion, and a different model
(or parametrization) should be used. For the relationship between Pearsons and
deviance dispersion estimates in the Poisson case we also refer to Figure 2.1 in
(m
Remark. We may also compare the 2 -test to the AIC presented in Section 3.3.3.
We have
X 2 = D (X,
b H0 ) D (X,
b full )
tes
= 2`X ( b full ) = AICH0 AICfull + 2p.
b H0 ) + 2`X (
Thus, we have relationship
X 2 + 2p = AICfull AICH0 .
no
Finally, to check the accuracy of the model and the fit one should also plot the
residuals. Again, we have two options. We can either study Pearsons residuals
given by
NL
Xm b0 (bm )
rP,m = q ,
b00 (bm )/wm
or the deviance residuals given by (note that the square roots are well-defined)
r h i
0
rD,m = sgn(Xm b (bm )) 2wm Xm h(Xm ) bm b(h(Xm )) + b(bm ) , (7.25)
for m = 1, . . . , M . These residuals should not show any structure because the
Xm s were assumed to be independent and the observed residuals should roughly
be centered having similar variances. We come back to this in Section 7.3.4, below.
Example 7.13. Assume that X1 , . . . , XM are independent with
Xm EDF(m , , wm , b() = ()2 /2). (7.26)

From Example 7.10 we know that these Xm s have a Gaussian distribution, i.e. their
densities are given by
( )
1 1 (xm m )2
f (xm ; m , ) = q exp .
2/wm 2 /wm
b = b0 ()
The scaled deviance is given by, set b = ,
b
1X 2
D (X, )
b = wm Xm bm ,
m
w)
and the deviance statistics is given by
X 2
b =
D(X, ) wm Xm bm . (7.27)
m
(m
Compare this to the residual difference SSerr of Section 7.2.2. Compare (7.23) and
(7.15) for the Gaussian model (7.26).
Exercise 22. Calculate the deviance statistics for the Poisson and the gamma
model, see also (3.4) in Ohlsson-Johansson [82].
tes
7.3.4 Claims frequency example
In this section we consider a real data exam-
ple for tariffication of claims frequencies. We
use the GLM method for Poisson claims counts
no
presented in Section 7.3.1. The data comes

from motor third party liability (MTPL) car
insurance and comprises 303381 policies. For
confidentiality reasons we do not explicitly pro-
vide the underlying volume measures vm which
NL
correspond to the number of policies in risk classes m.1 For this MTPL car in-
surance example we choose K = 4 tariff criteria which provide for risk classes
m = m(l1 , l2 , l3 , l4 ) the model
Nm = Nl1 ,l2 ,l3 ,l4 Poi (l1 ,l2 ,l3 ,l4 vl1 ,l2 ,l3 ,l4 ) ,
with vm = vl1 ,l2 ,l3 ,l4 being the number of policies in risk class m and l1 ,l2 ,l3 ,l4 the
expected claims frequency in the corresponding risk class. We assume independence
between different risk classes and we choose a multiplicative tariff structure for the
expected claims frequency, see also (7.1) and (7.17),
m = l1 ,l2 ,l3 ,l4 = exp {l1 ,l2 ,l3 ,l4 } = exp {0 + 1,l1 + 2,l2 + 3,l3 + 4,l4 } , (7.28)
1
An own synthetic example can be generated from the model provided in Appendix A of

with intercept 0 and tariff factors k,lk for the tariff criteria k = 1, . . . , 4. The 4
tariff criteria reflect weight category of car, age of driver, kilometers yearly
driven and local region (canton) in Switzerland. We define the relative volume
measures for the 4 different tariff factors as follows
P
l2 ,l3 ,l4 vl1 ,l2 ,l3 ,l4
vlweight
1 ,
=P [0, 1],
l1 ,l2 ,l3 ,l4 vl1 ,l2 ,l3 ,l4
and analogously for vlage 2 ,

, vlkm
3 ,
and vlcanton4 ,
. Moreover, for all tariff criteria k =
1, . . . , 4 we can consider the marginal MLEs. These are given by, see also Estimator
2.32,
w)
b weight = P 1 X
l1 Nl1 ,l2 ,l3 ,l4 ,
l2 ,l3 ,l4 vl1 ,l2 ,l3 ,l4 l2 ,l3 ,l4
b age ,
and analogously we define the marginal MLEs b km and
b canton .
l2 l3 l4
(m
k = 1. The first tariff criterion is the weight category of the car. We have the
following 7 risk characteristics for l1 {1, . . . , 7}:
l1 1 2 3 4 5 6 7
in kg 1-500 501-1000 1001-1500 1501-2000 2001-2500 2501-3000 3001-3500
label W1-500 W501-1000 W1001-1500 W1501-2000 W2001-2500 W2501-3000 W3001-3500
tes
vlweight
,
<1% 8% 56% 30% 4% 1% 1%
1
weight
bl
15.4% 7.1% 6.7% 7.3% 11.0% 13.3% 21.4%
1
k = 2. The second tariff criterion is the age of the driver. We have the following
8 risk characteristics l2 {1, . . . , 8}:
no
l2 1 2 3 4 5 6 7 8
age 18-20 21-25 26-30 31-40 41-50 51-60 61-70 71-99
label Y18-20 Y21-25 Y26-30 Y31-40 Y41-50 Y51-60 Y61-70 Y71-99
vlage
,
6% 5% 6% 17% 22% 20% 14% 10%
2
age
bl
19.8% 8.8% 7.7% 6.6% 6.2% 5.8% 5.4% 6.7%
2
NL
k = 3. The third tariff criterion is the kilometers yearly driven (in 10 000 km).
We have the following 7 risk characteristics l3 {1, . . . , 7}:
l3 1 2 3 4 5 6 7
in 10 000 km 1-5 6-10 11-15 16-20 21-25 26-30 31-99
label K1-5 K6-10 K11-15 K16-20 K21-25 K26-30 K31-99
vlkm, 1% 52% 30% 14% 1% 1% 1%
3
km
bl 7.4% 6.6% 7.3% 8.2% 12.3% 12.4% 13.1%
3
k = 4. The fourth tariff criterion is the Swiss canton the car is registered in
(according to its license plate). There are 26 different cantons in Switzerland which
implies l4 {1, . . . , 26}.
k = 1. We observe that the light weight category W1-500 and the heavy weight
categories W2001-2500, W2501-3000 and W3001-3500 have a much higher claims

label
AG AI
AR BE
BL BS
FR GE
GL GR
JU LU
NE NW
OW SG
SH SO
SZ TG
TI UR
w)
VD VS
ZG ZH
(m
Figure 7.3: Fourth tariff criterion: cantons of Switzerland the car is registered in,
i.e. l4 {AG, AI, . . . , ZH}.
tes
no
NL
b weight and (rhs)

Figure 7.4: Marginal MLEs (lhs) for the different weight categories l1
age
for the different age classes l2 .
b
frequencies than the middle weight classes, see Figure 7.4 (lhs). The straight hor-
izontal line is the overall sample claims frequency. Figure 7.4 (lhs) also indicates
that light and heavy weight categories have much less volume then the middle
weight categories, this is also reflected in the values vlweight
1 ,
, l1 = 1, . . . , 7.
k = 2. From the marginal claims frequencies for the different age classes mainly
young drivers are conspicuous, see Figure 7.4 (rhs). The average claims frequency
of drivers between 18 and 20 is more than twice as large as the average claims
frequency of all other drivers.

w)
(m
Figure 7.5: Marginal MLEs (lhs) for the different kilometers yearly driven cate-
b km and (rhs) for the different cantons
gories b canton .
l3 l4
k = 3. Figure 7.5 (lhs) shows that frequent long-distance drivers have a much
higher claims frequency than other drivers. But frequent long-distance drivers are
tes
only a small proportion of the total MTPL portfolio, see also vlkm
3 ,
values.
k = 4. Figure 7.5 (rhs) shows that we expect substantial differences between

different cantons. Probably mountain regions are different from urban regions, and
no
it is also noticeable that there seem to be differences between the linguistic areas in
Switzerland. The high frequency observation in Appenzell Innerrhoden (AI) is also
conspicuous, it comes from the fact that rental companies get good deals for license
plates in AI and therefore many rental cars are registered in AI (which obviously
cause higher claims frequencies).
NL
Observe that in this example we have 7 8 7 26 = 100 192 (potential) risk classes.
However, in only M = 60 146 risk classes we have a positive volume vm > 0 (in the
other risk classes we have not sold any policies). Introducing the multiplicative
tariff structure (7.28) with K = 4 tariff criteria reduces the complexity to r + 1 =
7 + 8 + 7 + 26 3 = 45 parameters. We apply the GLM estimation method for
Poisson claims counts, i.e. we evaluate (7.19) using Proposition 7.11. This is done
with R command2
> d.glm <- glm(counts W1-500 + W501-1000 + ...,

data=input, offset=log(volumes), family=poisson())
2
For a leaner representation the covariates should have the appropriate format in R.

w)
(m
Figure 7.6: (lhs) Tukey-Anscombe plot which shows the fitted means E[N b
m ] versus
the deviance residuals rD,m for m = 1, . . . , M ; (rhs) QQ plot of the deviance
residuals rD,m versus the theoretical (estimated) quantiles qbm for m = 1, . . . , M .
where input contains the counts Nm , the volumes vm as well as the correspond-
ing design matrix Z {0, 1}M (r+1) that consists of binary variables only. The
tes
summary of the results, similar to Figure 7.2, is obtained by R command
> summary(d.glm)
MLE
Rr+1 with corresponding standard
no
This R command provides the MLE b

b = 30 761.8 on degrees of freedom
errors and p-values, the deviance statistics D(X, )
M r 1 = 60 146 45 = 60 101 and the AIC value of 13519. Furthermore, it
provides the so-called Null Deviance which corresponds to a model which only has
an intercept 0 . This Null Deviance corresponds to the total difference SStot in the
Gaussian model. In our example the Null Deviance is 80 025.5 on 60 146 1 = 60 145
NL
degrees of freedom. The R command
> glm.fitted <- fitted(d.glm)
then calculates the estimated expected claims numbers for m = 1, . . . , M

MLE
b MLE = v exp (Z
E[Nm ] = vm m )m .
b b
m
Next we determine the deviance residuals rD,m , see (7.25). In the Poisson case they
take a rather simple form
v " ! #
u
E[Nm] E[Nm]
u b b
rD,m = sgn(Nm E[N
b
m ])
t2N
m log + 1 ,
Nm Nm

q
for Nm = 0 the deviance residual reduces to rD,m = 2E[N b
m ]. These deviance
residuals and the corresponding theoretical quantiles qbm are obtained from the R
command
> glm.dev <- qqplot.glmRob(input$counts, glm.fitted, 1)
This R command provides the deviance residuals glm.dev$deviance.residuals

and the corresponding theoretical quantiles glm.dev$quantiles. Figure 7.6 (lhs)
gives the Tukey-Anscombe plot which plots the fitted means E[N b
m ] versus the
w)
deviance residuals rD,m for m = 1, . . . , M . This plot should not show any structure
in order to support the Poisson model for claims counts. Figure 7.6 (lhs) is not
completely convincing but still acceptable.
Figure 7.6 (rhs) gives the QQ plot of the deviance residuals rD,m versus the the-
(m
oretical (estimated) quantiles qbm for m = 1, . . . , M , for more background we also
refer to Garcia Ben-Yohai [51]. This QQ plot is also not too convincing for the
Poisson model choice. This can also be seen from the estimates of the dispersion
parameter
D(X, )
b 30 761.8
bD = = 0 = 0.62 and bP = 0.89.
M r1 6 101
tes
Both estimates bD and bP suggest under-dispersion in the data (a 2 -goodness-
no
NL
Figure 7.7: Marginal MLEs and the GLM fitted values: (lhs) for the different
weight categories and (rhs) for the different age classes.
of-fit test, see (2.8) for the test statistics, would reject the model assumption on
the 1% significance level). However, as long as we are only interested into tariff
segmentation for different risk classes we may still use the GLM fit as relative tariff
factors unless we have drastic changes in the portfolio mix. Finally, in Figures

w)
(m
Figure 7.8: (lhs) Marginal MLEs and the GLM fitted values for the different kilo-
meters yearly driven categories; (rhs) GLM fitted values for the different cantons.
7.7 and 7.8 we provide the fitted tariff factors compared to the marginal MLE
estimates.
tes
k = 1. We see that the GLM fit punishes the light weight cars even slightly more
whereas heavy weight cars are relieved, see Figure 7.7 (lhs). From a practical point
of view the former seems a bit unreasonable. It probably has to do with the fact
weight
that the volume v1, in the lowest weight class is very small (this weight class
no
should probably be merged with the next one). The relieve for the heavy weight
cars might be compensated by the fact that these heavy weight cars are typically
driven by frequent long-distant drivers.
k = 2. The marginal estimates for different age classes are very much in line
NL
with the corresponding GLM fits, see Figure 7.7 (rhs).
k = 3. Figure 7.8 (lhs) suggests that we can probably merge the three kilometers
yearly driven classes K21-25, K26-30 and K31-99, also due to their small volumes.
k = 4. Figure 7.8 (rhs) shows that we might be able to merge the different
cantons into 4 or 5 different tariff regions to reduce the complexity of the tariff
structure. This is what we will analyze next using the variable reduction technique
of Section 7.3.3.
In the last step we present the reduction of variables technique presented in Section
7.3.3. We have performed this for all tariff criteria: the weight category criterion
and the age classes cannot further be reduced. This is a bit surprising for the lowest

weight class W1-500 because the resulting estimate seems a bit unreasonable and
weight
this risk factor has a very low volume v1, . But the tests clearly reject the null
hypothesis of a merger with the next weight class. Therefore, we only present the
analysis for the kilometers yearly driven tariff criterion and for the canton tariff
criterion.
Null hypothesis H0 : The three kilometers yearly driven classes K21-25, K26-30 and
K31-99 are merged, i.e. 3,5 = 3,6 = 3,7 .
w)
We calculate the test statistics F given in (7.23), the test statistics X 2 given in
(7.24) and the AIC. These values are given in Table 7.3.
full model under H0 test statistics p-value
(m
AIC 13519 13516
deviance statistics 3761.8 3762.0
test statistics F 0.23 79%
test statistics X 2 0.28 87%
Table 7.3: Parameter reduction analysis for the tariff criterion kilometers yearly
tes
driven.
The AIC supports the model with merged classes K21-25, K26-30 and K31-99
and both, the F test statistics and the X 2 test statistics, do not reject the null
no
hypothesis H0 on a 5% significance level. Therefore, we consider a merger of these

risk classes to one risk factor according to null hypothesis H0 .
Null hypothesis H0 :
(i) The three kilometers yearly driven classes K21-25, K26-30 and K31-99 are
NL
merged, i.e. 3,5 = 3,6 = 3,7 ;

(ii) the following cantons are merged:
(a) 4,AG = 4,BE = 4,LU ,
(b) 4,AI = 4,AR ,
(c) 4,GR = 4,SG ,
(d) 4,GL = 4,NW = 4,OW = 4,SZ = 4,UR = 4,ZG ,
(e) 4,FR = 4,GE = 4,JU = 4,NE = 4,TI = 4,VD = 4,VS .
We again calculate the test statistics F given in (7.23), the test statistics X 2 given
in (7.24) and the AIC. The results are presented in Table 7.4.
The AIC supports the reduced model of null hypothesis H0 , the F test statistics
rejects H0 on a 1% significance level, whereas the X 2 test statistics does not reject
the null hypothesis H0 on a 1% significance level. Thus, if we want to reduce

full model under H0 test statistics p-value

AIC 13519 13516
deviance statistics 3761.8 3792.4
test statistics F 2.92 <1%
test statistics X 2 30.6 2.2%
Table 7.4: Parameter reduction analysis for the tariff criteria kilometers yearly
driven and for the cantons according to null hypothesis H0 .
w)
(m
tes
Figure 7.9: Tariff factors for cantons: (lhs) full model; (rhs) reduced model of null
hypothesis H0 .
no
the complexity of the tariff structure, we could choose the reduced model of null
hypothesis H0 , see Figure 7.9 for the resulting regional tariff factors.
At the end this tariff decision is a strategic business decision (which is supported by
statistical analysis). This business decision will also depend on the tariff structure
NL
applied in the previous year: in these considerations, in particular when intro-

ducing a new tariff structure, one should always keep in mind that the individual
premia on single policies should not change too much from one year to the next.
Otherwise loyal customers will be very upset about the new price politics in the
insurance company and they will think that the companys business is not under
control. Therefore, transitions should always be done as smoothly as possible. An
other reason for such business decisions is that the prices should be competitive
(hopefully) in many segments. Therefore, it is also important that these business
decisions take into account what competitors are doing.
Outlook. For more complex tariffication problems and high dimensional covariate
spaces we refer to the lecture notes Wthrich-Buser [101].

Chapter 8
Bayesian Models and Credibility
w)
Theory
(m
In the previous chapter we have done tariffication using GLM. This was done by
splitting the total portfolio into different homogeneous risk classes (i, j). The vol-
ume measures in these risk classes (i, j) were given by vi,j in Section 7.3.1 (Poisson
case) and by ni,j in Section 7.3.2 (gamma case), respectively. There might occur the
situation where a risk class (i, j) has only small volume vi,j and ni,j , respectively,
tes
i.e. only a few policies or claims fall into that risk class. In that case an observation
Ni,j and Si,j may not be very informative and single outliers may disturb the whole
picture, see Figure 7.7 (lhs). Credibility theory aims at dealing with such situations
in that it specifies a tariff of the following type
Si,j
no
b i,j = i,j + (1 i,j ),

vi,j
i.e. the tariff b i,j for next accounting year is calculated as a credibility weighted
average between the individual past observation Si,j /vi,j and the overall average
with credibility weight i,j [0, 1]. For i,j = 1 we completely believe into
NL
past the observation Si,j /vi,j , for i,j = 0 we only believe into the overall average
. Credibility theory makes this approach rigorous and specifies the credibility
weights i,j .
Credibility theory belongs to the field of Bayesian statistics:
There are exact Bayesian methods which allow for analytical solutions.
There are simulation methods such as the Markov chain Monte Carlo (MCMC)
method which allow for numerical solutions of Bayesian models.
There are approximations such as linear credibility methods which give opti-
mal solutions in sub-spaces of possible solutions.
Central to these methods is the Bayes rule.
201
202 Chapter 8. Bayesian Models and Credibility Theory
8.1 Exact Bayesian models

We start by explaining Bayes rule. The basic idea of Bayes
rule goes back to Reverend Thomas Bayes (1701-1761) who
discovered the rule during the 1740s. It was then Richard
Price (1723-1791) who devoted much of his time to clean and
prepare Bayes essay on the probability of causes and he sub-
mitted An essay toward solving a problem in the doctrine of
chances to the Royal Societys Philosophical Transactions. In T. Bayes
w)
1774 Pierre-Simon Laplace discovered the rule on its own
and he has brought it into todays form. Therefore, the Bayes rule should be called
Bayes-Price-Laplaces rule. For a historical review we refer to McGrayne [76]. As
we will just see, Bayes rule is the mathematical tool to combine prior knowledge
(m
and observations into posterior knowledge. Technically it exchanges probabilities,
therefore it is also known under the name method of inverse probabilities.
Assume we have an observation X that has density f (x). Often

the difficulty is that the parameter is not known/specified. In
previous chapters we have estimated this parameter with the
tes
MLE method and with the method of moments. These methods
are purely observation based. What can we do if we have no
R. Price
past observations or only scarce past observations? This is the
question we would like to answer in this chapter. It will lead to a new attitude and
no
to a new estimation method.

NL
Figure 8.1: (lhs) Grave of the Bayes family at Bunhill Fields Burial Ground, London
UK; (rhs) historical review of McGrayne [76].
We specify a prior distribution/density for the (unknown) parameter . We

will explain below how this prior distribution is specified. The joint density of

Chapter 8. Bayesian Models and Credibility Theory 203
observation X and parameter is then given by
f (x, ) = f (x)().
Bayes rule allows to calculate the posterior distribution of , given observation x,
f (x)()
(|x) = R f (x)().
f (x)() d
w)
This means that we start with a prior distribution (). This prior distribution
either expresses expert knowledge or is determined from a portfolio of similar busi-
ness. Having observed x, we modify the prior believe () to obtain the posterior
distribution (|x) that reflects both prior knowledge () about and experience
(m
x. That is, the prior believe () is improved by the arriving observation x. The
general idea then is to update our (prior) knowledge about whenever an obser-
vation arrives. These updates constantly improve our estimation of the unknown
model parameter .
This is exactly what Bayesian and credibility theory is about.

tes
We start with an explicit example to show how this mechanism works.
8.1.1 Poisson-gamma model

no
In this section we present one of the most popular Bayesian models

which has a closed form solution for the posterior distribution.
As mentioned in Bhlmann-Gisler [24], this mathematical model
can be traced back in the actuarial literature to Fritz Bichsel
(1921-1999) [11]. He has introduced it in the 1960s to calculate
NL
a bonus-malus tariff system for Swiss motor third party liability

insurance. The aim was to punish bad drivers and to reward good
drivers according to the collected individual claims experience. F. Bichsel
This has led to bonus-malus considerations.
Definition 8.1 (Poisson-gamma model). Assume fixed volumes vt > 0 are given
for t N.
Conditionally, given , the components of N = (N1 , . . . , NT ) are independent

with Nt Poi(vt ).
(, c) with prior parameters > 0 and c > 0.

Remark. Observe that there is a fundamental difference to the negative-binomial

distribution considered in Section 2.2.4. Here, we assume that N1 , . . . , NT belong
all to the same , whereas for having independent negative-binomial distribu-
tions N1 , . . . , NT every component belongs to another independent latent factor
1 , . . . , T . In the latter case the components of N are independent, whereas in
the former case they are dependent (and only conditionally independent, given ).
Theorem 8.2. Assume N = (N1 , . . . , NT ) follows the Poisson-gamma model of

Definition 8.1. The posterior distribution of , conditional on N , is given by
w)
T T
!
X X
|{N } + Nt , c + vt .
t=1 t=1
(m
Proof. The posterior is given by
(|N ) f (N ) ()
T
Y (vt )Nt c 1 c
= evt e
t=1
Nt ! ()
PT PT
c+ vt
+ t=1 Nt 1 e t=1 .
tes
This is a gamma density with the required properties. 2
Remarks 8.3.
no
The posterior distribution is again a gamma distribution but with modified

parameters. For the parameters we obtain the updates
T T
Tpost cpost
X X
7 =+ Nt and c 7 T =c+ vt .
NL
t=1 t=1
Often and c are called prior parameters and Tpost and cpost
T posterior pa-
rameters (at time T ).
Note that this update has a recursive structure
Tpost = Tpost
1 + NT and cpost
T = cpost
T 1 + vT .
The remarkable property in the Poisson-gamma model is that the posterior

distribution stays in the same family of distributions as the prior distribution.
There are more examples of this kind as we will see below. Many of these
examples belong to the exponential dispersion family with conjugate priors.

For the estimation of the unknown parameter we obtain the following prior
and posterior estimators

0 = E[] = ,
c
Tpost + Tt=1 Nt
P
post
T
b = E[|N ] = post = .
c + Tt=1 vt
P
cT
We analyze the posterior estimator b post in more detail below, which will
T
provide the basic credibility formula.
w)
Corollary 8.4. Assume N = (N1 , . . . , NT ) follows the Poisson-gamma model of
b post has the following credibility form
Definition 8.1. The posterior estimator T
b post =
b + (1 ) .
T T T T 0
(m
with credibility weight T and observation based estimator
b given by
T
PT T
t=1 vt b = P 1
X
T = PT (0, 1) and T T Nt .
c+ t=1 vt t=1 vt t=1
The (mean square error) uncertainty of this estimator is given by

tes
Tpost 1 b post

post 2
E T
b
N = post 2 = (1 T ) .
(cT ) c T
no
Proof. In view of Theorem 8.2 we have for the posterior mean

PT PT T
bpost = + Pt=1 Nt = t=1 vt 1 c
X
T T PT PT Nt + PT
c + t=1 vt c + t=1 vt v
t=1 t t=1 c + v
t=1 t
c
= bT + (1 T ) 0 .
T
NL
This proves the first claim. For the estimation uncertainty we have
2 Tpost

post 1 bpost
E T b N = Var ( | N ) = post 2 = (1 T ) .
(cT ) c T
Remarks 8.5.
b post is a credibility weighted
Corollary 8.4 shows that the posterior estimator T
average between the prior guess 0 and the purely observation based estimator
b with credibility weight (0, 1).
T T
The credibility weight T has the following properties:
1. for the number of observed years T : T 1 (since vt 1 for all

t if vt counts the number of policies);

2. for the volume vt : T 1;

3. for the prior uncertainty going to infinity, i.e. c 0: T 1;
4. for the prior uncertainty going to zero, i.e. c : T 0.
Note that
1
2
Var () =
= 0 .
c c
For c large we have informative prior distribution, for c small we have vague
prior distribution and for c = 0 we have non-informative or improper prior
distribution. The latter means that we have no prior parameter knowledge
w)
(this has to be understood in an asymptotic sense).
The observation based estimator satisfies, see Estimators 2.27 and 2.32,
b MV =
b MLE =
b .
(m
T T T
The posterior estimator b post has the nice property of a recursive update
T
structure which is important in many situations, see next corollary.
Corollary 8.6. Assume N = (N1 , . . . , NT ) follows the Poisson-gamma model of

Definition 8.1. Let b post the posterior esti-
b post denote the posterior estimator and
T T 1
tes
mator in the sub-model where we only have observed (N1 , . . . , NT 1 ). The posterior
estimator b post has the following recursive update structure
T
b post = NT b post .
T T + (1 T ) T 1
vT
no
with credibility weight

vT
T = PT (0, 1).
c+ t=1 vt
Proof. In view of Corollary 8.4 we have for the posterior mean
PT T
t=1 vt 1 c
NL
X
bpost
= Nt +
T PT PT PT
c+ t=1 vt t=1 vt t=1 c+ v
t=1 t
c
T PT 1
1 X c+ t=1 vt c
= PT Nt + PT PT 1
c+ t=1 vt t=1 c+ t=1 vt c + t=1 vt c
1
T
!
1 X
= PT Nt + NT + (1 T )(1 T 1 )0 .
c+ t=1 vt t=1
For the first term we have

1
T
!
1 X
PT Nt + NT
c+ t=1 vt t=1
PT 1 PT 1 T 1
vT NT c + t=1 vt t=1 vt 1 X
= PT + PT PT 1 PT 1 Nt
c+ t=1 vt
vT c + t=1 vt c + t=1 vt t=1 vt t=1
NT
= T + (1 T ) T 1
bT 1 .
vT

Collecting all terms provides the claim. 2
Conclusions. For pricing such a portfolio, we need to have prior information 0

about the premium. This prior information can come from experts, from similar
portfolios, from market information or from a combination thereof. If we have
no observations we charge premium 0 . When we start to collect observations
N1 , N2 , . . . , we constantly update the premium by the rule
b post = Nt b post ,
t t + (1 t ) t1
vt
w)
for t 1, where we set b post = . The prior information has an uncertainty
0 0
parameter c for the credibility weighting of 0 . The bigger the prior uncertainty
the faster the prior knowledge will disappear as t . In the limit (as t )
(m
we have a premium that is completely based on the observations and which, in this
Poisson-gamma case, coincides with the MLE.
However, the credibility formula of Corollary 8.4 is of special interest when we only
have a few observations, i.e. t small, and these few observations are only based on
a small portfolio, i.e. vs small for all s t. In such cases the credibility weight t
may be around, say, 60% and therefore the prior mean 0 substantially smooths
tes
the purely observation based estimator b . This way we get much more stability
t
and reliability in the premium calculation because we add an additional source of
information to the premium calculation problem (prior choice).
Next we study a larger class of distribution functions for which we can explicitly
no
solve the pricing problem in a Bayesian context.
8.1.2 Exponential dispersion family with conjugate priors

The crucial property of the Poisson-gamma model is that the prior and the poste-
NL
rior distributions belong to the same family of parametric distributions, only the
parameters change from prior parameters to posterior parameters. There are many
examples of this type. The best known examples belong to the exponential disper-
sion family with conjugate priors. We have already met the exponential dispersion
family in Definition 7.7, X EDF(, , w, b()) has (generalized) density
( )
x b()
fX (x; , ) = exp + c(x, , w) ,
/w
for an (unknown) parameter in the open set . In the Bayesian case we will
model this parameter = with a prior distribution on and then try to
determine the posterior distribution after we have collected (independent) obser-
vations X1 , . . . , XT that belong to this EDF(, , w, b()).

Model Assumptions 8.7 (exponential dispersion family with conjugate priors).

Assume fixed volumes wt > 0, t = 1, . . . , T , a dispersion parameter > 0 and a
cumulant function b : R on an open set R are given.
Assume the random variable has the following density on

( )
x0 b()
x0 , () = exp + d(x0 , ) ,
2
with fixed prior parameters x0 I and (0, c ), and d(, ) describes the
normalization. I R denotes the possible choices of x0 so that x0 , is a
)
well-defined density on for all (0, c ) for a fixed given constant c > 0.
w
Conditionally, given , the components of X = (X1 , . . . , XT ) are in-
dependent with Xt EDF(, , wt , b()), having well-defined densities with
(m
supports not depending on .
Theorem 8.8. We make Model Assumptions 8.7 and assume that the domain I of
possible prior choices x0 is an open interval which contains the range of Xt for all
and t = 1, . . . , T . The posterior distribution of , given X, is given by the
tes
density bxpost , post () with
T
" T #1/2
post wt 1
post (0, c ),
X
= + 2 < with
t=1
xbpost = T xbMV + (1 T )x0 I,
no
T T
with credibility weight T and (minimum variance) estimator xbMV

T
PT T
t=1 wt 1
xbMV
X
T = PT and T = PT wt X t ,
t=1 wt + 2 t=1 wt t=1
where for the minimum variance statement we additionally assume that the second
NL
moments of Xt |{} exist and the cumulant function b C 2 in the interior of .

Proof. The Bayes rule gives for the posterior distribution of , conditionally given X,
T T
Y Y Xt b() x0 b()
(| X) fXt (Xt ; , ) x0 , () exp exp
t=1 t=1
/wt 2
(" T # " T # )
X Xt wt x0 X wt 1
= exp + 2 + 2 b()
t=1
t=1

1
" # " #
T T
X w t 1 X X w
t t x 0

= exp ( post )2 + 2 + 2 b() .

t=1
t=1

Observe that 0 < post < < c and

" T #1 " T # T
X wt 1 X Xt wt x0 1 X wt
+ 2 + 2 = T PT wt
Xt + (1 T )x0 I.
t=1
t=1
t=1 t=1


The latter holds true because I is (by assumption) an open interval that contains x0 and the
range of all possible outcomes Xt for all and t = 1, . . . , T . Therefore, we obtain posterior
density b
xpost , post
which is a well-defined density on by assumption. There remains the proof
T
of the minimum variance statement. For fixed parameter we know that X = (X1 , . . . , Xn )
are independent with Xt EDF(, , wt , b()). Corollary 7.9 (or its generalization) implies
00
E[Xt |] = b0 () and Var(Xt |) = b (). (8.1)
wt
Note that does not depend on t, therefore the statement of the minimum variance estimator
follows from Lemma 2.26. This closes the proof. 2
w)
Theorem 8.9 (credibility estimator). We make the assumptions of Theorem 8.8.
In addition we assume that exp{(x0 b())/ 2 } disappears on the boundary of
for all x0 I and (0, c ) and that b C 1 in the interior of . We have
E [b0 ()| X] = xbpost
(m
E [b0 ()] = x0 and T = T xbMV
T + (1 T ) x0 ,
see Theorem 8.8 for notation.
Proof. In view of Theorem 8.8 it suffices to prove the first statement for all x0 I and (0, c ).

x0 b()
Z
tes
0 0
E [b ()] = b () exp + d(x0 , ) d
2
x0 b0 ()

x0 b()
Z
2
= x0 exp + d(x0 , ) d
2 2

x0 b()
= x0 2 exp {d(x0 , )} exp = x0 .
2

no
Example 8.10 (exact credibility model). We make the assumptions of Theorem

8.9 but we extend the random vector (X1 , . . . , XT , XT +1 ), i.e. we add one addi-
NL
tional component XT +1 to the random vector, and we assume that conditionally,

given , these components are all independent satisfying Model Assumptions 8.7.
Our aim is to price XT +1 based on the observations X1 , . . . , XT and on the prior
knowledge x0 , . Therefore we calculate the conditional expectation of XT +1 , given
observations X1 , . . . , XT , by applying the tower property. This provides
E [XT +1 | X1 , . . . , XT ] = E [E [XT +1 | , X1 , . . . , XT ]| X1 , . . . , XT ]
= E [E [XT +1 | ]| X1 , . . . , XT ]
= E [b0 ()| X1 , . . . , XT ] = xbpost
T (8.2)
= T xbMV
T + (1 T ) x0 .
Thus, we get a credibility weighted average for the premium of XT +1 which is based
on the prior knowledge x0 , and on the past experience X1 , . . . , XT . Similar to

Corollary 8.6 we obtain a recursive update structure for this experience premium,
which allows to express the premium more and more accurately as time passes
(under the above stationarity assumptions, of course).
Remarks 8.11.
Examples that belong to the exponential dispersion fam-

ily with conjugate priors are: Poisson-gamma model,
gamma-gamma model, (log-)normal-normal model. For
detailed information we refer to Chapter 2 in Bhlmann-
w)
Gisler [24].
All models that have been studied in GLM Chapter 7

can also be studied in the Bayesian sense as illustrated
(m
above.
Theorem 8.8 gives an additional way of parameter esti-

mation within the exponential dispersion family. In contrast to the MLEs and
the minimum variance estimators, this Bayesian way also allows to include
prior information, which may come from experts or from similar business.
tes
Moreover, parameter uncertainty is quantified by the posterior distribution.
This Bayesian idea can be extended to other families of distribution, for

example the Pareto-gamma case is treated in Section 2.6 of Bhlmann-Gisler
[24].
no
Example 8.12 (gamma-gamma model). We close this section with the example of
the gamma-gamma model. We recall Example 7.10. Choose fixed volumes wt > 0,
t = 1, . . . , T , and dispersion parameter = 1/ > 0. Assume that conditionally,
given > 0, X1 , . . . , XT are independent gamma distributed with densities
NL
(wt /)wt / wt /1
fXt (x; , ) = x exp {wt / x} for x R+ .
(wt /)
This is the form used in (7.20) with scale parameter c = / > 0. Observe that
the range of the random variables Xt is R+ and that we obtain well-defined gamma
densities on R+ for all R+ and all t = 1, . . . , T . This motivates the choice of
the open set f = R for the possible parameter choices .
+
Thus, we need to show two things: (i) the density fXt (x; , ) belongs to the
exponential dispersion family for a particular cumulant function b : R; (ii)
this will allow to define the conjugate prior density x0 , for which we would like
to show that we can apply Theorem 8.9.

Item (i) was already done in Example 7.10, however we will do it once more because
the signs need a careful treatment.
fXt (x; , ) = wt / exp {wt / x} exp {c(x, , wt )}

n o
= exp log wt / wt / x exp {c(x, , wt )}
( )
x() ( log(()))
= exp exp {c(x, , wt )} .
/wt
The last formula seems to be a waste of minus signs, but with the definitions
= and b() = log() for < 0 we see that the gamma density belongs
w)
to the exponential dispersion family, that is, by a slight abuse of notation in fXt ,
( )
x b()
fXt (x; , ) = exp exp {c(x, , wt )} .
/wt
(m
Moreover, we set = f = R for the domain of b. Corollary 7.9 then implies
+
for all t = 1, . . . , T
1
E [Xt | ] = b0 () = = 1 R+ .

tes
This completes task (i).
(ii) The prior density on is then chosen by

( )
x0 b() x0

1
() 2 +11 exp
no
x0 , () = exp + d(x0 , ) () .
2 2
This is a gamma density, set = , with shape parameter 1 + 1/ 2 > 0 and scale
parameter x0 / 2 . This implies that we should choose I = R+ and > 0. In view of
Theorem 8.8 the assumptions are fulfilled because I is an open interval containing
NL
all possible observations Xt , and thus Theorem 8.8 can be applied.
Next we observe that this density disappears on the boundary of = R+ given

by the set {} {0}. Therefore, we have from Theorem 8.9 (we also perform
the whole calculation to back test the result)
2 1+1/ 2
1 (x0 /
) x0
Z
h i 1
0
x0 = E[b ()] = E 1
= 2
2 +11 exp 2 d
R+ (1 + 1/ )
2 2
(x0 / 2 )1+1/ (1/ 2 ) Z (x0 / 2 )1/ 12 1 x0

= exp d = x0 .
(1 + 1/ 2 ) (x0 / 2 )1/ 2 R+ (1/ 2 ) 2
Moreover, the posterior mean is given by

h i
E 1 X = xbpost
T = T xbMV
T + (1 T ) x0 ,

with credibility weight

PT
t=1wt
T = PT .
t=1 wt + 2
> 0 describes the degree of information contained in the prior distribution.
In this section we have considered examples for which we can explicitly calculate
the posterior distribution. The next section will give approximations where this is
not the case.
w)
8.2 Linear credibility estimation
In Model Assumptions 8.7 we have studied Bayesian models
which were based on the exponential dispersion family with con-
(m
jugate priors. As a result we were able to explicitly calculate the
posterior distribution in these models and, moreover, this pos-
terior distribution belonged to the same class of distributions
as the prior itself, see Theorem 8.8. In many applied modeling
problems we do not face such an ideal situation. Nowadays there
are powerful simulation techniques that can handle more com- H. Bhlmann
tes
plicated models and problems. In the case of Bayesian analysis
we can use Markov chain Monte Carlo (MCMC) methods, such as Gibbs sampling,
the Metropolis-Hastings algorithm and sequential Monte Carlo samplers, which
will provide the posterior distribution in almost any situation where we can write
down the posterior density up to the normalizing constant. That is, whenever we
no
have an explicit posterior density of the following crucial form
(|x) f (x)(),
NL
and the right-hand side of this proportionality is explicit as a

function of , we can use an acceptance-rejection simulation al-
gorithm (within MCMC methods) which allows to approximate
(|x) empirically. For MCMC methods we refer to the related
literature, see for instance Congdon [28], Gilks et al. [53], Green
[56, 57], Johansen et al. [61] or Robert [88].
Linear credibility theory is not based on simulation methods
but it tries to approximate the posterior mean by the best linear
E. Straub estimator. This we are going to describe more explicitly in this
section. The key model to this analysis is the Hans Bhlmann
and Erwin Straub (1938-2004) model [25]. This model was mainly used in
an insurance pricing context but, of course, possible applications are much more
widespread. For literature we refer to Bhlmann-Gisler [24].

8.2.1 Bhlmann-Straub model

Model 8.13 (Bhlmann-Straub (BS) model [25]). Assume we have I risk classes
and T random variables per risk class. Assume fixed volumes wi,t > 0, i = 1, . . . , I
and t = 1, . . . , T , are given.
Conditionally, given i , the components of X i = (Xi,1 , . . . , Xi,T ) are inde-

pendent with the first two conditional moments given by
E [ Xi,t | i ] = (i ),
2 (i )
w)
Var (Xi,t | i ) = .
wi,t
The pairs (1 , X 1 ), . . . , (I , X I ) are independent and 1 , . . . , I are i.i.d.
(m
2
Throughout, we assume that the second moments are finite, i.e. E[Xi,t ] < for
all i, t.
Remarks 8.14.
tes
We assume that each risk class i is characterized by a risk characteristics i
with range . A priori (before having any observations Xi,t ) all risk classes
are considered to be similar which is expressed by the i.i.d. property of i .
This describes our prior knowledge about the risk classes.
no
The conditional mean and variance are characterized by the two functions
: R and 2 : R+ ; 7 () and 7 2 ().
If we set I = 1, i.e. we only have one risk class, then an explicit example to the
BS Model 8.13 is given by the exponential dispersion family with conjugate
NL
priors, Model Assumptions 8.7. The conditional mean and variance are then
modeled by, see (8.1),
() = b0 () and 2 () = b00 (),
for the corresponding (sufficiently smooth) cumulant function b : R.
For the BS credibility estimator we define the following structural parameters:
0 = E[(1 )] collective mean, (8.3)

2
= Var((1 )) volatility between risk classes, (8.4)
2 2
= E[ (1 )] (expected) volatility within risk classes. (8.5)

8.2.2 Bhlmann-Straub credibility formula

The Bayesian estimator for the (unknown) mean (i ) of risk class i is given by
\
(i ) = E [(i )| X 1 , . . . , X I ] . (8.6)
In the exponential dispersion family with conjugate priors this posterior mean can
be calculated explicitly, see Theorem 8.9. In most other situations, however, this is
not the case. Therefore, we approximate this posterior mean. We briefly describe
how this approximation is done. Assume that all considered random variables
are square integrable, thus we work on the Hilbert space L2 (, F, P) of square
w)
integrable random variables, where the inner product is given by
hX, Y i = E [XY ] for X, Y L2 (, F, P).
In this Hilbert space the random vectors X 1 , . . . , X I generate the subspace G(X)
(m
of all (X 1 , . . . , X I )-measurable random variables. The posterior mean ( \ i)
2
given by (8.6) is the element of the subspace G(X) that minimizes the L -distance
between this subspace G(X) and (i ). In the Hilbert space this estimate ( \ i)
corresponds to the orthogonal projection of (i ) onto G(X). In general, this min-
imization and orthogonal projection to G(X), respectively, has a too complicated
form. To reduce this complexity we restrict the orthogonal projection to simpler
tes
subsets L of G(X). This will provide approximations to ( \ i ) G(X) in the more
restricted subsets L G(X). We define the following two subsets

X
L(X, 1) = b = a0 + ai,t Xi,t ; a0 R, ai,t R for all i, t G(X),

i,t
no

X
L0 (X) = b = ai,t Xi,t ; ai,t R for all i, t and E[]
b = 0 G(X).

i,t
The first subset L(X, 1) includes the constants which will imply unbiasedness of
the estimators, whereas in the second case L0 (X) we need to enforce unbiasedness
NL
by a side constraint.
Definition 8.15 (inhomogeneous and homogeneous credibility estimator). We as-

sume that the BS Model 8.13 is fulfilled with collective mean 0 R.
The inhomogeneous (linear) credibility estimator of (i ) based on X 1 , . . . , X I is
defined by
\ h i
\
(i ) = arg min E ( b (i ))2 .
bL(X,1)

The homogeneous (linear) credibility estimator of (i ) based on X 1 , . . . , X I is

defined by
hom
\ h i
\
(i ) = arg min E (b (i ))2 .
bL0 (X)


\
\
Remark 8.16. The inhomogeneous credibility estimator ( i ) is the best approx-
2
imation to (i ) (in the L -sense) among all linear estimators given by L(X, 1).
Because L(X, 1) is a subset of G(X), we immediately obtain for the mean square
error with the Pythagorean theorem for successive orthogonal projections
!2 " !2
2 #
\
\ \ \
\ \
E (i ) (i ) =E (i ) (i ) + E (i ) (i ) (8.7)
.
)
The left-hand side describes the error of the inhomogeneous
credibility estimator which can be split (right-hand side) into
w
the error of the (best) Bayesian estimator and the approxima-
tion error by the inhomogeneous credibility estimator to the
(m
Bayesian estimator, see Theorem 3.14 in Bhlmann-Gisler [24].
hom
\
\
In a similar spirit the homogeneous credibility estimator (i )
\
\ \
is the best approximation to ( i ) and (i ) within L0 (X)
(here we additionally use unbiasedness). A. Gisler
tes
Theorem 8.17 (inhomogeneous and homogeneous BS estimator). We assume that
the BS Model 8.13 is fulfilled with parameters 0 , 2 and 2 given by (8.3)-(8.5).
The inhomogeneous credibility estimator is given by
no
\
\
(i ) = i,T Xi,1:T + (1 i,T ) 0 ,
c
with credibility weight i,T and observation based estimator X

c
i,1:T
PT T
t=1 wi,t 1 X
i,T = PT 2
and X
c
i,1:T = PT wi,t Xi,t .
NL
t=1 wi,t + 2 t=1 wi,t t=1
The homogeneous credibility estimator is given by

hom
\
\
(i) = i,T X i,1:T + (1 i,T )
bT ,
c
with estimate
I
1 X
b T = PI i,T X
c
i,1:T .
i=1 i,T i=1
Proof of Theorem 8.17. The theorem can be proved by brute force doing convex optimization
(using the method of Lagrange in the latter case) or we can apply Hilbert space techniques using
projection properties, see Chapters 3 and 4 in Bhlmann-Gisler [24]. We do the brute force

calculation because it is quite straightforward. We minimize

2
X
h(a) = E a0 + al,t Xl,t (i )

l,t
over all possible choices a0 , al,t R. This requires that we calculate all derivatives w.r.t. these
parameters and set them equal to zero.

X !
h(a) = 2E a0 + al,t Xl,t (i ) = 0, (8.8)
a0
l,t

)
X !
h(a) = 2E Xj,s a0 + al,t Xl,t (i ) = 0. (8.9)
aj,s
l,t
w
Equation (8.8) immediately implies unbiasedness of the inhomogeneous credibility estimator, and
moreover that
(m
X
a0 = 0 1 al,t .
l,t
Plugging this into (8.9) and using (8.8) once more immediately gives for all j, s the requirement

!
X
Cov Xj,s , al,t Xl,t (i ) = 0.
l,t
tes
Using the uncorrelatedness between different risk classes (which is implied by the independence)
we obtain the following (normal) equations, see Corollary 3.17 and Section 4.3 in Bhlmann-Gisler
[24],

X
a0 = 0 1 al,t , (8.10)
no
l,t
T
X
Cov (Xj,s , (i )) = aj,t Cov (Xj,s , Xj,t ) for all j, s. (8.11)
t=1
We calculate these last covariance terms

Cov (Xj,s , Xj,t ) = E [Cov ( Xj,s , Xj,t | j )] + Cov (E [ Xj,s | j ] , E [ Xj,t | j ])
NL
1
E 2 (j ) 1{t=s} + Var ((j ))

=
wj,s
2
= 1{t=s} + 2 > 0.
wj,s
The first covariance term is given by
Cov (Xj,s , (i )) = Var ((i )) 1{j=i} = 2 1{j=i} .
This implies that the left-hand side of (8.11) is equal to 0 for j 6= i and because Cov (Xj,s , Xj,t )
2 > 0 it follows that aj,s = 0 for all j 6= i. This is not surprising because we have assumed that
the different risk classes are independent. Therefore (8.10)-(8.11) reduce to
T
!
def.
X
a0 = 0 1 ai,t = 0 (1 i,T ) , (8.12)
t=1
2 T
X 2
2 = ai,s + 2 ai,t = ai,s + 2 i,T for all s. (8.13)
wi,s t=1
wi,s

PT
This defines i,T = t=1 ai,t and we still need to see that this credibility weight has the claimed
form. Requirement (8.13) then implies for all s
2
ai,s = (1 i,T ) wi,s .
2
If we sum this over s we obtain
T T
X 2 X
i,T = ai,s = (1 i,T ) wi,s .
s=1
2 s=1
Solving this for i,T gives the following credibility weights
w)
2
PT PT
2 s=1 wi,s t=1 wi,t
i,T = 2
PT = PT 2
,
2 s=1 wi,s + 1 t=1 wi,t + 2
and the ai,s are given by

2
PT !

2 wi,t 2 wi,s
(m
t=1 2
ai,s = 2 1 PT wi,s = wi,s = i,T PT .
2 2 Tt=1 wi,t + 2
P
t=1 wi,t + 2 2 t=1 wi,t
If we collect all the terms we have found the following inhomogeneous credibility estimator
T
\
\ 1 X
(i ) = i,T PT wi,s Xi,s + (1 i,T ) 0 = i,T X
bi,1:T + (1 i,T ) 0 .
t=1 wi,t s=1
tes
This proves the first claim and an important observation is that this credibility estimator is
unbiased for 0 . Therefore, it coincides with the estimator if we would have projected to
L0 (X, 1) = L(X, 1) {b L2 (, F, P) : E[b ] = 0 }.
The proof of the homogeneous credibility estimator goes along the same lines as the inhomoge-
neous one, using the method of Lagrange for replacing (8.8) by the side constraint
no

X X X
0 = E [b] = E ai,t Xi,t = ai,t E [Xi,t ] = ai,t 0 ,
i,t i,t i,t
P
which implies i,t ai,t = 1. An alternative proof would go by using the iterative property and the
linearity of orthogonal projections on subspaces. For details we refer to Section 4.6 in Bhlmann-
Gisler [24]. This closes the proof of Theorem 8.17. 2
NL
Remarks 8.18 (interpretation of the BS formula of Theorem 8.17). The BS for-

mula provides the best linear approximations to the true premium (i ) and the
\ 2
Bayesian estimator ( i ) in the L -sense, see also (8.7).
The inhomogeneous and the homogeneous credibility estimators are somewhat dif-
ferent which may also lead to different interpretations.
For the inhomogeneous credibility estimator we assume that there is prior

knowledge on (i ) in the form of the prior mean parameter 0 . This prior
knowledge has uncertainty described by the variance parameter 2 and the
resulting estimator is the classical credibility weighted average between port-
folio experience X i and prior knowledge 0 which leads to the credibility
weights i,T . To calculate this estimator it is sufficient to have one risk class
only.

The homogeneous credibility estimator can be interpreted as the modified

version of the inhomogeneous one if we do not have prior knowledge. In
this case we extract additional information from similar portfolios. That is,
we consider all risk classes simultaneously to obtain b T which replaces the
prior knowledge 0 . The precision that is given to this overall knowledge b T
depends on the volatility between the risk classes, i.e. on the significance of
particular observations.
The so-called credibility coefficient is defined by, see Bhlmann-Gisler [24]

page 84,
w)
2
= 2. (8.14)

It describes the ratio of volatilities within risk classes and between risk classes.
This is the crucial ratio that determines the credibility weights
(m
PT
t=1 wi,t
i,T = PT .
t=1 wi,t +
This latter case can now be used for tariffication of risk factors on different risk
classes, similar to the GLM Chapter 7. The overall premium is given by b T , the
tes
experience of risk class i is given by X i,1:T and the credibility weight i,T (0, 1)
c
explains how this information needs to be combined to obtain the risk adjusted
premium of risk class i.
no
8.2.3 Estimation of structural parameters

In order to apply the credibility estimators there remains the specification of the
structural parameters 2 and 2 . We make the same choice as in Bhlmann-Gisler
[24]. We define the sample estimators of risk class i
NL
T
1 2
sb2i =
X
wi,t Xi,t X
c
i,1:T .
T 1 t=1
A straightforward calculation shows that this is an unbiased estimator for 2 (i ),

conditionally given i . But this immediately implies that sb2i is an unbiased esti-
mator for 2 for all i. Therefore, we set
I
1 X
bT2 = sb2 , (8.15)
I i=1 i
with E[bT2 ] = 2 . Observe that one risk class is sufficient to get an estimate for 2
if T > 1.
If we have prior knowledge 0 then 2 should be calibrated such that it quantifies the
reliability of this prior knowledge. If we use the homogeneous credibility estimator

then 2 is estimated from the volatility between the risk classes (here we need more
than one risk class i). We define the weighted sample mean over all observations
!
1 X 1 X X
X = P wi,t Xi,t = P wi,t X
c
i,1:T .
i,t wi,t i,t i,t wi,t i t
In analogy to (2.7) we define

P
I t wi,t
2
vbT2
X
= X i,1:T X .
c
I 1 i
P
j,s wj,s
w)
Similar to Lemma 2.29 we can calculate the expected value of vbT2 which then shows
that we need to define !
2 2 I bT2
tT = cw vbT P
b ,
j,s wj,s
(m
with constant " !#1
I 1 X t wi,t
P P
wi,t
cw = P 1 P t .
I i j,s wj,s j,s wj,s
This estimator has the unbiasedness property E[tb2T ] = 2 , we refer to Section 4.8 in
Bhlmann-Gisler [24]. The only difficulty is that it might become negative which,
of course, is non-sense for estimating 2 . Therefore, we set for the final estimator
tes
n o
bT2 = max tb2T , 0 . (8.16)
no
Example 8.19. We do Exercise 4.1 of Bhlmann-Gisler [24]. We have I = 5 risk

classes and for every risk class we have T = 5 observations. The data is provided
in Table 8.1. We have claims Si,t and corresponding numbers of policies vi,t . In
order to apply the BS model we choose volumes wi,t = vi,t , i.e. the volumes wi,t are
determined by the number of policies in the corresponding cell (i, t) and we define
NL
the claims ratios Xi,t = Si,t /vi,t . Our aim is to apply the BS model to (Xi,t )i,t .
Observe that the application of the BS model is motivated by the fact that some
cells have small volumes and volatile claims ratios. Therefore, Bayesian methods
are applied to smooth the calculated premia.
hom
\
\
We would like to calculate the homogeneous credibility estimator ( i ) for the
claims ratios of the risk classes i = 1, . . . , 5, see Theorem 8.17. Therefore, we
first need to estimate the structural parameters. With formulas (8.15) and (8.16)
we obtain bT2 = 261.2 and bT2 = 0.1021. This gives estimated credibility coefficient
bT =
b T2 /bT2 = 2558 and from this we can estimate the credibility weights i,T . The
estimates are provided in Table 8.2. We see that in risk class 4 we have big volumes
v4,t which results in a high credibility weight estimate of b 4,T = 87.8%. In risk class
5 we have small volumes v5,t which results in a low credibility weight estimate of
b 5,T = 45.2%. From this we calculate the credibility weighted overall claims ratio

1 2 3 4 5
risk class 1 v1,t 729 786 872 951 1019
S1,t 583 1100 262 837 1630
X1,t 80.0% 139.9% 30.0% 88.0% 160.0%
risk class 2 v2,t 1631 1802 2090 2300 2368
S2,t 99 1298 326 463 895
w)
X2,t 6.1% 72.0% 15.6% 20.1% 37.8%
risk class 3 v3,t 796 827 874 917 944
S3,t 1433 496 699 1742 1038
X3,t 180.0% 60.0% 80.0% 190.0% 110.0%
(m
risk class 4 v4,t 3152 3454 3715 3859 4198
S4,t 1765 4145 3121 4129 3358
X4,t 56.0% 120.0% 84.0% 107.0% 80.0%
risk class 5 v5,t 400 420 422 424 440
S5,t 40 0 169 1018 44
X5,t 10.0% 0.0% 40.0% 240.1% 10.0%
tes
Table 8.1: Observed claims Si,t and corresponding numbers of policies vi,t .
no
NL
risk class 1 risk class 2 risk class 3 risk class 4 risk class 5
b i,T 63.0% 79.9% 63.0% 87.8% 45.2%
Xi,1:T
b 101.3% 30.2% 124.1% 89.9% 60.4%
hom
\
\
( i) 93.5% 40.3% 107.9% 88.7% 71.3%
Table 8.2: Estimated credibility weights

b i,T , observation based estimate X
c
i,1:T and
hom
\
\
homogeneous credibility estimate ( i) of the claims ratio at time T = 5.

b T = 80.4% (which should be compared to the sample mean X = 77.9%) and

from this we finally calculate the homogeneous credibility estimators for the claims
ratios, see Table 8.2. We observe smoothing of Xc
i,1:T towards
b T according to the
credibility weights
b i,T .
Exercise 23.
(a) Choose the data of Table 8.1 and calculate the inhomogeneous credibility esti-
\
\
mators ( i ) for the claims ratios under the assumption that the collective mean
is given by 0 = 90% and the variance between risk classes is given by 2 = 0.20.
(b) What changes if the variance between risk classes is given by 2 = 0.05?
w)

8.2.4 Prediction error in the Bhlmann-Straub model

\
\
Observe that the credibility estimator ( i ) is used to estimate (i ) and to predict
(m
next years claim Xi,T +1 . Similar to (1.9) we can analyze the total prediction error
!
\
\ \
\
Xi,T +1 (i ) = (Xi,T +1 (i )) + (i ) (i ) .
If we assume that Xi,1 , . . . , Xi,T +1 are independent, conditionally given i , then we

obtain from unbiasedness
tes
!2 !2
\
\
h
2
i \
\
E Xi,T +1 (i)
= E (Xi,T +1 (i )) + E (i ) (i)

!2
\
\
= E [Var (Xi,T +1 | i )] + E (i ) (i)
no
2
= + (1 i,T ) 2 , (8.17)
wi,T +1
see Theorem 4.3 in Bhlmann-Gisler [24]. The first term on the right-hand side
of (8.17) is called process variance and the second term parameter uncertainty.
NL
Similarly we obtain for the homogeneous credibility estimator, see Theorem 4.6 in
Bhlmann-Gisler [24],
2
hom !
\ 2 1 i,T
E Xi,T +1 \
(i) = + (1 i,T ) 2 1+ P . (8.18)

wi,T +1 i i,T

The expressions in (8.17) and (8.18) are called mean square error of prediction
(MSEP). We will come back to this notion in Section 9.3 and for a comprehensive
treatment we refer to Section 3.1 in Wthrich-Merz [102].
hom
\
\
Exercise 24. Estimate the prediction uncertainty E[(Xi,T +1 (i) )2 ] for the
data of Example 8.19 under the assumption that the volume grows 5% in each risk
class.

Exercise 25. We consider Example 4.1 of Bhlmann-Gisler [24]. The observed

numbers of policies vi and claims counts Ni in 21 different regions are given in
Table 8.3.
region i vi Ni
1 50061 3880
2 10135 794
3 121310 8941
4 35045 3448
5 19720 1672
6 39092 5186
7 4192 314
8 19635 1934
w)
9 21618 2285
10 34332 2689
11 11105 661
12 56590 4878
13 13551 1205
14 19139 1646
(m
15 10242 850
16 28137 2229
17 33846 3389
18 61573 5937
19 17067 1530
20 8263 671
21 148872 15014
total 763525 69153
tes
Table 8.3: Observed volumes vi and claims counts Ni in regions i = 1, . . . , 21.
Calculate the inhomogeneous credibility estimators for each region i under the
assumption that Ni |i has a Poisson distribution with mean (i )vi = i 0 vi and
no
E[i ] = 1. The prior frequency parameter is given by 0 = 8.8% and the prior
uncertainty by 2 = 2.4 104 .
Hint: For the estimation of the credibility coefficient = 2 / 2 one should use that
Ni |i is Poisson distributed which has direct consequences for the corresponding
NL
variance 2 (i ), see also Proposition 2.8.
Example 8.20 (MTPL frequencies). We revisit the MTPL example of Section

7.3.4. In this example we have observed that some of the risk classes m have a
rather small volume vm which, of course, is in favor of applying credibility methods.
For this analysis we only consider a risk classification by cantons. This exactly
corresponds to the marginal consideration in Figure 7.5 (rhs). We assume that the
regional data fulfills the BS model assumptions with wm = vm and, henceforth,
we can use Theorem 8.17. We do this under the additional assumption of having
conditional Poisson distributions for Nm , m {AG, AI, . . . , ZH}. The latter implies
for Xm = Nm /vm that, see also Exercise 25 above,
2 (m ) E [ Xm | m ] (m )
= Var( Xm | m ) = = .
wm vm vm

w)
(m
hom
\
b canton and (rhs) credibility estimators (
\
Figure 8.2: (lhs) MLEs m m) for the
different cantons m {AG, AI, . . . , ZH}.
Therefore, we have 2 (m ) = (m ), note that wm = vm , and
2 = E[ 2 (m )] = E[(m )] = 0 ,
tes
if 0 > 0 denotes the prior frequency parameter (collective mean). Therefore, we
do not need to estimate 2 for given prior frequency parameter 0 , but we only
need to estimate parameter 2 for applying the homogeneous credibility estimator.
no
The additional nice feature about the conditional Poisson model is that this can
be done with only one period of observations. Applying the iterative algorithm of
Bhlmann-Gisler [24], pages 102-103, we find estimate b2 for 2 and b = 7.48%
0
(we have obtained sufficient convergence after 4 iterations). From this we calculate
the homogeneous BS credibility estimators
NL
hom
\
\
(m) = m Xm + (1 m )
b ,
0
with credibility weights

vm
m = b /b2
.
vm + 0
The resulting credibility weights m are within the interval (35%, 98%) depending
on having a small or large canton. Remarkable is that the estimate b = 7.48%
0
is substantially higher than the sample mean of 7.15%. In Figure 8.2 we present
hom
canton \
\
the MLEs mb = Xm and the credibility estimators (m ) for the different
cantons m {AG, AI, . . . , ZH}. We observe that the MLEs are smoothed out
towards the collective mean estimate
b . This applies in particular to small cantons
0

such as m = AI, whereas large cantons are only marginally affected by the collective
mean estimate.
This picture could now be further refined using methods from spatial statistics
(based on the intuition that neighboring cantons behave more similarly, etc.). This
has, for instance, been done in Fringeli [50].
w)
(m
tes
no
NL

Chapter 9
Claims Reserving
w)
This chapter will give a completely new perspective to non-life insurance business
which has not been covered in these notes, yet. Until now we have assumed that
(m
the total claim amount for a fixed accounting year can be described by a compound
distribution of the form
Nt
X
St = Yi ,
i=1
where t = 1, . . . , T denote the different accounting years, Nt counts the number of

tes
claims in accounting year t and Yi describes the claim size of claim i. This was
the base model for collective risk modeling in Chapter 2, it was used for the study
of the surplus process (Ct )tN0 in Chapter 5 (see Definition 5.1) and it was also
the base assumption for parameter estimation (based on past claims experience)
for the prediction of future claims. This model suggests that we have Nt claims in
no
accounting year t and their claim sizes Y1 , . . . , YNt describe the total payouts to the
insured. The main issue in practice is that a typical non-life insurance claim cannot
be settled immediately at occurrence. That is, if Yi describes the claim amount of
claim i = 1, . . . , Nt in accounting year t then, in general, this claim amount is not
observable at time t due to a settlement delay that allows for a final assessment
NL
of the claim only later. Likewise Nt is not observable at the end of accounting
year t because there might be claims that have occurred in year t but which are
reported only later. We describe reasons for such reporting and settlement delays
in the next section. As a consequence we need to predict future cash flows of claims
that have occurred in the past and are only settled in the future in order to have
a sound basis for pricing future insurance contracts. This task is exactly known
as the claims reserving problem and it assesses outstanding loss liabilities for past
claims. The prediction of these outstanding loss liabilities constitute the claims
reserves. Importantly, these claims reserves typically are the largest position on
the liability side of the balance sheet of a non-life insurance company, see Table
9.1, and are crucial for the financial strength of the company. Therefore, we aim
at describing the claims reserving process in this chapter and we would also like to
describe the uncertainties involved.
225
226 Chapter 9. Claims Reserving
assets as of 31/12/2013 mio. CHF

liabilities as of 31/12/2013 mio. CHF
debt securities 6374
claims reserves 7189
equity securities 1280
provisions for annuities 1178
loans & mortgages 1882
other liabilities and provisions 2481
real estate 908
share capital 169
participation 2101
legal reserve 951
short term investments 693
free reserve, forwarded gains 1966
other assets 696
total liabilities & equity 13934
total assets 13934
w)
Table 9.1: Source: Annual Report 2013 of AXA Versicherungen AG, Switzerland.
9.1 Outstanding loss liabilities
(m
A claim in non-life insurance is triggered by an accident which is an event that
causes (financial) damage covered by an insurance contract. The date of claims
occurrence is called accident date. Typically, time elapses until such a claim is in
the administrative system of the insurance company and is available for statisti-
cal analysis. The time lag between the accident date and the registration at the
insurance company is called reporting delay and the date of registration is termed
tes
reporting date.
The reporting delay can be small, say days, but it can also be very large, for
example several years. Reasons for such reporting delays are that claims are not
immediately reported to the insurance company, for instance, a stolen bike is only
reported once it is clear that it will not reappear, but of course the accident
no
date is the day the bike was stolen. Large reporting delays are typically caused by
claims which are not immediately noticed. A common example is an asbestos claim
which is typically caused a long time before cancer is diagnosed and reported. The
accident date refers to the event when there was contact with asbestos, the trigger
of the cancer, and not to the date of the breakout of the asbestos disease.
NL
Once a claim is reported to the insurance company it typically cannot be settled

immediately. The insurance company starts an investigation, monitors the recovery
process, waits for external information, external bills, court decisions, etc. This
process may last for several years for more involved claims. Of course, the insurance
company cannot wait with claims payments until there is a final assessment of the
claim but it will continuously pay for justified claims benefits. Therefore, insurance
claims trigger a whole sequence of cash flows after the reporting date. This period
is called settlement period and the final assessment of a claim is called settlement
date or closing date.
Thus, we have three important (ordered) dates for non-life insurance claims:
accident date T1 reporting date T2 settlement date T3 .
In addition, there are the following two important dates: beginning of insurance

Chapter 9. Claims Reserving 227
period U1 and ending of insurance period U2 > U1 , we always assume U2 < .

Typically, the insurance company is only liable for a claim if T1 [U1 , U2 ], thus, we
only consider claims that have accident dates T1 that fall into the insurance period
[U1 , U2 ] specified in the insurance contract.
accident date T_1 claims payments

reporting date T_2 claims closing T_3
w)
insurance period [U_1,U_2] time
Figure 9.1: Non-life insurance run-off showing insurance period [U1 , U2 ] and a claim
(m
with accident date T1 [U1 , U2 ], reporting date T2 > U2 > T1 and settlement date
T3 > T2 . Moreover, we have claims payments during the settlement period [T2 , T3 ].
If we denote todays time point by t U1 we can have four different situations:

tes
1. t < T1 . Such (potential) claims have not yet occurred. If the company is
lucky then T1 > U2 . This means that it is not liable for this particular claim
with the actual insurance policy because the contract is already terminated
at claims occurrence. Be careful, the company may still be liable for this
no
particular claim, namely, if the contract is renewed and T1 falls into the
renewed insurance period, but renewals are not of interest for the present
(claims reserving) discussion because they correspond to insurance exposures
only sold in the future.
In this first case t < T1 the only information available at the insurance com-
NL
pany is the insurance contract signed, i.e. the exposure for which it is liable in
case of a claims occurrence T1 [U1 , U2 ]. Therefore, one often speaks about
unearned premium if the exposure has not yet expired, i.e. if t < U2 .
2. T1 t < T2 and T1 [U1 , U2 ]. In this case the insurance claim has occurred
but it has not yet been reported to the insurance company. These claims
are called Incurred But Not Yet Reported (IBNYR) claims. For such claims
we do not have any individual claims information (because it is IBNYR) but
we already have external information, like economic environment (e.g. un-
employment rate, inflation rate, financial distress), weather conditions and
natural catastrophes (storm, flood, earthquake, etc.), nuclear power accident,
flu epidemic, and so on. This external information already gives us a hint
whether we should expect more or less claims reportings.

3. T2 t < T3 and T1 [U1 , U2 ]. These claims are reported at the company but
the final assessment is still missing. Typically, we are in the situation of more
and more information becoming available about the claim, i.e. the prediction
uncertainty in the final claim assessment decreases. However, these claims
are not completely resolved and therefore they are called Reported But Not
Settled (RBNS) claims. The settlement period [T2 , T3 ] is also the period
within which claims payments are done, see Figure 9.1.
During the settlement period we receive more and more information of the
individual claim like accident date, cause of accident, type of accident, line-of-
w)
business and contracts involved, claims assessments and predictions by claims
adjusters, payments already done, etc.
4. T3 < t and T1 [U1 , U2 ]. Claim is settled, file is closed and stored and we
(m
expect no further payments for that claim. Under some circumstances, it
may be necessary that a claim file is re-opened due to unexpected further
claims development. If this happens too often then the files are probably
closed too early and the claims settlement philosophy should be reviewed in
that particular company. If there is a systematic re-opening it may also ask
for a special provision for unexpected re-openings, for example, for contracts
tes
which have a timely unlimited cover for relapses.
To give statistical statements about insurance contracts and claims behavior, in-
surance companies build homogeneous groups and sub-portfolios to which a LLN
no
applies. In non-life insurance, contracts are often grouped into different business
lines such as private property, commercial property, private liability, commercial
liability, accident insurance, health insurance, motor third party liability insurance,
motor hull insurance, etc. If this classification is too rough it can further be divided
into sub-portfolios, for example, private property can be divided by hazard cate-
NL
gories like fire, water, theft, etc. Often such sub-classes are built by geographical
markets and for different jurisdictions.
Once these (hopefully) homogeneous risk classes are built we study all claims that
belong to such a sub-portfolio. These claims are further categorized by the acci-
dent date. Claims that fall into the same accident period are triggered by similar
external factors like weather conditions, economic environment; therefore such a
classification is reasonable. Since the usual time scale for insurance contracts and
business consolidation is years, claims are typically gathered on the yearly time
scale. Therefore, we consider accounting years denoted by k N. All claims that
have accident dates T1 [1/1/k, 31/12/k] are called claims with accident year k.
We abbreviate the latter interval by [k, k + 1). These claims generate cash flows
which are also considered on the consolidated yearly level, i.e. all payments that
are done in the same accounting year are aggregated. This motivates the classical

claims reserving notation for fixed i N and j N0
Xi,j = all payments done for claims with accident year i

and paid in accounting year k = i + j N.
Thus, we consider all claims (of a given sub-portfolio) which have accident dates
T1 [i, i+1) = [1/1/i, 31/12/i], i.e. have the same accident year i. For these claims
we consider aggregate cash flows which are further sub-divided by their payment
delays denoted by j N0 and called development years. For instance,
w)
Xi,0 = payments in year [i, i + 1) for claims with accident year i;
Xi,1 = payments in year [i + 1, i + 2) for claims with accident year i;
Xi,j = payments in year [i + j, i + j + 1) for claims with accident year i.
(m
Moreover, a common assumption is that there is fixed maximal settlement delay
J N, i.e. Xi,j 0 for all development years j > J. Of course, this maximal
settlement delay J depends on the business line considered, typically it is smaller
for property insurance and larger for liability insurance. At time t N, with
t > J, this motivates the graphical representation given in Table 9.2 (note that we
identify t with 31/12/t). This table displays three time axes: (1) accident year axis
tes
accident development years j
year i 0 1 ... j ... J
1 X1,0 X1,1 X1,J
.. .. .. ..
. . . .
no
tJ XtJ,0 XtJ,1 XtJ,J
..
.
.. ..
i . observations Dt .
NL
..
. to be predicted Dtc
t1
t Xt,0 Xt,1 Xt,J
Table 9.2: Claims development triangle/trapezoid Dt at time t > J.
i {1, . . . , t} (vertical axis); (2) development year axis j {0, . . . , J} (horizontal

axis); and (3) accounting year axis k = i + j {1, . . . , t + J} (diagonal axis). In
claims reserving all three time axes are important: (1) i collects the claims with
the same accident year; (2) j describes payments with the same payment delay
(relative to the accident year); and (3) k = i + j describes the payments that are

done in the same accounting year (and hence are influenced by the same external
factors like inflation). Therefore, we denote the accounting year payments by
tk J(k1)
X X X
Xk = Xi,j = Xi,ki = Xkj,j . (9.1)
i+j=k i=1(kJ) j=0(kt)
At time t N we are liable for all claims that have occurred in accident years i t.
We call these claims past exposure claims. Some of these past exposure claims are
already settled (if the settlement date T3 t), others belong to either the class of
RBNS claims (if the reporting date T2 t but the settlement date T3 > t) or the
w)
class of IBNYR claims (if the reporting date T2 > t).
On the aggregate level we have the following payment information at time t N
for past exposure claims
(m
Dt = {Xi,j ; i + j t, 1 i t, 0 j J} . (9.2)
This information exactly corresponds to the upper triangle (if t = J + 1) or the

upper trapezoid (if t > J +1) of Table 9.2. These past exposure claims will generate
cash flows in future accounting years given by
tes
Dtc = {Xi,j ; i + j > t, 1 i t, 0 j J} .
This corresponds to the lower triangle in Table 9.2. This lower triangle Dtc is
called outstanding loss liabilities and it is the major object of interest. Namely,
these outstanding loss liabilities constitute the liabilities of the insurance company
no
originating from past premium exposures. In particular, the company needs to

build appropriate provisions so that it is able to fulfill these future cash flows.
These provisions are called claims reserves and they should satisfy the following
requirements:
NL
the claims reserves should be evaluated such that it considers all relevant
(past) information;
the claims reserves should be a best-estimate for the outstanding loss liabili-
ties adjusted for time value of money.
Basically, this means that we need to predict the lower triangle Dtc based on all
available information Ft Dt at time t. In particular, we need to define a stochastic
model on a filtered probability space (, F, P, F) (i) that allows to incorporate
past information Ft F through a filtration F = (Ft )t ; (ii) that reflects the
characteristics of past observations Dt ; (iii) that is able to predict future payments
of the outstanding loss liabilities Dtc ; and (vi) that is able to attach time values
of money to these future cash flows Xi,j , i + j > t. Of course, this is a rather
ambitious program and we will build such a stochastic model step-by-step.

For the time-being we skip the task of attaching time values of money to cash flows
and we only consider nominal payments. The total nominal claims payments for
accident year i are given by
J
X Ni
X
Xi,j = Yl = Si . (9.3)
j=0 l=1
Thus, for assessing the total claim amount Si of accounting year i we need to
describe the claims settlement process Xi,0 , . . . , Xi,J . In particular, we need to
predict the (unobserved) future claims cash flows of the outstanding loss liabilities
w)
to quantify the total claim Si of accounting year i. Here, Si is measured on a
nominal basis, therefore we use the symbol = in the above identity (9.3),
see also Wthrich [98]. Moreover, we see that the total claim amount of a fixed
accounting year i is by far more complex than a compound distribution.
(m
We assume that the latest observed accident/accounting year is t = I and we do
all considerations based on this (fixed) accounting year.
The (nominal) best-estimate reserves at time t = I > J for past exposure claims
are (under these model assumptions) defined by
tes
X X
R= E [Xi,j | FI ] = E [Xi,j | FI ] ,
i+j>I (i,j)IIc
where we define the sets I, II and IIc of indexes as follows

no
I = {1, . . . , I} {0, . . . , J},

II = {(i, j) I; i + j I} and IIc = I \ II .
The set IIc exactly corresponds to the lower triangle DIc . (Ft )t0 is a filtration on
(, F, P) assuming that Xi,j is Fi+j -measurable for all (i, j) I.
NL
The best-estimate reserves R are a predictor for the (nominal) outstanding loss
liabilities of past exposure claims at time t = I
X
Xi,j .
(i,j)IIc
This predictor R is based on the available information FI at time I. Often Ft and

Dt are identified, i.e. one assumes that there is no other information available than
the claims themselves.
The next question of interest is the uncertainty in this prediction called prediction
uncertainty. That is, we want to investigate the possible fluctuation of the true
cash flows around their best-estimate reserves. If the confidence interval is narrow,
we can predict the outstanding loss liabilities rather accurately. If we obtain wide

confidence bounds, an additional risk margin is necessary which protects against

possible shortfalls in the outstanding loss liability cash flows. We will discuss this
below.
9.2 Claims reserving algorithms

The title of this section contains the term algorithms. Initially, in the insur-
ance industry, actuaries have designed algorithms that enable to determine claims
reserves R. These algorithms should be understood as mechanical guidelines to
w)
obtain claims reserves. Only much later actuaries started to think about stochastic
models underlying these algorithms. In this section we present claims reserving
from this algorithmic point of view and in the next section we present stochastic
models that support these algorithms.
(m
The two most popular algorithms are the so-called chain-ladder (CL) algorithm
and the Bornhuetter-Ferguson (BF) algorithm [16]. These two algorithms take
different viewpoints. The CL algorithm takes the position that the observations DI
are extrapolated into the lower triangle, the BF algorithm takes the position that
the lower triangle DIc is extrapolated independently of the observations DI using
expert knowledge. Depending on the line of business considered and the progress
tes
of the claims development process one or the other method may provide better
predictions. Only actuarial experience may tell which one should be preferred in
which particular situation. Therefore, we are going to present both algorithms
from a rather mechanical point of view, because we cannot provide applied insight
to a given data set.
no
9.2.1 Chain-ladder algorithm

For the study of the CL algorithm we need to define (nominal) cumulative payments
j
NL
X
Ci,j = Xi,l .
l=0
That is, we sum all payments Xi,l , l 0, for a fixed accident year i so that
ultimately we obtain Ci,J = Si , if Si denotes the total (nominal) claim amount
that corresponds to accident year i, see also (9.3). We call Ci,J ultimate (nominal)
claim of accident year i.
CL idea. All accident years i {1, . . . , I} behave similarly and for cumulative
payments we have approximately
Ci,j+1 fj Ci,j , (9.4)
for given factors fj > 0. These factors fj are called CL factors, age-to-age factors
or link ratios.

Structure (9.4) immediately provides the intuition for predicting the ultimate claim
Ci,J based on observations DI , namely, choose for every accident year i the obser-
vation on the last observed diagonal, that is Ci,Ii , and multiply this observation
with the successive CL factors fIi , . . . , fJ1 .
The remaining difficulty is that, in general, the CL factors fj are not known and,
henceforth, need to be estimated. Assuming that a volume weighted average pro-
vides the most reliable results we set in view of (9.4)
w)
PIj1
Ci,j+1 Ij1 Ci,j Ci,j+1
fbCL = Pi=1
X
j Ij1 = PIj1 . (9.5)
i=1 C i,j i=1 n=1 C n,j C i,j
(m
This formula (9.5) expresses that we should divide the sums
of observed successive columns by each other which exactly
reflects link ratio structure (9.4). Thus, we calculate a vol-
ume weighted average of the individual loss development ra-
tios Fi,j+1 = Ci,j+1 /Ci,j which have been observed in DI . In
Table 9.4 we provide as example the claims reserving example
tes
of Wthrich-Merz [102].
Equipped with these CL factor estimators we predict the ulti-
mate (nominal) claim Ci,J for i + J > I at time t = I by
J1
CL
fbjCL ,
Y
no
Cbi,J = Ci,Ii (9.6)

j=Ii
and, in general, we set

n1
CL
fbjCL ,
Y
Cbi,n = Ci,Ii
j=Ii
for i + n > I.
NL
The CL reserves at time t = I for accident years i > I J are given by

J1
cCL = C
b CL C fbjCL 1 ,
Y
Ri i,J i,Ii = Ci,Ii

j=Ii
and aggregated over all accident years we predict the (nominal) outstanding loss
liabilities of past exposure claims by the CL predictor
I
cCL = cCL .
X
R Ri
i=IJ+1
A numerical example is presented in Tables 9.3, 9.4 and 9.5, below.

234

year i 0 1 2 3 4 5 6 7 8 9
1 5946975 3721237 895717 207760 206704 62124 65813 14850 11130 15813
2 6346756 3246406 723222 151797 67824 36603 52752 11186 11646
3 6269090 2976223 847053 262768 152703 65444 53545 8924
4 5863015 2683224 722532 190653 132976 88340 43329
5 5778885 2745229 653894 273395 230288 105224
6 6184793 2828338 572765 244899 104957
7 5600184 2893207 563114 225517
8 5288066 2440103 528043
9 5290793 2357936
NL
10 5675568
Table 9.3: Observed payments Xi,j with (i, j) II with I = J + 1 = 10.

no
year i 0 1 2 3 4 5 6 7 8 9
1 5946975 9668212 10563929 10771690 10978394 11040518 11106331 11121181 11132310 11148124
2 6346756 9593162 10316383 10468180 10536004 10572608 10625360 10636546 10648192
tes
3 6269090 9245313 10092366 10355134 10507837 10573282 10626827 10635751
4 5863015 8546239 9268771 9459424 9592399 9680740 9724068
5 5778885 8524114 9178009 9451404 9681692 9786916

6 6184793 9013132 9585897 9830796 9935753
7 5600184 8493391 9056505 9282022
(m
8 5288066 7728169 8256211
9 5290793 7648729
10 5675568
fbjCL 1.4925 1.0778 1.0229 1.0148 1.0070 1.0051 1.0011 1.0010 1.0014
w)
Table 9.4: Observed cumulative payments Ci,j with (i, j) II and estimated CL factors fbjCL .
Chapter 9. Claims Reserving
year i 0 1 2 3 4 5 6 7 8 9 Ri
bCL
1 0
2 10663318 15126
3 10646884 10662008 26257
4 9734574 9744764 9758606 34538
5 9837277 9847906 9858214 9872218 85302
6 10005044 10056528 10067393 10077931 10092247 156494
7 9419776 9485469 9534279 9544580 9554571 9568143 286121
8 8445057 8570389 8630159 8674568 8683940 8693030 8705378 449167
9 8243496 8432051 8557190 8616868 8661208 8670566 8679642 8691971 1043242
NL
10 8470989 9129696 9338521 9477113 9543206 9592313 9602676 9612728 9626383 3950815
Chapter 9. Claims Reserving
total 6047061
CL cCL .
Table 9.5: CL predicted cumulative payments Cbi,j , (i, j) IIc , and estimated CL reserves Ri
no
accident prior BF reserves CL reserves
year i estimate CL C BF C CL R R
i i
bi bIi bi,J bi,J bBF bCL
1 11653101 100.0% 11148124 11148124
tes
2 11367306 99.9% 10664316 10663318 16124 15126
3 10962965 99.8% 10662749 10662008 26998 26257
4 10616762 99.6% 9761643 9758606 37575 34538
5 11044881 99.1% 9882350 9872218 95434 85302

6 11480700 98.4% 10113777 10092247 178024 156494
7 11413572 97.0% 9623328 9568143
(m 341305 286121
8 11126527 94.8% 8830301 8705378 574089 449167
9 10986548 88.0% 8967375 8691971 1318646 1043242
10 11618437 59.0% 10443953 9626383 4768384 3950815
total 7356580 6047061
w)
Table 9.6: Claims reserves from the BF method and the CL method.
235
9.2.2 Bornhuetter-Ferguson algorithm

The Ronald Bornhuetter and Ronald E. Ferguson (BF)
method [16] is based on the assumption of having prior informa-
tion b i for the expected ultimate claim of accident year i. This
prior information then allows to predict DIc as soon as we have a
so-called claims development pattern (j )j=0,...,J which describes
the proportions paid in each development year. Thus, the BF
method extrapolates prior knowledge b i into the lower triangle
DIc by using a development pattern (j )j=0,...,J . R. Bornhuetter
w)
BF idea. All accident years i {1, . . . , I} behave similarly and payments approx-
imately behave as
Xi,j j b i , (9.7)
(m
for given prior information b i and given development pattern (j )j=0,...,J with nor-
malization Jj=0 j = 1.
P
The prior value b i should reflect an estimate for the total expected
ultimate claim E[Ci,J ] of accounting year i. It is assumed that this
tes
prior value is given externally by expert opinion which, in theory,
should not be based on DI . There only remains to estimate the
development pattern (j )j . In view of the CL method, one defines
the following estimates for the development pattern:
no
J1 Qj1 bCL
1 l=0 fl
bCL
Y
j = = QJ1 bCL . R.E. Ferguson
l=j fblCL l=0 fl
This ratio exactly reflects the proportion paid after the first j development periods
(according to the estimated CL pattern). Therefore, we define estimates
NL
b0CL = b0CL ,
bjCL = bjCL bj1
CL
for j = 1, . . . , J 1,
bJCL = 1 bCL J1 .
Equipped with these estimates we predict the ultimate claim Ci,J for i + J > I in
the BF method by
J
BF
bjCL = Ci,Ii + b i 1 bIi
CL
X
Cbi,J = Ci,Ii + b i . (9.8)
j=Ii+1
The BF reserves at time t = I for accident years i > I J are given by

J
cBF = bjCL = b i 1 bIi
CL
X
Ri bi ,
j=Ii+1

and aggregated over all accident years we predict the (nominal)

outstanding loss liabilities of past exposure claims by
I
cBF = cBF = b i bjCL .
X X
R Ri
i=IJ+1 (i,j)IIc
An example is provided in Table 9.6.
We conclude this section with a comparison of the BF and CL
w)
predictors. Therefore, we modify CL formula (9.6) as follows

J1 J1
CL 1
fbjCL 1
Y Y
Cbi,J = Ci,Ii + Ci,Ii .
j=Ii j=Ii fbjCL
(m
This gives the following comparison

CL CL CL
Cbi,J = Ci,Ii + 1 bIi Cbi,J ,

BF CL
Cbi,J = Ci,Ii + 1 bIi b i .
tes
Thus, we see that we have the same structure. The only difference is that for
the BF method we use the external prior estimate b i for the ultimate claim and
CL
in the CL method the observation based estimate Cbi,J . Therefore, we have two
complementing prediction positions, which exactly gives the explanation mentioned
in the introduction to Section 9.2. For further remarks (also detailed remarks on
no
the example in Tables 9.3-9.6) we refer to Wthrich-Merz [102].
9.3 Stochastic claims reserving methods

NL
In the previous section we have presented algorithms for the calculation of the
claims reserves R. Of course, we should also estimate the precision of these pre-
P
dictions, i.e. by how much the true payouts (i,j)IIc Xi,j may deviate from these
predictions R, see also (1.9) and Smith-Thaper [93]. This brings us back to the
notion of risk measures of Section 6.2.4. In claims reserving, the most popular
risk measure is the conditional mean square error of prediction (MSEP) because it
can be calculated or estimated explicitly in many examples. Assume X c is a D -
I
measurable predictor for the random variable X. The conditional MSEP is defined
by
2
msepX|DI X
c =E X X DI .
c (9.9)

The conditional MSEP is an L2 -distance measure. This conditional MSEP can

be decoupled into two parts, the so-called process uncertainty and the parameter
estimation error as follows, see also (1.9),
c 2.

c = Var (X| D ) + E [ X| D ] X
msepX|DI X (9.10)
I I
If all parameters are known and if we can calculate E [ X| DI ] explicitly then we

c = E [ X| D ] because this minimizes the conditional MSEP in (9.10).
should set X I
w)
In any other case we try to estimate E [ X| DI ] as accurately as possible and then
we try to determine the possible sources of parameter error and uncertainty in this
estimation. In order to analyze this prediction uncertainty we need to put the
claims reserving algorithms into a stochastic framework.
(m
For the CL method there are different stochastic models that provide the CL re-
serves as predictors:
distribution-free CL model of Thomas Mack [73],
over-dispersed Poisson (ODP) model of Renshaw and Verrall

[85] and of England and Verrall [42] with MLE parameter esti-
mates,
tes
Bayesian CL model of Gisler and Wthrich [55] and of Bhlmann, T. Mack
De Felice, Gisler, Moriconi and Wthrich [23].
Macks distribution-free CL model [73] is probably the most popular stochastic
no
claims reserving model. It is straightforward from a stochastic point of view and

it is easy to implement. The crucial contribution by Mack was the derivation
of an estimate for the parameter estimation error term. In the present text we
do not consider Macks distribution-free CL model, but we provide the gamma-
gamma Bayesian CL model in detail. This model belongs to the family of Bayesian
NL
CL models for which the conditional MSEP can be calculated explicitly. We will
compare the conditional MSEP formula of the gamma-gamma Bayesian CL model
to the famous Mack formula.
For the BF method there are different approaches such as:
BF ODP model of Alai, Merz and Wthrich [3, 4],
BF model of Mack [74],
BF model of Saluz, Gisler and Wthrich [90],
Bayesian BF model of England, Verrall and Wthrich [43].

Some of these models also use estimates of j different from the ones previously
suggested. In the present text we are not going to consider stochastic models for
the BF method and we refer to specialized lectures on stochastic claims reserving.

9.3.1 Gamma-gamma Bayesian CL model

In this section we consider an explicit distributional Bayesian model that belongs
to the exponential dispersion family with conjugate priors. The advantage of such
an explicit distributional model is that we can calculate the posterior distribution
analytically. This allows us to determine the quantities of interest in closed form.
Model Assumptions 9.1 (gamma-gamma Bayesian CL model). Assume that

j > 0, j = 0, . . . , J 1, are given fixed constants.
(a) Conditionally, given vector = (0 , . . . , J1 ), (Ci,j )j=0,...,J are independent
w)
(in i) Markov processes (in j) with conditional distributions

Ci,j+1 |Ci,j , Ci,j j2 , j j2 .
(m
(b) j are independent and (j , fj (j 1))-distributed with given prior param-
eters fj > 0 and j > 1 for j = 0, . . . , J 1.
(c) and C1,0 , . . . , CI,0 are independent and Ci,0 > 0, P-a.s., for all i = 1, . . . , I.
For given parameters we have conditional means

tes
E [ Ci,j+1 | Ci,j , ] = 1
j Ci,j .
From this we see that 1j plays the role of the CL factor introduced in (9.4). We
have
1
no
h i
E 1 j = fj (j 1) = fj .
j 1
This explains the choices of the prior parameters of the distribution of j : fj
corresponds to the prior mean of 1
j and j is used to calibrate prior uncertainty.
For the conditional variance we have
Var (Ci,j+1 | Ci,j , ) = Ci,j j2 2
NL
j . (9.11)
The joint likelihood function of observations DI and parameters is given by
Ci,j1
j1 2 Ci,j1
j1
1
( )
Y 2
j1 2
j1 j1
h(DI , ) = Ci,j exp 2 Ci,j
(i,j)II ,j1 Ci,j1 j1
2
j1
J1
Y (fj (j 1))j j 1
g(C1,0 , . . . , CI,0 ) j exp {j fj (j 1)} .
j=0 (j )
g(C1,0 , . . . , CI,0 ) denotes the density of the first column j = 0. Applying Bayes
rule provides for the posterior distribution of , conditionally given DI ,

PIj1 Ci,j PIj1 Ci,j+1
Y j +
J1 1 j fj (j 1)+
i=1 2 i=1 2
j
h(|DI ) j e j
.
j=0

We have just proved the following lemma:
Lemma 9.2. Under Model Assumptions 9.1, the posteriors of 0 , . . . , J1 are

conditionally, given DI , independent with

Ij1 Ij1
X Ci,j X Ci,j+1
j |DI j + , fj (j 1) + .
i=1 j2 i=1 j2
Corollary 9.3. Under Model Assumptions 9.1, the posterior Bayesian CL factors
w)
are given by
def.
h i
fbjBCL = E 1 D bCL + (1 )f ,
I = j f j

j j j
with CL factor estimate fbjCL given by (9.5) and credibility weight
(m
PIj1
i=1 Ci,j
j = PIj1 (0, 1). (9.12)
i=1 Ci,j + j2 (j 1)
Proof. The proof is a straightforward application of the gamma distributional properties, namely
Ij1
" #
1 1 X Ci,j+1
E j DI fj (j 1) +
tes
=
j2
PIj1 Ci,j
j + i=1 2 1
j i=1
PIj1 Ci,j PIj1 Ci,j+1
j 1 i=1 j2 i=1 j2
= PIj1 Ci,j
fj + PIj1 Ci,j PIj1 Ci,j
.
j 1 + i=1 j2
j 1 + i=1 j2 i=1 j2
2
no
This proves the claim.
Remarks 9.4.
Lemma 9.2 and Corollary 9.3 are the key for the derivation of the reserves.
NL
The result says that in the gamma-gamma Bayesian CL model the Bayesian
CL factors should be estimated by a credibility weighted average between the
classical CL estimate fbjCL and the prior estimate fj with credibility weight
j (0, 1). Moreover, for j 0, we can consider the product of these
estimates fbjBCL due to posterior independence, this will be highlighted in
more detail in Theorem 9.5, below.
The parameter j describes the degree of information contained in the prior

distribution of j . If we let j 1 (non-informative priors) we obtain
j 1. In this case we give full credibility to the observation based estimate,
i.e. we have fbjBCL fbjCL in the non-informative limit j 1.
Observe that the individual development factors Fi,j+1 = Ci,j+1 /Ci,j satisfy
the Bhlmann-Straub (BS) model, see Model 8.13: conditionally given j

and C1,j , . . . , CI,j , the Fi,j+1 are independent with
E [ Fi,j+1 | Ci,j , j ] = (j ) = 1
j , (9.13)
j2 (j ) j2 2
j
Var (Fi,j+1 | Ci,j , j ) = = . (9.14)
Ci,j Ci,j
Thus, Ci,j plays the role of the volume measure and j2 () = j2 2 j plays
the role of the variance function. We calculate, see (8.4) and (8.5),
1
j2 = Var((j )) = fj2 ,
j 2
w)
2 2 j 1
ej2 = E[j2 2
j ] = j fj .
j 2
This implies for the credibility coefficient, see (8.14),
(m
ej2
j = = j2 (j 1).
j2
Therefore, Corollary 9.3 provides the classical BS formula and the structure
of the credibility weights is given by, see Theorem 8.17 and (9.12),
PIj1
i=1 Ci,j
tes
j = PIj1 .
i=1 Ci,j + j
Note that the BS formula requires j > 2 otherwise the credibility coefficient
j cannot be calculated. However, (9.12) is more general in this sense because
the second prior moment of 1
j does not need to exist for Corollary 9.3.
no
Theorem 9.5. Under Model Assumptions 9.1, the Bayesian CL predictor for Ci,J
with i + J > I is given by
J1
BCL
fbjBCL .
Y
Cbi,J = E [ Ci,J | DI ] = Ci,Ii
NL
j=Ii
Proof. We use conditional independence between different accident years, the conditional Markov
property and the tower property to obtain

J1
Y
BCL 1

Ci,J
b = E [ E [ Ci,J | Ci,0 , . . . , Ci,Ii , ]| DI ] = Ci,Ii E j DI .
j=Ii
Using the posterior independence of Lemma 9.2 and Corollary 9.3 proves the claim. 2
Remark 9.6. Theorem 9.5 explains that our Model Assumptions 9.1 give the CL
reserves if we let the prior distributions of 1
j become non-informative, i.e. for
j 1, j = I i, . . . , J 1, we have
BCL CL
Cbi,J Cbi,J . (9.15)

For this reason we can use the (non-informative prior) gamma-gamma Bayesian
CL model as a stochastic representation of the CL algorithm (9.6). This analogy
allows to study prediction uncertainty within Model Assumptions 9.1 for the CL
algorithm in an asymptotic sense.
For the conditional MSEP we obtain, see (9.10),

2
BCL BCL
msepCi,J |DI Cbi,J = Var (Ci,J | DI ) + E [ Ci,J | DI ] Cbi,J
= Var (Ci,J | DI ) .
)
BCL
This shows the optimality of the Bayesian CL predictor Cbi,J within our model
w
assumptions and there remains the calculation of the conditional variance of the
ultimate claim Ci,J . We define (subject to being well-defined)
(m
j2
j = PIj1 .
j2 (j 2) + l=1 Cl,j
Note that j is DI1 -measurable, i.e. observable at time t = I 1.
Theorem 9.7. Under Model Assumptions 9.1 the Bayesian CL predictor satisfies
tes
for i > I J
J1 J1
Y
BCL BCL
j2 fbnBCL (1 + n )
X
msepCi,J |DI Cbi,J = Cbi,J
j=Ii n=j

no
2 J1
BCL
Y
+ Cbi,J (1 + j ) 1 ,
j=Ii
In1
Cl,n /n2 > 2 for all I i n
P
under the additional assumption that n + l=1
J 1; otherwise the second moment is infinite. The aggregated conditional MSEP
is given by
NL
!

BCL BCL
X X
msepP C |DI Cbi,J = msepCi,J |DI Cbi,J
i i,J
i i

J1
BCL b BCL
X Y
+2 Cbi,J Cl,J (1 + j ) 1 ,
i<l j=Ii
where the summations run over I J + 1 i I and I J + 1 i < l I,

respectively.
Proof. We first decouple accident years

! !
X X X
BCL
msep Ci,J = Var Ci,J DI = Cov (Ci,J , Cl,J | DI ) .
P b
Ci,J |DI
i
i i i,l

We calculate these covariance terms. Applying the tower property for conditional expectations
implies for i, l > I J
Cov (Ci,J , Cl,J | DI ) = E [ Cov (Ci,J , Cl,J | DI , )| DI ] (9.16)

+ Cov (E [ Ci,J | DI , ] , E [ Cl,J | DI , ]| DI ) .
We start with the first term on the right-hand side of (9.16). Observe that this term is zero for
i 6= l because of the conditional independence between different accident years. Therefore, we
only need to consider the case i = l > I J. For this case we have, applying the tower property
and using conditional independence and the conditional Markov property,
Var (Ci,J | DI , ) = E [ Var ( Ci,J | Ci,J1 , )| DI , ] + Var ( E [ Ci,J | Ci,J1 , ]| DI , )
w)
2
2 1

= E Ci,J1 J1 J1 DI , + Var Ci,J1 J1 DI ,

J2
Y
= Ci,Ii 1
j
2
J1 2 2
J1 + J1 Var ( Ci,J1 | DI , ) .
j=Ii
(m
Hence, we obtain the well-known recursive formula for the process variance in the CL method
(see Section 3.2.2 in Wthrich-Merz [102]). By iterating the recursion we find for given (see
also Lemma 3.6 in Wthrich-Merz [102])
J1
X j1
Y J1
Y
2 2
Var (Ci,J | DI , ) = Ci,Ii 1
m j j 2
n , (9.17)
j=Ii m=Ii n=j+1
tes
where empty products are set equal to 1. Applying operator E[|DI ] to (9.17) and using the
posterior independence of the random variables j we obtain
J1
X j1
Y J1
Y
BCL 2
E 2

E [ Var (Ci,J | DI , )| DI ] = Ci,Ii fbm j n
DI
j=Ii m=Ii n=j
no

J1 j1 J1
PIn1 Cl,n
X Y Y n 1 + l=1 2
n
BCL 2 (fbnBCL )2
= Ci,Ii fbm j PIn1 Cl,n
j=Ii m=Ii n=j n 2 + l=1 2
n
J1
X J1
Y
BCL
= C
bi,J j2 fbnBCL (1 + n ) .
j=Ii n=j
NL
PIn1
Note that in the second step we need n + l=1 Cl,n /n2 > 2 for all I i n J 1 so
that these conditional expectations are finite. For the second term in (9.16) we have, w.l.o.g. we
assume l i > I J,

J1
Y J1
Y

1 1

Cov (E [ Ci,J | DI , ] , E [ Cl,J | DI , ]| DI ) = Ci,Ii Cl,Il Cov j , j D I
j=Ii j=Il

Ii1
Y J1
Y
J1
Y
J1
Y

1 2 1 1

= Ci,Ii Cl,Il E j j DI E
j DI E
j DI
j=Il j=Ii j=Ii j=Il

J1
Y
BCL b BCL
= C bi,J Cl,J (1 + j ) 1 .
j=Ii
This proves the statements. 2

We analyze the terms of Theorem 9.7 involving j . Under assumption

Ij1
j2
X
Cl,j , (9.18)
l=1
we obtain
0 j 1.
Note that assumption (9.18) is stronger than j + Ij1 Cl,j /j2 > 2 which provides
P
l=1
finiteness of conditional variances in Theorem 9.7. Assumption (9.18) for all j then
implies for the first term in Theorem 9.7
w)
J1 J1
Y J1 J1
Cb BCL j2 fbBCL (1 + Cb BCL j2 fbnBCL
X X Y
i,J n n) i,J
j=Ii n=j j=Ii n=j

BCL
2 J1
X j2
= Cbi,J .
(m
BCL
j=Ii Cbi,j
In fact, the right-hand side is a lower bound for the left-hand side for any j > 1
(where the second posterior moment exists). For the second term in Theorem 9.7
we have under (9.18)

2 J1 2 J1
BCL BCL
Y X
tes
Cbi,J (1 + j ) 1 Cbi,J j .
j=Ii j=Ii
In fact, the right-hand side is again a lower bound for the left-hand side for any
j > 1.
no
This implies that under assumption (9.18) for all I i j J 1 we have

approximation

BCL

BCL
2 J1
X j2
msepCi,J |DI Cbi,J Cbi,J
BCL
+ j , (9.19)
j=Ii Cbi,j
NL
where the right-hand side is a lower bound for the left-hand side for any j > 1.
Since the latter formula applies to any j > 1 (it can even be made uniform in
j 1) we can consider its non-informative limit j 1. In this case the Bayesian
CL predictor converges to the classical CL predictor, see (9.15). For the j -terms
we obtain in the non-informative limit
j2 j2 j2
lim j = lim PIj1 = , PIj1 PIj1
j 1 j 1 j2 (j 2) + l=1 Cl,j j2 +
Cl,j l=1 Cl,j l=1
(9.20)
where in the last step we have again used (9.18). In fact the last approximation is
again a lower bound. This motivates in the non-informative prior case j 1 the
following approximation and lower bound to (9.19) and Theorem 9.7, respectively,


b CL = (C
b CL )2
J1
s2j /(fbjCL )2 s2j /(fbjCL )2
msepMack
X
Ci,J |DI Ci,J i,J

CL
+ PIj1 , (9.21)
j=Ii Cbi,j l=1 Cl,j
where we set j2 = s2j /(fbjCL )2 . The conditional MSEP formula (9.21) is exactly the
famous Mack formula [73]. We emphasis important remarks and differences:
Remarks 9.8.
Mack [73] has derived formula (9.21) in Macks distribution-free CL model,
w)
we have derive formula (9.21) as approximation (and lower bound) to the
non-informative prior case of the gamma-gamma Bayesian CL model. These
two stochastic models are different and, therefore, our derivation cannot be
considered as a conditional MSEP formula in Macks distribution-free CL
(m
model.
Both models, Macks distribution-free CL model and the non-informative

prior gamma-gamma Bayesian CL model, have in common that they provide
the CL reserves. Therefore, both models can be used to derive a conditional
MSEP formula for the CL algorithm of Section 9.2.1.
tes
This implies that Macks model and our model are different and the deriva-
tions of the corresponding conditional MSEP formulas are different. However,
we have proved that under assumption (9.18) we expect that the numerical
results of the two approaches are very close. This will be justified in Ex-
no
ample 9.9, below. Assumption (9.18) is fulfilled in many applied data sets
and, therefore, it is a relief because both methods come to similar conclusions
about prediction uncertainties in many applied situations.
We use variance parameters j2 , Mack [73] uses variance parameters s2j . Their
relationship is justified by identity (9.11), see also (9.14).
NL
The blue terms are the process uncertainty terms and the red terms are the
parameter estimation error terms in Macks formula. For more interpretation
we refer to Section 9.4, below, and to Merz-Wthrich [80].
For aggregated accident years, one has under assumption (9.18) approximation and
lower bound to Theorem 9.7 given by
!

Mack
Cb CL msepMack b CL
X X
msepP C |DI i,J = Ci,J |DI Ci,J (9.22)
i i,J
i i
X
CL b CL
J1
X s2j /(fbjCL )2
+2 Cbi,J Cl,J PIj1 .
i<l j=Ii n=1 Cn,j

Again, the red term describes the parameter estimation error in Macks formula
and for interpretation we refer to Merz-Wthrich [80].
Example 9.9 (gamma-gamma Bayesian CL model and Macks formula). We come

back to the claims reserving example presented in Table 9.4. We consider the
gamma-gamma Bayesian CL model with non-informative priors j 1 in Theorem
9.7. In this non-informative prior case we have j = 1, see (9.12), and therefore
we obtain fbjBCL = fbjCL and Cbi,JBCL CL
= Cbi,J in the non-informative prior limit. This
w)
immediately implies that the claims reserves for the outstanding loss liabilities in
this non-informative prior gamma-gamma Bayesian CL model are given by Tables
9.5 and 9.6.
There remains the calculation of the prediction uncertainty in this non-informative
(m
prior Bayesian CL model. In order to do this we need an estimate for j2 . From
(9.14) we see that j2 () = j2 2 j which is compared to s2j = j2 (fbjCL )2 . If we
estimate 2
j by (fbjCL )2 then we can find estimates bj2 = sb2j /(fbjCL )2 once we have
estimated s2j . The estimation of the latter is done rather ad-hoc by the classical
estimates of Macks distribution-free CL model, see Lemma 3.5 in Wthrich-Merz
[102],
tes
Ij1 !2
2 1 X Ci,j+1 CL
sbj = Ci,j fj
b . (9.23)
I j 2 i=1 Ci,j
For triangles I = J + 1 the variance parameter s2J1 cannot be estimated because
we do not have sufficiently many observations in this last column. In practice, one
no
therefore often uses Macks [73] estimate (which is based on exponential decay),
see also (3.13) in Wthrich-Merz [102],
n o
sb2J1 = min sb4J2 /sb2J3 ; sb2J2 ; sb2J3 . (9.24)
This provides the estimates in Table 9.7.

NL
0 1 2 3 4 5 6 7 8 9
sbj 135.25 33.80 15.76 19.85 9.34 2.00 0.82 0.22 0.06
j
b 90.62 31.36 15.41 19.56 9.27 1.99 0.82 0.22 0.06
Table 9.7: Estimated standard deviation parameters in the non-informative priors

gamma-gamma Bayesian model, where we set bj = sbj /fbjCL .
These parameters provide the results for the square-rooted conditional MSEPs
given in Table 9.8. We observe that for the total claims reserves the 1 standard
deviation confidence bounds are about 7.7% of the total claims reserves. These
confidence bounds should now be put in relation to the point estimator in the
balance sheet of Table 9.1 for the claims reserves.

accident CL reserves msep1/2 msep1/2 in %

year i RcCL Bayes Mack reserves
i
1
2 15126 267 267 1.8%
3 26257 914 914 3.5%
4 34538 3058 3058 8.9%
5 85302 7628 7628 8.9%
6 156494 33341 33341 21.3%
7 286121 73467 73467 25.7%
w)
8 449167 85399 85398 19.0%
9 1043242 134338 134337 12.9%
10 3950815 410850 410817 10.4%
covariance1/2 116811 116810
(m
total 6047061 462990 462960 7.7%
Table 9.8: Claims reserves and prediction uncertainty in the non-informative priors
gamma-gamma Bayesian CL model (see Theorem 9.7) and Macks formula (9.21)-
(9.22).
tes
We also observe that the exact formula given by Theorem 9.7 with non-informative
priors and Macks formula (9.21)-(9.22) are very close, i.e. 462990 versus 462960.
This observation holds true for many typical non-life insurance data sets and it
says that both models (though being different) come to the same conclusion about
no
prediction uncertainty.
9.3.2 Over-dispersed Poisson model

Another stochastic model that has attracted a lot of attention
NL
in the insurance industry is the so-called over-dispersed Poisson

(ODP) model. It goes back to Renshaw-Verrall [85]. Peter
D. England and Richard J. Verrall [42] have popularized
the model a lot. It belongs to the family of GLM models and
it is quite attractive because bootstrap simulation can easily be
applied.
R.J. Verrall
Model Assumptions 9.10 (over-dispersed Poisson model).
Assume there exist positive parameters 1 , . . . , I , 0 , . . . , J and such that all
Xi,j are independent (in i and j) with
Xi,j
Poi(i j /).


Observe that
E[Xi,j ] = i j ,
Var(Xi,j ) = i j .
We have a cross-classified mean with i modeling the exposure

of accident year i and j the development pattern of the payout P.D. England
delay j, see also (9.7). In order to make the parameters i
and j uniquely identifiable we need a side constraint. The two
w)
commonly used side constraints are either
J
X
1 = 1 or j = 1.
j=0
(m
The first option is more convenient in the application of GLM methods, the second
option gives an explicit meaning to the pattern (j )j , namely, it corresponds to the
cash flow pattern.
The best-estimate reserves at time I are given by
X X
R= E [ Xi,j | DI ] = i j .
tes
(i,j)IIc (i,j)IIc
Hence, we need to estimate the parameters i and j . This is done with MLE
methods. We assume J + 1 = I which simplifies notation. Having observations DI
allows to estimate the parameters. The log-likelihood function for = (1 , . . . , I ),
no
= (0 , . . . , J ) and is given by
X
`DI (, , ) = i j / + (Xi,j /) log(i j /) log((Xi,j /)!).
(i,j)II
NL
Calculating the derivatives w.r.t. and and setting them equal to zero implies
that we need to solve the following system of equations to find the MLEs
Ij
X Ij
X
j i = Xi,j for all j = 0, . . . , J, (9.25)
i=1 i=1
Ii
X Ii
X
i j = Xi,j for all i = 1, . . . , I, (9.26)
j=0 j=0
w.l.o.g. under side constraint Jj=0 j = 1. The remarkable fact about the MLE
P
system (9.25)-(9.26) is that it can be solved explicitly and that it provides the CL
reserves. Moreover, the constant dispersion parameter cancels and is not relevant
for estimating the reserves.

Theorem 9.11. Under Model Assumptions 9.10, the MLEs for and under
side constraint Jj=0 j = 1, given DI , are given by
P

J1
CL 1 1
b MLE bjMLE =
Y
i = Cbi,J and CL
1 bCL ,
k=j fk
b fj1
for i = 1, . . . , I and j = 1, . . . , J (an empty product is set equal to 1). Moreover,

b0MLE = J1 bCL
Q
k=0 1/fk . For the estimated reserves we have
J
cODP =
b MLE cCL .
bjMLE = R
X
R
w)
i i i
j=Ii+1
Proof. For the proof we refer to Lemma 2.16, Corollary 2.18 and Remarks 2.19 in Wthrich-Merz
[102]. Basically, the proof goes by induction along the last observed diagonal in DI . 2
Remarks 9.12.
(m
Theorem 9.11 goes back to Hachemeister-Stanard [58], Kremer [67] and Mack
[72].
tes
Theorem 9.11 explains the popularity of the ODP model for claims reserving
because it provides exactly the CL reserves. Thus, we have found a second
stochastic model (besides the non-informative prior gamma-gamma Bayesian
CL model) that can be used to explain the CL algorithm from a stochastic
no
point of view.
In this ODP model we can also give an estimate for the conditional MSEP.
This estimate uses that MLEs can be approximated by standard Gaussian
asymptotic results for GLM. For details we refer England-Verrall [42] and
Wthrich-Merz [102], Section 6.4.3. Another way to assess prediction uncer-
NL
tainty is to use bootstrap simulation.
The ODP framework also allows to give an estimate for the conditional MSEP
in the BF method, and it justifies the choice bjCL = jk=0 bkMLE . For details
P
we refer to Alai et al. [3, 4].
9.4 Claims development result

9.4.1 Definition of the claims development result
In the previous sections we have given a static point of view of claims reserving.
However, claims reserving should be understood as a dynamic process, where more
and more information becomes available over time and prediction is continuously

adapted to this new knowledge. This is also the viewpoint that needs to be taken
for solvency considerations.
We consider the run-off situation, and thus the last accident year I is kept fixed.
In the run-off situation the flow of information (9.2) is changed to (we do a slight
abuse of notation here)
Dt = {Xi,j ; i + j t, 1 i I, 0 j J} .
This generates a filtration denoted by (Dt )t0 on (, F, P) that describes the flow
of information (we abbreviate Dt = (Dt )). At time t I the ultimate claim of
w)
accident year i is predicted by the best-estimate
(t)
Cbi,J = E [ Ci,J | Dt ] . (9.27)
This is the predictor that minimizes the conditional MSEP at time t. The best-
(m
estimate reserves at time t I for accident year i > t J are provided by
(t) (t)
Ri = Cbi,J Ci,ti . (9.28)
In accounting year t + 1 we then collect new information resulting in Dt+1 and we

do payments Xi,ti+1 = Ci,ti+1 Ci,ti . This allows to define the so-called claims
development result (CDR) of accident year i > t J in accounting year t + 1 by,
tes
see Michael Merz and Wthrich [78],

(t) (t+1) (t) (t+1)
CDRi,t+1 = Ri Xi,ti+1 + Ri = Cbi,J Cbi,J . (9.29)
no
The claims development result CDRi,t+1 explains how we change

the prediction of the ultimate claim when new information be-
comes available. If the claims development result is negative
we have a loss in the P&L statement because we have under-
NL
estimated the outstanding loss liabilities at time t, otherwise we

have a gain. This is exactly the classical earning statement view
in order to understand the risk that derives from the develop-
ment of the outstanding loss liabilities.
M. Merz
The tower property immediately gives the following crucial state-
ment:
Corollary 9.13. Assume Ci,J has finite first moment and i + J > t I. Then we
have
E [ CDRi,t+1 | Dt ] = 0.
This corollary explains that in average we neither expect losses nor gains in the
claims development result but the prediction is just unbiased. Note that (9.27)

defines a martingale in t (under integrability) and remark that martingales have

uncorrelated innovations (claims development results). Our aim is to study the
uncertainty in this position measured by the conditional MSEP. For simplicity we
set t = I and assume i + J > I. Then we define
h i
msepCDRi,I+1 |DI (0) = E (CDRi,I+1 0)2 DI = Var (CDRi,I+1 | DI )

(I+1)
= Var Cbi,J DI . (9.30)
We aim to study the volatility of this one-period update. We do this in the gamma-
gamma Bayesian CL Model 9.1.
w)
9.4.2 One-year uncertainty in the Bayesian CL model
Firstly, observe that Lemma 9.2 easily extends to the following lemma (the proof
(m
is an exercise).
Lemma 9.14. Choose t I. Under Model Assumptions 9.1, the posteriors of

0 , . . . , J1 are independent, conditionally given Dt , with

(tj1)I (tj1)I
X Ci,j X Ci,j+1
j |Dt j + , fj (j 1) + .
tes
2
i=1 j i=1 j2
The Bayesian CL predictor for Ci,J , i + J > t I, is given by

J1
(t) Y (t)
Cbi,J = E [ Ci,J | Dt ] = Ci,ti fbj ,
no
j=ti
(t)
with posterior expected Bayesian CL factors given by fbj = E[1
j |Dt ].
Here, we slightly change notation, the upper index now indicates the time point t
of the available information Dt .
NL
Next we exploit the recursive structure of credibility estimators, see for instance
Corollary 8.6. This holds true in quite some generality, for the current exposition
we restrict to t {I, I + 1} because these are the only indexes of interest for the
analysis of (9.30). For t = I + 1 and j 0 we have (in the last step we use the
calculation of the proof of Corollary 9.3)
PIj Ci,j+1
h i fj (j 1) + i=1 j2
(I+1)
fbj = E 1
j DI+1 =

PIj Ci,j
j 1 + i=1 2
j
CIj,j+1 PIj1 Ci,j+1
j2
fj (j 1) + i=1 j2
= PIj Ci,j + PIj Ci,j
j 1 + i=1 2 j 1 + i=1 2
j j
(I) CIj,j+1 (I)

(I)
= j + 1 j fbj ,
CIj,j

with DI -measurable credibility weight
(I) CIj,j
j = (0, 1).
j2 (j 1) + Ij
P
i=1 Ci,j
(I+1)
The important observation is that there is only one random term in fbj , condi-
tionally given DI . This is crucial in the calculation of the conditional MSEP of the
claims development result prediction. We start with a technical lemma.
w)
Lemma 9.15. Under Model Assumptions 9.1 we have for I i + 1 J
(m
2
(I) (I)
Var (Ci,Ii+1 | DI ) = Cbi,Ii+1 (Ii )1 Ii ,
Pi1 2
under the additional assumption that Ii + l=1 Cl,Ii /Ii > 2; otherwise the
second moment is infinite.
tes
Proof. In the first step we apply Theorem 9.7 for J 1 = I i and then we derive
2 (I) (I)
Var (Ci,Ii+1 | DI ) = Ci,Ii Ii (fbIi )2 (1 + Ii ) + Ci,Ii
2
(fbIi )2 Ii
no
2 2

= b (I)
C Ii
(1 + Ii ) + Ii
i,Ii+1
Ci,Ii
2 2
= b (I)
C Ii
+ 1 (1 + Ii ) 1 .
i,Ii+1
Ci,Ii
We calculate the square bracket. It is given by

NL
!
2 2 2
Ii Ii Ii
+ 1 (1 + Ii ) 1 = +1 1+ Pi1 1
Ci,Ii Ci,Ii 2 (
Ii Ii 2) + l=1 Cl,Ii
2 2 2 2
Ii Ii Ii Ii
= + i1
+ Pi1
Ci,Ii 2 ( Ci,Ii Ii
2 (
P
Ii Ii 2) + l=1 Cl,Ii Ii 2) + l=1 Cl,Ii

Pi1
2 ( 2) + l=1 Cl,Ii + Ci,Ii + Ii 2
= 2
Ii Ii Ii
2 (
P i1
Ci,Ii Ii Ii 2) + l=1 Cl,Ii
(I)
(Ii )1 (I)
= 2
Ii Pi1 = (Ii )1 Ii .
2 (
Ii 2) + C
Ii l=1 l,Ii

Theorem 9.16. Under Model Assumptions 9.1 the Bayesian CL predictor satisfies
i>I J

J1
(I) 1 + ( (I) )1 Ii (I)
(Cbi,J )2
Y
msepCDRi,I+1 |DI (0) = Ii 1+ j j 1 ,
j=Ii+1
where we assume that j + Ij1 Cl,j /j2 > 2 for all I i j J 1; otherwise
P
l=1
the conditional MSEP is infinite. For aggregated accident years we have
X
msepP CDRi,I+1 |DI (0) = msepCDRi,I+1 |DI (0)
)
i
i

J1
w

X (I) (I) Y (I)
+2 Cbi,J Cbl,J (1 + Ii ) 1 + j j 1 ,
i<l j=Ii+1
the summations run over I J + 1 i I and I J + 1 i < l I, respectively.
Proof. We first decouple accident years
msepP
i
CDRi,I+1 |DI (0) = Var
X
i
i,J

b (I+1) DI
C
(m !
=
X
i,l

Cov Cb (I+1) , C
i,J l,J

b (I+1) DI .
tes
We calculate these covariance terms. Observe
J1 J1
(I) CIj,j+1

b (I+1) = Ci,Ii+1 (I+1) (I) b(I)
Y Y
Ci,J fbj = Ci,Ii+1 j + 1 j fj .
CIj,j
j=Ii+1 j=Ii+1
The only random terms under the measure P[|DI ] are Ci,Ii+1 , Ci1,Ii+2 , . . . , CIJ+1,J . All
no
these random variables belong to different accident years i and to different development periods
j. Therefore, they are independent given DI , this follows from Model Assumptions 9.1 and
Lemma 9.2. Moreover, we have the following unbiasedness of successive estimations (use the
tower property)

(I) CIj,j+1
h i
(I) b(I) (I+1) (I)
E j + 1 j fj DI = E fbj DI = fbj .
CIj,j
NL
In the first step we decouple the covariance as follows

h i
Cov C b (I+1) , C
b (I+1) DI = E Cb (I+1) C
b (I+1) DI C
b (I) C
b (I)
i,J l,J i,J l,J i,J l,J ,
with

h i J1 J1
b (I+1) C
b (I+1) DI = E Ci,Ii+1 (I+1)
Y Y
(I+1)

E C fbj Cl,Il+1 fbm DI .

i,J l,J
j=Ii+1 m=Il+1
We first treat the variance case i = l. In that case we have using conditional independence

J1 2
(I) CIj,j+1
h i
b (I+1) )2 DI (I) b(I)
Y
= E (Ci,Ii+1 )2

E (C i,J j + 1 j fj DI
CIj,j
j=Ii+1
" #
J1 2

(I) CIj,j+1

(I) b(I)
Y
= E (Ci,Ii+1 )2 DI

j + 1 j fj DI ,

E
CIj,j
j=Ii+1

which allows to calculate each term individually. Unbiasedness and Lemma 9.15 for i = I j
imply for these individual terms

(I) CIj,j+1
h i
(I+1) 2 (I) b(I) (I)
E (fbj ) DI = Var j + 1 j fj DI + (fbj )2
CIj,j
(I)
!2
j (I)
= Var (CIj,j+1 | DI ) + (fbj )2
CIj,j
(I)
!2
j 2
= b (I)
C
(I) (I)
(j )1 j + (fbj )2
Ij,j+1
CIj,j

(I) (I)
= (fbj )2 j j + 1 .
w)
Similarly we have for the first term
2
E (Ci,Ii+1 )2 DI b (I)

= Var (Ci,Ii+1 | DI ) + C i,Ii+1
2
(m
= b (I)
C (
(I) 1
) Ii + 1 .
i,Ii+1 Ii
Collecting all the terms proves the statement for i = l. There remains the case of different
accident years. W.l.o.g. we assume i < l which implies I i + 1 > I l + 1. This and conditional
independence, given DI , imply for the covariance between these accident years

h i J1 J1
b (I+1) DI
b (I+1) C b(I+1) Cl,Il+1
Y Y
b(I+1) DI

E C i,J l,J = E Ci,Ii+1 fj f m
tes

j=Ii+1 m=Il+1
Ii1 h i J1 h i
(I+1) (I+1) 2
Y Y
(I)
= Cl,Il fbm E Ci,Ii+1 fbIi DI E (fbj ) DI
m=Il j=Ii+1
h i J1
b (I) (I+1) (I) (I) (I)
Y
= Cl,Ii Cov Ci,Ii+1 , fbIi DI + Ci,Ii (fbIi )2 j j + 1 (fbj )2
no
j=Ii+1
J1
b (I) [Ii + 1]
b (I) C (I)
Y
= Ci,J l,J j j + 1 .
j=Ii+1

NL
We study the conditional MSEP formula of the claims development result under
assumption (9.18). This assumption implies again that 0 j 1. Moreover, we
(I)
have j (0, 1) from which we see that (9.18) implies
(I)
0 j j 1.
The other term in Theorem 9.16 is more sophisticated. We have from the proof of
Lemma 9.15
2
!
(I) Ii
(Ii )1 Ii = + 1 (1 + Ii ) 1.
Ci,Ii
If in addition to (9.18) we assume

2
Ii Ci,Ii , (9.31)

then we also obtain

(I)
0 (j )1 j 1.
Moreover, we get approximation (and lower bound) under (9.18) and (9.31)
2
(I) Ii
(Ii )1 Ii + Ii .
Ci,Ii
This implies that under assumptions (9.18) and (9.31) we obtain approximation

J1
2
w)
(I) (I)
msepCDRi,I+1 |DI (0) (Cbi,J )2 Ii + Ii +
X
j j , (9.32)
Ci,Ii j=Ii+1
where the right-hand side is a lower bound for the left-hand side for any j > 1.
(m
This formula should be compared to (9.19). We will give interpretations below,
after formula (9.34). Since (9.32) applies to any j > 1 we can again consider its
non-informative limit j 1: the Bayesian CL predictor converges to the classical
(I) CL
CL predictor, Cbi,J Cbi,J , see (9.15), and for the j -terms we obtain, see (9.20),
tes
j2
lim j PIj1 .
1j Cl,j
l=1
For the credibility weights we have
CIj,j CIj,j def. e(I)

no
(I)
lim j = lim PIj = PIj = j . (9.33)
j 1 j 1 2 (j 1) + i=1 Ci,j
j l=1 Cl,j
This motivates in the non-informative prior case j 1 the following approxima-

tion and lower bound to (9.32) and Theorem 9.16, respectively,
NL
CL 2
s2Ii /(fbIi
"
)
msepMW
CDRi,I+1 |DI (0) = (Cb CL )2
i,J (9.34)
Ci,Ii
CL 2
s2Ii /(fbIi J1 2 bCL 2 #
) e(I) sj /(fj )
X
+ P i1 + j Ij1 ,
l=1 Cl,Ii
P
j=Ii+1 l=1 Cl,j
where we set j2 = s2j /(fbjCL )2 . This is the Merz-Wthrich (MW) formula, see (3.17)
in [78]. We also refer to Bhlmann et al. [23] and Merz-Wthrich [80].
Remarks 9.17.
Concerning derivation and stochastic model choice for the MW formula (9.34)
the same Remarks 9.8 apply as for Macks formula (9.21).

Macks formula (9.21) is often called total run-off uncertainty and the MW
formula (9.34) corresponds to the one-year run-off uncertainty. Comparing
these two formulas we observe that from the total run-off uncertainty the first
blue term with index j = I i also appears in the one-year run-off uncertainty.
This is the process variance in period j = I i. From the red terms, the
first red term with index j = I i appears (parameter uncertainty) and the
remaining red terms j I i + 1 of the summation in (9.21) are scaled
(I)
with factor ej (0, 1) to obtain the one-year run-off uncertainty . These
scalings reflect the release of parameter uncertainty when new information (a
w)
new diagonal in the claims development triangle) arrives.
The same interpretation applies to (9.32) versus (9.19).

For aggregated accident years, one has estimate
msepMW msepMW
(m
X
CDRi,I+1 |DI (0) = CDRi,I+1 |DI (0) (9.35)
P
i
i
"
CL 2
s2Ii /(fbIi J1 2 bCL )2#
) (I) s /(f
Cb CL Cb CL ej PjIj1j
X X
+2 i,J l,J Pi1 + .
i<l n=1 Cn,Ii j=Ii+1 n=1 Cn,j
tes
Example 9.18. We revisit claims reserving Example 9.9 and calculate the claims
development result uncertainty. We consider the non-informative prior case and
CL reserves total msep1/2 CDR msep1/2 CDR/total

no
i Rb CL Mack (9.21) MW (9.34) msep1/2

i
2 15126 267 267 100%
3 26257 914 884 97%
4 34538 3058 2948 96%
5 85302 7628 7018 92%
NL
6 156494 33341 32470 97%

7 286121 73467 66178 90%
8 449167 85398 50296 59%
9 1043242 134337 104311 78%
10 3950815 410817 385773 94%
total 6047061 462960 420220 91%
Table 9.9: Claims reserves and prediction uncertainty: Macks formula (9.21)-(9.22)
for the total run-off uncertainty and MW formula (9.34)-(9.35) for the one-year
claims development uncertainty.
we choose the same parameter estimates as in Example 9.9. Moreover, we consider

the MW formula (9.34)-(9.35).

The results are presented in Table 9.9. We see that in this example the one-year
claims development result uncertainty measured by the square-rooted conditional
MSEP results in 91% of the total run-off uncertainty. The reason for this high
value is that knowing the next diagonal in the claims development triangle already
releases a major part of the claims run-off risks. For the next accounting year
we predict payments of 3873205 which is almost 2/3 of the total claims reserves,
i.e. we expect a rather fast claims settlement in this example and a fast decrease
of run-off uncertainties. Typically, the square-rooted conditional MSEP of the
claims development result is in the range of 50% to 95% relative to the total run-
w)
off uncertainty, the former relates to liability insurance and the latter to property
insurance.
Exercise 26 (Italian motor third party liability insurance example). We revisit the
Italian motor third party liability insurance example of Bhlmann et al. [23]. The
(m
field study considers 12 12 run-off triangles of 37 Italian insurance companies at
the end of 2006. For these data the claims reserves and the corresponding square-
rooted conditional MSEPs for the total run-off uncertainty and for the one-year
claims development result uncertainty using Macks formula (9.22) and the MW
formula (9.35), respectively, were calculated. The results are presented in Table
9.10. Note that for confidentiality reasons the volumes of the 4 biggest companies
tes
were all set equal to 100.0 and the order of these 4 companies is arbitrary.
Give interpretations to these results.
9.4.3 The full picture of run-off uncertainty

no
Note that in Theorem 9.16 and in the MW formula (9.34)-(9.35) we have only
derived the uncertainties in the next accounting year I + 1. A natural question is
what can we say about the individual uncertainties in all future accounting years?
This is exactly the question answered in Merz-Wthrich [80]. We would like to
briefly summarize these results (without proofs) because they give further insight
NL
in the run-off of risk behavior of claims development triangles.

We consider the total prediction error as a telescoping sum of successive claims
(i+J)
development results. Note that we have Cbi,J = Ci,J , P-a.s., because this ulti-
mate claim is observable at time t = i + J. This and the definition of the claims
development result imply for the total prediction error at time t = I
i+J i+JI
(I) X (k1) (k) X
Cbi,J Ci,J = Cbi,J Cbi,J = CDRi,I+k ,
k=I+1 k=1
for i > I J, see (9.29). This telescoping sum describes all innovations of the
claims development process. These innovations have mean zero (martingale), see
Corollary 9.13. This immediately implies that they are uncorrelated. Under the
assumption that the second moment exists, uncorrelatedness provides the following

CDR msep1/2
company business total msep1/2 CDR msep1/2 total msep1/2
volume in % (in % reserves) (in % reserves) (in %)
1 100.0 4.03 3.24 80.4
2 100.0 2.90 2.36 81.4
3 100.0 2.41 1.98 82.3
4 100.0 3.45 2.85 82.6
5 61.8 3.66 3.04 82.9
6 56.9 5.54 4.50 81.2
7 53.0 4.52 3.70 81.8
8 49.4 4.60 3.82 83.1
w)
9 46.2 5.61 4.59 81.8
10 41.6 5.32 4.36 82.0
.. .. .. .. ..
. . . . .
30 3.5 18.02 14.78 82.0
(m
31 3.4 17.23 13.92 80.8
32 2.6 18.73 14.89 79.5
33 2.5 23.11 19.10 82.6
34 2.2 20.83 17.53 84.2
35 2.0 17.01 13.87 81.5
36 1.8 26.16 21.54 82.4
tes
37 1.8 27.79 22.25 80.1
total 0.96 0.78 81.8
Table 9.10: Italian motor third party liability insurance example of Bhlmann et
al. [23]. Prediction uncertainties: Macks formula (9.22) for the total run-off un-
no
certainty and MW formula (9.35) for the one-year claims development uncertainty.
decoupling property of the total prediction uncertainty

NL
i+JI
(I) X
msepCi,J |DI Cbi,J = Var (CDRi,I+k | DI )
k=1
i+JI
X
= msepCDRi,I+k |DI (0) (9.36)
k=1
i+JI
X h i
= E msepCDRi,I+k |DI+k1 (0) DI .

k=1
The first line of (9.36) describes the total run-off uncertainty over the entire set-
tlement period of the claims; the second line considers the claims development
result volatilities based on todays knowledge DI ; and the third line considers the
expected one-year run-off uncertainties of all future periods. Thus, formula (9.36)
exactly explains how the total run-off uncertainty needs to be split (dynamically)
across all future development periods. In Theorem 9.16 and the MW formula (9.34)

we have only derived the first term with index k = 1 of this sum on the right-hand
side (in the gamma-gamma Bayesian CL model).
In the non-informative prior gamma-gamma Bayesian CL model all terms k =
1, . . . , i + J I can be estimated, see Merz-Wthrich [80], and these estimates have
exactly the same structure as the MW formula (9.34). They are estimated by
h i
msepMW MW
CDRi,I+k |DI (0) = E msepCDRi,I+k |DI+k1 (0) DI (9.37)
b

CL 2
s2 CL
/(fbIi+k1 )2 k1
Y
e(I)
s2 bCL 2
Ii+k1 /(fIi+k1 )
w)
= Cbi,J Ii+k1 + 1
CL Ii+m P ik
Cbi,Ii+k1 m=1 l=1 Cl,Ii+k1

2 J1 k2 s2 /(fbCL )2
CL e(I) (I)
1 ejm PjIj1 j
X Y
+ Cbi,J jk+1
.
j=Ii+k m=0 l=1 Cl,j
(m
Note that this latter formula is an approximation in the non-informative prior
gamma-gamma Bayesian CL model. This is indicated by the symbol E b on the first
line of (9.37). For its derivation we refer to Merz-Wthrich [80]. The coloring in
formula (9.37) is exactly the same as in the MW formula (9.34), and also the same
tes
interpretations apply. Note that (9.37) provides the natural split of Macks formula
(9.21) across all future accounting years.
For aggregated accident years we have estimation for k 1
no
I
def.
MW MW
msepMW
X
I+k|I = msep I (0) = CDRi,I+k |DI (0)
P
i=IJ+k
CDRi,I+k |DI
i=IJ+k
k1 bCL
s2 2
Ii+k1 /(fIi+k1 )

X
CL b CL
Y (I)
+2 Cbi,J Cl,J 1 eIi+m
Pik (9.38)
i<l m=1 n=1 Cn,Ii+k1

J1 k2 s2 /(fbCL(I) )2
NL

e(I) (I) j
CL b CL
1 ejm Pj Ij1
X X Y
+ 2 Cbi,J Cl,J jk+1
,
i<l j=Ii+k m=0 n=1 Cn,j
where the last two summations run over I J + k i < l I.
These formulas are all implemented in the R package ChainLadder, see [52]. Let
us describe the relevant code, for more details we refer to [52].
# bringing data in appropriate triangular form and labeling axes

> tri <- as.triangle(as.matrix(data.cumulative))
> dimnames(tri)=list(origin=1:nrow(tri),dev=1:ncol(tri))
# illustrating data using standard plots in R

> plot(tri,ylab="",main="")
> plot(tri,lattice=TRUE,ylab="",main="")
# calculating the CL reserves and the corresponding MSEPs

> M <- MackChainLadder(tri,est.sigma="Mack")
# CL reserves and Macks formula (9.21)-(9.22) including illustrations

> M
> plot(M)
w)
> plot(M,lattice=TRUE)
# split of (9.21)-(9.22) into process variance and parameter error

> M$Mack.ProcessRisk[,ncol(tri)]
(m
> M$Total.ProcessRisk[ncol(tri)]
> M$Mack.ParameterRisk[,ncol(tri)]
> M$Total.ParameterRisk[ncol(tri)]
# CL reserves and the MW formula (9.34)-(9.35)

> CDR(M)
tes
# full uncertainty picture (9.37)-(9.38)
> CDR(M,dev="all")
Example 9.19. We revisit Example 9.9 and calculate for this example the full run-
no
off uncertainty picture using (9.37)-(9.38). We start by illustrating the data using
the above R commands. This provides Figure 9.2. The graphs show that the data
2 4 6 8 10 2 4 6 8 10
1 2 3 4
11

1 1 1 1

1
11
1 10

1

2
NL
1 2 3
2 3
2 3
2 9
2 3
2 3 8
3
10
6 7
6 5
1 5 4 4 6
2 6 4
4
5
3 4
5 7 5 6 7 8
6 7
9
11
in 1'000'000
4 10
5
in 1'000'000
7

8 9

8

8
8
9 7

6

7
9 10
11
2
3 10
6
6
1
4
9
5
0
7 8

9
8 7
6

2 4 6 8 10
2 4 6 8 10
dev. period
dev. period
Figure 9.2: Illustration of the data of Table 9.4 (the labeling of the development
year axis is shifted by 1).

is rather regular, with a small decrease of volume over accident years. Moreover,
most of the payments are done in the first two development years j = 0, 1.
Mack Chain Ladder Results Chain ladder developments by origin period Chain ladder developments by origin period

1 1 1 1
10 11
1
1
Chain ladder dev. Mack's S.E.
10
Forecast 1 3 3
2 3
2 2
1 2
3 2
3 2
Latest 2
3
6
1 6 5 5
4 4
2 6 4
8
3 4 4
5
7
6 5
7
9
Amount
Amount
4
5
7 2 4 6 8 10 2 4 6 8 10
6
8
8
9
1 2 3 4
4
7
2
3
6 11
2
6
1
4
5
0
7
9
8 10
0
1 2 3 4 5 6 7 8 9 10 2 4 6 8 10 9
Origin period Development period 8

7
6
Standardised residuals
2
2

5 6 7 8

1
1

11

10
0
Amount

9
1
w)

7
8.0 8.5 9.0 9.5 10.0 10.5 11.0 2 4 6 8 6
Fitted Origin period
9 10

11
10
2

9

1

8
0

7

6
1
(m

2 4 6 8 10
2 4 6 8 1 2 3 4 5 6 7 8
Calendar period Development period

Development period
Figure 9.3: Predicted claims development including 1 standard deviation confidence

bounds (the labeling of the development year axis is shifted by 1) and observed
residuals in upper triangle.
tes
In Figure 9.3 the graphs of Figure 9.2 are complemented by the predicted payments
in the lower triangle. These graphs also include the 1 standard deviation confidence
bounds (top left and right-hand side). Moreover, Figure 9.3 (lhs) provides residuals
in the direction of all three time axes and ordered by the size of the observations.
no
These residuals should not show any trends in one of the (time) axis. We see that
there might be some problem in the accident year direction. The decrease in the
accounting/calendar year direction should not be overstated because the first two
years contain rather scarce information.
Finally, in Table 9.11 we provide the full run-off picture. This table summarizes in
NL
the 5th column the expected future accounting year cash flows for t > I
j2
X X Y (I) (I)
E [Xi,j | DI ] = Ci,Ii fbl fbj1 1 ,
i+j=t i+j=t l=Ii
and in the 2nd column the corresponding expected run-off of the claims reserves
for t I h i
E R(t) DI =
X
E [Xi,j | DI ] .
i+jt+1
Moreover, the table provides in the 6th column the square-rooted expected one-
year uncertainties (MW
t+1|I )
1/2
for t I and in the 3rd column the expected run-off
of the total uncertainty calculated as
1/2
MW
X
, s+1|I
st

exp. run-off rooted exp. expected

accounting of reserves run-off of in % cash flows
E[R(t) |DI ] E[Xi,j |DI ] (MW 1/2
P
years t MSEP reserves t+1|I )
i+j=t
10 6047061 462960 8% 420220
11 2173856 194285 9% 3873205 150544
12 1048144 122813 12% 1125712 93390
13 570584 79758 14% 477560 72882
14 293063 32397 11% 277521 31459
15 148951 7739 5% 144112 7172
w)
16 67824 2906 4% 81127 2803
17 36036 769 2% 31788 744
18 13655 191 1% 22381 191
19 0 0 13655 0
(m
Table 9.11: Full run-off picture of Example 9.9.
where the first term t = I corresponds to the square-rooted Mack formula (9.22).
We conclude that we now have the full run-off picture, the 2nd column displays the
expected run-off of the claims reserves and the 3rd column provides the expected
tes
run-off of the prediction uncertainty (measured by the square-rooted remaining
conditional MSEPs). This is in particular of interest for risk margin calculations
in solvency considerations.
no
Outlook. All claims reserving methods presented above consider aggregated

claims reserving triangles. However, typically there is much more information avail-
able on individual claims level. Claims reserving methods on individual claims in-
formation are not established, yet, but they may be derived with machine learning
methods as presented in Wthrich [100].
NL

Chapter 10
Solvency Considerations
w)
In the previous chapters we have mainly discussed the model-
(m
ing of insurance contracts, the related liability cash flows and
the implications for tariffication. If we remind of the discussion
in Chapter 1, we recall that the insurance company organizes
the equal balance within the community. That is, it issues in-
surance contracts at a fixed premium and in return it promises
to cover all (financial) claims that fall under these contracts.
tes
Of course, we need to make sure that the insurance company
can keep its promises. This is exactly the crucial task of su-
pervision (regulation) and sound risk management practice. Regulation aims to
protect the policyholder in that it enforces (by law) the insurance company to
no
follow good risk management practice. Companies should be sufficiently well cap-
italized so that they can fulfill their promises also under certain stress scenarios.
This is exactly what we would like (and need) to study in the present chapter.
We have already touched this issue in Chapter 5 on ruin theory. The main purpose
of Chapter 5 was to explain that there is a huge difference in ruin behavior between
light tailed and heavy tailed claims. Beyond that insight the random walk model of
NL
Chapter 5 is much too simple to reflect real world insurance problems. Therefore,
we modify the ultimate ruin probability considerations so that they reflect the
current risk management task. In a first step we will discuss more general risk
management views, for a comprehensive discussion we refer to Wthrich-Merz [103],
and in a second step we discuss more explicitly the solvency and risk management
implementations used in the insurance industry.
10.1 Balance sheet and solvency

In Chapter 1 of Wthrich-Merz [103] we have provided the balance sheet of an
insurance company. It may look as follows (we only provide the positions that are
relevant for non-life insurance companies):
263
264 Chapter 10. Solvency Considerations
assets liabilities
cash and cash equivalents deposits
debt securities policyholder deposits
bonds reinsurance deposits
loans borrowings
mortgages money market
real estate hybrid debt
equity convertible debt
equity securities insurance liabilities
private equity claims reserves
w)
investments in associates premium reserves
hedge funds annuities
derivatives derivatives
futures, swaptions, equity options
(m
insurance and other receivables insurance and other payables
reinsurance assets reinsurance liabilities
property and equipment employee benefit plan
intangible assets other provisions
goodwill
deferred acquisition costs
income tax assets income tax liabilities
tes
other assets other liabilities
Table 10.1: Balance sheet of a non-life insurance company at a fixed point in time.
no
Table 10.1 presents a snap shot of a non-life insurance companys balance sheet,
that is, it reflects all positions at a certain moment in time t R+ . The left
hand side shows the assets at time point t and the right hand side should show the
NL
liabilities at the same time point t. We denote the value of the assets at time t by
At , and Lt denotes the value of the liabilities at time t.
In the language of Chapter 5, we can think of At denoting all asset values in the
company at time t. These comprise the initial capital, all premia received and all
other amounts received minus the payments done up to time t. These amounts
are invested at the financial market and, thus, are allocated to the different asset
classes displayed in Table 10.1. On the other hand, the liabilities Lt reflect the
value of all obligations accepted by the insurance company that are still open at
time t.
In a similar context to the ruin theory Chapter 5, we should have At Lt in

order to cover the liabilities by asset values at time t. In fact, we may study the
continuous time surplus process (Cet )tR+ , given by Cet = At Lt , which should

Chapter 10. Solvency Considerations 265
fulfill for a given large probability 1 p (0, 1)

P inf Cet 0 Ce0 = c0 = Pc0 inf At Lt 0 1 p. (10.1)
tR+ tR+
Since an insurance company cannot continuously verify the solvency situation,

condition (10.1) is only checked on a discrete time grid t N0 , this is similar to
(5.5). But in fact, one even goes beyond that which we are just going to describe.
This will be done in several steps, see also Wthrich [98].
Step 1 (one-period problem). Let us assume that we are at time t = 0 and
w)
we would like to check a solvency condition (no ruin condition) similar to (10.1).
Moreover, we assume that at time 0 we have only sold one-year contracts (one-year
risk exposures) for which we receive a premium at time 0 and for which the claim
is paid at the end of the year, i.e. at time t = 1.
(m
The total asset value at time 0 is given by A0 . This value is invested at the financial
market and generates value A1 at time 1. Thus, for this one-period problem the
no ruin condition reads as follows:
for a given large probability 1p (0, 1) the initial capital c0 and the asset strategy
should be chosen such that
tes
Pc0 [A1 L1 ] = Pc0 [L1 A1 0] 1 p. (10.2)
This means that we need to choose the initial capital c0 and the asset strategy, which
no
maps value A0 at time 0 to value A1 at time 1, such that the (given stochastic)
liabilities L1 can be covered with large probability at time 1. Note that A1 and L1
are, in general, not independent.
Step 2 (risk measure). The no ruin condition in (10.2) is described under the
NL
Value-at-Risk risk measure VaR1p (L1 A1 ) on security level 1 p (0, 1), see
Example 6.25. Assume we have a normalized, monotone and translation invariant
risk measure %, see (6.12), then more generally
the initial capital c0 and the asset strategy should be chosen such that
% (L1 A1 ) 0. (10.3)
Solvency II uses the VaR risk measure on the 1 p = 99.5% security level and the
Swiss Solvency Test (SST) uses the TVaR risk measure on the 1p = 99% security
level, see also Examples 6.25, 6.20 and 6.26. The main aspect is now concerned
with the stochastic modeling of position L1 A1 .

Step 3 (market(-consistent) values). The main difficulty is the stochastic mod-

eling of L1 A1 . Some positions in this difference are traded at active financial
markets. For these positions we need to stochastically model their market prices
at time 1 (viewed from time 0). However, most positions (on the liability) side
of the balance sheet are not traded at active markets. For these positions we
need to determine market-consistent values of their stochastic developments in a
marked-to-model approach, see also Happ et al. [59]. Let us explain the rationale
behind this with the liabilities L1 at hand and using the claims reserving context
of Chapter 9.
w)
Assume we can split the liabilities L1 into two elements:
(i) payments X1 done at time 1 (similar to Section 9.1 we map all payments in
accounting year [1, 2) = [1/1/1, 31/12/1] to its endpoint);
(m
(ii) outstanding loss liabilities L+
1 at time 1 (at the end of accounting year 1).
The liabilities at time 1 are then given by
L1 = X 1 + L+
1.
tes
The easier part is the modeling of X1 . We need to find a stochastic model that is
able to predict the payments X1 and capture the dependencies with A1 and L+ 1.
+
The more complicated part is L1 . This amount should reflect a market-consistent
value for the outstanding loss liabilities at time 1. Observe that it differs from the
best-estimate reserves R(1) given in (9.28) in two crucial ways:
no
(1) The best-estimate reserves R(1) were calculated on a nominal basis, i.e. the
time value of money was not considered because no discounting was applied
to R(1) .
(2) The best-estimate reserves R(1) are conditional expectations, conditioned on

NL
the information F1 . That is, these are expected payouts and we should add
a (risk, market-value) margin/loading to obtain market-consistent values.
Otherwise risk averse financial agents are not willing to do the run-off of
these liabilities at price L+
1 , see Chapter 6 and Happ et al. [59].
The aim in these two items (1) and (2) is motivated by the fact that L+ 1 should
reflect a price at which another insurance company is willing to take over the liabili-
ties at time 1 and to complete the run-off of the outstanding loss liabilities (reflected
by an appropriated marked-to-model approach price, sometimes also called transfer
value).
Step 4 (acceptability and solvency). As described above, we have three building

blocks A1 , X1 and L+
1 that we need to model stochastically (note that these building

blocks are not independent). In the last step, we need to evaluate risk measure
condition (10.3). If this condition is fulfilled we have an acceptable balance sheet
and the company is solvent at time 0 w.r.t. the chosen risk measure %. If (10.3)
is not fulfilled we have an unacceptable balance sheet and it needs to be modified
to achieve solvency. Options for modification are the following: change the asset
strategy so that it matches better the liabilities; reduce liabilities and mitigate
uncertainties in liabilities (if possible); inject more initial capital c0 .
In the remainder of this chapter we discuss the modeling of the asset deficit at time
w)
t = 1, where the asset deficit is for t N0 defined by
def.
ADt = Lt At = Xt + L+
t At . (10.4)
Thus, the insurance company is solvent at time 0 (w.r.t. the risk measure %) if
(m
% (AD1 ) = % (L1 A1 ) 0.
10.2 Risk modules

tes
Typically the modeling of the asset deficit AD1 at time t = 1, defined in (10.4), is
split into different modules that reflect different risk classes. In a first step each
risk class is studied individually and in a second step the results are aggregated to
obtain the overall picture.
no
NL
Figure 10.1: lhs: Swiss Solvency Test risk modules; rhs: Solvency II risk modules
(sources [26] and [44]).
One may question whether this modeling approach is smart. Modeling individ-
ual risk classes may still be fine, but aggregation of risk classes is rather non-
straightforward because it is very difficult to capture the interaction between the
different risk classes. Nevertheless we would like to describe the approach used in
practice (and also the short cuts applied).

In Figure 10.1 we show the individual risk modules used in the Swiss Solvency Test
[26] and in Solvency II [44]. Overall they are rather similar though some differences
exist. Often one considers the following 4 risk classes that are driven by the risk
factors that we will just describe:
1. Market risk. We cite SCR.5.1. of QIS5 [44]: Market risk arises from the level
or volatility of market prices of financial instruments. Exposure to market
risk is measured by the impact of movements in the level of financial variables
such as stock prices, interest rates, real estate prices and exchange rates.
w)
2. Insurance risk. Insurance risk is typically split into the different insur-
ance branches: non-life insurance, life insurance, health insurance and re-
insurance. Here we concentrate on non-life insurance risk. This is further
subdivided into (i) reserve risk which describes outstanding loss liabilities of
(m
past exposure claims; and (ii) premium risk which describes the risk deriving
from newly sold contracts that give an exposure over the next accounting pe-
riod. Additionally, there is often an annuity portfolio deriving from liability
insurance covering disability claims of third party.
3. Credit risk. We cite SCR.6.1. of QIS5 [44]: The counterparty default risk
module should reflect possible losses due to unexpected default, or deteriora-
tes
tion in the credit standing, of the counterparties and debtors of undertakings
over the forthcoming twelve months. The scope of the counterparty default
risk module includes risk-mitigating contracts, such as reinsurance arrange-
ments, securitisations and derivatives, and receivables from intermediaries,
no
as well as any other credit exposures which are not covered in the spread risk
sub-module.
4. Operational risk. We cite SCR.3.1. of QIS5 [44]: Operational risk is the risk
of loss arising from inadequate or failed internal processes, or from personnel
and systems, or from external events. Operational risk should include legal
NL
risks, and exclude risks arising from strategic decisions, as well as reputation
risks. The operational risk module is designed to address operational risks to
the extent that these have not been explicitly covered in other risk modules.
Let us formalize these risk factors and classes. Therefore, we first consider the
beginning of accounting year 1. At time t = 0 the asset deficit is given by
AD0 = L0 A0 .
We assume that X0 = 0 which implies that L0 = L+ +

0 , thus, L0 is the value of all
liabilities that need to be settled after t = 0. For simplification we assume that
the liabilities consist of insurance liabilities only. In this case, L+ 0 describes the
liabilities stemming from claims with accident date prior to t = 0 (these are the
liabilities of past exposure claims; we denote them by previous year (PY) claims,

see also Chapter 9), and of claims with accident date in accounting year 1 (these
are all liabilities of the new premium exposure if we assume one-year contracts
only; we denote them by current year (CY) claims). Summarizing, this implies on
the liability side of the balance sheet at time t = 0 (with the obvious notation)
L0 = L+ PY CY
0 = L0 + L0 .
On the asset side of the balance sheet we have (this is also a simplified version)
A0 = c0 + APY
0 +
CY
,
w)
where APY
0 are the provisions to cover the PY liabilities LPY
0 ,
CY
is the premium
CY
received for the CY claims L0 and c0 is the initial capital. As described above,
this amount A0 is invested at the financial market and provides value A1 at time
t = 1. This value needs to be compared to
(m

L1 = X1 + L+
1 = X1PY + X1CY + X1Op + L+,PY
1 + L+,CY
1 ,
where X1PY are the payments for PY claims, X1CY are the payments for CY claims,
L+,PY
1 is the value of the outstanding loss liabilities at time t = 1 for claims with
accident year prior to t = 0, and L+,CY
1 is the value of the outstanding loss liabilities
tes
at time t = 1 for CY claims (i.e. accident date in year 1). Thus, if we merge these
+,PY
two values L+1 = L1 + L+,CY
1 we obtain the new outstanding loss liabilities for
past exposure claims with accident date prior to t = 1. Finally, X1Op denotes
the operational risk loss payment where, for simplicity, we assume that this can
immediately be settled. We conclude that the asset deficit at time 1 is given by
no

AD1 = X1PY + X1CY + X1Op + L+,PY
1 + L+,CY
1 A1 (10.5)

= X1PY + L+,PY
1 + X1CY + L+,CY
1 + X1Op A1 . (10.6)
Let us comment on (10.5)-(10.6).

NL
Formula (10.5) gives the split into payments and outstanding loss liabilities.
This view is crucial for doing asset-and-liability management, i.e. to compare
the structure of the asset portfolio to the maturities of the liabilities.
Formula (10.6) provides the split into PY risk and CY risk. The PY risk is
mainly described by the claims development result described in Section 9.4.
The CY risk is described by a compound distribution as, for instance, seen
in Example 4.11. However, both these descriptions only consider nominal
claims and in order to get values we still need to add time values for cash
flow payments and a risk margin for bearing the run-off risks. Therefore,
these values also depend on financial market movements. This second view is
important for profitability analysis because it allows to match liabilities with
the corresponding insurance premium.

Coming back to the risk modules: market risk affects all variables in (10.5);
insurance risk is mainly reflected in X1PY , L+,PY
1 , X1CY and L+,CY
1 ; credit risk
is a main risk driver in A0 (if we assume that liabilities are considered before
re-insurance is applied (gross)); and operational risk is reflected in X1Op .
In the remainder we concentrate on the modeling of insurance liabilities.
10.3 Insurance liability variables

10.3.1 Market-consistent values
w)
We still face the difficulty of attaching market-consistent values to the insurance
liabilities which provides value at time t = 1 given by
def.
LIns = X1PY + L+,PY + X1CY + L+,CY .
(m
1 1 1
Op
Note that in our terminology L1 = LIns 1 + X1 . Assume that X = (X1 , . . . , Xn )
denotes the (random) cash flow that is generated by the insurance liabilities, see
also (9.1). We assume that this cash flow is adapted to the filtration F = (Fs )s1 .
In analogy to Wthrich-Merz [102] we need to choose an appropriate (state price)
deflator = (1 , . . . , n ) (which is F-adapted and strictly positive, P-a.s.) and
tes
then
1 X 1 X
LIns
1 = E [ s Xs | F1 ] = X1 + E [ s Xs | F1 ]
1 s1 1 s2
provides a market-consistent value in an arbitrage-free pricing system described by
the triple (P, F, ). For a one-period problem this was already described in Section
no
6.2.5. Under the assumption of uncorrelatedness of s and Xs , conditionally given

F1 , we can rewrite the market-consistent value of the insurance liabilities as
1
LIns
X X
1 = E [ s | F1 ] E [ Xs | F1 ] = P (1, s) E [ Xs | F1 ] (10.7)
s1 1 s1
NL
= X1PY + X1CY +
X
P (1, s) E [ Xs | F1 ] ,
s2
where P (1, s) denotes the price at time 1 of the zero-coupon bond that matures
at time s 2 (and P (1, 1) = 1). Note that viewed from time 0 both P (1, s) and
E [ Xs | F1 ] are F1 -measurable random variables in (10.7) and (expected) insurance
cash flows are adjusted for time value of money.
Under all the previous assumptions (in particular the uncorrelatedness assumption
(10.7)) the acceptability requirement (10.3) reads as:
The initial capital c0 and the asset strategy should be chosen such that

%(AD1 ) = % X1PY + X1CY + X1Op +

X
P (1, s) E [ Xs | F1 ] A1 0. (10.8)
s2

Since the asset deficit still has a rather involved form the model is further simplified.
Denote the expected values
p(1, s) = E [ P (1, s)| F0 ] and xs = E [ Xs | F0 ] .
Then, we use the following linear approximation
P (1, s) E [ Xs | F1 ] p(1, s)xs + (P (1, s) p(1, s)) xs + p(1, s) (E [ Xs | F1 ] xs ) .
The first term p(1, s)xs is the expected value (viewed from time 0) of the time-1-
w)
price P (1, s)E [ Xs | F1 ]. The term (P (1, s) p(1, s)) xs coins uncertainties in finan-
cial discounting and p(1, s) (E [ Xs | F1 ] xs ) describes volatilities in the insurance
cash flows. The cross term of the uncertainties was dropped in this approxima-
tion. Typically, the above terms are assumed to be independent so that they can
(m
be studied individually and aggregation is obtained by simply convoluting their
marginal distributions.
This approximation implies that for (10.8) we study the following three terms
X
Z1 = p(1, s)xs + (P (1, s) p(1, s)) xs A1 ,
s1
tes
X
Z2 = p(1, s) (E [ Xs | F1 ] xs ) ,
s1
Z3 = X1Op .
Z1 describes market and credit risks, Z2 describes insurance risk and Z3 describes
no
operational risk. In non-life insurance one often assumes that these three ran-
dom variables are independent (which may be problematic in particular w.r.t. re-
insurance).
In the remainder of this chapter we describe the insurance liability variable Z2 . For
the other terms we refer to the related solvency literature QIS5 [44], Swiss Solvency
NL
Test [46] and Wthrich-Merz [103].
10.3.2 Insurance risk

We study insurance risk given by
X
Z2 = p(1, s) (E [ Xs | F1 ] xs ) .
s1
As already mentioned the insurance variables are separated into PY variables and
CY variables w.r.t. the valuation date t = 0. This provides the split
Z2 = Z2PY + Z2CY

def.
h i h i
p(1, s) E XsPY F1 xPY p(1, s) E XsCY F1 xCY
X X
= + .

s s
s1 s1

The final simplification is that we assume that there are deterministic payout pat-
terns (sPY )s1 and (sCY )s1 , for instance, obtained by the CL method, see Theo-
rem 9.11 (and the estimation errors in these patterns are neglected). Then the last
expressions can be modified to

h i
Z2PY = p(1, s)sPY X1PY + R(1) R(0) ,
X
s1

Z2CY = p(1, s)sCY [E [ S1 | F1 ] E [ S1 | F0 ]] .

X
w)
s1
The first line Z2PY reflects the study of the claims development result, see (9.29).
(m
The second line Z2CY describes the total nominal claim S1 of accident year 1 that
is caused by the premium exposure CY . The terms in the round brackets are the
deterministic discount factors that respect the underlying maturities of the cash
flows; the terms in the square brackets are the random terms that need further
modeling and analysis.
tes
Claims development result
The claims development result for PY claims, given by

h i
CDR1 = X1PY + R(1) R(0) ,
no
has expected value 0, see Corollary 9.13, if the claims reserves are defined by
conditional expectations in a Bayesian model. Therefore, there remains the study
of higher moments. In practice, one restricts to the second moment:
Calculate for every line of business the conditional MSEP of the claims de-
NL
velopment result prediction, for instance using MW formula (9.35). This

provides a variance estimate for every line of business.
Specify a correlation matrix between the different lines of business, see for
instance SCR.9.34. in QIS5 [44].
The previous two items allow to aggregate the uncertainties of the individual
lines of business to obtain the overall variance over the sum of all lines of
business.
Fit a translated gamma or log-normal distribution to these first two mo-

ments assuming that the mean is exactly given by R(0) . This provides an
approximation to the distribution of CDR1 .

Premium liability risk
The claim E [ S1 | F1 ] resulting from the premium exposure CY is split into two
independent random variables Ssc and Slc , where Ssc reflects all small claims below
a given threshold M and Slc the claims above that threshold, see Examples 2.16
and 4.11.
The large claim Ssc is modeled per line of business (or per peril) by independent
compound Poisson distributions with Pareto claims severities and aggregation is
done using the aggregation Theorem 2.12 resulting in a compound Poisson dis-
tribution. The latter can be determined, for instance, with the Panjer algorithm,
w)
see Theorem 4.9, or the fast Fourier transform FFT, see Section 4.2.2.
The small claim Ssc is treated similarly to the claims development result, i.e. esti-
mate per line of business the first two moments. Aggregate these moments using
an appropriate correlation matrix, see for instance Section 8.4.2 in the technical
(m
Swiss Solvency Test document [46], and fit a gamma or a log-normal distribution
to this first two moments.
Remarks.
In the Swiss Solvency Test one distinguishes between pure process risk and
tes
parameter uncertainty for the small claims layer, too. Process risk is diver-
sifiable with increasing volume, whereas parameter uncertainty is not. As a
result the coefficient of variation per line of business has a similar form as has
been found for the compound negative-binomial distribution, see Proposition
no
2.24. That is, for volume v the coefficient of variation does not vanish
but stays strictly positive.
In the Swiss Solvency Test one aggregates in addition so-called scenarios. The
motivation for this is that the present model cannot reflect all uncertainties
and therefore it is disturbed by scenarios. Basically, these scenarios are claims
NL
of Bernoulli type, i.e. they occur with a certain probability and if they occur
they have a given amount.
For the aggregation between PY and CY claims it is either assumed that they
are independent, or that the claims development result uncertainty CDR1 and
the small claim CY Ssc are again aggregated via a correlation matrix and then
an overall distribution is fitted to the resulting first two moments.
In summary we see that many approximations are used (as described above)
and, also crucially, that aggregation is done over correlation matrices. The
latter may be quite problematic because correlations typically also depend on
underlying volumes which is neglected in actual solvency implementations.
Therefore, this needs to be revised carefully in each individual case.

Market-value margin
The careful reader will have noticed that we have lost the risk margin somewhere
on the way to the final result. We will not further discuss the risk and market-
value margin here, we only want to mention that the current calculation of the
market-value margin is quite ad-hoc, see Chapter 6 in Swiss Solvency Test [46] and
Section 10.3 in Wthrich-Merz [102], and further refinements are necessary. The
crucial point is that the conditional uncorrelatedness in (10.7) does not hold true
in general, see Wthrich-Merz [102], and for a more general discussion we also refer
to Happ et al. [59] and Wthrich [98].
w)
(m
tes
no
NL

Appendix
w)
Derivations from Gaussian distributions
i.i.d.
Assume Z0 , Z1 , . . . N (0, 1). We can derive the following distributions.
2 -distribution. Define for k N the random variable
(m
k
Zi2 .
X
Xk =
i=1
Xk has a 2 -distribution with k degrees of freedom, see Example 2 on page 22. Its
density is given by
tes
1
f (x) = xk/21 exp {x/2} for x 0,
2k/2 (k/2)
and the corresponding moment generating function is
no
MXk (r) = (1 2r)k/2 for r < 1/2.
Moreover, we have E[Xk ] = k and Var(Xk ) = 2k.
t-distribution. Define for k N the random variable

NL
Z0
Xk = q P .
k
i=1 Zi2 /k
Xk has a t-distribution with k degrees of freedom. Its density is given by

((k + 1)/2)
f (x) = (1 + x2 /k)(k+1)/2 for x R.
k (k/2)
The moment generating function MXk (r) does not exist for r > 0, and we have
E[Xk ] = 0, for k > 1, and Var(Xk ) = k/(k 2), for k > 2.
F -distribution. Define for k, m N the random variable

Pk
i=1 Zi2 /k
Xk,m = Pk+m 2
.
i=k+1 Zi /m
275
Xk,m has an F -distribution with k, m degrees of freedom. Its density is given by
((k + m)/2) k/2 m/2 k/21

f (x) = k m x (m + kx)(k+m)/2 for x 0.
(k/2)(m/2)
The moment generating function MXk,m (r) does not exist for r > 0, and we have
E[Xk,m ] = m/(m 2), for m > 2.
i.i.d.
Lemma A.1. Assume Z1 , . . . , Zk N (, 2 ). Define the sample estimators
w)
k k
1 X 1 2
S2 =
X
Z = Zj and Zj Z .
k j=1 k 1 j=1
Then Z and S 2 are independent with
(m
(d) 2
Z N (, 2 /k) and S2 = 2 ,
k 1 k1
where 2k1 is 2 -distributed with k 1 degrees of freedom.
The previous lemma implies that

tes

k Z
T =
S
has a t-distribution with k 1 degrees of freedom.
no
NL

Bibliography
[1] Acerbi, C., Tasche, D. (2002). On the coherence of expected shortfall. Journal
Banking and Finance 26/7, 1487-1503.
w)
[2] Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans-
actions on Automatic Control 19/6, 716-723.
[3] Alai, D.H., Merz, M., Wthrich, M.V. (2009). Mean square error of prediction in
(m
the Bornhuetter-Ferguson claims reserving method. Annals of Actuarial Science
4/1, 7-31.
[4] Alai, D.H., Merz, M., Wthrich, M.V. (2010). Prediction uncertainty in the
Bornhuetter-Ferguson claims reserving method: revisited. Annals of Actuarial Sci-
ence 5/1, 7-17.
tes
[5] Artzner, P., Delbaen, F., Eber, J.M., Heath, D. (1997). Thinking coherently. Risk
10/11, 68-71.
[6] Artzner, P., Delbaen, F., Eber, J.M., Heath, D. (1999). Coherent measures of risk.
Mathematical Finance 9/3, 203-228.
no
[7] Asmussen, S., Albrecher, H. (2010). Ruin Probabilities. 2nd edition. World Scien-
tific.
[8] Bahr, von B. (1975). Asymptotic ruin probabilities when exponential moments do
not exist. Scandinavian Actuarial Journal 1975, 6-10.
NL
[9] Bailey, R.A. (1963). Insurance rates with minimum bias. Proceedings CAS 50, 4-11.
[10] Bailey, R.A., Simon, L.J. (1960). Two studies on automobile insurance ratemaking.
ASTIN Bulletin 1, 192-217.
[11] Bichsel, F. (1964). Erfahrungstarifierung in der Motorfahrzeug-Haftpflicht-

Versicherung. Bulletin of the Swiss Association of Actuaries 1964, 119-130.
[12] Billingsley, P. (1968). Probability and Measure. Wiley.
[13] Billingsley, P. (1995). Probability and Measure. 3rd edition. Wiley.
[14] Boland, P.J. (2007). Statistical and Probabilistic Methods in Actuarial Science.
Chapman & Hall/CRC.
[15] Bolthausen, E., Wthrich, M.V. (2013). Bernoullis law of large numbers. ASTIN
Bulletin 43/2, 73-79.
277
278 Bibliography
[16] Bornhuetter, R.L., Ferguson, R.E. (1972). The actuary and IBNR. Proceedings
CAS 59, 181-195.
[17] Boyd, S., Vandenberghe, L. (2004). Convex Optimization. Cambridge University

Press.
[18] Bhlmann, H. (1970). Mathematical Methods in Risk Theory. Springer.
[19] Bhlmann, H. (1980). An economic premium principle. ASTIN Bulletin 11/1, 52-
60.
[20] Bhlmann, H. (1992). Stochastic discounting. Insurance: Mathematics and Eco-
w)
nomics 11/2, 113-127.
[21] Bhlmann, H. (1995). Life insurance with stochastic interest rates. In: Financial
Risk in Insurance, G. Ottaviani (ed.), Springer, 1-24.
(m
[22] Bhlmann, H. (2004). Multidimensional valuation. Finance 25, 15-29.
[23] Bhlmann, H., De Felice, M., Gisler, A., Moriconi, F., Wthrich, M.V. (2009).
Recursive credibility formula for chain ladder factors and the claims development
result. ASTIN Bulletin 39/1, 275-306.
[24] Bhlmann, H., Gisler, A. (2005). A Course in Credibility Theory and its Applica-
tes
tions. Springer.
[25] Bhlmann, H., Straub, E. (1970). Glaubwrdigkeit fr Schadenstze. Bulletin of

the Swiss Association of Actuaries 1970, 111-131.
[26] Bundesamt fr Privatversicherungen (2004). Weissbuch des Schweizer Solvenztests.

no
November 2004.
[27] Cern, A. (2006). Introduction to fast Fourier transform in finance. SSRN

Manuscript ID 559416.
[28] Congdon, P. (2006). Bayesian Statistical Modelling. 2nd edition. Wiley.

NL
[29] Cramr, H. (1930). On the Mathematical Theory of Risk. Skandia Jubilee Volume,
Stockholm.
[30] Cramr, H. (1955). Collective Risk Theory. Skandia Jubilee Volume, Stockholm.
[31] Cramr, H. (1994). Collected Works. Volumes I & II. Edited by A. Martin-Lf.
Springer.
[32] Delbaen, F. (2000). Coherent Risk Measures. Cattedra Galileiana. Pisa.
[33] Delbaen, F., Schachermayer, W. (1994). A general version of the fundamental the-
orem of asset pricing. Mathematische Annalen 300, 463-520.
[34] Denneberg, D. (1989). Verzerrte Wahrscheinlichkeiten in der Versicherungsmath-

ematik, quantilabhngige Prmienprinzipien. Mathematik Arbeitspapiere 34, Uni-
versity of Bremen.

Bibliography 279
[35] Denuit, M., Marchal, X., Pitrebois, S., Walhin, J.-F. (2007). Actuarial Modelling
of Claims Count. Wiley.
[36] Dickson, D.C.M. (2005). Insurance Risk and Ruin. Cambridge University Press.
[37] Duffie, D. (2001). Dynamic Asset Pricing Theory. 3rd edition. Princeton University
Press.
[38] Embrechts, P., Frei, M. (2009). Panjer recursion versus FFT for compound distri-
butions. Mathematical Methods of Operations Research 69/3, 497-508.
w)
[39] Embrechts, P., Klppelberg, C., Mikosch, T. (2003). Modelling Extremal Events
for Insurance and Finance. 4th printing. Springer.
[40] Embrechts, P., Nelehov, J., Wthrich, M.V. (2009). Additivity properties for
Value-at-Risk under Archimedean dependence and heavy-tailedness. Insurance:
(m
Mathematics and Economics 44/2, 164-169.
[41] Embrechts, P., Veraverbeke, N. (1982). Estimates for the probability of ruin with
special emphasis on the possibility of large claims. Insurance: Mathematics and
Economics 1/1, 55-72.
[42] England, P.D., Verrall, R.J. (2002). Stochastic claims reserving in general insur-
tes
ance. British Actuarial Journal 8/3, 443-518.
[43] England, P.D., Verrall, R.J., Wthrich, M.V. (2012). Bayesian overdispersed Pois-
son model and the Bornhuetter-Ferguson claims reserving method. Annals of Ac-
tuarial Science 6/2, 258-283.
no
[44] European Commission (2010). QIS 5 Technical Specifications, Annex to Call for
Advice from CEIOPS on QIS5.
[45] Feller, W. (1966). An Introduction to Probability Theory and its Applications. Vol-
ume II. Wiley.
NL
[46] FINMA (2006). Swiss Solvency Test. FINMA SST Technisches Dokument, Version
2. October 2006.
[47] Fllmer, H., Schied, A. (2004). Stochastic Finance, An Introduction in Discrete

Time. 2nd edition. de Gruyter.
[48] Fortuin, C.M., Kasteleyn, P.W., Ginibre, J. (1971). Correlation inequalities on

some partially ordered sets. Communication Mathematical Physics 22/2, 89-103.
[49] Frees, E.W. (2010). Regression Modeling with Actuarial and Financial Applications.
Cambridge University Press.
[50] Fringeli, M. (2005). Credibility fr Probleme mit rumlicher Abhngigkeit. Diploma

Thesis, ETH Zurich.

280 Bibliography
[51] Garcia Ben, M., Yohai, V.J. (2004). Quantile-quantile plot for deviance residuals
in the generalized linear model. Journal of Computational and Graphical Statistics
13/1, 36-47.
[52] Gesmann, M., Murphy, D., Zhang, W., Carrato, A., Crupi G., Wthrich, M.V.
(2015). ChainLadder: statistical methods and models for the calculation of out-
standing claims reserves in general insurance. R package version 0.2.0.
[53] Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (1996). Markov Chain Monte Carlo
in Practice. Chapman & Hall.
w)
[54] Gisler, A. (2011). Nicht-Leben Versicherungsmathematik. Lecture Notes, ETH
Zurich.
[55] Gisler, A., Wthrich, M.V. (2008). Credibility for the chain ladder reserving
method. ASTIN Bulletin 38/2, 565-600.
(m
[56] Green, P.J. (1995). Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination. Biometrika 82/4, 711-732.
[57] Green, P.J. (2003). Trans-dimensional Markov chain Monte Carlo. In: Highly Struc-
tured Stochastic Systems, P.J. Green, N.L. Hjort, S. Richardson (eds.), Oxford
Statistical Science Series, 179-206. Oxford University Press.
tes
[58] Hachemeister, C.A., Stanard, J.N. (1975). IBNR claims count estimation with
static lag functions. ASTIN Colloquium 1975, Portugal.
[59] Happ, S., Merz, M., Wthrich, M.V. (2015). Best-estimate claims reserves in in-
complete markets. European Actuarial Journal 5/1, 55-77.
no
[60] Hofert, M., Wthrich, M.V. (2013). Statistical review of nuclear power accidents.
Asia-Pacific Journal of Risk and Insurance 7/1, Article 1.
[61] Johansen, A.M., Evers, L., Whiteley, N. (2010). Monte Carlo Methods. Lecture
Notes, Department of Mathematics, University of Bristol.
NL
[62] Johnson, R.A., Wichern, D.W. (1998). Applied Multivariate Statistical Analysis.
4th edition. Prentice-Hall.
[63] Jung, J. (1968). On automobile insurance ratemaking. ASTIN Bulletin 5, 41-48.
[64] Kaas, R., Goovaerts, M., Dhaene, J., Denuit, M. (2008). Modern Actuarial Risk
Theory, Using R. 2nd edition. Springer.
[65] Kehlmann, D. (2005). Die Vermessung der Welt. Rowohlt Verlag.
[66] Kolmogoroff, A. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer.
[67] Kremer, E. (1985). Einfhrung in die Versicherungsmathematik. Vandenhoek &

Ruprecht, Gttingen.
[68] Kyprianou, A. (2014). Gerber-Shiu Risk Theory. Springer.

Bibliography 281
[69] Laplace, P.S. (1812). Thorie analytique des probabilits. Suppl. to 3rd edition
Courcier, Paris 1820.
[70] Lehmann, E.L. (1983). Theory of Point Estimation. Wiley.
[71] Lundberg, F. (1903). Approximerad framstllning av sannolikhetsfunktionen. ter-

frskering av kolletivrisker. Almqvist & Wiksell, Uppsala.
[72] Mack, T. (1991). A simple parametric model for rating automobile insurance or
estimating IBNR claims reserves. ASTIN Bulletin 21/1, 93-109.
[73] Mack, T. (1993). Distribution-free calculation of the standard error of chain ladder
w)
reserve estimates. ASTIN Bulletin 23/2, 213-225.
[74] Mack, T. (2008). The prediction error of Bornhuetter/Ferguson. ASTIN Bulletin

38/1, 87-103.
(m
[75] McCullagh, P., Nelder, J.A. (1983). Generalized Linear Models. Chapman & Hall.
[76] McGrayne, S.B. (2011). The Theory That Would Not Die. Yale University Press.
[77] McNeil, A.J., Frey, R., Embrechts, P. (2005). Quantitative Risk Management: Con-
cepts, Techniques and Tools. Princeton University Press.
tes
[78] Merz, M., Wthrich, M.V. (2008). Modelling the claims development result for
solvency purposes. CAS E-Forum Fall 2008, 542-568.
[79] Merz, M., Wthrich, M.V. (2013). Mathematik fr Wirtschaftswissenschaftler.

Vahlen.
no
[80] Merz, M., Wthrich, M.V. (2014). Claims run-off uncertainty: the full picture.
SSRN Manuscript ID 2524352.
[81] Mikosch, T. (2006). Non-Life Insurance Mathematics. Springer.
[82] Ohlsson, E., Johansson, B. (2010). Non-Life Insurance Pricing with Generalized
NL
Linear Models. Springer.
[83] Panjer, H.H. (1981). Recursive evaluation of a family of compound distributions.

ASTIN Bulletin 12/1, 22-26.
[84] Panjer, H.H. (2006). Operational Risk: Modeling Analytics. Wiley.
[85] Renshaw, A.E., Verrall, R.J. (1998). A stochastic model underlying the chain-
ladder technique. British Actuarial Journal 4/4, 903-923.
[86] Resnick, S.I. (1997). Heavy tail modeling of teletraffic data. Annals of Statistics
25/5, 1805-1869.
[87] Resnick, S.I. (2002). Adventures in Stochastic Processes. 3rd printing. Birkhuser.
[88] Robert, C.P. (2001). The Bayesian Choice. 2nd edition. Springer.

282 Bibliography
[89] Rolski, T., Schmidli, H., Schmidt, V., Teugels, J. (1999). Stochastic Processes for
Insurance and Finance. Wiley.
[90] Saluz, A., Gisler, A., Wthrich, M.V. (2011). Development pattern and predic-
tion error for the stochastic Bornhuetter-Ferguson claims reserving model. ASTIN
Bulletin 41/2, 279-317.
[91] Schmidli, H. (2007). Risk Theory. Lecture Notes, University of Cologne.
[92] Schweizer, M. (2009). Stochastic Processes and Stochastic Analysis. Lecture Notes,
ETH Zurich.
w)
[93] Smith, A., Thaper, S. (2014). Making uncertainty explicit: stochastic modelling.
Actuarial Post, February 12, 2014, 12-15.
[94] Sovacool, B.K. (2008). The costs of failure: a preliminary assessment of major
(m
energy accidents, 19072007. Energy Policy 36/5, 1802-1820.
[95] Sundt, B., Jewell, W.S. (1981). Further results of recursive evaluation of compound
distributions. ASTIN Bulletin 12/1, 27-39.
[96] Tsanakas, A., Christofides, N. (2006). Risk exchange with distorted probabilities.
ASTIN Bulletin 36/1, 219-243.
tes
[97] Williams, D. (1991). Probability with Martingales. Cambridge University Press.
[98] Wthrich, M.V. (2015). From ruin theory to solvency in non-life insurance. Scan-
dinavian Actuarial Journal 2015/6, 516-526.
no
[99] Wthrich, M.V. (2016). Market-Consistent Actuarial Valuation. 3rd edition.

Springer.
[100] Wthrich, M.V. (2016). Machine learning in individual claims reserving. SSRN
Manuscript ID 2867897.
[101] Wthrich, M.V., Buser, C. (2016). Data analytics for non-life insurance pricing.
NL
SSRN Manuscript ID 2870308.
[102] Wthrich, M.V., Merz, M. (2008). Stochastic Claims Reserving Methods in Insur-
ance. Wiley.
[103] Wthrich, M.V., Merz, M. (2013). Financial Modeling, Actuarial Valuation and
Solvency in Insurance. Springer.
[104] Wthrich, M.V., Merz, M. (2015). Stochastic claims reserving manual: advances
in dynamic modeling. SSRN Manuscript ID 2649057.

List of exercises
Exercise 1, page 18
w)
Exercise 2, page 22
Corollary 2.7, page 28
Exercise 3, page 28
Exercise 4, page 40
(m
Exercise 5, page 51
Exercise 6, page 51
Exercise 7, page 60
Exercise 8, page 78
Exercise 9, page 84
Exercise 10, page 90
tes
Corollary 6.6, page 149
no

NL

283
Index
F -distribution, 180 Bayes, Thomas, 202

2 -distribution, 22, 62 Bayesian
2 -goodness-of-fit test, 49, 83 inference, 41, 201
w)
p-value, 22 Bayesian CL
factor, 240
absolutely continuous distribution, 15 predictor, 241
acceptable, 161, 266 Bayesian information criterion, 83
(m
accident Bernoulli
date, 226 distribution, 26
year, 228 experiment, 26
AD (asset deficit), 267 random walk, 127
AD test, 82 Bernoulli, Jakob, 12
adjustment coefficient, 125 best-estimate reserves, 231
tes
admissible, 32 BF
age-to-age factor, 232 method, 232, 236
aggregation property, 31 reserves, 236
AIC, 83 BIC, 83
Akaike information criterion, 83
no
Bichsel, Fritz, 203

Akaike, Hirotugu, 83 binary variable, 177
alternative hypothesis, 21 binomial distribution, 26, 183
Anderson, Theodore Wilbur, 82 definition, 26
Anderson-Darling test, 82 moments, 27
approximation
NL
Bornhuetter, Ronald, 236

Edgeworth, 100 Bornhuetter-Ferguson method, 232, 236
normal, 94 BS model, 213
translated gamma, 97
translated log-normal, 97 CARA utility function, 147
arbitrage-free pricing, 270 categorical variable, 177
asset deficit, 267 CDR, 250, 272
uncertainty, 255
Bhlmann, Hans, 154, 212 central limit theorem, 13, 94
Bhlmann-Straub model, 213 chain-ladder method, 232
Bahr, von Bengt, 139 chain-ladder model
Bailey, Robert A., 171 distribution-free, 238
balance sheet, 264 Chebychevs inequality, 127
Bayes rule, 203 Chebychev, Pafnuty Lvovich, 17
284
Index 285
chi-square distribution, 22, 62 decomposition property, 33

chi-square-goodness-of-fit test, 49, 83 definition, 30
CL factor, 232 moments, 30
Bayes, 240 concave, 144
estimate, 240 conditional tail expectation, 159
CL method, 232 conjugate prior, 208
CL model constant absolute risk-aversion, 147
gamma-gamma Bayes, 239 constant relative risk-aversion, 147
MSEP, 242 continuous variable, 177
w)
CL reserves, 233 convergence in distribution, 17
claims convex cone, 161
counts, 23 convolution, 25
frequency, 26 cost-of-capital, 160, 163
claims development
(m
rate, 160, 164
result, 250, 272 Cramr, Harald, 121
triangle, 229 Cramr-Lundberg process, 121
claims inflation, 89 credibility coefficient, 218, 241
claims reserves, 230, 231 credibility estimator, 209
claims reserving, 225 homogeneous, 214
tes
algorithm, 232 inhomogeneous, 214
method stochastic, 237 credibility weight, 201, 205, 208
closing date, 226 credit risk, 268, 271
CLT, 13, 94 CRRA utility function, 147
CoC, 160 CTE, 159
no
rate, 160 cumulant function, 182

coefficient of determination, 178 cumulant generating function, 19
coefficient of variation, 16, 58 current year claim, 269
coherent risk measure, 157, 162 CY claim, 269
collective mean, 213 CY risk, 272
NL
collective risk model, 23

compound binomial distribution, 27 Darling, Donald Allan, 82
definition, 27 De Moivre, Abraham, 13, 94
moments, 28 decomposition property, 33
compound distribution, 23 deductible, 88
definition, 23 deflator, 165, 270
moments, 24 Delbaen, Freddy, 157
compound negative-binomial distribution, density, 15, 58
39 descending ladder epoch, 129
definition, 39 design matrix, 175
moments, 40 development year, 229
compound Poisson distribution, 30 deviance statistics, 190
aggregation property, 31 discrete distribution, 15

286 Index
discretization, 108 Fourier, Jean Baptiste Joseph, 117

disjoint decomposition, 32 Frchet, Maurice, 64
property, 33
dispersion, 182 gamma distribution, 37, 83, 184
distortion function, 156 gamma-gamma Bayes CL model, 239
distribution function, 15 gamma-gamma model, 210
distribution-free CL model, 238 Gauss, Carl Friedrich, 18
Duffie, James Darrell, 165 Gaussian distribution, 17, 184
generalized inverse, 20
EDF, 182 generalized linear model, 167, 182
w)
Edgeworth approximation, 100 Gerber, Hans-Ulrich, 121
Edgeworth, Francis Ysidro, 100 Gisler, Alois, 215
Embrechts, Paul, 139 Glivenko-Cantelli theorem, 80, 93
Embrechts-Veraverbeke theorem, 137 GLM, 170, 182
(m
empirical Goldie, Charles M., 139
distribution function, 56 goodness-of-fit, 83, 177
loss size index function, 56
mean excess function, 56 happiness index, 144
England, Peter D., 247 heavy tailed, 135
ES, 159 Hill
tes
Esscher estimator, 75
measure, 154 plot, 75
premium, 154, 165 histogram, 55
estimation error, 21 homogeneous credibility estimator, 214
no
estimator, 21 i.i.d., 20
expectation, 15 IBNYR, 227
expected claims frequency, 26 incomplete gamma function, 59
expected shortfall, 158, 159, 163 independent and identically distributed,
expected value, 15, 58 20
expected value principle, 141
NL
individual claim size, 23, 53

exponential dispersion family, 182, 208 informative prior, 206
exponential distribution, 62 inhomogeneous credibility estimator, 214
exponential utility function, 147 insurance risk, 268, 271
F-distribution, 180 inverse Gaussian distribution, 63
fast Fourier transform, 116 inversion formula, 117
Ferguson, Ronald E., 236 isoelastic utility function, 147
FFT, 116 Jewell, William S., 106
finite horizon ruin probability, 122 Jung, Jan, 173
first moment, 15
Fisher, Sir Ronald Aylmer, 45 Khinchin, Aleksandr Yakovlevich, 132
Fourier transform Kolmogorov distribution, 80
discrete, 117 Kolmogorov, Andrey Nikolaevich, 79

Index 287
Kolmogorov-Smirnov test, 79 mean excess function, 56, 58

KS test, 79 mean excess plot, 57
mean square error of prediction
ladder
conditional, 237
epoch, 129
Merz, Michael, 250
height, 129
Merz-Wthrich formula, 255
Laplace, Pierre-Simon, 13, 94, 202
method of
large claims separation, 35
Bailey & Jung, 173
law of large numbers, 12
Bailey & Simon, 171
layer, 58, 86
w)
total marginal sums, 173
leverage effect, 89
method of moments, 40
likelihood function, 45
minimal variance estimator, 42
likelihood ratio test, 179
mixed Poisson distribution, 36
linear credibility, 201, 212
(m
definition, 36
link ratio, 232
MLE, 40, 45
LLN, 12
MM, 40
log-gamma distribution, 70
model risk, 13
log-likelihood function, 45
model world, 14
log-linear model, 176
moment estimator, 41
log-link function, 185
tes
moment generating function, 16, 19, 58
log-log plot, 57
moments, 15
log-normal distribution, 66
monotonicity, 161
loss size index function, 56, 58
Lundberg Monte Carlo simulation, 93
Morgenstern, Oskar, 144
no
bound, 125, 126

coefficient, 125 MSEP, 237, 242
Lundberg, Ernst Filip Oskar, 121 multiplicative tariff, 168
Lyapunov, Aleksandr Mikhailovich, 94 multivariate Gaussian distribution, 175
density, 176
Mack CL model, 238 MV, 42
NL
Mack formula, 245 MW formula, 255

Mack, Thomas, 238
margin, 266 negative-binomial distribution, 37
market risk, 268, 271 definition, 37
market-consistent, 266 moments, 38
value, 270 net profit condition, 124
market-value margin, 266, 274 Neumann, von John, 144
Markov chain Monte Carlo, 201 non-informative prior, 206
Markov, Andrey Andreyevich, 17 normal approximation, 94
maximum likelihood estimator, 45 normalization, 161
maximum likelihood method, 40 NPC, 124
MCMC, 201 null hypothesis, 21
mean, 15, 58 number of claims, 23

288 Index
ODP model, 247 liability risk, 268, 273

one-period problem, 265 previous year claim, 268
one-year uncertainty, 256 Price, Richard, 202
operational risk, 268, 271 prior
loss, 269 distribution, 202
outstanding loss liabilities, 226, 230 parameter, 204
over-dispersed Poisson model, 247 probability distortion, 156
probability space, 14
p-value, 22 process uncertainty, 238
Plya, George, 38
w)
provisions, 230, 269
Panjer pure randomness, 13
algorithm, 105, 107 PY claim, 268
distribution, 105 PY risk, 272
recursion, 105
(m
Panjer, Harry H., 105 radius of convergence, 16
parameter estimation Radon-Nikodym derivative, 165
claims count distribution, 40 random variables, 14
error, 238 random walk theorem, 124
Pareto distribution, 73 rapidly varying, 59
tes
Pareto, Vilfredo Federico Damaso, 73 RBNS, 228
past exposure claim, 230 re-insurance, 88
Pearsons residuals, 191 real world, 14
Pearson, Karl, 59, 83 regularly varying, 59, 136
Poisson distribution, 29, 184 renewal property, 125
no
definition, 29 reporting
moments, 29 date, 226
Poisson, Simon Denis, 29 delay, 226
Poisson-gamma model, 203 reserve risk, 268
Pollaczek, Flix, 132 reserves, 230, 231
NL
Pollaczek-Khinchin formula, 129, 132 residual standard deviation, 179

positive homogeneity, 161 Resnick, Sidney Ira, 75
posterior Riemann-Stieltjes integral, 15
distribution, 203 risk
parameter, 204 averse, 144
power law distribution, 73 bearing capital, 159
power utility function, 147 characteristics, 168
prediction error, 21, 221 class, 168
predictor, 21, 237 components, 13
premium margin, 266, 274
calculation principle, 141 measure, 159, 265
CY, 269 modules, 267
elements, 13 ruin probability

Index 289
finite horizon, 122 survival function, 20, 58

ultimate, 123 Swiss Solvency Test, 265
ruin theory, 121
tail index, 59, 136
ruin time, 122
Tail-Value-at-Risk, 159
sample tariff criterion, 168
estimators, 41 tariffication, 167
mean, 41, 54 total claim amount, 23
variance, 41, 54 total uncertainty, 256
saturated model, 189 tower property, 20
w)
scale parameter, 59 translated gamma approximation, 97
scaled deviance, 190 translated log-normal approximation, 97
scatter plot, 54 translation invariance, 161
settlement TVaR, 159, 265
(m
date, 226 ultimate ruin probability, 123
delay, 229 utility function
period, 226 exponential, 147
shape parameter, 59 power, 147
Shiu, Elias S.W., 121 utility indifference price, 148
tes
significance level, 21 utility theory, 144
Simon, LeRoy J., 171
skewness, 16, 58 vague prior, 206
slowly varying, 59 value
Smirnov, Nikolai Vasilyevich, 79 assets, 264
no
solvency, 266 liabilities, 264

Solvency II, 265 Value-at-Risk, 159, 162
Spitzers formula, 130 VaR, 159, 162, 265
Spitzer, Frank Ludvig, 131 Var, 16
SST, 265 variable reduction analysis, 189
NL
standard assumptions for compound dis- variance, 16, 58

tributions, 23 variance loading principle, 142
standard deviation, 16 Vco, 16
standard deviation loading principle, 142 Veraverbeke, Nol, 139
stochastic claims reserving method, 237 Verrall, Richard J., 247
stochastic dominance, 109 volume, 26
stopping time, 129
Weibull distribution, 64
Straub, Erwin, 212
Weibull, Ernst Hjalmar Waloddi, 64
structural parameter, 218
subadditivity, 161 zero claim, 53
subexponential, 133, 135, 137 zero-coupon bond, 270
Sundt, Bjrn, 106
surplus process, 121, 264

SSRN Id2319328

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSRN Id2319328

Uploaded by

Copyright:

Available Formats

w)

Version March 14, 2017

Electronic copy available at: https://ssrn.com/abstract=2319328

Version March 14, 2017, M.V. Wthrich, ETH Zurich

Electronic copy available at: https://ssrn.com/abstract=2319328

Citation: please use the SSRN URL.

Electronic copy available at: https://ssrn.com/abstract=2319328

violate copyright, I appreciate an immediate note and the corresponding pic-

Version March 14, 2017, M.V. Wthrich, ETH Zurich

Version March 14, 2017, M.V. Wthrich, ETH Zurich

2 Collective Risk Modeling 23

2.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Individual Claim Size Modeling 53

3.4.1 Claim size modeling using layers . . . . . . . . . . . . . . . . 86

4 Approximations for Compound Distributions 93

6.2.3 Probability distortion pricing principles . . . . . . . . . . . . 156

7 Tariffication and Generalized Linear Models 167

7.1 Simple tariffication methods . . . . . . . . . . . . . . . . . . . . . . 171

8 Bayesian Models and Credibility Theory 201

Version March 14, 2017, M.V. Wthrich, ETH Zurich

8.1.2 Exponential dispersion family with conjugate priors . . . . . 207

9 Claims Reserving 225

10.3.2 Insurance risk . . . . . . . . . . . . . . . . . . . . . . . . . . 271

Version March 14, 2017, M.V. Wthrich, ETH Zurich

Version March 14, 2017, M.V. Wthrich, ETH Zurich

risks whose individual insurance claims can be

probability theory. Assume we have a proba-

Version March 14, 2017, M.V. Wthrich, ETH Zurich

provided an extension in 1812, see also page 94 below.

1.1.2 Risk components and premium elements

1. Pure randomness: The outcomes of the claims Yi are uncertain/random. This

2. Model risk: The description of the randomness of the variables Yi , described in

Version March 14, 2017, M.V. Wthrich, ETH Zurich

+ risk margin to protect against the risks mentioned above

financial gains on investments

+ other administrative expenses

1.2 Probability theory and statistics

Version March 14, 2017, M.V. Wthrich, ETH Zurich

between random variables and random vectors, we restrict ourselves to random

F (x) = FX (x) = P [X x] [0, 1]

denotes the probability that X has an outcome less or equal to x. In general, we

We distinguish two important types of random variables:

with kA pk = 1. We call pk probability weight of X in k A;

(ii) a random variable X F is called absolutely continuous if there exists a

Assume X F and h : R R is a sufficiently nice measurable function. We

R R h(x)f (x) dx if X is absolutely continuous.

Version March 14, 2017, M.V. Wthrich, ETH Zurich

standard deviation and coefficient of variation of X F

The moment generating function is crucial to identify the properties of random

This proves the lemma. 2

Version March 14, 2017, M.V. Wthrich, ETH Zurich

Lemma 1.4. Assume that the random variables Xn , n N, P.L. Chebychev

and MX on a common interval (r0 , r0 ) with r0 > 0. Suppose

Proof. See Section 30 of Billingsley [13]. Basically, Chebychevs inequality

implies tightness of the underlying probability measures from which the

The Pafnuty Lvovich Chebychev (1821-1894) inequality is

Example 1.5 (Gaussian distribution). Assume X N (, 2 ) has a Gaussian

Version March 14, 2017, M.V. Wthrich, ETH Zurich

This moment generating function is obtained by direct calculation completing the

and for the second moment we obtain

This implies for the variance of Gaussian distributions

Exercise 1 (Gaussian distribution).

(a) Assume X N (0, 1). Prove that a + bX N (a, b2 ) for a, b R.