Professional Documents
Culture Documents
Applied Projects
(as summarised by their authors)
1
ANALYSIS OF US INSURANCE DATA
The purpose of the project was to analyse the gross payment amounts for 24051 pro-
fessional indemnity insurance claims (i.e. claims incurred by professionals during the
practising of their profession). The insurance was written between 1970 and 1993, in
10 dierent US states, by an unspecied insurance company. As well as State , the
two other factors that were analysed were the Type Code (representing the type of risk
insured) and the Year of claim.
The payment amount, while obvious for closed (i.e. nally settled) claims, had to be
calculated by:
claim amount = estimated liability share of contract
for open claims. This failed to take into account the possibility of the insurance com-
pany being found not liable in court, and so another dataset was produced by sampling
from the rst with the estimated probability of liability.
The Smirnov Test, implemented on Splus, showed that all three factors aected the
distribution, and as the sample size was insucient to stratify the data in three ways at
the same time, each factor was investigated separately. There were three distributions
used to try to model the data:
The Log-Normal LN (; 2 )
The Generalised Gamma G (k; p; )
The Generalised F GF (m1 ; m2; ; 2 )
None of these tted perfectly, although the log-normal was quite close, and the gen-
eralised F improved upon it slightly at the expense of two extra parameters. One
conclusion that was drawn from the stratication by year was that the log-normality
reduced as the balance between open and closed claims approached, and then increased
again. This suggested that the estimates of liability in the open cases were biased {
the fact that the estimates were used for determining the levels of reserves meant that
they were probably conservative.
The best t was for the upper tail of the distribution, possibly because the importance
of the high value claims meant that greater care was taken in estimating the liabilities,
and so the estimates were more accurate. The t of the log-normal to the upper tail was
tested and found to be quite good, while the generalised F was again slightly better.
An interesting application of these upper tails is in assessing the premiums charged to
the insurance company for excess loss reinsurance, and a suciently accurate t of the
log-normal would facilitate this.
11
ANALYSIS OF CRIMINAL DATA
The project is divided into two parts. In the rst part, a comparison of the accuracy
and the dierence is made for two sources of criminal records, the Oender Index and
the records from the Criminal Record Oce. In the second part, models are built to
predict the court outcome (in terms of the type of sentence given to the oender).
In the rst part, a third source of information is also used to evaluate the accuracy of
the two sources of criminal records. It is found that both sources of criminal records
are quite accurate in recording the personal information (date of birth) of individuals
and the date of the sample court appearance. However, the two sources of criminal
records are not consistent with each other in recording the number of court appear-
ances, the oence conviction information and the sentence information. The study also
compares the information from four dierent local areas representing the rural, county,
metropolitan and small city areas. It is found that these problems are more apparent
in the county area as the information recorded in the two sources of criminal records
shows great inconsistency.
In the second part of the project, the information in the Oender Index is used to
build a model for predicting the court outcome. The information is stored in a `Nested
Data Structure'. After some manipulation, one dependent variable and 11 independent
variables are extracted. Correlation analysis is rst done on the dataset and reveals
that three independent variables are adequate in explaining the dependent variable.
A Linear Discriminant Analysis is done but does not yield a satisfactory model. A
relatively new modelling technique, tree-based modelling, is then used and a much
better model is obtained. The selection criteria of the models are based on the estimated
misclassication rate. A number of approaches to estimate the misclassication rate
are considered. These are the apparent misclassication rate, the jackknife estimate of
the misclassication rate, the cross-validation approach and the bootstrap estimates of
the misclassication rate.
The statistical package SPSS was used in the rst part and the data manipulation in
the second part of the project. Most of the analysis in the second part of the project
was done in the statistical package Splus, and a number of functions were written to
perform the analysis.
12