You are on page 1of 12

DIPLOMA IN MATHEMATICAL STATISTICS, 1993{94

Applied Projects
(as summarised by their authors)

P.A. Beresford Trinity Analysis of US insurance data


(P.D.G. Tompkins, Lane, Clark and Peacock)
S.D. Byers Darwin Analysis of multivariate hormone data in a
psychiatric study
(Professor I.M. Goodyer, Development Psychiatry)
I.J. Clubb Gonville & Caius Construction of age-related centiles from reference
date: a study of childhood haematology
(Dr T.J. Cole, Dunn Nutrition Centre)
A. Khan Trinity Hall Delta hedging strategies for European call options
(Dr P.J. Hunt, NatWest Markets)
V.S.F. Lo Trinity Hall Multidimensional scaling and intra-European trade
ows: a methodological evaluation
(Dr A.D. Cli , Department of Geography)
S.R. Seaman Darwin Exercises in data-resampling with applications to
Alzheimer's disease data
(Dr C.J. Palmer, Institute of Public Health)
D.J. Stent Magdalene Survival analysis for censored data: a study of
English MPs (1550{1689)
(J.E. Oeppen, History of Population and Social Structure)
M. Stephens Churchill The results of Gregor Mendel: an analysis, and
comparison with the results of other researchers
(Dr A.W.F. Edwards, Institute of Public Health)
J.W. Tanser Darwin Estimation and analysis of non-linear growth models
for plants and plant diseases
(Dr C.A. Gilligan, Department of Plant Sciences)
B.D.M. Tom St John's Con dence interval procedures based on the median
survival time(s)
(J. Matcham, Amgen Ltd)
C.S. Wong Magdalene Analysis of criminal data
(D. O'Mahoney, Institute of Criminology)

1
ANALYSIS OF US INSURANCE DATA
The purpose of the project was to analyse the gross payment amounts for 24051 pro-
fessional indemnity insurance claims (i.e. claims incurred by professionals during the
practising of their profession). The insurance was written between 1970 and 1993, in
10 di erent US states, by an unspeci ed insurance company. As well as State , the
two other factors that were analysed were the Type Code (representing the type of risk
insured) and the Year of claim.
The payment amount, while obvious for closed (i.e. nally settled) claims, had to be
calculated by:
claim amount = estimated liability  share of contract
for open claims. This failed to take into account the possibility of the insurance com-
pany being found not liable in court, and so another dataset was produced by sampling
from the rst with the estimated probability of liability.
The Smirnov Test, implemented on Splus, showed that all three factors a ected the
distribution, and as the sample size was insucient to stratify the data in three ways at
the same time, each factor was investigated separately. There were three distributions
used to try to model the data:
 The Log-Normal LN (; 2 )
 The Generalised Gamma G (k; p; )
 The Generalised F GF (m1 ; m2; ; 2 )
None of these tted perfectly, although the log-normal was quite close, and the gen-
eralised F improved upon it slightly at the expense of two extra parameters. One
conclusion that was drawn from the strati cation by year was that the log-normality
reduced as the balance between open and closed claims approached, and then increased
again. This suggested that the estimates of liability in the open cases were biased {
the fact that the estimates were used for determining the levels of reserves meant that
they were probably conservative.
The best t was for the upper tail of the distribution, possibly because the importance
of the high value claims meant that greater care was taken in estimating the liabilities,
and so the estimates were more accurate. The t of the log-normal to the upper tail was
tested and found to be quite good, while the generalised F was again slightly better.
An interesting application of these upper tails is in assessing the premiums charged to
the insurance company for excess loss reinsurance, and a suciently accurate t of the
log-normal would facilitate this.

ANALYSIS OF MULTIVARIATE HORMONE DATA IN A PSYCHIATRIC


STUDY
A study of the dependency of major depressive disorder (MDD), a psychiatric condition,
on various hormone levels and other disorders involved the collection of data on 134
children and the tting of a logistic regression for the probability of having MDD.
2
The nine-dimensional hormone data was subjected to principal components analysis to
try to reduce its dimensionality. The questions of discriminating MDD and obsessive
compulsive disorder (OCD) were examined, as well as the testing of the eciency of
discrimination. Functions were written in the statistical package Splus to facilitate
these operations.
Several questions of lesser importance were also addressed.

CONSTRUCTION OF AGE RELATED CENTILES FROM REFERENCE


DATA. A STUDY OF CHILDHOOD HAEMATOLOGY
This project is concerned with the construction of centile charts for haematological
measurements in children between the ages of 1 21 and 4 21 years. It investigates the
application of the LMS method proposed by Cole (1988, JRSSA, 151, 385{418).
Centile charts are widely used in clinical medicine as a screening tool. Typically a chart
will have curves denoting the 3rd, 10th, 25th, 50th, 75th, 90th and 97th centiles. They
enable clinicians to compare individuals' measurements with a reference population and
thus identify subjects with extreme measurements for further investigation. Unusually
high or low measurements may indicate the presence of an underlying pathological
condition. Similarly, a signi cant rise or fall in the centile score of an individual may
point to the development of a problem. Thus centiles are particularly useful when
considering measurements which depend heavily on a covariate such as age.
There are certain properties which it is desirable for the centiles to possess. They should
be smooth, follow the centiles of the reference population as closely as possible, and
have commonality. Commonality ensures that adjacent centiles do not touch or cross
each other and is achieved by some sort of distributional assumption, either explicit or
implicit.
The LMS method assumes that the data follow some Box{Cox power transformation
of a normal. The centiles may then be summarised by three smooth curves:
1. The Box{Cox power (L)
2. The median (M)
3. The coecient of variation (S)
The 100 th centile is then given by
C100 (t) = M (t)[1 + L(t)S (t)z ] L(t) 6= 0
or
C100 (t) = M (t) exp[S (t)z ] L(t) = 0
where z is the normal equivalent deviate for tail area .
The application of the method is investigated for two analytes: haemoglobin and fer-
ritin, which together account for approximately 90% of the body's iron, and are thus
useful in identifying the possible presence of a variety of clinical disorders.
The analysis was performed on the department network using Splus.
3
DELTA HEDGING STRATEGIES FOR EUROPEAN CALL OPTIONS
The most striking attractions of a project loosely described as `Financial Modelling' is
obvious { a map showing the way to riches, money and fast cars. Dreaming is certainly
a favourite student pastime. However, a more speci c project requirement of assessing
hedging strategies for European call options soon arouses suspicion of hard work and
a new language of incomprehensible scal terminology. Personally, the project has
proved to be an excellent learning ground despite having left my original hopes for
piles of wealth unful lled.
Applying statistical techniques of stochastic di erential equations to practical theo-
ries of running portfolios of options and stocks was a perfect way of expanding my
knowledge of the nancial industry as well as the practicality of mathematics, statis-
tics and computing. The main objective of the project was to ascertain whether a
maximum worthwhile hedge frequency exists for delta hedging of a European call op-
tion. In addition, some investigations have been attempted into the distribution of
the corresponding portfolio pro t and loss account, and the e ects of varying a few of
the parameters inherent in the application of Block{Scholes theory to pricing options.
Computer packages and languages used were S-plus for analysis and C for simulations.

MULTIDIMENSIONAL SCALING AND INTRA-EUROPEAN TRADE


FLOWS: A METHODOLOGICAL EVALUATION
Existing literature on patterns of trade between countries suggests that the volume of
imports/exports between trading partners is strongly inversely related to the distance
between the partners. Accordingly, the conventional approach to modelling trade ows
is through so-called Spatial Interaction or Gravity Models . Results from those models
depend heavily on the way `distance ' is de ned.
It has also been of interest whether there is similarity between the out ows (exports)
or the in ows (imports) of trade among pairs of trading partners who are comparably
close geographically.
If it is true that the trade ow between trading partners is determined by the distance
between the partners, then the relative geographical locations of pairs of trading part-
ners should be evident from their relative locations constructable from the volume of
imports/exports between these countries. And if it is true that pairs of countries who
are closer geographically will have higher correlation between their patterns of trade
distributions, then the geographical locations of the trading partners should also be
constructable from the patterns of out ows and in ows.
In this project we consider only the trading patterns between the members of the
European Community in 1958 and 1989.
The use of the multidimensional scaling method (MDS) is illustrated to examine the
trading pattern between the eleven EC members, namely:
Benelux(BL) Denmark(DK) Germany(D)
Greece(GR) Spain(E) France(F)
Eire(IRL) Italy(I) Netherlands(NL)
Portugal(P) United Kingdom(UK)
4
in each of the two years 1958 and 1989 and the contrast in trading patterns between
these two years.
We begin by outlining the principles of MDS. The model is then formally speci ed and
applied to trading patterns between the members of the European Community in the
two years 1958 and 1989. Change in trading patterns between 1958 and 1989 are also
examined.
Correspondence analysis techniques are introduced, and applied to the available trade
data in order to obtain a comparison of the con gurations produced by these methods
and those obtained from MDS. Their usefulness is discussed.
It is shown that there is no correlation between (i) the relative geographical locations
in any of the nal con gurations and (ii) the import/export patterns among the EC
members in both 1958 and 1989. There is also no evidence of a dependence of the
volume of trade ows between the trading partners on their inter-distance.
Furthermore, in modelling trade ows using log-linear models for contingency tables and
their application to trading patterns between the members of the European Community
in 1958 and 1989, it is shown that trade ows are, in fact, dominated by the size of the
trading partners rather than their relative geographical locations ; inter-partner distances
account for only a tiny proportion of the total variance in ows. Moreover, size rather
than distance e ects appear to have increased in importance over time (this may be
due to improvement and lower costs in transportation). This explains to some extent
the inadequacy of the MDS model.
The MDS non-metric method takes into account only the rank order of the percentage
measures of the trade ow (in ECUs), so that all the information concerning the pro-
portion of the raw trade ows accounted for by the size of trade partners is lost. Thus
it is impossible to judge the signi cance or otherwise of the patterns revealed by the
normalized data.
One can also use the MDS classical (metric) method to model the problem. However
there is no recognisable evidence of geographical location e ects of the EC countries
in the nal con guration produced. In fact it appears that the volume of trade ow is
dominated by size rather than distance.
When using the method of multidimensional scaling on the residuals of the `size model'
(`size model' refers to the log-linear model of trade ows in the mass which can be
measured in terms of GNP per capita of the experimental countries in this context),
there is no recognisable geographical location e ect for the EC members in the nal
MDS con gurations for both years 1958 and 1989. This suggests that there may exist
other factors besides the mass of the countries which play an important role in deter-
mining the trading patterns of the EC trading partners. This further suggests that the
in uence of inter-partner distances on trade ows between the countries may be even
less signi cant.
Nevertheless, the MDS nal con gurations for the EC countries in which the volume of
trade ow (percentage) is used as similarity matrix may be more related to their relative
locations where the inter-distances are measured in terms of `economics distances'.
(`Economic distances' are distances measured in terms of costs of transactions in space .
5
In broad terms, transport costs are likely to be strongly positively correlated with
distance, and so such costs appear to be candidates for de ning `distance'. However, it
is generally not possible to obtain a single assessment of transfer costs between a pair of
countries. Freight rates depend upon the mode of transport selected, the precise origin
and destination of consignments, and the nature of the commodities being shifted.
In any case, much of this information is treated as con dential between transport
companies and their customers.)

EXERCISES IN DATA-RESAMPLING WITH APPLICATIONS TO


ALZHEIMER'S DISEASE DATA
This project is concerned with a set of medical data collected for the purpose of inves-
tigating the relationship of Alzheimer's Disease to ageing. It is known that su erers of
the disease tend to have microscopic abnormalities called plaques within their brains.
Older people without Alzheimer's Disease also commonly have plaques, but su erers
tend to have more. Over 200 autopsies were carried out and, in each case, measure-
ments of processes relating to physical ageing were taken; plaques were counted and
the age and sex of the individual were recorded.
The project divided up into two parts. In the rst part the large number of variables
available was reduced to a more manageable number and logistic models used to predict
the probability that an individual has plaques were compared. It was found that the
chronological age of the individual is the best predictor of whether he or she has plaques
and, once the age variable was used, none of the physical ageing variables was of further
use in prediction.
In the second part of the project the dataset was used to investigate various computer-
intensive data resampling schemes. At present, such schemes are used to assess predic-
tive model adequacy, that is, how well a model generated from one set of data can be
expected to t another sample from the same population, and to estimate the standard
error or bias of sample estimators when they cannot be calculated analytically. In
this project the possibility of using data resampling to compare the usefulness of the
potential predictor variables was explored, and two methods of resampling for model
adequacy were considered.
To assess variable usefulness the dataset was randomly partitioned into two halves of
equal size; a logistic model was generated on one half (the generating half) and then
applied to the other half (the testing half). The dataset was partitioned like this in
500 ways and each time statistics relating to model t and variable signi cance were
recorded. It was found that two statistics were of primary interest: one was the average
signi cance of the variable over the 500 partitions and the other related to the ratio of
deviances in the generating and testing halves.
The best method of assessing model adequacy was found to be cross-validation, also
known as `leave-one-out'. This involves leaving out each case in turn from the sample;
tting the logistic model to the remaining cases in order to determine the model pa-
rameter estimates and then tting this model to the excluded case. The model which
6
performed best out of those assessed was the logistic model in which age was the only
predictor variable.
The dataset was collected by the Department of Pathology, Southampton University, in
collaboration with the Department of Community Medicine, Cambridge University. It
was available on SPSS and most of the work was done at the Department of Community
Medicine using SPSS for Windows. The results of the resampling were analysed at the
Statistical Laboratory using S-Plus, after Awk, a le-editing language, had been used
to extract the important parts. This report was produced using the mathematical text
processor LaTeX.

SURVIVAL ANALYSIS FOR CENSORED DATA: A STUDY OF EN-


GLISH MPs (1550{1689)
This project examines methods of optimising the use of limited datasets, particularly
with many missing values, for population history. This is of vital importance to the
study of the survival characteristics of the population before the rst ocial census, as
so much information is missing. Therefore, the problem is to make the most we can of
the limited data that exist.
The actual data used were of English MPs who entered Parliament between 1550 and
1689. Much of the data involved some kind of censoring, the various cases being
explained in the text. Prior to 1550 up to half of the observations are unusable due to
no birthdate being recorded, while the recordings in the early 18th century were taken
in a di erent demographic environment due to the inclusion of Scottish MPs for the
rst time.
The rst method investigated in order to nd the estimated survival distribution was
the Kaplan{Meier Product-Limit Estimator. But although suitable for other datasets
and computationally convenient, this method underestimates mortality at younger ages.
This was due to the truncation of observation of many of the MPs; i.e. they were only
observed when they became MPs and death prior to this was impossible to observe.
The paper by Turnbull (1976) suggests another method of non-parametric estimation
of the distribution function, which accounts for incomplete data due to truncation.
Using the idea of self-consistency, an algorithm is constructed and shown to converge
monotonically to give a maximum likelihood estimate. This method was formulated
into an Splus function, and compared with the Kaplan{Meier results for veri cation.
Then a comparison between the rst and last decades of the period in question was
made. A signi cant di erence was observed, so comparison between every set of succes-
sive decades was undertaken to see if this was a gradual change, a dramatic change at a
certain point, or a combination of the two. These comparisons revealed that the change
was not as dramatic as some previous theories, which are mentioned in the project, had
suggested. In fact they re ected a similar trend to that of various results from parish
registers of the time, which was a gradual shift with the mortality at younger ages
being hit the hardest.
7
Finally, brief investigations into observations where births were unknown were carried
out. As in many other elds of research, utilizing new techniques for di erent but sim-
ilar problems is ecient in both time and money. This is the case here, where research
into doubly-censored and/or data which occur in HIV/AIDS infection, is applied to a
mortality problem.
The data used were collated by the Cambridge Group for the History of Population and
Social Structure. All of the analyses were carried out using the statistical computing
package Splus, which included using built-in functions and writing others, including the
Turnbull Algorithm. The text was prepared using the text-processing program LaTeX
on the Hewlett-Packard network system of the Statistical Laboratory.

THE RESULTS OF GREGOR MENDEL: AN ANALYSIS, AND COM-


PARISON WITH THE RESULTS OF OTHER RESEARCHERS
The work of Gregor Mendel on inheritance in the garden pea, which was carried out
in the 1850s and 1860s, has played a very important part in the formation of modern
genetical theory. There is no doubt that the ideas put forward by Mendel are still highly
relevant today, more than a century later. However, for more than 15 years after his
death in 1884 Mendel's work went mostly unnoticed. Mendel's work was `rediscovered'
by scientists at the beginning of this century, and with this renewed interest came
a much closer analysis of his numerical results, using statistical methods which were
certainly not available to Mendel. The conclusions they came to were that the data
actually tted Mendel's theory (or at least their interpretation of it) much more closely
than could reasonably be expected by chance, some commentators estimating that the
chance of getting such a close t was 1 in 700, or even less. This project analyses
Mendel's results, and compares them both with the standard Mendelian models, and
with the results of researchers who conducted experiments similar to those of Mendel.
The fact that Mendel's work has been extensively analysed caused me a small prob-
lem. Readers of the project will not necessarily be familiar with the large amount of
background analysis which I have read. I have therefore attempted to keep reference
to previous analyses to a minimum; but there are so many di erent arguments and
theories presented in the literature that the reader could not help but bene t from
some familiarity with the papers cited in the bibliography. Notwithstanding this, the
project is intended to stand alone as a self-contained report.
The structure of the report is as follows:
 Chapter 1 contains a brief introduction to the history of Mendel's experiments
and the necessary basic genetical theory and terminology.
 Chapter 2 presents an analysis of Mendel's results using various statistical meth-
ods. The relative merits of these statistical methods are then compared in Chap-
ter 3.
 Chapters 4 and 5 look at smaller sections of Mendel's results which are of par-
ticular interest.
 Chapter 6 introduces data collected by other researchers, and compares them
with both the Mendelian model and the data collected by Mendel.
8
 Chapter 7 presents a summary of the conclusions drawn, and suggestions for
further work.
The report was created using LaTeX, with graphics being imported from Splus via
Postscript les. All statistical analysis and random simulation were carried out using
Splus.

ESTIMATION AND ANALYSIS OF NON-LINEAR GROWTH MODELS


FOR PLANTS AND PLANT DISEASES
The purpose of this project was to examine data taken from published papers on the
development of plants under various conditions, or the spread of a disease through
a population of plants. These data were then analysed using intrinsically non-linear
models, and the parameter estimates were examined to see if any relationships were
evident between these estimates and the conditions under which the development took
place.
This form of analysis had not been used to examine these data sets before, and it is
thought that by using the non-linear models some understanding can be gained into the
underlying processes. C.A. Gilligan published two papers in 1990 in which it was shown
that the rate parameter is often unchanged by the treatments. This has implications
about how the development can be changed, in this case that the development cannot
be speeded up or slowed down signi cantly. In the case where the progress of a disease
is being studied this would mean that it was not possible to reduce the speed at which
the disease spread through a population, but it may be possible to delay the onset or
reduce the severity.
This analysis can also lead to nding relationships between the quantitative level of the
treatment (which may be, e.g., time heated, temperature, density of disease causing
organism etc.) in the experiment, and the parameters in the non-linear model. This is
also useful in understanding the biological processes, as well as allowing models to be
made about future behaviour.
The actual model tting was performed using the Maximum Likelihood Program, and
the residuals were recalculated using Splus, which also produced all the plots and graphs
in this project.
A total of seven data sets were examined in the course of the project. Two each came
from papers written by Y. Elad et al. (1984), A.H.C. van Bruggen et al. (1986), and
S. Freeman et al. (1988), with the nal data set coming from a paper by N.W. Callan
et al. (1990). The data sets were taken from gures in the papers which were enlarged
and scaled.
In these papers Y. Elad et al. (1981) and N.W. Callan et al. (1990) were measuring
percentage of plants diseased, and the other papers looked at percentage of seeds which
emerged. In each case the plants were grown under a number of conditions, designed
to examine how these a ected the variable of interest. The papers looked at the use
of chemical, biological and thermal methods of controlling disease, and also at how the
amount of disease causing organism present in the soil changed the growth of the plant.
9
The non-linear models tted allowed the nal level of the disease, the speed of growth
and the time of fastest growth to be compared between conditions and where there
were sucient number of related conditions an attempt was made to t relationships
between these parameters and the conditions. This was done for four of the data sets,
and suggestions were made on probable relationships where it was reasonable to do so,
but insucient time to t the models.
The project was written using LaTeX and the spelling corrected using ispell.

CONFIDENCE INTERVAL PROCEDURES BASED ON THE MEDIAN


SURVIVAL TIME(S)
The aim of this project was to design a program which would construct a con dence
interval for the di erence in median survival times, when the data used is subject to
right-censoring. The project was prompted by work done at Amgen Ltd. which, during
the analysis of data collected from a Phase III clinical study on patients with small-cell
lung cancer, it was found that a point-estimate and a 95% con dence interval for the
di erence in median survival times would be useful additions to the statistical analysis
of the study. However, no standard statistical package had a function which could
determine the con dence interval.
The con dence interval procedure used in designing this program was developed by Su
and Wei (1993). This procedure is based on minimizing a certain quantity W (0 ; 1 ),
to arrive at a test-statistic G(0), which is used to test the null hypothesis that the
`suggested' di erence in medians equals the true di erence. The inversion of this test
then provides a con dence interval for the di erence of the medians.
The developed program was subjected to a small-scale simulation study to test its
e ectiveness and was then applied to the aforementioned Phase III data to obtain a
con dence interval for the di erence.
The program was written in Splus, while interfacing with C to do the intensive numer-
ical work required. All the graphical work, as well as the statistical analysis performed
on the data, was done using Splus. The report was typed in LaTeX, with diagrams
being imported from Splus and from the x g drawing program present in the Unix
system.
The make-up of this report is as follows: Chapter 1 introduces the notation and ba-
sic concepts which are essential for understanding the work discussed throughout the
report. Chapter 2 introduces the program of determining a con dence interval for the
median, with emphasis being placed on the procedure by Brookmeyer and Crowley.
The next chapter follows on from the one-sample problem and discusses the two-sample
problem of constructing a con dence interval for the di erence of medians. The prob-
lems experienced in putting the theory of Su and Wei's procedure into practice are also
touched on in this chapter, as are the ways used to overcome these problems. Chapter
4 discusses the simulation work carried out in testing the e ectiveness of the program,
while Chapter 5 discusses the application of this program to an example, via the analy-
sis carried out on the small-cell lung cancer data. Chapter 6 discusses recommendations
for future work in this area, and Chapter 7 contains the conclusions of the report. The
10
data set and the programs written can be found in the Appendices. Documentation
and commentary for the programs can also be found there.

11
ANALYSIS OF CRIMINAL DATA
The project is divided into two parts. In the rst part, a comparison of the accuracy
and the di erence is made for two sources of criminal records, the O ender Index and
the records from the Criminal Record Oce. In the second part, models are built to
predict the court outcome (in terms of the type of sentence given to the o ender).
In the rst part, a third source of information is also used to evaluate the accuracy of
the two sources of criminal records. It is found that both sources of criminal records
are quite accurate in recording the personal information (date of birth) of individuals
and the date of the sample court appearance. However, the two sources of criminal
records are not consistent with each other in recording the number of court appear-
ances, the o ence conviction information and the sentence information. The study also
compares the information from four di erent local areas representing the rural, county,
metropolitan and small city areas. It is found that these problems are more apparent
in the county area as the information recorded in the two sources of criminal records
shows great inconsistency.
In the second part of the project, the information in the O ender Index is used to
build a model for predicting the court outcome. The information is stored in a `Nested
Data Structure'. After some manipulation, one dependent variable and 11 independent
variables are extracted. Correlation analysis is rst done on the dataset and reveals
that three independent variables are adequate in explaining the dependent variable.
A Linear Discriminant Analysis is done but does not yield a satisfactory model. A
relatively new modelling technique, tree-based modelling, is then used and a much
better model is obtained. The selection criteria of the models are based on the estimated
misclassi cation rate. A number of approaches to estimate the misclassi cation rate
are considered. These are the apparent misclassi cation rate, the jackknife estimate of
the misclassi cation rate, the cross-validation approach and the bootstrap estimates of
the misclassi cation rate.
The statistical package SPSS was used in the rst part and the data manipulation in
the second part of the project. Most of the analysis in the second part of the project
was done in the statistical package Splus, and a number of functions were written to
perform the analysis.

12

You might also like