You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228664027

Beyond the Cox model: artificial neural networks for survival analysis part II

Article

CITATIONS READS
10 929

2 authors, including:

Colin Richard Reeves


Coventry University
127 PUBLICATIONS   5,074 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

human origins View project

All content following this page was uploaded by Colin Richard Reeves on 21 May 2014.

The user has requested enhancement of the downloaded file.


Beyond The Cox Model: Artificial Neural Networks
For Survival Analysis Part II

Rashmi Joshi*, Colin Reeves§

*Formerly Control Theory and Applications Centre, Coventry University, Priory Street, Coventry, U.K.
§
Faculty of Engineering and Computing, Coventry University, Priory Street, Coventry, U.K.
Tel: +44(0)7956 157094 E-mail: Rashmi.Joshi@hotmail.co.uk, C.Reeves@coventry.ac.uk

Keywords: Artificial Neural Networks (ANNs), survival, and in some cases have given results that match or surpass
non-linear, malignant melanoma, confidence intervals those obtained from statistical models [6].
The analysis of time-to-event data i.e. data concerned
Abstract with the time from a defined time origin until the occurrence
of a particular event of interest is termed survival analysis.
Artificial neural networks (ANNs) are proving popular and The field of survival analysis has experienced tremendous
successful in a wide variety of medical applications and for growth during the latter half of the 20th century and of
non-linear regression and classification. It has been primary interest in this field is the investigation of the
previously shown that a novel flexible and non-linear ANN functional relationship between covariates, such as treatment
model for prognosis and prediction of conditional survival or subject characteristics (possible risk factors), and the time
probabilities was developed and successfully applied to to occurrence of an event such as death, disease recurrence
censored data [1]. Building on this, we expand the initial or cure. By identifying factors of prognostic significance for
probabilistic model and this paper details results with a particular disease, valuable information may be utilised in
refinements such as enhancements of generalisation an important area of medical statistics, for instance,
capability, in order to address issues such as model predictions of survival characteristics of a particular disease
complexity and topographical structure. The model is have major implications on patient management and care
trained using a maximum likelihood approach. An strategies.
asymptotic approximation to the variance-covariance matrix The survivor and hazard functions are estimated from the
using the Fisher Information Matrix is discussed, and observed survival times and are of main interest when
provides standard errors on the parameter estimates. In analysing survival data. The probability density function of
addition to the prediction of conditional survival t, the actual survival time of an individual, is f(t). The
probabilities, hazard, and probability density functions, survivor function, S(t), is the probability the survival time is
confidence intervals on the survival estimates are estimated greater than or equal to t. The related hazard function h(t)
using the Choleski decomposition algorithm and a quasi- denotes the instantaneous death rate and represents the
bootstrap approach, and are shown. Thus the model’s probability that the event occurs at time t, conditional on it
predictive accuracy is further confirmed. The ANN’s
not occurring prior to time t. The following relationship
performance is compared to other popular traditional
survival modelling techniques. We conclude that the ANN holds:
model’s predictive accuracy is at the very least as good as h(t) dt = f (t )dt (1)
that of a heavily used leading statistical model, the Cox S (t )
model, but that it is advantageous as a flexible general The cumulative hazard function H(t) is defined as
hazards model when analysing survival data where a t
specified distributional form or model assumptions are H (t ) = ∫ h(u ) du (2)
difficult to justify. The proposed ANN model therefore 0
extends the range of data that can now be analyzed using so that
survival analysis methods, and is a candidate for use in the
analysis of censored survival data.
S(t) = e − H(t) (3)
A fundamental characteristic of survival data is that survival
times are frequently censored (i.e. the end point of interest
does not occur, for instance because the case has been lost to
1 Introduction follow up). Right-censored cases have survival times that
are greater than some defined time point. The data used in
The field of ANN research experienced some growth in the this study contains right-censored survival times.
1950s, when one of the earliest models, known as the The semi-parametric Cox’s proportional hazards model
perceptron, was developed [2]. Indeed the growth in neural (Cox PH) is a popular choice in the analysis of censored
computing techniques is widely recognized as they have survival data [7], in addition to parametric models [8].
found applications in a variety of fields, such as medical, for However, both impose either distributional forms or
example, for use in clinical diagnosis and analysis [3, 4, 5], assumptions that are not always justifiable, for instance a
fundamental assumption of the former model is that the Table 1. The data
hazard of death at any given time for an individual in one
group is proportional to the hazard at that time for a similar Factor Description Level
individual in another group. Although it has increased Townsend Score: Measure of Social Deprivation

stability and flexibility over parametric models that specify TQ = 0 Townsend Score between -8.5 and -1.3 Affluent
TQ = 1 Townsend Score between -1.29 and 8.8 Deprived
a particular distribution, the proportional hazards
assumption will often not be reasonable, and methods to test Clarke Level: Histological stage of cancer
Clar 12 Melanoma in situ (no invasion) or tumour invades papillary dermis Least Severe
the validity of this assumption [9] frequently discover that it
Clar 3 Tumour invades papillary reticular dermal interface Medium
does not hold in many data sets [10, 11]. Furthermore, the Clar 45 Tumour invades reticular dermal interface or subcutaneous tissue Most Severe
regression requires that the correct functional form be
Log Pathological Depth: Log of vertical thickness of tumour, in mm
defined. There has therefore been increasing interest in more L.P.12 depth <= 0.75mm Least Thickest
flexible models in recent years. L.P.3 depth between 0.75mm and 1.5mm Medium Thickness
Due to their less restrictive frameworks that can L.P.4 depth between 1.51mm and 4mm Medium Thickness
L.P.5 depth > 4mm Thickest
incorporate non-linearities, ANNs may be viewed as flexible
models for non-linear multivariate problems. Indeed they
have acquired increasing attention over the past decade as
mathematical tools that may be used for solving non-linear
regression or classification tasks. ANNs provide the
3 The ANN Model
potential of producing more accurate predictions of survival
time than do traditional methods, and are becoming popular 3.1 The Model
tools for analysing many types of data. Applications for the
prediction of probability distributions have been suggested Using the existing feed-forward multi-layer perceptron
[12, 13], and they have been applied to classification and (MLP) model [1], designed and implemented in an Excel
prediction tasks in the biomedical field, where regression spreadsheet incorporating matrix multiplication, 24 patient
models have been used traditionally, in order to improve the subsets are created according to levels of the input variables
prediction of outcome [14, 15]. One of the most important defined in Table 1 above.
components of the development of survival analysis has For convenience details of the model are presented here.
been the formulation of censored data survival analysis The covariates X = {x1, x2,…., xp-1}, a bias parameter x0
methods. Recent papers have shown that ANNs have ≡ 1, and the time input xp, coded as a prognostic variable by
successfully processed censored outcome time data, and using it as a scaled normalized input, comprise the source
therefore may be used as alternatives to standard regression nodes in the input layer of the network. These are multiplied
models for survival data [16, 17]. However few ANN
by parameters αjk (j = 0,…, p), the synaptic weights (for
studies in the published literature retain censored
observations and/or use continuous time. the connection from input node k to output node j),
This paper details a refinements to a probabilistic ANN constituting the hidden nodes. This weighted sum is then
model for the analysis of censored survival data, previously passed in turn through a non-linear logistic activation
proposed by the authors [1], with enhancements such as function, the commonly used logistic function G, such that
generalisation capability, model complexity and confidence 1
G (ϑ ) = (4)
intervals on the predicted survival outputs. Censored 1 + exp( − a ϑ )
observations are not omitted from the analysis. The ANNs
predictive performance is compared to traditional survival where a is the slope parameter of the sigmoid function. A
modelling techniques, such as the Cox PH model, empirical slope parameter of unity was used. By using such activation
non-parametric Kaplan-Meier estimate (KM) [18], and functions the outputs are given a probabilistic interpretation.
parametric log- normal model (LN) [8]. Furthermore, it implies that the ANN is in principle, a
mixture of logistic distributions. The inputs to the second
layer, G(yk) (k = 1,…, m), are then multiplied by weights
2 The Data βk, and adding a bias parameter of unity, β0, we have a
single input to the output node. The same activation function
The data set used in this study consists of malignant is then applied.
melanoma patients diagnosed from 1987-1996 in the West The ANN model is therefore represented as:
p
Midlands region. There are 1160 females and 786 males.
The survival time variable t is defined as the number of days yk =∑ α jk x j ∀ k ∈ {1,…., m} (5)
j =0
from date of diagnosis (their entrance to the register) to the
m
end of the study, due either to death or survival till the cut-
off date (31st December 1999). Censored observations are V = β 0 + ∑ β k G ( yk ) (6)
k =1
defined as patients that have not died, and as there are 1439
censored cases, this indicates a heavily censored data set. where β 0 is a bias parameter of unity and the functional
Significant prognostic factors used are summarised in Table representation of the output is
m
1. As the distribution of the Pathological depth (tumour
thickness) variable is extremely positively skewed, a log Z = S(t, X ) = G( ∑ β k G(yk ) + β0 ) (7)
k =1
transformation was applied in order to normalize it, resulting
Equations (5) to (7) are summed over all cases (the sample
in the categories shown.
index i has been suppressed for clarity). As the batch
method of learning is used, training and validation processes the probability density function by the corresponding
common to ANN studies are not required. The ANN model survivor function S(t).
is illustrated in Figure 1.
3.4 Refinements to the Model
Input Layer Hidden Layer Output Layer

X0 = 1
Regularisation and Generalisation
bias
bias=1
αjk
X1= TQ In order to find a balance between the bias and variance of
1 the model regularisation techniques may be employed so as
X2 =Clar12

y1 G( y1 )
to improve the generalisability of the model beyond the data
X3= Clar3 set and thus prevent over-fitting. Weight decay [19]
X4= Clar45 2 V Z=G(V) provides a simple and highly effective form of regulariser
and was used to control this aspect of model complexity by
X5= L.P.12 y2 G( y2 )
penalising the cost function, by subtracting from it a
X7= L.P.3
m βκ multiple, ω, of the sum of squared weights. ω was selected
X8= L.P.5 by analysing the profile log-likelihood upon convergence of
Xp = t
yk G( yk )
the ANN. A value of 0.01 was determined for the results in
Scaled
Event Times this paper. All weights were greatly reduced in size as
weight decay favours small weights (over-fitted mappings
Figure 1. The ANN model with regions of large curvature tend to have large weights).

Pruning
3.2 Training & Monotonicity
In order to address the ANN’s topographical structure, the
The back-propagation algorithm is a popular choice for significance of the weights in the model was assessed using
training feed-forward MLPs however as an alternative the saliency network pruning technique. The saliency of a
optimisation technique we have used the SOLVER tool in weight is defined as the change in the cost function resulting
Excel in order to provide a more user-friendly approach from the deletion of that weight. The ANN was pruned
particularly for non-specialists. A common objective (cost) using the optimal brain surgeon (OBS) [20] stepwise
function is the sum of squared differences between output procedure, reducing the number of weight parameters in the
and target values, however additional valuable information model to 24. This technique requires both the Hessian, H,
may be retrieved by using a log-likelihood approach. and its inverse for the computations of the algorithm, and
The log-likelihood for a data set of n exact or right- details of methods used to compute both are described in
censored survival times may be written as follows: section 3.6.
n
Log L = ∑ δ i log { f (t i , X i )} + (1 − δ i )log {S (t i , X i )} (8) 3.5 Model Predictions
i =1

where δ i = Censoring Status (0 = “Censored”, 1 = “Died”). Figure 2 below shows a comparison of the ANN output with
Differentiating Equation (8) with respect to time, the each survival modelling technique discussed in section 1.
probability density function may be found, as The plot shows predicted outputs for a representative
dZ m example, subset 24, defined as a patient that resides in a
= G ′(V)∑ β k G' (y k ) α pk
− f(t) = (9) deprived area, with a Clarke level 45 and L.P.5 tumour
dt k =1
(most progressed stage and thickest tumour respectively).
where G' (V ) = G (V )(1 − G (V )) and G is the logistic
1.00
activation function in Equation (4). The log-likelihood is
then maximised using the SOLVER tool in Excel. This
0.90
procedure is stopped once convergence is achieved.
In order to ensure that the survival functions are
0.80 ANN
monotonic the αpk and βk weight parameters are KM
S(t)

constrained to have opposite signs. Therefore, a Cox


0.70
computationally convenient regularity condition of the LN

ANN, easily incorporated in the SOLVER tool, is


αpk ≤ 0, βk ≥ 0 ∀ k ∈ {1,…., m} (10) 0.60

3.3 Estimating Hazard Functions 0.50


0 2000 4000 6000

Days
A probability density estimate for each non-censored case is
used to compute the log-likelihood, as shown in Equation Figure 2. Comparison of the ANN output with traditional
(8). Estimates of the hazard function for each subset may models – Cox (stratified), LN (the lognormal model) and
therefore also be obtained from the ANN by simply dividing KM (the Kaplan-Meier estimate).
The ANN ML output function provides a closer fit to the data
than the Cox and lognormal models (the former is a stratified (
ϑˆ nj ± c I n (ϑˆ n ) −1 ) jj
(14)
version in order to accommodate the violation of the PH
assumption discovered with the data [21]). This pattern was where ϑ̂ nj represents the optimised weight, c is the
evident in comparisons of similar plots and 10-year survival appropriate z critical value (1.96 for 95% confidence) and
probabilities for the other subsets, proving the ANN’s
predictive accuracy. (I n (ϑ̂n ) −1 ) jj
is the jj component of the inverse Fisher
information matrix [23] (for ease of presentation values are
3.6 Hessian and Variance-Covariance Matrices not presented here).

The outer product approximation [19] (O.P) was used for Confidence Intervals on Survival Predictions
generating a Hessian matrix, in order to avoid singularity and
ill-conditioning issues associated with the exact Hessian [21]. The Choleski decomposition [24] of the variance-covariance
Using the Jacobian of the log-likelihood function the O.P matrix of the optimised ANN was used as a function in
approximation is given by: producing simulations of a multivariate normal distribution
N [25]. This provides much flexibility and computational
H N = ∑ g n (g n ) T (11)
efficiency in simulations, and also permits the construction
n =1
of confidence intervals on the output of the ANN using a
where HN denotes the Hessian, n is the number of patterns in bootstrap approach.
the data set and g is the gradient vector of the cost function Using matrix notation, if W is the variance-covariance
(for ease of presentation the actual expression is not
matrix, Z is a vector of n independent and identically
displayed here). The inverse of the Hessian was computed
using a computationally efficient procedure [19] distributed normal random variables, a linear transformation
H −1g N +1 (g N +1 )T H −1
A of Z creates a new set of random variables, X, where A
H −N1+1 = H −N1 − N N +1 T −1 N +N1 (12) is chosen such that
1 + (g ) H N g
that constructs the matrix one point at a time. The initial W=V[X]=V[AZ]=A V[Z]AT = AAT (15)
matrix H0 is αI, where α is a small quantity. Results are not V[·] denotes the variance-covariance matrix. This approach
sensitive to the precise value of α, and a value of 0.01 proved is further suitable when correlations exist between the
acceptable. Both matrices were computed using a specifically variables (analysis of the correlation matrix of the variance-
written Matlab program [21]. covariance matrix showed various highly positively and
Having computed the Hessian and its inverse, the large negatively correlated weights). 24000 (1000 sets of 24)
sample approximate variance-covariance matrix may be independent standard normal variables were generated using
computed noting Fisher’s scoring method, such that the Matlab. 1000 vectors of transformed variables were then
variance-covariance matrix is obtained by multiplying each vector of 24 variables by the
−1 Choleski triangle, A, to which the values of the optimal
  d 2 log L( β )  
−1
  d log L( β )  2 
   weights were added, such that
 − E  =  E  (13)
  dβ j dβk    dβ dβ
 
 ˆ =X
Chol AZ + Y (16)
     j k  
where E is the expectation operator and β j is a parameter (not where Z is a (1000 x 24) matrix of simulated values of the
necessarily identifiable with the βk weights of the network). weights, Ŷ is the vector of estimated weights, and Chol is
This expression is valid for censored data [22]. Using the law the Choleski decomposition. All computations were
of large numbers, the sample average of the partial performed using a specifically written macro in Excel®.
derivatives in Equation (13) converges to the expectation of The process is highly computationally efficient. The 95%
one term. The RHS of Equation (13) in its sample average upper and lower confidence intervals for the survival
form is the outer-product approximation of Equation (11). functions were then calculated as the 2.5% and 97.5%
percentiles of the distribution of the estimates, and are non-
symmetric, by nature of the non-parametric bootstrap
3.7 Confidence Intervals
approach.
Figure 3 depicts a representative example of survival
Having arrived at an ANN architecture that gives
predictions with confidence intervals for subset 23, a patient
satisfactory predictions confidence limits on the weight
residing in a deprived area, with a Clarke level 3 and L.P.4
parameters (standard errors) may be obtained in addition to
tumour (intermediate stage and medium thickness
confidence intervals on the survival predictions that act in a
respectively).
further validatory role. Both are formed using simple
Confidence intervals of predictions for low volume
statistical assumptions of the properties of the weight
subsets were neither wider nor particularly smaller than
parameters of the trained ANN with a quasi-bootstrap
those of predictions of high volume subsets, indicating the
approach.
robustness of this approach.
Standard Errors

Using asymptotic multi-parameter maximum likelihood


theory, standard errors on the weights are constructed by
Network, Computerized Medical Imaging and Graphics, 15, 3-9
1.0 (1991).
[4] Mann, N. H. I., and Brown, M. D., Artificial Intelligence of
0.9 Low Back Pain, Orthopedic Clinics of North America, 22: 303-314
(1991).
0.8 [5] Weinstein, J. N., Grever, M. R., Viswanadan, V. N.,
Lower CI
Rubinstein, L. V., Monks, A. P., et al., Neural Computing in
S(t)

0.7 S(t)
Upper CI Cancer Drug Development Predicting Mechanism of Action,
0.6 Science, 258, 447-451 (1992).
[6] Erler, B. S., Vitagliano, P., Lee, S. L., Superiority of Neural
0.5 Networks over Discriminant Functions for Thalasemia Minor
Screening of Red Blood Cell Microcytosis, Archives of Pathology
0.4 and Laboratory Medicine, 119, 350-54 (1995).
0 2000 4000 6000 [7] Cox, D.R., Regression models and life tables, Journal of the
Days Royal Statistical Society, Series B, 34, 187-202 (1972).
[8] Johnson, N. L., and Kotz, S., Distributions in Statistics:
Figure 3. ANN survival predictions with 95% lower and Continuous Univariate Distributions, 1, Houghton Mifflin, Boston
upper confidence intervals for subset 19 (1970).
[9] Grambsch P.M and Therneau, T.M, Proportional Hazards Tests
and Diagnostics Based on Weighted Residuals, Biometrika, 81,
515-526, (1994).
4 Discussion and Conclusion [10] Hanson, D. L., Horsburgh, C. R. Jr, Fann, S. A., Havlik, J.
A., Thompson SE 3d., Survival Prognosis of HIV infected patients,
This paper details refinements to a flexible non-linear ANN Journal of Acquired Immune Deficiency Syndromes, Jun 6(6),
624–9 (1993).
model for prediction of conditional survival probabilities, [11] Carter, W. H., Wampler, G. L., and Stablein, D. M.,
hazard and probability density functions. The model is Regression Analysis of Survival Data in Cancer Chemotherapy,
trained using a maximum likelihood estimation approach. Marcel Dekker, New York (1983).
Generalisation and model complexity have been addressed. [12] Mulsant, B. H., A Neural Network as an Approach to
An asymptotic approximation to the variance-covariance Clinical Diagnosis, M. D. Computing, 7, 25-36 (1990).
matrix provides standard errors on the parameter estimates [13] Baum, E. B., and Wilczek, F., Supervised Learning of
in addition to confidence intervals on the survival estimates, Probability Distributions by Neural Networks, in Anderson, D. Z.
estimated using a quasi-bootstrap approach with the (Ed.), Neural Information Processing Systems (Denver 1987),
Choleski decomposition algorithm. Thus the model’s American Institute of Physics, 52-61 (1988).
[14] Baxt, W. G., Application of Artificial Neural Networks to
predictive accuracy is further confirmed and it shares Clinical Medicine, The Lancet, 346, 1135 – 1138 (1995).
desirable similarities with conventional statistical [15] Dybowski, R., and Grant, V., Artificial Neural Networks in
methodology. Pathology and Medical Laboratories, The Lancet, 346, 1203 –
The lack of clear interpretation of the weight parameters 1207 (1995).
and choice of regularization parameter form the biggest [16] Liestol, K., Andersen, P. K. and Andersen, U., Survival
disadvantages of the model. However our investigations Analysis and Neural Nets, Statistics in Medicine, 13, 1189-1200
show that predications are not too sensitive to the latter. (1994).
At the very least, survival predictions are as good as [17] Ripley, R. M., Harris, A. L., Tarassenko, L., Non-linear
traditional statistical models, however the model overcomes Survival Analysis using Neural Networks, Statistics in Medicine,
23: 825-842 (2004).
difficulties in applying and fitting traditional models where [18] Kaplan, E. L., and Meier, P., Nonparametric Estimator from
distributional assumptions are untenable. The model’s user- Incomplete Observations, Journal of the American Statistical
friendly approach is enhanced by its Excel spreadsheet Association, 53, 457-481 (1958).
implementation. Thus the model remains a valuable flexible [19] Bishop, C. M., Neural Networks for Pattern Recognition,
prognostic tool for the analysis of censored survival data. Oxford University Press, (1995).
[20] Hassibi, B., Stork, D. G., and Wolff, G. J., Optimal Brain
Surgeon and General Network Pruning, IEEE Int. Conf. Neural
Networks, San Francisco, 293-299 (1993).
Acknowledgements [21] Joshi, R., Modelling Survival Time Distributions of Cancer
Data using Artificial Neural Networks, Ph.D. Thesis, Coventry
Thank you to the West Midlands Cancer Intelligence Unit University (2004).
for the data set and collaboration with the project. [22] Escobar, L. A., and Meeker, W. Q., Fisher Information
Matrices with Censoring, Truncation, and Explanatory Variables,
Statistica Sinica, 8: 221-237 (1998).
[23] Geyer, C. J., Fisher Information and Confidence Intervals
References using Maximum Likelihood, Stat 5102 Notes
http://www.stat.umn.edu/geyer/5102/notes/fish.pdf (2003)
[1] Joshi, R., Johnston, C., and Reeves, C., Beyond the Cox [24] Maindonald, J. H., Statistical Computation, Wiley (1984).
Model: Artificial Neural Networks for Survival Analysis, [25] Parramore, K., On Simulating Realizations of Correlated
Proceedings ICSE 2003 Conference, Coventry (2003). Random Variables, Teaching Statistics, 22(2), 61-63 (2000).
[2] Rosenblatt, F., The Perceptron: A Perceiving and Recognizing
Automation, Ithaca, NY:Cornell Aeronautical Laboratory Report,
85-460-1 (1957).
[3] Daponte, J.S., and Sherman, P., Classification of Ultrasonic
Image Texture by Statistical Discriminant Analysis of Neural
View publication stats

You might also like