You are on page 1of 346

A Method for Predicting Death and Illness in an Elderly Person

by
Michael Walter Anderson
B.A. (College of William and Mary in Virginia) 1991
M.A. (University of California at Berkeley) 1993
A dissertation submitted in partial satisfaction of the
requirements for the degree of
Doctor of Philosophy
In
Demography
in the
GRADUATE DIVISION
of the
UNIVERSITY of CALIFORNIA at BERKELEY
Committee in charge:
Professor Kenneth W. Wachter, Chair
Professor Ronald D. Lee
Professor John R. Wilmoth
Professor David A. Freedman
1997
The dissertation of Michael Walter Anderson is approved:

71
8
0
/71-
Chair Date

::t1l-9/'17-
*
Date
K
-:+ ( 2..3 { 4-:t
Date

W"7
University of California at Berkeley
1997
A Method for Predicting Death and Illness in an Elderly Person
1997
by
Michael Walter Anderson
1
Abstract
A Method for Predicting Death and Illness in an Elderly Person
by
Michael Walter Anderson
Doctor of Philosophy in Demography
University of California at Berkeley
Professor Kenneth W. Wachter, Chair
This dissertation describes three accomplishments. Statistically, it defines a
nonparametric predictor for efficiently classifying binary observations. The model uses
Boolean operators to combine binary predictor variables in a variation on classification trees.
Second, this approach is demonstrated by predicting three-year mortality and illness in a
sample of elderly persons (ages 65+) with compact subsets of questions from a baseline
survey (EPESE: Established Populations for Epidemiologic Studies of the Elderly,
N=10,294). Third, the author constructs a search algorithm to fit the model, using a test set
to choose model sizes and estimate error. These tasks are united under a common objective:
predicting whether an elderly person would die within three years by asking simple questions
about the individual's demographic status, functionality, and epidemiological history. The
results also inform substantive analyses of mortality.
Several models are presented. Set A correctly predicted 66% of deaths in an internal
test set with a specificity of70%. Set B caught 42% of deaths with a specificity of 89%. Set
C caught 22% of deaths with a specificity of 96%. The estimated area under the receiver
operating characteristic curve was 74.4% 1.2%. The accuracy is comparable to that
2
achieved by larger models built with more complicated methods (e.g., linear discriminant
analysis, logistic regression). However, the models developed here use fewer variables and
achieve a more elegant, easily interpreted representation. For example, Set B requires only
seven variables, all in the form of simple questions (e.g., "Can you walk half a mile?"). To
validate the error estimates, the models were applied to an independent sample to which the
author was blinded (the North Carolina EPESE respondents, N=4, 162). The accuracy was
slightly lower for this sample, but the size of the difference was small (comparable to
sampling error).
Although the models are designed for the purposes of prediction, they also facilitate
etiological analyses by dividing the respondents into more homogeneous subsets. Based on
evidence provided in the results, it is argued that cancer was underestimated as a cause of
death in the EPESE sample, and that digitalis toxicity may have been a substantial source of
excess mortality.
Chair
This dissertation is dedicated to the elderly.
111
IV
Table of Contents
List of Tables vi
List of Figures vi
Acknowledgments viii
Chapter 1 - Questions that predict death and illness 1
Section 1.1 - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Section 1.2 - The value of predicting mortality 5
Section 1.3 - Data and concept 9
Section 1.4 - A general search algorithm 14
Section 1.5 - The test set method 19
Section 1.6 - Use of the test set to determine the number of questions 26
Section 1.7 - Results: mortality 32
Section 1.8 - Results: heart failure, stroke and cancer 38
Section 1.9 - Discussion 41
Section 1.10 - An overview of the dissertation 52
Chapter 2 - The problem of prediction 55
Section 2.1 - The specific problem of predicting death or illness 55
Section 2.2 - Defining prediction and prediction error for a general predictor .. 56
Section 2.3 - Previous efforts to predict mortality and morbidity 63
Section 2.4 - Regression as a method of classification 65
Section 2.5 - Classification trees and Bayes' rule 71
Section 2.6 - Other methods of classification 88
Section 2.7 - Existing applications of the models 96
Chapter 3 - Data: Established Populations for Epidemiologic Studies of the Elderly .. 105
Section 3.1 - The EPESE project 105
Section 3.2 - Composition of the populations 106
Section 3.3 - The baseline survey 107
Section 3.4 - Outcomes 109
Section 3.5 - A comparison of the EPESE sample and the U.S. population 110
Section 3.6 - Functionality, morbidity and causes of death in the
EPESE sample 118
Section 3.7 - Missing data 131
Chapter 4 - The statistical methods of prediction 135
Section 4.1 - Approaches to model selection 135
Section 4.2 - A method for choosing questions 137
Section 4.3 - Performance of the search algorithm 151
Section 4.4 - Linear discriminant analysis 163
Section 4.5 - Logistic regression 166
Section 4.6 - The CART algorithm for model selection 169
Section 4.7 - Modifications and additions to the question set method 172
Chapter 5 - Tools for prediction on an individual level 177
Section 5.1 - Simple questions for predicting survival or death 177
Section 5.2 - An index for the risk of mortality: linear discriminant analysis . 186
v
Section 5.3 - Classification trees for predicting death 197
Section 5.4 - A regression model of mortality 202
Section 5.5 - Validation of the question sets with the North Carolina sample .205
Section 5.6 - Validation of the discriminant and classification tree models 219
Section 5.7 - A comparative analysis of the models 222
Section 5.8 - An additional question set model 227
Chapter 6 - The causes of death in the elderly 230
Section 6.1 - Causes as identified by the death certificate 230
Section 6.2 - Underlying and associated causes in the EPESE population 236
Section 6.3 - The causes of death associated with the models 240
Section 6.4 - The causal processes and risk factors associated with death 243
Section 6.5 - Digitalis use and mortality: cause or consequence? 248
Chapter 7 - Conclusions 255
Section 7.1 - The power of the models for predicting mortality 255
Section 7.2 - The efficiency of the search method 256
Section 7.3 - Implications for causal analyses of mortality 257
Section 7.4 - Future applications of these methods 261
References 263
Appendix I - Questions for the prediction of death 269
Appendix II - Questions for predicting heart failure 273
Appendix III - Questions for predicting strokes 275
Appendix IV - Questions for predicting cancer 277
Appendix V - C code for the repeated random search algorithm (RSA) 279
Appendix VI - Mortality index for the elderly 294
Appendix VII - C code for the repeated random and exhaustive search
algorithm (RRESA), combined with backward deletion to estimate
error on a test set 299
Appendix VIII - A large question set for predicting death 333
VI
List of Tables
Table 1.1 - Predicted outcome by true outcome for Question Set A of Appendix I
as applied to a test set 23
Table 1.2 - Deaths predicted correctly and survivors predicted incorrectly as dead 33
Table 1.3 - Heart failures predicted correctly and non-failures predicted
incorrectly 40
Table 1.4 - Strokes predicted correctly and non-strokes predicted incorrectly 42
Table 1.5 - Cancer predicted correctly and non-cancer predicted incorrectly 43
Table 3.1 - Variables in the baseline survey 108
Table 3.2 - Observed sample and U.S. population estimates of probability of dying
within three years of age x (3Qx) by sex 117
Table 4.1 - Specifications for the construction of the question set models 147
Table 5.1 - Predicted outcome by true outcome for Question Set A, full dataset 177
Table 5.2 - Predicted outcome by true outcome for Question Set A, internal test set .. 178
Table 5.3 - Predicted outcome by true outcome for Question Set B, full dataset 180
Table 5.4 - Predicted outcome by true outcome for Question Set B, test dataset 182
Table 5.5 - Predicted outcome by true outcome for Question Set C, full dataset 183
Table 5.6 - Predicted outcome by true outcome for Question Set C, test dataset 185
Table 5.7 - Probability of death within three years by mortality index 188
Table 5.8 - Average and median mortality index score by age and sex 192
Table 5.9 - Deaths and survivors predicted by discriminant analysis 194
Table 5.10 - Deaths and survivors predicted by classification trees 198
Table 5.11 - Coefficients in the logistic regression model of mortality 203
Table 5.12 - Predicted outcome by true outcome, logistic regression, test set 204
Table 5.13 - Results of applying Question Set A to Duke sample 210
Table 5.14 - Results of applying Question Set B to Duke sample 215
Table 5.15 - Results of applying Question Set C to Duke sample 217
Table 5.16 - Results of applying discriminant model to Duke sample 219
Table 5.17 - Results of applying discriminant model to Duke sample (highest risk) .. 220
Table 5.18 - Results of applying classification tree to Duke sample 221
Table 5.19 - Performance of three methods of prediction 223
Table 5.20 - Predicted outcome by true outcome, Question Set J, full dataset 227
Table 5.21 - Predicted outcome by true outcome, Question Set J, test dataset 227
Table 6.1 - Classification of ICD-9 codes 237
Table 6.2 - Underlying causes of death by question subset 242
List of Figures
Figure 1.1 - Body mass index (weight/height) by systolic blood pressure 20
Figure 1.2 - Misclassification error by number of questions asked 28
VB
Figure 1.3 - True positive fraction by false positive fraction (ROC curve) for
questions predicting deaths in the test set 37
Figure 1.4 - Diagram of death from chronic illness 46
Figure 2.1 - Simulation of Yas a quadratic function ofX 68
Figure 2.2 - Example of a classification tree for risk of death 72
Figure 2.3 - Distribution of survivors by blood pressure and body mass index 77
Figure 2.4 - Distribution of decedents by blood pressure and body mass index 78
Figure 2.5 - Contour plot of ration of survivors' distribution to decedents' distribution 80
Figure 2.6 - Example of question sets combined with OR 86
Figure 2.7 - Division of contour plot, 9-question model 89
Figure 2.8 - Division of contour plot, 3-question model 90
Figure 2.9 - Distribution of deaths and survivors by z index 92
Figure 3.1 - Number of respondents by age and sex 112
Figure 3.2 - Probability of death within 3 years by age and sex 116
Figure 3.3 - Proportion unable to walk half of a mile without help by age and sex 120
Figure 3.4 - Proportion unable to bathe without help by age and sex 121
Figure 3.5 - Proportion ever diagnosed with heart failure by age and sex 123
Figure 3.6 - Proportion ever diagnosed with cancer by age and sex 124
Figure 3.7 - Incidence of heart failure by age and sex 127
Figure 3.8 - Incidence of stroke by age and sex 128
Figure 3.9 - Incidence of cancer by age and sex 129
Figure 4.1 - Histogram of numbers of mutations required to achieve absorption 155
Figure 4.2 - Total number of successful mutations by total number of mutation
(trajectories for 200 random absorption points) 156
Figure 4.3 - Total number of successful mutations by total number of mutations
(trajectories for 200 random absorption points) 157
Figure 4.4 - Probability of a successful mutation by number of previous
successful mutations 160
Figure 4.5 - Model fitness by model size, discriminant analysis 165
Figure 4.6 - Misclassification error by size of tree 173
Figure 5.1 - Respondents chosen by Set A, by age and sex 181
Figure 5.2 - Respondents chosen by Set B, by age and sex 184
Figure 5.3 - Bar plot of probability of death by linear discriminant index score 190
Figure 5.4 - Smoothed estimate of probability of death by linear discriminant
index score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Figure 5.5 - True positive fraction by false positive fraction (ROC curve) for
discriminant model of deaths in the test set 196
Figure 5.6 - True positive fraction by false positive fraction (ROC curve) for
classification trees of deaths in the test set 199
Figure 5.7 - Classification tree for risk of death within 3 years 201
Figure 5.8 - Number of respondents in Duke sample by age, sex, and race 209
Figure 5.9 - True positive fraction by false positive fraction (ROC curve)
for questions predicting death (with Duke results) 218
Vlll
Acknowledgments
My research would not have been possible without the support of the following persons and
institutions: Ronald Lee, Kenneth Wachter, John Wilmoth, David Freedman, Leo Breiman,
David Aday, the Department of Demography and the Statistical Computing Facility at the
University of California at Berkeley, the National Institutes of Health, Gary Maloney,
Marilyn Friedemann, and my mother and father.
1
Chapter 1 - Questions that predict death and illness
Danger is a biologic necessityfor men, like sleep and dreams. Ifyouface death, for
that time, for the period ofdirect confrontation, you are immortal. For the Western middle
classes, danger is a rarity and erupts only with a sudden, random shock. Andyet we are all
in danger at times, since our death exists; Mektoub, it is written, waiting to present the
aspect ofsurprised recognition.
Is there a techniquefor confronting death without immediate physical danger? Can
one reach the Western Lands without physical death? These are the questions that Hassan
I Sabbah asked.
Don Juan says that every man carries his own death with him at all times. The
impeccable warrior contacts and confronts his death at all times, and is immortal.
- William S. Burroughs, The Western Lands
1.1 Introduction
This research has four distinct objectives. From a statistical standpoint, the
dissertation defines a nonparametric method of efficiently classifying observations by using
Boolean operators to combine predictor variables in the form of binary questions. This
approach is demonstrated by predicting three-year mortality in a sample of elderly persons
with small subsets of baseline survey questions. To fit this form of model, the author
constructs a search algorithm that attempts to reduce misclassification error by using internal
test set techniques to choose model sizes and estimate error. Finally, these predictive models
are used to inform more substantive analyses of the causes of death in the elderly.
These diverse goals are unified under a single theme: the prediction of death with
simple survey questions. The central idea of the dissertation is to select a combination of
questions from a large pool of possible questionnaire items such that persons with certain
answers have a high probability of death within three years. Many of these items are simple
questions about the functional status of the individual (e.g., activies of daily living, or
2
ADL's) which are commonly found in many large health surveys. Below is an example of
such questions that can be answered without help from any clinician, and that take less than
a minute to administer in an interview:
(l) Other than when you might have been in the hospital, was there any time
in the past 12 months when you needed help from another person or from any
special equipment or device to do the following things?
i) Walking across a small room?
No help needed
Needed help
Unable to do
Missing/NA
1
ii) Bathing, either a sponge bath, tub bath or shower?
No help needed
Needed help
Unable to do
Missing/NA
(2) Are you retired?
Yes
No
MissinglNA
The chance that a randomly selected elderly person (aged 65 or over) in the U.S.
would give all three answers in bold is small. However, for those that do, it is estimated
with the data used in this dissertation that there is a 50% to 65% chance of death within three
years. It is improbable that this estimate is the result of a fluke, chance variation, or
statistical artifact, as these estimates were validated on a large, independent sample of elderly
1 The answer "Missing/NA" is explained below. For these questions, this category of answer is
given only by the small minority of respondents who do not give any of the other responses for some reason
(e.g., refusal). See Chapter 3 for a thorough treatment of missing values.
3
persons after the questions were selected.
First, the questions were constructed with the methods developed below using a
sample of 10,294 respondents from a survey study of persons aged 65 and over from Iowa,
New Haven and Boston (the original EPESE sites). Using an internal test set (explained
below), the respondents chosen by the above questions were estimated to have a 65% chance
of dying. When the questions were then applied to an entirely independent sample of 4, 162
elderly persons having a different demographic makeup (the North Carolina EPESE dataset),
a death rate of greater than 50% was observed in the chosen respondents. This dissertation
argues that many more combinations of questions can be found which isolate other groups
of elderly persons who also have a very high risk of death and chronic illness. The central
task is in developing a method for selecting these combinations of questions. The approach
invented here is developed completely in the context of predicting mortality and morbidity;
yet it is somewhat serendipitously designed to be generalizable to the prediction or
classification of nearly any binary outcome based on a large pool of predictor variables. This
is accomplished by using a search algorithm which: I) handles a large variety of data types;
2) does not utilize distribution-based statistical tests of parameters for model selection; and
3) does not depend on any substantive or context-specific information about the predictor
variables (e.g., medical knowledge about the causes of mortality). Rather, the strategy is
simply to use the power of the computer to search across combinations of questions and
answers that isolate large numbers of respondents with high death rates. To ensure that the
computer was not simply exploiting chance variation by identifying random correlations in
the data, a large test set of respondents was selected out of the dataset before the models were
4
built, and the computer used the remaining respondents (the learning set) to construct models
of different sizes. These models were then applied to the test set respondents, and the model
size that most accurately predicted mortality was selected. The accuracy of the models was
estimated with the test set data, and these error estimates were validated on an independent
sample from an entirely different geographic region, as mentioned (see Chapter 5).
In this way, an entirely systematic and atheoretic method of model construction IS
defined. However, powerful statistical models are useless if they are not applicable to
contextually complex, real world situations. Thus, the author does not attempt to develop
this algorithm in a completely atheoretic, mechanistic context. Instead, the construction of
the statistical method is integrated with the more palpable task of predicting death in actual
elderly persons. Consequently, a considerable portion of the dissertation is concerned with
many different aspects of predicting and explaining mortality beyond those involving the
mere construction of statistical models. In short, the methods here were developed with the
ultimate intention of contributing to the existing knowledge of mortality; the statistical tools
were created to serve this purpose, and although they may be generalized to other problems,
the development of a general algorithm was not intended to be the primary contribution of
the dissertation. Since the methods matter to the results, however, a great deal of attention
is given to the details of the model construction in addition to pure mortality research.
The remainder of this chapter provides a roadmap for the dissertation in broad
strokes. First, the motivation for prediction is discussed in the context of predicting death
and illness in individuals, and the existing "state of the art" techniques of building models
for these purposes are briefly described. Next, the chapter provides background on the
5
survey data used for this project and presents the basic idea of combining the survey
questions in binary form with Boolean logic to classify respondents as high or low risk. A
general search algorithm for finding these questions is then described. The problem
presented by random variation in the data is defined in nontechnical language, and questions
are raised about the use of internal test set techniques to estimate the accuracy of the models.
Finally, a brief summary of the central results of the dissertation is provided, followed by a
discussion of substantive implications.
1.2 The value of predicting mortality
A vast literature exists concerning the statistical prediction of mortality and
morbidity, and this research serves a variety of purposes.
2
The value of accurate prognoses
for clinicians responsible for medical decision making is one central interest. Since most
chronic diseases progress in stages, slowly deteriorating the body over time, there is a great
advantage in detecting the onset of these conditions sooner and treating the patient before
disease has fully taken its toll. Additionally, if physicians can reliably predict postoperative
mortality, the less suitable candidates for surgery could be more effectively identified before
the procedure is performed. Secondly, a large part of the research is concerned with the
assessment of risk-adjusted outcomes for hospital quality evaluation, since accurately
determining the risk-level of patients at admission is essential for a fair comparison of post-
admission mortality rates between hospitals. Similarly, the identification of an accurate
2 For some of the latest research in this field, see Anderson et al. (1994); Becker et al. (1995);
Bernstein et al. (1996); Cain et al. (1994); Davis et al. (1995); Eysenck (1993); Grubb et al. (1995);
Iezzoni et al. (1996); Josephson et al. (1995); Marshall et al. (1994); Normand et al. (1996); Ortiz et al.
(1995); Piccirillo and Feinstein (1996); Poses et al. (1996); Pritchard and Woolsey (1995); Quintana et al.
(1995); Reuben et al. (1992); Rowan et al. (1994); Schucter et al. (1996); Smith and Waitzman (1994);
Talcott et al. (1992); Turner et al. (1995); and Wong et al. (1995).
6
measure of risk using a small group of commonly-measured variables yields a natural proxy
for the observation of death. This provides a valuable research tool, as many studies in
which similar questionnaires are used do not observe mortality and morbidity. Lastly, some
researchers are interested in the causal mechanisms behind death and illness. Since causal
connections are difficult to establish with observational data, their quantitative analyses are
necessarily geared toward the statistically feasible task of prediction, with some hope of
gaining some substantive insight from the fitted models.
However, the goal of most existing research focuses on a limited universe of
respondents. In particular, the vast majority of attempts at genuine prediction (i.e., prognosis
as opposed to causal analysis) are directed at the minority of persons who are already quite
ill. They have usually been admitted to a hospital for their illness, and may be quite close
to death at admission. As such, the models use physiological data that require laboratory
work, such as blood analyses. The dominant statistical approach used in these efforts is
logistic regression, with some other equally cumbersome methods (e.g., discriminant
analysis, kernel density estimation) used less frequently. These methods use complicated
(usually innappropriate) assumptions about the data. In the case of regression, models tend
to be overfitted when many variables are introduced, and the usual R
2
statistic may over-
estimate the accuracy of a model.
3
The approach developed in this dissertation is distinct from these efforts in two ways.
First, the model is of a nonparametric form with a particularly simple representation (as
demonstrated by the question sets above). This idea of partitioning a dataset into high and
3 See Chapters 2 and 4 for more detailed discussions of these methods and explanations for why
the author finds them cumbersome.
7
low risk groups nonparametrically with "yes/no" questions (called binary splits on predictor
variables) is not original; indeed, the inspiration came from the idea of classification trees
as developed by Breiman et al. (1984). However, instead of arranging splits hierarchically
in a tree structure, the variation developed here uses Boolean logic to arrange splits into
subsets of questions, each defining a particular region of the sample space. When combined
with an algorithm for selecting these question subsets, this form of model appears to provide
advantages over the tree approach in terms of predictive power and simplicity of
representation.
Secondly, the dissertation aims to build powerful models designed to predict death
in the general population of elderly persons in the U.S., not patients in a hospital who are
already suffering from a known, diagnosed condition. Although there exist many fitted
models of mortality constructed from probability samples of U.S. elderly, these usually are
designed not for the purposes of predictionper se, but with the intention of conducting causal
analyses.
4
The aim of this dissertation research is to build models constructed purely based
on prediction and attaining increased predictive power. Some readers not familiar with
modem statistical techniques may fail to see the distinction between these goals. The biggest
differences pertain to how the model is specified and how its accuracy is measured. By
defining prediction strictly, one is provided with a much more natural criterion for estimating
the predictive accuracy of models honestly (e.g., test set or cross-validation estimates of
error). This approach also aids the task of model specification by giving the researcher a
means by which the optimal model size may be gauged.
4 E.g., see the discussion on Smith and Waitzman (1994) in Section 2.7.
8
As a byproduct of models whose development is based on pure prediction, many
results obtained in this dissertation may be of academic interest for mortality researchers.
For example, because the predictor for mortality consists of a handful of questions which
are common to many surveys, the model yields a natural proxy for short-term mortality
which could be used in those surveys for which mortality is not directly observed as an
outcome. That is, a predictor which can classify a large number of high-risk respondents
accurately can be constructed through the use of a set of questions which number 10 or
fewer, and which require no information about most substantive variables of interest (e.g.
race, income, educational attainment). Thus, a researcher wishing to use such a proxy would
be left with most independent variables available for analysis. Additionally, the variables
required (principally age, sex, and activities of daily living, or ADL's) are widely available
on many survey instruments, most of which do not observe mortality directly.
Finally, although the method is statistically geared toward prediction, the exact
questions and answers by which high-risk persons are identified raise some interesting
substantive and theoretical issues. Beyond simply identifying possible risk factors, greater
structural features of the mortality processes in the elderly population are highlighted by the
analysis. Since the method for finding questions does not control for the vast majority of
covariates in any way, it might seem that the causal implications of such a model would be
much less interpretable than the usual regression model. However, the question set method
has a different type of advantage: it not only finds subsets of the population who are at very
high risk (or conversely, low risk), it identifies these persons in efficient and easy-to-
understand terms. Thus, the idea is not necessarily to find a direct causal connection between
9
mortality and the actual variables used in the question models, but to use the questions
simply to delimit the various (highly heterogeneous) high-risk subsets ofthe population; then
intuitive thinking about the causal processes can be usefully applied by comparing the high
and low risks populations on more substantive grounds.
The results have some very interesting implications for this type of thinking. In
particular, many researchers seem to believe that by first controlling for age and sex, which
of course does divide the population into relatively low and high risk subsets, one gains not
only a great deal in substantive understanding, but in predictive power as well. However, the
questions below depart from this assumption radically; it seems possible to attain higher
levels of predictive power by considering age and sex secondarily. The result is that the
subsets of the population identified as high risk by a particular model usually include persons
of all ages and both sexes, and their mortality rates are much higher than predicted by their
age and sex distribution. Additionally, the models point to some possible deficiencies in
cause-specific death certificate data and the detection of illnesses (particularly cancer) that
may mislead research into the causal processes leading to death.
1.3 Data and Concept
The EPESE project (Established Populations for Epidemiologic Studies of the
Elderly) gathered extensive survey data on 10,294 noninstitutionalized elderly persons (aged
65 and over) from New Haven, East Boston and Iowa County.s In a 1983 baseline survey,
these respondents answered questions concerning demographic status, weight, height, blood
pressure, past medical diagnoses of maladies such as cancer, heart attacks, diabetes, strokes
5 See Comoni-Huntley (1993). A more thorough description ofthe data is provided in Chapter 3.
10
and bone fractures, activities of daily living and other measures of physical and mental
functional status, depression, living arrangements, past histories of smoking, drinking and
obesity, and many other variables. The respondents were then followed over a period of
three years and the survivors annually provided follow-up data on the subsequent diagnoses
ofchronic diseases (among other variables). Those respondents who died (l ,450 within three
years of baseline) were matched with their death certificates. In this dissertation, the main
variable to be predicted from the baseline survey is whether the respondent is among these
decedents; occurrence of heart attacks, strokes, and the onset of cancer after baseline are the
other outcomes of interest.
For this project, the dataset was divided into two groups: a test set and a learning set.
A simple random sample was used to select out one third ofthe respondents (3,432) in the
survey, and these persons were designated as the test set respondents. At the earliest stages
of the research, the test set was set aside for future use, and a computer-intensive, randomly-
driven search algorithm was implemented on the remaining two thirds of the respondents
(6,862). The goal of this search was to find four survey questions such that the respondents
who supplied particular answers to these questions subsequently experienced extremely high
rates of mortality.
6
The potential pool of questions was vast; 164 predictor variables were
culled from the survey.? Variables could take on many different values, allowing for
6 At this point, the number of questions to ask was chosen solely from intuition. Later, a
backward deletion was used to decide the best number of questions.
? There were more than 164 variables available in the dataset, but some were not consistently
phrased in all three geographic areas (e.g., most of the questions on emotional status), and some questions
were sample-specific items which did not have any particular relevance to the general population of elderly
persons; these items were ignored. Other than missing values, which were assigned a numeric value of999,
there was no recoding ofthe data.
11
thousands of distinct questions. On the first application of this algorithm, without any
examination of the test set, the following set of four questions was discovered:
(l) Are you able to walk half a mile without any help? That's about eight ordinary
blocks.
Yes
No
Missing 8
(2) Are you able to do heavy work (such as shoveling snow, washing windows, walls)
without any help?
Yes
No
Missing
(3) What day of the week is it? (Respondent's answer is:)
Correct
Incorrect
Refused
Missing
(4) What is your household composition?
Alone
With spouse only
With spouse and other person(s)
Other arrangements
Missing
Once these questions were selected, the answers of the 3,432 test set respondents were
examined. In the test set, 82 respondents gave answers to the questions that were all in bold
(above). Of these respondents, about 50% died within three years. These individuals were
not all obviously ill: 60% said they had not spent one night in the hospital in the year before
8 The category "missing" includes all responses which do not fit the other categories, as well as
data which is literally missing since some questions were not asked of a handful of respondents (e.g., the
Iowa proxy and telephone interviews). See Section 3.7 for a discussion of missing data.
12
baseline. Some 85% said they had never been diagnosed with a heart attack and 81 % had
never been diagnosed with cancer. They were not all old males, either. In fact, the mortality
rate ofthe chosen test set respondents was more than twice as high as the expected death rate
based on the age and sex composition of the chosen respondents.
9
The deaths in these chosen respondents only made up 8% of all deaths in the test set,
but this was not the only set of questions that could be found. Upon repeating the search, it
was found that other sets of questions chose other sets of respondents. When these questions
were applied to the test set, death rates were consistently higher than 50% and sometimes as
high as 80%. Thus, it seemed possible to combine sets of questions using simple Boolean
logic to identify larger groups of respondents. Two sets of questions, each of the form "Are
the answers to questions A, Band C all in bold?" can be combined to have the form "Are
the answers to questions A, B, and C all in bold, OR are the answers to questions X, Yand
Z all in bold?" Of course, it is only possible to gain predictive power if both sets of
questions do not identify the same set of respondents. Typically, however, these groups
were mostly disjoint, and the question subsets could be delibrately constructed together so
that they were mostly disjoint.
Since the number of questions was not systematically chosen at this stage of the
research, it was thought that the results of the search algorithm could probably be improved
by determining the best number of questions to ask. Larger sets of questions could provide
greater predictive power, for example, but they could also fail due to chance error in the
9 This expected age- and sex-standardized death rate is the death rate one would expect if a
random group of respondents with the same age and sex distribution was exposed to the age- and sex-
specific death rates calculated from the entire sample.
13
predictions. Also, it seemed that greater predictive power could be achieved by combining
sets of questions using the OR operator, as described. Again, at what point do the
combinations of question sets become so large that they fail to increase (or even decrease)
true predictive accuracy?10 The answer was learned by examining the test set with a
sequence of variously sized questions sets.
Before examining more powerful sets of questions however, it is instructive to
examine this search algorithm in somewhat more detail. It is also necessary to define what
is meant by "true" predictive power or accuracy with a probability model and error criterion,
and to describe how this accuracy might be estimated honestly. Thus, it is also necessary to
address another issue concerning the "honesty" ofthe error estimates. Multiple examinations
of the test set can introduce some downward bias in test set estimates of error, resulting in
an optimistic impression of accuracy. Also, the author had used the full dataset for other
analyses before this endeavor, and was consequently not truly blind to the test set
respondents. However, the search was constructed in an attempt to mitigate such bias. I I
Ultimately, the predictive power of these questions (and the overall method) was validated
on a completely independent sample (the Duke EPESE sample), and additional efforts to do
so with other datasets (the NHANES I and NHEFS surveys) are presently underway. Thus
far, the results have shown that using the test set for both refining and assessing the model
does not impart substantial bias to the estimate of accuracy, and that the resulting models are
not overfitted.
10 The general problem of choosing a model to maximize predictive power usually resolves to a
bias-variance tradeoff. This is a well-studied subject in statistics, see Breiman et aI. (1984) for an example.
II See Chapter 2 for a heuristic justification of the estimation.
14
1.4 A general search algorithm
Suppose we have some number ofpredictor variables, or survey questions (164 here),
and a binary outcome variable, such as death or survival within three years of baseline.
Again, the goal is to search for a set of questions and answers such that the death rate of the
respondents who answer a certain way is extremely high.
12
However, this task is more easily
stated than completed. To simplify the job, the questions can be constrained to take the form
of binary splits, i.e., "Is variable X < 2.3?", or "Is 1.5?", such that every question
has a "yes/no"answer (or bold and nonbold, as in the examples above). Some readers may
recognize that this is essentially a variation on the CART method of building classification
trees, which also finds sets of binary questions in classifying a categorical variable (see
Breiman et al. (1984)). Several important differences exist, however. The most significant
difference was in the construction of the models, as the two model forms themselves (the
Boolean structure and the tree structure) performed quite similar functions statistically.
Detailed comparisons of these methods are presented in Chapters 2 and 4, and Chapter 5
compares the results.
From the initial 164 variables chosen out of the survey, 3,248 binary questions could
be formed.
13
Thus, if one is to find a particular combination of four questions, for example,
12 The term "death rate" is being used somewhat loosely here to mean the proportion of persons
who died within three years (typically called a "raw" death rate by demographers). Formally defined
however, a death rate for ages x to x+3, or 3MX' measures deaths divided by exposure, while the quantity
measured here is actually a cohort estimate of 3qx> the probability of dying between ages x and x+3. This
distinction will be ignored in this chapter for the convenience of non-demographers.
13 Note that if a variable has Jpossible responses, then there are 2(J - I) possible "<" questions
that can be formed with this variable. For the example, the variable sex has only two responses (l or 2),
and so only two questions from this variable can be formed: "Is sex s 17" or "Is sex> 17". A variable with
three responses, however, yields four possible questions: "Is X s 17", "Is X s 2", "Is X > I?" and "Is X >
27". Of course, a variable with more than two responses may also be split with a more complicated
15
there exist more than four trillion possible combinations from which to choose (3,248 choose
four). If one wishes to find ten such "yes/no" questions, then more than 10
28
combinations
are possible (3,248 choose ten). Moreover, many possible combinations of questions and
answers did not pick out any respondents at all, and few chose sets of respondents with
extremely high death rates. (Indeed, the number of question sets with low error was quite
small for models of only several questions, as is shown by exhaustive searching in Chapter
4.)
Suppose agam that the number of questions is fixed at four. A measure of
"efficiency" for the prediction scheme was defined. This criterion was based on the death
rate of the chosen respondents and how many of them were chosen. Questions that picked
more respondents, and with a higher death rate were designated as more "efficient" than
questions that picked fewer respondents or respondents with a lower death rate, as defined
formally below. Then, to search across the large pool of combinations of four questions the
following algorithm was used: 14
(l) The computer started by randomly generating a set of questions from the
baseline data in the form of binary splits. A lower limit on the number of
respondents (dead or alive) chosen by these questions was enforced, say 25.
The death rate of these respondents was calculated, and the efficiency was
computed.
(2) Next, one question was picked at random and dropped. In its place, a new
random question was introduced. This question was found by choosing a
variable at random, and then choosing a split point at random from all
question, such as "Is X E {I ,3}". This same split can be achieved with".,,/>" style questions using the OR
operator, as done below, but this also increases the number of potential splits vastly.
14 The idea for this type of random search was inspired by evolutionary or genetic algorithms (see
Goldberg (1989); Davis (1987); Holland (1975)); however, it is a much simpler form of search, not a true
genetic algorithm (which did not seem to be necessary for the model sizes used here). The C code is
provided in Appendix V.
16
possible variable values. The death rate of the respondents chosen by this
new set of questions was calculated, and the efficiency was computed.
Again, it was ensured that the number of chosen respondents was above the
floor of25.
(3) If the new set of questions was more efficient than the old questions, the
computer kept the new set of questions. If the new set was less efficient, then
the computer kept the old set of questions. The computer then returned to
step (2) and repeated this process until no further improvement could be
achieved by replacing any single question.
It was also possible, as mentioned, to combine sets of questions using the OR operator.
(Again, such a set of questions works with Boolean logic: a respondent is chosen if all the
answers to anyone of the four sets are all in bold, e.g., "Are the answers to A, B, C AND D
bold?" OR "Are E, F, G AND E bold?", and so on.) Here, the computer would start by
randomly generating a "full" model. For example, a full model would typically consist of
four subsets of four questions each, for 16 total questions. Then the above process of random
replacement simply cycled from subset to subset, replacing a question each time to check for
improvement in the efficiency of the set of all 16 questions. This search algorithm was
always run until no further improvement in the model could be found by replacing any single
question. To be sure this was true, the computer periodically checked the model by
exhaustively replacing each question in the set with each of all the 3,248 possible questions
one at a time. If the computer found that no single change in the model could improve its
efficiency, the question set was labeled an absorption point; otherwise, the computer
continued its search undisturbed until it reached an absorption point (which was always
observed to happen eventually). The above algorithm for finding such an absorption point
was called the random search algorithm, or RSA.
Note that because the starting point for RSA is random, if one executes the program
17
more than once, a different initial starting point will be used, and different questions will be
randomly generated as replacements.
15
As a result, the algorithm would usually end at a
different absorption point with a different (perhaps higher) level of efficiency. Thus, this one
step RSA method is greedy in the sense that if it is executed only once, it can latch onto a
local maximum, finding no further improvement even if better absorption points exist; in
fact, if the number of questions in the set was at all large, this was typically the case. Chapter
4 shows this explicitly by searching exhaustively over models with only two questions.
As a result it was necessary to run the RSA many times over using different seeds for
the random number generator, so that many different absorption points could be observed.
From this large pool of question sets, the model with the lowest error was selected out. This
algorithm was labeled the repeated random search algorithm (denoted RRSA(N), where N
is the number of independent runs of the RSA). The RRSA(N) was always run for N L 100,
and in some cases for N > 1,000. For Set B in Appendix I, RRSA(2,000) was used. (This
was done merely to evaluate the performance of the algorithm. The result of RRSA(lOO) has
a greater than 99% chance of achieving the same result as RRSA(2,000) for that model
structure). A single RSA search would usually take more than 10,000 iterations for a
moderately sized model (e.g., 10 questions). Thus, at least one million total passes through
the dataset were always made before the set of questions with the most efficient set of
15 Of course, "random" actually implies that the computer reads an internal clock for a seed to
generate pseudo-random numbers (not to be confused with the seed model defined in Chapter 4). This is to
say, the computer must be instructed to use a new seed in each run if "independent" output is desired from
multiple runs of the program. Alternatively, "deterministic" results could be achieved by using the same
seed repeatedly.
18
questions was selected.
16
One million may be a large number of passes, but compared with
the number of potential question sets, it is minute.
Thus no guarantee exists that this procedure search will find the model with lowest
possible error for any given model structure. To the contrary, if the starting question set is
made large enough (20 to 30 questions), the RRSA(100) method will usually not find the
global maximum. For smaller model sizes (less than 10 questions), finding a maximum with
RRSA(N) seems possible for an N of about 100 or more. Unfortunately, it is not clear how
to prove any given set is a maximum. Only for very small models (two questions) could it
be proved that RRSA(N) does find the global maximum for a reasonably sized N, since
exhaustive searches to find the true maximum were then possible. Applications of the
algorithm to simulated datasets also proved that the algorithm could find a global maximum
in some cases (as described in Chapter 4).
However, while the global maximum is of interest, it is the primary goal of this
research simply to find the most efficient absorption points as it can, since the chosen
respondents still exhibit very high death rates. The reasons for not focusing strictly on a
search for a global maximum are that it may well be nearly impossible to find a maximum
and to prove that the given absorption point is such a maximum. This is due mainly to the
extremely large number of potential sets of questions for models with more than about ten
variables. Moreover, it is certainly not clear how to go about doing so if it were possible.
Even the basic RSA method described above (requiring about 80 lines of simple C code) uses
considerable computing time when applied to a large dataset; a more complicated algorithm
16 This would typically take more than an hour of CPU time on a Cray C-90 supercomputer,
depending on the particulars of the search.
19
would have to provide much greater searching efficiency to justify the lengthier code
(although some quite beneficial modifications were eventually discovered, as described in
Chapter 4). The search procedure also has an advantage by its simplicity, as it is intuitive,
easy to describe, and easy to carry out. Most important, the method seems to work, and the
results below suggest that it provides all the predictive power of more conventional
predictive models, if not more (see Chapter 5).
1.5 The test set method
At this point, many readers will recognize that the above algorithm may have the
tendency to capitalize on "chance" variation. Chance error is precisely why the questions
found using the learning set data are checked using the test set respondents. To see vividly
the sort of error that requires that a test set be used to judge the accuracy of the questions,
consider the scatterplot shown in Figure 1.1 (constructed with a simple random sample of
several thousand EPESE respondents). In this graph, every respondent who died within three
years of baseline is represented with an asterisk (*), and every survivor is shown by a dash
(-). Respondents are plotted against the x-axis according to systolic blood pressure, and
against the y-axis according to body mass, or weight in pounds divided by height in inches.
Notice that the graph is separated into two regions. The smaller region is delineated by two
yes/no questions: "Is systolic blood pressure;:>: 190?" and "Is body mass < 2.6?"; in this
region of the graph, only respondents who answer "yes" to these questions are plotted. The
death rate in this region is 25%, while the death rate for respondents outside the region is
13%. In this way, a series of yes/no questions can divide the sample space into high and low
risk regions.
Figure 1.1 - Body mass index (weight/height) by systolic blood pressure
Percentages give % who died in three years (* is dead, - is alive)
L()
C'0
13% died
25% died
*
*
*
*
-
-

* -
-
*
ll'
* -
-*
-
-;< -* -
*
_---:: -: _
;Ii* - *= ==1*=--=",- -=ai_ i- -:- .=
- __- _*-*: i. -*.= *
*- -w *_ :-: * -
- - - "'1'" *-- f--:-=* - -
- *-t= =-1 -- == *,-:<'-1;- - _i ;ll'--l!' --'"
_-- :;- ii_=;i!3er --=: =
*- :-*- -'==;;if=;;-I* --.:=.
=i-. = -11 .,. ..... -- ft- --I!*- -=-
- ---- __ ----*--
* w- -_. * .*wW--w. A --- - -
* t-=-a=!.=l1;*.'''- --*-1*-* :r_ -=-- *1* -
- .----- 11:-* *- .- __ a --
=*t-........ jj'=-...... !!.... -*=== -=-
- - =i!:-i-
C1
ili- 1- -,"".--1 _.=- "'-* -
-* - -. w ; i' - ==.- - --
*- W - =;! -1.-- *---
= _- -*-
.=:__ =::-. _ -- __
-- _ *i- _It...==I. __-=- =* .. ...;it - *
*-
_= it li_:I;ll';!-_
- *- --_._.**! - - = -*
-m- : ..=- *-- -- _ll' _
_ * * * _ *- -= =-*-
-* - - =- - - ***,":; * - --* - -
- --= --t _li - - - ---*
- ; _ ii
t
-*!ll'=:=*;1i! *-;:; - =- * ;
= ._---i -*-w- - ==_ -- _-
t*=- -- _"\1:- - = - - .1*
o
C'0

-
N
*
0
-
N
--*
L()
..-
x
Q)
"'0
C
C/)
C/)
C1:l
E
>-
"'0
o
.0
..-...
..c
()
c
:.:::::
C/)
.0
I I I I
100 150 200 250
systolic blood pressure (mm of Hg)
Source: EPESE New Haven, E. Boston and Iowa surveys (partial sample)
tv
o
21
Notice, however, that if no constraints are placed on the search for questions, picking
out regions that have very high rates of death due largely to chance variation is easy. For
example, the pair of questions, "Is systolic blood pressure < 80?" and "Is body mass> 2?",
would isolate a single dead respondent on the far left side of the graph, for a death rate of
100%. Suppose the "true" death rates are defined as those experienced by the general U.S.
population of elderly. It is unlikely that the true death rate for all persons in this region of
the plot is this high, and in fact it is not clear from this scatterplot whether it is at all high.
The problem is that the number of respondents in that area of the variable space is so small
that it becomes very easy to find regions with decedents unfairly, or "after the fact". (This
can be directly confirmed by examining the rest of the EPESE respondents, as Figure 1.1
uses only a small sample.) This is the problem described above as the "capitalization on
chance error". Strictly speaking, the element of chance error is typically thought of as
resulting from sampling variation when persons are chosen through a small probability
sample (as with the probability model formally defined in Chapter 2). Unfortunately, only
half the EPESE respondents (the New Haven and North Carolina respondents) were chosen
via probability samples, so this notion of chance is not so clearly defined. However, if an
explicit stochastic structure is sought, one might also think of mortality itself as a chance
process. This is frequently assumed after conditioning on a given set of variables (as in the
construction of age and sex-specific life tables, for example).
It is also possible for one to choose such a large number of questions that even if the
isolated set of respondents is larger than one respondent, many questions are still chosen
22
because of chance variation. To see this with Figure 1.1, imagine many variously sized
rectangles, all shaped in a way that includes many the oddly-positioned deaths on the fringes
of the scatterp10t, but without containing many survivors. The more such questions are
allowed, the easier it becomes to find such groups of rectangles that isolate substantial
numbers of deaths, yet the actual predictive power of such partitions is clearly suspect. One
can eliminate much of this problem by placing a hard lower limit on the number of
respondents that anyone rectangle can isolate (say, 50 respondents, as is done here), yet it
is still possible for certain questions to be found purely because of chance variation. This
issue is examined in more detail and with more rigor in Chapter 2. In short, if too many
questions are found which choose too few respondents, the true accuracy of the preferred
questions suffers. Thus, is necessary to limit the number of questions asked in some way.
However, the size ofthe question sets should not be constrained to be so small as to sacrifice
predictive accuracy by restricting the search too severely. The central problem is that of the
bias-variance tradeoff. To discuss this issue more meaningfully, it is necessary to define
exactly what is meant by predictive accuracy, or "efficiency".
To do so, notice that a set of questions that identifies a subset oftest set respondents
who have death rates of 75% or higher is still not very "efficient" in some sense if it only
picks out only a few deaths.
17
Thus, it is useful to discuss a criterion that somehow
measures the efficiency of a particular set of questions. Misclassification error is defined
as the number of survivors in the subset of respondents chosen by the questions plus the
number of deaths in the subset of respondents not chosen by the questions, divided by the
17 These characteristics are often referred to as "sensitivity" and "specificity", as explained below
( also, see Ortiz et al. (1995); Thompson and Zucchini (1989.
23
total number of respondents.
I8
For example, in relation to Figure 1.1 and the two
corresponding questions, the misclassification error would be the number of -' s in the lower
right-hand region plus the number of *'s in the upper-Iefthand region, divided by the total
number of points plotted.
Table 1.1 - Predicted outcome by true outcome for Question Set A
of Appendix I as applied to a test set (N = 3,432)
TRUE OUTCOME
Survival Death
PREDICTED
Survival 2,035 172
OUTCOME
Death 888 337
TOTAL 2,923 509
The term "misclassification error" is used because the chosen respondents are
essentially being classified or predicted as dead by the questions, while the ignored
respondents are classified as survivors. The error is in the deaths ignored by the questions,
and the survivors who were chosen (i.e., those respondents who were misclassified). This
can be seen directly in a simple 2x2 cross-tabulation of the predicted outcomes against the
true outcomes, as is shown in Table 1.1 for a typical set of questions (Question Set A, listed
in Appendix I). The misclassification error here is simply the number of misclassified
18 This misclassification error is exactly the same criterion by which CART judges its
classification trees in the context of the two class problem with unit misclassification costs. See Breiman et
al. (1984). It is also sometimes convenient to implement a set of misclassification costs, such that the
search may be more sensitive to picking up deaths at the cost of a lower death rate in the chosen
respondents; this is explored below. By convention, the criterion by which the overall accuracy of any
predictive classifier can be judged, which holds constant across varying levels of sensitivity and specificity,
is the area under the receiver operating characteristic curve (see Swets and Pickett (1982); Thompson and
Zucchini, (1989. An estimate of this area is presented below, and the fitting of the ROC curve is
explained in detail in Chapter 2.
24
respondents (172 + 888) divided by the test set N, or 1,060/3,432 = 31 %. Note that those
sets of questions that isolate many deaths are rewarded more than questions that ignore many
deaths (holding the death rates of the chosen and unchosen persons constant). Thus, the more
"efficient" sets of questions result in a lower misclassification error. Note also that if the
number of respondents to be chosen by the questions is fixed, the goal of reducing
misclassification error is always equivalent to the goal of maximizing the death rate of the
chosen respondents.
Now the statistical goal of the dissertation can be stated explicitly in terms of the
misclassification criterion. The purpose of the method developed in this research is to find
a set of questions, combined with the AND and OR operators, which yield a reasonably low
test set misclassification error for the prediction of mortality or morbidity. Thus it is simply
a search for the questions that, when applied to the test set, correctly choose the most persons
who die while simultaneously choosing thefewest survivors, and likewise for the occurrence
of heart attacks, strokes, and cancer. This is the ultimate objective of all searches, and the
resulting questions are always judged by how well this error is minimized (but not according
to whether the absolute, lowest possible minimum error truly was achieved).
Exactly how many is "most", or "fewest"? To answer this question, a slight
modification to the above definition ofmisclassification error is also useful: a cost-adjusted
misclassification error can be computed. To see why this is useful, consider Table 1.1 again;
notice that a lower misclassification error may be achieved simply by classifying all
respondents as survivors, since then the only misclassified respondents are the deceased,
resulting in an error of 509/3,432 = 15%. This is halfthe size of the error for Set A. By this
25
criterion, we should choose to ask no questions, and ignore the deceased completely! In fact,
it is easily seen that the only sets of questions that result in a lower error than the strategy of
ignoring the deaths are those sets that isolate respondents having a death rate higher than
50%.19 These sets can be found, but the questions only account for a few deaths, and we are
interested in identifying a much larger proportion of the deaths even if the chosen
respondents do not have death rates quite so high. In other words, we may be willing to
allow more than one misclassified survivor for each death we predict successfully. Thus, we
can define the cost of misclassifying a death as survival to be greater than the cost of
misclassifying a survivor as dead. For example, suppose the relative cost of misclassifying
a death is defined to be five times that of misclassifying a survivor; then the numerator for
the cost-adjusted misclassification error associated with Table 1.1 is equal to 5x 172 + 888
= 1,748. In comparison, the numerator for the error associated with the ignorant strategy of
classifying all respondents as survivors is now 5x509 = 2,545, so asking the questions is an
improvement by this criterion since doing so yields a 31% reduction in the cost-adjusted
error rate (equal to (2,545 - 1,748)/2,545 ).
As a result of increasing this relative cost, the questions designed to minimize the
adjusted error choose subsets of respondents who experience death rates lower than 50%, but
they also catch a higher proportion of the deaths. This tradeoff, as referred to above, is a
balance between specificity and sensitivity. Specificity is simply the proportion of survivors
correctly classified as survivors. For the questions summarized in Table 1.1, this is equal to
2,035/2,923 = 70%; this quantity is usually proportional to the death rate of the chosen
19 This is a direct result of Bayes' rule, defmed in Chapter 2.
26
respondents. Sensitivity is the proportion of deaths correctly classified as deaths, equal to
329/509 = 65% in Table 1.1, and tends to be inversely proportional to specificity. That is to
say, it is possible to identify a small fraction of the unhealthy elderly who are at extremely
high risk, or to catch a larger proportion of elderly who have a somewhat lower (but still
relatively high) risk of death or illness, but it is yet not possible to identify the vast majority
of deaths while simultaneously achieving very high accuracy. The purpose of using a cost-
adjusted misclassification error is to explore the range of possible combinations of sensitivity
and specificity in this tradeoff. This is done by building multiple, separate sets of questions,
each built in an attempt to reduce cost-adjusted misclassification error, but with a different
relative cost for each set. In this way, the researcher is provided with a wide-range of models
with varying levels of sensitivity and precision.
1.6 Use of the test set to determine the number of questions
Consider the case of question Set B in Appendix 1. How was this question set found?
First, the search algorithm of random replacement (called the RSA above) was employed on
the learning set data to build a combination of four subsets (combined with OR) of four
questions each (combined with AND), for 16 total questions. The search was run
independently 100 times, and the lone combination of questions with the lowest observed
misclassification error was chosen as the full model (defining the RRSA(1 00) method). A
backward deletion process was then applied to these questions. Each of the 16 questions
(and each of the four sets as a whole) was temporarily dropped from the model, and the
misclassification error was recomputed in its absence. The question (or set of questions)
which yielded the smallest increase in error when temporarily dropped was dropped from the
27
model permanently, resulting in a set of 15 (or 12) questions. The deletion process was then
repeated on this submodel to obtain a model of size 14 (or 11, etc.). This model was itself
subject to deletion and so on until no questions remained. This resulted in a sequence of up
to 16 nested models, each containing a unique number of questions, and each a subset of the
next larger set of questions.
20
Next, this sequence of models was applied to the test set, and the misclassification
error of each model was estimated on this new data. Figure 1.2 shows a plot of
misclassification error as estimated on both the learning set and the test set for each set of
questions in the overall sequence of models. The shape of the learning set error curve is
much like that which would result from a sequence of regression models of increasing size.
That is, obtaining a slightly lower error by using a larger set of questions was always
possible, just as increasing R
2
by adding another coefficient to the regression equation is
always possible (provided the model is not completely saturated). However, when the larger
sets of questions were applied to the test set, they had slightly less predictive power than
more moderately sized sets of questions. This is because when more than seven questions
were found with the learning set, these additional questions began to reflect the random
variation in the learning set data. This occurred in much the same way that multiple
questions could be used to partition Figure 1.1 unfairly, as suggested above. Yet when less
than seven questions were used, it seemed that the resulting partition was not fine enough to
capture all ofthe truly high-risk regions adequately. Thus, if the test set estimates accurately
20 This backwards deletion process is almost exactly equivalent to CART's method of "pruning"
the full-sized tree to obtain a nested sequence of subtrees, which are then applied to the test set to determine
the best-sized tree (see Breiman et al. (1984)).
Figure 1.2 - Misclassification error by number of questions asked
co

0
<0

0

---.. C"")
L()
0
C"")
II
..-
N
en
a
C"")
()
0
---- .....
a
.....
.....
a>
0
C"")
0
co
"!
0
test set
minimum
learning set
o 5 10
number of questions in model
15
N
00
29
measure the predictive power of the questions, a set of about seven questions provides the
optimal tradeoff between the lack of fit and the fitting of chance variation. This result was
for a relative misclassification cost of 3.5; a larger set of 10 questions seemed better when
the cost was fixed at five. In fact, it was found that when starting with the 4x4 model
structure, the best number of questions invariably ranged between six and ten. Moreover,
these were always arranged in 3-5 subsets of questions combined by AND which were 1-3
questions in size. Once this preferred model size was chosen, the test set and learning set
respondents were recombined into one learning set. Finally, the repeated search algorithm
RRSA(N) was repeated using the full dataset in an attempt to find the best set of seven
questions.
This set of seven questions, then, was designated as the preferred set for the relative
cost of 3.5. Now consider again the issue of the honesty of the prediction error for these
seven questions as it was estimated by the test set. It is more convenient to have a test set
which is truly an independent, unobserved sample of data on which to test the accuracy of
the predictions. Unfortunately, in this research the test set was not kept completely blind to
the researcher. One problem was that the test set was already examined by the author before
the above algorithm above was fully developed. In addition, the author has had previous
contact with the dataset in other analyses before the dissertation was conceived. Thus, it was
possible that any knowledge of which variables or transformations of variables yielded high
predictive power could have informed the model-building process, invalidating the honesty
of the test set. Several precautions were taken to mitigate the unfair advantage this might
have granted to the estimation of predictive accuracy.
30
First, the input to the algorithm was essentially the raw dataset itself. Almost no
variables were preselected out of the analysis; the sole exceptions were a handful of items
that were sample-specific (e.g., the respondent ID, the sample region) and some questions
that were unusable because they were not consistently phrased in all three surveys (consisting
mostly ofthe items on emotional status). These were selected out well before the analysis
began, and none of the remaining variables were dropped from the analysis at any point.
Also, no respondents were dropped from the analysis.
21
Nor was any variable recoded from
its original form on the supplied tape or processed in any way at any point in the analysis
(except that missing values were assigned the value "999" at the start).22 Secondly, the
algorithm always chose the full question set by a systematically random process specified
completely by the simple three-step process above (see Chapter 4 for a formal definition).
That is, the initial set of questions was generated completely at random, and the replacement
of questions was also driven entirely randomly. Specifically, all variables had the same
probability of being selected into the initial model set. Also, within each replacement (steps
(2) and (3) above), all the questions in the model had the same probability of being replaced.
All variables had the same probability of being chosen to replace the dropped question, and
all possible splits for the replacement variable had an equal probability of being chosen.
Thus, knowledge about which variables or transformations of variables yielded predictive
power could not have directly influenced which questions were selected into the [mal
2\ The exception to this rule was in the analysis of cancer, where respondents known to have
cancer at baseline were removed from the analysis because the determination of new cancers after baseline
was ambiguous for these persons.
22 One variable was added to the dataset: a measure of body mass, equal to the ratio of weight in
pounds to height in inches. However, this variable was created well before any analysis began.
31
models.
It is more likely that if an unfair advantage has been gained, it was attained by sheer
familiarity with the data or some more heuristic type of knowledge that could have
influenced the model building strategies themselves. For example, it was thought that such
a splitting process was feasible due to initial runs of the CART algorithm on the data. This
is one reason to consider the true validation of these models with a completely independent
sample (see Chapter 5).
The second problem is that it is necessary to examine the test set more than once to
determine the best number of questions to ask. These multiple examinations of the test set
can also result in an optimistic assessment of the accuracy of the model if the same test set
is also used to estimate prediction error. This is partly why the method of model selection
by backward deletion outlined above (and described more thoroughly in Chapter 4) mimics
the extensively-simulated "pruning" method developed for classification trees.
However, the area of the curve near the minimum test set error in Figure 1.2 appears
quite stable (largely because of the large number of respondents). Sets of questions of size
six and eight also have very low error rates, and this stability in the error-by-model-size curve
was typical for all the model sequences examined below. Sets of questions at or just above
the right size consistently predicted at a comparable level of accuracy on the test set; thus,
the test set error curve was usually quite well defined at this point. Moreover, extensive
computer simulations of the backward-deletion method of model-selection and the test set
estimation of prediction error on nested sequences of partitions found by the CART method
(called "pruning") have found that in practice the downward bias in the test set estimates of
32
prediction error is small ifN is large and the test set is used to select from a small sequence
of nested models built on the learning set (and 16 is probably a small enough sequence)?3
Since the partitions found by the random replacement search are nearly identical in structure
to the type of partitions formed by CART, and since the method of backward deletion is
nearly identical to CART's pruning method, it is hoped that the estimates of prediction error
derived from the test set for the models in this dissertation are also nearly unbiased?4
Chapter 2 presents a somewhat more rigorous (but still heuristic) argument along these lines,
and Chapter 5 presents the results of the validation with an independent sample.
1.7 Results: Mortality
The above RRSA(l 00) method was used to find "full" models on the learning set for
three different misclassification costs: 1.5,3.5 and 5. These models were then subjected to
backward deletion, and the resulting sequences of submodels were applied to the test set.
Those models with the lowest error at each cost were then selected out, yielding three models
consisting of seven, seven and ten questions respectively (results are presented in Table 1.2).
The test set performance of each preferred set of questions for each cost is summarized on
each row of the table, catalogued by capital letters, and the corresponding questions and
23 See Breiman et al. (1984).
24 The only difference in the structure of partitions chosen by the two methods is that subsets of
respondents isolated by the sets of questions here were not completely disjoint (but mostly disjoint). In
partitions built by CART, all terminal nodes of a tree identify completely disjoint sets of respondents. The
only difference in the method of backward deletion is in CART's use of a cost-complexity parameter to
index the "size" of the tree. Here, simply the number of questions is used, which is equivalent to a the
number of terminal nodes in a tree. The cost-complexity parameter is a linear combination of the number of
terminal nodes in a tree and the cost-adjusted learning set misclassification error of the tree. But the
backward-deletion method used here is more basic than CART's more thorough method, and a more basic
index is used. More complicated methods of backwards deletion (e.g., by using a cost-complexity
parameter) are explored in Chapter 4.
Table 1.2 - Deaths predicted correctly and survivors predicted incorrectly as dead
(in a test set of2,923 survivors and 509 deaths three years after baseline)
reduction # of SensitivityS # of Specificity' Death rateS
death rate
Set! Cose in deaths survivors
prediction predicted
deaths predicted
predicted
survivors predicted deaths predicted
rate predicted
r r ~ correctly4
correctly
incorrectly6
correctly correctly
by age, sex
all deaths all survivors deaths predicted
A 5 31% 337 66% 888 70% 28% 1.39
B 3.5 17% 209 41% 348 88% 38% 1.83
C 1.5 5% 113 22% 132 96% 46% 2.12
1. The letter of each set refers to the letters of the sets of questions listed in Appendix I.
2. The ratio of the cost ofmisclassifying a death as survival to the cost ofmisclassifying a survival as death.
3. The precentage decrease in the cost-adjusted misclassification error rate compared to the error of classifying all respondents as survivors.
4. The number of deaths in the test set correctly classified as dead by the set of questions.
5. The number of deaths correctly predicted divided by the total number of the deaths in the test set (509), also called the true positive fraction or TPF.
6. The number of survivors in the test set incorrectly classified as dead by the questions (i.e., false positives).
7. The proportion of all survivors (2,923) correctly classified as survivors by the questions.
8. The proportion of deaths in the respondents predicted as dead. In the column to the right, this rate is divided by the death rate a randomly chosen set of
respondents (with the same age/sex distribution as the respondents chosen by the questions) would have suffered.
Source: EPESE baseline and three years offollow-up data from New Haven, East Boston and Iowa.
w
w
34
answers are listed in Appendix 1. The third column gives the percent reduction in prediction
error achieved by the question sets as compared with the ignorant strategy of classifying all
deaths as survivors (as calculated above). This quantity is mathematically equivalent to the
R
2
statistic commonly used for regression models where the weighted total sum ofsquares
-
is defined about the ignorant strategy instead of the mean (i.e., Y == 0, since one cannot be
classified as 14% dead). However, note that if all the deaths are weighted (by the relative
cost of misclassification here), this measure of accuracy is very sensitive to changes in cost.
That is, it would be possible to achieve a 100% reduction in error simply by defining a very
high relative cost and classifying all respondents as deaths (clearly an inaccurate strategy).
Likewise, one can obtain 0% reduction in error by defining a relative cost near zero even if
the predictor catches 100% of deaths with no false positives! So this is not an objective
method for comparing the accuracy of models without regard to cost.
This is why it is more informative to consider the sensitivity and specificity for any
one predictor. Again, the sensitivity of the questions is the proportion of deaths that are
correctly classified as deaths by the questions, and the specificity is the proportion of
survivors who are correctly classified as survivors (i.e., not chosen by the questions). Thus
the sensitivity of Set A is 337/509 = 66% and the specificity is (2923 - 888)/2923, or 70%.
The death rate in the table is simply the number of true deaths in the chosen (predicted as
dead) respondents divided by the total number of respondents classified as dead. The last
column presents the ratio of this observed raw death rate to the age- and sex-adjusted rate
predicted solely from the age and sex distribution of the chosen respondents. If the
respondents chosen by Set A had been subjected to the sample-average mortality risks for
35
their ages and sexes, we would have expected 19.7% of them to die.
25
However, it was
observed that 27.5% died, so this ratio is equal to 0.197/0.275 = 1.39. Set A, a total of only
ten questions, is the most "sensitive" set of questions presented; that is, it correctly predicts
the most deaths (two-thirds of all deaths, in fact). Of course, it would have been possible to
build a more sensitive set of questions by using a higher relative misclassification cost.
However, since Set A misclassified as many as 30% of survivors as dead, it was thought that
any higher proportion of false positives would be unacceptable. 26
When the cost was lowered to 3.5, the predictions decreased in sensitivity and
increased in specificity, as expected. Set B correctly predicted only 41 % of all deaths, but
88% of survivors were also correctly classified, a significant rate of success for seven simple
questions. About 38% of the respondents chosen by these questions died, a death rate 1.83
times as high as that predicted by the age and sex distribution of the respondents. When the
cost ofmisclassification was further lowered to 1.5 (Set C), the death rate of the predicted
dead increased to 46%, nearly half the 245 chosen respondents! This rate was more than
twice that predicted by age and sex, partly representing the fact that no questions about age
are included in Set C.
The consistently high levels of specificity suggest that the high death rates
experienced by the chosen subsets oflearning set respondents are not likely to be due entirely
25 Specifically, the death rate is compared to the death rate a randomly chosen group of elderly
would experience with the same age and sex composition. First, age- and sex-specific death rates were
estimated from all 10,294 respondents in the sample. Then each respondent chosen by the questions was
hypothetically subjected to the sample-wide mortality risks they would have experienced according to their
age and sex. This gives the death rate of the chosen respondents predicted only by age and sex.
26 Although if one turns the question around, so that the goal is to predict survival by isolating
groups of respondents with low death rates, the present method is an equivalent way to approach the
problem - simply specifY a very high cost of relative misclassification.
36
to chance variation. While the highest level of sensitivity is only 66% with a specificity of
70% (which admittedly may be a slightly optimistic estimate), this level of accuracy is still
quite remarkable for a predictor that requires nothing more than asking a handful of simple
questions. It may also be quite useful to identify the 30% of survivors who did not die, as
they may still be at very high risk of death beyond the three years offollow-up.27 Moreover,
the questions apply to elderly of all ages and both sexes, and as the preferred questions
demonstrate, the method deals effectively with missing data.
Finally, it is informative to calculate one additional statistic that is conventionally
used for assessing the overall accuracy of a predictive method without regard to specific
combinations of specificity and sensitivity. This is the area under the receiver operating
characteristic (ROC) curve, also known as the c statistic. Based on the sensitivity and
specificity levels of the sets of questions A through C, a simple curve can be plotted as in
Figure 1.3. The proportion of deaths predicted correctly (the sensitivity, or true positive
fraction) is plotted on the y-axis, and the proportion of survivors predicted incorrectly as
deaths (the false positive fraction, equal to 1 - specificity) is plotted on the x-axis. The three
dots labeled A, B and C correspond to the question sets, and a curve is neatly fitted to these
points. Note that for a method of prediction equivalent to random guessing, this curve would
follow the 45 line, and the area under it would be 50%. If the method of prediction was
perfectly accurate, implying 100% sensitivity with a 0% false positive fraction, the curve
would "fill up" the plot (i.e. it would merge with the y-axis and the line at y = 1), and the
area would be equal to 100%. The area under the ROC curve fitted to sets A, Band C was
27 Efforts are now underway to apply these questions to later waves of the EPESE data for further
validation.
37
Figure 1.3 - True positive fraction by false positive fraction
(ROC cUNe) for questions predicting deaths in the test set
Area =74.4%
A
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
... //
, /
V
/'
/ ,
/ ,
/ ,
/ ,
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
o
.25
.75
o .25 .75
False positive fraction (the proportion of
survivors predicted incorrectly as dead)
Note: The test set consisted of 2,923 survivors and 509 deaths.
Letters (A,B.C) correspond to the question sets in Appendix I.
Source: New Haven, East Boston and Iowa County EPESE
38
estimated at 74.4% 1.2%, about halfway between these two extremes.
28
This gives a good
intuitive feel for the accuracy of the overall method without regard to costs or specific levels
of sensitivity and specificity.
However, sets of questions with different levels of specificity and sensitivity were not
estimated solely for the fitting of the ROC curve. The main idea is to allow the models to be
used in different ways. For example, question set C, which isolates the small fraction of
respondents at the very highest risk of death, could be used by a physician whose goal is to
identify the neediest of elderly for the efficient distribution of expensive resources. Sets A
and B are more geared toward identifying a large number of elderly who are at moderately
high risk.
1.8 Results: Heart failure, stroke and cancer
Within three years of baseline, some 868 respondents (8.4%) either reported that they
had been diagnosed with some type of heart failure, or died of heart failure without having
had a heart attack previously.29 New diagnosis of strokes was reported by or as the cause of
death for 943 (8.6%) respondents. For cancer, it seemed impossible to distinguish between
new cancers reported post-baseline and the existence of cancer as it was reported at baseline.
This was due partly to the wording of the questions and partly to the fact that cancer is not
a discrete event in time like a stroke or heart failure. Therefore, all respondents who reported
that they had ever been diagnosed with cancer were removed from the prediction of cancer
28 The points were fit with a rotated parabola which was constrained to pass through the origin and
the point at (1, I). The standard error was estimated with a bootstrap method. See Chapter 2 for an
explanation of this estimate and its standard error.
29 If a respondent died from heart failure without reporting it post-baseline and they reported
having experienced heart failure at baseline, it was assumed that no new heart attack occurred. The
occurrence of a new stroke and diabetes was defined in this way also.
39
analysis to catch only genuinely new incidents of cancer. Thus, the resulting questions only
apply to persons never diagnosed with cancer. A total of 552 persons (6.2% of a total N of
8,874 ) never previously diagnosed with cancer either reported post-baseline that they had
been diagnosed with cancer or died of cancer.
One problem becomes evident at this point for any researcher familiar with survival
analysis: the occurrence of these traumatic events is censored in some respondents by death
due to other causes. (This is a demographer's euphemism for saying some respondents died
of cancer before they had a chance to have heart attacks!) The are two ways to think about
this intrusion. First, if one is genuinely concerned with estimating the probability of an event
in a certain group of respondents (specifically, with the idea that some deaths could be
averted, e.g., in a cause-elimination scenario), care must be taken to adjust for the changes
in exposure and examine any assumptions regarding the dependence of chronic illness events
and death due to other causes (the censoring agent). Still, for pure prediction, we can merely
define our event-of-interest to be the admittedly censored incidence of heart attack, stroke
or cancer, essentially ignoring the issue (that is to say, defining risk in the presence of other
causes ofdeath, perhaps with the justification that we are more interested in identifying those
persons for whom a specific event will occur before they die of another illness!) The brief
results presented below are given in the context of pure prediction, but there are still some
interesting substantive hypotheses suggested by the questions themselves.
Table 1.3 shows the results of using the above to predict heart failure, and Appendix
II contains the actual question sets (D and E). Questions were identified for two relative
costs of misclassification, nine and six. Set D, consisting of seven questions, correctly
Table 1.3 - Heart failures predicted correctly and non-failures predicted incorrectly
(in a test set of288 heart failures and 3,144 non-failures 3 years after baseline)
reduction # of Sensitivity5 # of non- Specificity? Failure rate
8
failure rate
Set' Cose in failures failures
prediction predicted
failures predicted
predicted
nonfailurespredicted failurespredicted
rate predicted
error
3
correctly4
correctly
incorrectly6
correctly correctly
by age, sex
all failures all nonfailures failurespredicted
D 9 25% 169 59% 871 72% 16% 1.60
E 6 12% 103 36% 416 87% 20% 1.86
I. The letter of each set refers to the letters of the sets of questions listed in Appendix II.
2. The ratio of the cost ofmisclassifying a heart failure as non-failure to the cost ofmisclassifying a non-failure as failure.
3. The precentage decrease in the cost-adjusted misclassification error rate compared to the error of classifying all respondents as non-failures.
4. The number of heart failures in the test set correctly classified as failures by the set of questions.
5. The number of failures correctly predicted divided by the total number of the failures in the test set (288), also called the true positive fraction or TPF.
6. The number of nonfailures in the test set incorrectly classified as failures by the questions (i.e. false positives).
7. The proportion of all non-failures (3,144) correctly classified as non-failures by the questions.
8. The proportion offailures in the respondents predicted as failures. In the column to the right, this rate is divided by the failure rate a randomly chosen set of
respondents (with the same age/sex distribution as the respondents chosen by the questions) would have suffered.
Source: EPESE baseline and three years of follow-up data from New Haven, East Boston and Iowa.

a
41
predicted some 169 heart failures out of288 failures total in the test set (sensitivity = 59%).
The number of false positives was 871, for a specificity of 72%, and the rate of heart attack
was 1.6 times as high as the rate predicted by the age and sex composition of the
respondents. The more sensitive questions (Set E), a total of only six questions, caught only
103 heart failures (36% of all failures in the test set) but with quite high specificity,
identifying only 416 false positives (87%). Of the respondents classified as high risk by the
questions, one in five suffered heart failure, a rate nearly 90% higher than that predicted by
their age and sex distribution! The area under the ROC curve was estimated as 71.0%.
The prediction of strokes was a little less accurate, as shown in Table 1.4; the
questions are in Sets F and G in Appendix III. The more sensitive set (G) catches 37% of all
strokes with a specificity of 86%, but Set F predicts 59% of strokes with. only 66%
specificity. The area under the ROC curve was estimated as 69.1 %. Cancer is much more
difficult to predict (see Table 1.5, and Sets H and I in Appendix IV.) Question in Set H
catch as many as 61 % of all respondents who get cancer, but with a specificity of only 54%.
Questions Set I appears more useful, catching as many as 43% of cancerous respondents with
a specificity of71% (not high, but still ofinterest). For cancer, the area under the ROC curve
was estimated as 61.5%.
1.9 Discussion
There are several perplexing questions evident in the preferred models, and there are
many questions that are even more puzzling by their absence. A physician is very likely to
be disconcerted by this lack of many clinically obvious predictors. One might imagine (as
did the author, before the analysis) that the most powerful predictors would be age, sex,
Table 1.4 - Strokes predicted correctly and non-strokes predicted incorrectly
(in a test set of 319 strokes and 3,113 non-strokes 3 years after baseline)
reduction # of Sensitivity5 # of non- Specificity? Stroke rateS
stroke rate
Set! Cose in strokes strokes
prediction predicted
strokespredicted
predicted
nonstrokes predicted strokespredicted
rate predicted
r r ~ correctly4
correctly
incorrectly6
correctly correctly
by age, sex
all strokes all nonstrokes strokes predictea
F 9 22% 189 59% 1065 66% 15% 1.38
G 5 10% 118 37% 439 86% 21% 1.73
I. The letter of each set refers to the letters of the sets of questions listed in Appendix III.
2. The ratio of the cost of misclassifYing a stroke as a non-stroke to the cost of misclassifYing a non-stroke as stroke.
3. The precentage decrease in the cost-adjusted misclassification error rate compared to the error of classifYing all respondents as non-strokes.
4. The number of strokes in the test set correctly classified as strokes by the set of questions.
5. The number of strokes correctly predicted divided by the total number of the strokes in the test set (319), also called the true positive fraction or TPF.
6. The number of non-strokes in the test set incorrectly classified as strokes by the questions (i.e. false positives).
7. The proportion of all non-strokes (3,113) correctly classified as non-strokes by the questions.
8. The proportion of strokes in the respondents predicted as strokes. In the column to the right, this rate is divided by the stroke rate a randomly chosen set of
respondents (with the same age/sex distribution as the respondents chosen by the questions) would have suffered.
Source: EPESE baseline and three years of follow-up data from New Haven, East Boston and Iowa.
~
tv
Table 1.5 - Cancer predicted correctly and non-cancer predicted incorrectly as cancer
(in a test set of 185 respondents with cancer and 3,046 without cancer 3 years after baseline)
reduction # with Sensitivity5 # without Specificity
7
Cancer rateS
See cose in
cancer rate
cancer
cancer predicted
cancer
noncancerpredicted cancer predicted
prediction predicted predicted
rate predicted
error
3
correctly4
correctly
incorrectly6
correctly correctly
by age, sex
all cancer all noncancer cancer predicted
H 15 11.1% 113 61% 1387 54% 7.5% 1.15
1 13 6.3% 79 43% 875 71% 8.3% 1.18
1. The letter of each set refers to the letters of the sets of questions listed in Appendix IV.
2. The ratio of the cost ofmisclassifying a cancerous respondent as non-cancerous to the cost ofmisclassifying a non-cancerous respondent as cancerous.
3. The precentage decrease in the cost-adjusted misclassification error rate compared to the error of classifying all respondents as non-cancerous.
4. The number of respondents in the test set who got cancer and were correctly classified as cancerous by the set of questions.
5. The number of cancerous respondents predicted correctly divided by the total number of the cancerous respondents in the test set (185), also called the true
positive fraction or TPF.
6. The number of non-cancerous respondents in the test set incorrectly classified as cancerous by the questions (i.e. false positives).
7. The proportion of all non-cancerous respondents (3,046) correctly classified as non-cancerous by the questions.
8. The proportion of cancerous respondents in the respondents predicted as cancerous. In the column to the right, this rate is divided by the cancer rate a
randomly chosen set of respondents (with the same age/sex distribution as the respondents chosen by the questions) would have suffered.
Note: This analysis included only respondents who were diagnosed as cancer free at baseline.
Source: EPESE baseline and three years of follow-up data from New Haven, East Boston and Iowa.

w
44
smoking, heavy drinking, obesity, and questions asking about previous diagnoses of
illnesses (e.g., "Have you ever been told by a doctor that you had cancer?" and similarly for
heart failure, stroke, diabetes and high blood pressure). Yet consider Set A, which correctly
predicts the most deaths; of the above predictors, only age, sex and weight are considered.
Age and sex, which are undoubtedly the most commonly used variables in models of
mortality, are used only once in each of Set A and B. Set C, which catches the respondents
who at the highest risk of death, considers neither! This may be a very disturbing result for
many scholars of mortality, prompting one to ask why so many other risk factors known to
be important are also not more prominently featured.
Several answers to this question exist. The first concerns the predominance of
measures of functional status. In particular, several questions that appeared repeatedly were
the items asking whether the respondent could walk a half mile, do heavy work, or bathe
without assistance. The tremendous predictive power of these ADL' s (activities of daily
living) has been reported in a number of other studies on predicting mortality (specifically,
see Bianchetti et al. (1995); Cahalin et al. (1996); Davis et al. (1995); Reuben et al. (1992);
and Warren and Knight (1982)). In short, many researchers have found that ADL's have
more predictive power than nearly any other variable, including those "obvious" items listed
above. This may be because the conventional risk factors are all variables that affect
mortality over long periods, and with uncertain consistency; if the goal is to predict short
term mortality, it is more effective to focus on characteristics that appear just before death,
and these changes are clearly identified in the functional ability of the individual.
A highly simplified picture of the process of death from chronic illness over time
45
illustrates this, as shown in Figure 1.4. Any particular risk factor mayor may not lead to
illness (hence the questionable arrows), and any particular illness mayor may not lead to
death. However, that illness or combination of illnesses that leads to death is quite certain
to cause a change in the functionality ofthe individual at some time before death. To be sure,
death is the ultimate breakdown in functional status, and accurate prediction becomes a
matter of how soon functionality begins to deteriorate before death. Clearly, predicting the
death of the person who is perfectly capable one day and dies suddenly on the next, perhaps
experiencing loss of function only minutes before death, is much more difficult. Similarly,
the variable loses its prognostic value for persons who may become disabled for reasons
other than illness, and well before the onset of mortality. For the majority of individuals,
however, much loss of function is caused by chronic illnesses. Thus, not only are these
changes in functional ability temporally closer to the event of death than any of the other
predictors, we also expect this association to be the strongest as well. It is expected then that
this variable should be an extremely accurate short-term predictor of death, to the extent to
which it can be measured accurately.
It is for this reason, perhaps, that age and sex in particular have not emerged as a
central focus in this analysis.
3D
Every one of the overall sets (A though I) is capable of
identifying any person of an age greater than 65 as high risk. Only two ofthe 11 subsets of
questions in Appendix I include a question about age and only two contain a question about
30 The author originally thought that the best way to precede with the analysis would be to divide
the sample into age- and sex-specific subsets before applying the search algorithm. This turned out to be
not only much more complicated logistically, but suboptimal with respect to predictive power as well. For
example, some sets of questions fmd subsets of respondents which cut across age boundaries while other
subsets are very age-specific. It seems that allowing the computer to decide when age or sex should or
should not be included is the best way to build question sets.
agel
exposure
Figure 1.4 - Diagram of death from
chronic illness
I diet/obesity I
c=c;
heart
disease
I
smoking
I
c=c;
I
?
>

I
I
drinking
I
I
cancer
I
death
I
I
c=c;
genetics
I
V I
functional
I
?
>
I
status
other
illnesses
other
factors
.j:>..
0\
47
sex. Yet the rates of both death and illness that the respondents experience are much higher
than predicted by age and sex, a very counterintuitive result given our knowledge of these
variables and their relation to mortality! However, age is generally such a powerful number
because it is a variable that captures the conglomerated effect of many other variables; when
we refer to the "aging process", we usually think of the decline of the individual's functional
status and the onset of chronic illness, not simply the mere act of passing time. Thus the
main effect that age measures directly is the exposure to repeated insults, and most of the
other variables that relate to the aging process are quite well measured in the available survey
questions. However, age is often included in the questions, and usually the chosen
respondents are older than the general population. Ideally, one would like a set of questions
that chooses persons who at highest risk relative to others of the same age and sex. In
Chapter 5, a hybrid combination of the question set method and a linear model gives a way
to accomplish this, but interestingly, it appears that the results are not substantially different
from those shown here.
Another reason for the puzzling inclusion and exclusion of certain variables is the
nonintuitive nature of the search strategy. Specifically, questions are designed to have low
misclassification error, and this is quite different from merely finding some respondents who
are at high risk. The criterion of misclassification error measures not only how high the risk
level is for those respondents classified as dead, it also reflects the number of respondents
classified as dead (i.e., the "efficiency" of the questions, as discussed above). This can be
seen vividly by actually trying to build question sets on a purely intuitive basis. For example,
consider a set of questions that isolates all those respondents who are males, aged 80+, and
48
who have been diagnosed with heart failure, cancer, stroke, or high blood pressure. One
expects these persons to be at very high risk, and indeed they are: 35% of these respondents
died. Yet of all 3,432 respondents in the test set, only 147 were isolated by these variables,
so only 53 deaths (out of 509) were correctly predicted. In contrast, the questions in Set C
isolated a group of respondents with a death rate of 46%. This was not only a set of
respondents at somewhat higher risk; it was also a much larger group of respondents. There
were 245 such persons in the test set, and 113 of them died, more than twice as many deaths
as predicted by the intuitive set! Even more incredibly, since the questions in Set Cask
nothing about sex or age and very little about illness, the persons so isolated are not all 80+
males who have been diagnosed with terminal diseases; so many respondents identified as
high risk are not already known to be at high risk prior to prediction.
An important hypothesis is suggested by these results. Because the observed
diagnoses of illness did not have a great deal of predictive power compared with other
symptomatic variables such as functionality, and because so many respondents who
eventually died had never been diagnosed with disease at baseline, it was thought that a great
deal of undetected morbidity existed in the EPESE sample at baseline. The relatively poor
respondents in this dataset likely had a particularly low level of access to and utilization of
medical resources, and it seemed that the diagnosis of morbidity was positively related with
income. It was also suspected that cancer accounted for many missing diagnoses. (This
issue is treated in detail in Chapters 6).
It seems that our etiological knowledge of mortality is still far too incomplete to
provide us with a predictor that is both causally intuitive and efficient based on the available
49
data. So, if the primary objective is to build an accurate and efficient predictor, as is the goal
ofthis dissertation, it benefits one to abandon (at least for the moment) conventional thinking
about causal structures. Much of the advantage of the present method comes from the fact
that it is not at all bound by our seriously flawed notions of what variables are causally
responsible for death. Rather, questions are chosen from the pool of potential variables by
a method that is completely objective, and for the purposes of building the models, the only
concern is that they increase predictive power. As it is statistically defined (see Chapter 2),
predictive structure is completely equivalent to correlative structure, and so one can
maximize predictive power totally without regard to causes. Thus, a word of caution is
necessary. Many laypersons (and even some academics) are ignorant to the difference
between cause and association. Careless consideration of the questions could lead
respondents to believe falsely that the variables suggest a way to mitigate their risk of death.
Question Set C, for example, predicts that respondents who cannot correctly state the day of
the week are at higher risk of death. This may be true, but it is doubtful that we could help
such persons merely by giving them calendars.
As a result, many researchers may be annoyed by this strategy of model-building
while ignoring considerations of causal structure. However, it is the intent of the author to
ignore causes only for the purpose of building the models; once they are built, it seems that
a great deal of insight into causal structure may be gained. The trick is to be extremely
cautious in this endeavor. As mentioned, many variables highly corrleated with death are
not controlled in the question sets, in comparison to a regression model for instance (see
Chapter 5 for such a model). However, a different type of advantaged is offered, and this
50
may also be another reason that the method predicts so well. People die for many different
reasons, and in many different ways, and it follows that using a single predictive equation
to try to predict all deaths is not an efficient approach. However, by allowing various
partitions to be combined with the OR operator, a set of questions can be constructed which
effectively deals with many different types of mortality since no single variable or question
is applied to all respondents.
3
! For example, consider Question Set A, which accounted for
66% of all deaths. This overall set consists of five subsets of questions (A. I through A.5)
combined by the ORoperator. Two-thirds ofthe respondents chosen by these questions were
only selected by one of the five sets. Moreover, the recorded causes of death vary
substantially from set to set; in particular, some sets are dominated by heart disease mortality,
as deaths due to heart failure appear to be more predictable than deaths from cancer and other
causes. Etiologies seem to vary widely from set to set, and so by efficiently dividing the
population into subgroups that are more homogeneous with respect to causes, we may greatly
simplify the causal analysis. Chapter 6 demonstrates this by applying the more conventional
types of analysis to respondents within these delineated regions in an attempt to glean more
information about the different causes behind the different types of deaths.
The nonparametric tactic of partitioning the sample space (as opposed to fitting
equations) also allows the predictors to deal very effectively with a number of more practical
problems: nonlinearities, the very large pool of potential predictor variables, the inevitable
missing values, and other "messy data" problems inherent in large scale survey endeavors.
3! This is an especially important distinction between the present search method and CART, as
with classification trees, the binary splits are constrained to take the form of a tree structure. As a result, all
splits are beholden to the root node of the tree, and all splits lower in the tree are dependent on splits above
them.
51
Finally, it is important to mention some deficiencies of the method. Since functional
status is being used to predict death, it is clear that the method predicts death in persons who
have already suffered some degree of illness (hence their decreased functionality). It is for
this reason that the method is also used to predict the debilitating events themselves, such
as heart failure, stroke and cancer, before they occur. The questions for predicting illness
contain some very obvious risk factors (questions about smoking predict both cancer and
strokes), but there are still some other questions (ADL's, questions about chest pain) which
indicate that the isolated respondents, again, were already somewhat ill at baseline. In future
analyses, later waves of the EPESE data will be used in an attempt to identify more long-
term predictors and risk factors.
Unfortunately, another serious flaw in the prediction of illness exists. Obviously,
what is observed is not the true incidence of illness, but the diagnosis of it (either by a
physician or on a death certificate). In the case of heart failure or stroke, this difference may
not be too crucial. Cardiac events, if they are at all serious, tend to be painful and relatively
obvious, and so are probably diagnosed somewhat successfully. Cancer, however, can be
quite difficult to predict, particularly in its earliest stages. It is perhaps for this reason that
the method identifies questions such as that in Set HA, which asks whether the respondent
has been to a dentist recently. Respondents who had been to a dentist in the previous five
years exhibited a higher risk of cancer than those respondents who had not been to a dentist.
It seems unlikely that going to a dentist is a cause of (or even moderately correlated with) the
true incidence of cancer; instead, it seems probable that persons who go to dentists are much
more likely to be diagnosed with cancer, perhaps because dentists can frequently spot oral
52
cancer, or maybe because such persons go to physicians more often as well. However, if one
is imaginative, it is usually possible to posit some kind of causal connection between
seemingly unrelated variables. For instance, it is possible that persons who go to dentists
regularly have received many more X-rays over the course of their lives, and so may actually
be at higher risk of cancer. This may be far-fetched, but nonetheless, the decision to reject
the hypothesis is purely a judgement call. Since such judgments are frequently wrong, and
since the observed correlation is still of interest in any case, these types of anomalies have
been allowed to remain in the preferred models. Consequently, the questions concerning the
prediction of cancer, which are not particularly accurate in any case, should be considered
principally of academic interest. As suggested above, two central concerns raised by the
results are the level of undetected morbidity evident in the population, and the possible
misspecification of causes on death certificates, as discussed in Chapters 6.
1.10 An overview of the dissertation
Chapter 2 of the dissertation addresses the problem of prediction from a statistical
standpoint and reviews the literature in this context. The chapter starts by formally stating
the problem of prediction, defining prediction error with respect to the test set, and then
briefly discusses recent research which is concerned with the prediction of mortality and
morbidity. A probability model for the classification problem is defined, and the statistical-
theory behind the various existing methods is explored. Shortcomings of the present
methods are then examined vis a vis the literature, and the method suggested here is
compared with the existing models to show how these shortcomings might be addressed.
The motivation for constructing the overall search method centers around the bias-variance
53
tradeoff inherent in building large, multivariate models. The argument for the model
selection process works by analogy with CART, showing how the method of determining the
best number of questions and partitions is similar to determining the best-sized classification
tree.
Chapter 3 presents the particulars of the dataset used in this dissertation Basic
summary statistics and the demographic makeup of the data are explored and compared with
the u.s. population of elderly. The issue of sample design is examined, as are the details of
missing data. Levels of disability, morbidity, and mortality in the EPESE sample are
summarized by age and sex.
Chapter 4 treats the method in detail, focusing first on the construction of the search
algorithm for selecting the full model and on the method of model selection through
backwards deletion. The primary goal of this chapter is to show how identifying a sequence
of submodels and applying it to the test set suggests a model size with low prediction error.
The practical problems associated with constructing this algorithm are discussed, and
variations on the basic search method are explored. The methods used for building
comparative models via linear discriminant analysis, logistic regression, and classification
trees are also set out. Again, the tactics used for model selection in all three cases revolve
around the optimization of the bias-variance tradeoff.
In Chapter 5, the models presented above and in the appendices are revisited in much
greater detail. The respondents chosen by the models are broken down for each of the
question subsets by age and sex. Performance characteristics of the questions are presented,
and they are compared in terms of predictive accuracy with results from the three other
54
methods discussed in Chapter 4. The validation of these models using the Duke EPESE
sample is also presented.
Chapter 6 is concerned largely with the substantive issues raised by the models. The
discussion begins with an in depth discussion of the causes in relation to the multiple cause
death certificate data linked to the sample. In particular, it is shown how the question
method developed here simplifies causal analysis by partitioning the high-risk respondents
into groups that are more homogeneous with respect to etiology. Some variables used in the
models (e.g., digitalis usage, body weight, disability) are analyzed as potential risk factors
in light of the causes of death associated with each question subset. The extent of the
problem of misspecification of causes of death on death certificates is explored. It is argued
that cancer deaths may be particularly underrepresented as a cause of death, and that the
detection of cancer as a cause of death and as a morbid condition in living persons is
dependent on the individual's access to medical care.
Chapter 7 summarizes the findings of the dissertation.
55
Chapter 2 - The Problem of Prediction
To be or not to be; that is the question.
- Hamlet
2.1 The specific problem of predicting death or illness
Statistical theory has given researchers any number of models for predicting binary
outcomes such as death/survival. Most commonly, multiple regression is used (logistic and
otherwise), along with proportional hazards, linear discriminant analysis and classification
trees, to name only a few. However, the exact statistical definition ofthe general prediction
problem itself is rarely discussed, partly because of this overwhelming multitude of specific
approaches. Unfortunately, in the social sciences the problemof prediction has often become
synonymous with causal modeling, particularly with multiple regression. One speaks of the
"independent" or right-hand side variables of a regression equation as predictors of the
dependent variable, and indeed any properly fit regression equation can be used to predict
the dependent variable in a literal sense.
Yet often these equations are not built with the goal of maximizing true predictive
power. The ultimate purpose of the models for most social scientists is not to predict, but
to understand the nature of the associations between the dependent and independent
variables. As such, the building of the equation typically operates around the maximization
of an R
2
measure of fit or the exclusion of variables that are not statistically significant rather
than the minimization of some well-defined prediction error. The use of modem, high-
powered statistical techniques for predicting mortality in individuals (as opposed to
56
forecasting mortality rates in populations) has been defined and explored mostly by
researchers in medical fields. However, these efforts still do not usually use the available
methods to their full potential.
In this dissertation, the meaning of statistical prediction is stated very explicitly. All
models are treated within this framework, and the review of existing research focuses
primarily on those efforts that are truly aimed toward prediction as defined (as opposed to
causal models, for instance). This definition also yields a statistical criterion by which to
judge a predictive model of any type, and a common theme emerges which relates to the
problem of building models to maximize predictive accuracy. This theme, the bias-variance
tradeoff, centers around the dimensionality or size of the models, and in particular, the task
of finding what size model gives the most accurate predictions.
2.2 Defining prediction and prediction error for a general predictor
Suppose we have data on some potential predictor variables, XI"'" X
p
, measured at
time To, for N individuals, and we also have a binary outcome variable Y for each person
(taking on the values zero and one, in this case representing survival and death respectively).
The outcome Y is usually measured at some later time T
I
for each person, but the time
element is not essential to the problem (i.e., it can also be defined in terms of classification).
The problem of prediction as defined here is to find some predictor (or classifier) of Y,
P(X!, ... , X
p
), built from XI"'" X
p
, such that the value of Y can be determined accurately
before it is observed at T
I
Once Y is observed, a simple measure of accuracy, or prediction
error (PE), can now be defined - just divide the number of misclassified cases by N:
PE
i=I
N
57
[2.1]
where Y; is the predicted value of Y for the ith respondent. I
The costs of misclassification can be adjusted asymmetrically to make the predictor
more sensitive to one type of misclassification or the other. For example, in the mortality
problem, suppose that we wish to find a set of questions that would catch a very high
proportion of the deaths in the sample. This might be possible only at the expense of more
misclassified survivors. Then we could define the cost of misclassifying a death as survival
to be higher than the cost of misclassifying a survivor as dead. If C(O, 1) is the cost of
misclassifying r; as I when it equals 0, and fiCO, 1) is an indicator random variable for this
type of misclassification, then costs can be accounted for with an adjusted PE:
2
PE
N
L qO,1) . f
i
(0,1) + C(l,O)' fp,O)
i=I
N N
C(0,1) 'L 1-Y
i
+ C(l,O) 'L Y
i
i=I i=I
[2.2]
where the denominator is also adjusted simply to keep the quantity within the (0,1) interval
(the denominator stays constant despite the predictor). In this way, the method can be
applied using a variety of misclassification costs so that different degrees of specificity and
sensitivity may be achieved.
Of course, the statistical problem of building P(X}, ... , X
p
) is typically encountered
} This is equivalent to the misclassification rate defined in Breiman et al. (1984) for the two-class
classification tree with unit misclassification costs, and is the same measure of prediction error employed in
Chapter 1 as a criterion for judging combinations of sets of questions.
2 Indicator variable means that f
i
(O,1) is 1 if Yj = and f;" = 1, else fl0,l) is 0.
58
after T
l
, so Y has been observed and used in the building process. Then, estimating the
prediction error in this way does not necessarily give an honest estimate of the accuracy of
the predictor as applied to additional, independent values of Xl"'" X
p
to predict a truly
unobserved y. (The reasons for this are discussed below.) Instead, a test set
misc1assification rate is defined.
3
Before building a predictor, a simple random sample is
used to select out some respondents in the data, who are then set aside for future testing of
the predictor, y
TS
and X
1
TS
, ... , X/so (This is usually a half or a third of the sample, by
convention). Using the half or two thirds ofthe data remaining, a predictor of yLs is built,
P(X]LS, ..., X
p
LS
), and this predictor is then applied to xT
s
, ... , xiS to produce predicted
values, Y;, for all I in the test set. A more honest estimate of prediction error for this
predictor can now be calculated as:
PETS
L lYi - Yil
iETS
[2.3]
where N
TS
is the number of observations in the test set. Likewise, in the cost-adjusted case,
it is:
PETS =
L C(O,I)' 1l0,1)
NeTS
C(O,I) . L l-Yi
iETS
+ C(l,O)' L Y
i
iETS
[2.4]
In both cases the summation is performed only over the test set. This measure for prediction
3 Numerous variants of this scheme exist, e.g. N-fold cross-validation, for cases where the dataset
is not large enough to afford selecting out a test set. However, in this case the number of observations is
large enough so that a test set method gives essential1y the same result as cross-validation, but with much
less logistical complexity.
59
error can be used (and will be used, for the remainder ofthis dissertation) as a criterion by
which to judge the predictive accuracy of any binary predictor or classifier.
To test whether a particular model is significantly difference than the null model
(identified as the "ignorant" model in Chapter 1), one must estimate a standard error for the
PETS. If the test set is treated as a simple random sample from the distribution of interest,
then in the unit cost scenario this can simply be estimated as the standard error of the mean
of a binomial variable, where the binary outcome is misclassification (l) or correct
classification (0), and the "success" parameter is estimated by PE
Ts

4
For the cost-adjusted
case, one simply models the PETS as a random sample from three possible outcomes as 0, 1
or C. These three outcomes can then be assigned a distribution defined by the observed
proportions of correct classifications, incorrectly classified decedents, and incorrectly
classified survivors respectively. (This is not unlike a bootstrap estimate, but where no
resampling is necessary since the error distribution is completed specified by these
proportions).5 A statistical test can then be calculated to assess whether the difference
between the test set estimate of error and the null error is statistically significant simply by
dividing this difference by the test set standard error estimate. For all the fitted models
presented in this dissertation, the differences were clearly significant far beyond 0.01, but the
actual tests are usually not shown.
To judge the overall accuracy ofthe method, taking into account any combination of
misclassification costs, it is conventional to estimate the receiver operating characteristic
4 This is the model used by Brieman et al. with fair success. Since the respondents were not
chosen with a simple random sample, one may also attribute the element of chance to the mortality process
itself, after condition on some set ofX variables. Identical results may be achieved with either model.
5 See Chapter 5 for examples of the explicit calculations.
60
(ROC) curve, as shown in Figure 1.3.
6
The name for this curve is more complicated than its
substance; it simply refers to a plot of sensitivity as it varies by specificity. In the context of
mortality prediction, sensitivity is the number of deaths correctly classified as dead by the
predictor, divided by the total number of deaths, also called the TPF (true positive fraction).
In the test set application:
TPF
TS
[2.5]
where Ii (l,1) is the indicator that the ith respondent is correctly classified as dead.
Specificity relates to the degree to which the respondents predicted as dead actually die, or
the FPF (false positive fraction). Specifically:
FPF
TS
L (l - Y)
iETS
[2.6]
where, as above, li(O, 1) is the indicator that the ith respondent is incorrectly classified as
dead. So the FPF is the number of respondents who survived but were incorrectly classified
as dead, as a proportion of the total number of survivors. (Specificity is usually defined as
1 - FPF). Notice that as C(l,O) (the cost ofmisclassifying a dead person as a survivor) is
raised relative to C(O, 1) (the cost of misclassifying a survivor as dead), a predictor built to
minimize such costs will improve sensitivity while sacrificing specificity.
The principal reason for considering the ROC curve is that the area under it serves
6 See Swets and Pickett (1982), and Thompson and Zucchini (1989).
61
as a natural and commonly used index of a predictive method's accuracy (as discussed in
Chapter 1), in addition to providing a visual summary of the error distribution which
expresses the ratio of sensitivity to specificity at any given level of either. There are many
methods for computing this statistic that rely on any number of questionable assumptions
(usually involving normality), and are ultimately designed for many situations. Here, there
was a rather remarkable feature of the fitted models that made the estimation quite
straightforward. It was observed, after having plotted many points in ROC space for several
different statistical methods, that the estimated points could always be extremely well
approximated by a particularly simple form. This fit was defined by a parabola constrained
to pass through the origin and the point at (1,1), resulting in a model with only one degree
of freedom. What allowed this approximation to work, however, was the following
serendipitous insight. To fit the data accurately the parabola had to be formed not by
modeling the TPF as a quadratic function of the FPF, but as a parabola rotated 45
counterclockwise. Such a parabola lies symmetrically about the line at y = I-x in the
unrotated coordinate system. An example is the fitted curve in Figure 1.3, which passes
nearly exactly through all three observed points. This method was observed to work just as
well when applied to the ROC curve obtained from a linear discriminant analysis: with
nearly 10,000 unique points observed on the curve (as compared with the three points in
Figure 1.3), the single parameter parabolic fit achieved an R
2
of greater than 0.999!
This method of fitting was not only extremely accurate, it was also easy to carry out.
First, only one point in ROC space is required to identify the model. The observed x and y
coordinates for each such point can be rotated 45 clockwise through a simple 2x2 change
62
ofcoordinate matrix (whose elements are composed of 1N2 and -lN2).7 The transformed
x and y (call them x' and y') can then be fit via ordinary least squares to the basis:
with the parameter P> O. This basis constrains the parabola to pass through the origin and
the point (1,1) in the original, umotated ROC space. Fitting the model in the rotated space
not only simplifies the process by allowing the use of least squares, it also minimizes the
squared errors along the y = 1 - x line in ROC space. This seems to provide a more
appropriate criterion than fitting OLS in the umotated space, since we do not wish to model
the TPF as a function ofthe FPF ( as in the usual OLS model). Also, by forcing the curve
to be symmetric about the y = 1 - x line, one insures that the area under the curve is not
greatly affected by the behavior ofthe curve in parts where no data is available. For instance,
in Figure 1.3, there were no data on models having a FPFhigher than 0.35, as that area of
the curve was not of great interest for the moment. However, since the unobserved half of
the curve simply mimics the observed half, one need not worry whether the principal statistic
of interest (the area under the curve) is driven by a part of the curve that is neither observed
nor interesting. Finally, this method has the added benefit that the area under the fitted ROC
curve is easily estimated as P/3 + 0.5, which can be seen by integrating the parabola over the
range (0, v'2) in the rotated space.
To estimate standard errors for this statistic, a resampling (or bootstrap) approach was
7 For a definition ofthe change of coordinate matrix, see Section 2.5 of Linear Algebra by
Friedberg et ai., (1989).
63
used. For example, in the case of the question set method developed in this dissertation, the
misclassification error for the models was observed for each respondent in the survey. (That
is, it was noted whether the respondent was correctly or incorrectly classified as dead or alive
by each of the three questions sets in Appendix I). By repeatedly drawing samples (with
replacement) ofthe same size as the test set from the entire sample distribution of all such
respondents, it was possible to mimic the observed misclassification error distribution
associated with the question set method. Then one can estimate the standard error of the
statistic by observing its inherent variance as generated by the repeated resamples. (The
probability model for misclassification error is specified in section 2.5 below.)
It is important to note that special care had to be taken to preserve dependencies
between the question sets. This was because the same respondents had answered all three sets
of questions, and some questions were shared by all three sets. First, it was observed
whether each respondent was classified or misclassified with respect to each of the three
question sets, so that every respondent was associated with a triplet of errors. Then in the
resampling process, these triplets were the units actually being resampled (such that a
particular respondent's status with respect to all three sets was always held together as a
single element). In this way, the correlation between the errors in the three models was
duplicated in each bootstrap sample. The estimation of standard errors for the other methods
was nearly identical.
2.3 Previous efforts to predict mortality and morbidity
There is an extensive literature on efforts to predict mortality and morbidity. A
MEDLINE search through the past five years of abstracts yielded more than 600 articles
64
reporting on efforts to predict death and illness, often cause-specific mortality or a particular
morbid condition such as the occurrence of cancer or heart failure. Usually, however, the
goal is to predict these events for in-hospital subjects (frequently patients in intensive care
units) or persons who have already experienced severe medical trauma of some sort.
8
Also,
the predictors almost always require extensive physiological information that can only be
obtained by a clinician (e.g., blood or urine analyses).9 Some methods do involve the use
of test set validation, but not usually for model selection.
1O
For many other researchers, it
appeared (although this was not always clear) that the primary goal ofthe model was to find
causal connections, often for specific interventions. II
An excellent overview of most currently used predictive models is provided by
Marshall et al. (1994). The authors consider a total of eight approaches: simple logistic
regression (a full model and a backward, stepwise deleted model), cluster analysis combined
with intuition and logistic regression, principal components followed by logistic regression,
a subjective "sickness score", a model based on Bayes' theorem, an additive model, and
finally a classification tree approach. All of these methods are compared, using a large
dataset of heart surgery patients. A test set is used to maintain an accurate measure of
8 Some recent examples are Anderson et al. (1994); Becker et al. (1995); Davis et al. (1995);
Grubb et al. (1995); Josephson et al. (1995); Marshall et al. (1994); Normand et al. (1996); Ortiz et al.
(1995); Piccirillo et al. (1996); Pritchard et al. (1995); Quintana et al. (1995); Rowan et al. (1994);
Schuchter et al. (1996); and Wong et al. (1995).
9 See Assmann et al. (1996); Becker et al. (1995); Bosch et al. (1996); Flanagan et al. (1996);
Iezzoni et al. (1996); Rowan et al. (1994); and Schucter et al. (1996).
10 Research using a test set methodology includes Anderson et al. (1995); Becker et al. (1995);
Normand et al. (1996); Ortiz et al. (1995); and Schucter et al. (1996).
11 Causal analysis was more frequently the goal for social scientists. For examples of cause-
oriented research, see Friedman et al. (1995); Huppert et al. (1995); Josephson et al. (1995); Silber et al.
(1992); and Smith and Waitzman (1994).
65
predictive power, and the common criterion by which all models are judged is the area under
the ROC curve. This paper serves as a natural starting point for a comparative look at
commonly-used methods for predicting mortality. Before conducting an in-depth exploration
of the literature, however, it is necessary first to define some of the statistical issues and
models, so that the existing research can be placed in perspective more clearly.
Half the strategies employed by Marshall et al. involve logistic regression, and
similarly the research overall is dominated by logistic regression in one form or another.
Straightforward logistic regression (whether the model is fitted stepwise or with another
method) is the most commonly used model. 12 Marshall et al. fit a basic regression model in
two ways. They use all available variables not having too many missing values to build a
full model, and apply a backward, stepwise deletion of variables based on statistical
significance. The goal of the stepwise deletion was to avoid the overfit inherent in building
the full model, an example of the bias-variance tradeoff mentioned in Chapter 1. The effect
of noise on the building of sets of questions was easily demonstrated via Figure 1.1 in
Chapter 1, but exactly what was meant by bias and variance was not discussed, partly
because no parameters were being estimated. For regression, however, these ideas are quite
clearly defined and shown.
2.4 Regression as a method of classification
First consider the use ofthe OLS regression model for prediction, but where Yis real-
valued instead of binary. Again, the goal is to build a predictor of Ybased onXj,...,xp, or P(Y
IXj, ... ,xp), to be used on future, unobserved realizations of Y; here, however the predictor is
12 Quite a few of the articles in the bibliography used regression, e.g. Schucter et ai. (1996); and
Iezzoni et ai. (1996, 1994).
66
made to be a polynomial function of the independent variables, and so the parameters of the
model are the coefficients ofthis regression equation. Analogous to Equation 2.3, one can
easily defined a measure of accuracy for this predictor, the most obvious being
PETS
L (Y
i
- Pi
iETS
[2.7]
the test set squared prediction error. This is a more honest measure of predictive accuracy
than the more conventionalleaming set R
2
statistic, as it reacts to overfit in the model.
For example, suppose that unknown to the researcher the underlying process that
generates Y is of the form of a quadratic function of a single X and some random error, so
that:
[2.8]
and suppose further that the Ej'S are independent and identically distributed with a mean of
zero and some finite variance. There are three parameters to estimate for predicting Y(each
of the p's), and for a fixed X the expected value of Yis a linear combination ofthe expected
values of these estimates. Likewise the noise in Ydepends on the standard errors associated
with the estimates ofthe p's, so the accuracy ofthe predictor depends on minimizing the bias
and variance in the estimates of these parameters.
Now, the researcher has some data on X and Y, and some choice of models to fit,
depending on the number of distinct observations of X. For example, if there are four
observations of Y, with the corresponding .x;'s taking on four distinct values, it is possible
67
to fit the usual straight line or a quadratic or cubic polynomial function to the data. These
equations require two, three and four parameters respectively and we can fit no larger model,
the last being saturated. Figure 2.1 shows a simulated set of observations, with a plot of the
true underlying quadratic function along with plots ofthe three regression equations fit to this
data. The straight line (represented by the dotted line) clearly cannot fit the curvature
adequately, and indeed there is bias in the model since the estimation of P2 is ignored, or
equivalently, constrained to zero. When a quadratic is fit (the dashed line) the estimated
coefficients are obviously unbiased, and likewise for the cubic polynomial. However, in the
latter case, the variance of the estimates is higher than for the quadratic fit; therefore the
variance in Yis higher. With respect to Figure 2.1, we can see that the cubic fit (represented
by the dotted and dashed line) fits the data perfectly. As a result, from fit to fit, with new
errors introduced each time, the cubic curve will fluctuate around the true, underlying
function much more than the quadratic fit, fitting the new errors each time.
So although both the quadratic fit and cubic fit will overlay the true function on
average, the quadratic fit will yield the lowest PE as defined by Equation 2.7; it provides the
optimal tradeoff between bias and variance. Moreover, the researcher can estimate the test
set PE for each fit to infer this optimality. Here, if each model is fit and tested on a
sufficiently large test set, the linear model will generate the highest prediction error, the
quadratic the lowest, and that for the cubic model will be between the two. Even in larger
collections of models, a plot of PE against model size typically reveals this hook-shaped (
LJ ) pattern, and the optimal model size is easily gleaned. Notice that the more usual R
2
is
highest for the cubic fit of course, a well-known deficiency with the index that has inspired
Figure 2.1 - Simulation of Y as quadratic function of X
Triangle = observed data
Solid line = true function
Dotted line = linear fit
Dashed line =quadratic fit
Dot-Dash line =cubic fit
>-
CD
"<:t
C\I
o
~ - : - - ~ - . : : : - - ~ ~ ~ -'-'-
.-- ~ -
.--'-- ~
/' ./
/' ./
V ./
./
/,
.. _-;/e:: '/1::{... -------------------
/_ ... -
/ ~
- - ~
----- -- /
/
/
/
/
/
/
"-
.......
....... "-
.......
.......' "-
~
\"
\\
' ~
\ "
\ "-
\
\
\
\
a 2
x
4 6
0\
00
69
several modifications and fixes (e.g., adjusted R
2
, Mallow's C
p
).
This type of overfitting problem is evident in the paper by Marshall et aI., as the
larger regression model resulted in a higher prediction error than the smaller model. The
stepwise deletion process that created the smaller, more powerful model is unlikely to choose
the best sized model, however. There is no clear relationship between the statistical
significance of the estimated coefficients and the predictive power of the overall model as
measured by a test set. Consequently, a more involved approach is used for fitting the
regression models in this dissertation. As described above, a whole series of models of all
sizes are fitted in order determine the model size that minimizes the test set PE.
There are other problems inherent to the regression approach. First, OLS, as
demonstrated above, models a real-valued Y, but here we are only interested in the
possibilities that Y is zero or one. So of course, a linkfunction must be used to map the real-
valued output of the linear regression equation into the (0, I) interval. Then the researcher
must select a cutoff for the predicted value of Y (e.g., 0.5 with unit misclassification costs)
below which respondents are classified as zero, and above which they are classified as one.
There is no clear choice of link (and variance) function. The binomial model suggests the
use of the logit function (i.e., logistic regression) but there is often no a priori reason why
an inverse Gaussian (probit) or any other mapping would not be more appropriate.
Ultimately, the problem is that regression is simply not geared toward binary classification,
and so attempts to mold it into a true classifier are usually circuitous. More direct strategies
of classification are suggested by Bayes' rule, as discussed below.
Secondly, the complexity and mixture of data types in the present dataset (as well as
70
most social science datasets) does not accommodate an equation-fitting approach to the
problem. For example, missing data present a problem, as there is no clear way to build such
anomalies into an equation.
13
Also, many of the variables are coded such that the values of
the predictor variables do not naturally agree with the basis for a regression equation. That
is, most of the variables are not coded in an ordinal scale. So if these categorical variables
are to be meaningfully represented in a regression equation, the model must accommodate
them with a large number of 0-1 indicator functions (i.e., dummy variables). Since the
number of candidate variables is large (some with many possible values), the size of the
potential full model is extremely large. One is faced with the same problem as encountered
by the search method above: there is no way to find, for example, the optimal regression
model built with as many as fifty or sixty variables chosen from a pool of hundreds or
thousands of candidate predictors. The sheer immensity of the numbers implied by the
combinatorics of such a search is just as daunting an obstacle as that faced by the above
search for binary splits. Moreover, the resulting regression equation would be much more
unwieldy than sets of simple questions.
Thus it is the opinion of this author (and of many statisticians) that compared with
many other existing methods regression does not offer the best prospects for elegant or
efficient prediction. Undoubtedly, with some judicious recoding of data types, and with the
use of a test set for the purposes of model selection (as shown below), regression models can
achieve a high degree of predictive accuracy. However, these are not tactics that can be
13 The usual methods are to ignore such cases altogether, or to replace missing values with the
means ofthe respective variables, neither of which utilizes any predictive information missing data may
contain.
71
easily packaged into generic software; they require a fair level of statistical sophistication,
and a degree of programming agility which is not available to most. (Most social science
researchers implement automatic stepwise model selection algorithms or build models
entirely by convenience or intuition.) It is perhaps unfortunate then, that a plurality of
research projects aimed at predicting mortality are centered around the use ofthe regression
model.
2.5 Classification trees and Bayes' rule
A more natural approach to classification also described by Marshall et al. is the
CART method of building classification trees invented by Breiman et al. (1984). This
method is also used by Gilpin et al. (1983) and and Henning et al. (1976) for predicting
death in heart failure patients. The idea behind classification trees is easy to understand. As
with the method developed here, the technique centers around the construction of binary
splits (i.e., yes/no questions) as a way to partition the sample space into homogeneous
regions (that is, regions containing respondents that are mostly dead, or mostly alive). A
classification tree is simply a way of organizing series of the yes/no questions or splits into
a tree form, where groups of respondents are split successively depending on previous splits.
Ultimately all final partitions (called terminal nodes) are assigned a class. Figure 2.2 shows
a hypothetical example of such a tree. A respondent is started at the top, with the question,
"Is the respondent's body-mass index less than 1.77". If the answer is "yes", the respondent
is sent to the left for the next question, while if the answer is "no", the respondent is sent to
the right to answer a different question. If sent left, the respondent is asked, "Is blood
pressure 179?", and here the respondent reaches a terminal node: the respondent is
Figure 2.2 - Example of a classification tree
for risk of death
Is body-
mass
index
<=2.1 ?
72
Yes
Low High Low High
73
classified as low risk if the answer is "yes", or as high risk if the answer is "no". Similarly,
ifthe respondent had answered "no" to the first question on body-mass index, they were then
asked the question "Is blood pressure ::s; 179?" and subsequently classified as high or low risk
based on the answer. In this way, the tree-structured questions assign a class (represented
by one and zero, just as death and survival are represented) to every respondent in the dataset
by partitioning the sample space into rectangles, just as in Figure 1.1. Thus the tree works
quite like the method developed in this dissertation, and indeed the idea of a classification
tree was the inspiration for this research.
How is a classification tree built? As above, the CART software starts with a set of
respondents who are known to be dead or alive. It then searches across all possible yes/no
questions to find the question that most effectively separates the dead from the living (the
GINI index is used for this). Once such a question is found, the sample is divided into two
groups according to whether they answer "yes" or "no", and this forms the very first question
(the root node) of the tree. Then for each of these two subgroups, another exhaustive search
is conducted for a question that best separates the dead from the living. However, the same
problem addressed in Chapter 1 - that of how many questions to ask - was faced in the
building of a tree. That is, how large should the tree be made? As mentioned in Chapter 1,
the answer can be obtained by examining the bias-variance tradeoff inherent in changing the
size ofthe model. So far, the bias-variance tradeoff has only been defined formally for OLS
regression, but these terms can be defined for the classification scenario as well, as shown
directly below.
At this point it is useful to state the probability model behind the CART endeavor
74
explicitly, as the same model is used as the model for the approach developed in this
dissertation. A general model for predicting any number of classes is set out in detail by
Breiman et al. (1984), but this model may be unfamiliar to the reader. Since it is central to
the method developed here, a more context-specific, two-class version is briefly presented
below with some simplifications in the notation, and with a concrete example illustrated by
Figure 1.1.
Let Xbe the space of all p-vectors x (a particular vector X being data on p variables
Xl' ... , X
p
for a single, randomly chosen respondent from the population), and let Vbe the set
{O,I} (here with Y= corresponding to a survivor, and Y= 1 to a decedent, as above). Now
consider the space of all couples, XxV, and assume that for any subset (A, j) where A c Xand
j E V, there is some probability Pr(A, j) that a respondent chosen at random from the
population has an X that is E A and a value of Yequal to j. 14 For example, consider Figure
1.1 again. Suppose that the number of variables is p = 2, and define X' as the 2-dimensional
Euclidean containing the plot itself, which contains all possible combinations of values for
body-mass index and blood pressure. Now denote the subset of X' delimited by the rectangle
in the lower right comer ofthe graph as A' , and let j' = 1. The assumption, in concrete terms,
is that there is some probability Pr(A', j') that a respondent drawn at random from the
14 Of course the actual sample design was much more complicated than a simple random sample,
as only some of the respondents (the New Haven and Duke samples) were systematically random samples;
respondents from East Boston and Iowa County were chosen through community censuses. It is hoped that
these latter two samples of convenience can be considered a reasonable facsimile of a random sample from
some general (but still meaningful) population of elderly. It would have been possible to use only the New
Haven dataset for this analysis, but since this would have ignored a large amount of data drawn from
respondents outside of New Haven, it probably would have yielded a set of results even less relevant to the
national population of elderly. Chapter 3 considers the EPESE sample designs in detail. Chapter 7
discusses possibilities for applying the present method to some existing datasets based on random samples
which represent the national population of elderly.
75
population subsequently dies and has values of blood pressure and body-mass index in this
region. Similarly, such a probability was assumed to exist for any subset of XxV.
Then if we have some predictor or classifier P(x) that deterministically assigns zero
or one to a respondent based on x (here it is a function that simply partitions into
rectangular subsets and classifies a respondent as zero or one according to which subset of
;; the respondent belongs) we can define the probability that the predictor misclassifies a
respondent as the probability that the predicted value of Y given X is zero when the
respondent actually dies, or vice-versa. So define the true prediction error of a predictor,
PErue, as Pr(P(X}t Y), the probability that the predictor misclassifies the randomly chosen
respondent. For example, if a predictor P'(x) were formed which classifies all respondents
in A' as dead and all respondents in the complement of A' as survivors, the true prediction
error would be the probability that X E A' and survives or X$ A' and dies. We would like
to estimate this prediction error for any given classifier (e.g., Equation 2.3), and we would
also like to construct a classifier that minimizes this error.
Define Pr(A Ii) as Pr(A,i)/Pr(Y=i), the probability that XE A given that Y =j. With
respect to Figure 1.1, Pr(A' I j') is the probability that a randomly chosen dead respondent
is in the lower-right rectangular region A'. Then assume that Pr(A Ii) has the probability
density f;(x) , so that
Pr(A Ii) = f!;(x)dx .
A
[2.9]
To see this assumption via Figure 1.1, lay it flat on a table and imagine a third, vertical axis.
For respondents who died, visualize some surface over this plane with a height that is defined
76
by the function .ft(x) (and similarly imagine another surface for the survivors). The
interpretation of this surface is that the probability that a deceased respondent lies in the
rectangle (and likewise for survivors) is equal to the volume under the surface that lies
directly over the rectangle. An approximation to this surface can be estimated from the data
by breaking the surface into equal squares and estimating a two-dimensional histogram, as
shown in Figure 2.3 for the survivors, and Figure 2.4 for the decedents. (The comers on
these histograms have been rounded, so that it appears as a smooth surface rather than a pile
of blocks). The volume under each histogram has been normalized to one.
Now, by the definitions and assumptions above,
Pr(P(X) =Y) = Pr(P(X) =0 I Y =O)Pr(Y=0) + Pr(P(X) =1 I Y=1)Pr(Y=I)
J !o(x)Pr(Y=O)dx + J h(x)Pr(Y=I)dx
all x:P(x) =0 all x:P(x) =I
= J[I(P(x) =O)!o(x)Pr(Y=0) + I(P(x) =I)h(x)Pr(Y=I)] dx
where 10 is the indicator function that the expression in parentheses is true. For a given x,
it can be seen upon inspection that
I(P(x) =O)!o(x)Pr(Y=O) + I(P(x) = l)h(x)Pr(Y= I) s:; max)!;(x)Pr(Y=})]
where maxJf;(x)Pr(Y=})] is that value ofJ;(x)Pr(Y=}) obtained by choosing the} which
maximizes this expression. The above inequality is equal if indeed P(x) is equal to this
maximizing value of}. Thus it is shown that for any classifier P(X),
Pr(P(X) =Y) s:; Jmax)t;(x) Prey =})] dx
><
Q)
-0
C
(J)
(J)
co
E
>-
-0
a
...c
-0
c
co
Q)
"-
::l
(J)
(J)
Q)
"-
Q.
-0
a
a
...c
>-
...c
~
a
>
~
::l
(J)
\f-
a
c
a
+:i
::l
...c
'C
+-'
~
o
('f)
N
~
::l
0)
LL r-_,.--.---,---.-----.----,----V
77
9
9
Z
SJOf\!f\.lnS }O %
~ 0
><
Q)
"'0
c
en
en
m
E
>.
"'0
o
..0
"'0
c
m
Q)
L.
::::J
en
en

0-
"'0
o
o
..0
>.
..0
en
.....
c
Q)
"'0
Q)
()
Q)
"'0
-.-
o
c
o

::::J
..0
"C
.....
.
o

N
Q)
L.
::::J
.2>
ll.

9
o
78
79
which implies that the lowest prediction error a classifier can achieve is
PE Bayes = 1 - Jmax. [((x) Prey =j)] dx
J J
known as the Bayes misclassification rate.
15
[2.10]
Define the subsetA
j
of Xas all x such thath(x)Pr(Y=j) = maxJ/;Cx)Pr(Y=1)], for I
= 0,1. Now suppose we consider the following rule for a classifier: classify all x in A
j
as j.
That is, classify x as that j for which h(x)Pr(Y= j) is maximized, a maximum-likelihood
strategy known as Bayes' rule, or pbayes(x). As demonstrated above,
Prep Bayes(x) =Y) = Jmax. [((x) Prey =j)] dx
J J
so Bayes' rule achieves the optimal Bayes' misclassification rate.
In relation to Figure 1.1, this rule suggests that for any x, we can estimate the product
ofthe height of the surface.fo(x) at x and the probability that the respondent survives. Then
just compare this with the estimated product of the height of};(x) and the probability that the
respondent dies. Bayes' rule simply classifies the respondent as alive or dead according to
which product is greater.
Suppose one fixes Pr(y=O) = Pr(y=l) = 0.5. Then one simply classifies x according
to which is greater, .fo(x) or };(x). For example, consider the approximations to these
functions as pictured in Figures 2.3 and 2.4. Figure 2.5 shows a contour plot of the ratio of
the approximated functions, };(x)lfo(x), at each point in the 2-dimensional space 1'. Under
Bayes' rule, one classifies all x in those regions where this ratio is less than one as survivors,
15 This is proof is drawn directly from Breiman et al. (1984).
Figure 2.5 - Contour plot, ratio of survivors' distribution to decedents' distribution
by systolic blood pressure and body mass index


.5
1:6
a
('i')
LO
('i')
LO
a
N
LO
N
......
><
Q)
"'0
C
CJ)
CJ)
ro
E
>.
"'0
o
..c
.--..
..r:::.
u
c
:..:::::
CJ)
..c
100 150 200
systolic blood pressure (mm of Hg)
00
o
81
and all x in those regions where the ratio is greater than one are classified as deceased. One
can see from the contour plot that the area where the ratio is greater than one consists mainly
ofthe area on the bottom right and center, along with an area on the left lower part of the plot
(excluding the bottom left comer itself), and an "island" on the left side at a body mass of
about 3.0. Now suppose that Pr(y=1) = 0.333, and therefore that Pr(y=O) = 0.666. Then
Bayes' rule classifies all x in those regions > 2 as dead. This is primarily
the region in the lower, right comer of the plot, almost exactly that region delineated in
Figure 1.1. There also seems to be a high risk area on the far left side of the graph, and
toward the bottom center, but the number of respondents in these areas is quite small. If
Pr(y=I) = 0.25, so that Pr(y=O) = 0.75, Bayes' rule classifies all x in those regions where
> 3 as dead. Judging from Figure 2.5, this area consists ofthe combined small
triangular regions on the far, right-hand edge of the plot, the far left edge, and the bottom,
right edge. These areas contain even fewer respondents.
Of course, estimating PreY =j) is easy here. It is simply the proportion of class j in
the sample; the problem centers around the estimation of J;(x). For example, linear
discriminant analysis tries to accomplish this by assuming fo(x) and f are normal
distributions with a common variance-covariance matrix; kernel density estimation fits them
nonparametrically.
Consider again the problem of deciding how large a classification tree should be.
With unit misclassification costs, the tree divides the learning sample space into disjoint
rectangles, and classifies each rectangle as zero or one according to whether most of the
respondents in the rectangle are alive or dead. With respect to Figure 1.1 then, we would
82
classify respondents in the lower right rectangle as dead only if more than 50% of these
respondents were observed to be dead, else we would classify them as survivors. The larger
the tree is (i.e., the more questions asked), the more of these rectangles it defines, and the
more the number of respondents in each rectangle diminishes. Denote the L disjoint
rectangles defined by a tree as SI' ... , SL' and lety" for I = 1, ... , L, represent the class (zero
or one) assigned to the lth partition (equal to the observed class majority in SJ. The
misclassification error for such a tree (call the predictor ptree(x), and its error pg
ree
) may be
expressed as
PE
tree = Prep tree(x) *- y) 1 "P (X S Y )
= - LJ r E /, =y/
/
1 - L ICY/ = O)Pr(XESf' Y= 0) +ICY/ = l)Pr(XESf' Y = 1)
/
where 10 is again the indicator function. Letyt
ue
be the true class majority in Sf' which is
equal to max} Pr(Y=j IX E SJ. The above expression can then be written as
PEtree = Prep tree(x) *- y) = 1 - L maxjPr(XE S/, Y =j)
( [2 11]
+L ICY/ *- yt
ue
) IPr(X ESf' Y =0) - Pr(XESf' Y =1) I. .
/
Note that the first part of the right side of this equation, or
[2.12]
is very similar to the Bayes misclassification rate as defined by Equation 2.10. In fact, as
Breiman et al. describe it, Equation 2.12 "forms an approximation to the Bayes rate
constructed by averaging the densities over the rectangles". They then define the bias of the
83
tree classifier as PE
1
tree - PEBayes, and go on to argue that as L increases (implying that the
rectangles SI' ... , SL become not only more numerous but smaller as well) this bias decreases
rapidly for small L, falls more gradually as L grows, and that it eventually approaches zero.
16
This is not a particularly intuitive result, but one can see how, as the rectangles become
smaller, the approximation to the density on each rectangle is more likely to equal the true
density on average, but with more instability since fewer observations are used in the
estimate (not unlike the regression example above, where the act of increasing the
coefficients in the equation allowed the fitted line to fit the underlying function more easily
on average, while simultaneously causing the fitted function to "bounce around" more
severely).
To see the increased error due to this type of noise, examine the second part of the
right side of Equation 2.11, or
PEt
ee
= L I(y[1' y/rue) IPr(XE Sf' Y=0) -Pr(XE Sf' Y=1) I
f
[2.13]
which is described by Breiman et al. as a "variance-like" term. The contribution of this term
to the error can be more intuitively understood. Note that, because of the indicator function,
this type of error is caused only by those rectangles for which y, * y/rue; that is, only the
rectangles that are misclassified (where the observed class majority differs from the true class
majority) contribute any error. For Figure 1.1 then, the lower right rectangle would only
contribute error to this term if more than 50% of the persons in this rectangle were observed
16 This type of bias should not be confused with the downward bias in the prediction error
discussed in the first chapter, which is defined by the difference between the true prediction error of a
question set and the expected value ofthe test set estimate of prediction error.
84
to die, while in fact the true proportion of deaths in this region was less than 50% (or vice
versa). It can be seen intuitively that as the observed number of respondents in this rectangle
grows larger (and to the degree that the class proportions differ from 50%) this type of
misclassification is much less likely to occur. Thus, as the number of rectangles in a tree
increases (implying that the rectangles contain fewer respondents), the variance-like error
term becomes more prominent, gradually increasing, while the bias error term goes to zero.
However, the true error ofthe classifier is unknown, so it is not obvious what number
of rectangles provides the optimal tradeoff between bias and variance. The problem is that
the learning set misclassification error for a particular predictor is not an honest estimate; it
is always possible to achieve a reduction in this estimate of the prediction error by adding
another rectangle (at least until there are so many rectangles that every rectangle contains
only one respondent). What is required, once the classifier has been constructed, is a large
number of additional, independent observations drawn from the same distribution, so that
the probability of misclassification can be estimated directly as the proportion of
misclassified respondents in the new sample.
It is for this reason that a test set sample is selected out of the original data set with
a simple random sample and set aside while the classifier is constructed with the learning set.
The test set respondents mimic a sample of respondents drawn independently from the same
distribution (although technically the two datasets are not independent), and so the error can
be estimated more honestly. Then the bias-variance tradeoff can be seen explicitly by
applying a sequence of variously-sized models to the test set, as in Figure 1.2; invariably, the
prediction error is seen to fall rapidly as the model first enlarges (indicating the decrease in
85
bias). It then bottoms out and increases gradually as the model size becomes too large
(suggesting that the variance-like error term is rising). Thus the optimal model size is
gleaned (or a "standard error rule" may be used to determine the minimum size), and the test
set estimate of prediction error for this model is used as the estimate of the model's accuracy.
Since the test set is being used both to select the model size and to estimate the error, this
estimate of the error is not completely unbiased (i.e., E(PE
TS
) =1= PErue). However, as
Breiman et al. have demonstrated, extensive computer simulations of the CART method of
classification suggest that this bias is small if the learning set is large and the optimal model
is chosen from a small sequence of nested submodels.
The method of model selection (but not the method of model building) used for the
models developed in this dissertation is nearly identical to the CART method. Consider
Figure 2.6, which shows a graphical depiction of the present method (in contrast to Figure
2.2, which depicts the tree method.) Again, one starts with a large set of binary splits that
divide :r into "rectangular" regions. Each column of questions in Figure 2.6, for example,
defines a particular rectangular region (actually a cube). The only difference between these
regions and those defined by a tree's terminal nodes is that here they are not defined
disjointly, as they are in CART. However, there always exists such a disjoint set of
rectangles (denoted as SI' ... , SL above) corresponding to the set of regions defined by the
questions, since here any area that is in the union of two overlapping rectangles is always
classified as high risk. To see that this is so, note that it is always possible to convert the
question sets developed here into tree form, e.g., in Figure 2.6 by making the second and
third columns of questions into branches along the questions in the first column of questions.
Figure 2.6 - Example of question sets
combined with OR
86
If s
High risk
OR
Ify s
If s
High risk
OR
If s
High risk
87
As the number of questions in each column increases, and as the total number of columns
increases, so does the number of disjoint regions (L) required to define the same partition.
Moreover, it is generally possible to define the regions SI' ... , SL as part of a nested
subsequence of such partitions, in much the same way a subsequence of nested tree models
defines them. Thus, the method of applying backward deletion to the questions to obtain a
subsequence of differently-sized partitions (described in greater detail in Chapter 4) is very
much like CART's pruning to obtain a set of smaller and smaller tree-based partitions.
Therefore, it is argued that the bias in the test-set estimates of prediction error attached to
these partitions is probably small. (Again, the principal difference between the two methods
lies not in the model selection or estimation of error, but in the building of the "full" model,
which was completed by use of the learning set only.)
To see the model selection explicitly, consider again the contour map in Figure 2.5
Suppose one assigns a cost of C = 6.1 to the misclassification of deceased respondents; this
is equal to the ratio of survivors to decedents, so the model is equivalent to assuming that
Pr(y=O) = Pr(y=l) = 0.5 in the unit cost scenario (as discussed above). Thus, Bayes' rule
classifies as dead all x for whichJ;(x)lfo(x) > 1, an area that can be approximated by the area
in Figure 2.5 where the contour is above one. Using the algorithm in Appendix V, a learning
set was drawn random, a full model with nine questions was constructed, and this model was
pruned to obtain a sequence of submodels. This sequence of submodels was then applied to
the test set, and the test set error associated with each model was estimated. The lowest error
was achieved when three questions were used.
Next, the data were recombined, and the largest full model with nine questions was
88
constructed with the search algorithm. The classification achieved by this model is shown
graphically in Figure 2.7; the respondents inside the regions on the bottom and right side of
the plot are classified as high risk. When this model was pruned down to three questions,
the classification shown in Figure 2.8 was achieved. As can be seen, the simpler
classification covers nearly the same area of the plot but with fewer questions. According
to the test set estimates it would probably have a lower prediction error than the larger model
when applied to a new sample from the same distribution.
2.6 Other methods of classification
Other commonly used methods of classification, as mentioned above, are linear
discriminant analysis, kernel density estimation, and cluster analysis. All these methods can
be related to Bayes' rule as a guide for classification based on estimates of the densities.fo(x)
andJ;(x), defined above. In the two-class problem, Fisher's method of linear discriminant
analysis assumes they are normal with means vectors X
o
and XI' and a common variance-
covariance Then a p-vector of discriminant coordinates can be calculated via:
which can be interpreted as the linear combination of the p variables XI' ... , x
p
that maximizes
the ratio of the variance between the classes to the variance within the classes.
17
To use this
linear combination to classify a given observed vector of data X, one first computes the value
of this function at X, i.e.,
17 For more extensive discussions ofthis technique, see Gnanadesikan (1977) or Mardia et al.
(1979). More modern versions of discriminant analysis (and other associated methods of classification) are
presented by Hastie and Tibshirani (1994, 1996), and Hastie et al. (1995).
Figure 2.7 - Division of contour plot, 9-question model
by systolic blood pressure and body mass index
..-..
..c
o
c
:.::::
Cf)
..c
><
Q}
"'0
c
Cf)
Cf)
ctl
E
>.
"'0
o
..c
LO
C'?
o
C'?
LO
N
o
N
LO
..-


100 150
systolic blood pressure (mm of Hg)
200
00
\0
Figure 2.8 - Division of contour plot, 3-question model
by systolic blood pressure and body mass index


.5
3- 1:6
o
C")
L()
C")
L()
o
N
L()
N
..-
x
Q.)
"'0
c
en
en
CO
E
>.
"'0
o
.0
--
()
c
:.:::::
en
.0
100 150 200
systolic blood pressure (mm of Hg)
'-D
o
91
so that z is a scalar, representing the projection of the vector X from p-space onto lRI. Then
the respondent is typically classified as zero or one according to whether z is closer to the
projection ofx
o
(equal to xo'ft) or the projection of Xl (equal to x,'ft).
For a nonstatistician, the interpretation may require some illustration. Consider
Figure 2.9, which shows two "rounded" histograms. Instead of showing rectangular bars,
the centers of the top of each bar are connected with lines, cutting or "rounding" the corners
of the bars to obtain a more natural-looking distribution. One is a histogram of survivors,
the other a histogram of deaths; both are scaled to an area of one. The x-axis is scaled from
zero to 1,000, and corresponds to the range of z scores for all 10,294 respondents as
calculated by a linear discriminant analysis of 15 variables (see Appendix V). (The method
for determining the model size, estimating ft and calculating these scores is discussed in
Chapter 4, and the results are presented in great detail in Chapter 5.) The height of each of
these distributions may be thought of as a crude estimate of each of the densitiesfo(x) and
f.(x). Here the 1 space is the I5-dimensional space corresponding to the space of possible
answers to the 15 survey questions used in the model (called lIS). Specifically, the height
ofthe histogram at z estimates the height offo(x) andf.(x) at all those points x in lIS such that
the Fisherian linear combination ofx (or x'ft) is equal to z. For example, let z = 500, and let
A
500
+ be that subset of XIS such that x'ft > 500; that is, let A
500
+ be that portion of the space of
question answers such that the z score computed from these answers is greater than 500.
Then Pr(z > 500) for the surviving respondents is equal to
Figure 2.9 - Distribution of deaths and survivors by z index
area under each curve scaled to 1
L()
...-
a
en
.......
c
Q)
"'0
C
0
c-
a
en
Q)
...-
.....
a
'+-
I / / \ deaths 0
c
0
t
0
c-
o L()
.....
a
c-
a
a
a
289 369 I 449 1625
a 200 400 600 800 1000
Index (discrminant variable)
\.Q
N
93
Pr(XEA
500
1 Y=O) = f fo (x) dx
A500
and likewise for };(x) and the dead respondents. By Figure 2.9, these probabilities can be
seen as estimated by the areas under the histograms to the right of 500, equal to 6.6% and
34% for survivors and decedents respectively. The projections ofthe means onto z-space can
be seen at z equal to 289 and 449. Thus, according to the usual classification rule, a
respondent at x is classified as dead if z is closer to 449 than 289; that is, x is classified as
zero ifz < (449+289)/2 = 369, and as one ifz > 369. Since the deaths and the survivors are
assumed to be normally distributed with equal variance-covariance matrices, it follows that
the two distributions ofz-scores should be normal with equal variances. This implies that the
midpoint between the two projected means is also that z at which.fo(x) is equal to};(x), and
that for z < 369,.fo(x) >hex) and for z> 369,.fo(x) <flex). Of course, it is obvious from
Figure 2.9 that the two groups are probably not distributed with equal variances, and nor are
the distributions perfectly normal. Nonetheless, 369 does seem to be quite close to that value
of z at which .fo(x) is equal to f (x). This seems to be a decent rule for distinguishing
between the groups, as 77% of survivors lie to the left of this cutoff, and 65% of deaths lie
to the right.
However, recall that the Bayes' rule suggests that the class assignment with minimal
error is the class j which maximizes J;(x)Pr(Y = j). This classification rule seems to
classifying x as the class j that maximizes J;(x). The problem is that this assignment rule
assumes that Prey= 1) and Prey=0); here, it assumes there are an equal number of survivors
94
and deaths in the population, and under this assumption the rules are equivalent. However,
the data suggest that these proportions are closer to 86% and 14%, a ratio of 86/14 = 6.14.
Thus Bayes' rule suggests that x should be classified as one when the ratio } ; x ~ x is
greater than 6.14, which seems to correspond to the region where z > 625 or so. This is also
observed to be the lowest score at which the death rate exceeds 50%. Thus, Bayes' rule for
this scheme can be seen to be equivalent to the rule of classification used in the method
above with unit costs, where the rule is to classify a rectangular region of the space as "dead"
when the proportion of deaths in that region is higher than 50%, and as "alive" otherwise.
Note that the costs of misclassification can easily be incorporated into Bayes' rule by
scaling up the estimate of Prey = 1) directly in proportion to the cost of misclassifying a
death as a survivor. Thus, assuming the defined relative misclassification cost of five is
mathematically equivalent to assuming that deaths in the data were undersampled by a factor
of five. So the cost can be incorporated directly into the error scheme simply by weighting
the dead respondents by a factor of five. (To see this via Equation 2.2, note that we obtain
the same error by replicating the data on each dead respondent five times over and applying
Equation 2.1.) For the observed proportions of 14% deaths and 86% survivors, then, Bayes'
rule corresponds to the simple linear discriminant rule of classifying a respondent according
to the relative distances to the projected means (i.e., according to the cutoff of that z where
fo(x) is equal to};(x)) when the relative cost ofmisclassification is set at 6.14 (suggesting the
cutoff at z = 369). It appears that the risk of death (proportionate to the ratio } ; x ~ x is
mostly monotonically increasing as a function of the discriminant score, as it should be under
the normality assumption. So scaling this cost between one and 6.14 is roughly equivalent
95
to scaling the z cutoff between 625 and 369. Then the same tradeoff of sensitivity and
specificity is observed as when scaling the cost of misclassification in the question scheme.
As such, the area under the ROC curve can be estimated directly by summing up the area
under the sequence of estimated TPF by FPF coordinates obtained by setting the cutoff at
each observed quantile of z. (The area under the ROC curve for this particular model was
estimated at 77.4% on the test set. However, since much knowledge about the test dataset
was used when recoding categorical variables and treating missing data, so this is probably
an optimistic estimate.)
In this way, a connection between the linear discriminant analysis and the question
method can be established. For example, suppose Question Set B was converted to a linear
model. One way to do so is to form three indicator random variables, each corresponding
to one of the three question subsets B.1, B.2 or B.3. For example, to obtain an indicator
random variable for question set B.1, first form two indicators corresponding to the two
questions: let I
B
l.l be zero if the respondent does not answer question B.l.l in bold (i.e., if
they can walk half a mile), and let it equal one ifthe answer is bold. Define I
B
I.2 similarly
for the second question (one if the respondent uses digitalis, zero if not), and define the
indicator I
BI
= I
BLI
'l
B12
, the product of the two question indicators. Then define I
B2
similarly as IB21'lB.221823' and likewise define 1
83
as 1831'lB3.2' Now for each respondent,
arrange the vector x
B
as (1Bl' I
B2
, I
B
.
3
), and for all respondents grouped together, estimate a
common variance-covariance matrix ~ Then, as above, one can calculate the linear
discriminant coordinates &B and estimate a discriminant score ZB for each respondent via the
vector product XB'&B' The question set strategy of classifying a respondent as dead when all
96
the answers to anyone of the three subsets are in bold is equivalent to classifying x
B
as zero
when z = 0, and as one when z > 0, since any respondent classified as alive by the question
strategy has X
B
= (0,0,0). This crude split attempts to approximate Bayes' rule with a relative
misclassification cost of five. (This assumes the &B are all positive; since the vector is only
determined to a multiplicative constant, the direction of the signs of the &B is arbitrary, it only
matters whether they are all in the same direction, which was certainly true here.) Thus the
two seemingly disparate strategies are doing something quite similar, once a common space
is defined. Both classify as dead those regions of the space where the estimated ratio
J;(x)lfo(x) is higher than the ratio 86%/(c-14%), where C is the cost of misclassifying a death
relative to misclassifying a survivor. The main differences between the methods lie in how
the space is chosen and how the partition of regions is formed.
2.7 Existing applications of the models
Here the literature on mortality and morbidity prediction was grouped into two, very
broad categories: 1) true prediction research, typically with in-hospital patients, or for risk-
adjustment to conduct quality assessments of hospitals; 2) causal or substantive analyses that
substantively interpret predictive models. Given the theoretically-based discussion above,
this literature can be analyzed on the basis of how well the researchers dealt with the issue
of model fitting from a bias/variance perspective. Unfortunately, the use of test set methods
to deal with this issue is not common, and in practice very little is said about the
bias/variance tradeoff at all.
Consider again the Marshall et al. paper discussed in section 2.3, which is an example
of the first type of research. The purpose of their research is to develop prognostic tools to
97
assist physicians in medical decision making. The subjects ofthis study were 12,712 patients
who underwent coronary artery bypass grafting between 1987 and 1990, and the outcome
was post-operative death within 30 days after the operation. The variables in their data
included age, body surface area, the usage of various medications (including Digoxin),
previous heart failure, diabetes, cerebrovascular disease, angina, hypertension, and many
other ailments. The authors compared a handful of various statistical methods (listed above),
and a test set was used to estimate the misclassification error associated with each. As
mentioned, there was clear overfitting in the larger models, as the test set accuracy of the full
logistic regression model (containing some 33 variables) was lower than the accuracy of the
abbreviated model with six variables (chosen via backwards deletion with the Ale criterion);
the area under the ROC curve was estimated as 71.0% for the smaller model, and 69.4% for
the larger model (with missing values excluded). They also found that cluster analysis, in
combination with logistic regression, formed the most powerful model (area under the ROC
curve = 71.1%). The classification trees, however, did not perform nearly as well as the
other methods, with an area under The ROC curve of 65.5% (which was the lowest of all
eight methods examined). In all cases, the learning set error was observed to be somewhat
lower than the test set error estimates, indicating the usual overfitting problems.
Unfortunately, apart from the classification tree method and the two different-sized
regression models, there were no other attempts to select model sizes based on test set error
estimates. Thus, it is not clear to what extent overfitting may have been a problem in
selection of the model sizes for the other methods examined.
A good, recent example of excellent methodology is also observed in an article by
98
Normand et al. (1996). The authors developed models for predicting in-hospital mortality
with 30 days of admission in more than 14,000 Medicare patients who had been admitted to
with acute myocardial infarction. The patients had been admitted to all acute care hospitals
in Alabama, Connecticut, or Iowa. The purpose of the research was the development of risk-
evaluation models for hospital quality evaluation. This is an increasingly important area of
research. For obvious reasons, researchers would like to be able to assess the ability of a
hospital to deliver health care effectively. One important, unambiguous outcome by which
a hospital's quality may be measured is the mortality of its patients. However, one must
adjust for the fact that different hospitals (with different geographic locations) serve very
different populations. Patients admitted to one hospital may be at much great risk than
patients admitted to another, prior to any hospital contact at all. Thus to compare mortality
rates across hospitals fairly, it is necessary to adjust for these differences in risk-at-admission.
The process of doing so can be expensive, and may require the gathering of data on many
variables. By creating efficient predictive models that can accurately adjust for patient risk-
at-admission without requiring a large number of variables would be a highly cost effective
solution.
These researchers used a learning sample of 10,936 patients, and a test sample of
3,645 patients to estimate model error. They started with a large number of candidate
predictor variables, including age, heart failure, cancer, functional status, laboratory
measurements (e.g., albumin, creatinine, serum urea nitrogen), diagnostic test results, and
anatomic location of the myocardial infarct. Using stepwise logistic regression on a learning
sample in conjunction with a rather complicated method of variable selection to minimize
99
overfit, a model was developed with some 30 variables. However, the important step in
assessing the accuracy of the model was the validation of the equation on a test sample,
wherein the area under the ROC curve was estimated as 78% (compared with 79% in the
learning sample). Therefore, the researchers could be sure that their model was not
dominated by variance error.
Davis et al. (1995), another example of good methodology, examined in-hospital
mortality for roughly 2,000 patients admitted to Beth Israel Hospital in Boston during 1987
and 1992 for pneumonia or cerebrovascular disease. The purpose was again risk-adjustment
for hospital evaluation. Variables used in their models included laboratory test results (e.g.,
blood urea nitrogen, while blood cell counts), chronic conditions, and functional status.
These authors also used a test set to validate the model error estimates, as well as using it to
refine the models built with the learning set. They used CART on the learning set to select
cutoff levels for some dichotomous variables, and used variables identified by CART in
addition to other variables to build a full logistic regression model with forward stepwise
selection. They then applied the full model to the test set with backward deletion, and
selected out variables that were not significant on test set estimated p-values. They also
computed the area under the ROC curve using the test set, as well as computing R
2
for some
models. The most interesting finding of the final models, despite the very different dataset,
was that the functionality variables (very similar to the ADL's presented in the question sets
here) were the best predictors of mortality, superior even to most of the laboratory results!
This sort of result suggests that the question sets presented in Appendix I might also serve
to classify in-hospital patients with some fair degree of accuracy.
100
These models are not new, and less sophisticated versions have existed for some
more than a decade. One frequently discussed model for assessing patient health is the
APACHE (Acute Physiology and Chronic Health Evaluation) index, of which there are
several variations. One recent example is seen in a paper by Iezzoni et al. (1992) which
compared this method with a number of other commonly used, preconstructed models
(including a model simulating that used by the Health Care Financing Administration, and
the MedisGroup scoring system). These researchers used a cross-validated R
2
estimate
(which applies a modified, more complicated version of the test set procedure) to assess the
accuracy of models honestly, but the variables in these models were not necessarily selected
on any such basis; rather, they are typically built more through medical knowledge and
intuition. They found that these models typically performed quite poorly when compared
with models built empirically through standard statistical methods (i.e., stepwise logistic
regression).
Many researchers did not resort to any sort of test set estimation, but instead relied
on either intuition, traditional model selection techniques (stepwise logistic regression) or
other more complicated schemes relying on standard statistical tests. Iezzoni et al. (1994)
developed a logistic regression model to predict in-hospital death using 1988 California
hospital discharge data from the Office of Statewide Health Planning and Development,
which included nearly two million admissions. With such a large N, statistical significance
is easy to achieve even for small coefficients, so model selection was not an issue; thus no
test set estimation was required! Grubb et al. (1995) predicted in-hospital mortality after 346
cardiac arrests at a hospital in Edinburgh. In more typical fashion, no test set estimates are
101
used, and model building is completed via stepwise logistic regression.
The second category of literature considered here includes "substantive" analyses of
mortality, which covers many purposes. From a statistical viewpoint, there is no difference
between the actual models used (e.g., a logistic regression equation is estimated no
differently), but rather how or why the chosen variables are selected is the defining
difference. The typical mode is to use either theory or intuition to start with a full regression
model and then apply standard stepwise regression techniques without estimating model
error on a test set, in which case the models are typically overfit. The problem is that
interpreting such models is often a convoluted exercise, since frequently some regression
coefficients do not reflect the currently championed theoretical paradigms.
For a recent, representative example of theory-based mortality modeling, see Smith
and Waitzman (1994). The authors were interested in the interaction effect of poverty and
marital status on mortality; in particular, it was hypothesized that the effect of being both
unmarried and poor had a greater impact on the risk of death than the sum of the effects of
each single variable. Using the NHANES (National Health and Nutrition Examination
Survey, conducted in 1971, N= 20,279) and the epidemiological follow-up survey (NHEFS)
of respondents in 1982-9184, the researchers analyzed 25-74 year-olds with death (1,935
traced) as the outcome measure. In addition to poverty and marital status, they observed age,
sex, race, smoking status, physical activity, serum cholesterol level, body mass index, and
hypertension.
Here, rather than the more common logistic regressIOn, the authors use the
proportional hazards regression model to estimate the effects ofthese variables (which treats
102
events and exposure explicitly, so dealing with the censoring issues mentioned in Chapter
1). Here, one is interested not in estimating predictive power necessarily. Instead, the
authors wished to test the interaction hypothesis by estimating the coefficient on this variable
when the potential confounders above are controlled. Smith and Waitzman, for instance,
conclude that there were indeed additional risks for nonelderly men subject to both
nonmarriage and poverty, but not women or the elderly, on the basis that these coefficients
above a threshold. The problem, however, is that of course one can never control for the
many unobserved factors that must influence mortality, and these variables are in all
likelihood highly correlated with the independent variables considered in the model. The
authors argue that by controlling for the above list of potential confounders, they have
accurately estimated the direct effect the independent variables involved; but one can easily
imagine that alcohol consumption, or access to health care, say (neither of which were built
into the model) is correlated with both poverty and mortality; by ignoring the potential effect
of these variables, one usually obtains a biased estimate of the coefficient on poverty. This
can lead to serious misinterpretations when coefficients are only taken at face value.
Furthermore, the authors (as do many researchers) often assess the importance of predictor
variables purely in terms of test statistics and p-values (usually on the basis of whether the
latter are smaller than 0.05). These measures are intrinsically driven by sample size and the
number of variables in the model, so they do not provide any real measure of a variable
social or human significance. It is unfortunate that so many researchers hinge their
conclusions on such numbers.
A large number of "substantive" studies of mortality models are not necessarily
103
interested in direct causal connections, but the general process-related or systematic
relationship between various variables and death (often from a descriptive or exploratory
perspective). One of the most well studied groups of variables in recent years has been the
various measures of functionality in elderly persons, in search of a description of how these
variables relate to mortality and morbidity. Reuben et al. (1992) studied a group of 282
elderly patients of UCLA faculty physicians by following them for 51 months and measuring
both death and functional status.
In addition to a battery of status questions measuring ADL's and IADL's, mental
health, social activity and self-assessed health perceptions, the researchers observed age,
gender, marital status, race, living arrangements, employment status, bed days, reduced
activity and other items. The model was fit with the standard stepwise backward deletion,
logistic regression methods, without estimating test set error. They too found that the best
predictors of mortality were the measures of functional status, particularly the IADL's (e.g.,
"During the past month, how much difficulty did you have doing work around the house?").
They also found gender, race and living alone to be significant at the 0.05 level. The authors
note the strength of association between variables, but stop short of drawing conclusions
about causality based on the results of the model. See Warren and Knight (1982) for another
study of the relationship between mortality and functionality, using 1,534 impaired, mostly
elderly persons from Canterbury.
Another interesting substantive but descriptive analysis of health status and mortality
among the elderly is provided by Berkman et al. (1989). Using a segment ofthe same data
used here (the New Haven third of the EPESE sample), the authors develop a "grade of
104
membership" model (GOM) to compare health conditions between whites and blacks in the
sample. The purpose is to identify "clusters" of homogeneity (with respect to health status,
risk factors, and functional status) in what is an otherwise diverse and heterogeneous sample
of elderly. In contrast to the usual regression based approach, the GOM model does not
assume a single dependent variable that is a function of independent multiple variables.
Rather, one considers one group of variables of interest internal, such as ADL's and chronic
conditions (i.e., those above variables in which one would like to find homogeneity, not
unlike dependent variables), and all remaining variables are considered external (e.g., marital
status or income, not unlike independent variables).
This approach has much in common with the question set method considered here.
The GOM approach, however, has the advantage of recognizing that there are many
dimensions of health status (e.g., more than can be measured by mere survival status), in
addition to dealing with heterogeneity in the population. Future research into the method
developed here will explore the use of the search algorithm in Appendix VI for locating
efficient GOMrepresentations.
105
Chapter 3 - Data: Established Populations for
Epidemiologic Studies of the Elderly
3.1 The EPESE project
The data for this dissertation come from the EPESE project (Established Populations
for Epidemiologic Studies of the Elderly), initiated by the Epidemiology, Demography, and
Biometry Program of the National Institute on Aging in 1980. The EPESE project was an
attempt to monitor four small populations of noninstitutionalized elderly persons
prospectively (elderly are those aged 65 and over; total N = 14,456). Initially, this involved
the administration of a baseline, household interview survey (conducted in the first three
populations in 1981-82) and subsequently continued through annual follow-up surveys, both
telephone and household interviews. The surveys were designed toward four objectives: 1)
to estimate the prevalence of various chronic conditions and impairments; 2) to estimate the
incidence of chronic conditions and impairments; 3) to identify the factors associated with
these conditions; and 4) to measure changes over time in the functioning of the elderly.! In
short, a fairly large group of elderly persons was observed in great detail at baseline with
respect to many variables. They were also observed over time with respect to several types
of outcomes, including functionality, various illnesses, and death.
Originally, the populations observed consisted of three communities: New Haven,
Connecticut (Yale Health and Aging Project, N=2,812); East Boston, Massachusetts (Senior
Health Project, N=3,809); and Iowa County and Washington County, Iowa (Iowa 65+ Rural
! See Comoni-Huntley et al. (1986, 1993).
106
Health Study, N=3,673), for a total N of 10,294. An additional site was added in 1984 near
Durham, North Carolina (N = 4,162), partly with the intent of oversampling blacks. This
latter dataset is considered here only for the purposes of testing the predictions made by the
models constructed in Chapter 4, which data from use the first three sites. For the remainder
of this chapter, all references to the data refer only to the New Haven, East Boston and Iowa
sites. The North Carolina sample is summarized briefly in Chapter 5, which contains the
results of the validation.
3.2 Composition of the populations
Unfortunately, the sample designs for the three populations were not uniform. In fact
the Iowa and East Boston respondents were chosen through total community censuses, not
random sampling; only the New Haven respondents were chosen randomly, with a stratified
cluster sample. In Iowa, the target population was all noninstitutionalized persons aged 65
and over in Iowa and Washington Counties, an agricultural area in East Central Iowa
consisting of about 16 small towns. A list of elders in the area was compiled by the area's
Agency on Aging, and this list was supplemented with additional names given by local
informants. About 80% of the persons identified responded to the survey. In East Boston,
a total community census was conducted concurrently with the baseline survey, and 84% of
the noninstitutionalized elderly persons enumerated by the census responded to the survey.
The New Haven data, the only randomly designed sample, was a cluster sample stratified by
type of housing: public housing for the elderly, private housing for the elderly, and all other
elderly in the community. The overall response rate for the New Haven elderly was 82%.
(All of this information was taken directly from the EPESE Resource Data Book, Comoni-
107
Huntley et al. (1983).)
Of all 10,294 respondents, 6,256 (60.8%) respondents were female, and 4,038 were
male. About 2,874 (28%) respondents were aged 65-69,2,659 (26%) were 70-74, 3,274
(32%) were 75-84, and only 958 (9%) were 85 or older. The Iowa and East Boston
respondents were almost entirely white, while the New Haven sample was 18.8% black.
Ethnically, the East Boston respondents were described as predominantly Italian, Irish, and
northern/central European in descent, while the New Haven respondents were also largely
Italian, but with a more sizable eastern European contingent, in addition to the black
population. Presumably, the Iowans were mostly northern European. The East Boston
community was described as blue-collar, consisting oflow- to middle-income working class
persons. The New Haven area was dominated by educational institutions, manufacturing and
service industries with a level of income well below the state median. The Iowans, as
mentioned, were largely rural and agriculturally oriented with some light industry and retail
in small towns of populations less than 2,000, and one small city with a population of 6,500.
Since the respondents were not chosen with a nationally representative random
sample, it is important to examine the composition of this sample in some detail, and to
compare it with national statistics, particularly with respect to age, sex, and mortality. This
is done below in Section 3.5. First, it is useful to discuss the particulars of the survey
instruments.
3.3 The baseline survey
In all three populations, the initial baseline data were gathered with an extensive
household interview survey. The instruments for the three studies were not identical, but
108
were highly similar; this dissertation uses only variables which could be uniformly coded
across all three datasets. Table 3.1 lists these variables according to 10 general categories:
Table 3.1 - Variables in baseline survey
Demographic and personal characteristics
Sex, age, race, educational attainment, marital status, employment status, occupation, work history
income, household composition, religion, number ofliving children, numbers of friends and relatives
Mental and emotional status
Does respondent know: own age, date of birth, present date, day of week, who is president, mother's
maiden name, telephone number, subtracting 3 from 20
Depression
Physical functioning
Respondent reports whether they can perform activities of daily living: walking across the room,
bathing, using the toilet, grooming, walking half a mile, doing heavy work, moving large
objects, eating, getting dressed, using stairs, stooping/kneeling, handling small objects, etc.
Does respondent need help from special equipment or a person to do these things
Hearing, hearing aid usage, vision, eyeglass usage
Bowel and urinary control
Self-reported level of general health
Sleep habits
Chronic conditions and suspected illnesses
Respondent reports whether they were ever diagnosed with heart failure, cancer, stroke, diabetes,
hypertension, fractured hip, other bone fractures
Was respondent hospitalized overnight for any ofthese condition
Chest pain (Rose questionnaire)
Possible infarction - severe chest pain, leg pain, pain when walking
Shortness of breath, coughing, phlegm, chest wheezing
Physiological characteristics
Weight and height at time of interview, weight at age 25 and 50, weight changes in past year
Pulse, two readings of systolic and diastolic blood pressure
Tobacco and alcohol consumption
Did respondent drink beer, wine, or liquor in the past year or month, how often in the past month, past
history of heavy drinking
Does respondent smoke now or in the past, how many cigarettes per day, age at first/last cigarette
Medications
Does respondent use insulin, hypertension medication, digitalis
Use of medical resources/institutions
Was respondent hospitalized for at least one night in the past year
Has respondent ever been a nursing home patient
How recent was last visit to a dentist
Some ofthese listed "variables" actually consisted of two or more questions, e.g., there were
109
a dozen questions on chest pain (the Rose questionnaire). It should also be noted that all
chronic condition, hospitalization and treatment variables captured only the self-reported
diagnosis and/or treatment; thus it was not discovered whether the respondent truly was
afflicted with or treated for a condition.
3.4 Outcomes
Following the baseline survey, respondents were recontacted annually to monitor a
number of different outcomes over time. The first two surveys were conducted by phone,
while the third involved another household interview. (Additional follow-up surveys were
completed, but these data are not publicly available, and are not considered in this
dissertation.) The outcome of central importance for this dissertation is mortality. Each of
the three centers established a "mortality surveillance system" to match up known deaths in
the community (e.g., through obituary notices or hospital records) with participants in the
study. Each respondent's status was observed at each annual recontact, and decedents were
matched with their death certificates. It was observed that 433 respondents died in the first
year, 486 died in the second, and 531 died in the third. Besides the mere occurrence of death,
a full listing of both underlying and associated causes is provided with the data, coded by a
single nosologist at each center.
For respondents who survived to the first or second annual recontacts, other variables
observed were chronic conditions, physical functioning (ADL's, hearing, vision),
hospitalizations, marital status, working status, household composition, weight loss,
medication usage, and nursing home admissions. The third follow-up, the second household
interview, was more detailed; respondents were also assessed with respect to mental status
110
and blood pressure, and they were asked about religious activity, relatives and friends,
urinary control, chest pain, possible infarction, sleeping habits, smoking, drinking, and
depression.
The other outcomes of interest for this dissertation, besides mortality, were the
occurrences of new illness, particularly heart failure, strokes and cancer. Since it was
possible for any particular condition to go unreported by the respondent but discovered at
death and listed as a cause of death, it was decided to include these listings as new incidences
of the illness. Thus any listing of heart failure, stroke or cancer as a cause of death when the
respondent had never reported having such was counted as a new occurrence. This was done
under the assumption that if the respondent died of the event without having reported it, the
event occurred between the time of reporting and death. For the analysis of cancer,
respondents who reported having ever had cancer at baseline were removed from the
analysis, as it was not clear whether "new" incidences of cancer were not simply new
malignancies from previously diagnosed cancers. After these recodings, it was observed that
868 respondents suffered new heart failures, 943 had strokes, and a total of 552 persons had
new cancers (6.2% of 8,874 persons never previously diagnosed with cancer). Again, it
should be noted that these "occurrences" of illness are mainly self-reported diagnoses (and
for new illnesses found on the death certificate, observed postmortem diagnoses). This is
important with respect to the incidence of cancer, since tumors in their earlier stages
frequently go undiagnosed.
3.5 A comparison of the EPESE sample and the U.S. population
The ideal sample design would have been a nationally-representative probability
111
sample of U.S. elderly, since one would like to build models applicable to these persons.
However, the EPESE dataset was instead a combination of two community censuses and a
stratified cluster sample, all from three small geographical regions that were themselves
chosen by convenience. It was clearly important, then, to note the differences and
similarities between the observed dataset and what would have been expected from a
nationally representative sample. It is argued below that the population is quite similar to
the U.S. population of elderly (although "statistically significant" differences do exist). With
respect to mortality in particular, the differences between the EPESE sample and the U.S.
population are relatively small (i.e., of the same size as the sampling variation one would
expect in a simple random sample of U.S. elderly).
To gauge the size ofthe differences between the EPESE sample and what would have
been expected from a nationally representative sample, one can calculate simple statistical
tests under the obviously false assumption that the EPESE sample was itself a simple random
sample (SRS) of U.S. elderly. That is, since the age and sex composition of the U.S.
population of elderly was known, and since population-level values for the probability of
death by age and sex were known, one can compare the EPESE sample estimates with the
population values directly. However, one would expect some small differences even if the
sample was highly representative of U.S. elderly. So the observed differences were
calibrated in terms of the standard errors one would have expected from a simple random
sample.
Figure 3.1 shows the age and sex distribution of all 10,294 respondents. The dashed
lines show the expected value for each total under the false assumption that the respondents
Figure 3.1 - Number of respondents, by age and sex
dashed lines show expected values for a simple random sample of U.S. elderly
o
0
0
LO
...- I I I
(/)
I
I
I I
- I ~ j I
female
c
Q.)
I VWLJ
"0
male
c
0 0
0... 0
(/)
0
Q.)
I-
...-
'+-
0
I-
Q.)
..Cl
E
::J
C
0
0
LO
65-69 70-74 75-79 80-84 85+
age
Respondents are from New Haven, East Boston and Iowa EPESE baseline surveys (N =10,294)
......
......
N
113
had been chosen with a SRS of the U.S. population of the elderly (including the
institutionalized) of size 10,294.
2
If the sample had been a simple random sample of
noninstitutionalized elderly only, one would expect more persons age 65-74, and fewer
persons aged 75 or older, compared with the entire U.S. population of elderly.3 It is clear
then that persons aged 65-74 were undersampled, while persons 75 and older were
oversampled. Under the SRS assumption, a calculation ofthe chi-square statistic across all
ten age-sex categories totals 57 on 9 degrees of freedom, p <.001; most of the contribution
to this statistic comes from the lack of males aged 65-69, the surplus of males 85+, and the
lack of females aged 70-74 (in that order). However, the overall sample distribution is
patterned quite similarly to the population distribution. There are significantly fewer
respondents with each increase in age, and there are significantly more females at every age.
Moreover, if one ranks the ten age-sex categories by size, they rank identically in both the
sample and the population. Although statistically significant differences exist, many of the
differences are not very large in real terms; rather, the large sample N makes even small
differences statistically significant. For example, with respect to sex only, the sample
proportion of females (60.8%) were actually quite close to the population proportion
(59.8%). However, this gives a significant {-statistic of 2.1 (p < 0.05) due to the small
standard error attached to the estimate.
A more serious difference is observed in the racial classifications. As mentioned,
only 5.1% ofthe sample reported itself as black, while some 8.2% of the elderly population
2 The numbers for the U.S. population are taken from the Current Population Reports, P-25
series No.1 095 (1982).
3 Approximately 97% of the population aged 65-74 was noninstitutionalized, while 86% of
persons 75+ were noninstitutionalized. Source: Statistical Abstract.
114
in 1982 was black. (Since blacks are oversampled in the North Carolina data, this issue is
addressed in more detail in Chapter 5.) Also, the oldest elderly may be substantially
oversampled, based on the observed proportions of respondents older than 85 and the high
degree of institutionalization that would have been expected at these oldest ages.
Thus there are some significant differences between the sample and the population,
whether or not the institutionalized are included, so the data are clearly not nationally
representative as gauged by SRS sampling error. Ultimately, however, the goal of the
dissertation is to develop models that predict mortality in the general population of elderly.
Thus, of particular importance is the effect of the sample's idiosyncrasies on estimates of
predicted probabilities of death and the approximate standard errors used to gauge the
accuracy of these estimates. Fortunately, as mentioned, one is in the position of knowing
near exact population values for the cohort probability of death according to U.S. single-year
period life tables for the three-year period 1982-1985.
4
This was approximately the three-
year period during which the EPESE respondents were observed, so ifthe EPESE cohort was
similar to the U.S. population of elderly, they should have experienced death rates
approximately equal to these national rates. A cohort estimate of the population value of 3qx
(the probability that a person age x dies between age x and x+3) may be obtained first by
gleaning lQx, jQx+l and jQx+2 from the 1982, 1983 and 1984 period life tables respectively. For
example, call them jQ}2, jQx+,83 and jQx+2
84
, where lQx+o
8X
is the probability that person aged
x+o died between ages x+o and x+o+1 in the year 198X. For the average person aged x in
the U.S. population in 1982 (the time of the baseline survey), the probability of death within
4 Source: us. Vital Statistics, 1982-1984.
115
the next three years was estimated as:
Thus, under the false assumption of that the sample is a SRS of elderly, the expected value
of the sample estimate of 3qxcohort (the sample estimate simply being the proportion of sample
respondents aged x who died in three years) could be predicted separately for each of the ten
age-sex groups by setting x equal to the average age of the sample respondents in each age-
sex category and using the above formula. Also, under the SRS assumption, the standard
error for the sample estimate of 3qxcohort can be estimated as:
(3
Q
x cohort). (1 -3Qx cohort)
N of persons aged x
as implied by the formula for the variance of a random variable having a binomial
distribution. (Note that one should use the population estimate of 3Qxcohort for this calculation
since it is known, although there is little difference if the sample estimate is used.) Again,
since the population value includes the institutionalized elderly, one might expect that the
sample estimates of 3Q/Ohort would be slightly lower than the population estimate for ages
above 75 or so (assuming institutionalized persons have higher rates of death than the
noninstitutionalized).
Figure 3.2 shows the sample estimates of 3Qxcohort for the ten age-sex categories, with
dashed lines at 3Qxcohort +/- 1 S.E. for each estimate, and asterisks at the height corresponding
the u.s. population values of 3Qxcohort. Table 3.2 also lists these estimated probabilities and
their standard errors in addition to the U.S. population values and the p-value for the t-
statistic (under the null hypothesis that the expected value of the estimate is equal to the
Figure 3.2 - Probability of death within 3 years, by age and sex
dashed lines show +/- 1 standard error (approximate estimates)
asterisks show population values
CJ)
"<:t 0.398
L- a
co
Q)
>.
C'?
c
~ ~
I I
female
..c
.....
Whd
- - - -- - - * ~
male
0.279
..c
.....
co
Q)
0.222
-0
\f-
a
>. N
:!:::
0
:.0 0.159
co
..0
a
L-
a..
d l
0.101
-0
- - - -- --
Q)
0.061 fW'l
--
0.078 .....
co *---- ---
E
------
:;::;
~
CJ)
Q)
~ J
a
65-69 70-74 75-79 80-84 85+
age
Sources: 1982-85 EPESE surveys from New Haven, East Boston and Iowa; U.S. Vital Statistics.
......
......
0\
Table 3.2 - Observed sample and u.s. population estimates of
probability of dying within three years of age x (3qJ by sex
Sex
Age Female
l
S.E.
2
US pOp.3 p-value
4
Male' S.E.
2
US pOp.3 p-va1.
4
65-69 0.061 0.0056 0.056 0.382 0.101 0.0081 0.103 0.800
70-74 0.078 0.0065 0.082 0.581 0.159 0.0110 0.146 0.227
75-79 0.106 0.0088 0.127 0.020 0.222 0.0149 0.211 0.452
80-85 0.188 0.0135 0.209 0.115 0.279 0.0216 0.295 0.295
85+ I 0.279 0.0181 -- -- I 0.398 0.0265
1. This column shows the cohort estimate of 3Qx, equal to the sample proportion of respondents in each age category who died within three years of baseline.
Data: New Haven, East Boston, and Iowa County EPESE surveys, 1982-1985. .
2. The S.E. column gives an approximate standard error for the sample estimate of 3Qx obtained by assuming a binomial distribution of deaths.
3. This column shows cohort values of 3Qx for the entire U.S. population (including the institutionalized) obtained from single-year period Iifetables for
1982-1984 (US. Vital Statistics). The value ofx for each age category was set equal to the average age of sample respondents in that age category. Note
that about 97% of persons in the U.S. population aged 65-74 were noninsitutionalized, while only 86% of persons 75+ were noninstitutionalized.
4. Given the assumption that the sample is a simple random sample from the U.S. population, the expected value of the sample estimate of 3Qx is equal to
the U.S. population value. The p-value for the t statistic under this null hypothesis is shown (two-tailed test).
.......
.......
-.J
118
population value, so that any observed difference is due purely to random chance). For
example, the average female in the sample aged 65-69 had an estimated 6.1 % chance of
dying, give or take 0.56%. If the respondents were chosen from the U.S. population with a
SRS, the expected value of this estimate would be 5.6%, a difference of about 0.5%, or 0.9
standard errors. There was a 38.2% chance that a difference at least this big would have been
observed due to chance error, so one fails to reject the null hypothesis. This was the case for
all the estimates except that for females aged 75-79; this estimate is significantly lower than
the population value. This is not surprising, since the population values include
institutionalized persons, as mentioned above. In fact the estimates for both males and
females aged 80-85 were also somewhat below the population level, as expected (although
not significantly so). Overall, however, the confidence intervals implied by the standard
errors under the SRS assumption capture the true population values quite accurately, despite
the quite small standard errors for the estimates for ages less than 75.
3.6 Functionality, morbidity and causes of death in the EPESE sample
Since there were several key measures of functionality closely associated with
mortality, it was informative to understand what levels of functionality existed in the sample
as measured by these important variables, particularly regarding age and sex. For example,
part of the question model's surprising ability to predict nearly as many female deaths as
male (despite using "maleness" as a predictor of high risk) was the result of some interesting
gender differences in the various functionality measures.
Consider the variable that asked whether respondents could walk for half a mile
without help, which was an excellent predictor of mortality even after controlling for many
119
other variables (see Chapter 5). There were 2,527 respondents in the sample who responded
they were unable to walk this distance, about 24.6% of all respondents. Figure 3.3 shows the
percentage of respondents at each age and sex who reported that they could not walk half a
mile. As expected, the percentage increases with age. However, when examined by gender,
a surprising difference is observed. Females were much more likely to report this disfunction
at all age groups! This was not an expected result; it was thought that since both the
probability of death and the incidence of chronic disease was higher for males (as estimated
below), males would also be more severely debilitated than females. However, this was
usually not so.
A very similar result could be seen with another important predictor variable
measuring functionality, which was the question asking whether respondents could bathe
without help. There were 889 respondents (about 9% of the sample) who reported that they
either needed assistance or were unable to bathe at all. Figure 3.4 shows the percentage at
each age and sex who reported this inability. Again, a similar pattern is observed: the
percentage increases with age, which is hardly surprising, but females report a significantly
higher level of disfunction than males. Other variables, such as the variable reporting levels
of self-assessed health, also showed a higher level of disfunction or illness in females. These
results are not unique to this sample; other researchers have reported higher levels of
disfunction (as well as disease) in females. However, notice that by using these variables as
classifiers of high risk groups, one compensates for the gender imbalance in predicted deaths
caused by simultaneously using the male sex as a classifier.
Interestingly, the measures of mental functionality did not display the same patterns.
Figure 3.3 - Proportion unable to walk half of a mile without help, by age and sex
70-74 85+ 80-84 75-79
male
female
I I
WZl
65-69
.9-
CD
Q) a
..c
.......
::l
0
..c L()
.......
. a
Q)
E
co
"<:t
-
a
0
-co
..c
..x:
(")
co
a

0
.......
Q)
N
.a
a
C'Cl
c
::l
C
0
T"""
t
a
0
Cl.
e
Cl. 0
a
age
Sources: EPESE baseline surveys from New Haven, East Boston and Iowa
.....
tv
o
Figure 3.4 - Proportion unable to bathe without help, by age and sex
LO
N
0
e..
Q)
.c
0
..... N
::::l
0
0
.c
I I .....

3
male
Q)
LO
VLZd
female
.c
..... ..-
m
0
.c
0
.....

.c 0
m ..-
c
0
::::l
C
0
t
0 LO
e.. 0
0
0 ....
e..
0
0
65-69 70-74 75-79 80-84 85+
age
Sources: EPESE baseline surveys from New Haven, East Boston and Iowa
,.....
N
,.....
122
Only 406 respondents could not correctly state their mothers' maiden names, and 514
respondents could not correctly state the day of the week. However, both groups had many
men in most age categories.
Respondents were also asked directly about many illnesses, as indicated above.
There were 1,239 respondents who reported having ever been diagnosed with heart failure
and 124 who reported having been suspected of having heart failure. Of these respondents,
about 54% were male. Figure 3.5 shows the percentage ofrespondents who reported that
they had been diagnosed with heart failure (or suspected heart failure) at some point in their
past (not necessarily at the time of baseline) by age and sex. The proportion of males with
heart failure was much higher than the proportion of females with heart failure at all ages.
Generally, the proportion with heart failure increased slightly with age, excepting the oldest
males. It appeared that the lifetime prevalence of heart failure in the EPESE sample was
lower than one might expect compared with other national samples (i.e., the National Health
Interview Survey). For example, NHIS estimates suggest that the prevalence of heart disease
in 75-80 year olds was high as 30% (compared with 12% to 21 % in the sample) although the
NHIS definition of heart disease is wider than the definition here (which is heart failure, and
only the reported diagnosis of it).
There were 1,420 respondents who reported having ever been diagnosed with cancer
(about 13.6% of the sample), and 92 who reported suspected cancer. Of these respondents,
nearly 68% were female. Figure 3.6 shows the proportion of respondents at each age and sex
who reported having been diagnosed with cancer (or suspected cancer). The proportion of
respondents with cancer did not appear to increase substantially with age, and in fact seemed
Figure 3.5 - Proportion diagnosed with heart failure by age and sex
85+ 80-84 75-79
age
70-74 65-69
o
o
0
N
c:i I
t
I I
ctl
I I
Q)
,-- I I
I I
..c
I
..c
L()
I I WhJ
male
..-
.....
I
c:i
female
\:J
Q)
CJ)
0
C
0>
0
.....
\:J
c:i
.....
Q)
>
Q)
c
0
t
L()
0 0
c..
c:i
0
.....
c..
Sources: EPESE baseline surveys from New Haven, East Boston and Iowa
-- tv
w
Figure 3.6 - Proportion ever diagnosed with cancer by age and sex
male
female
85+
I I

80-84 75-79
age
70-74 65-69
o
o
L()
o
o
o
o
L()
......
o
......
o

3:
c
o
t
o
c..
o
....
c..
....
Q)
()
c
co
()

-
3
"0
Q)
(/)
o
C
0>
co
"0
C
Q)
Q)
..c
0>
c
.;;
co

1::
o
c..
Q)
....
Sources: EPESE baseline surveys from New Haven, East Boston and Iowa
.....
N

125
to decrease for females (perhaps showing the selection of those who die of cancer at earlier
ages). At all ages except 80-84, females were more like to have been diagnosed with cancer.
The other chronic illnesses asked about in the survey included hypertension, strokes,
diabetes and bone fractures. Hypertension was by far the most common malady; some 4,429
respondents (43% of respondents) reported having ever been diagnosed with high blood
pressure, and 174 were suspected. Of these, the majority were again female (nearly 68%).
These levels may sound high, but comparisons with results from the National Health
Interview Survey suggest that these levels are quite plausible and that the gender difference
is quite real. For example, 48% of female respondents aged 75-80 in the NHIS sample had
hypertension, compared with 29% of the males. The proportion of respondents having been
diagnosed decreases with ages, particularly for males.
There were 1,322 respondents who reported having ever been diagnosed with
diabetes (12.8% of respondents), and 157 suspected. Males were slightly more likely to
report having diabetes, and the proportion seemed to decline with age. Fewer respondents,
601, reported having been diagnosed with a stroke, and 60 were suspected of having had a
stroke. Males were much more likely to have had a stroke (except at the oldest ages), and
the proportion of respondents having had a stroke increased with age. About 4% of the
respondents reported having ever been diagnosed with a fractured hip, and 18.9% reported
having been diagnosed with some other bone fracture.
For the more common illnesses, one is also typically interested in the incidence of
disease besides the lifetime prevalence measure used above. The number of respondents
who reported being diagnosed with chronic illnesses for the three-year follow-up period were
126
reported briefly in section 1.8 for heart failure, cancer and stroke. Note that these numbers
include persons who died ofthe condition, without having reported ever having it at baseline.
The assumption is that ifthe respondent died of a heart attack, for instance, but did not report
having ever been diagnosed with a heart attack at baseline, then the death was the result of
a new incident. Conversely, if the condition had been reported at baseline and the patient
died of the condition without reporting a new incident in follow-up, then it was assumed that
no new incident occurred.
There were 868 respondents suffering a new heart attack in the three-year follow-up
period (8.4% of the sample), of whom 408 were male (47% of all victims). Figure 3.7 shows
the proportion of respondents at each age and sex who experienced new incidents of heart
attack. The risk of a new heart attack clearly rises with age, and is greater for men than
women, congruent with Figure 3.5. There were 943 respondents with new incidents of stroke
(9.2% of the sample), of whom 446 (47.3%) were male. The proportion of respondents at
each age and sex with new incidences of stroke is shown in Figure 3.8. Again, the risk rises
with age, and is much higher for males of all ages (excepting 80-84 year-olds).
There were 726 respondents reporting incidents of cancer during follow-up, not all
of which were clearly new malignancies. Of these respondents, 356 (49%) were male.
Figure 3.9 shows the proportion of respondents at each age and sex with incidents of cancer.
Interestingly, the risk is much higher for males, and does not necessarily increase with age.
This pattern is in contrast to Figure 3.6, which suggests that the lifetime prevalence of cancer
is much higher in females. This implies that the duration and survivability of cancer spells
are greater for females. Notice again that for the purpose of prediction, it was decided that
Figure 3.7 - Incidence of heart failure by age and sex
85+ 80-84 75-79
age
70-74 65-69
o
o
I
:::J
I
I

LO
male
..--
Whd
t:: 6
female
m
Q)
.c
"t-
o
en
-c
Q) 0
"'C
..--
'(3 6
S

Q)
c
.c
-
'3:
LO
c
0
0
6
1::
0
0..
0
....
0..
Sources: EPESE baseline surveys from New Haven, East Boston and Iowa
Note: Includes those respondents who died from new incidents
......
N
-J
Figure 3.8 - Incidence of stroke by age and sex
85+ 80-84 70-74 65-69 75-79
age
Sources: EPESE baseline surveys from New Haven, East Boston and Iowa
Note: Includes those respondents who died from new incidents
o
o
0
N
0
Q)

0
.....
J
I I
male
.....
en
'+-

female
0 LO
en
...--
.....
0
c
Q)
"0
u

0
Q) ...--
c
0
.J::
.....
03:
c
0
:e
LO
0
0
e..
0
0
.....
e..
-tv
00
o

o
Figure 3.9 - Incidence of cancer by age and sex
I I

male
female
.... cc
OJ 0
<.:>
0
c
m
<.:>
....
0
CD (/)
.....
c
0 OJ
"'0
'(3
C
.c
"<;f" .....
0
0
c
0
t
0
0-
N
0
0 ....
0-
0
o
o
65-69 70-74 75-79
age
80-84 85+
Sources: EPESE baseline surveys from New Haven, East Boston and Iowa
Note: Includes those respondents who died from new incidents
.......
N
\0
130
all respondents who reported ever having been diagnosed with cancer should be dropped
from the analysis. This was due to the difficulty in distinguishing between incidents of
cancer during follow-up which were genuinely new incidents, and those that were essentially
malignancies that may have been diagnosed prior to baseline. Of the 8,874 such persons,
there were 552 who experienced genuinely new cases of cancer (6.2%).
Finally, it is informative to examine the causes of death assigned to the deceased
respondents. For simplicity, one usually looks at the underlying cause as implied by the
death certificate, although this is quite an insufficient piece of information for fully
understanding the true causes behind most deaths. The definition of underlying cause is
discussed in Chapter 6, as well the shortcomings of this data and other details concerning the
classification of causes (both underlying and associated) on death certificates. Here,
however, it is sufficient to examine underlying causes as given, with the brief caveat that this
is an inexactly measured (and somewhat ambiguously defined) variable.
The most common underlying cause of death, as expected, was heart disease, which
accounted for 633 deaths, nearly 44% of all 1,450 deaths in the sample. The next most
common cause was cancer, assigned to 335 respondents, only 23% of all deaths.
Cerebrovascular disease was the cause identified for 6.8% of deaths, circulatory diseases
accounted for 2.8%, accidents accounted for 2.1%, diabetes for 1.9%, and hypertension for
1.7% of all deaths. The remaining 261 deaths (18% of all deaths) were attributed to other
causes.
A comparison of these proportions to the U.S. population of elderly reveals some
interesting differences. For instance, using cause-specific death rates by age and sex from
131
the National Center for Health Statistics, as applied to the EPESE age and sex sample
distribution, one would have expected that about 50% of all deaths would have been
attributed to heart disease. About 30% would have been attributed to cancer, and about 11%
would have been attributed to stroke. In fact, for every cause of death examined (including
accidents, diabetes, and hypertension), the proportion attributed to each cause is smaller in
the sample than in the population; this is to say, the only category of death that contains a
larger proportion of deaths in the sample than in the population is the residual, "other"
category. It is possible that some discrepancies in the grouping of ICD codes by cause may
have been responsible for some differences between the categories, but the observed
differences are quite substantial. More likely, it may have been that the EPESE population,
consisting of working-class, middle- and low-income persons, was less likely to have death
certificates examined by a physician or coroner than persons from a random sample ofD.S.
elderly. This would have produced a lower consistency of classification and more deaths due
to unknown causes. See Chapter 6 for a more thorough treatment of deaths by cause.
3.7 Missing data
As often happens with large complex datasets in social science, data were not always
available on every respondent for every variable. Fortunately, the numbers of missing values
were not large for the vast majority of variables of interest. However, as the accuracy of the
models includes the classification and misclassification of cases for which certain values
were indeed missing, it is important to document the extent of them and explain the source
where possible.
For example, one primary reason for missing values on certain variables related to
132
the small number of phone and proxy interviews, in which some survey items were not asked
of the respondent. There were about 660 interviews (6.4% of the sample) completed by
telephone or proxy (for the New Haven sample, at least 50% of the interview was completed
by proxy). These respondents were important to isolate because, interestingly, they
experienced particularly high death rates: about 29% of them died, more than twice the raw
death rate in the whole sample. Thus, among these persons there were some 191 deaths
(13.2% of all deaths in the sample), of which 90 were concentrated in the East Boston
sample. Incredibly, of the 149 East Boston respondents who were interviewed by proxy (or
partially by proxy), 74 died (nearly 50%); this turned out to be, of all the variables on the
survey, the single largest split of respondents with such a high death rate. Thus, it was
possible that by including this set of respondents in the data, one might unfairly inflate the
apparent ability of the model to detect high risk persons by using questions that classified lots
of missing values as high risk (as do the questions in the appendix). The variables that many
of these respondents were not asked about included their mental status (e.g., the questions
asking for the mother's maiden name or the day of week), a handful of the physical
functioning questions (including questions on vision and hearing), some questions about
hospitalization and medication for chronic conditions, the items on physical measurement
(e.g., pulse, blood pressure), questions on sleeping habits, and the items on alcohol usage.
Other reasons for missing values included the more usual causes: some proportion
of respondents did not fully complete the survey for various reasons (about 162 surveys were
identified as partially complete or abbreviated), and inevitably respondents refused to answer
some questions, or were unable to respond for lack of opinion or knowledge. Some
133
questions were coded as "missing" but were not truly unknown. Rather, they were simply
not answerable by the respondent because they did not reply (e.g., a person who never
smoked was not asked at what age they started smoking); this is not what is considered
"missing" data in the totals presented below.
Of particular interest here are the questions presented in the appendices. Age and sex
had no missing variables. The greatest number of missing values occurred in the item in Set
A.3 asking about the respondent's weight at age 25: there were 2,068 missing values for this
variable (20% of respondents), and these included all of the Iowa proxy and telephone
interviews. Clearly, however, it seemed that most of these missing values belonged to
persons who were simply unable or unwilling to answer the question. The next largest
number of missing values was contained in the question asking the respondent to state the
mother's maiden name (see Set B.3): there were 747 missing values (7.3% of respondents).
The item asking how much difficulty the respondent has pushing or pulling large objects (Set
A.l) had 745 missing values. The question asking about the difficulty of bathing (Set C.1)
had 736, and the weight of the respondent at time of interview (Set A.4) had 707. The item
asking the respondent to state the day of the week (Set C.2) had 686, and the question about
the ability to see a friend across the street (Set A.5) had 525 missing values. For all these
variables, the total included all 456 respondents in the Iowa telephone and proxy interviews,
who had a death rate of about 20%. The next largest number of missing values was present
in the digitalis question in Set A.2 (381, or 3.7% of the sample). The question asking about
heavy work (Set A.4) had 116 missing values (only 1.1% of respondents) and the two
remaining questions had only 57 missing values.
134
To find out how much effect the high-mortality proxy interviews may have had on
the estimates of misclassification error associated with the model in Appendix I, the error
estimates were recalculated with these respondents removed from the sample. For Set A,
there was in fact a slight increase in error when the proxy interviews were removed, as there
was a decrease in sensitivity (the proportion of deaths correctly predicted) when the higher
risk respondents were removed. This was only about a 3% increase in error, however, an
amount that was less than the approximate standard error attached to the estimate. For Set
B, the error was nearly unchanged when the proxy and telephone respondents were removed,
and for Set C there was a slight decrease in error. The change was always very small (less
than the rather small standard errors attached to the estimates). Interestingly, however, the
downward shift in sensitivity and upward shift in specificity observed when the
proxy/telephone respondents were removed almost exactly mimics the change observed
when the model was applied to the North Carolina (Duke) sample (although not quite as
large in magnitude as the Duke differences). It is possible that these differences account for
part of that observed shift, assuming the Duke proxy respondents (of which there were 162)
were not high-risk respondents like the proxy respondents from the original interview. (See
Chapter 5 for a detailed treatment ofthe validation with the Duke sample.)
135
Chapter 4 - The statistical methods of prediction
4.1 Approaches to model selection
Important to any multivariate statistical method of prediction is the decision of which
variables should be included as predictors. As suggested by Table 3.1, the number of
candidate variables in the EPESE survey was quite large. Since the sample size is also large,
the largest possible models could contain many more variables than one typically wants.
Sometimes, these models can be larger than a computer can manage. Thus the researcher is
required to form some method for selecting out the best variables, and to recode many
variables for some types of models (i.e., as indicator variables). As suggested repeatedly in
Chapter 2, the size of a model greatly affects its ability to predict accurately, so the matters
of model size and what variables belong in the model require much attention. The purpose
of this chapter is to describe the method of model selection not only for the question set
method invented above, but for logistic regression, linear discriminant analysis and the
CART method as well. (The results from these methods are compared in Chapter 5.)
The four methods considered below all have various potentials for dealing with the
problems of variable selection and recoding, depending partly on the parameterization (or
lack of it), and partly on the abilities of the computer to handle large datasets. For example,
the nonparametric question set method developed here required no recoding of any variable
or any missing value, and could handle a very large data matrix or model consisting of
several hundred variables. Thus it was possible to perform a systematic and objective
variable selection by using the full, raw dataset as input to the algorithm. To perform logistic
regression, however, one is required to recode missing values and nonordinal variables.
136
Also, due to memory constraints, the computer was not able to handle a full model nearly as
large, requiring the elimination of many variables before a "full" model could be built at all.
Again, the main motivation behind dimensionality reduction lies in the bias-variance
tradeoff, discussed extensively in Chapter 2. In short, if too many variables or parameters
are used in the model, the predictions are subject to greater chance error, but if the model is
too small, one may forsake predictive accuracy by ignoring powerful predictors. The
problem is well recognized by statisticians, and many methods have been proposed for the
very purpose of reducing model size without losing predictive power. The most familiar
techniques involve some kind of backward deletion from a full model, and all the methods
considered below use a variant of this technique. Unfortunately, in actual application most
researchers use an inference-based approach (i.e., p-values) or a rough guide such as
Mallow's C
p
or Akaike's Information Criterion (AlC), despite the frequently inappropriate
assumptions required. I Few researchers take advantage of information about the
approximate levels of bias and variance supplied by a test set method, although the CART
method does use such an approach.
2
In this dissertation, a simple random sample was used to separate the dataset into a
learning set (N = 6,862) and an "internal" test set (N = 3432). The former was used for
building various models (it is also called the "training set") and the latter for gauging their
accuracy and choosing a particular size. The same division was used for all methods except
CART, in which the program chose its own test set. Since a fair amount of back-and-forth
I See Akaike (1973) and Mallows (1973).
2 Another example not specific to classification is the nonparametric smoothing algorithm in Splus
called supsmu(), see Friedman (1984).
137
between the test set and learning set was necessary in the model selection process, it is
important to recognize that the test set estimates of model error are probably biased
downward to some degree. Hopefully, this bias is small for the question set method, since
the sample size was quite large and the method of choosing questions was systematically
random. More stringent tests of the models are conducted in Chapter 5 with the use of the
North Carolina EPESE dataset. These results suggest that any bias was probably small.
4.2 A method for choosing questions
The method of selection for the model invented in this dissertation was initially
described in Chapter 1. As mentioned, it was possible to use nearly the entire, untransformed
dataset as input to the search algorithm. After dropping some variables for logistical reasons,
a set of 164 candidate variables, coded as on the original tape, was chosen from the survey
data and used as the penultimate "full" dataset for the remainder of the project. Using this
set of 164 variables, a search was implemented on the learning set data to find a "full model".
For Question Set B, for example, the full model consisted of a subset of 16 questions, (four
sets offour questions each, combined with OR and AND as described in Chapter 1), which
gave a low learning set misclassification error. (Error was defined by Equation 2.2, as
applied only to the learning set respondents). Each of these questions could be any question
in binary form. Such a question could be of the form, "Is X < 3.2" or "Is X 2 5" where X
could be any of the 164 variables, and the value of the cutoff could be any possible value of
X in the dataset. Both were always chosen randomly. To do the random selection of
variables and cutoffs, the program uses the C math library function drand48( ), which
generates a pseudorandom number between zero and one. Thus, one could multiply the
138
output of this function times the number of variables in the data matrix (rounding the result
to an integer) in order to pick a variable at random. Similarly, one could pick a cutoff at
random for any given variable.
Briefly, the computer picked an initial set of 16 questions at random, and tried
dropping and replacing questions randomly to check for any improvement in the learning set
misclassification error. The event of dropping a question at random and trying out a new
random question in its place was called a mutation (a reference to genetic or evolutionary
algorithms, of which this algorithm is a very crude sort). Mutations that produced
improvement were kept, while mutations that did not were dropped. At some point, the
search always resulted in a set of questions that could no longer be improved with a single
step. That is, the search never continued indefinitely; eventually, the set of 16 questions was
constructed such that no one question could be replaced so that improvement in error was
achieved. This could be checked by instructing the computer to try replacing each of the 16
questions one at a time, each time trying out all possible candidate questions in its place. By
exhaustively searching in this way, it was always possible to check whether a given set of
questions could no longer be improved by the random search method above. Here, such a
set of questions will be called an absorption point. This search algorithm is described in
Chapter 1 as the random search algorithm, or RSA.
To define this algorithm as precisely and carefully as possible (so that a reader need
not interpret the C code in the appendices directly), establishing several definitions is helpful.
Let 'IJ be an index of the set of all predictor variables in the data. (Here, this is just taken as
the sequence from one to 164, with the 164 "X' variables indexed arbitrarily.) Let U
j
be the
139
set ofall unique possible values (which were always real-valued) for the ith variable in 'IJ, and
let %' be the set U
i
without its smallest element. Let b be a set of two comparison operators,
{ <, z }. Define a question as any 3-tuple of elements of the form (v,u,s), where v is any
element of'IJ, u is any element ofU
y
, and s is any element of b. Define a question subset as
any set of questions combined with the "AND" operator. Define a question model as any set
of question subsets combined with the "OR" operator.
Now define a mutation as a question that is randomly generated in the following
manner: Choose an element s at random from b, and choose v at random from 'IJ (here "at
random" always implies a simple random sample of size one). If s is "z II, choose u at
random from 'U
y
; else, if sis "<", choose u at random from U/. Note that since 'IJ, b and U.
for all v in 'IJ are directly observed and known, the joint distribution of (v,u,s) for a mutation
is completely described by this definition. Note also that the elements of (v,u,s) were
correlated within a single mutation, but that the mutations were always generated
independently of one another. Based on the structure of'IJ and the U. for all v in 'IJ, it was
determined that 3,248 unique mutations were possible. Note that generating a mutation as
defined above is different from choosing from the set of all unique possible mutations at
random. Finally, define an absorption point as any question model that cannot be improved
(meaning that no lower misclassification error can be achieved) by replacing any single
question in the model with any possible mutation.
Now the RSA (random search algorithm) may be described for a full model of four
subsets offour questions each as follows (it is easily generalized to a model of any structure):
1. Generate a "seed model", Start by making a model that contains four question
140
subsets with four questions each by independently generating 16 mutations and
grouping them randomly into four subsets. Once this 4x4 model is built, check to
see if anyone of the four question subsets captures fewer than Nmin respondents in
the learning set (where N
min
is a small integer, e.g., for Set B it was set at 25; CART
typically sets it at 5). If such a subset exists, discard the model and generate a new
one. Repeat this process as necessary until all four question subsets choose at least
N
min
respondents. Once this is achieved, designate this model as the seed model,
and calculate its misclassification error as applied to the learning set.
2. Starting with the first question subset in the seed model, choose a question from
the model at random to be replaced. Generate a mutation, and replace the chosen
question with the mutation. Designate this the new model. Calculate the
misclassification error of the new model as applied to the learning set. If the new
question subset does not identify at least N
min
respondents, assign the new model an
error of infinity.
3. Compare the misclassification error of the new model to that of the seed model.
If the error of the new model is lower than that of the seed model, keep the new
model and discard the seed model. Otherwise, keep the seed model, and discard the
new model. Designate whichever model is kept as the seed+1 model.
4. Use the seed+1 model to repeat step 2, mutating the seed+1 model just as the
seed model was mutated. The only difference should be that the question to be
replaced should be chosen from the second subset in the model. Then repeat Step
3 to obtain a seed+2 model. Repeatedly cycle through steps 2-3, each time moving
to the next question subset until the last has been tried (when the next cycle starts
again with the first subset). Repeat this process indefinitely until an absorption
point has been reached.
In this way, the algorithm cycles regularly through the subsets in the model, each time
choosing a question at random to be replaced, and generating a random mutation to replace
it. Models that produced lower error were kept, and the mutation process considered
indefinitely until no further improvement was possible through the replacement of anyone
question. Again, the RSA always happened upon an absorption point if the search was
allowed to run long enough (e.g., about a million mutations for a model of 16 questions).
This was not a proven result, but out of tens of thousands of runs of the RSA on the present
dataset (usually using fewer than 30 variables in a model), the algorithm never avoided
141
absorption. (Although if one started with large enough models, it would take an extremely
long time; millions of mutations were required to reach absorption by the RSA for a model
of size 30, with the present data). Here, the computer was instructed to check for absorption
after 400,000 mutations for a model with as many as 16 questions. Allowing 100,000
mutations per subset was a rough rule of thumb that was usually employed across different
model sizes. The algorithm was always allowed to continue with the RSA undisturbed if
absorption was not achieved, so the check for absorption made no difference to the ultimate
outcome of the RSA. (However, later improvements to the algorithm used information from
the check for absorption to speed up the search, as described in the section below).
Define the observed absolute maximum for a question model of a particular structure
(e.g., 4x4) as that model with the lowest possible misclassification error when applied to a
particular dataset (usually the learning set), and assume this maximum is unique. (This
assumption was never contradicted in practice.) By definition, this model is an absorption
point. In any given run of the RSA, the resulting absorption point would usually not be the
absolute maximum. This fact was easily shown as it was usually possible to find an
absorption point with lower error using additional runs of the RSA with independently
generated seed models and mutations.
To show this problem explicitly, consider the very simple model form consisting of
two questions combined with "AND". From the 3,248 possible questions, the number of
possible models was about 5.2 million, so it was possible to search over this space
exhaustively to find the true maximum (and indeed, all absorption points). For instance,
using a misclassification cost of 3.5 with the full dataset, it was found through exhaustive
142
searches that there were only two possible absorption points. The model with the lowest
error chose the 1,702 respondents who were aged 75 or older who could not walk half a mile
without help, of whom 520 died (for an error of 0.319). The second absorption point picked
the 1,032 respondents who had used digitalis and could not do heavy work around the house,
of whom 364 died (for an error of 0.321). Neither of these models could be improved by
replacing one question, and these were the only two models for which this was the case. By
independently running the RSA hundreds of times over to search across models of this form,
it was found that the algoritllIll located the true maximum in only 33% of the searches. The
suboptimal model was found in the other 67% of the runs.
To find the model with the lowest obtainable error for larger model sizes, it was
necessary to repeat the RSA (perhaps hundreds of times), each time starting with an
independently generated seed model, and independently generated mutations. Then, out of
the many resulting models, that absorption point with the lowest possible misclassification
error was chosen (the case where more than one model had the lowest possible error was
never observed). This method of independently repeating the RSA and selecting the model
with lowest error was called the RRSA (repeated random search algorithm), and the notation
RRSA(N) was used to denote N repeated searches with the RSA. As N goes to infinity,
RRSA(N) will find the observed absolute maximum by pure luck since the finite pool of
models will eventually be exhausted solely through the random sampling of seed models.
A central question is whether the required N is small enough so that this absolute maximum
can be found within a reasonable time frame, given the available computing power. For
small enough models (e.g., two or three questions), the algorithm could obviously find the
143
absolute maximum within a reasonable period. For example, since the RSA could find the
true maximum for the two-question model considered above about a third of the time, one
needs only use RRSA(20) to ensure that the algorithm would find the maximum with 99.9%
certainty. Brute force arguments are made below to show that it may have found the
maximum for models with the structure of Set B (which contains a total of seven questions)
by using RRSA(2,000).
Consider the building of Question Set B again. First, the RRSA(l 00) was used on the
learning set to select the full model as that model with the lowest misclassification error after
100 independent runs of the RSA. This model was then subjected to backward deletion.
Starting with the full set of 16 questions, the backward deletion was conducted in the
following manner:
1) The first question was dropped from the set, and the misclassification error
of the remaining 15 questions was computed and recorded.
2) Then the first question was put back into the model, and the second
question was dropped. The misclassification error was again computed and
recorded for the remaining 15 variables. This process was repeated until all
16 questions had each been dropped from the model temporarily, resulting in
16 different misclassification errors for 16 models, each consisting of 15
questions.
3) The 16 errors were compared, and the question that produced the smallest
increase in prediction error when dropped was permanently dropped from the
model.
4) The process then returns to step one, this time dropping each of the 15
questions temporarily to find the question that, when dropped, produced the
smallest increase in error for models of size 14. The process continues until
no variables are left.
At the end of the process, one is left with a sequence of 16 nested submodels, one of each
144
size, all built with data from the learning set only. Once this sequence of models was
obtained, they were each applied to the test set to estimate prediction error, according to
Figure 1.2. (Computer code in C for the algorithms defining both the search and the
backward deletion process is given in Appendix V. This code is generic, and can be applied
to any dataset to predict a binary outcome variable merely by changing the parameters
describing the dimensions of the dataset and the costs of classification.)
Once the relationship between model size and test set prediction error was observed
(e.g., by creating Figure 1.2). The structure of this cur-Ie was always observed to be ofthe
same general form: a rapid decrease in error was observed as model size increases from zero
to medium range (about seven questions for Set B as shown in Figure 1.2), but then leveled
out, and increasing very slowly as model size was increased up to 16 questions. Sometimes
(as was the case with all the models presented here) the researcher has the luxury of
examining such a graph. Then judgement can be used to choose the model size that seems
to lie right in the "elbow" of this curve (about 6-7 questions for the curve in Figure 1.2).
Sometimes, as with Set B and Figure 1.2, the chosen model size that seemed to lie just in the
"elbow" was also that model with the absolute lowest error on the test set, or min(PE
TS
). If
one wishes instead to select a model size automatically, the standard error for the test set
rnisclassification error estimates can be estimated as described in Section 2.2. The preferred
model size can then be chosen as the smallest number of questions that have an error that is
less than min(PE
TS
) plus a standard error (see Breiman et al. (1984)). This is sometimes
useful when the test set N (N respondents, not runs of the RSA) is not as large as it is here;
then the test estimates of misclassification error may possess more variation than is shown
145
in Figure 1.2. To test whether any given model has a misclassification error that is
significantly better than that of the null model, the same standard error estimate defined in
Section 2.2 can be used to check that the absolute difference between the test set
misclassification error of the preferred model and that of the null model is bigger than two
standard errors. For all models in the appendices, this was by far the case.
Then the learning set and test set data were recombined into the full data set, and the
RRSA(100) was implemented again, this time in search of the best model of the preferred
size. For example, it was observed by Figure 1.2 that the best model size for predicting
mortality with a relative misclassification cost of 3.5 was seven questions. So, the data were
recombined and the RRSA(l 00) was conducted for the best set of seven questions, where the
model had the same combination of three subsets (one consisting of three questions, the
others consisting of two). As it turned out, the model obtained as the result of applying the
RRSA(100) to the combined dataset to find a model of seven questions was nearly the same
model achieved by applying the RRSA(100) to the learning set to find a full set of 16
questions and pruning backwards to seven questions. In fact only one of the subsets had
different questions, and it identified most of the same respondents).
This procedure of recombining the data and refitting the full model is a very typical
last step in model building (used by CART, for example). However, it was not entirely clear
whether the strategy was appropriate for the RRSA(N) search method. The problem, as
pointed out by the author's advisor, sterns from the fact that the results of the RSA are
random, and so the RRSA(N) method is not guaranteed to happen upon the absolute
maximum for finite N. Thus there was some uncertainty whether the local maximum
146
discovered from the search on the learning set implies the correct model size in the local
maximum obtained from the full dataset search. A partial solution to this sort of problem
is addressed by cost-complexity pruning, which is treated below. However, it also seemed
(as demonstrated in the results below) that it was possible to obtain a given absorption point
consistently through repeated applications of the RRSA(lOO) algorithm on a given dataset
with a model structure of about seven questions. Whether this absorption point is the
absolute maximum or a local maximum was not proven here. However, it is noteworthy that
the same model was obtained consistently for many runs of the RRSA(l 00) (numbering well
into the hundreds at this point).
As mentioned, there was in fact very little difference found between the models built
on the learning set and those on the test set. The questions in Set B for example, were
almost the same; the questions in Set B.1 and B.2 were identical, and most of the respondents
identified by the learning set question Set B.3 were also identified by the full dataset version
of Set B.3. However, in other applications of the method to other datasets, one may not have
such a large N, and one may not be able to verify that these similarities exist. Then, it may
be advisable to skip this last step of recombining the learning and test sets to estimate a final
model.
Table 4.1 shows the details associated with the construction of Set B, plus Sets A, C,
and an additional set discussed below. For example, to create the questions in Set C, the
same procedure described above was followed, except that a misclassification cost of 1.5 was
used. Again, the RRSA(lOO) algorithm was applied to the learning set to obtain a full model
of 16 questions (four subsets of four questions each) and the model was pruned backwards
Table 4.1 - Specifications for the construction of the question set models
Modell risk level of misclassification structure of full algorithm for final structure
6
algorithm for
mortalitl cost
3
mode1
4
full structure
5
final structure
7
SetA high 5 5 subsets of 4 RRSA(lOO) 3+2+2+2+1 RRSA(lOO)
SetB higher 3.5 4 subsets of 4 RRSA(lOO) 3+2+2 RRSA(2,OOO)
Set C highest 1.5 4 subsets of 4 RRSA(lOO) 3+2+2 RRSA(lOO)
Set J higher 3.5 10 subsets of 3 RRESA(200) 3+3+3+2+2+2+ RRESA(200) with
2+2+2+2 cost-complexity
I. The letter of each model refers to one of the question sets in the appendices.
2. The "high" risk model chose persons with three-year probability of death of about 28%. For the "higher" risk models this estimate was about 38%, and for the
"highest" risk models it was about 46%.
3. This was the cost ofmisclassifying a true decedent relative to the cost ofmisc1assifying a true survivor.
4. This column refers to the number of questions in the full model and how they are grouped by "AND" and "OR". For example, the full model for Set A had
five subsets of questions combined with "OR". Each of these subsets consisted of four questions combined with "AND", so the model contained twenty sets total.
5. The RRSA(N) method consisted ofN independent runs of the random search algorithm RSA defmed in Section 4.2. The RRESA(N) method consisted ofN
independent runs of the random and exhaustive search algorithm (RESA) defmed in Section 4.7. This was first applied to the learning set of respondents.
6. This refers to the grouping of questions for the final model. For example, Set A had three subsets of questions combined with "OR". The fust subset had two
questions combined with "AND", and the others each had two questions combined with "AND", for seven total questions.
7. This refers to the algorithm which was applied to the full dataset to achieve the fmal model. For Set J, the RRESA method was applied to the full model with
the full dataset structure of 30 questions; cost-complexity pruning (see Breiman et aI, 1984) was used to fmd the model structure in the column to the left.
.......

......:l
148
to obtain a sequence of nested submodels. This sequence of submodels was then applied
to the test set, and a curve similar to Figure 1.2 was plotted. It was again judged, by directly
observing this plot, that the best model size contained seven questions, composed of two
subsets of two questions each and one subset of three questions. Then the data were
recombined into the full dataset and the RRSA(l 00) algorithm with the same model structure
of seven questions in three subsets was applied to determine the final model as presented in
Appendix 1.
The questions in Set A were constructed in the same way, with the sole exception that
the full model was composed ofjive subsets of four questions each (i.e., a 5x4 structure for
a total of20 questions). The main reason for this difference was that it seemed possible to
achieve a lower model error by accommodating a somewhat larger model size when a higher
cost of misclassification was used. Then the subsets needed to identify a much larger number
of decedents (and respondents overall) to satisfy the low-error criterion. This suggested that
a great deal of heterogeneity existed in the data. In later analyses, when the search algorithm
was somewhat modified to achieve much greater speed, it was ultimately determined that
larger models could achieve a reduced test set error (see Section 4.7 below). This was
possible only when the model size was increased with the addition of more subsets, each
withfewer questions. For example, a model of 30 questions (ten subsets of three questions
each) may be a more appropriate full model (with the best pruned submodel containing 23
questions). All of the results presented in Appendix I, however, were achieved solely with
the slower, unmodified RRSA(IOO) algorithm, which took quite a bit of computing. A model
of 20 questions could require millions of mutations to achieve an absorption point, so at the
149
time it was not feasible to search over larger model spaces.
Some mention should be made ofthe reasons for using the particular costs and model
structures used in the creation of the models presented in Appendix 1. To the reader, many
of these values (e.g., the relative misclassification costs of 1.5, 3.5 and 5, or the model
structure of four subsets of four questions each) may appear to be chosen from thin air.
Unfortunately, there were no systematic guidelines for choosing these parameters, so most
of them were found with some combination of experimentation and intuition.
The relative costs of misclassification were easy to specify, since the relative
proportions of decedents and survivors were known in the sample, and there was some
knowledge of the death rates among the smallest, highest risk subpopulations.
Experimentation combined with rough hand calculations of Bayes' rule then suggested a
rough misclassification cost which would identify a given proportion of decedents. For
example, when running the RRSA with a cost of unity, Bayes' rule suggests that the algorithm
should attempt to find regions of sample space that have death rates higher than 50%. It was
observed that only a few respondents in the test set (accounting for perhaps 2-3% of all
deaths) could be identified with such high death rates. It was thought that models that could
identify larger numbers of deaths were of more interest, so Bayes' rule suggested a level of
misclassification cost higher than one was needed to capture more deaths.
On the other hand, it was also observed that when a cost of greater than about seven
was used, many more respondents were identified as high risk by the algorithm (e.g., more
than a third of all survivors). These models were not specific enough to be useful as
predictors of true "high risk"as advertised. It seemed that three different models with three
150
levels of risk seemed to be a reasonable alternative to a single level model. These levels
needed to be somewhere between one and seven, so three levels of misclassification cost
within this range were picked: 1.5,3.5 and 5. These three models captured roughly 20%,
40% and 65% of decedents, so the observed levels of sensitivity were evenly spread across
the range of interest. That is to say, a model identifying much less than 20% of deaths (on
the very far left, lower part of the ROC curve) was not particularly interesting for this
analysis; nor was a model that identified much more than 65% of decedents (as it would also
misclassify many survivors, being on the far right side of the ROC curve). However, the
three models built at the three given costs seemed to cover the remaining range of the ROC
curve quite well.
The structure of the full models (e.g., the decision to use four subsets of four
questions for Sets B and C) was also determined with some level of experimentation, and the
limitations of the computing power. It was decided, somewhat arbitrarily, to run the RSA at
least 100 times on each the learning set and the full set for a given model structure, so the
computer needed to be able to find an absorption point within a day to get a final model
within a month or so. (Note that since the RRSA ran the multiple RSA's independently, more
than one computer could always be used, if available, to reduce the total search time). This
dictated the use of a full model of less than 20 questions. It was not known at the time
whether more AND or OR operators should be used, so roughly the same number of each
was implemented (suggesting the 4x4 or 4x5 model).
The only other choice left to specify the RRSA(N) algorithm was the value of N
min
,
the smallest number of respondents allowed to be identified by any given question subset.
151
CART, for example, is designed so that by default no single terminal node identifies fewer
than five cases (although the parameter is adjustable). In the present case, considering the
very large number of variables and the large size of the dataset, raising this limit seemed
reasonable, so N
min
was set at 25 for all three models in Appendix I. In practice, the subsets
identified in the final models identified many more than 25 respondents (see Chapter 5 for
these details).
One big advantage ofthe nonparametric searching method described above is that the
program can adeptly handle a very large dataset, and it needs absolutely no recoding of the
data. Also, it was possible to consider not only more variables but also the computer could
handle different ways of splitting each variable (of which there could be many). A formulaic
approach requires that splits (dummy variables) and missing values be defined in advance
to accommodate the linear basis of the equation (although one might use a similar search
algorithm to select from many regression equations). Since substantive interpretations were
not the primary goal of the model, missing values were allowed to remain coded as "999"
with no further treatment, so that they were treated just as any other value. Furthermore,
since all mutations were chosen randomly from the full range of all possible variables and
variable values with equal weighting, it was possible to claim that the splits were chosen in
a completely objective, systematic, and well-defined manner.
4.3 Performance of the search algorithm
There were three central questions of interest concerning the efficacy of the search
algorithm. First, it was clearly possible to find an absolute maximum for sufficiently small
sets of questions (e.g., two or three questions as a single model). It also seemed that because
152
of the extremely large number of possible models inherent in a search over larger models
(e.g., 40-50 questions in a single model), one would probably not be able to locate an
absolute maximum within a reasonably short time. Thus, one would like to know the model
size at which, given the present computing power, one can find an absolute maximum within
a reasonable length of time. Of particular interest is whether it is possible to find an absolute
maximum for models ofthe size presented in Appendix I (e.g., less than 16 questions or so),
and whether the models given in Appendix I are in fact such maxima. Secondly, there was
an interest in exactly how long the search to find such a maximum needed to succeed. Lastly
one would also like to know howfrequently the absolute maximum would be achieved, if
indeed a true maximum was found.
Suppose one starts with a relatively compact model (e.g., that given in Appendix I
as Set B). This set of questions consisted of a simple 2+3+2 structure (3 sets of questions
combined with "OR", with the sets consisting of2, 3 and 2 questions combined by "AND")
for a total of 7 questions. It used a relative misclassification cost of 3.5. At the time of this
writing, the RSA search across all models of this type (using the full dataset) has been
repeated independently more than 2,000 times; this was the equivalent of 20 independent
runs of RRSA(1 00). The observed maximum (defined as the result of the RRSA method,
which again is not proven to be an absolute maximum) of Question Set B was achieved in
almost exactly 5% of the RSA searches. When the RSA searches were grouped into 20
groups of 100 searches each, as if 20 RRSA(100) searches had been conducted, it was
observed that Set B was achieved in every one of the RRSA(100) searches. Thus, although
the RSA was inherently random and unlikely to settle into a single absorption point, the
153
RRSA(N) algorithm was nearly deterministic for an N of at least 100. Suppose, for example,
that any single run ofthe RSA has a 5% chance of finding the observed maximum of Set B,
and a 0% chance of finding a better maximum. (This is the case as estimated here from
thousands of repeated runs of the RSA.) Then the chance that RRSA(100) finds Set B as the
observed maximum is equal to 1- (0.95)100 = 99.4%. For RRSA(200), there is a 99.997%
chance that the algorithm will find Set B.
For this particular model structure, it seems probable that this observed maximum
was also the absolute maximum. It was possible through repeated applications of RRSA(N)
with large N to achieve the observed maximum an indefinite number of times, without ever
finding an improvement or settling on a worse model. If the true maximum were not being
achieved, one would expect that eventually, with enough continuous searching, one would
eventually find some degree of improvement. As the number of independent RSA runs goes
to infinity, the probability that the absolute maximum is found goes to one. This is because
the seeds themselves are chosen randomly from a finite pool of all possible models, so
eventually one will reach the maximum by pure chance. However, the author has left the
RRSA(lOO) search running continuously for multiple months, using more than 2,000 different
seeds. Ever since the observed maximum of Set B was found (which occurred within the first
25 or so searches), no additional improvement has ever been found.
This argument by brute force may become increasingly convincing with more
searches; in fact it seems possible in this case to catalog all the possible absorption points
that can result (not just those with the lowest error). There appears to be a relatively small
number of such points (less than 100 in all) for the given model structure. Thus, at some
154
point the search will no longer be able to find any additional absorption points (much less
than those that give improvement), and all possible absorption points will have been
achieved more than several hundred times, for example. At this point, it would be possible
to estimate a probability function for all absorption points having a mass greater than 0.1 %,
for instance.
The median number of mutations required to achieve any given absorption point for
the model size of seven questions and a cost of 3.5 in one given run of the RSA was about
42,000, and the average was roughly 111,000. However, the median number of mutations
required was much larger in those RSA searches where the observed maximum was obtained:
319,000, nearly eight times as large (the reason for this is discussed below). Also, the
distribution ofthe numbers of mutations required to achieve absorption was heavily skewed,
with an extremely long right tail. Figure 4.1 shows a histogram of the number of mutations
required to achieve absorption for about 400 random absorption points obtained from 400
independent runs of the RSA (where the total number of mutations has been transformed by
the base 10 logarithm). The range of values was amazingly wide: the smallest number of
mutations required was 1,160 and the largest was more than 1,139,000!!
Also of interest was the rate at which an average run of the RSA converged to an
absorption point. Figure 4.2 shows the cumulative number of successful mutations (those
that achieve a reduction in learning set error) against the cumulative number of mutations for
three random searches. Interestingly, only about 40-50 successful mutations were required
to reach an absorption point in these three trajectories (the median for all trajectories was
almost exactly 50, no matter whether the observed maximum was achieved). Figure 4.3
Figure 4.1 - Histogram of numbers of mutations required to achieve absorption
o
LO
o

o
C'0
o
N
o
T"""
o
-_..
I I I 1 I I
1-
3.0 3.5 4.0 4.5 5.0 5.5 6.0
log (base 10) of number of mutations
-- V1
V1
o
lC)
en 0
c
0

cu
.......
::J
E
0
::J
C'?
......
en
en
Q)
()
()
::J
en 0
...... N
0
=t:t:
CU
.......
0
.......
0
....
o
Figure 4.2 - Total number of successful mutations by total number of mutations
(trajectories for 3 random absorption points)
----------- ,
r --
r
__ ,:f- - - - - - - -- - -- - - - - --,
,1- I !
,:-- r
.I
r
.I
J
r -
r
,_' ,I
__ , J - r
J- - _.I
o 1 2 3 4 5
log (base 10) of total # of mutations)
-VI
0\
o
co
C/)
c
0
0
<0
:;:::;
en
-
;:,
E
S
....
C/)
C/)
0
Q)
V
t)
t)
;:,
C/)
....
0
:tt:
en
0
-
N 0
-
o
Figure 4.3 - Total number of successful mutations by total number of mutations
(trajectories for 200 random absorption points)
o 1 2 3 4 5 6
log (base 10) of total # of mutations)
.......
VI
......:l
158
shows about 200 such trajectories. These figures suggest that the process has some degree
of regularity despite the extremely wide range of possible outcomes.
To see why this sort ofpattem arises, consider a simple probabilistic model of the
search. After starting with a random seed, a mutation is generated and there is some
probability that this mutation will be successful (that it will achieve a reduction in error); let
this probability be denoted by 1 - TIl' where re] is the probability of not succeeding. In the
search the computer draws such mutations randomly from the pool of all possible mutations
until one is successful. Clearly, the probability that one achieves a successful mutation in
the first mutation is 1-TI]. The probability that one achieves success only after two mutations
is TI
j
'(1-TI
j
), the probability that one achieves success only after three mutations is TIt(1- TIl)'
and so on. That is, the total number of mutations required before the first success is achieved
(call it Y
mut
(1) ) has the well known geometric distribution:
Pre Y
mut
(1) = y) = TI/ . (1 - TIl)
which implies that Y
mut
(1) has a mean ofre/(1 - TI]) and a variance ofTI/(1- TIll
Once the first success has been achieved however, it seems clear that the probability
of drawing the second successful mutation in one more draw is not necessarily equal to (and
in many cases, will be smaller than) 1 - TIl' This is because the pool of successful mutations
may be smaller once the first successful mutation has been found (but not necessarily so).
In any case, it seems that the number of additional mutations required for the second
successful mutation to be achieved (call it Y
mut
(2) ) also has a geometric distribution. Most
likely, this distribution has a different probability of success (call it 1 - TI
2
), where TI
2
is the
probability of not achieving a successful mutation, once the first successful mutation has
159
occurred.
Now, suppose that a total of K successful mutations were required before the search
reaches absorption. The total number of mutations required before the search reaches
absorption (call it Tmut ) may then be thought of as a sum of all the unsuccessful mutations
and the successes, so that:
K
T
mut
== K + L Ymu/i)
~
where the Y
mut
(1) are all each geometrically distributed with parameter 71j. If one takes K and
the sequence of parameters 11:
1
, .. , 11: K as given, then the Y mut (1) are independent (but not
identically distributed) and the distribution of T
mut
is completely determined. As mentioned,
K for this case was around 50 (between 20 and 80), and the success parameters could also
be estimated quite easily. Figure 4.4 shows the estimated probability of success on the y-
axis, against the number of successful mutations which have occurred. On average, then, the
probability of success starts high (near 0.5) but falls rapidly, asymptotically approaching
zero. It becomes vanishingly small and (in these estimates) reaches zero beyond 90
successful mutations or so, suggesting that absorption before this point is virtually certain.
The pattern of decreasing probabilities explains the nearly-exponential pattern to the
cumulative number of successful mutations by total mutations as shown in Figure 4.3; at
first, success is common, but becomes extremely rare one 40 or so successful mutations have
been achieved. If one were to estimate these sequences of probabilities for each of the
possible values of K (instead of averaging over all values, as in Figure 4.4) and observe the
distribution of the K's, this would completely specify the dynamics of the process.
Figure 4.4 - Probability of a successful mutation by number of previous successful mutations
for models having the form of Question Set B in Appendix I

0
c
0
:;::;
CO
-::J
E

::J
0
\t-
en
en
Q)
u
U
::J
en
"!
co
0
\t-
o
>-

:0
co
..- ..0
0
0 .....
e..
o
o
o 20 40 60 80
number of previous successful mutations
.......
0\
o
161
Unfortunately, these parameters depend heavily on the structure of the data (largely by the
sets 'IJ and 'U., for all v in 'U, as defined in the section above), so there is no obvious model to
generalize the process any further.
For pragmatic purposes, some readers may be interested in the amount of real time
required to find the observed maximum. Since, for any given search, there was an estimated
5% chance of achieving the observed maximum, the total number of repeated RSA searches
required before reaching the maximum can itself be model geometrically, again with a
parameter 1C, which here is estimated as one minus the probability of success, or 95%. Then
the average number of RSA searches required before the observed maximum was achieved
was 95%/5% = 19 searches plus the single successful search for 20 searches. On an unused
SparcStation Ultra, this required less than 8 megabytes of RAM, and took an average of
about 18 hours to complete. Of course, one does not know beforehand whether the
maximum has been achieved, and so additional computing is undoubtedly necessary.
However, this was at least a short enough period so that the project became feasible.
Now consider the performance of the model for the larger-sized model, in particular
the 4x4, 16-question structure used as the full model before pruning back to obtain Set B.
With twice as many questions in the model, the number of potential models was much larger,
and the computer required more searching time for any single run of the RSA since more
questions had to be handled by the computer at any given point. Then, given the smaller
number of absorption points observed and the larger pool of potential points, it is harder to
argue that an absolute maximum has been found. About 300 different seeds have been used
at this point, and the observed maximum has only been achieved several times, implying that
162
the mass associated with this absorption point is closer to 1% (compared with 5% for the
small model). However, no improvement has yet been found.
To assess the efficiency of the search algorithm further, a simulated dataset was
constructed by the author's advisor. The structure of the absorption points in the data was
such that there was a global maximum consisting of eight questions as well as multiple local
maximum (all known to the advisor, but unknown to the author). These were embedded in
a dataset of 32 variables and 1,000 observations. No clues about the structure of these
absorption points were given to the author, and the RSA search algorithm used on the data
was completed (and programmed) before the author's contact with the data. Using a full
model with the 4x4 structure above, it was found that the RSA located the absolute maximum
about 15% of the time. It took an average of less than five minutes on those runs that did find
the maximum. The algorithm also found the local maxima quite efficiently. Moreover, the
backwards deletion invariably chose the proper model size. Overall, the performance of the
algorithm on this ersatz data was quite encouraging. An additional simulated dataset
(consisting entirely of noise, unknown to the author) was also constructed by the advisor.
This provided an interesting test for the algorithm, since building a full model on the learning
dataset that appeared to predict the noise with some accuracy was quite possible, with only
a modest amount of searching. However, when the fitted models were applied to the test set,
the data were quite readily identified as noise. This was discovered simply by calculating
a standard error for the test set estimate of the misclassification error (as suggested in
Chapter 2). Then one could calculate whether the estimate was within two standard errors
of the null misclassification rate.
163
4.4 Linear discriminant analysis
For linear discriminant analysis, it was possible to handle a very large number of
variables in the full model, but the model required extensive recoding of the data, and some
preselection of variables. Specifically, it was necessary to reassign missing values, to
convert categorical or nonordinal variables into indicator (dummy) variables, and to
eliminate some variables that were highly correlated. After removing 16 variables that were
one of a pair of variables with a correlation coefficient greater than 0.9, there were 150
variables left out of the 166 variables in the penultimately full dataset. For variables that
were clearly ordinal or interval in value (e.g., weight in pounds, height, blood pressure, pulse,
frequency of alcohol consumption, and others) missing values were assigned the mean of the
nonmissing values. For many ofthe more categorical variables (particularly those pertaining
to functionality), all values including the missing were simply recoded as zero or one; for
example, the responses to the question "Was there any time in the past 12 months when you
needed help from a person or equipment to take a bath?" were recoded as zero if the answer
to the question was "no", and one if the answer was "yes", "unable to do", or "missing". (In
most of these case the number of missing values was small.) It should be noted, however,
that much knowledge was gleaned from the above question-choosing process concerning
which splits were effective. Certain splits of responses in many categorical variables were
known to work because they had been observed as the output to the method of Section 4.2;
in this way, the linear discriminant analysis was probably afforded some advantage over the
above method, which used no such information.
Using the set of 150 recoded variables for the learning set data, it was possible to
164
estimate just such a "full" model of size 150 using linear discriminant analysis as described
in Chapter 2.
3
With such a large model, a thorough backward deletion was not feasible.
Instead, a close approximation was used. By standardizing all the variables to have a mean
of zero and standard deviation of one, the size of the coefficient for each variable was taken
to be a good measure of a variable's importance in the model. Thus, the single variable
(from the full model of 150 variables) with the smallest standardized coefficient was dropped
from the model, and the parameters were reestimated. Again, the variable with the smallest
coefficient was dropped, and so on until only 30 variables remained. At this point, a more
thorough deletion process was implemented. Just as in Section 4.2, each of the 30 variables
was temporarily dropped from the model, and the error was estimated for each of the 30
submodels of size 29 obtained in this way. The variable that yielded the smallest increase
in error when temporarily dropped was then permanently dropped from the model to give the
best model of size 29. Again, the deletion was repeated in this way until no variables
remained, resulting in a sequence of 30 nested submodels, one of each size, built only with
the learning set data.
Next, this sequence of 30 models was applied to the test set, and the accuracy of each
model was recorded. Figure 4.5 shows the results ofthis application, where the y-axis shows
the accuracy of the model as measured by the correlation between the discriminant variable
( or z-score) and the true outcome of death or survival. The higher, smoother curve shows
the correlation coefficient as it was estimated by the learning set, and the lower curve shows
the result of applying the models to the test set (analogous to an upside-down version of
3 The Splus function used here for the linear discriminant analysis was discr().
......
c
co
c
E
C
(.)
en
"'0
Q)
N
en
Q)
"'0
o
E

..c
en
en
Q)
c
......
li=
L{)
N
Q)
0
"0
0
N
E
c
(/)

.c
ctl
"C
ctl
L{)
>
4-
.....
0
....
Q)
.c
E
::J
C
0
.....
L{)
165
9"0
166
Figure 1.2). As in Figure 1.2, the learning set measure of accuracy could always be
improved by adding yet another variable to the model. However, when the model is applied
to the test set, there is a point at which the addition of variables fails to improve (or even
decreases) the accuracy of the model. By observing Figure 4.5, it was decided that the best
linear discriminant model size contained 15 variables, and the accuracy of this model on the
test set was recorded. Then, the data were recombined into a full data set, and the best model
of size 15 was reestimated, yielding the "preferred" linear discriminant model.
4.5 Logistic regression
Probably the most commonly used multivariate model for predicting mortality (and
many other binary events) is the multivariate regression model, typically employed with the
logit or probit link functions. Under the binomial scenario (i.e., with a logit link function
(log(p/(l-p)) and the variance function (p(l-p)/N) ), one obtains the most frequently used
form, logistic regression. For this method, missing values and categorical variables need to
be recoded to conform to the regression equation's linear basis. So the same recodings as
used for the linear discriminant analysis were employed (i.e., assigning the mean to missing
values of ordinal/interval variables, and splitting categorical variables into indicators). Also,
due to memory limitations the computer was unable to estimate the large model of all 150
variables used in the linear discriminant analysis. So the set of 30 variables obtained after
the first deletion pass in the linear discriminant analysis was used as the "full" logistic
regression model. In this way, the logistic regression method was granted an even larger
advantage than that given to the linear discriminant analysis. A great deal of knowledge
about which variables were important was gained in the linear discriminant deletion from
167
150 variables to 30 variables, in addition to knowledge about how to recode variables
effectively from the question set method.
To estimate the model parameters and variances, the Splus function glm() (standing
for generalized linear models) was used with the usual binomial model. The parameters
were fit with maximum-likelihood estimation using an iteratively reweighted least squares
algorithm.
4
There were a number of algorithms for variable selection that could be applied
to the fitted full model at this point. The most common (and probably the most suspect)
method for stepwise backward deletion involves computing p-values for each coefficient,
dropping the variable with the lowest t-statistic at each pass until all the remaining
coefficients are statistically significant. (For the logistic regression model, this involves
computing standard errors from the asymptotic variance-covariance matrix of the coefficient
estimates.) The problem with this type of backward deletion is well known: "weeding" or
selecting out variables according to statistical inferences invalidates additional estimates of
p-values, since some coefficients are bound to be significant given a large enough number
of variables. The result is that the suggested model size is often larger than optimal.
Some statistics have been suggested which attempt to attach some "penalty" or cost
to each additional variable in a model, or to select out a model size based on some more
sophisticated criterion than t-statistics. Mallow's C
p
and Akaike's Information Criterion both
attempt to pick an optimal model size (the AlC statistic actually being a likelihood version
of C
p
). Here the AIC measure was quite easy to implement, with the Splus backward
deletion algorithm step(). The AlC is essentially a linear combination of the model deviance
4 See Chambers and Hastie (1992).
168
(a measure of model error analogous to the residual sum a/squares of the usual OLS model)
and the number of degrees of freedom in the model.
s
Specifically,
AlC = deviance + 2df
where , assumed to be nonnegative, is known as the "dispersion parameter" (the Splus
algorithm sets this equal to one by default). The deviance gauges the error in the model, and
the second term in the expression assigns a penalty of2 to each additional coefficient. As
one drops a coefficient from the model, the deviance increases Gust as R
2
would decrease for
the OLS model) but there is also a decrement of 2; one thinks of the increase in deviance
as showing an increase in bias, while the decrease of 2 reflects the decreased variance of
the smaller model. Thus, the goal is to find the model size that minimizes this criterion.
The deletion worked by starting with the full model as estimated on the learning set,
and dropping the variable that decreased the AlCthe most, repeating this deletion until the
AlCcould not be decreased by another deletion. It would have been possible to continue the
deletion to obtain a full sequence of nested models, applying each to the test set to determine
the optimal model size as above. However, it was felt that an interesting comparison would
be obtained by using the more standard method of model selection. The idea was to use
logistic regression much as it is commonly used in conventional practice (e.g., with a typical
"canned" algorithm for model selection), to compare it with the more sophisticated test set
techniques described above. The Splus automatic backward deletion algorithm step() is a
good example of the state of the art for such algorithms.
On the first pass, using only the learning set data and the 30 variables from the linear
5 See Chambers and Hastie (1992).
169
discriminant analysis, the step() function (with the dispersion parameter set at unity) picked
a regression model containing 26 coefficients, including the intercept. All coefficients had
asymptotic t-statistics that gave p-values significant beyond the 0.05 level. However, on
applying the model to the test set, this model size was evidently too large, and indeed, upon
recombining the data and reestimating the model of size 26, some coefficients were no longer
significant. Thus the step() function was then implemented using the entire dataset, and this
yielded a model with 24 coefficients. All p-values for this model were significant beyond
0.05, except two, which were significant beyond 0.10. It was suspected that this model was
still too large, particularly since it yielded almost the same level of accuracy (in both the test
set and the learning set) as the much smaller linear discriminant model (see Chapter 5).6 This
is a common criticism of the Mallow's C
p
or AlC criterion; in retrospect, it might have been
wise to raise the dispersion parameter to obtain a smaller model.
4.6 The CART algorithm for model selection
Just as with the methods above, there are several parts to the CART method for
devising splits in the form of a classification tree: 1) the building of the full sized tree with
the learning set; 2) the pruning of the tree through backward deletion to obtain a sequence
of nested submodels; 3) the application of these models to a test set to determine the optimal
model size; 4) the use of the full, undivided sample to estimate the best model of the optimal
size. Since the algorithm could accomplish this all automatically with a test set method, it
was possible simply to provide the program with the undivided data (N = 10,294), allowing
6 Interestingly, although the standard errors for the coefficients were smaller when the full dataset
was used (due to the 50% larger N compared to the learning set N), many of the coefficient estimates were
shrunk towards 0, resulting in smaller t statistics.
170
it to draw its own test set. As with logistic regression, the program was not able to take as
many as 150 variables, and so the set of 30 variables culled from the first pass of the linear
discriminant analysis (Section 4.3) was used.
7
An advantage to CART over logistic
regression and linear discriminant analysis, however, is that it is essentially possible to use
the raw dataset (including missing values) as the input, just as with the question set method
described above. The method for dealing with missing values involves finding substitutes
or "surrogate" splits that give a very similar division as the split that is missing. The
problems are that there is no guarantee that an accurate surrogate split exists; frequently
there would not be one. For this reason, it was decided to recode the missing values as
above.
The building process for the CART algorithm started by forming the split at the top
of the tree (the root node), conducting an exhaustive search of all possible splits and
choosing that split that best separated the survivors from the decedents. (The improvement
in the Gini index of heterogeneity is used to measure the success of the division). This root
node then yielded two subgroups of respondents, and for each of these subgroups, another
exhaustive search was conducted for the best split. This process was completed until the
"full" sized tree (TmaJ was constructed; this tree usually contained well more than 100 splits,
depending on the parameters supplied to the program. Since the searches for splits were
exhaustive, but only within each given subgroup of respondents as defined by the succession
of splits, the process was labeled one-step optimal. That is, the optimal split was found only
7 The CART program as designed by the original creators was used for this analysis. There is also
a version of classification trees available with in the Splus software package (which possibly could have
handled a larger number ofvariabJes), but after extensive experience with both, the CART program was felt
to be superior in most other respects.
171
for each stage of splitting, there was no guarantee that the overall combination of splits that
formed the tree as a whole would be the optimal combination.
Next the algorithm performed a backward deletion on this tree to obtain a sequence
of subtrees. The idea behind the deletion process was labeled cost-complexity pruning. The
idea was that pruning would be a process of dropping splits, starting from the bottom of the
tree and pruning upwards toward the root node. To determine which splits should be
dropped, each possible subtree was gauged by a cost-complexity measure. This was equal
to a linear combination of the learning set misclassification error of the tree and the number
of terminal nodes in the tree (the number of terminal nudes being equal to the number of
splits + 1):
cost complexity = misclassification error + (X' # of terminal nodes in tree
where (X denotes the cost-complexity parameter. The construction of this measure is highly
analogous to the Ale measure in that it combines a measure of the goodness of fit of the
model (measured by deviance in a logistic regression model and misclassification error for
a tree) and a penalty for each additional coefficient or split in the model.
However, the CART algorithm does not use this directly to choose the overall model
SIze. Instead, the parameter (X was scaled upward from zero along a continuum of values,
and at each stage the subtree that minimized the cost-complexity criterion by pruning from
the full tree was designated as the optimal tree for that value of a. Although the range of
possible a values (a L 0) is a continuum, there are a discrete number of such subtrees; as a
is fixed at zero, the largest tree possible is allowed, but as a is raised, there are threshold
values at which the splits toward the lower part of the tree are pruned off. As a is scaled
172
upwards, more splits are pruned from the bottom of the tree. Eventually, a can be raised
enough so that only a very small tree (or no tree at all) remains. This process yields a
sequence of nested submodels of varying sizes (varying numbers of splits, or terminal nodes)
built with the learning set only.
Next, this sequence of models was applied to the test set, and the misclassification
error for each submodel was recorded. Figure 4.6 shows a plot of this error on the y-axis
with the number of terminal nodes on the x-axis. As usual (compare this figure to Figure 1.2
for instance), one observes that the smallest models have a higher error. For the test set, this
error decreases rapidly as the model size increases and then flattens once the tree contains
more than nine splits or questions. (If one tests larger trees, containing 100 splits say, the test
set error is much higher.) Thus the tree with nine splits was chosen as the optimal size, and
the data was recombined into the undivided, full (N = 10,294) dataset, whereupon the
optimal tree with the same cost-complexity parameter was grown. As with the questions set
method above, three different trees were grown using three relative misclassification costs,
equal to 2,3.5 and 7 (see Chapter 5 for an explanation of why the same set of costs was not
used).
4.7 Modifications and additions to the question set method
In latter stages of the dissertation research, a number of improvements were made to
both the RSA and the method of model selection. (None of these changes concerned the
results in Appendices I-IV.) First, as demonstrated in Section 4.3, it was noted that the RSA
was spending a great deal of searching time attempting to locate the last few successful
mutations. It turned out that exhaustive searching was much faster once most ofthe
173
0
(J)
C'?
(J)

+-'
l+-
e
(J)
L()
N N
C/)

..c

Q)
e
0

N
-
(J)
c
C
en
c
e
0
:;::;
+-' en
co Qi
Q)
:::l
U
en
L()
0-
......

en
T"""
......
C/)
2 0
....
C/)
Q)
CO
..c
E
U
:::l
C/)
C

E
0
:::J
T"""
E
c
<0
E
..q
(J)

:::J
0)
L.1..
0
6'0
g'O
LO
174
successful mutations were achieved. The trick however, was that this exhaustive searching
had to be done in a random way to avoid systematically falling into local maxima. Thus, the
candidate questions were always reordered randomly before conducting the exhaustive search
(and of course, the chosen questions in the existing model were already randomly ordered
at this point). This provided much faster searching times, particularly because ofthe fact that
some variables contained many possible values (e.g., weight at time of interview took on
hundreds of possible unique values), making it very difficult to find the optimal cutoff value
through a purely random search. This was dubbed the random and exhaustive search
algorithm (RESA) and when N independent runs were made, the notation RRESA(N) was
used.
Secondly, a change was introduced in the method of pruning. As mentioned above,
one problem was that since the searching could result in varying absorption points, perhaps
implying varying model sizes when pruned. Thus it was possible to estimate a model size
that was optimal for the absorption point constructed on the learning set, but suboptimal for
an absorption point constructed on the full, recombined dataset. One solution of this was to
allow more flexibility in the model size via a version of backward deletion called cost-
complexity pruning, which is that method used by CART.
The idea is discussed above: one assigns a cost or penalty to each variable in the
model (called ex above), and this is combined linearly with the model error estimated from
the learning dataset. To estimate the optimal level of ex, one applies the sequence of models
built on a learning set to a test set, fixing ex at that value corresponding to the model size with
optimal test set error. For the test question set method, this amounted to attaching a cost of
175
IX to each question in the set. Then the data were recombined into the full dataset, and the
RRSA(N) was used to find a full model. This full model was then subjected to backward
deletion, the error was estimated for each model size, and the cost-complexity associated
with each model was then calculated using the value of IX as estimated above. The preferred
model was then chosen as that which minimized the cost-complexity criterion. In this way,
the model chosen on the full dataset was allowed to have a different number of questions
than the model chosen with the learning dataset.
To determine the efficacy of this algorithm, the method was applied to the dataset
contrived by the author's advisor, as discussed in Section 4.3 above. This dataset contained
two local (suboptimal) maxima consisting of six questions each, while the global maximum
consisted of eight questions. Once the data was divided into a learning set and test set, the
RSA was applied to the learning set. It was deliberately allowed to locate one of the
suboptimal maxima (embedded in a full, 4x4 model), a sequence of models was estimated
by backward deletion, and IX was estimated as that associated with the six-question, local
maxima. Then the data were recombined into the full dataset, and the RSA was allowed to
locate the global, eight-question maximum (embedded in a fu1l4x4 model). Then backwards
deletion was applied to obtain a sequence of submodels, and using the value of IX as
estimated on the suboptimal local maxima, the model that minimized the cost-complexity
criterion was chosen. Invariably, this turned out to be the global, eight-question maximum.
The C code for conducting this improved method completely automatically (both the
exhaustive searching, and the cost-complexity pruning) is given in Appendix VII.
These changes in the method were used to compute a much larger model of mortality
176
with the EPESE data. The fact that the RESA was much faster than the RSA made it possible
to search across much larger model spaces. Based on the above results, it was thought that
the best sized subsets invariably contained three or fewer questions. It was also found that
when a full model consisted of subsets with no more than three questions, such a model
structure could accommodate more of these subsets. For example, one gains greater
predictive power on the test set by starting with a full model of 30 questions (ten subsets of
three questions each).
To show this, the RRESA(200) algorithm was applied to the learning set with a full
model of 30 questions, using a misclassification cost of 3.5. This full model was then
subjected to backward deletion, and the resulting sequence of nested models was applied to
the test set.
8
This suggested a model size of 20 to 25 questions. The cost-complexity
parameter for the optimal model size was estimated, and the learning and test sets were
recombined into the full dataset. Then, RRESA(200) was applied to the full dataset to choose
a full model with 30 questions. This full model was then subjected to backward deletion to
obtain a sequence of submodels, and the model that minimized the cost-complexity criterion
was selected out. The resulting model contained 23 questions in ten subsets, and it achieved
a lower prediction error than the most powerful linear discriminant model. See Chapter 5
for a summary of this model.
8 At this stage of the research, the original division of learning set and test set was lost. Therefore,
a new learning set was created by drawing a simple random sample from the full dataset as before. The
only difference was that this random division was repeated until the same proportion of deaths was
observed in both the learning and test datasets. Breiman et al. (1984) suggest that this may result in a more
stable estimate of prediction error.
177
Chapter 5 - Models for the prediction of mortality
5.1 Simple questions for predicting survival or death
The first set of questions in Appendix I, Set A, consists of 10 questions in total,
broken into three subsets of one, two or three questions each (see Table 4.1). The questions
in this set result in the following classification when applied to the full dataset, as shown in
Table 5.1.
Table 5.1 - Predicted outcome by true outcome, Question Set A,
full dataset estimates
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 6,743 (92.5%) 546 ( 7.5%) 7,289
PREDICTED (76.2%) (37.7%)
OUTCOME died 2,101 (69.9%) 904 (30.1 %) 3,005
(23.8%) (62.3%)
TOTAL 8,844 1,450 10,294
Thus, 3,005 persons in the sample were classified as high risk (predicted as dead) and
of these, about 30% actually died. This classification correctly classifies about 62% of all
deaths and 76% of all survivors, corresponding to a cost-adjusted misclassification rate equal
to:
5465 + 2,101
1,450'5 +8,844
0.300.
However, since this is computed using the same dataset on which the classification was built,
the estimate is downwardly biased, perhaps to a substantial degree. Therefore, a more honest
178
estimate is given by the corresponding internal test set estimate, obtained by applying the
same-sized model built on the learning set of 6,862 respondents to the test set of 3,432
respondents. The results of this application are shown in Table 5.2. (These are the same
results summarized in Table 1.2, but they are presented here again for convenience, and with
a little more detail).
Table 5.2 - Predicted outcome by true outcome, Question Set A,
internal test set estimates
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 2,035 (92.2%) 172 ( 7.8%) 2,207
PREDICTED (69.6%) (33.8%)
OUTCOME died 888 (72.5%) 337 (27.5%) 1,225
(30.4%) (66.2%)
TOTAL 2,923 509 3,432
Interestingly, one finds that when the model was applied to a test set, it captured
slightly more deaths (66%) with a lower specificity of nearly 70%. These were somewhat
different numbers than those obtained in the full dataset estimates. The error rate
corresponding to this result is equal to:
1725 + 888
5095 +2,923
0.320
and may be considered a much more honest estimate (i.e., having a much smaller, though
still downward, bias). As discussed, this estimate is not completely unbiased because
multiple observations of the test set were made to obtain the estimate; however, one hopes
179
that the bias in this estimate is small (and in the validation process, discussed below, this
held true). An approximate estimate for the standard error of this estimate of error (based
on the assumption that the test set is a simple random sample) is 0.012. (For the calculation
of this estimate, see section 5.5 below.)
Suppose one breaks the high risk respondents down according to each of the five
question subsets. The most respondents were chosen by Set A2, which classifies all 1,294
digitalis users (12.6% ofthe sample) as high risk. More than 31% of these persons (406)
died within three years. The second largest group of respondents was defined by Set A3,
which classified all 1,234 persons who are aged 80 and over and weighed more than 139
pounds at age twenty as high risk. Again, about 31% (381) of these respondents died within
three years. Set A.4, which isolated males weighing less than 168 pounds, and who could
not do heavy work, chose some 751 persons (7.3% of all respondents), of whom 277, or 37%
died within three years. Set Al chose 596 persons, and 234 (or about 39%) ofthem died.
Set A5, which chooses respondents who cannot see well and need help to walk across a
room, chose the smallest number of respondents (311), but these persons also had the highest
death rate (44%, or 137). Interestingly, then, an inverse correlation existed between the
number of respondents chosen by a subset, and the death rate of the chosen respondents.
About 908 respondents were chosen by more than one question subset, and 389
(43%) ofthese respondents died. Some 232 were chosen by more than two subsets (ofwhom
118, or 51 %, died), and 38 were chosen by more than three subsets (of whom 22, or 58%
died). Thus it seemed that the effect of being classified by each of the different subsets was
cumulative when a respondent was classified by more than one subset.
180
Interestingly, there were more females than males chosen as high risk by Set A,
although partly because there were more females in the sample. Some 1,452 males were
chosen (36% of all males), and 1,553 females were chosen (25% of all females). Figure 5.1
shows the distribution of chosen respondents by age and sex. As expected, the distribution
is quite elderly, but there are still many younger respondents chosen as high risk (nearly a
third of the respondents were younger than 75).
Consider Question Set B in Appendix I, consisting of seven questions total, made of
three subsets combined with "OR", with each subset containing two or three questions
combined with "AND". This model was obtained by searching on the full dataset (see Table
4.1 again). Table 5.3 shows the classification achieved by applying this question set to the
full dataset:
Table 5.3 - Predicted outcome by true outcome, Question Set B,
full dataset estimates
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 7,838 (90.3%) 844 ( 9.7%) 8,682
PREDICTED (88.6%) (58.2%)
OUTCOME died 1,006 (62.4%) 606 (37.6%) 1,612
(11.4%) (41.8%)
TOTAL 8,844 1,450 10,294
This model selected out 1,612 respondents as high risk (about 16% of all respondents), and
about 38% of them died within three years. The model error corresponding to this
classification was equal to:
181
><
Q)
en
"'0
c
co
+
L()
Q) co
C)
co

..0
cd::

co
....., I
Q)
a
co
CJ)

..0
C
Q) (j)
en
I'--
I
0
L()
Q.)
.c
I'--
0>
U
CO
en
.....,
c
Q)
"'0

I'--
C
I
0
a
Q)
I'--
c..
Q) ctl
en
Q)
ctl E
0::
E

rn
(j)
c.o
LO
I
L()
c.o
Q)
s...
:J
C)
lL.
OOv 00 OOl 0
sluapuodsaJ JO Jaqwnu
182
844,3.5 + 1,006 = 0.285
1,450'3.5 +8,844
again, a downwardly biased estimate since it was computed using the same data on which
the model was built. In this case, however, when the same-sized model was built on the
learning set, it was nearly the same model (with the exception of one question). When
applied to a test set, the results were not dramatically different, as shown in Table 5.4:
Table 5.4 - Predicted outcome by true outcome, Question Set B,
internal test set estimates
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 2,575 (89.6%) 300 (10.4%) 2,875
PREDICTED (88.1%) (58.9%)
OUTCOME died 348 (62.5%) 209 (37.5%) 557
(11.9%) (41.1%)
TOTAL 2,923 509 3,432
Nonetheless, the accuracy is slightly lower than estimated from the full dataset, as can be
seen by calculating the error rate for the test set, equal to:
300,3.5 + 348
= 0.297,
509,3.5 +2,923
which can be regarded as a much more honest measure, though still (hopefully slightly)
biased. An approximate standard error for this estimate (as calculated below) is equal to
183
0.0125.
When the high risk respondents were broken down by the three different subsets, it
was found that Set B.2 (which selected out the same 751 respondents chosen by Set A.4)
chose the greatest number of respondents. Set B.l picked out the next largest group,
classifying 650 persons as high risk (those digitalis users who could not walk a half mile),
of whom 267 (41 %) died within three years. Set B.3 defined the smallest group, those 413
respondents who were age 80 or older and could not state their mother's maiden name, of
whom 181 (44%) died. There were 196 respondents classified as high risk by more than one
ofthe question sets, and of these 113 (58%) died. There were only six persons chosen by all
three questions, but all six of these persons died! The question subsets in Set B, then, also
seemed to have a cumulative effect on those respondents chosen by more than one subset.
Table 5.5 - Predicted outcome by true outcome, Question Set C,
full dataset estimates
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 8,580 (88.1%) 1,163 ( 11.9%) 9,743
PREDICTED (97.0%) (80.2%)
OUTCOME died 264 (47.9%) 287 (52.1%) 551
( 3.0%) (19.8%)
TOTAL 8,844 1,450 10,294
There were much more males chosen by Set B (939, compared with 673 females),
owing partly to the first question in Set B.2 that classifies males as high risk. Figure 5.2
shows the distribution of chosen respondents by age and sex. Interestingly, the
184
><
Q)
en
"'0
c
co +
Q)
LO
co
0>
co
~
..0
co
"'
co
......
f
Q)
0
CO
if)
~
..0
C
Q)
m
en
f"'-.
0
I
LO
Q)
..c
f"'-.
0>
()
co
en
......
c
Q)
"'0
"'
f"'-.
C
f
0
0
a>
f"'-.
C-
en
a> ro
Q)
ro E
0::
E ~
N
D ~
m
CO
LO
I
LO
CO
Q)
L..
:::J
0>
U-
OOl ~ O O ~ 09 0
sluepuodseJ 10 Jeqwnu
185
predominance of males exists only at ages younger than 80. The nwnber of younger males
chosen by this question set is quite large.
Table 5.5 shows the classification achieved by Set C in Appendix I (a model
consisting of seven questions with the same structure as Set B, but seeking to classify
respondents at the highest level of risk) when applied to the full dataset. Thus, only about
20% of deaths were correctly classified, but with a specificity of 90%. The death rate of the
high risk respondents exceeded 50% in three years, indicating quite a high level of risk. The
cost-adjusted misclassification error for this result was calculated as:
1,163'1.5 + 264
1,450'1.5 +8,844
= 0.182,
which, again, is downwardly biased. When the same-sized model as Set C was built with
the learning set (a set of questions very similar to Set C) and applied to the test set, the results
presented in Table 5.6 were observed:
Table 5.6 - Predicted outcome by true outcome, Question Set C,
internal test set estimates
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 2,791 (87.6%) 396 (12.4%) 3,187
PREDICTED (95.5%) (77.8%)
OUTCOME died 132 (53.9%) 113(46.1%) 245
( 4.5%) (22.2%)
TOTAL 2,923 509 3,432
Here, the test set result revealed a higher level of sensitivity (22.2% of all deaths, compared
186
with less than 20% in the full sample), but with a lower level of specificity (95.5% compared
with 97%). However, the high risk persons experienced a lower death rate of 46%
(compared with 52%). The misclassification error was estimated as:
with a standard error of 0.008.
3961.5 + 132
509,1.5 +2,923
= 0.197,
Breaking down the respondents by subset, one finds that Set C.2 chose the most high
risk persons, some 249 respondents of whom 131 (53%) died. Set C.1 chose 240 persons as
high risk, and 130 (54%) died within three years, while Set C.3 picked 225 persons, of whom
110 (49%) died. There were 126 respondents chosen by more than one subset (of whom 64,
or 51%, died), and 32 respondents were chosen by all three subsets (of whom 20, or 63%,
died). Thus the cumulative effect ofthe subsets seems slighter than in the first two sets.
It was quite surprising to find that, in this highest risk group as chosen by Set C,
females outnumbered males nearly two to one! There were 364 females (5.8% ofall women)
and 192 males (4.8% of all men). Also, the distribution of males does not vary by age in any
regular way. High risk females were particularly predominant at older ages. It was expected
that the highest risk respondents would be primarily elderly males, but appeared that this was
not necessarily so.
5.2 An index for the risk of mortality: linear discriminant analysis
After determining that the best model size for the linear discriminant analysis seemed
to contain about 15 coefficients, the data were recombined into the full dataset. Then the
stepwise backward deletion was repeated to obtain the preferred model of size 15. The form
187
of this predictor is a set of 15 questions to which the answers can be scored to obtained an
index ofmortality. Appendix VI contains the actual questions and the scoring system for this
questionnaire. The order in which the questions are listed corresponds to the ranking of the
variables in terms of each variable's "impact" on the index score. This ranking was achieved
by standardizing all the variables to have a mean of zero and a standard deviation of one, and
ranking the standardized discriminant coefficients according their absolute magnitudes. The
first variable, age, had the largest standardized coefficient (not surprisingly) the next most
important variable was sex, and so on.
For this tool to be applied to an elderly individual, the interviewer simply asks the 15
questions, writes down the point value corresponding to each answer in the blank space, and
totals the point values to obtain the index score. Then by consulting Table 5.7, one estimates
the probability of death within three years by finding the row of the table corresponding to
the score. For example, suppose the questionnaire is completed by 65 year-old male with a
present and past weight of 170 pounds, and who scores zero on all other items. The index
score for this person is totaled as 145 + 91 - 225 + (170-26) = 155, corresponding to an
estimated probability of death of about 4.0%, give or take 0.55%. The point values
corresponding to each question are equivalent to the values of the discriminant coordinates
vector &.. described in Chapter 2. These coordinates have been scaled so that the observed
sample range of index scores (equal to the product X'&..) equaled zero to 1,000.
As can be seen in Table 5.7, the lower the index score, the higher the probability of
death within three years. The probability of death for each range of scores in the table was
estimated by the proportion of respondents in each range who were observed to die. The
188
Table 5.7 - Probability of death within three years by mortality index
Index score
<100
100-149
150-199
200-249
250-299
300-349
350-399
400-449
450-499
500-549
550-599
600-649
650-800
800+
Probability of death within 3 years (3qJ
2.56%
2.91%
4.00%
4.35%
8.70%
11.97%
17.77%
21.56%
28.57%
33.42%
36.16%
51.21 %
57.72%
66.67%
Approximate standard error
0.84%
0.57%
0.55%
0.52%
0.71%
0.92%
1.22%
1.51%
1.97%
2.39%
2.74%
3.47%
3.15%
6.09%
Note: The interpretation of this probability and the standard error are as such: from the full sample of 10,294
persons, 11.97% of those respondents who scored 300-349 actually died within three years; the standard error
roughly gauges the uncertainty in this estimate. For example, if the sample is assumed to be a simple random
sample from the general population of noninstitutionalized elderly, the range of probabilities 11.97% 0.92%
= (11.05%,12.89%) has a 68% chance of catching the true probability of death for persons who scored 300-
349. For a 95% confidence interval, 11.97 1.96'0.92% = (10.17%,13.77%). The sample was not a simple
random sample from the general population of elderly, so this estimate is a rough approximation. Probabilities
were estimated for persons living in 1983-1985; present probabilities (for 1997) may be somewhat lower.
189
standard errors attached to these estimates were obtained by assummg a binomial
distribution of deaths (corresponding to the assumption that the respondents were chosen
with a simple random sample), and so should be taken as rough approximations. If the
respondents were chosen with a simple random sample, the interval of plus or minus one
standard error around each estimate would have a 68% chance of containing the true
population proportion of deaths for each group. Figure 5.3 shows a bar graph of Table 5.7,
where the heights ofthe bars are equal to the estimated probability of death within each range
of scores. The dashed lines around the top of each bar give the 1 S.E. interval around each
estimate. Clearly, if the standard errors are taken to be even slightly accurate, there is a
substantial degree of differentiation between levels of risk by index score. The probability
of death ranges from 2.56% to 66.67% (the former estimate having a standard error ofless
than 1%).
To further understand the relationship between this index score and the risk of death,
the outcome variable Y as a function of z was smoothed with the nonparametric Splus
smoothing algorithm supsmu(). This function uses cross-validation to determine the span
of the smoothing window. The output is plotted in Figure 5.4. It appears from this graph
that the probability of death rises slowly until the index score reaches about 220, at which
it rises approximately linearly with a slope of 0.1 06 every 100 index points. This gives an
approximate rule of thumb for scores above 220, which included the scores of about 70% of
all respondents: an increase of 100 points in the index score is roughly equivalent to an
increase of 0.1 06 in the probability of death, giving a good intuitive feel for the importance
of various questions and their answers. For example, suppose we examine the scores of two
Figure 5.3 - Bar plot of probability of death by discriminant index
dashed lines show +/- 1 standard error (approximate estimates)
66.7%
r--
57.7%

r--
21.6%
17.8%r--
--- I
r--
2.6%
r----- I I
12.0%
. d
2.9% .4.0% 4.4%
I - r I - "1" ....

o
(0
o
o
o
N
o

Ctl
Q)
>.
('f)
c::
..c::
......
"
..c::
co
Q)
"'0
....
o

:.a
Ctl
..0
e
c.
o 200 400 600 800 1000
mortality index (discriminant variable)
.......
\0
o

co
Q)
>-
(")
c
:.c
.......
'3
..c
.......
co
Q)
"0
-
o

:0
co
..0
e
Q.
OJ
a
<D
a
'<:t
a
N
o
o
a
Figure 5.4 - Smoothed estimate of probability of death by discriminant index
observed 0/1 outcomes, smoothed with Splus function supsmuO
slope is approximately 0.1 increase
for every 100 index points
---------------------------------------------------------------------------------------------------------------------
- - -- - - - -- - - - - - - - - - -- - - --- --- -- -- - - - - - - - - -- --,- - - -- - - - - - - - - - - -- - - - - - - -- - ---- - - - - - - - -- --- ---- - - - - - -- - - - - - - - - - - - -
- -- - --
----------------
- - - - --
o 200 400 600 800 1000
mortality index (discriminant variable)
.......
'-D
.......
192
males with scores above 220 whose answers are identical for every question on the survey,
except that one of them reports being a smoker. The smoker's score would be 54 points
higher, since that is the value of the points corresponding to that answer (compared with a
nonsmoker's score of zero). Assuming the scores for these persons are above 220 or so, the
difference between the index scores implies that the smoker has a probability of death that
is approximately 0.10654 = 5.7% higher than the nonsmoker. A linear regression of scores
in the range above 220 gave a simple rule for converting the index score to the estimated
probability of death: divide the index score by 10 and subtract 20 to obtain the percent
chance of death (these values are 9.457 and 21.323 to be more exact), an approximation that
works remarkably well for any score above 240 or so. For example, a person with a score
of300 had about a 10% chance of dying (equal to (300/10) - 20), as confirmed by Figure 5.3
and Table 5.7.
Table 5.8 - Average and median mortality index score by age and sex
Sex
Female scores Male scores
Age average median average median
65-69 189 163 288 267
70-74 240 214 336 316
75-79 302 276 402 375
80-85 384 356 457 430
85+ 471 449 534 517
Note: The mortality index score may be computed by using the questionnaire in Appendix VI.
Source: EPESE baseline and three years of follow-up data from New Haven, East Boston and Iowa.
Since it might be useful to be able to compare an individual's index score with that
of other persons of the same age and sex, Table 5.8 reports the average and median for the
index score by age and sex. It is evident from this table that for every age-sex group, the
193
distribution of scores had a long right tail, as the mean was found to lie well above the
median score for each group. Moreover, comparing the probability of death implied by the
average scores for each group with the observed sample proportions of dead reveals a
significant difference. For example, some 6.1 % of females younger than 70 in the sample
died, but the average score for these women was only 183, implying a death rate of about
4%; the distribution of scores is clearly asymmetric, with a small contingent of women
having a risk of death well above average, as one would expect since there is much more
potential for above-average mortality at such a low rate of death.
A simple measure of accuracy for the discriminant model is the raw correlation
between the index score and the binary outcome of survival or death, or r
disc
= cor( z, Y). As
measured by the test set Ywhen z was computing using coefficients computed by the learning
set for the model of size 15, r
disc
equaled 0.363. When respondents were recombined into the
full dataset and the best model of size 15 was estimated (giving the model above), r disc was
0.372. In comparison, a linear discriminant model based only on age and sex yielded a
correlation of about 0.22 with either the test set or the full sample.
To compare the accuracy of the model with that of the question method above, one
simply chooses a cutoff value of the index score, classifying all persons scoring above the
cutoff as dead, and all lower-scoring persons as alive. Then it is possible to estimate
misclassification error just as it was computed for the questions. To simplify the
comparison, three cutoff values of the index score were chosen such that the same number
of persons were classified as dead as were classified as dead by the three sets of questions
(sets A through C). The results of these classifications can be observed in Table 5.9, which
Table 5.9 - Deaths and survivors predicted by discriminant analysis
(in a test set of 2,923 survivors and 509 deaths three years after baseline)
# of Sensitivity4 # of Specificity6 Death rate'
death rate
Cutoff' Cose deaths survivors
predicted
deaths predicted
predicted
survivors predicted deaths predicted
rate predicted
correctly3
correctly
incorrectly5
correctly correctly
by age, sex
all deaths all survivors deaths predicted
341 5 353 69% 872 70% 29% 1.44
455 3.5 221 43% 336 88% 40% 1.79
563 1.5 120 24% 125 96% 49% 2.09
1. The cutoff is that index score below which persons were classified as alive, and above which they were classified as dead. The index score may be computed
by using the questionnaire in Appendix VI.
2. The ratio of the cost ofmisclassifying a death as survival to the cost ofmisclassifying a survival as death. This was not actually used in the
computation of the model or the cutoff value, rather it is included only for ease of comparison with Table 1.2.
3. The number of deaths in the test set correctly classified as dead by the model.
4. The number of deaths correctly predicted divided by the total number of the deaths in the test set (509), also caUed the true positive fraction or TPF.
5. The number of survivors in the test set incorrectly classified as dead by the model (i.e., false positives).
6. The proportion of aU survivors (2,923) correctly classified as survivors by the model.
7. The proportion of deaths in the respondents predicted as dead. In the column to the right, this rate is divided by the death rate a randomly chosen set
of respondents (with the same age/sex distribution as the respondents classified as dead by the model) would have suffered.
Source: EPESE baseline and three years offoUow-up data from New Haven, East Boston and Iowa.
,....
I,Q

195
is directly comparable to Table 1.2. It appears that the accuracy of the linear discriminant
method is slightly better than the question set method in terms ofmisclassification error. For
the most sensitive model, using a cutoff of 341, 69% (353) of deaths in the test set were
classified correctly compared to 66% (337) for question Set A. These both were achieved
with a specificity of about 70% (872 and 888 false positives, respectively). Compared to
question Set B (209 deaths predicted correctly and 348 false positives), the linear
discriminant model caught 43% of all deaths (221) with a specificity of 88% (336 false
positives). With respect to question set C, the discriminant model predicted 24% of deaths
correctly with a specificity of 96% (120 true deaths, 125 false positives), compared with a
sensitivity of 22% (113 true deaths, 132 false positives). The area under the ROC curve for
the linear discriminant model was estimated as 76.2% 1.3% using the test set respondents,
compared with 74.4% for the three question sets.
Figure 5.5 shows these three cutoff points plotted on the ROC curve for the linear
discriminant model, along with the points corresponding to questions Sets A through C. It
does appear that the discriminant model possesses a small advantage over the question
method. However, it should be noted that the manner in which categorical variables were
recoded as indicator variables was greatly informed by the results of the question set method.
This was done after the accuracy of those questions had been confirmed with the test set.
Thus, the test set error estimate for the linear discriminant model could be somewhat
optimistic. Building such an accurate model would have been much more difficult without
using such knowledge about how to recode the variables.
The distribution of scores for both survivors and deaths was discussed in Chapter 2
.75
c
o
'e ........
0>-
0. :;:::;
o 0
.... OJ
0. ....
OJ 0
.cO
:t:::--o
c2
o
n'5
ell ~
~
OJ l/)
.:::
~ r
l/) OJ
0-0
0._
OJ 0
:::J
~
.25
o
Figure 5.5 - True positive fraction by false positive fraction
(ROC curve) for discriminant model of deaths in the test set
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
" //
, /
.x Area = 76.2%
/ ,
/ ,
/ ,
/ ,
// ~
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
196
o .25
False positive fraction (the proportion of
survivors predicted incorrectly as dead)
.75
Note: The test set consisted of 2,923 survivors and 509 deaths.
Letters (A,B,C) correspond to the question sets in Appendix I.
Source: New Haven, East Boston and Iowa County EPESE
197
and is shown in Figure 2.4. It is obvious from this graph that the assumptions concerning
the forms of the distributions (that the two distributions are distributed normally with the
same variance-covariance matrix) are not entirely accurate. The scores for the decedents
seem to have a larger variance than the survivors' scores, and the survivors exhibit an
abnormally long right tail, as mentioned. However, as the test set results indicate, the
obvious falsity of these assumptions does not seem to diminish the model's predictive
accuracy substantially. As Breiman et al. note, the success of the linear discriminant
approach is surprising given the questionable assumptions of the model.
5.3 Classification trees for predicting death
Using the 30 variables resulting from the first stage of the linear discriminant
backward deletion process, trees were grown for the same three levels of misclassification
cost (5, 3.5 and 1.5) as were used in the question set process. However, the program fit a
tree of size zero at a cost of 1.5 (i.e., it would not build a tree). The tree constructed with a
cost of five only caught 52% of all deaths. Thus, trees were constructed for the following
three levels of misclassification cost: 2, 3.5, and 7. Table 5.1 0 lists the results, and Figure
5.6 plots the models along the ROC curve, along with the points corresponding to the
question set method. With a misclassification cost of 3.5 the tree method seems to do about
as well as the question set method. At a cost of two or seven, however, the accuracy of the
trees falls well below the accuracy provided by Sets A and C. The accuracy ofthe medium-
cost tree lies almost exactly on the same ROC curve as question Set B, but the low cost tree
had a sensitivity of only 18% at roughly the same specificity of question Set C. The most
sensitive tree had a slightly lower level of accuracy than question Set A. The area under the
Table 5.10 - Deaths and survivors predicted by classification trees
(in a test set of 2,869 survivors and 486 deaths three years after baseline)
# of deaths Sensitivity4 # of Specificity6 Death rate'
death rate
Size of Cost
2
predicted
deaths predicted
survivors
survivors predicted deaths predicted
tree! correctly3 predicted
rate predicted
correctly
incorrectlyS
correctly correctly
by age, sex
all deaths all survivors deaths predicted
15 7 298 61% 791 72% 27% --
9 3.5 216 44% 403 86% 35% --
2 2 87 18% 134 95% 40% --
1. The size of the tree is the number of questions (or splits) in the tree.
2. The ratio of the cost ofmisclassifying a death as survival to the cost ofmisclassifying a survival as death.
3. The number of deaths in the test set correctly classified as dead by the tree.
4. The number of deaths correctly predicted divided by the total number of the deaths in the test set (509), also called the true positive fraction or TPF.
5. The number of survivors in the test set incorrectly classified as dead by the tree (i.e., false positives).
6. The proportion of all survivors (2,923) correctly classified as survivors by the tree.
7. The proportion of deaths in the respondents predicted as dead.
Source: EPESE baseline and three years of follow-up data from New Haven, East Boston and Iowa.
......
\0
00
199
Figure 5.6 - True positive fraction by false positive fraction
(ROC curve) for classification trees of deaths in the test set
Area =
73.4%
A

/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
~ //
, /
V
/'
/ ,
/ ,
/ ,
/ ,
// ~
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
.75
.25
a
c
o
t ~
0>-
0..:;:;
o ()
..... CD
o..l:::
CD 0
.c ()
~
c2
o ()
'U'6
CO CD
..........
_ 0..
CD CIJ
.2: :5
:t=ro
CIJ CD
0-0
0.._
CD 0
:::l
~
a 25 .75
. False positive fraction (the proportion of
survivors predicted incorrectly as dead)
Note: The test set consisted of 2,869 survivors and 486 deaths.
Letters (A,B,C) correspond to the question sets in Appendix I.
Source: New Haven, East Boston and Iowa County EPESE
200
ROC curve for the three trees was estimated as 73.5%.
Figure 5.7 shows the medium-cost tree, which contained nine questions. It is quite
informative to compare the structure of these binary splits with the questions in Set B. Note,
for instance, that every question contained in the tree is essentially contained within question
Set B, and that Set B contains one additional question, asking about the ability to do heavy
work. The first high risk terminal node (containing persons who cannot walk half a mile and
are taking digitalis) catches the same respondents classified as high risk by Set B.1. It is
interesting how similarly these two models are constructed, considering the vastly different
algorithms employed in their constructions. Yet there are some important differences
between them; Set B asks about more variables than the tree (the additional question about
doing heavy work), but actually contains/ewer questions (seven for Set B, and nine for the
tree). The reason for this may be that the tree's tactic of splitting persons into completely
disjoint rectangles requires that some variables are asked about multiple times. For example,
in the first split respondents are divided completely into those who can walk half of a mile
and those who cannot. Thus any particularly powerful questions which are asked of the
former persons (e.g., is respondent male or female) must also be asked of the latter persons
separately if the predictive power of the question is to be fully exploited. This requires that
such a question appear multiple times in the tree.
Although the tree algorithm worked just as accurately for the medium level of cost,
the higher and lower cost models were less accurate than the question set models, as
mentioned. The most sensitive tree, built with a cost of seven, was only able to capture 61 %
ofall deaths, and lies slightly below the ROC curve for the questions set method. This model
201
Figure 5.7 - Classification tree for risk of death within 3 years
cost of misclassification = 3.5
No
LOW RISK HIGH RISK
No
Yes
LOW RISK
No
Yes
LOW RISK
No
HIGH RISK
Yes
HIGH RISK
HIGH RISK LOW RISK HIGH RISK LOW RISK
202
also required some 16 questions in total (compared with 10 questions, as in Set A), so that
the structure ofthis tree was quite complex. The tree built with a cost of two was clearly less
accurate that the comparable question set, and consisted of only two questions (the two
questions in Set B.l). It was not entirely clear why the tree was not able to pick out the
highest risk persons as accurately as the question set method. It may have been a result of
the one-step optimality of the tree-building process. It is likely that the highest risk persons
(who are small in number, by definition) may be identified with certain combinations of
questions which are not likely to occur in the form of a branch of a tree. This is because the
optimal splits formed near the top of the tree are greatly weighted by the large numbers of
the low-to-moderate-risk respondents.
5.4 A regression model of mortality
The final logistic regression model constructed with the full dataset (as described in
section 4.5) consisted of23 variables, in addition to the intercept. Table 5.11 shows the 21
coefficients in the model (ordered by the absolute size of the standardized coefficient) which
were significant at the 0.05 level, along with the standard errors estimated for each
coefficient. The most important variable was age, followed by sex, digitalis usage, the ability
to walk a half mile, self-assessed health status, and the ability to state the mother's maiden
name; these variables were identified by both the question set method, and the linear
discriminant analysis. The remainder of the variables consisted of some that were identified
in the question set method, and some that were found instead by discriminant analysis:
primarily smoking, insulin usage, the ability to bathe or use the toilet without help, the
previous diagnoses of cancer and heart attacks, and weight (both at time of interview, and
Table 5.11 - Coefficients in the logistic regression model of
mortality
203
Variable Coefficient Std.
Error
Age: 65 to 69=0, 70 to 74=1, ... , 85+ = 4 0.3357 0.02578
Sex: female=O, male = 1 0.6594 0.07112
Do you take digitalis? yes=l, no or missing=O 0.6846 0.08159
Can you walk half a mile? yes=O, no or missing=1 0.5497 0.07494
Would you say your health is: excellent=O, ... , poor/bad=3 0.2497 0.04070
What is your mother's maiden name? correct=O, other=l 0.5464 0.09012
Do you smoke cigarettes now? yes=l, no/missing=O 0.5227 0.09190
Ever taken insulin for diabetes? yes=1, no/missing=O 0.6123 0.12407
Ever been diagnosed with cancer? yes=l, no/missing=O 0.3945 0.08255
Usual weight at age fifty (in pounds) 0.0081 0.00172
Wake up at night? most oftime=O, ... , rarely/never=2 0.1832 0.04079
Need help using toilet? yes=l, no/missing=O 0.5229 0.12400
Need help from person to bathe? yes=l, no/missing=O 0.4405 0.10461
Ever hospitalized for a heart attack? yes=1, no/missing=O 0.3836 0.09679
Did you ever smoke cigarettes? yes=1, no/missing=O 0.1269 0.03523
What is your date of birth? correct=O,other=1 0.3332 0.09953
Weight (in pounds) -0.0336 0.01024
Bring up phlegm at least 3 months of year? yes=1, no=O 0.4219 0.13313
Leg pain when walking on level groud? yes=l, no=O 0.4589 0.16007
Weight
2
(in pounds) 7.61'10-
5
3.52'10-
5
Note: All coefficients were significant beyond the 0.05 level.
204
at age fifty). The squared term for weight at baseline was included in the model as well,
indicating that the principal effect that this variable was capturing was probably that of
respondents wasting away just prior to death, resulting in high mortality at very low body
weights. (Note that the question set method indicates that this is particularly true for men,
but the model has no way of detecting this without the appropriate interaction term).
There were many ways to measure the accuracy ofthis model. The residual deviance
was 6992.037 on 10,270 degrees offreedom (compared with a null deviance of 8369.42 on
10293 degrees offreedom). The correlation between the fitted values (ranging between zero
and one) and the true outcome of death (as one) or survival (as zero) was 0.395, roughly
equal to the correlation between the linear discriminant scores and the true outcome. A
model of nearly the same size was built on the learning set and applied to the test set (see
section 4.5) to calculate a set of fitted values. A cutoff was applied to the fitted values to
classify respondents as high or low risk, and the classification in Table 5.12 was achieved.
Table 5.12- Predicted outcome by true outcome, logistic regression
internal test set estimates
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 2,594 (90.2%) 281 (9.8%) 2,875
PREDICTED (88.7%) (55.2%)
OUTCOME died 329 (59.1%) 228 (40.9%) 557
(11.3%) (44.8%)
TOTAL 2,923 509 3,432
Note that the cutoff was chosen so that the row marginals would be exactly equal to those
205
displayed in Table 5.4 (the test set results from Question Set B) and the second row of Table
5.9 (the test set results from linear discriminant analysis). The regression model had a level
of sensitivity and specificity very close to that achieved by the linear discriminant model, but
required more variables than the discriminant analysis. The test set error estimate for the
regression model, using a relative misclassification cost of 3.5, was calculated as:
281'3.5 + 329
509,3.5 +2,923
with an approximate standard error of 0.0122.
= 0.279,
5.5 Validation of the question sets with the North Carolina sample
As mentioned above, all the models were constructed after having made multiple
observations on the test set, and the author had used the entire dataset (excepting the North
Carolina sample) for other related analyses prior to this research. As a result, the "internal"
test set error estimates above cannot be considered truly unbiased. However, the hope is that
the bias in these estimates is small. Fortunately, this hope can be validated to some extent,
as there exists, as mentioned, the additional EPESE sample of 4,162 respondents from North
Carolina, also called the "Duke" sample. These respondents had been ignored by the author
initially, despite the public availability ofthe baseline dataset, since the ID's of the deceased
respondents were not publicly released. After the construction of the above models,
however, it became possible to identify the deceased respondents through private
communication with Duke researchers. All the questions required by the models were asked
of the Duke respondents, and so the Duke sample formed a completely compatible but
independent sample. Taken at a different point in time from an entirely different geographical
206
region within the U.S., this dataset provided a natural yet challenging validation sample for
the models.
However, the fact that the North Carolina sample was so radically different from the
original samples made the validation slightly more complicated. Ideally, for a validation
sample, one would like to have an independent probability sample from the same population
from which the original sample was drawn. Then a statistical comparison of accuracy would
be straightforward. One simply needs to calculate standard errors for the estimates of
misclassification error, so that a statistical test can be calculated to determine whether any
difference between the estimates is significant. However, since the validation sample was
so different in this case, and since the original dataset was not constructed with a probability
sample, such a comparison is not necessarily meaningful. The problem was that when the
models were applied to the validation sample, a smaller proportion of respondents were
classified as high risk as compared with the original sample. This was because the Duke
sample was younger on average, and contained more females; both of these variables were
used in the question set models. (There were also fewer respondents who used digitalis).
With respect to Table 5.2, for example, this meant that in the Duke sample there was
a greater proportion of respondents in the upper row of the table. This shift in the marginals
could have affected estimates of misclassification error, perhaps resulting in more optimistic
estimates. Thus, it is more informative simply to compare the respondents within each row;
that is, one can simply compare the death rates of the high or low risk respondents. Notice
that if the row marginals are held constant, a decrease in the death rate of the high risk
respondents (or likewise, an increase in the death rate of the low risk respondents) is directly
207
proportional to an increase in misclassification error. Thus, comparing death rates provides
an informative comparison of accuracy regardless of the marginals. In these terms, the
models performed extremely well, as the death rates in the validation sample were nearly
exactly as predicted. It was also found that, coincidentally, the observed death rates were
such that shifts in marginals actually had a very small effect on estimates of misclassification
error. Thus, the observed differences in error were also very small. In fact, they were no
larger than one would have expected from simple random sampling variation, as
demonstrated in the statistical tests below. Clearly, the samples were not simple random
samples, so these tests should not be taken literally. They are simply provided to gauge the
sizes of the observed differences (i.e., they are used to define "small"). Since the sample
sizes were large, such a measure provides a fairly strict gauge; nonetheless, the models
passed the test.
The target population for the Duke sample consisted of persons 65 and older living
m a contiguous five-county area in North Carolina consisting of Durham, Franklin,
Granville, Vance and Warren Counties.
1
A four-stage sample design was used to obtain a
probability sample of these persons. Area sampling was used to obtain census blocks and
enumeration districts, smaller geographic areas were randomly chosen within these units, and
the households within each area were listed and stratified by race. Finally, one person within
each household was randomly chosen as the designated interviewee. About 80% of these
selected persons completed interviews.
Apart from geography, there were other substantial differences between the Duke
I See Comoni-Huntley et al. (1986) for a detailed description of the North Carolina EPESE
sample.
208
respondents and the respondents from the first three EPESE sites. Since two of the original
three samples (the Iowa and Boston samples) contained virtually no nonwhites, the Duke
sample was designed to oversample blacks, so that less than half the Duke respondents were
white. Figure 5.8 shows the distribution of respondents by age, sex, and race. By comparing
this distribution with the age-sex distribution shown in Figure 3.1, one can see that besides
the majority ofblacks, there is a preponderance of females at all ages (except nonblacks aged
65-69). Females were significantly oversampled relative to U.S. population sex-ratios with
respect to all age-race categories (again, excepting nonblacks aged 65-69).2 Thus, the
unweighted Duke sample was radically different from what would be expected in a simple
random sample of the U.S., raising the issue of whether sampling weights should be used in
the validation estimation of model error. The models, though, suggested that race should not
be an important predictor of mortality once other factors have been controlled (at least to the
extent to which the small number of nonwhites in the original sample provided such data);
so the unique makeup of the unweighted sample was viewed as providing an interestingly
more difficult test case, and the decision was made to use the raw, unweighted counts.
Again, it is important to note that all of the above models were constructed prior to
the author's knowing which Duke respondents had died. Each of the three methods was
applied to the Duke baseline dataset (Sets A through C, the linear discriminant model with
a cutoff score of 455, and the classification tree model in Figure 5.7). Using these models,
lists ofID's were compiled of those respondents predicted to die according to each method.
Finally, to match the chosen respondents to the decedents, these lists were sent via electronic
2 See the Current Population Reports, P-25 series No.1 095.
Figure 5.8 - Number of respondents in Duke sample by age, sex, and race
o
I I
black female
0
I I I
V/1
black male
l
P77J
nonblack female
tz0j
nonblack male
, , , , , /,
C/)
.......
s:::::
0
<l>
0
"0
C"')
s:::::
0
a.
C/)
<l>
.....
-
0 0
..... 0
<l>
N
..0
E
:::J
s:::::
0
0
......
65-69 70-74 75-79 80-84 85+
age
Respondents are from the North Carolina EPESE baseline survey (N = 4,162).
tv
o
\0
210
mail to a researcher (unassociated with the author) for the NIH who had access to a list of
the decedents' ID's.3
Table 5.13 - Results of applying Question Set A to Duke sample
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 2,786 (90.9%) 278 (9.1%) 3,064
PREDICTED (78.5%) (46.2%)
OUTCOME died 765 (70.2%) 324 (29.8%) 1,089
(21.5%) (53.8%)
TOTAL 3,551 602 4,153
Table 5.13 shows the result of applying Question Set A in Appendix I to the Duke
sample. The cell counts give the number of persons according to the true and predicted
outcomes, and the percentages by row (next to the count) and column (below the count) are
also shown. There were nine respondents for whom the date of death was missing; these
respondents were removed from the analysis, leaving a total of 4,153 respondents for testing.
Of these, 602 (14.5%) died within three years, a raw death rate slightly higher than the rate
of 14.1% in the original sample. A total of 1,089 persons were classified as high risk, and
of these 324 (29.8%) died within three years, accounting for 53.8% of all deaths. There
were 765 persons incorrectly predicted to die, so of all the 3,551 survivors, 78.5% were
correctly classified. So the specificity is given by the lower right percentage-by-column
(53.8%) and the sensitivity is the upper left percentage-by-column (78.5%). The raw death
3 The actual ID matching was performed by Caroline Phillips at the National Institute for Aging,
to whom the author is eternally grateful.
211
rates of the two groups classified as low- and high-risk are the percentages-by-row in the
right column (9.1% and 29.8% respectively).
These numbers should be compared with those in Table 5.2 (the internal test set
predictions, p. 178) and Table 5.1 (the full dataset predictions, p. 177). For example, one
central statistic of interest is the estimated death rate of the high risk respondents (the
percentage-by-row in the lower, right-hand cell). In the internal test set (Table 5.2), some
27.5% of respondents who were predicted to die actually died, and in the full dataset, 30.1%
of such persons died, quite close to the Duke sample estimate of29.8%. Likewise, about
7.8% (test set estimate) of persons were predicted to die in the low-risk group, and in the
Duke sample this estimate was 9.1 %. However, when one examines the percentages by
column (i.e., sensitivity and specificity), an interesting discrepancy is revealed: the
predictions (based on the test set estimates) were that 69.6% of all survivors and 66.2% of
all decedents would be classified correctly. However, when applied to the Duke sample, the
questions were much more successful at classifying survivors: 78.5% of survivors were
classified correctly, but only 53.8% of deaths were classified correctly.4 If one calculates the
marginal percentages by row, one notices that the reason for this may be that the row margins
for the two samples are quite different; in the test set, for example, 35.7% of persons were
classified as high risk, and in the full dataset some 29.2% of persons were predicted to die.
However, in the Duke sample, only 26.2% of persons were classified as high risk.
Thus, one type of error is more predominant in the Duke sample (the
misclassification of the decedents, or false negatives), but the other type of error (survivors
4 A chi-square test using the predicted, test-set column percentages as expected values and the
Duke estimates as observed values suggests that these differences are statistically significant.
212
predicted to die, or false positives) is less frequent. The reader, then, may wonder exactly
in what way the original results have been validated. To answer this question, compare the
overall, cost-adjusted misclassification rate for the two samples (the criterion all models
were designed to reduce). The internal test-set estimate of error, as calculated above was
equal to 0.3196, and the Duke sample estimate is calculated as:
(5' 278 + 765) = 0.3285.
(5'602 + 3,551)
So in terms of this criterion, the Duke classifications were slightly less accurate than
predicted. The difference is small, however, only 0.0088.
Using rough theoretical approximations for standard errors (i.e., based on the
assumption that the two samples are independent, simple random samples), one can calculate
a simple t-statistic for the difference between the two error rates. For example, suppose the
original three samples were considered a single simple random sample. Then the numerator
of the error rate for Question Set A may be modeled as the sum of 3,432 tickets randomly
drawn with replacement from a box that contains 3,432 tickets: 172 tickets are labeled with
a "5" (representing the errors for the misclassified decedents), 888 tickets are labeled "1" (the
errors for the misclassified survivors), and the remaining 2,372 tickets are labeled "0" (the
nonerrors for the properly classified respondents).5 The sum ofthese tickets has a standard
error equal to the product of the standard deviation of the tickets in the box and the square
root of the number of tickets drawn, here equal to 1.1192)3,432. The denominator is
5 Some readers may recognize that Statistics by Freedman, Pisani and Purves (1991) is the
inspiration for this type of layperson's description.
213
constant, so the standard error of the internal test set estimate may be calculated as:
1.1192J3A32 = 0.01199
5509 + 2,923 '
and likewise, the standard error for the Duke estimate of misclassification error can be
calculated as:
1.2605' V4J53 = 0.01238
5'602+3,551 '
where 1.2605 is the standard deviation of 4, 153 tickets in a box with 278 tickets marked "5",
765 tickets marked "1", and 3,110 tickets marked "0". (Of course, in the unit weight
scenario, this calculation is simply equivalent to the standard error as derived from the
binomial distribution.) Nowto estimate the standard error for the difference between the two
misclassification elTOl'S, assume the samples are independent and therefore that the sums of
squares are orthogonal, expressed as:
with the usual formula.
6
Then the usual test statistic may be estimated as
observed difference
SE
diff
which is not significant.
0.008778 = 0.5093
0.01724 '
If one were comparing two independent, simple random samples from the same
population, this test would provide excellent evidence that any bias in the test-set estimate
6 See Elements by Euclid (ca. 300 Be) for a proof of this result.
214
of prediction error is negligibly small, as gauged by sampling error. However, the Duke
sample can hardly be considered a simple random sample of the u.s. population, and clearly
the faulty assumption makes a difference in regards to the model, as suggested by the shift
in row marginals. Exactly how does the assumption affect the statistical test above?
Consider the most obvious effect, as suggested by the shift in row marginals; that is, suppose
the percentages by row (the death rates) within each cell are held fixed while the row
margins are allowed to vary (which is an accurate description of what actually happened).
With respect to Table 5.13, for example, suppose that the death rates of the low- and high-
risks groups are held fixed at 9.1% and 29.8% respectively. Then shift the row margins
toward the high risk group so that the proportions in each row are equal to the proportions
observed in the test-set sample (Table 5.2). In the test set, 35.7% of respondents were
classified as high risk, so 35.7% of 4,153 or about 1,482 Duke respondents would have been
classified as high risk. Then, with the death rates implied by the fixed percentages by row,
one would have correctly classified about 569 out of 836 deaths, while misclassifying 913
out of 3,317 survivors; this would have yielded a misclassification error of 0.2999, more
than two standard errors lower than the observed Duke error rate. Thus, the observed shift
of Duke respondents into the low risk class tends to push the error rate upwards in this
instance. This implies that applying the model to the Duke sample rather than a true simple
random sample from the original EPESE population inflates the above test statistic.
7
Yet,
the observed error rate for the Duke sample was still within one standard error of the
7 Of course, it is possible for the true expected value of the Duke error to be lower than the
expected value for the test error (as suggested by Question Set C, below), in which case this effect does not
necessarily inflate the statistic, but we can ignore the statistic then since we are only concerned about an
undue increase in error over the test set estimate. (That is to say, we wish to perform a one-tailed test).
215
predicted rate!
Table 5.14 - Results of applying Question Set B to Duke sample
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 3,236 (88.7%) 413 (11.3%) 3,649
PREDICTED (91.1%) (68.6%)
OUTCOME died 315 (62.5%) 189 (37.5%) 504
( 8.9%) (31.4%)
TOTAL 3,551 602 4,153
Consider Question Set B in Appendix I, for which the results are shown in Table
5.14. The test set results (which were very close to the results from the full dataset)
suggested that the death rates in the low and high risk respondents would be about 11.3% and
37.5% respectively. The observed death rates were 10.4% and 37.5% respectively, giving
nearly perfect results. With respect to percentages by columns, however, the same shift in
row margins (and therefore a shift from sensitivity to specificity) was observed as with
Question Set A: only 31.4% of deaths were correctly classified (compared with 41.1% in the
test set), while only 8.9% of survivors were misclassified (compared with 11.9% in the test
set). The resulting error rate for the Duke sample was estimated as 0.3112 (compared with
0.2971623 in the internal test set). So again the models were less accurate when applied to
the Duke sample, but only slightly so (a difference of 0.01399). Standard errors were
estimated for the two sample errors in the same way as above, giving 0.01249 for the test set
estimate, and 0.01202 for the Duke estimate. The standard error for the difference equaled
0.01734, so the test statistic was estimated as 0.01399/0.01734 = 0.8068, not a significant
216
difference. Here, however, the shift in marginals has the opposite effect on this statistic. For
example, suppose one holds the low and high risk death rates fixed at 11.3% and 37.5%
respectively. Then adjust the row margins so that the proportions of low and high risk
respondents are the same as in the test set (so that 16.2% of 4,153, or 674 respondents would
have been classified as high risk). This implies an error rate of 0.3119. Here, then, the
observed shift in row marginals toward the low risk class pushes the error rate downward.
However, the pressure is small; a hypothetical difference of about 0.0007 is obtained, less
than tenth of a standard error. Thus, had the death rates held fixed (apparently a reasonable
assumption), one would have expected about the same test statistic had the question been
applied to a true simple random sample from the original EPESE population.
Table 5.15 shows the results of applying Question Set C to the Duke sample. Again,
the same pattern is observed: the death rates for the high and low risk groups were predicted
quite accurately (48.3% and 12.7% respectively, compared with 46.1% and 12.4% in the
internal test set), but a shift in row marginals toward the low risk category resulted in lower
sensitivity (16.8%, compared with 22.2%) and higher specificity (97.0% versus 95.5%). This
effect was quite systematic (not random). The cost-adjusted error rate was estimated at
0.1930, which was slightly lower than the internal test set estimate of 0.1969. In this case,
the observed shift in row margins toward the low risk class has the effect of pushing the error
estimate down and below the test set estimate. One can see this by holding the death rates
fixed, adjusting the marginals toward the high risk class, and recalculating the error rate as
above. This implies an error rate of 0.1987, only slightly higher than the test set error rate.
Thus, the test statistic is less informative here.
217
Table 5.15 - Results of applying Question Set C to Duke sample
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 3,443 (87.3%) 501 (12.7%) 3,944
PREDICTED (97.0%) (83.2%)
OUTCOME died 108 (51.7%) 101 (48.3%) 209
( 3.0%) (16.8%)
TOTAL 3,551 602 4,153
Figure 5.9 plots the results from all three question sets in ROC space, along with the
original test set results and the original ROC curve. In each case, one can see how the change
in specificity and sensitivity shifted the results downward and leftward in ROC space, with
only slightly less accuracy (which is to say that the results were quite close to the original
ROC curve). The area under the ROC curve for the Duke sample was estimated as 73.3%,
slightly lower than the test set estimate of 74.4%, showing the small decrease in accuracy.
As a general conclusion, however, the models performed quite admirably, particularly in
consideration ofthe diverse makeup of the Duke sample.
Because of this validation process, an interesting substantive issue arose: was the
observed shift in row marginals a product of the skewed racial stratification of the Duke
sample, and if so, what was the sociological mechanism responsible for this shift? Or is it
possible that the observed shift was in some way reflecting the potential problem of
overfitting which was inherent to the model-building process, as suggested in earlier
chapters. (This issue is not explored in detail in this dissertation, although it was thought that
some differences might have been do to the age and sex composition of the Duke sample.)
218
Figure 5.9 - True positive fraction by false positive fraction
(ROC curve) for questions predicting death (with Duke results)
Duke area = 73.3%
A
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
... //
, /
V
/,
/ ,
/ ,
/ ,
/ ,
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
.75
.25
o
c
o
'E,......
0>-
c.:;::;
o ()
.... <ll
c. ....
<ll 0
.s=. ()
::::::'-0
c2
o ()
U'5
rn 2!
':=0.
<ll <n
.:::
CO
<n <ll
0-0
c. ....
<ll 0
::::l
....
f-
o .25 .75
False positive fraction (the proportion of
survivors predicted incorrectly as dead)
Note: The test set consisted of 2,923 survivors and 509 deaths.
Letters (A,B,C) correspond to the question sets in Appendix I.
The Duke sample contained 3,551 survivors and 602 deaths.
Source: Establish Populations for Epidemiologic Studies of the Elderly
219
First, it is informative to examine the validation of two of the other models discussed above:
the linear discriminant analysis, and the classification tree.
5.6 Validation of the discriminant and classification tree models
The linear discriminant model in Appendix VI was applied to the Duke sample with
a cutoff score of 455 to classify respondents as high or low risk. Table 5.16 shows the
results.
Table 5.16 - Result of applying discriminant model to Duke sample
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 3,150 (90.0%) 352 (10.0%) 3,502
PREDICTED (88.7%) (58.5%)
OUTCOME died 401 (61.6%) 250 (38.4%) 651
(11.3%) (41.5%)
TOTAL 3,551 602 4,153
The internal test set predictions (from Table 5.9 above) were that the high risk respondents
would suffer a death rate of about 39.7%, and that the model would capture 43.4% of all
deaths with a specificity of 88.5%. The result was that 250 (38.4%) of the high risk
respondents died, and the model caught 41.5% of all deaths with a specificity of 88.7%. So
the results are slightly less accurate, but based on the simple random sample assumptions,
it is not a statistically significant difference. Chi-square tests on these observed counts (using
the internal test set predictions for the expected values, by column, row, or cell) show that
there is no statistically significant difference between the observed Duke counts and the
220
predicted counts. That is to say, the results were statistically indistinguishable from what
would have been observed if the model had zero bias and were applied to a simple random
sample from the EPESE population! The internal test set error rate was calculated as 0.2857,
and the Duke error rate was estimated to be 0.2886, a minute increase.
Table 5.17 - Result of applying discriminant model to Duke sample
Highest risk; cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 3,406 (87.7%) 477 (12.3%) 3,883
PREDICTED (95.9%) (79.2%)
OUTCOME died 145 (53.7%) 125 (46.3%) 270
(4.1 %) (20.8%)
TOTAL 3,551 602 4,153
It was also possible, knowing outcomes for all persons with scores higher than 455,
to calculate the error rate for the subset of these persons having scores higher than 563,
which was the highest-risk cutoff suggested by Table 5.9. The results of this classification
are shown in Table 5.17. The high risk respondents suffered a death rate of 46.3% (slightly
lower than the predicted death rate of 49%), and the model caught 20.8% of all deaths
(compared with 23.6%) with a specificity of 95.9% (slightly higher than the predicted
specificity of95.7%). This loss of sensitivity with a slight gain in specificity is similar to the
shift which occurred in the application of the question sets above. There is also a slight shift
in row margins toward the low risk class (6.5% of respondents scored above 563, compared
with 7.1% of the respondents in the test set). However, this is not a statistically significant
difference; a chi-square test of the observed cell counts in Table 5.17 (using the cell
221
percentages from the internal test set results to calculate the expected values) totals 3.28 on
3 degrees of freedom, p-value = 0.35.
Thus, error rates for the discriminant model were slightly higher than predicted, but
the differences were statistically insignificant (despite the small standard errors involved).
The area under the ROC curve for the Duke sample (using data from both Tables 5.16 and
5.17) was estimated as 75.8% (compared with 76.2% using the internal test set estimates
from Table 5.9).
Table 5.18 - Result of applying classification tree to Duke sample
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 2,921 (88.9%) 365(11.1%) 3,286
PREDICTED (82.3%) (60.6%)
OUTCOME died 630 (72.7%) 237 (27.3%) 867
(17.7%) (39.4%)
TOTAL 3,551 602 4,153
Table 5.18 shows the result of applying the classification tree (shown in Figure 5.7)
to the Duke sample. The high risk respondents experienced a 27.3% death rate (somewhat
lower than the test set death rate of 35%). The tree correctly predicted 39.4% of all deaths
(compared with 44% in the test set) with a specificity of only 82.8% (compared with 86%).
Thus the tree model was less accurate than predicted; the error rate based on the test set was
0.3126, and the observed Duke error rate was 0.3371. A 2-sample, one-tailed t-test of these
errors (as calculated above) yields a p-value of 0.08, so the difference is probably real.
However, although the row margins shifted, they did so in the opposite direction from that
222
observed with the question sets; that is, there were more high risk respondents in the Duke
sample as classified by the tree (20.9%, compared with 18.5% in the test set.) The area under
the ROC curve was estimated as 67.7%, (compared with the test set estimate of 73.4%).
5.7 A comparative analysis of the models
The models above can be easily compared on predictive accuracy. Table 5.19
displays the central statistics of interest side-by-side. The general pattern is that the linear
discriminant analysis yielded the highest level accuracy, followed by the questions sets, and
lastly the classification tree. This order was maintained with respect to the internal test set
error estimates, the full EPESE dataset estimates, and the Duke dataset estimates. For all
three methods, the predictive accuracy when applied to the Duke sample was less accurate
than the predicted, internal test set estimates. Suppose one calculates standard errors and chi-
square tests based on the (admittedly questionable) assumption that the two datasets were
two independent simple random samples. Then these differences were not statistically
significant at the 0.05 level, although the increase in error for the classification tree was
bordering significance (p-value = 0.08). For the discriminant model and question set
methods, however, the results indicated that any bias in the test set estimates of error was
relatively small, if not completely negligible. (The small increases in error that were observed
could have been due to the radical differences between the two samples). Overall, the linear
discriminant model performed most impressively; even the application of the linear
discriminant model to the Duke dataset predicted about as accurately as (if not more
accurately than) the internal test set error estimates for the question set method. However,
Table 5.19 - Performance of three methods of prediction
as applied to the North Carolina sample, and internal test sets from the original EPESE samples
Method of Internal test sets (N =3,432) North Carolina sample (N = 4,153)
prediction Model
4
Sensitivitl Specificity6 Area under ROC Sensitivitl Specificity6 Area under ROC
curve? curve?
Set A (high risk) 66.2% 69.6% 53.8% 78.5%
Question 74.4% 73.3%
sets'
Set B (higher) 41.1% 88.1%
1.2%
31.4% 91.1%
Set C (highest) 22.2% 95.5% 16.8% 97.0%
Discrim- high risk 69.4% 70.2% -- --
,
I
inant 76.2%
I
75.8%
I
analysis
2 higher risk 43.4% 88.5%
1.3%
41.5% 88.7%
I
I
I
20.8% 95.9%
I
highest risk 23.6% 95.7%
I
I
high risk
I
61.3% 72.4% -- --
,
I
Classifi- 73.4%
I
67.7%
I
cation tree
3 higher risk 44.4% 86.0%
1.3%
39.4% 82.3%
I
I
I
highest risk 17.9% 95.3%
I
-- --
I
I
1. The question sets are presented in Appendix I.
2. The linear discriminant model is given in Appendix VI.
3. The classification tree for the "higher risk" model is shown in Figure 5.9; the "highest risk" tree was a pruned version of this tree
(i.e., a subtree), and the "high risk" tree was a somewhat larger version.
4. For each method, three models were estimated; the "high risk" models were constructed with a relative cost ofmisclassification equal
to 5, the "higher risk" models with a cost of3.5, and the "highest risk" models with a cost of 1.5.
5. Sensitivity is the proportion of deaths which were correctly predicted by the model.
6. Specificity is the proportion of survivors which were correctly predicted by the model.
7. The area under the ROC curve is a measure of each method's accuracy based on the levels of sensitivity and specificity acheived by the
models. An area of 100% would indicate a perfect predictor, while an area of 50% would indicate a useless predictor (i.e. an area of 50%
implies that there is no correlation between the predictions and the outcomes). The standard error was estimated with a "bootstrap"
resarnpling method.
tv
tv
w
224
the advantage in accuracy of the discriminant model over the question sets is not an
overwhelming one. The discriminant model was about 7.4% more accurate than the question
set method on the internal test set, and about 11 % more accurate when applied to the Duke
sample.
8
Notice, however, that (by comparing Tables 5.9 and 1.2) the discriminant model did
not gain any predictive power relative to what would have been expected by predicting death
rates based on age and sex. Next, the author calculated the ratio of the death rate for the high
risk persons in the discriminant model to the death rate predicted by the age and sex
distribution of those persons. This ratio was slightly lower than was estimated for the
question set method as shown in Table 1.2 (1.79 to 1.83 for the higher risk cutoff, and 2.09
to 2.12 for the highest risk cutoff). Thus, although the discriminant model may have been
able to predict more deaths than the question set method, these deaths tended to occur in
older males. Thus there was no gain over the death rates predicted by age and sex. That is,
the additional high risk respondents found by the discriminant model were already known
to be at high risk!
Perhaps more important than small differences in accuracy, however, are the less
tangible differences between the models themselves (and the methods). One of the most
glaring differences between the discriminant model and the two other methods of binary
classification is the ease of application as actual prognostic tools. Consider using the models
to diagnose a respondent in person. To score a respondent based on the discriminant model,
doing a substantial amount of arithmetic is necessary; this obviously allows more room for
8 These were calculated from the ratios of the areas under the ROC curves for the two methods
after subtracting off the baseline area of 50%.
225
error on the scorer's part (and assuming a certain level of numerical literacy). To apply the
question sets or the classification tree, however, one simply needs to obey simple conditional
rules ofthe form, "ifX, then y", so there is no arithmetic required. For the classification tree
with the highest level of sensitivity, though, the ease of use was hampered by the large,
complex structure of the tree (which is why the tree is not pictured). Thus, the simplicity of
representation achieved by the question set method offers more intuitive, "nonmathematical"
insight into the structure of the data than the linear model. Yet it achieves a greater accuracy
than the classification tree model, which is almost as easily interpreted.
Another important difference concerns the number of variables required for each
model (which is very important if one is interested is using the predictor variables to
construct a proxy of mortality risk). To some extent, this comparison is difficult to make
fairly, as the question set method actually consisted ofthree separate models, each requiring
a small number of variables. The linear discriminant model required 15 variables, but
question Sets Band C consisted of only seven questions (Set A had ten), and with
considerable overlap in the number of variables actually needed for the two models.
However, the linear model could accommodate any cutofflevel of risk (that is, any possible
combination of specificity and sensitivity on the ROC curve) with the same 15 variables. If
one totals the number of unique variables required for all three question sets (which still
provides only a narrow part of the complete range of potential specificity/sensitivity levels),
there are some 16 variables required. This is the same number used in the discriminant
model. However, the question sets had the advantage of being usable as separate modules
requiring between seven and 10 questions. Thus, the question sets required fewer variables
226
if one was willing to accept a fixed risk cutoff, but the discriminant model was more
effective for predicting along the entire range of cutoffs. The classification tree in Figure 5.7
required only six distinct variables, but many more questions were required for the larger,
high-sensitivity model.
The binary classification methods had another useful advantage over the discriminant
analysis that was particularly useful when considering the use of the models as proxies. For
some variables in the discriminant model, there were multiple levels of scoring when more
than two values were possible (e.g., the eighth question in Appendix VI). A problem with
this method arises when one wishes to use a set of predetermined proxy variables in a survey
where the necessary survey items are not coded or asked exactly as the original survey
questions from which the index was constructed. For example, the ADL variables used in
the question set method are quite common to many surveys in addition to the EPESE
questionnaire. Unfortunately, they are not usually phrased in the exact same manner, and the
possible answers for any survey question can vary from instrument to instrument. The binary
splitting of a variable is generally more adaptable to this type of problem, since this simpler
two-class classification is much easier to reconstruct based on different codings.
Constructing a discriminant model purely from binary splits would have been possible, as
with the tree or question sets, but the model would have been somewhat weaker.
As discussed in the theoretically-based comparisons of the methods in Chapter 2, it
is entirely possible to create a hybrid of the question set method and the discriminant analysis
model by forming interaction-based indicator random variables equivalent to the question
subsets and fitting the models parametrically. The results above suggest that this might
227
achieve even more powerful models, as there were many persons who were identified as high
risk by one model, but not the other.
5.8 An additional question set model
Table 5.20 - Predicted outcome by true outcome, Question Set J,
full dataset estimates
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 7,539 (92,0%) 652 ( 8,0%) 8,191
PREDICTED (85,2%) (45,0%)
OUTCOME died 1,305 (62.1 %) 798 (38,0%) 2,103
(14.8%) (55,0%)
TOTAL 8,844 1,450 10,294
Table 5.21 - Predicted outcome by true outcome, Question Set J,
test dataset estimates
Cells counts, percentages by row, and percentages by column
TRUE OUTCOME
survived died TOTAL
survived 2,480 (91.3%) 235 ( 8,7%) 2,715
PREDICTED (84,2%) (48,6%)
OUTCOME died 466 (65,2%) 249 (34,8%) 715
(15,8%) (51.4%)
TOTAL 2,946 484 3,430
Question Set J in Appendix VII is the larger model, constructed with the RRESA(N)
method and Breiman et al.'s cost-complexity pruning method of backward deletion, as
228
discussed in the last section of Chapter 4. The full model, using a relative misclassification
cost of3.5, consisted often subsets of three questions each, for 30 questions. The computer
required an average of about 20 hours to find a particular absorption point with the RESA
on a SparcStation Ultra. Since the RRESA(200) was applied to both the learning set and the
test set in this search, this method required more than 8,000 hours (333 days) of computing
time.
9
Consequently, the final model was not constructed until the latter stages of the
research, so the model has not yet been validated with the Duke sample at the time of this
writing. Some readers may \vish to treat this result with greater caution since it was achieved
with a different method, and was not validated on an independent sample. This is the reason
for considering the model separately, in this section.
The final model (Question Set J) consisted of23 questions in ten subsets. All these
subsets consisted of two or three questions. Not one subset was dropped from the model,
suggesting that even larger models might be constructed without fitting variance. The result
of applying this model to the full dataset is shown in Table 5.20. The misclassification error
associated with this table is equal to:
652,3.5 + 1,305
1,450'3.5 +8,844
= 0.2577,
which is bias downward (by roughly 0.02, based on the above results). When the model of
a similar size (26 questions) was created on the learning set and applied to the test set, the
following classification was achieved (Table 5.21). Thus, a model like Set J is estimated to
9 Of course, since the repeated runs of the RESA were independent, the actual time needed to
conduct such a search could by reduced by using multiple computers in parallel. The author was able to use
six to eight computers at a time, thanks to the generous resources of the Statistical Computing Facility at the
University of Califomi a at Berkeley.
229
capture more over half of all deaths with a specificity of 84.2%. The test set error associated
with Set J was estimated as:
235,3.5 + 466
484'3.5 +2,946
0.2777,
with a standard error of 0.0147. The area under the ROC curve was estimated as 76.6%.
This model outperforms the linear discriminant analysis on the test set, and requires only 17
unique variables to do so (compared with 15 for the linear model and 23 for the regression
model). Many questions in Set J are also found in Sets A through C, although in different
combinations. Interestingly, some questions included in Set J but not in Sets A through C
are also found in the latter part of the discriminant model (e.g., diabetes and hospitalization
for heart failure were in the final model, and smoking was also included in the learning set
model).
230
Chapter 6 - The causes of death in the elderly
6.1 Causes as identified by the death certificate
Of obvious interest to mortality researchers are the causal processes involved in
dying. Important clues about these processes may be gained as a byproduct of the predictive
models constructed above, if one is careful with the interpretation of the identified
correlations. The problem is that the mortality process is one of enormous complexity,
reflected by the incredibly large number of ways to die, as witnessed by any researcher
familiar with reD (International Classification of Disease) codes. Fortunately, some
important clues have already been provided along with the data, most notably the connection
between respondents and their death certificates. Because of the link, it was possible to tell
not only who had died, but what causes were listed on the certificate.
First, it must be recognized that death certificate data on the causes of death can be
a double-edged sword for the researcher. On the one hand, one has a great deal of
information about many respondents. This could include as many as 30 associated
conditions besides the underlying and immediate causes of death for each respondent.
Unfortunately, a great deal of inaccuracy and incompleteness pervades this information, and
in many ways it seems that the data can be more misleading through what is missing from
the certificate rather than what is included. First, the vast majority of death certificates are
not informed by autopsies and do not include any data from them (mostly because autopsies
are not usually done on elderly persons). Secondly, the causes of death are often extremely
difficult to pinpoint (even with autopsy information) because the process of death itself is so
complicated, usually involving more than one medical condition for elderly persons. The
231
typical strategy of identifying a single, lone underlying cause does not account for the
multiplicity of conditions which may precipitate death. So it is often helpful to examine
more than one entry on the certificate's list of causes, provided one is lucky enough to have
this additional data on such respondents. The crux of the problem is that for many
respondents, the data are probably not fully present; inevitably, some "true" causes of death
must be absent from the certificates, and it seems likely that this is a substantial problem.
Thirdly, there is frequently a lack of uniformity in how certain conditions are reported and
coded (i.e., by ICD classification codes) on the death certificate. In the EPESE data, one is
fortunate to have data coded entirely by a single nosologist at each of the three centers,
hopefully providing some degree of uniformity. However, no information was available in
the EPESE data on the person who originally reported the causes on the certificate (e.g.,
whether it was a coroner, clinician or other). Thus, examining the methods by which death
certificates are typically completed (or supposed to be completed) by the general informant
is extremely important.
The relevant section of the death certificate for the purposes of this dissertation is the
listing of causes in the bottom area of the certificate. This section is divided into two parts:
Part I, which lists "immediate" and "underlying" causes, and Part II, containing "other
significant causes - conditions contributing to death but not related to the causes given in Part
I". Guidelines for filling out and coding this and other areas of the death certificate may be
found in several NCHS publications, including the Medical Examiners' and Coroners'
Handbook on Death Registration and Fetal Death Reporting, the Instruction Manual Part
2a - Instructions for Classifying the Underlying Cause ofDeath, and Instruction Manual
232
Part 2b - Instructions for Classifying Multiple Causes of Death. These manuals are the
sources for much of the information presented in this section.
Part I, as mentioned, contains the immediate and underlying causes of death and all
causes in between, with one cause listed on each line and additional causes listed as
necessary on improvised lines. (However, more than one condition was sometimes listed on
a single line.) The underlying cause is defined as "the disease or injury that initiated the train
of morbid events leading directly to death, or the circumstances of the accident or violence
that produced the fatal injury". I The immediate cause is "the final disease, injury, or
complication leading directly to death" (emphasis added)? Causes are listed in reverse order
according to their occurrence in time and causal ordering, so that the immediate cause is
listed first, the underlying cause is listed last, and all intermediate causes are listed between
them. If only one cause is listed, it appears on line "a.", and is considered both the
underlying and the immediate cause. An illustrative example given in the Medical
Examiners' and Coroners' Handbook is that of an unfortunate gardener who stepped on a
rake, contracted tetanus, and died of asphyxia (suffocation) during convulsions. Part I was
completed as follows:
Part I
a. Asphyxia
b. Convulsions
c. Tetanus
d. Infected laceration of foot
Thus infected laceration of foot was the underlying cause, which in turn caused tetanus,
I Medical Examiners' and Coroners' Handbook on Death Registration and Fetal Death Reporting (1987).
2 Ibid.
233
which lead to convulsions, which finally led to the immediate cause of death, asphyxia.
Additionally, the manner of death was recorded as accidental, and a short description of the
injury ("stepped on rake while gardening") was provided, both in the latter part of the
certificate. This example shows the causal and temporal order implied in the chain of events.
The approximate time between the onset of each condition and the cause of death is
sometimes listed on the actual death certificates, but unfortunately this information is not
coded into the EPESE data.
Part II of the certificate contains "any other important disease or condition that was
present at the time of death, and that may have contributed to death but did not result in the
underlying cause of death listed in Part 1".
3
Again, an example from the Medical Examiners'
and Coroners' Handbook demonstrates this classification: "On May 5, 1989, a 54-year-old
male was found dead from carbon monoxide poisoning in an automobile in a closed garage.
A hose, running into the passenger compartment of the car, was attached to the exhaust pipe.
The deceased had been despondent for some time as a result of a malignancy, and letters
found in the car indicated intent to take his own life." The death certificate was completed
as follows:
Part I
a. Carbon monoxide poisoning
b. Inhaled auto fumes
c.
d.
Part II
Cancer of stomach
The underlying cause was the inhalation of auto fumes, the immediate cause was carbon
3 Ibid.
234
monoxide poisoning, and the "other significant condition contributing to death but not
resulting in the underlying cause" was cancer of the stomach. Here, the stomach cancer was
not directly related (at least not in any physical or medical sense) to either the underlying or
immediate cause. It was only indirectly related in that it may have led to the decedent's
depression and therefore suicide. However, conditions listed in Part II do not necessarily
relate to the causes of death at all. Yet another example from the Handbook illustrates this:
"On July 4, 1989, a 56-year-old male was found dead in a hotel. Autopsy revealed
asphyxiation due to aspiration of vomitus - a result of acute alcohol intoxication. Blood
alcohol level was 0.350 gm percent." The death certificate contained these causes:
Part I
a. Asphyxiation
b. Aspiration of vomitus
c. Alcohol intoxication (0.350 gm percent)
d.
Part II
Alcoholic cirrhosis
Here the victim became excessively drunk and choked on his vomit. Also listed is cirrhosis,
which is a liver disease often resulting from the long-term consumption of alcohol
4
.
However, the bout of drunkenness resulting in death was not itself responsible for cirrhosis.
Nor was the disease any cause of the victim's intoxication. (While it is possible that the
victim was drinking to escape the disease psychologically, no indications to this effect
existed in the example). In no way did the victim's cirrhosis contribute to his death. So
according to the guidelines, only causes thought to contribute to death should be listed; yet
this designation is evidently subject to considerable interpretation and not strictly obeyed.
4 Wyngaarden and Smith (1988).
235
Several lessons can be learned from the examples above. First, notice that in all three
examples, the actual physical process surrounding the death was easy to identify for persons
familiar with the details. Yet, for a researcher faced with such data in coded form, it is
extremely difficult to pinpoint the actual process of death based on a handful of leD codes.
A list of numbers does not tell the story of the gardener and his rake, nor the story of the
suicide (in fact suicide was never mentioned on the certificate, not an unusual omission). It
is also not at all clear to what extent, or in what way, any conditions listed in Part II might
have contributed to the death, so there is no way to work these causes into any
comprehensive story either.
Secondly, when considered from a causal standpoint, any given death is assumed to
be a result of a series of chain reaction events. However, it is often doubtful whether can one
truly identify a single "underlying" cause with any meaning. The gardener stepped on a rake,
but it was tetanus that eventually killed him. It seems both were necessary for his death to
occur (provided the laceration itself was not so life-threatening); neither event caused death
by itself. The only real distinction is one of temporal order, which may have been easy to
establish in these cases. With multiple chronic illnesses, a far more typical mode of death,
these connections are hardly so readily established. Suppose for example that a strong factor
in the gardener's demise was his chronically weakened immune system, which allowed him
to succumb to the infection. Presumably, this condition may have existed before the rake
incident. Where in the chain of events does this factor belong temporally? If the laceration
was minor, should one then list the immune condition as the underlying cause? What about
other chronic illnesses that work together simultaneously to deteriorate the body, so that no
236
single condition can be identified as underlying (e.g., arterioslerosis and heart failure)? No
standards or rules exist for handling such ambiguities on the death certificate.
Thus consider that for many elderly persons the process of death is not always so easy
to recognize, even for a clinician who is intimately familiar with the patient. Such persons
frequently have any number of medical conditions and illnesses, and may also be under
multiple medications. Any or all of these events may interact or conspire to bring about the
ultimate demise of the body, and many of these factors may be entirely undetected by close
observers. Clearly, trying to pinpoint a cause for these deaths is quite inexact even for
experienced physicians; so trying to understand such a complex course of events based only
on a small list of incomplete ICD codes (which are themselves categorical representations
of a report that was often not derived from a physician's opinion) amounts to guesswork for
many decedents. It is with these caveats that one must examine any results from this data.
Part of this problem may be understood by trying to identify those certificates that are mostly
likely to have misspecified the causes of death, as suggested below.
6.2 Underlying and associated causes in the EPESE population
The underlying causes for the decedents in the EPESE sample, grouped into broad
categories, were presented in Section 3.6. As indicated, the most common underlying cause
of death was heart disease, followed by cancer and cerebrovascular disease, as one would
expect from a representative sample of U.S. elderly. However, although the general order
of causes was as expected, the proportions of deaths attributed to each cause were not as
expected. Instead, the underlying causes of death were much more diverse in the EPESE
sample, with more deaths categorized as "other". However, based on the discussion above,
237
examining the underlying cause alone is clearly insufficient. So the additional causes listed
on the death certificate were also examined.
There were 352 deaths (24% of all deaths) which had only one cause listed on Part
I ofthe death certificate, while 589 certificates (41 %) listed two causes, 369 (25.4%) listed
three causes, and 134 (9.2%) listed more than three causes. On Part II of the certificate, only
about 42% of the documents listed a cause. There were 383 (26.4%) respondents who listed
one cause, 168 (11.6%) listed two causes, and 59 (4.1 %) listed three or more causes. These
proportions were fairly close to what would be expected from a simple random sample of
death certificates in the U.S. For example, in a 1% simple random sample of U.S. death
certificates in 1988, about 43% of all certificates listed any causes in Part II, and roughly
75% of certificates contained more than one cause in Part I. To group the raw ICD codes
into meaningful categories, the ICD classifications shown in Table 6.1 were used. Any code
not included in the given ranges was classified as "other".
Table 6.1 - Classification ofICD-9 codes
ICD-9 code
390-398, 410-429
140-239
430-438
440-454, 456-459
400-404
250
487-496, 510-519
800+
Condition
cardiovascular disease
malignant neoplasms
cerebrovascular diseases
circulatory disease
hypertension
diabetes
bronchopulmonary diseases
accidents
Suppose one considers immediate causes, that condition listed at the top (line "a")
of Part I on the certificate (which was identical to the underlying cause for the 383
238
respondents with only one cause listed). Of the death certificates listing only one cause,
some 31% listed cancer, and 41% listed heart disease. After removing persons with only one
cause listed, it was observed that of the 1,092 remaining certificates, more than half (580)
of these immediate causes were categorized as heart disease, and only 77 (7.1 %) were
categorized as cancer. There were 52 of these deaths with cerebrovascular disease listed as
the immediate cause, 31 certificates listed bronchopulmonary diseases, 22 listed accidents,
and 18 listed circulatory disease. It should also be noted that there were 127 certificates
which actually listed more than one condition on the first line of Part I (despite the
suggestion in the section above that only cause should be listed per line). These conditions
were ignored in the above tallies. Most of these were also listed as heart attack or cancer.
There were 1,002 certificates that also listed a condition on line "bOO of Part 1. Of
these, 400 listed heart disease on line "boo, 160 listed cancer, 41 listed cerebrovascular
disease, 64 listed circulatory disease, and 56 listed bronchopulmonary diseases. Some 408
certificates listed a cause on line "COO, of which only 137 (33.6%) listed heart disease, and
only 56 (13.7%) listed cancer. Circulatory diseases were listed on 37 certificates (9.1%),24
listed bronchopulmonary disease, and between ten and twenty certificates listed diabetes and
hypertension each (the latter two having been nearly absent in the above sections of the
certificate). Thus, circulatory disease, bronchopulmonary disease, diabetes and hypertension
appeared more frequently in this section of the certificate than in line "a". There were only
24 certificates listing any causes on line "d".
As Explained above, Part II of the certificate contains any other causes that may have
contributed to death that were not associated with the immediate and underlying conditions
239
in Part 1. There were 610 certificates that listed at least one condition on this part, and 227
that listed two or more conditions. Ofthe conditions listed first, 107 (17.5%) certificates that
listed heart disease, 55 (9%) listed cancer, 71 (11.6%) listed diabetes, 69 listed
bronchopulmonary disease, 39 (6.4%) listed hypertension, 36 listed cerebrovascular disease,
and about two dozen listed circulatory disease and accidents each. In this section of the
certificate then, one is much more likely to find diabetes, bronchopulmonary disease, and
hypertension than in other parts of the certificate.
The general pattern observed in the death certificates then, is that cancer and heart
disease tended to dominate as immediate and underlying causes (with cancer usually being
considered both the immediate and underlying cause more frequently). Other conditions
such as circulatory disease, cerebrovascular disease, and bronchopulmonary disease were
seen as illnesses which participated in the causal chain of events leading to death.
Conditions such as diabetes and hypertension were apparently seen more as aggravating
conditions that contributed to death, but were not in the "chain of events" leading to the
immediate cause of death as described in Part I of the certificate.
To examine the associated conditions further, it was decided to group together all
those conditions listed in both Part I and Part II with the sole exclusion of the condition
identified as the underlying cause. Thus the unit of analysis became the associated conditions
themselves rather than the respondents. The reasons for conglomerating all associated
conditions on the certificate, despite their location on the document, are discussed in the
section above. Briefly, it was not entirely clear that the location of a condition on the
certificate was really a meaningful indicator of its position in the causal "chain of events"
240
leading to death. Moreover, this assumed such a straightforward chain of connections could
really be established in reality, which is debatable.
After removing all underlying causes, a total of 1,516 associated conditions remained
(drawn from a total of 1,092 death certificates). Interestingly, these conditions were
dominated by those in the "other" category, which accounted for 598 (39.4%) of all
conditions. Only 326 (21.5%) conditions were categorized as heart disease, 141 (9.3%) as
bronchopulmonary diseases, 7.3% as circulatory diseases, and about 5% as each of
cerebrovascular disease, diabetes, and accidents. Only 3% were attributed to cancer. Since
the "other" conditions were so predominant, it was decided to examine these codes in more
detail. It was determined that of the 498 "other" conditions, 203 (41 %) of the codes were
between 780.0 and 799.9, which contains all those "symptomatic" conditions, the causes of
which are unknown to the observer (e.g., weight loss, convulsions, or coma). These codes
were almost exclusively in the 799.0 - 799.9 range, consisting mostly of respiratory failure,
with a handful in the range 785.0-785.9, which consisted of symptoms involving the
cardiovascular system (mostly shock and gangrene). Of the remaining 296 "other" codes,
126 were between 560.0 and 579.9, consisting of diseases ofthe urinary and digestive tracks,
and the rest were distributed quite widely among all the remaining codes.
6.3 The causes of death associated with the models
Suppose now that one considers the causes of death as they were listed for the sets
of persons as chosen by the questions in Appendix 1. First, consider the underlying causes
listed on the death certificates of the deceased among the high risk respondents as identified
by these models. To examine the small groups of deaths captured, it was necessary to group
241
the conditions more broadly to maintain enough observations. Thus, four main groups of
conditions were identified: heart disease, cancer, cerebrovascular disease, and all other
remaining causes.
The persons chosen by Question Set A were dominated by heart disease deaths, as
431 (some 48%) of the 904 correctly predicted deaths were so listed, followed by 266 deaths
(29.4%) in the "other" category, 140 cancer deaths (15.5%), and 67 deaths (6.4%) due to
cerebrovascular disease. Thus, cancer deaths were substantially underrepresented, while
more heart disease and "other" deaths were captured. It appears that death due to cancer is
more difficult to predict than other types ofdeath (congruent with the observation that cancer
itself was difficult to predict). The deaths correctly predicted by Set B were also skewed
toward heart disease and away from cancer. More than 50% of the deceased respondents
(308 out of 606 total deaths) had heart disease listed as the underlying cause, only 13.7%
were cancer deaths, 7.8% were cerebrovascular, and 27.7% were classified as "other".
The causes of death varied quite substantially from subset to subset, however. For
example, consider subset B.l, which asks about digitalis usage and whether the respondent
could walk a half of a mile without help. This set correctly predicted 267 deaths, of which
157 (58.8%) had heart disease listed as the underlying cause, only 24 (9%) had cancer listed,
and 21 (7.9%) certificates listed cerebrovascular disease. Suppose one computes a chi-square
for this distribution, where the expected proportions are those observed in the set of all
deaths in the sample. The statistic is 37.9 on three degrees of freedom, suggesting that these
discrepancies were highly unlikely to be due to randomness. Consider, in contrast, the
deaths chosen by subset B.2, which selects out males who weigh less than 168 pounds and
242
cannot do heavy work. Of the 277 deaths correctly predicted by these questions, 117
certificates listed heart disease (42.2%), 50 listed cancer (18.1%),22 listed cerebrovascular
disease (7.9%) and 88 (31.8%) listed other causes. Thus, the respondents chosen by the
questions in B.2 were more than twice as likely to have died of cancer than the persons
chosen by Set B.l, and 40% less likely to have died of heart disease. Table 6.2 shows the
breakdown by cause for each of the unique question subsets in Sets A and B. Note that set
B.2 was identical to Set AA. The number of deaths chosen by Set C was too small to
provide reliable data when broken down by cause.
Table 6.2 - Underlying causes of death by question subset
cause A.l A.2 A.3 AA A.5 B.1 B.3
heart 105 (45%) 231 (57%) 188 (49%) 117 (42%) 72 (53%) 157 (59%) 96 (53%)
cancer 34 (15%) 47 (12%) 49 (13%) 50 (18%) 17 (12%) 24 (9%) 25 (14%)
stroke 16 (7%) 27 (7%) 24 (6%) 22 (8%) 14 (10%) 21 (8%) 12 (7%)
other 79 (34%) 101 (25%) 120 (32%) 88 (32%) 34 (25%) 65 (24%) 48 (27%)
total 234 406 381 277 137 267 181
Overall, heart disease as an underlying cause is more strongly associated with Sets
B.1, A2, A5 and B.3. Cancer was more likely to be listed on the certificates of respondents
chosen by sets A4 and A.!. Stroke was slightly more likely in those chosen by A.5, and the
"other" category was represented more strongly by Sets AI, A3 and A4.
Of the 904 deceased respondents chosen by Set A, there were 676 (75%) with more
than one cause listed on the certificate. After removing the conditions specified as
underlying causes and conglomerating all the remaining conditions on the certificates as
above, it was found that on these 676 certificates, a total of 928 conditions were listed. Of
243
these 928 "associated" conditions, 375 (40.4%) were categorized as "other", 209 (22.5%)
were heart disease, and 82 (8.8%) were bronchopulmonary diseases. These proportions (and
the proportions ofthe other categories as well) were quite close to the proportions observed
in the entire sample of deaths (as listed in Section 2.2 above). When the conditions were
broken down by subsets, the same pattern was observed. When the associated conditions
were then conglomerated for the respondents chosen by Question Set B, the same patterns
were again observed, except that heart disease (26.1% of all conditions) was slightly more
frequent relative to the "other" conditions (36.8%).
Overall, the associated conditions were dominated by the "other" category, which
accounted for about 40% of all conditions, followed by heart disease (about 24%), and then
bronchopulmonary disease (about 9%), no matter which group of respondents was examined:
those chosen by Set A, Set B and their subsets, or the entire sample of deaths. Many of these
conditions (roughly a third of the "other" conditions) were completely symptomatic (of
entirely unknown etiology). Moreover, there was an extremely large degree of heterogeneity
associated with the various conditions, as no single category of disease (excepting the
catchall "other" category) was dominant: the single most prevalent condition was the
extremely broad category of "heart disease", which only accounted for about a quarter of all
conditions!
6.4 The causal processes and risk factors associated with death
At this point, the reader may be well aware of the double-edged nature of the death
certificate data. As demonstrated above, there is a vast amount of information associated
with death certificates. However, it is of highly dubious quality, and extremely difficult to
244
interpret on any substantive basis (partly because of the high degree of heterogeneity in the
causes ofdeath). Unfortunately, a sensible story is difficult to decipher from the data without
much additional information, or at least some well-reasoned conjecture. It is argued below
that some additional information is available in the form of the models themselves.
However, these models tell only half the story. To make a reasonable interpretation, it is
necessary to look at the picture presented by the above data as a whole, and to consider the
deficiencies that are likely to distort this picture.
First, although it could not be proven solely with the data here, it was strongly
suspected that a fair number of ailments that may have contributed to death were either
misdiagnosed or entirely undetected. This may also be the case in the U.S. population as a
whole (based on previous analyses of U.S. death certificates conducted by the author), but
the EPESE deaths were probably less accurately documented. That inaccuracies exist in
death certificate data is hardly an original opinion, as demonstrated by studies comparing
actual autopsy results with death certificates. However, it was also suspected that cancer was
the illness that accounted for a disproportionately large number of undetected or
misdiagnosed conditions. Secondly, one particularly large group of high risk respondents
existed for whom death due to heart disease was excessively high. The author wondered
whether these respondents were actually suffering deaths due not to heart disease itself, but
toxicity from a purported treatment for heart disease: digitalis.
Several signs pointed to cancer as the culprit for much of the "missing morbidity".
First, cancer deaths were underrepresented in the EPESE sample relative to the U.S.
population of elderly, as indicated in Section 3.6. Secondly, the respondents in the sample
245
were "working class", and had lower incomes compared with the U.S. population. In the
East Boston and New Haven samples, two-thirds of the respondents had household incomes
less than $10,000 per year; in the Iowa sample, incomes were somewhat higher, as about
63% of respondents had household incomes less than $15,000. It was thought that because
these poorer persons were less likely to be seen by a clinician than wealthier elderly persons,
they were probably less likely to be diagnosed with cancer (and therefore less likely to have
cancer listed as a cause of death). This was observable in the data, although the relationship
was not extremely strong: about 13% of persons with an income less than $5,000 reported
having been diagnosed with cancer, compared with 15% ofpersons with incomes of$15,000
or more. In studies of the relatively wealthy women in Marin County, for another example,
breast cancer rates were also observed to be abnormally high; this was thought by some to
be the result of increased detection rather than a truly higher incidence of cancer, although
these issues have been hotly debated. It was also observed in the EPESE sample that among
persons who had seen a dentist within the previous five years, the lifetime prevalence of
cancer was higher than in those who had not seen a dentist (15% versus 12%). This was
despite the fact that the latter group were more likely to be smokers or exsmokers than those
who had seen a dentist (45% smokers versus 39%). These sorts of results suggest that many
cancers in the general population go undetected, since much of the elderly population is not
wealthy.
It was also thought that many deaths that were truly (or in part) due to cancer instead
had unknown or misdiagnosed causes listed on the certificate (e.g., the "symptomatic" ICD
codes of unknown etiology, ranging from 780 to 800, which make up many of the deaths in
246
the "other" category). For instance, it seemed plausible that lung cancer might be appearing
as respiratory failure (the most common symptomatic condition), or that tumors affecting the
nervous system might appear as convulsions. Table 6.2 suggested that some of these deaths
may have been identified by the question subsets AA/B.2 and A.l (particularly the former).
Consider the questions in subset AA: this subset picked out males weighing less than 168
pounds who could not do heavy work. The fact that these men had lighter weights was likely
a sign of the physical "wasting way" which occurs in many cancer victims just before death.
Also, more than 68% of the persons identified by Set B.2 were either present smokers or
exsmokers (compared with 39% of the persons not identified by Set B.2). Yet amazingly,
only 14.1% of the persons identified by Set B.2 had ever been diagnosed with cancer at
baseline (compared with 13.8% for all those not chosen by B.2)! There is a clear implication
for health policy: elderly males weighing less than 168 pounds who are unable to do heavy
work should be encouraged to be particularly vigilant against cancer, regardless of their
observed history with the disease. This also suggests that researchers interested in studying
risk factors associated with cancer should be very careful to control for factors that may
affect the likelihood of detection.
At this point, there is an interesting question concerning the exact causal nature of
the respondent's weight at the baseline time of interview. It seemed to be associated with
the relatively large numbers of deaths due to cancer and "other" causes. As always, one is
concerned with the issue of whether this variable is essentially symptomatic of the condition
that is causing death, or whether it is itself a "causal" risk factor for death (meaning that
mitigation of the variable would reduce death rates). The etiology suggested above (that
247
cancer was responsible for many of these deaths) would suggest that the weight variable is
entirely symptomatic, as one cannot sustain a cancer patient merely through nourishment.
However, it is also easy to argue that for many elderly persons a lack of body weight may
actually contribute to death in the sense that the body's tissue mass and fat reserves may
serve as a type of "cushion" against diseases which cause loss of appetite. Thus, it may be
that an elderly person of lower weight succumbs to some illness more rapidly due to the
shortened time period needed to drive the body mass down below some "healthy weight"
threshold.
Notice also that the relationship between body weight and mortality is quite
nonlinear, as was suggested by the size of the coefficient on the squared term in the logistic
regression model. Thus, excessively high body weight was not associated with lower
mortality; rather there seemed to be a threshold weight below which there was a high risk of
mortality. (Note that the question set method deals with this sort of nonlinearity more
adeptly than the linear discriminant analysis, since the latter model is forced to treat the
variable linearly.) Also, it appeared that a better variable for capturing the causal effect of
obesity was the weight of the respondent at previous ages rather than at baseline. This
variable appeared in all the fitted models as strongly and positively correlated with death,
even when all other variables were controlled. Set A.3 suggests it is a particularly good
predictor of death for those above age 80.
The same issue of association or causality arose with the questions concerning the
functionality of the respondent. That is, if one cannot walk a half of a mile (or simply does
not attempt to do so under the belief that it would be too stressful), is this symptomatic of
248
disease, or does it contribute to one's demise? Both arguments are quite plausible, but it is
not clear to what extent one effect dominates over the other. The best way to answer such
a question would be to institute clinical trials designed to assess the efficacy of increasing
mobility and functionality in the elderly. If decreased functionality has even a moderate
influence on mortality, these results suggest that the impact on mortality rates would be
substantial. The argument for causality with respect to questions that measure mental
functioning IS more difficult to make; it appears that these variables were largely
symptomatic.
Now consider the question subsets related to the highest levels of heart disease
mortality. The subsets most strongly related to heart disease (Sets B.l and A.2) had one very
powerful predictor in common: the use of digitalis by the respondents chosen.
6.5 Digitalis use and mortality: cause or consequence?
Consider the 1,294 persons who were taking digitalis as baseline (12.6% of all
respondents). Of these, 792 (61 %) were female, and 551 (43%) were younger than 75. Yet,
there were 406 deaths among these persons in three years, a 31.4% probability of death.
There were nearly twice as many deaths as should have occurred based on the sex and
distribution ofthese persons. Moreover, these 406 deaths accounted for 28% of all deaths
in the sample, although only 12.6% of respondents were on the drug!
The main question of interest is whether or not the use of the drug was symptomatic
of heart disease (clearly, physicians usually do not prescribe it to persons without some form
of heart disease), or whether it was a cause of death through digitalis toxicity. Initially, the
author assumed that the former was the case; however, it became increasingly clear that the
249
correlation between digitalis use and death was impossible to explain away with any other
variable. It was judged to be the third most important predictor (third only to age and sex)
in both the linear discriminant analysis, and the logistic regression model; yet both
controlled extensively for every observed predictor thought to have any predictive power at
all, including previous diagnoses of heart failure. It is true that this still does not establish
a causal relationship. The problem was that many important predictors were not observed,
and most of the persons who were taking digitalis had heart trouble first; thus, one expects
these persons to have a high death rate despite digitalis use or even heart failure. Still, the
association between digitalis and death was powerfully stubborn. To see this explicitly,
consider two very different groups of persons. There were 665 respondents who reported
that they had been diagnosed with heart failure, but who had never used digitalis. Of these
persons, 126 died, a death rate of 19%. Another group of respondents reported that they had
never been diagnosed with heart failure, but were using digitalis at baseline. Of these 826
persons, 244 died (29.5%). The latter group of digitalis users with no heart failure, then, had
a probability of death more then half again as high as those who had suffered heart failure
but had never used digitalis; yet, this latter group with the lower death rate was the older of
the two groups! The highest death rate belonged to the persons who had suffered heart
failure, and used digitalis; of these 468 persons, 162 (34.6%) died.
It was this sort of finding that prodded the author to research digitalis and its use in
the elderly community. It was with some amazement, then, that the author discovered a
raging controversy in the medical community concerning the use of digitalis. Essentially,
the drug has been hotly debated since its very discovery (hundreds or probably thousands of
250
years ago).
5
Digitalis, which is extracted from the leaves of the plant Digitalis Lanata, is one
of the cardiac glycosides, a group of drugs that directly affect the muscle tissue of the heart.
It is usually prescribed for supraventricular fibrillation (essentially rapidness or irregularity
in the heart beat which is outside the ventricular, or lower, chambers of the heart, e.g., atrial
fibrillation and flutter) or heart failure itself. The influence of the drug is higWy dose-related.
At normal dosages, it causes a slowing of the heart rate (called a negative chronotropic
effect), an increase in the force of systolic contractions (called a positive inotropic effect),
and decreased conduction velocity through the atrioventricular node.
6
In the past decade,
digitalis has been one of the most commonly prescribed drugs in the U.S., with some 21
million prescriptions in 1990.
7
A known problem with digitalis is that it becomes fatally toxic if the concentration
of digitalis in the blood becomes too high. Unfortunately, very little is known about exactly
what the optimal dosage is, although it appears that the lethal dose is not very high relative
to the effective dose. Moreover, the proper dosage may vary tremendously from person to
person depending on any number of variables, such as age, body weight and renal
functionality. This last factor is tremendously important; persons who have bodies which
are less able to excrete wastes efficiently (as is frequently the case for those who are less
mobile and elderly persons overall) are much more susceptible to toxicity. This is because
5 For a summary of the present state of the debate, see Milton Packer's editorial comments in the
New England Journal of Medicine, "End ofthe Oldest Controversy in Medicine: Are We Ready To
Conclude the Debate on Digitalis?" (1997).
6 See http://www.rxlist.com/cgi/generic/dig.htm for an excellent summary of the pharmacology
and chemistry of digitalis.
7 The Digitalis Investigation Group, (1996).
251
the concentration of the drug in the blood can build up to a toxic level much more easily.
Many substantial studies of digitalis toxicity exist.
8
Based on some of these studies,
it does not appear to be a large problem. For example, Warren et aI. estimated that only
0.85% of Medicare beneficiaries who used digitalis were hospitalized annually for adverse
effects from digitalis. Similarly, Kernan et aI., using the New Haven EPESE sample,
estimated that only 4-6% of digitalis users were hospitalized for toxicity in a period of 4-6
years. However, both studies suffered serious deficiencies. First, both studies ignored
persons who died. Secondly, since the optimal dosage (which is not really known) varies
from person to person, and since "toxicity" is defined on the basis of a somewhat arbitrarily
high blood serum threshold concentration of digitalis, it was not clear that the "true"
incidence of toxicity was estimated. ("True" toxicity is thought of as that level of digitalis
concentration at which the health of the individual is negatively affected by the drug). There
were also a substantial number of studies which asserted that digitalis toxicity was much
higher. However, these were anecdotal or setting-specific (i.e., applying to some particular
group such as persons in nursing homes or patients already known to be suffering toxicity).9
None ofthese could document toxicity systematically in a large, representative group of U.S.
elderly. A number of other researchers using observational data have also noted the
persistence of the association between digitalis use and mortality despite the attempt to
control for many potential confounders.
Io
A central problem in detecting digitalis toxicity, besides the ambiguity of the
8 See Kernan et al. (1994) and Warren et al. (1994).
9 For example, see Aronow (1996).
10 See Moss et al. (1991) and Bigger et al. (1985).
252
definition, is that it usually kills the recipient through arrhythmia or heart failure; both are
precisely those ailments for which it is prescribed! Thus, if toxicity is not checked for
explicitly (which is the case for the vast majority of deaths) it is very likely to go undetected.
An ignorant observer would likely assign heart failure or arrhythmia as the underlying cause
of death. For elderly persons, who are generally much more susceptible to toxicity, and who
may require lower dosages, the problem is exacerbated; even if blood serum tests are
performed, the problem would not necessarily be detected because true toxicity may be
assll.'1led away by definition. Consider that of the 1,450 deaths in the EPESE sample, only
one was attributed to digitalis toxicity (having an ICD-9 code of 792.l)! If one takes this
estimate at face value, only 0.07% of digitalis users die from it in three years, a suspiciously
low estimate. There is little doubt that toxicity is going undetected; the question is to what
degree it may be killing people.
The National Institutes of Health, increasingly interested in the efficacy and safety
of digitalis, has sponsored several large randomized, double-blind clinical trials to examine
the drug.
ll
In the most recent of these (with results published in the 1997, February issue of
the New England Journal of Medicine), the Digitalis Investigation Group (a consortium of
researchers created by the NIH to study the issue) assessed the efficacy and safety of Digoxin
in a randomized, double-blind controlled clinical trial with 6,800 patients with heart failure
and normal sinus rhythm. There were 1,181 deaths in the treatment group (34.8%), and
1,194 deaths in the placebo group (35.1%) in an average of 37 months; thus, there was
almost no difference in mortality between the two groups. The benefits of the drug were
11 See The Digitalis Investigation Group (1997).
253
highly symptomatic; essentially, patients recelvmg the drug suffered somewhat less
discomfort and suffered slightly fewer hospitalizations (6% fewer in the treatment group).
This result is representative of many results from previous clinical trials. Although
admitting that there was no impact on mortality and that the actual savings implied by the
decrease in hospitalizations was negligible, Dr. Packer concludes, "For most patients with
heart failure, digitalis remains an effective, safe, and inexpensive choice for the relief of
symptoms".
However, several mitigating conditions must be considered before drawing broad
conclusions that generalize from this study to most u.s. elderly who use the drug. First, the
study was not intended to examine elderly patients per se (persons who probably account
for most of the susceptibility) as only 27% of the patients were over age 70. Secondly, there
was a concerted, thorough effort to monitor dosages closely through the periodic
measurement of serum concentrations, and dosages were adjusted accordingly. Thirdly, the
patients appeared to be relatively healthy in comparison with many elderly persons who use
the drug in reality. 12
If one believes that there is a neutral effect on mortality when the drug is applied to
relatively young, healthy patients in a highly monitored clinical setting, as suggested by the
results of clinical trials, then the safety of the drug should be highly suspect when applied to
elderly, sicker, less functional persons in a real-world setting. Consider the following cost-
benefit analysis: If digitalis use is truly neutral with respect to mortality, then there is no real
harm in discontinuing the use of the drug, other than moderate symptomatic worsening.
12 A lengthy list of the exclusion criterion may be found in Digitalis Investigation Group, (1996).
254
However, if digitalis does have even a moderate positive impact on the risk of death in
elderly persons, the sheer numbers imply a substantial degree of excess mortality from this
drug. The implications for clinicians and public health researchers are clear. First,
physicians should be extremely cautious when prescribing the drug to elderly, less functional
patients, and particular care should be taken to monitor and moderate dosages. Secondly,
more extensive studies are needed to estimate the true level ofdigitalis toxicity and death due
to digitalis in elderly persons. Finally, it was worth noting that models of the sort in
Appendix I frequently chose male respondents who used digitalis and who weighed less than
about 160 pounds as being at particularly high risk of death. It is possible that prescribed
dosages of digitalis are not being adequately adjusted for the below-average body weight of
these persons.
255
Chapter 7 - Conclusions
7.1 The power of the models for predicting mortality
This dissertation has achieved several goals. First, it presented several powerful,
compact models for predicting three-year mortality in elderly persons. Second, it defined a
nonparametric model structure for binary prediction analogous to classification trees. Third,
the author assembled a search algorithm for selecting such models, and described a method
for using an internal test set to select model sizes and estimate prediction error. The
estimates of error were then validated on an independent sample of elderly persons. Finally,
the fitted models raised several interesting hypothesis concerning the causes of death in the
elderly, and suggested serious pitfalls that must be navigated by mortality researchers.
The primary focus of the research was the first of these objectives. In short, the
appendices of this dissertation present several small but efficient models for predicting short
term mortality and morbidity. Some ofthese models required as few as seven variables. Yet
they can detect about 40% of deaths with a specificity of 88%. The question set method
resulted in a ROC curve with an area estimated at 74.4% 1.2%. When these models were
applied to an independent sample of elderly persons with a radically different demographic
makeup, the estimates of error were shown to be quite accurate. The differences between the
test set estimates of error and the validation estimates were no bigger than the sampling
errors that could have occurred had the respondents been chosen with a simple random
sample.
The accuracy achieved by Question Sets A through C was superior to the accuracy
of the classification trees, but less than that of the linear discriminant model. This order was
256
maintained whether the models were applied to the learning set respondents, the test set
respondents, or the Duke sample. However, the differences were generally small in
magnitude. Logistic regression could achieve the same accuracy as the linear discriminant
model, but used more variables (particularly when fit with the Ale criterion). It also seems
possible to build larger question set models (such as Set J, consisting of23 questions) with
even greater predictive power than the linear discriminant fit. Set J correctly predicted 51 %
of deaths in the test set with a specificity of 84%, but it has not yet been validated with an
independent sample.
The models provided an interesting alternative to the more conventional, parametric
methods of prediction (such as linear discriminant analysis and logistic regression). Besides
requiring fewer variables, the form of the predictors (combinations of simple questions)
allowed for an ease of interpretation that was not available through equation-based methods.
It was possible to understand the structure of the models with little knowledge of
mathematics or distribution theory, and the results provided a stark contrast to more
complicated forms. They also provided valuable insights into mortality processes, as
discussed below.
7.2 The efficiency of the search method
The RSA was defined as the basic random search algorithm for finding question sets
that could not be improved by replacing any single question. The method was likely to
achieve a suboptimal model if the algorithm was only run once. Consequently, N
independent runs of the algorithm were made with independently generated starting models,
and the best model out of these N runs was selected (the RRSA(N) method). With very small
257
model sizes (i.e., two questions), it was proven that the RRSA(N) method selected the
optimal model with a high probability, even for a reasonably small N. For moderately sized
models (seven to ten questions), a brute force argument was made to suggest that the optimal
model could be found with RRSA(N) with an N of at least 100, but this could not be proven.
For example, consider the seven-question model structure of Set B in Appendix I
with a relative misclassification cost of 3.5. The median number of mutations required for
a single run of the RSA with this model structure was estimated as rougWy 42,000. A single
run of the RSA was estimated to have a 5% chance of finding the observed maximum (Set
B) when applied to the full dataset. Consequently, one had to perform an average of 20
independent runs of the RSA to find Set B, requiring a total of about 18 hours of computing
time on a SparcStation Ultra. The RRSA(lOO) was estimated to have a 99.4% chance of
achieving the observed maximum.
For larger model sizes (containing more than a dozen questions), it was more
doubtful that the RRSA(N) could find an absolute maximum within a reasonable time. To
achieve shorter search times, the RSA was modified by combining random searching with
exhaustive searching (designated the RESA method). Also, a more flexible method of
choosing for model size was used (cost-complexity estimation, defined by Breiman et al.).
These approaches were used to select Set J in Appendix VIII.
7.3 Implications for causal analyses of mortality
Besides age and sex, the best predictors of short term mortality were the use of
digitalis, body weight at time of interview, several measures of functionality, and the
previous diagnosis of illnesses (in that order). Interestingly, the age and sex variables
258
featured prominently in the linear models, as they were always the two most powerful
variables in the model. However, in the question sets, these variables played a much less
important role. In fact, it was possible to build a model of the highest risk persons (Set C)
which makes no references to age or sex! One might expect that since age and sex were
correlated with many questions in the models, a model that controls for neither (like Set C
) is simply referencing the oldest males through these spurious predictors. Yet, the persons
chosen by Set C were hardly dominated by old males, and their probability of death was
more than twice as high as would have been predicted by age and sex. Sets A and C also use
age and sex sparingly. Each variable was used in only one subset of each model, and they
were never included in the same subset.
Several reasons exist to explain these results. As discussed in Chapter 1, the
variables that measure physical functioning are probably temporally closer to death than
many other predictors. There is also a less obvious reason. Clearly, the observed incidence
of heart failure, cancer and stroke was higher for males at all ages. However, females
showed a greater loss of function than males at all ages. It was also suspected that there was
much undetected morbidity in the EPESE sample. Many persons who died had never been
diagnosed with major illnesses at baseline (which is why previous diagnoses of illness were
relatively weak predictors). Taken together, these observations suggested that much of the
undetected morbidity was in the female population. Females had much lower incomes than
males; thus, they may have had less access to medical care, and consequently were less
likely to be diagnosed with illness.
The issue of "missing morbidity" is crucial to mortality researchers. It was suspected
259
that because of undetected morbidity and the poor quality of death certificate data, many
causes of death were not accurately recorded or completely missing from the certificates.
It was also suspected that cancer was the illness most likely to go undetected, and that many
deaths caused by cancer were actually listed with other or unknown causes. This was
because so many deaths listed as "other" (which included unknown, or "symptomatic"
causes) were found to have a strong common bond with many cancer deaths. Such persons
were chosen largely by the question concerning body weight at time of interview;
specifically, many of these persons had below-average weights at baseline, which would
indicate the "wasting" that frequently occurs before death from cancer. Most of them were
also smokers, or had smoked in the past. Finally, many of these deaths listed respiratory
failure of "unknown etiology" as a condition on the death certificate. This may lead
researchers to construct misguided conclusions about mortality processes. As suggested, the
likelihood of diagnoses (with true morbidity held constant) is probably much greater in those
with access to medical care, which is highly correlated with income and many other
variables. This is a major source of spurious association for any researcher wishing to
analyze the causes of death.
The body weight variable was an interesting predictor. Researchers have typically
thought ofthe variable as positively related to the risk of death, reflecting the negative health
effects of obesity. A better measure of these effects was obtained through the variable that
asked about body weight at younger ages, which did tum out to be a positive predictor of
mortality. For predicting short term mortality, it is much more powerful to use body weight
at baseline to choose those respondents who are "wasting away"; other variables that also
260
served this purpose were the questions about loss of weight in the past year (positively
associated with death) and the body-mass index. There was a strong, nonlinear "threshold"
effect for this variable (about 165 pounds, for males), below which the risk of mortality was
high. Clearly, the linear approximation was inaccurate, although it did not substantially
detract from the power of the discriminant model. (The regression model included a squared
term for this variable.)
It was also observed that lowbody weight was a particularly powerful predictor when
combined with digitalis use. It is possible that the prescribed dosages for the drug are not
adequately adjusted for underweight elderly persons. In fact, it was strongly suspected that
digitalis toxicity may have a substantial source of excess mortality for elderly persons in
general. Discerning the true causal connection between digitalis use and mortality with
observational data is impossible, since this variable is highly confounded with heart disease
(which was obviously not measured perfectly in this data). However, clinical trials have
found that digitalis does not affect mortality when the drug is used on relatively younger,
healthier persons. The elderly persons in the EPESE sample were much more likely to have
renal failure, and were generally less functional. It is possible that toxic levels of the drug
were accumulating in these persons since their bodies would expel the drug less efficiently.
Moreover, digitalis toxicity kills by inducing heart failure or arrhythmia (exactly those
conditions for which it is prescribed), suggesting that the true cause of such deaths would
have been undetected. Physicians should be extremely careful to monitor digitalis dosages
in elderly, dysfunctional patients, or eliminate the use of the drug entirely, since the only
benefits are symptomatic. Additional studies are needed to estimate the "true" level of
261
digitalis toxicity in the elderly population.
7.4 Future applications of these methods
At the time of this writing, efforts are underway to apply the above model-building
techniques to nationally representative surveys. For example, the National Health and
Nutrition Examination Survey (NHANES I), conducted in 1971-75, provides an even larger
pool of predictor variables from which to select. Respondents were followed over time and
subsequently interviewed in the NHEFS epidemiologic followup studies in 1982 through
1986. Respondents were interviewed with respect to many different outcomes, and deceased
persons were matched with their death certificates. This rich dataset provides an opportunity
to construct nationally representative models for many different outcomes, induding
mortality, morbidity and functionality. Other longitudinal, national probability samples
which track decedents are the AHEAD survey, and the National Health Interview Survey.
This research will have several objectives. Foremost, the goal is to develop a
compact proxy for mortality that can be used in studies that do not directly observe deaths.
This dissertation suggests that it would be possible to construct an accurate proxy using a
limited number of questionnaire items that are common to many surveys. Secondly, future
research will attempt to construct more powerful models purely for the purposes of
prediction. This endeavor will draw from the entire range of potential questionnaire items
in a survey such as NHANES. The most likely model structure will be hybrid ofthe question
set method invented here and linear discriminant analysis, as discussed in Chapter 2. Thus,
it will be possible to control completely for age and sex while simultaneously utilizing the
predictive power of the question set method. The results of the above validation suggest that
262
this model structure may result in more stable error rates, since more variables can be
controlled.
To assess the predictive power ofthese models honestly, a test set of respondents will
be selected out with a simple random sample before conducting any analysis. The
researchers involved in the project will be kept blind to this dataset until the final models are
constructed. This technique will provide a truly unbiased estimate of model error.
(However, additional "test set"/"leaming set" divisions may be employed on the remaining
learning set respondents as part of the model building process.)
Finally, the research will endeavor to create a nationally representative "health index"
using some subset of survey items. Models such as those presented in the appendices are
very easily applied to a living person, and they provide an excellent means by which the
general health of a person may be assessed. There is some demand for such an index on
behalf clinicians and of public health practitioners, and some attempts have been made
already (e.g., the SF-36 questionnaire). However, these are rarely linked with an
unambiguous outcome such as death, and they do not usually use cutting-edge statistical
techniques.
263
References
Akaike H. 1973. "Information theory and an extension of the maximum likelihood
principle." Second International Symposium on Information Theory (eds. Petrov
and Csaki). Akademia Kiado. Budapest. 267-81.
Anderson CS; Jamrozik KD; Broadhurst RJ; Stewart-Wynne EG. 1994. "Predicting
survival for 1 year among different subtypes of stroke. Results from the Perth
Community Stroke Study." Stroke, 25: 1935-44.
Aronow JS. 1996. "Prevalence of appropriate and innappropriate indications for use of
digoxin in older patients at the time of admission to a nursing home." Journal of
the American Geriatrics Society. 44:588-90.
Assmann G; Cullen P; Heinrich J; Schulte H. 1996. "Hemostatic variables in the
prediction of coronary risk: results of th 8 year follow-up of healthy men in the
Munster Heart Study (PROCAM)." Israel Journal ofMedical Sciences,
32:364-70.
Becker RB; Zimmerman JE; Knaus WA; Wagner DP; SeneffMG; Draper EA; Higgins
TL; Estafanous FG; Loop FD. 1995. "The use of APACHE III to evaluate ICU
length of stay, resource use, and mortality after coronary artery by-pass surgery."
Journal ofCardiovascular Surgery, 36:1-11.
Bernstein JH; Carmel S. 1996. "Medical and psychosocial predictors of morbidity and
mortality: results of a 26 year follow-up." Israel Journal ofMedical Sciences,
32:205-10.
Bianchetti A; Scuratti A; Zanetti 0; Binetti G; Frisoni GB; Magni E; Trabucchi M. 1995.
"Predictors of mortality and institutionalization in Alzheimer disease patients 1
year after discharge from an Alzheimer dementia unit." Dementia, 6:108-12.
Bigger JT; Fleiss JL; Rolnitzky LM; Merab JP; Ferrick KJ. "Effect of digitalis treatment
on survival after acute myocardial infarction." American Journal ofCardiology.
55:623:30.
Blumberg D; Port JL; Weksler B; Delgado R; Rosai J; Bains MS; Ginsberg RJ; Martini
N; McCormack PM; Rusch V; et al. 1995. "Thymoma: a multivariate analysis of
factors predicting survival." Annals ofThoracic Surgery, 60:908-13.
Bosch X; Magrina J; March R; Sanz G; Garcia A; Betriu A; Navarro-Lopez F. 1996.
"Prediction of in-hospital cardiac events using dipyridamole-thallium scintigraphy
performed very early after acute myocardial infarction." Clinical Cardiology,
19:189-96.
264
Breiman L; Friedman JH; Olshen RA; Stone CJ. 1984. Classification and Regression
Trees. Wadsworth, Inc. Pacific Grove, California.
Cahalin LP; Mathier MA; Semigran MJ; Dec GW; DiSalvo TG. 1996. "The six-minute
walk test predicts peak oxygen uptake and survival in patients with advanced
heart failure". Chest, 110:325-32.
Cain KC; Martin DP; Holubkov AL; Raghunathan TE; Cole WG; Thompson A. 1994. "A
logistic regression model of mortality following hospital admissions among
Medicare patients: comparison with HCFA's model [abstract]." Ahsr Fhsr Annu
Meet Abstr Book, 11 :81-2.
Chambers JM; Hastie TJ. 1991. Statistical Models in S. Wadsworth, Inc. Pacific Grove,
California.
Cornoni-Huntley J; Brock DB; Ostfeld AM; Taylor JO; Wallace RB. 1986. Established
Populations for Epidemiologic Studies ofthe Elderly: Data Resource Book.
Washington D.C.: U.S. Government Printing Office. (NIH Pub. No. 86-2443.)
Cornoni-Huntley J; Ostfeld AM; Taylor JO; Wallace RB; Blazer D; Berkman LF; Evans
DA; Kohout FJ; Lemke JH; Scherr PA; Korper SP. 1993. "Established
populations for epidemiologic studies of the elderly: Study design and
methodology." Aging Clin. Exp. Res. 5: 27-37.
Davis L. 1987. Genetic Algorithms and Simulated Annealing. Morgan Kaufmann
Publishers. San Mateo, California.
Davis RB; Iezzoni LI; Phillips RS; Reiley P; Coffman GA; Safran C. "Predicting in-
hospital mortality: The importance of functional status information." Medical
Care 33: 906-921.
The Digitalis Investigation Group. 1997. "The effect of digoxin on mortality and
morbidity in patients with heart failure." The New England Journal ofMedicine.
336:525:533.
Eysenck HJ. 1993. "Prediction of cancer and coronary heart disease mortality by means of
a personality inventory: results of a 15-year follow-up study." Psychological
Reports, 72:499-516.
Flanagan JR; Pittet D; Li N; Thievent B; Suter PM; Wenzel RP. 1996. "Predicting
survival of patients with sepsis by use of regression and neural network models."
Clinical Performance and Quality Health Care, 4:96-103.
Freedman, D; Pisani R; Purves R. 1991. Statistics. Norton. New York.
265
Friedman HS; Tucker JS; Schwartz JE; Tomlinson-Keasey C; Martin LR; Wingard DL;
Criqui MH. 1995. "Psychosocial and behavioral predictors of longevity. The
aging and death of the "termites"." American Psychologist, 50:69-78.
Friedman JH. 1984. "A variable span smoother." Technical Report No.5, Laboratory
for Computational Statistics, Department of Statistics, Stanford University,
California.
Friedberg SH; Arnold JI; Spence LE. 1989. Linear Algebra. Prentice Hall. Englewood
Cliffs, New Jersey.
Gnanadesikan R. 1977. Methods for statistical data analysis ofmultivariate
observations. Wiley. New York.
Goldberg DE. 1989. Genetic algorithms in search, optimization, and machince learning.
Addison-Wesley. Reading, Massachusetts.
Grubb NR; Elton RA; Fox KA. 1995. "In-hospital mortality after out-of-hospital cardiac
arrest." Lancet, 346:417-21.
Hastie TJ; Tibshirani R. 1994. "Discriminant adaptive nearest neighbor
classification." Technical Report (December).
Hastie TJ; Tibshirani R. 1996. "Discriminant analysis by Gaussian mixtures."
Technical Report (February).
Hastie TJ; Buja A; Tibshirani R. 1995. "Penalized discriminant analysis." Annals of
Statistics, 73-102.
Holland JH. 1975. Adaptation in natural and artificial systems. The University of
Michigan Press. Ann Arbor, Michigan.
Huppert FA; Whittington JE. 1995. "Symptoms of psychological distress predict 7-year
mortality." Psychological Medicine, 25: 1073-86.
Iezzoni LI; Ash AS; Coffman GA; Moskowitz MA. 1992. "Predicting in-hospital
mortality. A comparison of severity measurement approaches." Medical Care,
30:347-59.
Iezzoni LI; Heeren T; Foley SM; Daley J; Hughes J; Coffman GA. 1994. "Chronic
conditions and risk ofin-hospital death." Health Services Research, 29:435-60.
266
Iezzoni LI; Shwartz M; Ash AS; Mackiernan YD. 1996 "Using severity measures to
predict the likelihood of death for pneumonia inpatients." Journal ofGeneral
Internal Medicine, 11 :23-31.
Josephson RA; Chahine RA; Morganroth J; Anderson J; Waldo A; Hallstrom A. 1995.
"Prediction of cardiac death in patients with a very low ejection fraction after
myocardial infarction: a Cardiac Arrhythmia Suppression Trial (CAST) study."
American Heart Journal, 130:685-91.
Mallows CL. 1973. "Some comments on C
p
'" Technometrics. 15:661 :675.
Mardia KV; Kent JT; Bibby JM. 1979. Multivariate Analysis. Academic Press. New
York.
Marshall G; Grover FL; Henderson WG; Hammermeister KE. 1994. "Assessment of
predictive models for binary outcomes: an empirical approach using operative
death from cardiac surgery." Statistics in Medicine, 13: 1501-11.
Moss AJ; Davis HT; Conard DL; DeCamilla JJ; OdoroffCL. 1985. "Digitalis associated
cardiac mortality after acure myocardial infarction." Circulation. 64: 1150-56.
National Center for Health Statistics. 1987. Medical Examiners' and Coroners' Handbook
on Death Registration and Fetal Death Reporting. U.S. Department of Health and
Human Services. Hyattsville, Maryland.
National Center for Health Statistics. 1988. Instruction Manual Part 2a: Instructions for
Classifying the Underlying Cause ofDeath, 1988. U. S. Department of Health and
Human Services. Hyattsville, Maryland.
National Center for Health Statistics. 1986. Instruction Manual Part 2b: Instructions for
Classifying Multiple Causes ofDeath, 1986. U.S. Department of Health and
Human Services. Hyattsville, Maryland.
Normand ST; Glickman ME; Sharma RG; McNeil BJ. 1996. "Using admission
characteristics to predict short-term mortality from myocardial infarction in
elderly patients. Results from the Cooperative Cardiovascular Project." Jama,
275:1322-8.
Ortiz J; Ghefter CG; Silva CE; Sabbatini RM. 1995. "One-year mortality prognosis in
heart failure: a neural network approach based on echocardiographic data."
Journal ofthe American College ofCardiology, 26:1586-93.
Piccirillo JF; Feinstein AR. 1996. "Clinical symptoms and comorbidity: Significance for
the prognostic classification of cancer." Cancer 77: 834-842.
267
Poses RM; McClish DK; Smith WR; Bekes C; Scott WE. 1996. "Prediction of survival
of critically ill patients by admission comorbidity." Journal ofClinical
Epidemiology, 49:743-7.
Pritchard ML; Woosley JT. 1995. "Comparison of two prognostic models predicting
survival in patients with malignant melanoma." Human Pathology, 26:1028-31.
Quintana M; Lindvall K; Brolund F; Eriksson SV; Ryden L. 1995. "Prognostic value of
exercise stress testing versus ambulatory electrocardiography after acute
myocardial infarction: a 3 year follow-up study." Coronary Artery Disease,
6:865-73.
Reuben DB; Rubenstein LV; Hirsch SH; Hays RD. 1992. "Value of functional status as
a predictor of mortality: Results of a prospective study." The American Journal
ofMedicine 93: 663-669.
Rowan KM; Kerr JH; Major E; McPherson K; Short A; Vessey MP. 1994. "Intensive
Care Society's Acute Physiology and Chronic Health Evaluation (APACHE II)
study in Britain and Ireland: a prospective, multicenter, cohort study comparing
two methods for predicting outcome for adult intensive care patients." Critical
Care Medicine, 22:1392-401.
Schuchter L; Schultz DJ; Synnestvedt M; Trock BJ; Guerry D; Elder DE; Elenitsas R;
Clark WH; Halpern AC. 1996. "A prognostic model for predicting 10-year
survival in patients with primary melanoma." The Pigmented Lesion Group.
Annals ofInternal Medicine, 125:369-75.
Silber JH; Williams SV; Krakauer H; Schwartz JS. 1992. "Hospital and patient
characteristics associated with death after surgery. A study of adverse occurrence
and failure to rescue." Medical Care, 30:615-29.
Smith KR; Waitzman NJ. 1994. "Double jeopardy: interaction effects of marital and
poverty status on the risk of mortality." Demography, 31 :487-507.
Swets JA; Pickett RM. 1982. Evaluation ofDiagnostic Systems: Methods from signal
Detection Theory. Academic Press. New York.
Thompson ML; Zucchini, W. 1989. "On the statistical analysis of ROC curves."
Statistics in Medicine 8:1277-1290.
TalcottJA; Siegel RD; Finberg R; Goldman L. 1992. "Risk assessment in cancer patients
with fever and neutropenia: a prospective, two-center validation of a prediction
rule." Journal ofClinical Oncology, 10:316-22.
268
Turner JS; Morgan CJ; Thakrar B; Pepper JR. 1995. "Difficulties in predicting outcome
in cardiac surgery patients." Critical Care Medicine, 23: 1843-50.
Warren MD; Knight R. 1982. "Mortality in relation to the functional capacities of people
with disabilities living at home." Journal ofEpidemiology and community Health
36:220-223.
Warren JL; McBean AM; Hass SL; Babish JD. 1994. "Hospitalisations with adverse
events caused by digitalis therapy among elderly Medicare beneficiaries."
Archives ofInternal Medicine, 154:1482-87.
Wong DT; Crofts SL; Gomez M; McGuire GP; Byrick RJ. 1995. "Evaluation of
predictive ability of APACHE II system and hospital outcome in Canadian
intensive care unit patients." Critical Care Medicine, 23: 1177-83.
269
Appendix I - Questions for the prediction of death
This appendix presents the preferred (having minimal test set misclassification error) sets
of questions which are constructed as described in Chapters 1 and 3. In order for a
respondent to be chosen (classified as dead) by a particular set of questions, they must
answer some of the question sets below with answers that are all in bold. For example,
consider SET X.I below:
SET Xl:
1. Age at time of interview: < 85 85+
2. Sex:
Female
Male
3. Other then when you might have been in the hospital, was there any time in the past 12
months in when you needed help from some person or any equipment or device for using
the toilet?
No help
Help
Unable to do
Missing
In order for a respondent to be classified as dead by this set, he would have to be a male,
and aged 85+, and who also needed help using the toilet, or was unable to use it, or
whose answer was missing.
These sets are combined into overall sets using the OR operator. Then a respondent is
considered chosen if he/she is chosen by one set OR another. For example, the overall
SET A is comprised of sets A.I though A.5. A respondent is considered chosen by SET A
if they are chosen by A.I, or A.2, or any of the sets in SET A.
Also reported with each overall set is the result of the application of this set to the test set
respondents. The sensitivity is the proportion of deaths in the test set that were identified
by the questions. For example, SET Xl caught 8 deaths, and there were 509 deaths in the
test set, so the sensitivity is 8/509 = 1.6%. The specificity is the proportion of survivors
not chosen by the questions. SET Xl incorrectly caught 3 survivors, and there were
2,923 survivors in the test set, so the specificity is (2,923 - 3)/2923 = 99.9%.
QUESTION SET A
(Sensitivity: 66% Specificity: 70%)
270
SET A.3
1. Age: <80 80+
2. What was your usual weight at age 25?
2. Weight (in pounds): <168 168+ Missing
3. Are you able to do heavy work around the
house, like shoveling snow, washing windows,
walls or floors without help?
SET A.I
1. Other than when you might have been in the
hospital, was there any time in the past 12
months when you needed help from some person
or any equipment or device to do the following:
Bathing, either a sponge bath, tub bath, or
shower?
No help
Help
Unable to do
Missing
Do you still require this help? Yes No Missing
<139 pounds 139+
OR
SET A.4
1. Sex: Female Male
Missing
2. How much difficulty, ifany, do you have
pulling or pushing large objects like a living
room chair?
No difficulty at all
A little or some difficulty
Just unable to do it
Missing
OR
SET A.2
1. Do you take any digitalis, Digoxin, Lanoxin,
or Digitoxin pills now?
Yes
No
Missing
OR
Yes
No
Missing
OR
SET A.S
1. (When wearing eyeglasseslcontact lenses)
Can you see well enough to recognize a friend
across the street?
Yes
No
Missing
2. Other then when you might have been in the
hospital, was there any time in the past 12
months when you needed help from some person
or any equipment or device to do the following
things:
Walking across a small room?
No help
Help
Unable to do
Missing
QUESTION SET B
(Sensitivity: 41 % Specificity: 88%)
SET B.1:
1. Are you able to walk half a mile without help?
That's about 8 ordinary blocks.
Yes
No
Missing
2. Do you take any digitalis, Digoxin, Lanoxin,
or Digitoxin pills now?
Yes
No
Missing
OR
SETB.2:
1. Sex: Female Male
2. Weight in pounds: < 168 168+ Missing
3. Are you able to do heavy work around the
house, like shoveling snow, washing windows,
walls or floors without help?
Yes
No
Missing
271
OR
SETB.3:
1. Age: <80 80+
2. What was (is) your mother's maiden name?
Correct
Incorrect
Refused
Missing
QUESTION SET C
(Sensitivity: 22% Specificity: 96%)
SET C.1:
I. How much difficulty, on the average, do you
have bathing, either a sponge bath, tub bath, or
shower?
No difficulty at all
A little difficulty
Some difficulty
A lot of difficulty
Missing
2. Do you take any digitalis, Digoxin, Lanoxin,
or Digitoxin pills now?
Yes
No
Missing
3. Weight in pounds: < 168 168+ Missing
OR
SET C.2:
I. Other than when you might have been in the
hospital, was there any time in the past 12
months in when you needed help from some
person or any equipment or device for bathing,
either a sponge bath, tub bath, or shower?
No help
Help
Unable to do
Missing
2. What day of the week is it?
Correct
Incorrect
Refused
Missing
272
OR
SET C.3:
Other than when you might have been in the
hospital, was there any time in the past 12
months in when you needed help from some
person or any equipment or device for bathing,
either a sponge bath, tub bath, or shower?
No help
Help
Unable to do
Missing
I. Do you still require this help?
Yes
No
Missing (includes "No Help"
respondents)
2. Have you ever taken any digitalis, Digoxin,
Lanoxin, or Digitoxin pills?
Yes
No
Missing
273
Appendix II - Questions for predicting heart failure
QUESTION SET D
(Sensitivity: 59% Specificity: 72%)
SETD.1:
Have you ever had any pain or discomfort in
your chest?
Yes
No
Missing
(Note: The above question does not select out
any respondents, and is only included because
the question below refers to it.)
1. What do you do if you get this pain while you
are walking?
Stop or slow down
Take a nitroglycerin
Continue at same pace
Missing (includes those with no pain)
OR
SETD.2:
I. As compared to other people your own age,
would you say that your health is excellent, good,
fair, poor or very poor?
Excellent
Good
Fair
Poor or bad
Missing
2. Weight (in pounds): <170 170+ Missing
OR
SETD.3:
1. Age: <80 80+
2. During the past week, I felt depressed:
Rarely or none of the time
Some of the time
Much of the time
Most or all of the time
Don't know/RefusedlMissing
For East Boston respondents, this question was
phrased: Have you felt this way much of the time
during the past week? -- I felt depressed:
No
Yes
Missing
Has a doctor ever told you that you had any
cancer, malignancy or malignant tumor of any
type?
Yes
Suspect
No
Missing
3. Were you hospitalized overnight or longer for
this?
Yes
No
Missing (includes those with no cancer)
OR
SET D.4:
1. Do you take any digitalis, Digoxin, Lanoxin
or Digitoxin pills now?
Yes
No
Missing
QUESTION SET E
(Sensitivity: 36% Specificity: 87%)
SETE.l:
1. Age: <85 85+
2. What was (is) your mother's maiden name?
Correct
Incorrect
Refused
Missing
OR
274
SETE.2:
1. Do you take any digitalis, Digoxin, Lanoxin
or Digitoxin pills now?
Yes
No
Missing
2. Are you able to do heavy work around the
house, like shoveling snow, washing windows,
walls or floors without help?
Yes
No
Missing
OR
SETE.3:
I. Are you able to walk half a mile without help?
That's about 8 ordinary blocks.
Yes
No
Missing
Have you ever had any pain or discomfort in
your chest?
Yes
No
Missing
(Note: The above question does not select out
any respondents, and is only included because
the question below refers to it.)
2. What do you do if you get this pain while you
are walking?
Stop or slow down
Take a nitroglycerin
Continue at same pace
Missing (includes persons with no pain)
275
Appendix III - Questions for predicting strokes
OR
QUESTION SET F
(Sensitivity: 59% Specificity: 66%)
SETF.1:
1. As compared to other people your own age,
would you say that your health is excellent, good,
fair, poor or very poor?
Excellent
Good
Fair
Poor or bad
Missing
2. Has a doctor ever told you that you had any
cancer, malignancy or malignant tumor of any
type?
Yes
Suspect
No
Missing
OR
SETF.2:
1. Do you take any digitalis, Digoxin, Lanoxin,
or Digitoxin pills now?
Yes
No
Missing
SET F.3:
1. Age: <75 80+
2. When was the last time you saw a dentist?
5 years ago or less
>5 years ago or never
Missing
OR
SETF.4:
Has a doctor ever told you that you had a heart
attack or coronary, or coronary thrombosis, or
coronary occlusion or myocardial infarction?
Yes
Suspect
No
Missing
1. Were you hospitalized overnight or longer for
this?
Yes
No
Missing
2. Do you smoke cigarettes regularly now?
Yes
No
Missing
QUESTION SET G
(Sensitivity: 37% Specificity: 86%)
SETG.l:
1. Do you take any digitalis, Digoxin, Lanoxin,
or Digitoxin pills now?
Yes
No
Missing
2. Do you get shortness of breath that requires
you to stop and rest?
Yes
No
Missing
OR
SETG.2:
1. How much difficulty, on the average do you
have walking across a small room?
No difficulty at all
A little difficulty
Some difficulty
A lot of difficulty
Missing
2. Have you ever taken any digitalis, Digoxin,
Lanoxin, or Digitoxin pills?
Yes
No
Missing
276
SET G.3:
1. Age: <85 85+
2. What was (is) your mother's maiden name?
Correct
Incorrect
Refused
Missing
OR
SET G.4:
1. Has a doctor ever told you that you had any
cancer, malignancy or malignant tumor of any
type?
Yes
Suspect
No
Missing
2. Are you able to walk half a mile without help?
That's about 8 ordinary blocks.
Yes
No
Missing
Have you ever had any pain or discomfort in
your chest?
Yes
No
Missing
3. Do you get this pain (or discomfort) when you
walk uphill or hurry?
Yes
Never walks uphill or hurries or
cannot walk
No
Missing
/
277
Appendix IV - Questions for predicting cancer
QUESTION SET H
(Sensitivity: 61% Specificity: 54%)
SETH.I:
How many close friends do you have?
None
lor 2
3 to 5
6 to 9
10 or more
Missing
1. How many of these friends do you see at least
once a month?
None
lor 2
3 to 5
6 to 9
10 or more
Missing
2. In the past year, have you gained or lost more'
than 10 pounds?
No change
Yes, gained
Yes, lost
Yes, both gained and lost
Missing
OR
SETH.2:
1. Do you smoke cigarettes regularly now?
Yes
No
Missing
OR
SETH.3:
1. On the average, how many cigarettes per day
did you usually smoke? (Former smokers only)
<20
20
>20
Missing
(Note: Missing for the above question includes
current smokers and nonsmokers)
2. How old were you when you first smoked
cigarettes regularly? (Former smokers only)
<40
40+
Missing
(Note: Missing for the above question includes
current smokers and nonsmokers)
OR
SET H.4:
1. When was the last time you saw a dentist?
5 years ago or less
>5 years ago or never
Missing
2. Have you ever taken any digitalis, Digoxin,
Lanoxin, or Digitoxin pills?
Yes
No
Missing
QUESTION SET I
(Sensitivity: 43% Specificity: 71 %)
SET 1.1:
1. Height (inches):
<64
64+
Missing
2. Pulse for 30 seconds:
<36
36+
Missing
3. Do you smoke cigarettes regularly now?
Yes
No
Missing
OR
SET 1.2:
1. Did you ever smoke cigarettes regularly?
Yes
No
Missing
2. On the average, how many cigarettes per day
did you usually smoke?
<20
20
>20
Missing
OR
278
SET 1.3:
1. About how often do you go to religious
meetings or services?
Never/almost never
Once or twice a year
Every few months
Once or twice a month
Once a week
More than once a week
Missing
Have you ever had any pain or discomfort in
your chest?
Yes
No
Missing
(Note: The above question does not select out
any respondents, and is only included because
the question below refers to it.)
2. What do you do if you get this pain while you
are walking?
Stop or slow down
Take a nitroglycerin
Continue at same pace
Missing (includes those with no pain)
OR
SET 1.4:
1. Weight at age 50 (pounds): <170 170+
Missing
2. Have you ever taken any digitalis, Digoxin,
Lanoxin, or Digitoxin pills?
Yes
No
Missing
279
Appendix V - C code for the repeated random search algorithm (RRSA)
/******************************************************************
* Program: Repated random search algorithm (RRSA) *
* Author: Michael Anderson *
* *
* This program performs repeated, independent runs of the random search *
* algorithm defined in Section 4.2. It should be compiled with a command *
* of the form "cc -0 progname sourcefile.c -1m" from a UNIX environment. *
* The files "y.dat" (containing a list of a's and 1's) and "x.dat" (containing *
* a matrix of predictor variables in columns for records in rows) must be *
* placed in the same directory as the executable for the program to run. *
* Thanks are due to Brad Wallet for help in coding earlier versions of *
* this program. *
*******************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <float.h>
#include <limits.h>
#include <time.h>
#define FALSE a
#define TRUE !FALSE
#define MAXQK 4 /* Defines the max # of questions in a single subset */
#define DEFKTOT 4 /* Defines the total # of subsets in the model */
#define DEFQK 4 /* Defines the # of questions in each subset */
#define DEFCOST 3.5 /* Defines the cost ofmisclassifying a decedent */
#define DEFFREQ 20000 /* # of mutations before checking for absorption */
#define DEFGENS INT MAX
/* GLOBAL VARIABLES: */
double
double
int
int
int
int
int
int
float
int
int
ndead, nalive;
bfit,miscost;
maxq,p,ktot,n;
min-part = 25;
gens;
checfreq;
lmax 1;
myseed = 0;
**X;
*y;
*qk;
/* # of 1s and as in the learning set Y */
/* Relative cost of misclass. dead as alive */
/* # of?s in subset, # of vars, # of sets, N */
/* Smallest # of obs to allow in a subset */
/* Max # of mutations allowed in one search */
/* How soon to start looking for absorption */
/* ignore this */
/* Seed; only used if greater than a */
/* Ptr to ptr to float: holds the X data */
/* Ptr to int: This holds the Y values */
/* Ptr to int: holds the number of?s in sets */
float
int
int
int
float
int
FILE
**Uvals;
*nvals;
**bfeat, **bdirect;
**feat, **direct;
**bthresh, **thresh;
srchreps=1000000;
*absfile, *mutfile;
/* Ptr to ptr to float: holds variable values */
/* Ptr to int: holds # of values each variable */
/* Ptr to ptr to int: variables, directions */
/* Ptr to ptr to int: variables, directions */
/* Ptr to ptr to flt: cutoff levels for variables */
/* Max number of indpendent runs */
/* Ptrs to files that hold output */
280
/* These INPUT data files are required for the program to run: */
char *X file = "x.dat";
char *y_file = "y.dat";
/* Ptr to file: n X p block of variables by col */
/* Ptr to file: n-vector of Y variables, 0 or 1 */
/* The program OUTPUT is stored in these files: */
char *abs_file = "abs.out";
char *mut_file = "mut.out";
/* FUNCTION PROTOYPES
/* Ptr to file: output - the absorption points */
/* Ptr to file: # of mutations at successes */
*/
FILE *myopenf(FILE *locfile, char *loc_file);
void welcome(void);
void getdat(void);
void getval(void);
int mycomp(const void *i, const void *j);
int fltcompare(float *i, float *j);
double misclass(int **feat, int **direct, float **thresh, float **locX,
int *locy, double locndead, double locnalive, int *loqk,
int locktot);
int checkabs(int **feat, int **direct, float **thresh, double best,
float **locX, int *locy, double locndead, double locnalive);
double mutpoint(int **feat, int **direct, float **thresh, float **locX,
int *locy, double locndead, double locnalive, double locfit);
double runsrch(int **feat, int **direct, float **thresh, float **locX,
int *locy, double locndead, double locnalive);
int indofmin(double *vect, int lengv);
int maxval(int *vect, int lengv);
void printquesa(int **feat, int **direct, float **thresh,
int *locqk, int locktot);
void printquesb(int **feat, int **direct, float **thresh,
int *locqk, int locktot);
void copyques(int **tofeat, int **todirect, float **tothresh,
int **frfeat, int **frdirect, float **frthresh);
void printscores(double *tscores, double *lscores, double *nvars,
281
double *cost_comp, int depth);
/********************* START MAIN PROGRAM *************************/
void main(argc, argv)
int argc;
char *argv[];
{
int i,j, k, 1, k1, k2,j1, minind, pr, wasimprov, printstat;
double cntdiff, best;
miscost = DEFCOST; gens = DEFGENS; checfreq = DEFFREQ;
ktot = DEFKTOT; maxq = MAXQK;
qk = (int *)maUoc(ktot*sizeof(int));
qk[O] = 4; qk[1] = 4; qk[2] = 4; qk[3] = 4;
/* If you want deterministic results, use a positive value of myseed */
if (myseed > 0)
{
for (k=O; k<myseed; k++) drand480;
}
else
{
srand48((unsigned)time(NULL));
}
/* The variables below will hold info on the present point in space.
feat is the variable number, direct is < or >=, thresh is cutoff */
feat = (int **)malloc(maxq*sizeof(int *));
direct = (int **)malloc(maxq*sizeof(int *));
thresh = (float **)malloc(maxq*sizeof(float *));
bfeat = (int **)malloc(maxq*sizeof(int *));
bdirect = (int **)malloc(maxq*sizeof(int *));
bthresh = (float **)malloc(maxq*sizeof(float *));
for (k=O; k<maxq; k++)
{
feat[k] = (int *)calloc(ktot,sizeof(int));
direct[k] = (int *)calloc(ktot,sizeof(int));
thresh[k] = (float *)calloc(ktot,sizeof(float));
bfeat[k] = (int *)calloc(ktot,sizeof(int));
bdirect[k] = (int *)calloc(ktot,sizeof(int));
bthresh[k] = (float *)calloc(ktot,sizeof(float));
}
getdatO; /* Read in the data with this call */
282
n = (int)(ndead+nalive);
printf("\n\n %d records and %d variables detected. \n\n", n, p);
printf("\n\n Done reading in data, now processing data.\n");
printf(" (This may take a few minutes if your dataset is big.)\n\n");
getvalO; /* Get the variable values */
printf("\n\n Done processing data, beginning search.\n\n");
printf(" The most recent questions will be stored in the file \"question.out\".\n");
printf(" All absorption points will be stored in the file \"abs.out\".\n");
printf(" The mutation numbers at successful mutations will be in \"mut.out\" .\n");
printf(" (If these files already exist, the results will be appended to them.)\n");
printf("\n\n BOBCAT will now search indefinitely. Hit Ctrl-C to stop it.\n\n");
absfile = myopenf(absfile,abs_file);
fprintf(absfile, "\n Absorption point in the learning dataset: \n\n");
fclose(absfile);
for (i=O;i<srchreps;i++)
{
bfit = runsrch(feat, direct, thresh, X, y, ndead, nalive);
}
} /******************** END OF MAIN PROGRAM **********************/
double misclass(int **feat, int **direct, float **thresh, float **locX,
int *locy, double locndead, double locnalive, int *locqk,
int locktot)
/* * This function computes the misclassification error for any particular
* point in space as defined by feat, direct, and thresh. It returns
** (l - error) as a double.
{
int checkin, i, j, k, insubset, locn, lktot,subsetsize, tdead;
double fit;
int *ssizes, *lqk;
locn = locndead + locnalive;
lqk=locqk; lktot = locktot; subsetsize = 0; tdead = 0;
ssizes = (int *)malloc(lktot*sizeof(int *));
for (k=O; k<lktot; k++)
{
ssizes[k]=O;
}
for (i=O; i<locn; i++)
*
*
*/
{
checkin = 1;
k=O;
while (checkin) /* checkin keeps track of whether a case is chosen */
{
while ((checkin) && (k<lktot)) /* Cycle through the k sets */
{
insubset= 1;
j = lqk[k];
while (insubset &&j)
{
J--;
if (direct[j][k])
{
if (locX[i] [feat[j] [k]] < thresh[j] [k])
{
insubset--;
}
}
else
{
if (locX[i] [feat[j][k]] >= thresh[j][k])
{
insubset--;
}
}
}
if (insubset) /* If the respondent is chosen, check for death */
{
subsetsize++;
if(locy[i]) tdead++;
checkin--;
ssizes[k]++;
}
k++;
}
if (checkin > 0) checkin--;
}
}
fit = 1 - ((miscost*(locndead-(double)tdead))+
( (double)subsetsize - (double)tdead)) /
(miscost*locndead + locnalive);
283
for (k=O; k<lktot; k++)
{
if (ssizes[k]<minyart) fit = 0;
}
free( (void *) ssizes);
return(fit);
}
double mutpoint(int **feat, int **direct, float **thresh, float **locX,
int *locy, double locndead, double locnalive, double locfit)
/*** This function mutates a point repeatedly, searching for lower error.
It returns I-error of the combination of questions with lowest error. ***/
{
int replaced, new_feat, old_feat, new_direct, old_direct;
int checkout, gen, i, k, j, cntmut;
double best;
float new_thresh, old_thresh;
best = locfit;
checkout = 1;
gen = 0; cntmut = 0;
mutfile = myopenf(mutfile, mut_file);
fprintf(mutfile, "\n\n Total # of mutations at Misclassification \nil);
fprintf(mutfile, II successful mutations error \nil);
fclose(mutfile);
while((gen<gens) && checkout) /* Enter loop to do mutations */
{
for (k=O; k<ktot; k++) /* Loop through the ktot question sets */
{
/* Introduce mutation (save old question in case mutation is bad) */
cntmut++;
replaced = floor(drand480 * qk[k]);
if (replaced==qk[k]) replaced--;
old_feat = feat[replaced][k];
old_direct = direct[replaced][k];
old_thresh = thresh[replaced][k];
new_feat = floor(drand480 * p);
if (new_feat==p) new_feat--;
j = floor(drand480*nvals[new_feat]);
if (j ==nvals[new_feat]) j--;
if (j < I) j++;
284
new_thresh = Uvals[new_feat][j];
new_direct = floor(drand480 * 2);
if (new_direct==2) new_direct--;
feat[replaced][k] = new_feat;
direct[replaced][k] = new_direct;
thresh[replaced] [k] = new_thresh;
/* Check the error rate */
locfit = misclass(feat, direct, thresh, 10cX, locy, locndead,
locnalive,qk,ktot);
if (locfit <= best) /* If it isn't an improvement, get old quest */
{
feat[replaced][k] = old_feat;
direct[replaced][k] = old_direct;
thresh[replaced][k] = old_thresh;
}
else /* If it IS better, keep it, output "gen" to out2.dat */
{
best = locfit;
mutfile = myopenf(mutfile, mut_file);
fprintf(mutfile,"%13d %1.8If\n", cntmut, I-best);
fclose(mutfile);
if (gen > checfreq) /* check for absorption */
{
checkout = I - checkabs(feat,direct,thresh,best,locX,locy,
locndead, locnalive);
}
}
}
if ((gen == checfreq) && checkout) /* check for absorption */
{
checkout = 1 - checkabs(feat,direct,thresh,best,locX,locy,
locndead,locnalive);
}
gen++;
} /* Exit from mutation loop */
absfile = myopenf(absfile,abs_file);
fprintf(absfile, "\n Misclassification error: %If\n\n'', I-best);
fclose(absfile);
printquesa(feat, direct, thresh, qk, ktot);
retum(best);
}
285
double runsrch(int **feat, int **direct, float **thresh, float **locX,
int *Iocy, double locndead, double locnalive)
/* This function not only returns the fit of the best question set from
the search, it also changes bfit, the questions, AND the best questions */
{
int I, k, j, j 1;
double lfit;
for (1=0; I < lmax; 1++) /* Enter loop to repeat the search lmax times */
{
lfit = 0.0;
while (1fit < 0.001) /* This loop generates a random point in space */
{
for (k=O; k<ktot; k++)
{
for U=O; j<qk[k]; j++)
{
feat[j][k]= floor(drand480 * p);
if (feat[j] [k]==p) feat[j] [k]--;
direct[j][k]= floor(drand480 * 2);
if (direct[j][k]==2) direct[j][k]--;
j 1 = floor(drand480*nvals[feat[j][k]]);
if U1==nvals[feat[j] [k]]) j 1--;
if U1 < 1) j 1++;
thresh[j][k] = Uvals[feat[j][k]][j 1];
}
}
lfit = misclass(feat, direct, thresh, 10cX, locy, locndead,
locnalive,qk,ktot);
}
lfit = mutpoint(feat, direct, thresh, 10cX, locy, locndead, locnalive,
lfit);
if (1fit > bfit)
{
bfit = lfit;
for U=O;j<maxq;j++)
{
for (k=O;k<ktot;k++)
{
bfeat[j][k] = feat[j][k]; bdirect[j][k] = direct[j] [k];
bthresh[j][k] = thresh[j][k];
286
}
}
}
}
return(bfit);
}
int checkabs(int **feat, int **direct, float **thresh, double best,
float **locX, int *locy, double locndead, double locnalive)
/*** This function checks a point in space to see if it is an absorption
* point by exhaustively replacing each question with all possible
* questions and checking for a lower error. Notice that although it
*** touches the point's parameters, it leaves them unchanged in the end.
{
int k, j, i, 1, k1, checkout;
int old_feat;
int old_direct;
float old_thresh;
double fit;
checkout = 1;
k= 0;
while((k<ktot) && checkout)
{
j = 0;
while((j<qk[kD && checkout)
{
old_feat = feat[j] [k];
old_direct = direct[j][k];
old_thresh = thresh[j] [k];
1= 0;
while(l<p)
{
for(i=O; i<nvals[l]; i++)
{
for(kl =0; kl <2; kl ++)
{
feat[j][k] = 1;
direct[j][k] = kl;
thresh[j][k] = Uvals[l][i];
fit = misclass(feat, direct, thresh, 10cX, locy,
*
*
*
*/
287
locndead, locnalive, qk, ktot);
if (fit> best)
{
checkout = 0;
}
}
}
1++;
}
feat[j] [k] = old_feat;
thresh[j][k] = old_thresh;
direct[j][k] = old_direct;
j++;
}
k++;
}
retum(checkout);
}
void getdat(void) /******* This function reads in the data ***/
{
int notdone, blnkint, i,j,k,np,n;
FILE *ptryfile, *ptrxfile;
double pcheck;
float blnkflt;
double outval;
if ((ptryfile=fopen(y_file,l rb"))==NULL)
{
printf("\n\nError: Could not open file %s\n\n", y_file);
exit(-l);
}
n=O;
notdone = TRUE;
while (notdone)
{
if(fscanf(ptryfile, "%d", &blnkint) != EOF)
{
n++;
}
else
{
notdone = FALSE;
}
288
}
fclose(ptryfile);
if ((ptrxfile=fopen(X_file, "rb"))==NULL)
{
printf("\n\nError: Could not open file %s\n\n", X_file);
exit(-l);
}
np=O;
notdone = TRUE;
while (notdone)
{
if(fscanf(ptrxfile, "%f', &blnkflt) != EOF)
{
np++;
}
else
{
notdone = FALSE;
}
}
fclose(ptrxfile);
pcheck = (double)np/(double)n;
p = floor(pcheck);
if ((pcheck - (double)p) > 0.0000000001)
{
printf("\n\nError: Your data files are not formatted properly. \nil);
printf("\nMake sure that the data in the files y.dat and x.dat \nil);
printf("have the same number of rows, and that every row of x.dat\n");
printf("has the same number of variables. \n\n");
exit(-l);
}
if ((ptryfile=fopen(y_file,"rb"))==NULL)
{
printf("\n\nError: Could not open file %s\n\n", y_file);
exit(-l);
}
if ((ptrxfile=fopen(X_file,"rb"))==NULL)
{
printf("\n\nError: Could not open file %s\n\n", X_file);
exit(-l);
}
ndead = 0; nalive = 0;
X = (float **)malloc(n*sizeof(float *));
y = (int *)malloc(n*sizeof(int));
289
for (i=O; i<n; i++)
{
X[i] = (float *)malloc(p*sizeof(float));
for O=O;j<p;j++)
{
fscanf(ptrxfile, "%f', &X[i][j));
}
fscanf(ptryfile, "%d", &y[i));
if(y[i)) { ndead++; }
}
nalive = (double)n - ndead;
fc1ose(ptrxfile); fc1ose(ptryfile);
}
void getval(void) /******* This function gets the uniq variable values ****/
{
int i,j,k;
float *Tval;
float lastval;
Tval = (float *)malloc(n*sizeof(float));
nvals = (int *)malloc(p*sizeof(int));
Uvals = (float **)malloc(p*sizeof(float *));
for 0=0; j<p; j++)
{
for (i=O; i<n; i++)
{
Tval[i] = X[i][j];
}
qsort((void *)Tval, n, sizeof(float), mycomp);
lastval=Tval[O];
nvals[j] = 1;
for (i=l; i<n; i++)
{
if(Tval[i]>lastval)
{
nvals[j]++;
lastval = Tval[i];
}
}
Uvals[j] = (float *)malloc(nvals[j]*sizeof(float));
k= 0;
Uvals[j][k] = Tval[k];
290
for (i=l; i<n; i++)
{
if(Tval[i]>Tval[i-1])
{
k++;
Uvals[j][k] = Tval[i];
}
}
}
free( (void *) Tval);
}
void copyques(int **tofeat, int **todirect, float **tothresh,
int **frfeat, int **frdirect, float **frthresh)
{
intj,k;
for G=O;j<maxq;j++)
{
for (k=O;k<ktot;k++)
{
tofeat[j] [k] = frfeat[j] [k]; todirect[j] [k] = frdirect[j] [k];
tothresh[j] [k] = frthresh[j] [k];
}
}
}
int fltcompare(float *i, float *j) /** Compares reals for qsortO call ****/
{
int mycheck;
float myi, myj;
mycheck= 1;
myi = *i; myj = *j;
if (myi > (myj + 0.000001
{
mycheck = 0;
return (1);
}
if (myi < (myj - 0.000001
{
mycheck= 0;
return (-1);
291
}
if (mycheck) retum(O);
}
int indofmin(double *vect, int lengv)
/* Returns smallest index of the "minimum" value in a vector of doubles*/
{
double tmpval;
int i, cnti;
tmpval = vect[O]; cnti = 0;
for (i=l ;i<lengv;i++)
{
if (vect[i] tmpval-2*FLT_EPSILON))
{
tmpval = vect[i];
cnti = i;
}
}
retum(cnti);
}
int maxval(int *vect, int lengv)
/* Returns the maximum value in a vector of ints*/
{
int tmpval;
int i;
tmpval = vect[O];
for (i=1;i<lengv;i++)
{
if (vect[i]>tmpval)
{
tmpval = vect[i];
}
}
retum(tmpval);
}
int mycomp(const void *i, const void *j) /* This calls a function above */
{ /* called fltcompare*/
return( fltcompare((float *) i, (float *) j)); /* for qsortO */
}
292
void printquesa(int **feat, int **direct, float **thresh,
int *locqk, int locktot)
{
int i,j,k;
absfile = myopenf(absfile,abs_file);
for(k=O; k<locktot; k++)
{
for (i=O; i<locqk[k]; i++)
{
fprintf(absfile,"Is variable %3d ", feat[i][k]+l);
if (direct[i] [k]) fprintf(absfile, ">= ");
else fprintf(absfile, "< ");
fprintf(absfile,"%3.4g ?\n", thresh[i] [k]);
}
if(k<(locktot-l)) fprintf(absfile," OR \nil);
}
fprintf(absfile, II \nil);
fclose(absfile);
}
FILE *myopenf(FILE *locfile, char *loc_file)
{
if ((locfile=fopen(loc_file,l a"))==NULL)
{
printf("\n\nError: Could not open file %s\n\n", loc_file);
exit(-l);
}
retum(locfile);
}
293
294
Appendix VI - Mortality Index for the Elderly
INSTRUCTIONS The following index is designed for noninstitutionalized
persons aged 65 and older. Enter the point value corresponding to each
answer in the blank space, and add up the points for all 16 questions.
(1) Intercept: 145 pts.
(2) Age:
(1) 145 pts.
65-69:
70-74:
75-79:
80-85:
85+ :
(3) Sex:
( 0 pts.)
( 42 pts.)
( 83 pts.)
(125 pts.)
(166 pts.) (2) pts.
Female:
Male:
( 0 pts.)
( 91 pts.) (3) ____pts.
(4) Do you take any digitalis, Digoxin, Lanoxin, or Digitoxin pills now?
Yes
No
Missing
(128 pts.)
( 0 pts.)
( Opts.) (4) ____pts.
295
(5) Weight at time of interview:
Females
weight in pounds
< 100
100-104
105-109
110-114
115-119
120-124
125-129
130-134
135-139
140-144
145-149
weight in pounds
< 120
120-124
125-129
130-134
135-139
140-144
145-149
150-154
155-159
160-164
165-169
170-174
175-179
180-184
Missing = -217 pts.
points
-148
-155
-163
-170
-178
-185
-193
-200
-208
-216
-223
points
-144
-149
-157
-165
-172
-180
-187
-195
-202
-210
-217
-225
-233
-240
Males
weight in pounds
150-154
155-159
160-164
165-169
170-174
175-179
180-184
185-189
190-195
195-199
200+
weight in pounds
185-189
190-194
195-199
200-204
205-209
210-214
215-219
220-224
225-229
230-234
235-239
240-244
245-249
250+
(5)
points
-231
-238
-246
-253
-261
-269
-276
-284
-291
-299
-303
points
-248
-255
-263
-270
-278
-286
-293
-301
-308
-316
-323
-331
-339
-342
____pts.
(Note: Negative points should be subtracted when adding up total.)
296
(6) Are you able to walk half a mile without help? That's about 8 ordinary blocks.
Yes
No
Missing
( 0 pts.)
( 78 pts.)
( 78 pts.) (6) ____pts.
(7) What was (is) your mother's maiden name?
Correct
Incorrect
Refused
Missing
( 0 pts.)
( 90 pts.)
( 90 pts.)
( 90 pts.) (7) ____pts.
(8) As compared to other people your own age, would you say that your health is
excellent, good, fair, poor or very poor?
Excellent ( Opts.)
Good ( 31 pts.)
Fair ( 63 pts.)
Poor or bad ( 94 pts.)
Missing ( 40 pts.) (8) pts.
(9) Other than when you might have been in the hospital, was there any time in the past
12 months when you needed help from some person or any equipment or device to do the
following things:
Bathing, either a sponge bath, tub bath, or shower?
No help
Help
Unable to do
Missing
Is (was) this help from a person, from special equipment, or both?
Person
Special equipment
Both
Missing
( 100 pts.)
( 0 pts.)
( Opts.)
( 0 pts.) (9) ____pts.
297
(10) What was your usual weight at age 50?
Females: # ofpts. = subtract 3 from weight in lbs.
Males: # of pts. = subtract 26 from weight in lbs.
Missing: # of pts. = 141 pts. (10) ____pts.
(11) Other than when you might have been in the hospital, was there any time in the past
12 months when you needed help from some person or any equipment or device to do the
following things:
Using the toilet?
No help
Help
Unable to do
Missing
( Opts.)
( 104 pts.)
( 104 pts.)
( 104 pts.) (11) ____pts.
(12) Do you smoke cigarettes now?
Yes
No
Missing
( 54 pts.)
( Opts.)
( Opts.) (12) ____----'pts.
(13) Has a doctor ever told you that you had diabetes, high blood sugar, or sugar in your
urine?
Yes
Suspect
No
Missing
Has a doctor ever told you to take insulin for this?
Yes
No
Missing
( 89 pts.)
( Opts.)
( Opts.) (13) ____pts.
298
(14) Has a doctor ever told you that you had any cancer, malignancy or malignant tumor
of any type?
Yes
Suspect
No
Missing
( 52 pts.)
( 52 pts.)
( 0 pts.)
( 0 pts.) (14) ____pts.
(15) Has a doctor ever told you that you had a heart attack or coronary, or coronary
thrombosis, or coronary occlusion or myocardial infarction?
Yes
Suspect
No
Missing
Were you hospitalized overnight or longer for this?
Yes
No
Missing
( 58 pts.)
( Opts.)
( 0 pts.) (15) ____pts.
(16) How often do you have trouble with waking up during the night?
Most of the time
Sometimes
Rarely or never
Missing
( 0 pts.)
( 21 pts.)
( 41 pts.)
( 23 pts.) (16) ____pts.
299
Appendix VII - C code for the repeated random and exhaustive search
algorithm (RRESA), combined with backward deletion to
estimate error on a test set
/******************************************************************
* *
* Program: BOBCAT- Boolean Operators with Binary splitting Classification
* Algorithm for Test set prediction
*
* Author: Michael Anderson
*
*
*
*
*
*
* This program perfoms a modified version of the random search algorithm *
* which incorporates exhaustive searching, for faster search times. It also uses *
* a learning set/test set division automatically in conjunction with backward *
* deletion to estimate prediction error. It then estimates a cost-complexity *
* parameter and recombines the full dataset to find a model which achieves *
* a low cost-complexity (see Breiman et aI, 1984). The program should be *
* compiled with a command of the form "cc -0 bobcat bobcat.c -1m" *
* from a UNIX environment. It should be run in the same directory with two *
* files: "y.dat" (containing a list of O's and l's) and "x.dat" (containing *
* a matrix of predictor variables in columns for records in rows). The program *
* will prompt the user for information concerning the model structure. *
*******************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <float.h>
#include <limits.h>
#include <time.h>
#define FALSE 0
#define TRUE !FALSE
#define MAXQK 9 /* Defines the max # of questions in a single subset */
#define DEFKTOT 10 /* Defines the total # of subsets in the model */
#define DEFQK 9 /* Defines the # of questions in each subset */
#define DEFCOST 1 /* Defines the cost of misclassifying a decedent (1) */
#define DEFFREQ 60000 /* # of mutations before checking for absorption */
#define DEFGENS INT MAX
/* GLOBAL VARIABLES: */
double
double
double
tfrac;
ndead, nalive;
lndead, lnalive;
/* Fraction ofN for test set respondents */
/* Number of 1s and Os in the learning set Y */
/* Number of 1s and Os in the learning set Y */
double
double
int
int
int
int
int
int
float
int
int
float
int
int
int
float
double
int
int
int
int
FILE
tndead, tnalive; /* Number of Is and Os in the test set Y */
miscost; /* Relative cost of misclass. dead as alive */
maxq,p,ln,tn,ktot; /* Size of test set, number of question sets */
min-part = 5; /* Smallest # of obs to allow in a subset */
gens; /* Max # of mutations allowed in one search */
checfreq; /* How soon to start looking for absorption *1
lmax 1; /* Max # of cycles through search (abs. pts.) */
myseed = 0; /* Seed, only used if greater than 0 */
**fX, **tX, **IX; /* Ptr to ptr to float: holds the X data */
*fy,*ty, *ly; /* Ptr to int: This holds the Y values */
*qk; /* Ptr to int: holds the number of?s in sets */
**Uvals; /* Ptr to ptr to float: holds variable values */
*nvals; /* Ptr to int: holds # of values each variable */
**bfeat, **bdirect, **tbfeat, **tbdirect, **beslfeat, **besldirect;
**besffeat, **besfdirect, **feat, **direct; /* These hold the model*/
**bthresh, **tbthresh, **beslthresh, **besfthresh, **thresh;
bfit, alpha, lalpha; /* Cost-complexity parameter */
splitdat=1 ;
prunereps=1000000; /* Number of repetitions */
lastsub, lasttd, goodalpha;
splitinit = 0; /* Seed; only used if greater than zero. */
*absfile, *mutfile, *quesfile; /* Ptrs to output files *1
300
1* These INPUT data files are required for the program to run: */
char *X file = "x.dat";
char *y_file = "y.dat";
/* File: n X p block of variables by col
/* File: n-vector of Y variables, 0 or 1
*/
*/
/* The program OUTPUT is stored in these files: */
char *abs_file = "abs.out";
char *mut_file = "mut.out";
char *ques_file = "question.out";
/* File: output - the absorption points
/* File: # of mutations at successes
/* File: pruning output
*/
*/
*/
/* FUNCTION PROTOTYPES Note: some of these have major side-effects */
FILE *myopenf(FILE *locfile, char *loc_file);
void welcome(void);
void getdat(int splitdat);
void getval(void);
int mycomp(const void *i, const void *j);
int fltcompare(float *i, float *j);
double misclass(int **feat, int **direct, float **thresh, float **locX,
int *locy, double locndead, double locnalive, int *loqk,
301
int locktot);
int checkabs(int **feat, int **direct, float **thresh, double best,
float **locX, int *locy, double locndead, double locnalive);
double mutpoint(int **feat, int **direct, float **thresh, float **locX,
int *locy, double locndead, double locnalive, double locfit);
double runsrch(int **feat, int **direct, float **thresh, float **locX,
int *locy, double locndead, double locnalive);
int prune(int ldep, int **feat, int **direct, float **thresh, float **X,
int *y, double lndead, double lnalive, float **tX, int *ty,
double tndead, double tnalive, int printstat);
int indofmin(double *vect, int lengv);
int maxval(int *vect, int lengv);
void printquesa(int **feat, int **direct, float **thresh,
int *locqk, int locktot);
void printquesb(int **feat, int **direct, float **thresh,
int *locqk, int locktot);
void copyques(int **tofeat, int **todirect, float **tothresh,
int **frfeat, int **frdirect, float **frthresh);
void print2by2(double locnalive, double locndead, int locsubsetsize,
int loctdead, double miserr, int pstat, int minnvars);
void printcost(double *lscores, double *cost_comp, double *nvars, int depth,
double alpha);
void printscores(double *tscores, double *lscores, double *nvars,
double *cost_comp, int depth);
void scramblei(int *vect, int lengv);
void scramblef(float *vect, int lengv);
/********************* START MAIN PROGRAM *************************/
void main(argc, argv)
int argc;
char *argv[];
{
int i,j, k, 1, kl, k2,jl, minind, pr, wasimprov, printstat;
double lfit, tfit, cntdiff, best, bestlfit, bestffit;
miscost = DEFCOST; gens = DEFGENS; checfreq = DEFFREQ;
ktot = DEFKTOT; maxq = MAXQK;
qk = (int *)malloc(ktot*sizeof(int));
for (i=O;i<ktot;i++)
{
qk[i] = DEFQK;
}
weicomeO; /* (When commented out, DONT'T) Prompt user for parameter values*/
/* If you want deterministic results, use a positive value of myseed */
if (myseed > 0)
{
for (k=O; k<myseed; k++) drand480;
}
else
{
srand48((unsigned)time(NULL;
}
/* The variables below will hold info on the present point in space.
feat is the variable number, direct is < or >=, thresh is cutoff */
feat = (int **)malloc(maxq*sizeof(int *;
direct = (int **)malloc(maxq*sizeof(int *;
thresh = (float **)malloc(maxq*sizeof(float *;
tbfeat = (int **)malloc(maxq*sizeof(int *;
tbdirect = (int **)malloc(maxq*sizeof(int *;
tbthresh = (float **)malloc(maxq*sizeof(float *;
bfeat = (int **)malloc(maxq*sizeof(int *;
bdirect = (int **)malloc(maxq*sizeof(int *;
bthresh = (float **)malloc(maxq*sizeof(float *;
beslfeat = (int **)malloc(maxq*sizeof(int *;
besldirect = (int **)malloc(maxq*sizeof(int *;
beslthresh = (float **)malloc(maxq*sizeof(float *;
besffeat = (int **)malloc(maxq*sizeof(int *;
besfdirect = (int **)malloc(maxq*sizeof(int *;
besfthresh = (float **)malloc(maxq*sizeof(float *;
for (k=O; k<maxq; k++)
{
feat[k] = (int *)calloc(ktot,sizeof(int;
direct[k] = (int *)calloc(ktot,sizeof(int;
thresh[k] = (float *)calloc(ktot,sizeof(float;
bfeat[k] = (int *)calloc(ktot,sizeof(int;
bdirect[k] = (int *)calloc(ktot,sizeof(int;
bthresh[k] = (float *)calloc(ktot,sizeof(float;
tbfeat[k] = (int *)calloc(ktot,sizeof(int;
tbdirect[k] = (int *)calloc(ktot,sizeof(int;
tbthresh[k] = (float *)calloc(ktot,sizeof(float;
beslfeat[k] = (int *)calloc(ktot,sizeof(int;
besldirect[k] = (int *)calloc(ktot,sizeof(int;
beslthresh[k] = (float *)calloc(ktot,sizeof(float;
besffeat[k] = (int *)calloc(ktot,sizeof(int;
besfdirect[k] = (int *)calloc(ktot,sizeof(int;
302
besfthresh[k] = (float *)calloc(ktot,sizeof(float));
}
tfrac = 0.3333; lastsub=O; lasttd=O;
getdat(splitdat); /* Read in the data with this call */
printf("\n\n %d records and %d variables detected. \n\n",
(int)(lndead+lnalive+tndead+tnalive), p);
printf("\n\n Done reading in data, now processing data.\n");
printf(" (This may take a few minutes if your dataset is big.)\n\n");
getvalO; /* Get the variable values */
printf("\n\n Done processing data, beginning search.\n\n");
printf(" The most recent questions will be stored in the file \"question.out\".\n");
printf(" All absorption points will be stored in the file \"abs.out\".\n");
printf(" The mutation numbers at successful mutations will be in \"mut.out\" .\n");
printf(" (If these files already exist, the results will be appended to them.)\n");
printf("\n\n BOBCAT will now search indefinitely. Hit Ctrl-C to stop it.\n\n");
bfit = 0;
absfile = myopenf(absfile,abs_file);
fprintf(absfile, "\n Absorption point in the learning dataset: \n\n");
fclose(absfile);
bfit = runsrch(feat, direct, thresh, IX, ly, lndead, lnalive);
copyques(beslfeat,besldirect,beslthresh,bfeat,bdirect,bthresh);
copyques(tbfeat,tbdirect,tbthresh,bfeat,bdirect,bthresh);
bestlfit = bfit;
minind = 0; alpha = 0; printstat = I; goodalpha = 1;
minind = prune(minind, tbfeat, tbdirect, tbthresh, IX, ly, lndead,
lnalive, tX, ty, tndead, tnalive, printstat);
if (goodalpha == 0)
{
copyques(tbfeat,tbdirect,tbthresh, bfeat,bdirect,bthresh);
minind = 0; printstat = 1; alpha = 0;
minind = prune(minind, tbfeat, tbdirect, tbthresh, IX, ly, lndead,
lnalive, tX, ty, tndead, tnalive, printstat);
}
lalpha = alpha;
free ( (void *) Uvals); free ( (void *) nvals);
getdat(O); getvalO;
bfit = 0;
absfile = myopenf(absfile, abs_file);
fprintf(absfile," \n Absorption point in the full dataset: \n\n");
fclose(absfile);
bfit = runsrch(feat, direct, thresh, fX, fy, ndead, nalive);
copyques(tbfeat,tbdirect,tbthresh,bfeat,bdirect,bthresh);
copyques(besffeat,besfdirect,besfthresh,bfeat,bdirect,bthresh);
bestffit = bfit;
303
minind = 0; printstat = 2; goodalpha=l;
minind = prune(minind, tbfeat, tbdirect, tbthresh, fX, [Y, ndead,
nalive, fX, [Y, ndead, nalive, printstat);
alpha = lalpha; printstat = 3; goodalpha = 1;
copyques(tbfeat,tbdirect,tbthresh, bfeat,bdirect,bthresh);
minind = prune(minind-l, tbfeat, tbdirect, tbthresh, fX, [Y, ndead,
nalive, fX, [Y, ndead, nalive, printstat);
for(pr=2;pr<(prunereps+1);pr++)
{
bfit = 0; wasimprov=O; alpha = lalpha;
myopenf(absfile,abs_file);
fprintf(absfile, "\n Absorption point in the learning dataset: \n\n");
fclose(absfile);
bfit = runsrch(feat, direct, thresh, IX, ly, lndead, lnalive);
if (bfit>bestlfit)
{
bestlfit = bfit; minind = 0; alpha = 0; wasimprov = 1; printstat=l;
goodalpha=I;
copyques(tbfeat,tbdirect,tbthresh,bfeat,bdirect,bthresh);
minind = prune(minind, tbfeat, tbdirect, tbthresh, IX, ly, lndead,
lnalive, tX, ty, tndead, tnalive,printstat);
if (goodalpha == 0)
{
copyques(tbfeat,tbdirect,tbthresh,bfeat,bdirect,bthresh);
minind = 0; printstat = 1; alpha = 0;
minind = prune(minind, tbfeat, tbdirect, tbthresh, IX, ly, lndead,
lnalive, tX, ty, tndead, tnalive, printstat);
}
lalpha = alpha;
copyques(beslfeat,besldirect,beslthresh,tbfeat,tbdirect,tbthresh);
}
bfit = 0;
absfile = myopenf(absfile, abs_file);
fprintf(absfile," \n Absorption point in the full dataset: \n\n");
fclose(absfile);
bfit = runsrch(feat, direct, thresh, fX, [Y, ndead, nalive);
if (bfit>bestffit)
{
copyques(besffeat,besfdirect,besfthresh,bfeat,bdirect,bthresh);
wasimprov = 1; bestffit = bfit;
}
if (wasimprov)
{
copyques(tbfeat,tbdirect,tbthresh,besffeat,besfdirect,besfthresh);
304
305
minind = 0; printstat = 2; goodalpha = 1;
minind = prune(minind, tbfeat, tbdirect, tbthresh, fX, fy, ndead,
nalive, fX, fy, ndead, nalive,printstat);
alpha = lalpha; printstat = 3;
copyques(tbfeat,tbdirect,tbthresh,besffeat,besfdirect,besfthresh);
goodalpha = 1;
minind = prune((minind-1), tbfeat, tbdirect, tbthresh, fX, fy,
ndead, nalive, fX, fy, ndead, nalive,printstat);
}
}
} /******************** END OF MAIN PROGRAM **********************/
double misclass(int **feat, int **direct, float **thresh, float **locX,
int *locy, double locndead, double locnalive, int *locqk,
int locktot)
/** This function computes the misclassification error for any particular
* point in space as defined by feat, direct, and thresh. It returns
* (l - error) as a double.
{
int checkin, i, j, k, insubset, locn, lktot,subsetsize, tdead;
double fit;
int *ssizes, *lqk;
locn = locndead + locnalive; lastsub=O; lasttd=O;
lqk=locqk; lktot = locktot; subsetsize = 0; tdead = 0;
ssizes = (int *)malloc(1ktot*sizeof(int *));
for (k=O; k<lktot; k++)
{
ssizes[k]=O;
}
for (i=O; i<locn; i++)
{
checkin = 1;
k=O;
while (checkin) /* checkin keeps track of whether a case is chosen */
{
while ((checkin) && (k<lktot)) /* Cycle through the k sets */
{
insubset= 1;
*
*
*/
j = lqk[k];
while (insubset && j)
{
J--;
if (directO] [k])
{
if (locX[i][feat[j][k]] < thresh[j][k])
{
insubset--;
}
}
else
{
if (locX[i] [featO][k]] >= thresh[j][k])
{
insubset--;
}
}
}
if (insubset) /* If the respondent is chosen, check for death */
{
subsetsize++;
if(locy[i]) tdead++;
checkin--;
ssizes[k]++;
}
k++;
}
if (checkin > 0) checkin--;
}
}
fit = 1 - ((miscost*(locndead-(double)tdead+
( (double)subsetsize - (double)tdead /
(miscost*locndead + locnalive);
lastsub = subsetsize; lasttd = tdead;
for (k=O; k<lktot; k++)
{
if (ssizes[k]<minyart) fit = 0;
}
free( (void *) ssizes);
retum(fit);
}
306
double mutpoint(int **feat, int **direct, float **thresh, float **locX,
int *locy, double locndead, double locnalive, double locfit)
/*** This function mutates a point repeatedly, searching for lower error.
It returns I-error of the combination of questions with lowest error. */
{
int replaced, new_feat, old_feat, new_direct, old_direct;
int checkout, gen, i, k, j, cntmut;
double best;
float new_thresh, old_thresh;
best = locfit;
checkout = 1;
gen = 0; cntmut = 0;
mutfile = myopenf(mutfile, mut_file);
fprintf(mutfile, "\n\n Total # of mutations at Misclassification \nIl);
fprintf(mutfile," successful mutations error \n'l);
fclose(mutfile);
while((gen<gens) && checkout) /* Enter loop to do mutations */
{
for (k=O; k<ktot; k++) /* Loop through the ktot question sets */
{
/* Introduce mutation (save old question in case mutation is bad) */
cntmut++;
replaced = floor(drand480 * qk[k]);
if (replaced==qk[k]) replaced--;
old_feat = feat[replaced][k];
old_direct = direct[replaced][k];
old_thresh = thresh[replaced][k];
new_feat = floor(drand480 * p);
if (new_feat==p) new_feat--;
j = floor(drand48()*nvals[new_feat]);
if (j==nvals[new_feat]) j--;
if (j < 1) j++;
new_thresh = Uvals[new_feat][j];
new_direct = floor(drand48() * 2);
if (new_direct==2) new_direct--;
feat[replaced] [k] = new_feat;
direct[replaced] [k] = new_direct;
thresh[replaced][k] = new_thresh;
/* Check the error rate */
locfit = misclass(feat, direct, thresh, 10cX, locy, locndead,
307
308
locnalive,qk,ktot);
if (locfit <= best) /* If it isn't an improvement, get old quest */
{
feat[replaced][k] = old_feat;
direct[replaced][k] = old_direct;
thresh[replaced] [k] = old_thresh;
}
else /* If it IS better, keep it, output "gen" to out2.dat */
{
best = locfit;
mutfile = myopenf(mutfile, mut_file);
fprintf(mutfile,"%13d %1.8If\n", cntmut, I-best);
fclose(mutfile);
if (gen > checfreq) /* check for absorption */
{
checkout = I - checkabs(feat,direct,thresh,best,locX,locy,
locndead, locnalive);
}
}
}
while ((gen> checfreq) && checkout) /* check for absorption */
{
checkout = I - checkabs(feat,direct,thresh,best,locX,locy,
locndead,locnalive);
if (checkout)
{
best = misclass(feat, direct, thresh, 10cX, locy, locndead,
locnalive,qk,ktot);
}
}
gen++;
} /* Exit from mutation loop */
best = misclass(feat, direct, thresh, 10cX, locy, locndead,
locnalive,qk,ktot);
absfile = myopenf(absfile,abs_file);
fprintf(absfile, "\n Misclassification error: %If\n\n'', I-best);
fclose(absfile);
printquesa(feat, direct, thresh, qk, ktot);
return(best);
}
double runsrch(int **feat, int **direct, float **thresh, float **locX,
int *Iocy, double locndead, double locnalive)
/* This function not only returns the fit of the best question set from
the search, it also changes bfit, the questions, AND the best questions */
{
int I, k, j, j I ;
double lfit;
for (1=0; 1< Imax; 1++) /* Enter loop to repeat the search Imax times */
{
lfit = 0.0;
while (1fit < 0.001) /* This loop generates a random point in space */
{
for (k=O; k<ktot; k++)
{
for 0=0; j<qk[k]; j++)
{
feat[j][k]= floor(drand48() * p);
if (feat[j] [k]==p) feat[j] [k]--;
direct[j][k]= floor(drand48() * 2);
if (direct[j][k]==2) direct[j] [k]--;
j I = floor(drand48()*nvals[feat[j][k]]);
if 0I==nvals[feat[j][k]]) j 1--;
if 01 < l)jl++;
thresh[j][k] = Uvals[feat[j] [k]] [j 1];
}
}
lfit = misclass(feat, direct, thresh, 10cX, locy, locndead,
locnalive,qk,ktot);
}
lfit = mutpoint(feat, direct, thresh, 10cX, locy, locndead, locnalive,
lfit);
if (1fit > bfit)
{
bfit = lfit;
for O=O;j <maxq;j ++)
{
for (k=O;k<ktot;k++)
{
bfeat[j][k] = feat[j][k]; bdirect[j][k] = direct[j][k];
bthresh[j][k] = thresh[j][k];
}
309
}
}
}
return(bfit);
}
int prune(int ldep, int **feat, int **direct, float **thresh, float **IX,
int *ly, double lndead, double lnalive, float **tX, int *ty,
double tndead, double tnalive, int printstat)
/*** This function performs backwards deletion on a model
{
int i,j ,k,l,k1,j l,j2, checkin, checktot, old_direct, old_feat, insubset;
int *locqk, *ssizes, *nchosen, *ntrued;
double *lscores, *tscores, *nvars, *cost_comp;
int replaced, new_feat, new_direct, subsetsize, depth, minind;
float new_thresh, old_thresh;
double fit, oldfit, newfit;
int tsubset, tdead, ktemp, qtemp, kmin, qmin, kdrop, tktot, locn;
depth = floor( (double)ktot * (double)maxq );
lscores = (double *)calloc((depth+2),sizeof(double));
tscores = (double *)calloc((depth+2),sizeof(double));
nvars = (double *)calloc((depth+2),sizeof(double));
cost_comp = (double *)calloc((depth+2),sizeof(double));
nchosen = (int *)calloc((depth+2),sizeof(int));
ntrued = (int *)calloc((depth+2),sizeof(int));
for(i=O;i depth+2);i++)
{
tscores[i] = 1 - ((miscost*(double)tndead)/
(miscost*(double)tndead + tnalive));
lscores[i] = 1 - ((miscost*(double)lndead)/
(miscost*(double)lndead + lnalive));
cost_comp[i] = I-tscores[i];
}
locqk = (int *)malloc(ktot*sizeof(int));
for (i=O;i<ktot;i++) {locqk[i] = qk[i];} tktot = ktot;
ktemp = -1; qtemp = -1; locn = lndead+lnalive;
*/
310
311
1* NOW ENTER INTO BACKWARD DELETION *1
I = depth-I;
Iscores[1+2] = misclass(feat, direct, thresh, IX, ly,lndead,lnalive,locqk,
tktot);
tscores[1+2] = misclass(feat, direct, thresh, tX, ty,tndead,tnalive,locqk,
tktot);
nchosen[I+2]=lastsub; ntrued[1+2] = lasttd;
for (k=O;k<tktot;k++) {nvars[I+2] = nvars[1+2] + locqk[k];}
cost_comp[I+2] = (l-tscores[I+2]) + alpha*nvars[1+2];
checktot = 1;
while ((l>=ldep) && (checktot))
{
ktemp=O; qtemp=O; kdrop = -1 ;
oldfit = 0.0;
1* ENTER INTO LOOP THROUGH QUESTIONS, DROPPING ONE AT A TIME*I
for (k1=0; k1 <tktot; k1 ++)
{
locqk[k1]--;
forO 1=0; j 1<(locqk[k1]+1); j 1++)
{
old_feat = feat[j l][k1];
old_thresh = thresh[j 1][k1];
old_direct = direct[j 1] [k1];
feat[j 1][k1] = feat[locqk[k1]] [k1];
thresh[j 1] [k1] = thresh[locqk[k1]] [k1];
direct[j 1][k1] = direct[locqk[k1 ]][k1];
subsetsize = 0; tdead= 0;
newfit = misclass(feat, direct, thresh, IX, ly, lndead,
lnalive,locqk,tktot);
if (newfit>oldfit)
{
qtemp = j1; ktemp = k1;
oldfit = newfit;
}
feat[j1][k1] = old_feat;
thresh[jI][k1] = old_thresh;
direct[j 1] [k1] = old_direct;
}
locqk[k1]++;
}
for (k1=0; k1<tktot; k1++) /* NOW TRY DROPPING EACH PARTITION */
{
subsetsize = 0; tdead = 0;
for (i=O; i<locn; i++)
{
checkin = 1;
k=O;
while (checkin)
{
while ((checkin) && (k<tktot))
{
insubset= 1;
j = locqk[k];
if (k!=k1)
{
while (insubset && j)
{
J--;
if (direct[j] [kJ)
{
if (lX[i][feat[j][k]] < thresh[j][k])
{
insubset--;
}
}
else
{
if (lX[i][feat[j] [k]] >= thresh[j] [k])
{
insubset--;
}
}
}
if (insubset)
{
subsetsize++;
if(ly[iJ) tdead++;
checkin--;
312
}
}
k++;
}
if (checkin > 0) checkin--;
}
}
newfit = 1 - ((miscost*(lndead-(double)tdead+
( (double)subsetsize - (double)tdead /
(miscost*lndead + Inalive) ;
if (newfit>oldfit)
{
oldfit = newfit;
kdrop = kl;
}
}
if ((I < minind) && (goodalpha==O
{
kdrop = -1;
}
if (kdrop < 0)
{
feat[qtemp] [ktemp] = feat[locqk[ktemp]-1] [ktemp];
thresh[qtemp] [ktemp] = thresh[locqk[ktemp]-1] [ktemp];
direct[qtemp][ktemp] = direct[locqk[ktemp]-1][ktemp];
locqk[ktemp]--;
if (locqk[ktemp]<1) kdrop = ktemp;
}
if (kdrop >=0)
{
locqk[kdrop]=locqk[tktot-1];
for (j2=0; j2<locqk[kdrop]; j2++)
{
feat02][kdrop] = feat 02] [tktot-1];
thresh02][kdrop] = thresh02][tktot-1];
direct02][kdrop] = direct02] [tktot-1];
}
tktot--;
}
if (tktot<l) checktot--;
1--' ,
if (l >=0 )
{
313
lscores[1+2]=misclass(feat, direct, thresh, IX, ly, Indead,lnalive,
locqk,tktot);
tscores[l+2]=misclass(feat, direct, thresh, tX, ty, tndead,tnalive,
locqk,tktot);
nchosen[l+2]=lastsub; ntrued[l+2] = lasttd;
for (k=O;k<tktot;k++) {nvars[l+2] = nvars[l+2] + locqk[k];}
cost_comp[l+2] = (l-tscores[1+2D + alpha*nvars[1+2];
}
}
minind = indofmin(cost_comp,depth+2);
if (minind == 0)
{
alpha = DBL_MAX;
minind++;
}
else
{
alpha = 0.97*((l-lscores[minind-lD-(l-lscores[minindD);
}
goodalpha = 1;
if (printstat ==3)
{
nvars[minind-l] = nvars[minind]-l;
}
if ((nvars[minind]-nvars[minind-l D > 1)
{
goodalpha=O;
}
if (alpha < 0) { alpha = 0; }
if ((printstat == l) && (goodalpha))
{
printscores(tscores, lscores, nvars, cost_comp, depth+2);
print2by2(tnalive, tndead, nchosen[minind], ntrued[minind],
I-tscores[minind],printstat,nvars[minindD;
}
else if (printstat == 2)
{
printcost(lscores, cost_comp, nvars, depth+2, lalpha);
print2by2(lnalive, lndead, nchosen[minind], ntrued[minind],
l-lscores[minind] ,printstat,nvars[minindD;
}
else if (printstat == 3)
{
printquesb(feat,direct,thresh, locqk, tktot);
314
}
freevoid *) lscores); freevoid *) ntrued); freevoid *) nchosen);
freevoid *) tscores); freevoid *) nvars); freevoid *) cost_comp);
retum(minind);
}
int checkabs(int **feat, int **direct, float **thresh, double best,
float **locX, int *locy, double locndead, double locnalive)
/* ** This function checks a point in space to see if it is an absorption
* point by exhaustively replacing each question with all possible
* questions and checking for a lower error. Notice that although it
*** touches the point's parameters, it leaves them unchanged in the end.
{
int k, j, i, 1, k1, checkout;
int old_feat;
int old_direct;
float old_thresh;
double fit;
int *tmpp;
float *Tval;
tmpp = (int *)calloc(p,sizeof(int));
for(i=O;i<p;i++) tmpp[i] = i;
checkout = 1;
k=O;
whilek<ktot) && checkout)
{
j = 0;
while(G<qk[kJ) && checkout)
{
old_feat = feat[j][k];
old_direct = directU][k];
old_thresh = threshU][k];
1= 0;
scramblei(tmpp,p);
while((l<p) && checkout)
{
Tval = (float *)calloc(nvals[tmpp[l]],sizeof(float));
for(i=O; i<nvals[tmpp[l]]; i++) Tval[i] = Uvals[tmpp[1]] [i];
scramblef(Tval,nvals[tmpp[l]J);
i=O;
*
*
*
*/
315
while((i<nvals[tmpp[l]]) && checkout)
{
k1=0;
while((k1 <2) && checkout)
{
feat[j][k] = tmpp[l];
direct[j Uk] = k1;
thresh[jUk] = Tval[i];
fit = misclass(feat, direct, thresh, locX, locy,
locndead, locnalive, qk, ktot);
if (fit> best)
{
checkout = 0;
}
k1++;
}
i++;
}
free((void *)Tval);
1++;
}
if (checkout)
{
feat[j][k] = old_feat;
thresh[j][k] = old_thresh;
direct[j] [k] = old_direct;
}
j++;
}
k++;
}
return(checkout);
}
void getdat(int splitdat) /******* This function reads in the data ***/
{
int notdone, blnkint, i,j ,k,np,n;
FILE *ptryfile, *ptrxfile;
double pcheck;
float blnkflt;
int lent, tcnt;
double outval;
316
if ((ptryfile=fopen(y_file,"rb"))==NULL)
{
printf("\n\nError: Could not open file %s\n\n", y_file);
exit(-l);
}
n= 0;
notdone = TRUE;
while (notdone)
{
if(fscanf(ptryfile, "%d", &blnkint) != EOF)
{
n++;
}
else
{
notdone = FALSE;
}
}
fclose(ptryfile);
if ((ptrxfile=fopen(X_file, "rb"))==NULL)
{
printf("\n\nError: Could not open file %s\n\n", X_file);
exit(-l);
}
np=O;
notdone = TRUE;
while (notdone)
{
if(fscanf(ptrxfile, "%f', &blnkflt) != EOF)
{
np++;
}
else
{
notdone = FALSE;
}
}
fclose(ptrxfile);
pcheck = (double)np/(double)n;
p = floor(pcheck);
if ((pcheck - (double)p) > 0.0000000001)
{
printf("\n\nError: Your data files are not formatted properly. \n");
printf("\nMake sure that the data in the files y.dat and x.dat \n");
317
printf("have the same number of rows, and that every row of x.dat\n");
printf("has the same number of variables. \n\n");
exit(-I);
}
tn = floor( tfrac * (double)n); In = n - tn;
if ((ptryfile=fopen(y_file, "rb"))==NULL)
{
printf("\n\nError: Could not open file %s\n\n", y_file);
exit(-I);
}
if ((ptrxfile=fopen(X_file, "rb"))==NULL)
{
printf("\n\nError: Could not open file %s\n\n", X_file);
exit(-I);
}
Icnt=O; tcnt=O;
if (splitdat)
{
tndead=O; tnalive=O; lndead = 0; lnalive = 0;
tX = (float **)maBoc(tn*sizeof(float *));
ty = (int *)maBoc(tn*sizeof(int));
for (i=O; i<tn; i++)
{
tX[i] = (float *)maBoc(p*sizeof(float));
}
IX = (float **)maBoc(ln*sizeof(float *));
ly = (int *)maBoc(ln*sizeof(int));
for (i=O; i<ln; i++)
{
IX[i] = (float *)maBoc(p*sizeof(float));
}
i=O; srand48(splitinit);
while (i<n)
{
outval = drand480;
if ((outval < tfrac) && (tcnt < tn))
{
fscanf(ptryfile, "%d", &ty[tcnt));
tndead = tndead + (double)ty[tcnt];
for (j=O;j<p;j++)
{
fscanf(ptrxfile, "%1', &tX[tcnt]U));
}
318
tent++; i++;
}
else if ((outval >= tfrae) && (lent < In))
{
fseanf(ptryfile, "%d", &ly[1cnt));
lndead = lndead + (double)ly[1cnt];
for O=O;j<p;j++)
{
fseanf(ptrxfile, "%f', &IX[lent][j));
}
1cnt++; i++;
}
else if ( (tent==tn) && (lent==ln) )
{
i++;
}
}
lnalive = (double)In - lndead; tnalive = (double)tn - tndead;
}
else
{
ndead = 0; nalive = 0;
fX = (float **)malloe(n*sizeof(float *));
fy = (int *)malloe(n*sizeof(int));
for (i=O; i<n; i++)
{
fX[i] = (float *)malloe(p*sizeof(float));
for 0=0; j<p; j++)
{
fseanf(ptrxfile, "%f', &fX[i][j));
}
fseanf(ptryfile, "%d", &fy[i]);
if(fy[i]) { ndead++; }
}
nalive = (double)n - ndead;
}
fclose(ptrxfile); fclose(ptryfile);
if (myseed)
{
srand48(myseed+I);
}
else
{
srand48((unsigned)time(NULL));
319
}
}
void getval(void) /******* This function gets the uniq variable values ****/
{
int i,j,k;
float *Tval;
float lastval;
Tval = (float *)malloc(ln*sizeof(float));
nvals = (int *)malloc(p*sizeof(int));
Uvals = (float **)malloc(p*sizeof(float *));
for 0=0; j<p; j++)
{
for (i=O; i<ln; i++)
{
Tval[i] = IX[i][j];
}
qsort((void *)Tval, In, sizeof(float), mycomp);
lastval=Tval[O] ;
nvals[j] = I;
for (i=l; i<ln; i++)
{
if(Tval[i]>lastval)
{
nvals[j]++;
lastval = Tval[i];
}
}
Uvals[j] = (float *)maUoc(nvals[j]*sizeof(float));
k=O;
Uvals[j][k] = Tval[k];
for (i=l; i<ln; i++)
{
if(Tval[i]>Tval[i-l])
{
k++;
Uvals[j][k] = Tval[i];
}
}
}
free( (void *) Tval);
}
320
void copyques(int **tofeat, int **todirect, float **tothresh,
int **frfeat, int **frdirect, float **frthresh)
{
intj,k;
for G=O;j<maxq;j++)
{
for (k=O;k<ktot;k++)
{
tofeat[j][k] = frfeat[j] [k]; todirect[j][k] = frdirect[j][k];
tothresh[j] [k] = frthresh[j] [k];
}
}
}
int fltcompare(float *i, float *j) /** Compares reals for qsortO call ****/
{
int mycheck;
float myi, myj;
mycheck= 1;
myi = *i; myj = *j;
if (myi > (myj + 0.000001))
{
mycheck = 0;
return (1);
}
if (myi < (myj - 0.000001))
{
mycheck = 0;
return (-1);
}
if (mycheck) return(O);
}
int indofmin(double *vect, int lengv)
/* Returns smallest index of the "minimum" value in a vector of doubles */
{
double tmpval;
int i, cnti;
tmpval = vect[O]; cnti = 0;
321
for (i=l;i<lengv;i++)
{
if (vect[i]tmpval-2*FLT_EPSILON))
{
tmpval = vect[i];
cnti = i;
}
}
return(cnti);
}
int maxval(int *vect, int lengv)
/* Returns the maximum value in a vector of ints*/
{
int tmpval;
int i;
tmpval = vect[O];
for (i=l ;i<lengv;i++)
{
if (vect[i]>tmpval)
{
tmpval = vect[i];
}
}
return(tmpval);
}
int mycomp(const void *i, const void *j) /* This calls a function above */
{ /* called fltcompare*/
return( fltcompare((float *) i, (float *) j)); /* for qsortO */
}
void printquesa(int **feat, int **direct, float **thresh,
int *locqk, int locktot)
{
int i,j,k;
absfile = myopenf(absfile,abs_file);
for(k=O; k<locktot; k++)
{
for (i=O; i<locqk[k]; i++)
{
fprintf(absfile,"Is variable %3d ", feat[i] [k]+I);
322
if (direet[i] [k)) fprintf(absfile, ">= ");
else fprintf(absfile, "< ");
fprintf(absfile,"%3.4g ?\n", thresh[iHk));
}
if(k<(loektot-l)) fprintf(absfile," OR \n");
}
fprintf(absfile," \n");
fclose(absfile);
}
void printquesb(int **feat, int **direet, float **thresh,
int *loeqk, int loektot)
{
int i,j,k;
quesfile = myopenf(quesfile,ques_file);
if (loektot > 0)
{
fprintf(quesfile, "\n The latest best question set using the full dataset is: \n\n");
}
else
{
fprintf(quesfile, "\n The latest best question set using the full dataset has
questions.\n\n");
}
for(k=O; k<loektot; k++)
{
for (i=O; i<loeqk[k]; i++)
{
fprintf(quesfile,"Is variable %3d ", feat[i] [k]+1);
if (direet[i] [k)) fprintf(quesfile, ">= ");
else fprintf(quesfile,"< ");
fprintf(quesfile,"%3.4g ?\n", thresh[iHk));
}
if(k<(loektot-l)) fprintf(quesfile," OR \n");
}
fprintf(quesfile," \n");
fclose(quesfile);
}
void print2by2(double loenalive, double loendead, int loesubsetsize,
int loetdead, double miserr, int pstat, int minnvars)
{
int entOO, entOI, entlO, entll, rowO, rowl, colO, coll, totn;
323
324
quesfile = myopenf(quesfile,ques_file);
fprintf(quesfile, "\nA set of%d questions was applied to the",minnvars);
if (pstat == I)
{
fprintf(quesfile, " test dataset:\n");
}
else if (pstat == 2)
{
fprintf(quesfile, " full dataset:\n");
}
rowO = (int)locnalive; rowl = (int)locndead; coll = locsubsetsize;
cnt11 = loctdead;
totn = rowO + rowI; colO = totn - coll; cntOI = coll-cnt II;
cntlO = rowl-cntll; cntOO = rowO-cntOl;
\n",
\n\n");
\n\n");
%6d \n",
\n");
%6d \n\n",
%6d
%6d
%6d
predicted outcome
o I
o%6d %6d
%6d
fprintf(quesfile, "\n\n");
fprintf(quesfile,"
fprintf(quesfile,"
fprintf(quesfile," true
cntOO,cntOI ,rowO);
fprintf(quesfile,"outcome
fprintf(quesfile," I %6d
cntlO, cnt1l, rowl);
fprintf(quesfile,"
colO, coll, totn);
fprintf(quesfile, "\n Misclassification error = %If\n\n\n'', miserr);
fclose(quesfile);
}
void printcost(double *lscores, double *cost_comp, double *nvars, int depth,
double alpha)
{
int i;
quesfile = myopenf(quesfile,ques_file);
fprintf(quesfile,"\n\n The estimated cost-complexity parameter was %1.7Ig.\n",
alpha);
fprintf(quesfile,"\n # of questions misclass. error cost complexity\n\n");
i = depth-I;
while (i>=O)
{
fprintf(quesfile," %3d %1.71f %1.719\n",
(int)nvars[i], I-Iscores[i], cost_comp[i]);
if (nvars[i]==O) i = 0;
1--' ,
}
fprintf(quesfile, "\n");
fclose(quesfile);
}
void printscores(double *tscores, double *lscores, double *nvars,
double *cost_comp, int depth)
{
int i,i2;
quesfile = myopenf(quesfile,ques_file);
fprintf(quesfile,"\n # of questions learning set error test set error\n\n");
i = depth-I;
while ((i>=O)&& i2)
{
fprintf(quesfile," %3d %If %If\n'' ,
(int)nvars[i], I-lscores[i], I-tscores[i]);
if (nvars[i]==O) i = 0;
1--' ,
}
fprintf(quesfile, "\n");
fclose(quesfile);
}
void scramblei(int *vect, int lengv)
{
int templeng,i,tempind,tempval;
templeng = lengv;
for (i=O; i<lengv; i++)
{
tempind = floor(drand480*templeng);
if (templeng == tempind) tempind--;
tempval = vect[tempind];
vect[tempind] = vect[templeng-I];
vect[templeng-I] = tempval;
templeng--;
}
}
325
void scramblef(float *vect, int lengv)
{
float tempval;
int templeng,i,tempind;
templeng = lengv;
for (i=O; i<lengv; i++)
{
tempind = floor(drand480*templeng);
if (templeng == tempind) tempind--;
tempval = vect[tempind];
vect[tempind] = vect[templeng-l];
vect[templeng-l] = tempval;
templeng--;
}
}
void welcome(void)
{
int resnotok, chkcnt, resp1, j unkint, k, goback, tmpint;
char cresp1, junkchar;
double tmpflt;
printf("\n\n");
printf("
?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?11?&?&?&?\n");
printf(" & &\n");
printf(" ? Welcome to BOBCAT! ?\n");
printf(" & Boolean Operations on Binary splits Classification &\n");
printf("? Algorithm for Test-set prediction ?\n");
printf(" & &\n");
printf("? Author: Michael Anderson Copyright 1997 ?\n");
printf(" & &\n");
printf("
?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?\n");
resnotok=1;
while(resnotok>O)
{
resnotok=1 ;
printf("\n\n");
printf(" MAIN MENU\n\n");
printf(" 1. Search for a new set of questions\n");
326
printf(" 2. Analyze a previously constructed set of questions\n");
printf(" 3. Exit\n\n");
printf(" Enter a number from 1 to 3 (default = 1): ");
cresp1 = getcharO;
resp1 = (int) cresp1; junkint = resp1;
while Gunkint != 10)
{
junkchar = getcharO; junkint = (int) junkchar;
}
if (resnotok ==1)
{
switch(respl)
{
case 49: resnotok=O;
break;
case 50: printf("\n\n\n\n");
printf(" Sorry, that option is not supported yet.\n");
printf(" (BOBCAT is still under construction!)\n");
printf("\n Please enter another number.\n");
resnotok=1;
break;
case 51: printf("\n\n\n Thank you for using BOBCAT.\n\n\n");
resnotok=O; exit(O);
break;
case 10: resnotok=O;
break;
default: printf("\n Enter 1, 2, or 3. \n"); resnotok=I ;
break;
}
}
}
goback = 1;
while (goback>O)
{
resnotok=1;
while(resnotok>O)
{
printf("\n\n Enter the number of question subsets to be");
printf(" combined \n");
printf(" with \"OR\" (maximum = %d, default = %d): ",MAXQK,ktot);
cresp1 = getcharO;
resp1 = (int) cresp1; junkint = resp1;
while Gunkint != 10)
{
327
junkchar = getcharO; junkint = (int) junkchar;
}
if ( ((respl < 49) II (respl > (48+MAXQK))) && (respl !=10))
{
printf("\n Enter a number no lower than 1, no greater than %d.",
MAXQK);
resnotok = 1;
}
else if(respl=10)
{
ktot = DEFKTOT; resnotok=O;
}
else
{
resp1 = resp1 - 48; ktot = resp1;
. resnotok = 0;
}
}
free( (void *) qk);
qk = (int *)malloc(ktot*sizeof(int));
for (k=O;k<ktot;k++)
{
qk[k] = DEFQK;
resnotok=1;
while(resnotok>O)
{
printf("\n\n Enter the number of questions in subset %d", k+1);
printf(" to be combined \n");
printf(" with \"AND\" (maximum = %d, default = %d): ",
MAXQK, qk[k]);
crespI = getcharO;
resp1 = (int) cresp1; junkint = resp1;
while Uunkint != 10)
{
junkchar = getcharO; junkint = (int)junkchar;
}
if( ((respl < 49) II (respl > (MAXQK + 48))) && (respl != 10))
{
printf("\n Enter a number no lower than 1, no greater than %d.\n",
MAXQK);
resnotok = 1;
}
else i f ~ s p l == 10)
{
328
qk[k] = DEFQK; resnotok = 0;
}
else
{
respl = respl - 48; qk[k] = respl;
resnotok = 0;
}
}
}
resnotok=1;
while(resnotok>O)
{
printf("\n\n Enter the cost of misclassifying a true \" 1\" as a");
printf(" \"0\" relative to \nil);
printf(" misclassifying a true \"0\" as a \" 1\" (type \" 1\" for the\n");
printf(" default, unit cost): ");
chkcnt = scanf("%lf', &tmpflt);
crespI = getcharO;
resp1 = (int) cresp1; junkint = resp1;
while Gunkint != 10)
{
junkchar = getcharO; junkint = (int)junkchar;
}
if ((tmpflt < 0) II (chkcnt!= 1))
{
printf("\n Enter a nonnegative, real-valued number.\n");
}
else
{
miscost = tmpflt; resnotok = 0;
}
}
resnotok=1;
printf("\n\n The following option only affects the speed ofthe search.\n\n");
printf(" Enter the number of mutations-per-subset to allow before checking\n");
printf(" whether the present set of questions is an absorption point. \nil);
printf(" This number may be as large as 100,000 if you dataset is big\n");
printf(" (e.g., more than 10,000 records and more than 150 variables), ar\n");
printf(" as small as 1,000 if your dataset is small (e.g., less than 1,000\n");
printf(" records and less than 40 variables). You may wish to experiment\n");
printf(" to find the optimal value. (Start with 10,000 if you are unsure).\n");
while(resnotok>O)
{
printf(" Enter a positive integer (omit commas): ");
329
330
chkcnt = scanf("%d", &tmpint);
crespI = getcharO;
resp1 = (int) cresp1; junkint = resp1;
while Gunkint != 10)
{
junkchar = getcharO; junkint = (int)junkchar;
}
if ((tmpint < 1) II (chkcnt!= 1))
{
resnotok = 1; printf("\n ");
}
else
{
checfreq = tmpint; resnotok = 0;
}
}
printf(" \n\n\n");
printf(" You have instructed BOBCAT to search for %d subsets of questions\n", ktot);
printf(" to be combined with \IOR\I.\n\n");
for (k=O;k<ktot;k++)
{
printf(" Subset %d will consist of%d questions combined with \IAND\I.\n", k+1,
qk[k]);
}
printf("\n The relative cost of misclassification will be %l.2lf.\nil,
miscost);
printf("\n %d mutations-per-subset will be allowed before checking for absorption.\n",
checfreq);
resnotok=1;
while(resnotok>O)
{
printf("\n Are these instructions correct? (y or n, default = y): ");
crespI = getcharO;
resp1 = (int) cresp1; j unkint = resp1;
while Gunkint != 10)
{
junkchar = getcharO; junkint = (int) junkchar;
}
if((respi == 121) II (respi == 89) II (respi == 10))
{
goback = 0; resnotok = 0;
}
else if ((respl == 110) II (respi == 78))
{
goback = 1; resnotok = 0;
}
else
{
resnotok = 1;
printf("\n\n Enter \"y\" or \"n\" .\n");
}
}
}
maxq = maxval(qk,ktot);
/* If you want deterministic results, use a positive value of myseed */
if (myseed > 0)
{
for (k=O; k<myseed; k++) drand480;
}
else
{
srand48((unsigned)time(NULL));
}
/* The variables below will hold info on the present point in space.
feat is the variable number, direct is < or >=, thresh is cutoff */
feat = (int **)malloc(maxq*sizeof(int *));
direct = (int **)malloc(maxq*sizeof(int *));
thresh = (float **)malloc(maxq*sizeof(float *));
tbfeat = (int **)malloc(maxq*sizeof(int *));
tbdirect = (int **)malloc(maxq*sizeof(int *));
tbthresh = (float **)malloc(maxq*sizeof(float *));
bfeat = (int **)malloc(maxq*sizeof(int *));
bdirect = (int **)malloc(maxq*sizeof(int *));
bthresh = (float **)malloc(maxq*sizeof(float *));
beslfeat = (int **)malloc(maxq*sizeof(int *));
besldirect = (int **)malloc(maxq*sizeof(int *));
beslthresh = (float **)malloc(maxq*sizeof(float *));
besffeat = (int **)malloc(maxq*sizeof(int *));
besfdirect = (int **)malloc(maxq*sizeof(int *));
besfthresh = (float **)malloc(maxq*sizeof(float *));
for (k=O; k<maxq; k++)
{
feat[k] = (int *)calloc(ktot,sizeof(int));
direct[k] = (int *)calloc(ktot,sizeof(int));
thresh[k] = (float *)calloc(ktot,sizeof(float));
bfeat[k] = (int *)calloc(ktot,sizeof(int));
bdirect[k] = (int *)calloc(ktot,sizeof(int));
331
bthresh[k] = (float *)calloc(ktot,sizeof(float));
tbfeat[k] = (int *)calloc(ktot,sizeof(int));
tbdirect[k] = (int *)calloc(ktot,sizeof(int));
tbthresh[k] = (float *)calloc(ktot,sizeof(float));
beslfeat[k] = (int *)calloc(ktot,sizeof(int));
besldirect[k] = (int *)calloc(ktot,sizeof(int));
beslthresh[k] = (float *)calloc(ktot,sizeof(float));
besffeat[k] = (int *)calloc(ktot,sizeof(int));
besfdirect[k] = (int *)calloc(ktot,sizeof(int));
besfthresh[k] = (float *)calloc(ktot,sizeof(float));
}
}
FILE *myopenf(FILE *locfile, char *loc_file)
{
if ((locfile=fopen(1oc_file,"a"))==NULL)
{
printf("\n\nError: Could not open file %s\n\n", loc_file);
exit(-l);
}
retum(1ocfile);
}
332
333
Appendix VIn - A large question set for predicting death
QUESTION SET J
(Sensitivity: 51% Specificity: 84%)
SET J.1
1. Other than when you might have been in the
hospital, was there any time in the past 12
months when you needed help from some person
or any equipment or device to do the following:
Walking across a small room?
No help
Help
Unable to do
Missing
2. (When wearing eyeglasses/contact lenses)
Can you see well enough to read ordinary
newspaper print?
Yes
No
Missing
OR
SET J.2
1. Are you able to walk half a mile without help?
That's about 8 ordinary blocks.
Yes
No
Missing
2. Do you take any digitalis, Digoxin, Lanoxin,
or Digitoxin pills now?
Yes
No
Missing
SET J.3
1. Age: 80+ <80
2. Did a doctor ever tell you that you had a stroke
or brain hemorrhage?
Yes
Suspect
No
Missing
OR
SET J.4
1. Sex: Male Female
2. What is your telephone number?
Correct
Incorrect
Refused
Missing
3. Have you been to a hospital at least one night
in the past 12 months?
Yes
No
Missing
OR
SET J.5
1. Age: 80+ <80
334
SET 1.7
1. Weight at time of interview (in pounds):
2. What was (is) your mother's maiden name?
Correct
Incorrect
Refused
Missing
OR
SET J.6
1. As compared to other people your own age,
would you say that your health is excellent, good,
fair, poor or very poor?
Excellent
Good
Fair
Poor or bad
Missing
2. In the past year, have you gained or lost more
than 10 pounds?
No change
Yes, gained
Yes, lost
Yes, both gained and lost
Missing
OR
<158 158+ Missing
2. Sex: Male Female
3. Do you take any digitalis, Digoxin, Lanoxin,
or Digitoxin pills now?
Yes
No
Missing
OR
SET 1.8
I. What day of the week is it?
Correct
Incorrect
Refused
Missing
2. Other then when you might have been in the
hospital, was there any time in the past 12
months when you needed help from some person
or any equipment or device to do the following
things:
Bathing, either a sponge bath, tub bath, or
shower?
No help
Help
Unable to do
Missing
OR
SET J.9
1. Weight at time of interview (in pounds):
<170 170+ Missing
2. Sex: Male Female
3. Are you able to walk half a mile without help?
That's about 8 ordinary blocks.
Yes
No
Missing
OR
335
SET J.10
1. Has a doctor ever told you you had diabetes,
high blood sugar, or sugar in your urine?
Yes
Suspect
No
Missing
2. Has a doctor ever told you you had a heart
attack, coronary, coronary thrombosis, coronary
occlusion or myocardial infarction?
Yes
No
Missing
3. Were you hospitalized overnight or longer for
this?
Yes
No
Missing

You might also like