Statistics in Clinical Oncology

ISBN: 0-8247-9025-1
This book is printed on acid-free paper.
Headquarters
Marcel Dekker, Inc.
270 Madison Avenue, New York, NY 10016
tel: 212-696-9000; fax: 212-685-4540
Eastern Hemisphere Distribution

Marcel Dekker AG
Hutgasse 4, Postfach 812, CH-4001 Basel, Switzerland
tel: 41-61-261-8482; fax: 41-61-261-8896
World Wide Web

http:/ /www.dekker.com
The publisher offers discounts on this book when ordered in bulk quantities. For more
information, write to Special Sales/Professional Marketing at the headquarters address
above.
Copyright  2001 by Marcel Dekker, Inc. All Rights Reserved.
Neither this book nor any part may be reproduced or transmitted in any form or by any
means, electronic or mechanical, including photocopying, microfilming, and recording,
or by any information storage and retrieval system, without permission in writing from
the publisher.
Current printing (last digit):

10 9 8 7 6 5 4 3 2 1
PRINTED IN THE UNITED STATES OF AMERICA

To my mother, who has been an inspiration to me for 55 years,
and to others for 91.
The idea for this book came to me from Graham Garrett of Marcel Dekker, Inc.,
who, tragically, passed away while the book was in progress. I hope that the
result lives up to his expectations.
All royalties for the editor will go to Cancer Research And Biostatistics,
a nonprofit corporation whose mission is to help conquer cancer through the
application of biostatistical principles and data management methods.
Preface
This book is a compendium of statistical approaches to the problems facing those

trying to make progress against cancer. As such, the focus is on cancer clinical
trials, although several of the contributions also apply to observational studies,
and many of the chapters generalize beyond cancer research. This field is approxi-
mately 50 years old, and it has been at least 15 years since such a summary
appeared; because much progress has been made in recent decades, the time is
propitious for this book. The intended audience is primarily but not exclusively
statisticians working in cancer research; it is hoped that oncologists might benefit
as well from reading this book.
The book has six sections:
1. Phase I Trials. This area has moved from art to science in the last
decade, thanks largely to the contributors to this book.
2. Phase 2 Trials. Recent advances beyond the widely accepted two-stage
design based on tumor response include designs based on toxicity and
response and selection designs, meant to guide in the decisions regard-
ing which of many treatments to move to Phase 3 trials.
3. Phase 3 Trials. A comprehensive treatment is provided of sample size,
as well as discussions of multiarm trials, equivalence trials, and early
stopping.
4. Complementary Outcomes. Quality-of-life and cost of treatment have
become increasingly important, but pose challenging analytical prob-
lems, as the chapters in this section describe.
5. Prognostic Factors and Exploratory Analysis. The statistical field of
survival analysis has had its main impetus from cancer research and the
v
vi Preface
chapters in this section demonstrate the breadth and depth of activity in

this field today.
6. Interpreting Clinical Trials. This section provides lessons—never out-
dated and seemingly always needing repeating—on what can and can-
not be concluded from single or multiple clinical trials.
I would like to thank all the contributors to this volume.
John Crowley
Contents
Preface v
Contributors xi
PHASE I TRIALS
1. Overview of Phase I Trials 1

Lutz Edler
2. Dose-Finding Designs Using Continual Reassessment Method 35

John O’Quigley
3. Choosing a Phase I Design 73

Barry E. Storer
PHASE II TRIALS
4. Overview of Phase II Clinical Trials 93

Stephanie Green
5. Designs Based on Toxicity and Response 105

Gina R. Petroni and Mark R. Conaway
vii
viii Contents
6. Phase II Selection Designs 119

P. Y. Liu
PHASE III TRIALS
7. Power and Sample Size for Phase III Clinical Trials of

Survival 129
Jonathan J. Shuster
8. Multiple Treatment Trials 149

Stephen L. George
9. Factorial Designs with Time-to-Event End Points 161

Stephanie Green
10. Therapeutic Equivalence Trials 173

Richard Simon
11. Early Stopping of Cancer Clinical Trials 189

James J. Dignam, John Bryant, and H. Samuel Wieand
12. Use of the Triangular Test in Sequential Clinical Trials 211

John Whitehead
COMPLEMENTARY OUTCOMES
13. Design and Analysis Considerations for Complementary

Outcomes 229
Bernard F. Cole
14. Health-Related Quality-of-Life Outcomes 249

Benny C. Zee and David Osoba
15. Statistical Analysis of Quality of Life 269

Andrea B. Troxel and Carol McMillen Moinpour
16. Economic Analysis of Cancer Clinical Trials 291

Gary H. Lyman
Contents ix
PROGNOSTIC FACTORS AND EXPLORATORY ANALYSIS
17. Prognostic Factor Studies 321

Martin Schumacher, Norbert Holländer, Guido Schwarzer,
and Willi Sauerbrei
18. Statistical Methods to Identify Prognostic Factors 379

Kurt Ulm, Hjalmar Nekarda, Pia Gerein, and Ursula Berger
19. Explained Variation in Proportional Hazards Regression 397

John O’Quigley and Ronghui Xu
20. Graphical Methods for Evaluating Covariate Effects in the

Cox Model 411
Peter F. Thall and Elihu H. Estey
21. Graphical Approaches to Exploring the Effects of Prognostic

Factors on Survival 433
Peter D. Sasieni and Angela Winnett
22. Tree-Based Methods for Prognostic Stratification 457

Michael LeBlanc
INTERPRETING CLINICAL TRIALS
23. Problems in Interpreting Clinical Trials 473

Lilllian L. Siu and Ian F. Tannock
24. Commonly Misused Approaches in the Analysis of Cancer

Clinical Trials 491
James R. Anderson
25. Dose-Intensity Analysis 503

Joseph L. Pater
26. Why Kaplan-Meier Fails and Cumulative Incidence Succeeds

When Estimating Failure Probabilities in the Presence of
Competing Risks 513
Ted A. Gooley, Wendy Leisenring, John Crowley, and Barry
E. Storer
x Contents
27. Meta-Analysis 525

Luc Duchateau and Richard Sylvester
Index 545
Contributors
James R. Anderson, Ph.D. Department of Preventive and Societal Medicine,

University of Nebraska Medical Center, Omaha, Nebraska
Ursula Berger, Dipl.Stat. Institute for Medical Statistics and Epidemiology,

Technical University of Munich, Munich, Germany
John Bryant, Ph.D. National Surgical Adjuvant Breast and Bowel Project, and
Biostatistical Center, University of Pittsburgh, Pittsburgh, Pennsylvania
Bernard F. Cole, Ph.D. Department of Community and Family Medicine,

Dartmouth Medical School, Lebanon, New Hampshire
Mark R. Conaway, Ph.D. Division of Biostatistics and Epidemiology, Depart-

ment of Health Evaluation Sciences, University of Virginia, Charlottesville, Vir-
ginia
John Crowley, Ph.D. Southwest Oncology Group Statistical Center, Fred

Hutchinson Cancer Research Center, Seattle, Washington
James J. Dignam, Ph.D. National Surgical Adjuvant Breast and Bowel Project
and Department of Biostatistics, University of Pittsburgh, Pittsburgh, Pennsylva-
nia, and Department of Health Sciences, University of Chicago, Chicago, Illinois
Luc Duchateau, Ph.D. EORTC Data Center, European Organization for Re-
search and Treatment of Cancer, Brussels, Belgium
xi
xii Contributors
Lutz Edler, Ph.D. Biostatistics Unit, German Cancer Research Center, Heidel-
berg, Germany
Elihu H. Estey, M.D. Department of Leukemia, University of Texas M.D. An-

derson Cancer Center, Houston, Texas
Stephen L. George, Ph.D. Department of Biostatistics and Bioinformatics,

Duke University Medical Center, Durham, North Carolina
Pia Gerein, Dipl.Stat. Institute for Medical Statistics and Epidemiology, Tech-
nical University of Munich, Munich, Germany
Ted A. Gooley, Ph.D. Department of Clinical Statistics, Southwest Oncology

Group Statistical Center, Fred Hutchinson Cancer Research Center, Seattle,
Washington
Stephanie Green, Ph.D. Program in Biostatistics, Southwest Oncology Group

Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, Washington
Norbert Holländer, M.Sc. Department of Medical Biometry and Statistics, In-

stitute of Medical Biometry and Medical Informatics, University of Freiburg,
Freiburg, Germany
Michael LeBlanc, Ph.D. Program in Biostatistics, Fred Hutchinson Cancer Re-

search Center, Seattle, Washington
Wendy Leisenring, Sc.D. Departments of Clinical Statistics and Biostatistics,

Southwest Oncology Group Statistical Center, Fred Hutchinson Cancer Research
Center, Seattle, Washington
P. Y. Liu, Ph.D. Public Health Sciences Division, Fred Hutchinson Cancer

Research Center, Seattle, Washington
Gary H. Lyman, M.D., M.P.H., F.R.C.P.(Edin) Department of Medicine,

Albany Medical College, and Department of Biometry and Statistics, State Uni-
versity of New York at Albany School of Public Health, Albany, New York
Carol McMillen Moinpour, Ph.D. Division of Public Health Sciences, South-

west Oncology Group Statistical Center, Fred Hutchinson Cancer Research Cen-
ter, Seattle, Washington
Contributors xiii
Hjalmar Nekarda, Dr. Institute for Medical Statistics and Epidemiology,

John O’Quigley, Ph.D. Department of Mathematics, University of California–

San Diego, La Jolla, California
David Osoba, B.Sc., M.D.(Alta), F.R.C.P.C. Quality of Life Consulting, West

Vancouver, British Columbia, Canada
Joseph L. Pater, M.D., M. Sc., F.R.C.P.(C) NCIC Clinical Trials Group,

Queen’s University, Kingston, Ontario, Canada
Gina R. Petroni, Ph.D. Division of Biostatistics and Epidemiology, Depart-

ment of Health Evaluation Sciences, University of Virginia, Charlottesville, Vir-
ginia
Peter D. Sasieni, Ph.D. Department of Mathematics, Statistics, and Epidemi-

ology, Imperial Cancer Research Fund, London, England
Willi Sauerbrei, Ph.D. Department of Medical Biometry and Statistics, Insti-

tute of Medical Biometry and Medical Informatics, University of Freiburg, Frei-
burg, Germany
Martin Schumacher, Ph.D. Department of Medical Biometry and Statistics,

Institute of Medical Biometry and Medical Informatics, University of Freiburg,
Freiburg, Germany
Guido Schwarzer, M.Sc. Department of Medical Biometry and Statistics, In-

stitute of Medical Biometry and Medical Informatics, University of Freiburg,
Freiburg, Germany
Jonathan J. Shuster, Ph.D. Department of Statistics, University of Florida,

Gainesville, Florida
Richard Simon Biometric Research Branch, National Cancer Institute, Na-

tional Institutes of Health, Bethesda, Maryland
Lillian L. Siu, M.D., F.R.C.P.(C) Department of Medical Oncology and He-

matology, Princess Margaret Hospital, Toronto, Ontario, Canada
Barry E. Storer, Ph.D. Clinical Research Division, Fred Hutchinson Cancer

Research Center, Seattle, Washington
xiv Contributors
Richard Sylvester, Sc.D. EORTC Data Center, European Organization for Re-
search and Treatment of Cancer, Brussels, Belgium
Ian F. Tannock, M.D., Ph.D., F.R.C.P.(C) Department of Medical Oncology

and Hematology, Princess Margaret Hospital, Toronto, Ontario, Canada
Peter F. Thall, Ph.D. Department of Biostatistics, University of Texas M.D.

Anderson Cancer Center, Houston, Texas
Andrea B. Troxel, Sc.D. Division of Biostatistics, Joseph L. Mailman School

of Public Health, Columbia University, New York, New York
Kurt Ulm, Ph.D. Institute for Medical Statistics and Epidemiology, Technical
University of Munich, Munich, Germany
John Whitehead, Ph.D. Medical and Pharmaceutical Statistics Research Unit,

The University of Reading, Reading, England
H. Samuel Wieand, Ph.D. National Surgical Adjuvant Breast and Bowel Proj-
ect and Department of Biostatistics, University of Pittsburgh, Pittsburgh, Pennsyl-
vania
Angela Winnett, Ph.D.* Department of Epidemiology and Public Health, Im-

perial Cancer Research Fund, London, England
Ronghui Xu, Ph.D. Department of Biostatistics, Harvard School of Public

Health and Dana-Farber Cancer Institute, Boston, Massachusetts
Benny C. Zee, Ph.D. Clinical Trials Group, National Cancer Institute of Can-
ada, Kingston, Ontario, Canada
* Current affiliation: Imperial College School of Medicine, London, England.

1
Overview of Phase I Trials
Lutz Edler
German Cancer Research Center, Heidelberg, Germany
I. INTRODUCTION
The phase I clinical trial constitutes a research methodology for the search and
establishment of new and better treatment of human diseases. It is the first of
the three phases—phase I–III trials—that became a ‘‘gold standard’’ of medical
research during the second half of the 20th century (1). The goal of the phase I
trial is to define and to characterize the new treatment in humans to set the basis
for later investigations of efficacy and superiority. Therefore, the safety and the
feasibility of the treatment are at the center of interest. A positive risk–benefit
judgment should be expected such that possible harm of the treatment is out-
weighed by possible gain in cure, in suppression of the disease and its symptoms,
and in improved quality of life and survival. The phase I trial should define a
standardized treatment schedule to be safely applied to humans and worth being
further investigated for efficacy. For non-life-threatening diseases, phase I trials
are usually conducted on human volunteers, at least as long as the expected toxic-
ity is mild and can be controlled without harm. In life-threatening diseases such
as cancer, AIDS, and so on phase I studies are conducted with patients because
of the aggressiveness and possible harmfulness of cytostatic treatments, because
of possible systemic treatment effects, and because of the high interest in the
new drug’s efficacy in those patients directly. After failure of standard treatments
or in the absence of a curative treatment for seriously chronically ill patients, the
new drug may be the small remaining chance for treatment.
1
2 Edler
The methodology presented below is restricted to the treatment of cancer

patients. Biostatistical methodology for planning and analyzing phase I oncologi-
cal clinical trials is presented. The phase I trial is the first instance where patients
are treated experimentally with a new drug. Therefore, it has to be conducted
unconditionally under the regulations of the Declaration of Helsinki (2) to pre-
serve the patient’s rights in an extreme experimental situation and to render the
study ethically acceptable. The next section provides an outline of the task, in-
cluding the definition of the maximum tolerated dose (MTD), which is crucial
for the design and analysis of a phase I trial. Basic assumptions underlying the
conduct of the trial and basic definitions for the statistical task are given. The
presentation of phase I designs in Section III distinguishes between the determina-
tion of the dose levels (action space) and the choice of the dose escalation scheme
(decision options). This constitutes the core of this chapter. Phase I designs pro-
posed during the past 10 years are introduced there. The sample size per dose
level is discussed separately. Validations of phase I trials rely mostly on simula-
tion studies because the designs cannot be compared competitively in practice.
Practical aspects of the conduct of a phase I trial, including the choice of a starting
dose, are presented in Section IV. Individual dose adjustment and dose titration
studies are also addressed. Section V exhibits standard methods of analyzing
phase I data. Basic pharmacokinetic methods are outlined. Regulatory aspects
and guidelines are dealt with in Section VI. It will become clear that the method-
ology for phase I trials is far from being at an optimal level at present. Therefore,
Section VII addresses practical needs, problems occurring during the conduct of
a phase I trial, and future research topics.
II. TASKS, ASSUMPTIONS, AND DEFINITIONS

A. Clinical Issues and Statistical Tasks
Clinical phase I studies in oncology are of pivotal importance for the development
of new anticancer drugs and anticancer treatment regimens (3,4). If a new agent
has successfully passed preclinical investigations (5) and is judged as being ready
for application in patients, then the first application to humans should occur
within the framework of a phase I clinical trial (6–9). At this early stage, an
efficacious and safe dosing is unknown and information is available at best
from preclinical in vitro and in vivo studies (10). Beginning treatment at a low
dose very likely to be safe (starting dose), small cohorts of patients are treated
at progressively higher doses (dose escalation) until drug-related toxicity reaches
a predetermined level (dose limiting toxicity [DLT]). The objective is to deter-
mine the MTD (8) of a drug for a specified mode of administration and to
characterize the DLT. The goals in phase I trials are according to Von Hoff et
al. (11):
Phase I Trials 3
1. Establishment of an MTD,
2. Determination of qualitative and quantitative toxicity and of the toxic-
ity profile,
3. Characterization of DLT,
4. Identification of antitumor activity,
5. Investigation of basic clinical pharmacology,
6. Recommendation of a dose for phase II studies.
The primary goal is the determination of a maximum safe dose for a specified
mode of treatment as basis for phase II trials. Activity against tumors is examined
and assessed, but tumor response is not a primary end point.
Inherent in a phase I trial is the ethical issue that anticancer treatment is
potentially both harmful and beneficial to a degree that depends on dosage. The
dose dependency is at that stage of research not known for humans (12). To be
on the safe side, the treatment starts at low doses that are probably not high
enough such that the drug can be sufficiently active to elicit a beneficial effect.
Even worse, the experimental drug may appear finally—after having passed the
clinical drug development program—as inefficacious and may have been of harm
only. Then, retrospectively seen, patients in phase I trials hardly had any benefit
from the medical treatment. This is all unknown at the start of the trial. The
dilemma of probably unknowingly underdosing patients in the early stages of a
phase I trial has been of concern and challenged the search for the best possible
methodology for the design and conduct of a phase I trial. The goal is to obtain
the most information on toxicity in the shortest possible time with the fewest
patients (13).
Important clinical issues in phase I trials are patient selection and identifi-
cation of factors that determine toxicity, drug schedules, and the determination
and assessment of target toxicity (6,11). Important statistical issues are the design
parameters (starting dose, dose levels, dose escalation) and the estimation of the
MTD.
B. Assumptions
Most designs for dose finding in phase I trials assume a monotone dose–toxicity
relationship and a monotone dose–(tumor) response relationship (14). This ideal-
ized relationship is biologically inactive dose ⬍ biologically active dose ⬍ highly
toxic dose.
Methods considered below apply to adult cancer patients with confirmed
diagnosis of cancer not amenable to established treatment. Usually excluded are
leukemias and tumors in children (9). Phase I studies in radiotherapy may require
further consideration because of long delayed toxicity. The conduct of a phase
I trial requires an excellently equipped oncological center with high-quality
4 Edler
means for diagnosis and experimental treatment, for detection of toxicity, and
for fast and adequate reaction in the case of serious adverse events. Furthermore,
easy access to a pharmacological laboratory is needed for timely pharmacokinetic
analyses. These requirements indicate the advisability of restricting a phase I trial
to one or very few centers.
C. Definitions
Throughout this article we denote the set of dose levels at which patients are
treated by D ⫽ {xi , i ⫽ 1, . . .} assuming xi ⬍ xi⫹1. The dose unit is usually mg/
m2 body surface area (15). But this choice does not have an impact on the methods
nor is the route of application. It is assumed that the patients enter the study one
after the other numbered by j, j ⫽ 1, 2, . . . and that treatment starts immediately
after entry (informed consent assumed). Denote by x( j ) the dose level of patient
j. Toxic response of patient j is assumed to be described by the dichotomous
random variable Yj , where Yj ⫽ 1 indicates the occurrence of a DLT and Yj ⫽
0 the nonoccurrence. To comply with most articles we denote the dose–toxicity
function by ψ(x, a) with a parameter (vector) a:
P(Y ⫽ 1|Dose ⫽ x) ⫽ ψ(x, a) (1)
ψ(x, a) is assumed as a continuous monotone nondecreasing function of the

dose x, defined on the real line 0 ⱕ x ⬍ ∞ with ψ(0, a) ⱖ 0 and ψ(∞, a)
ⱕ 1.
Small cohorts of patients of size nk, 1 ⱕ nk ⱕ nmax are treated on a timely
consecutive sequence of doses x[k] ∈ D, where nmax is a theoretical limit of the
number of patients treated per dose level (e.g., nmax ⫽ 8) and where k ⫽ 1, 2,
. . . count the time periods of treatment with dose x[k], i.e., x[k] ≠ x[h] for k ≠ h is
not assumed. Notice that a dose xi ∈ D may be visited more than one with some
time delay between visits. If the treatment at each of these dose levels x[k], k ⫽
1, 2 . . . lasts a fixed time length ∆t (e.g., 2 months), the duration of the phase
I trial is then equal to ∆t times the number of those cohorts entering the trial,
independent of the number nk of patients per cohort at level x[k].
D. Maximum Tolerated Dose

The notion of an MTD is defined unequivocally in terms of the observed toxicity
data of the patients treated using the notion of DLT under valid toxicity criteria
(8). Drug toxicity is considered as tolerable if the toxicity is acceptable, manage-
able, and reversible. Drug safety has been standardized for oncological studies
recently by the establishment of the common toxicity criteria (CTC) of the U.S.
National Cancer Institute (NCI) (16). This is a large list of adverse events (AEs)
Phase I Trials 5
subdivided into organ/symptom categories that can be related to the anticancer

treatment. Each AE has been categorized into five classes:
1. CTC grade 0, no AE or normal;
2. CTC grade 1, mildly (elevated/reduced);
3. CTC grade 2, moderate;
4. CTC grade 3, serious/severe;
5. CTC grade 4, very serious or life threatening.
The CTC grade 5, fatal, sometimes applied, is not used in the sequel because
death is usually taken as very serious adverse event preceeded by a CTC grade
4 toxicity. Of course, a death related to treatment has to be counted as DLT. The
list of CTC criteria has replaced the list of the World Health Organization (17)
based on an equivalent 0–4 scale. Investigators planning a phase I trial have to
identify in the CTC list a subset of candidate toxicities for dose limitation, and
they have to fix the grade for which that toxicity is considered to be dose limiting
such that treatment has to be either stopped or the dose has to be reduced. Usually,
a toxicity of grade 3 or 4 is considered dose limiting. That identified subset of
toxicities from the CTC list and the limits of grading define the DLTs for the
investigational drug. Sometimes the list of DLTs is open such that any AE from
the CTC catalogue of grade 3 and higher related to treatment is considered a
DLT.
During cancer therapy patients may show symptoms from the candidate
list of DLTs not caused by the treatment but by the cancer disease itself or by
concomitant treatment. Therefore, the occurrence of any toxicity is judged by
the clinician or study nurse for its relation to the investigational treatment. A
commonly used assessment scale is as follows:
1. Unclear/no judgment possible,
2. Not related,
3. Possibly,
4. Probably,
5. Definitively
related to treatment (18). Often, a judgment of ‘‘possibly’’ and more (i.e., proba-
bly or definitively) is considered as drug-related toxicity and called adverse
drug reaction (ADR). Therefore, one may define the occurrence of DLT for a
patient more strictly as if at least one toxicity of the candidate subset of the CTC
criteria of grade 3 and higher has occurred that was judged as at least possibly
treatment related. Obviously, this definition carries subjectivity: choice of the
candidate list of CTCs for DLT, assessment of the grade of toxicity, and assess-
ment of the relation to treatment (19). Uncertainty of the assessment of toxicity
has been investigated (e.g., in 20, 21). When anticancer treatment is organized
in treatment cycles—mostly for 3–4 weeks—DLT is usually assessed retrospec-
6 Edler
tively before the start of a new treatment cycle. In phase I studies, often two
cycles are awaited before a final assessment of the DLT is made. If at least one
cycle exhibits at least one DLT, that patient is classified as having reached DLT.
An unambiguous definition of the assessment rules of individual DLT is manda-
tory for the study protocol. For the statistical analysis, each patient should be
assessable at his or her dose level either as having experienced a DLT (Y ⫽ 1)
or not (Y ⫽ 0).
With the above definition of DLT, one can theoretically assign to each
patient an individual MTD (I-MTD) as the highest dose that can be administered
safely to that patient: The I-MTD is the highest dose x that can be given to a
patient before a DLT occurs. No within-patient variability is considered at this
instance (i.e., I-MTD is nonrandom). Because a patient can be examined at only
one dose, it is not possible to observe the I-MTD. It is only observed if the given
dose x exceeded the I-MTD or not. It is implicitly assumed that all patients enter-
ing a phase I trial react in a statistical sense identically and independently from
each other. A population of patients gives rise to a statistical distribution. There-
fore, one postulates the existence of a population-based random MTD (realized
in I-MTDs) to describe the distribution of this MTD. The probability that x ex-
ceeds the random MTD is
P(x ⬎ MTD) ⫽ F(x) (2)
and describes the proportion of the population showing a DLT when treated by
dose x. This then becomes the adequate statistical model for describing the MTD.
This probabilistic approach known as tolerance distribution model for a quantal
dose–response relationship (22) allows any reasonable cumulative distribution
function F for the right side of Eq. (2). F is a nondecreasing function with values
between 0 and 1. In practice one should allow F(0) ⬎ 0 as a ‘‘baseline’ toxicity’’
and also F(1) ⬍ 1 for saturation of toxicity. Classes of well-known tolerance
distributions are the probit, logit, and Weibull models (22). Based on this probabi-
listic basis, a practicable definition of an MTD of a phase I trial is obtained as
a percentile of the statistical distribution of the (random population) MTD as
follows. Determine an acceptable proportion 0 ⬍ θ ⬍ 1 of tolerable toxicity in
the patient population before accepting the new anticancer treatment. Define the
MTD as that dose for which the proportion of patients exceeding the DLT is at
least as large as θ: F(MTD) ⫽ θ or MTD ⫽ F⫺1(θ) (Fig. 1). Obviously, there
is a direct correspondence between F in Eq. (2) and ψ in Eq. (1) as
ψ(MTD) ⫽ P(Y ⫽ 1|Dose ⫽ MTD) ⫽ F(MTD) ⫽ θ (3)
If ψ(x) is monotone nondecreasing and continuous, the MTD for θ denoted MTDθ
is the θ percentile:
MTDθ ⫽ ψ⫺1(θ) (4)
Phase I Trials 7
Figure 1 Schematic dose–toxicity relationship ψ(x, a). The maximum tolerable dose
MTDθ is defined as the θ percentile of the monotone increasing function ψ(x, a) with
model parameter a.
The choice of θ depends on the nature of the DLT and the type of the target
tumor. For an aggressive tumor and a transient and non-life-threatening DLT, θ
could be as high as 0.5. For persistent DLT and less aggressive tumors, it could
be as low as 0.1 to 0.25. A commonly used value is θ ⫽ 1/3 ⫽ 0.33.
E. Dose–Toxicity Modeling
The choice of an appropriate dose–toxicity model ψ(x) is important not only for
the planning but also for the analysis of phase I data. Most applications use an
extended logit model and apply the logistic regression because of its flexibility,
the ease of accounting for patient covariates (e.g., pretreatment, disease staging,
performance, etc.), and the availability of computing software. A general class
of dose–toxicity models is a two-parameter family:
ψ(x, a) ⫽ F(a 0 ⫹ a1h(x)) (5)

8 Edler
where F is a known cumulative distribution function, h a known dose metric,

and a ⫽ (a 0, a1) unknown parameters. Monotone increasing functions F and h
are sufficient for a monotone increasing ψ. If h(x) ⫽ x, the MTD is
F⫺1(θ) ⫺ a 0
MTDθ ⫽ (6)
a1
Convenient functions F are the
PROBIT (x): Φ(x)
LOGIT (x): {1 ⫹ exp(⫺x)}⫺1
HYPERBOLIC TANGENT (x): {[tanh(x) ⫹ 1]/2}⫺a2
with a further unknown parameter component a2; see O’Quigley et al. (23).
III. DESIGN
A phase I trial design has to determine which dose levels are applied to how
many patients and in which sequel, given the underlying goal of estimating the
MTD. This implies three tasks: determination of the possible set of dose levels,
choosing the dose levels sequentially, and determining the number of patients
per dose level.
A. Choice of the Dose Levels—Action Space

From previous information—mostly preclinical results—a range D of possible
doses (action space) is assumed. One may distinguish between a continuous set
of doses DC , a discrete finite action space DK ⫽ {x1 ⬍ x2 ⬍ ⋅ ⋅ ⋅ ⬍ xk} of an
increasing sequence of doses, or an infinite ordered set D ∞ ⫽ {x1 ⬍ x2 ⬍ . . .}.
Simple dose sets are the additive set
xi ⫽ x1 ⫹ (i ⫺ 1)∆ x i ⫽ 1, 2, . . . (7)
and the multiplicative set
xi ⫽ x1 ⋅ f (i⫺1) i ⫽ 1, 2, 3 . . . (8)
where f denotes the factor by which the starting dose x1 dose is increased. A pure
multiplicative set cannot be recommended and is not used in phase I trials because
of its extreme danger of jumping from a nontoxic level directly to a highly toxic
level. In use are modifications of the multiplicative scheme that start with a few
large steps and slow down later. Such a modified action space could be the result
of a mixture where the first steps of low doses are obtained multiplicatively and
Phase I Trials 9
the remaining ones additively. Another smoother set is obtained when the factors
are decreasing with higher doses, for example;
xi ⫽ fi⫺1 xi⫺1 i ⫽ 1, 2, 3, . . . (9)
where { fi} is a nonincreasing sequence of factors, which may start with f1 ⫽ 2
as doubling dose from x1 to x2 and continues with 1 ⬍ fi ⬍ 2 for i ⱖ 2. The
modified Fibonacci scheme, described next, is of this general type. It has been
in use from the beginning of systematic phase I research.
1. Modified Fibonacci Dose Escalation

The most popular and most cited dose escalation scheme is the so-called modified
Fibonacci dose escalation (MFDE) (Table 1). A review of the literature for its
origin and justification as a dose-finding procedure is difficult. A number of au-
thors (8,24) refer to an article from 1975 of Goldsmith et al. (25), who present
the MFDE as an ‘‘idealized modified Fibonacci search scheme’’ in multiples of
the starting dose and as percent of increase. Two years earlier, Carter (3) summa-
rized the study design principles for early clinical trials. For methodology he
refered in a general way to O. Selawry, Chief of the Medical Oncology Branch
at the NCI in the early seventies, ‘‘who has elucidated many of the phase I study
principles.’’ Carter stated that ‘‘this scheme has been used successfully in two
Phase I studies performed by the Medical Oncology Branch’’ in 1970, one by
Hansen (26) and one by Muggia (27). Both studies are published in the Proceed-
ings of the American Association of Cancer Research without a bibliography.
Table 1 Evolution of the Modified Fibonacci Scheme from

the Fibonacci Numbers fn Defined Recursively fn⫹1 ⫽ fn ⫹ fn⫺1,
n ⫽ 1, 2, . . . , f0 ⫽ 0, f1 ⫽ 1
Fibonacci Fibonacci Modified Smoothed

numbers multiples Fibonacci modified
fn , n ⫽ 1, 2, . . . fn⫹1 /fn fn⫹1 /fn Fibonacci
1 — — —
2 2.0 2 2.0
3 1.5 1.65 1.67
5 1.67 1.52 1.50
8 1.60 1.40 1.40
13 1.63 1.33 1.30–1.35
21 1.62 1.33 1.30–1.35
34 1.62 1.33 1.30–1.35
55 1.62 1.33 1.30–1.35
10 Edler
However, the relation to the Fibonaccci numbers is not clarified in these early
studies. In 1977, Carter (28) refers to Schneiderman (10) and ‘‘a dose escalation
based on a numeral series described by the famed 13th Century Italian mathemati-
cian Leonardo Pisano, alias Fibonacci.’’ He also reported on the use of the MFDE
by Hansen et al. (29) when studying the antitumor effect of 1.3-bis(2-chloro-
ethyl)-1-nitrosourea (BCNU) chemotherapy. Both refer to Schneiderman (10).
The weakness of the foundation of the MFDE for phase I trials becomes even
more evident when one looks deeper into the history.
Fibonacci (Figlio Bonaccio, the son of Bonaccio) also known as Leonardo
da Pisa or Leonardo Pisanus, lived from 1180 to 1250 as a mathematician at the
court of Friedrich II in Sicilia working in number theory and biological applica-
tions. The sequence of numbers named after Fibonacci is created by a simple
recursive additive rule: Each number is the sum of its two predecessors: fn⫹1 ⫽
fn ⫹ fn⫺1 and starts from f1 ⫽ 1 using f0 ⫽ 0 (Table 1). Fibonacci related this
sequence to the ‘‘breeding of rabbits’’ problem in 1202 and also to the distribu-
tion of leaves about a stem. The sequence fn grows geometrically and is approxi-
mately equal to a[(1 ⫹ √5)/2]n , where a ⫽ (1 ⫹ √5)/√5. The ratio of successive
numbers fn /fn⫺1 converges to (1 ⫹ √5)/2 ⫽ (1 ⫹ 2.236)/2 ⫽ 1.618, the Golden
Section, a famous principle of ancient and Renaissance architecture. The Fibo-
nacci numbers have been used in optimization and dynamic programming in the
1950s (30) for determining the maximum of an unimodal function. One applica-
tion can be illustrated as follows: ‘‘How many meters long can a bent bridge be,
such that one can always locate its maximum height in units of meters by measur-
ing at most n times?’’ The solution is given by Bellman’s theorem (30) as fn
meters. The results say nothing on the placement of the measurements but only
on the needed number of measurements. In his article on methods for early clini-
cal trials research, Schneiderman (10) showed that he was familiar with this work
on optimization and cites Bellman’s result, but actually with the wrong page
number (correct is Ref. 30, page 34 and not 342). The optimization result is even
better explained in Ref. 31 on page 152. Schneiderman (10) tried to transpose
the Fibonacci search by fixing an initial dose x1, a maximum possible dose xK,
and the number n of steps for moving upward from x1 to xK. ‘‘By taking a Fibo-
nacci series of length n ⫹ 1, inverting the order, and spacing the doses in propor-
tion to the n intervals in the series’’ (10), Schneiderman obtained an increasing
sequence of doses. In contrast to the MFDE, which is based on a multiplicative
set of doses, this approach is somehow still additive. But it leads to smaller and
smaller steps toward higher doses similar to the MFDE. However, the steps ob-
tained by Schneiderman’s inversion are at the beginning very large and later very
small. This escalation is fixed to K doses and does not open to higher doses if the
MTD was not reached at xK. Schneiderman discussed a restarting of the scheme if
no toxicity was seen at xK and concluded that then ‘‘no guide seems to exist for
the number of steps.’’ And he confesses that he has ‘‘not seen it in any published
Phase I Trials 11
account of preliminary dose finding.’’ The number of steps in this reversed Fibo-
nacci scheme is strongly related to the escalation factor and so provides no guid-
ance for dose escalation. In the same article, however, he cites. De Vita et al.
(32) in which a dose escalation with the factors of 2, 1.5, 1.33, and 1.25 is used.
A hint at the use of the inverse of a Fibonacci scheme where the dose increments
decrease with increasing numbers is also given by Bodey and Legha (8), who
refer to Ref. 25 ‘‘who examined the usefulness of the modified Fibonacci method
as a guide to reaching the MTD.’’
In summary, it seems that the idea of the so-called MFDE came up in the
NCI in the sixties when the early clinical trials programs started there and was
promoted by the scientists mentioned above. They searched for a dose escalation
scheme that slows down from doubling the dose to smaller increases within a
few steps. The MFDE (Table 1), slowing down the increase from 65% to 33%
within the first five steps, seemed reasonable enough to be used in many trials.
The method has been successful to the extent that MTDs have been determined
through its use. From empirical evidence and the simulation studies performed
later, the MFDE seems now to be too conservative in too many cases.
2. Starting Dose
The initial dose given to the first patients in a phase I study should be low enough
to avoid severe toxicity but also high enough for a chance of activity and potential
efficacy in humans. Extrapolation from preclinical animal data focused on the
lethal dose 10% (LD10) of the mouse (dose with 10% drug-induced deaths) con-
verted into equivalents in units of mg/m2 (33) of body surface area. The standard
starting dose became 1/10 of the minimal effective dose level for 10% deads
(MELD10) of the mouse after verification that no lethal and no life-threatening
effects were seen in another species, for example, rats or dogs (7,11,34). Earlier
recommendations had used higher portions of the MELD10 (mouse) or other char-
acteristic doses as, for example, the lowest dose with toxicity (toxic dose low)
in mammals (35).
B. Dose Escalation Schemes

If a clinical action space has been defined as set of dose levels D, the next step
in designing a phase I trial consists of the establishment of rule by which the doses
of D are assigned to patients. Proceeding from a starting dose x1, the sequence of
dosing has to be fixed in advance in a so-called dose escalation rule. This section
starts with the traditional escalation rules (TER). Those rules are also known as
‘‘3 ⫹ 3’’ rules because it became usual to enter three patients at a new dose
level and when any toxicity was observed to enter six patients in total at that
dose level (11) before deciding to stop at that level or to increase the dose. Carter
12 Edler
(3) is also the first source I am aware of where the so-called 3 ⫹ 3 rule (the
traditional dose escalation scheme) is listed as a phase I study principle. Two
versions of this 3 ⫹ 3 rule are described below as TER and strict TER (STER),
respectively. Then we introduce the up-and-down rules (UaD) as fundamental
but not directly applicable rules and turn from this to Bayesian rules and the
intensively and sometimes also controversially discussed continual reassessment
method (CRM). Methods for the determination of the MTD have to be addressed
in this context also.
1. Traditional Escalation Rule

A long used standard phase I design has been the TER where the doses escalates
in DK or D∞ step by step from xi to xi⫹1, i ⫽ 1, 2, . . ., with three to six patients
per dose level; see Table 2 for an example taken from Ref. 36.
Using TER, patients are treated in cohorts of three each receiving the same
dose, say xi. If none of the three patients shows a DLT at level xi, the next cohort
of three patients receives the next higher dose xi⫹1. Otherwise, a second cohort
of three is treated at the same level xi again. If exactly one of the six patients
treated at xi exhibits DLT, the trial continues at the next higher level xi⫹1. If two
Table 2 Example of a Phase I Study Performed According to

the Standard Design (TER)
Dose Escalation
level Dosage factor N DLT Grade
1 0.45 — 3 0 000
2 0.9 2 6 1 111311
3 1.5 1.67 6 1 041111
4 3 2 3 0 012
5 4.5 1.5 4 0 1122
6 6.0 1.33 3 0 111
7 7.5 1.25 3 1 113
8 10.5 1.4 3 0 111
9 13.5 1.29 4 1 0320
10 17.5 1.29 3 0 010
11 23.1 1.33 6 1 123122
12 30.0 1.3 5 4 33331
13 39.0 1.3 1 1 3
Each row shows the number of patients N of that dose level, the number
of cases with DLT defined as grade 3–4, and the actually observed toxicity
grades of the N patients. (From Ref. 36.)
Phase I Trials 13
or more patients of the six exhibit DLT at the level xi , the escalation stops at
that level.
When the escalation has stopped, various alternatives of treating a few more
patients are in use:
1. Treat a small number of additional patients at the stopping level xi ,

e.g., to a total of eight patients.
2. Treat another cohort of three patients at the next lower level xi⫺1 if six
patients had not already been treated.
3. Treat another cohort of three patients at all next lower levels xi⫺1, xi⫺2,
. . . if only three patients had been treated earlier, possibly going down
as far as x1.
4. Treat a limited number of patients at a level not previously included
in D located between xi⫺1 and xi .
A slightly more conservative escalation is implemented in a modified TER de-

noted here as STER. Using STER, patients are treated in cohorts of three each
receiving the same dose, say xi . If none of the three patients shows a DLT at
level xi , the next cohort of three patients receives the next highest dose xi⫹1. If
one DLT is observed, three other patients are included at the same level xi and
the procedure continues as TER. If two or three DLTs are observed among the
first three of that cohort, escalation stops at xi and the dose is de-escalated to the
next lower level xi⫺1 where a prefixed small number of cases is treated additionally
according to one of the options 2–4.
STER can be described formally as follows: Assume that j patients have
been treated before the turn to the level xi at lower dose levels x(1), . . . x( j ) ⬍
xi . Assume that ni⫺1 patients had been treated at dose level xi⫺1. Then denote by
S mji the number of patients with a DLT among m patients at dose level xi when j
patients have been treated before. Then
x( j ⫹ 1) ⫽ x( j ⫹ 2) ⫽ x( j ⫹ 3) ⫽ xi (10)
and set
冦
xi⫹1 if S 3ji ⫽ 0 and continue
x( j ⫹ 4) ⫽ x( j ⫹ 5) ⫽ x( j ⫹ 6) ⫽ xi if S 3ji ⫽ 1 and continue (11)
xi⫺1 if S 3ji ⫽ 2 and stop
Then set next
x( j ⫹ 7) ⫽ x( j ⫹ 8) ⫽ x( j ⫹ 9) ⫽ xi⫹1 if S 6ji ⱕ 1
x( j ⫹ 7) ⫽ x( j ⫹ 8) ⫽ x( j ⫹ 9) ⫽ xi⫺1 if S 6ji ⱖ 2 and ni⫺1 ⬍ 6 or (12)
stop if S 6ji ⱖ 2 and ni⫺1 ⫽ 6
14 Edler
If the escalation stops at a dose level xi , that level would be the first of
unacceptable DLT, and one would conclude that the MTD is exceeded. The next
lower level xi⫺1 is then considered as the MTD. It is common practice that at the
dose level of the MTD, at least six patients are treated. For this reason, the options
2 and 3 above are often applied at the end of a phase I trial. Therefore, the MTD
can be characterized as the highest dose level below the stopping dose level xi
at which (at least) six patients have been treated with no more than one case of
DLT. If no such dose level can be identified, the starting dose level x1 would be
taken as MTD.
2. Random Walk (RW) Designs

A large class of escalation rules is based on the sequential assignment of doses
to one patient after the other. Those rules have their origin in sequential statistical
designs and in stochastic approximation theory. If the dose assignment to the
current patient depends only on the result seen in the previous one, the assignment
process becomes Markovian and performs a random walk on the action space
D. Basically, a patient is assigned to the next higher, the same, or the next lower
dose level with a probability that depends on the previous subject’s response.
RW designs operate mostly on the finite lattice of increasingly ordered dosages
DK ⫽ {x1 ⬍ ⋅ ⋅ ⋅ ⬍ xK }. A Markov chain representation of the random walk on
D is given in Ref. 37. In principal, the MTD is estimated after each patient’s
treatment and the next patient is then treated at that estimated level. All optimality
results of RW designs require that the set D of doses remains unchanged during
the trial. They are simply to implement, essentially nonparametric, and of known
finite and asymptotic distribution behavior. RW designs have been applied to
phase I studies (37,38). Early prototypes in statistical theory were the UaD (39),
proposed originally for explosives testing, and the stochastic approximation
method (SAM) (40). SAM has never been considered seriously for phase I trials.
One reason may be the use of a continuum of dose levels leading to impracticable
differentiation between doses; another reason could be the ambiguity of the adapt-
ing parameter sequence {aj } (but see Ref. 41). The main reason has been stated
already in (10): ‘‘the up and down procedures and the usual overshooting are
not ethically acceptable in an experiment on man.’’ However, the UaD rules were
more recently adapted for medical applications by considering grouped entry,
biased coin randomization, and Bayesian methods (38). Such, the elementary
UaD has been reintroduced into phase I trials—cited as Storer’s B design (37)—
as a tool to construct more appropriate combination designs.
Elementary UaD. Given patient j has been treated on dose level x( j ) ⫽
xi, the next patient j ⫹ 1 is treated at the next lower level xi⫺1 if a DLT was
observed in patient j, otherwise at the next higher level xi⫹1. Formally,
Phase I Trials 15
x( j ⫹ 1) ⫽ 冦 xx
i⫺1
i⫹1 if x( j ) ⫽ xi and no DLT
if x( j ) ⫽ xi and DLT
(13)
Two modifications of the elementary rule were proposed (37): ‘‘modified

by two UaD’’ or Storer’s C design (UaD-C) and ‘‘modified by three UaD’’ or
Storer’s D design (UaD-D, which is quite similar to the 3 ⫹ 3 rule:
UaD-C: Proceed as in UaD but escalate only if two consecutive patients
are without DLT.
UaD-D: Three patients are treated at a new dose level. Escalate if no DLT
and de-escalate if more than one DLT occurs. If exactly one patient
shows a DLT, another three patients are treated at the same level and
the rule is repeated.
For an algorithm see O’Quigley and Chevret (42). The two designs were
combined with the elementary UaD to Storer’s BC and BD two-stage designs:
UaD-BC: Use the UaD until the first toxicity occurs and continue with the
UaD-C at the next lower dose level.
UaD-BD: Use the UaD until the first toxicity occurs and continue with
UaD-D design.
Simulations revealed a superiority of the UaD-BD over the UaD-BC and
the elementary UaD (37). Although the single-stage designs UaD, UaD-B, and
UaD-C were not considered as sufficient and only the two-stage combinations
were proposed for use (37), unfortunately, proposals of new designs were cali-
brated mostly on the one-stage designs instead of the more successful two-stage
designs such that recommendations for practice are hard to deduct from those
investigations. A new sequential RW (38) is the so-called biased coin design
(BCD) applicable for a action space DK ⫽ {x1 ⬍ ⋅ ⋅ ⋅ ⬍ xK }.
Using BCD, given patient j has been treated on dose level x( j ) ⫽ xi , the
next patient j ⫹ 1 is treated at the next lower level xi⫺1 if a DLT was observed
in patient j, otherwise at xi with some probability p# not larger than 0.5 or at xi⫹1
with probability 1 ⫺ p# (not smaller than 0.5). When reaching boundaries x1 and
xK, the procedure must stay there. This design centers the dose allocation unimod-
ally around the MTDθ for any θ, 0 ⬍ θ ⱕ 0.5 of p# is chosen as θ/(1 ⫺ θ), and
if the MTD is in the inner of the dose space DK (38). Interestingly, the 33%
percentile, MTD1/3, is obtained with a nonbiased coin of probability p# ⫽ (1/3)/
(2/3) ⫽ 0.5.
3. Continual Reassessment Method

To reduce the number of patients treated with possibly ineffective doses by the
TER/STER or the RW type designs, a Bayesian based dose escalation rule was
16 Edler
introduced by O’Quigley et al. (23). A starting dose is selected using a prior

distribution of the MTD, and this distribution is updated with each patient’s ob-
servation with regard to the absence or presence of DLT. Each patient is treated
at the dose level closest to the currently estimated MTD. We describe the CRM
next and postpone a straightforward Bayesian design to the next subsection. As-
sume a finite action space DK ⫽ {x1 ⬍ ⋅ ⋅ ⋅ ⬍ xK } of dose levels, a fixed sample
size N and a one-parameter dose-toxicity function ψ(x, a) as defined in Eq. (1)
depending on the model parameter a. Estimation of the MTD is therefore equiva-
lent to the estimation of a. Assume an unique solution ao at the MTDθ:
ψ(MTD, ao) ⫽ θ (14)
Let Yj denote the dichotomous response variable of the jth patient, j ⫽ 1, . . . ,
N, and summarize the sequentially collected dose-toxicity information up to the
j-1th patient by Ωj ⫽ { y1, . . . , yj⫺1}. The information upon the parameter a is
described by its density function f (a, Ωj ), the posterior density distribution of
the parameter a given Yi ⫽ yi, i ⫽ 1, . . . , j ⫺ 1. In the following, a is assumed
as a scalar a ⬎ 0 and f normalized by
∞
冮0
f (a, Ωj ) da ⫽ 1, j ⱖ 1. (15)
Using CRM, given the previous information Ωj ⫽ { y1, . . . , yj⫺1}, the next
dose level is determined such that it is closest to the current estimate of an MTDθ.
For this, the probability of a toxic response is calculated for each xi ε DK given
Ωj ⫽ { y1, . . . , yj⫺1} as
∞
P(xi, a) ⫽ 冮0
ψ(xi, a) f (a, Ωj ) da ⫽ θij (16)
and the dose level x ⫽ x( j ) for the jth patient is selected from DK such that the
distance of P(x, a) to the target toxicity rate θ becomes minimal: x( j ) ⫽ xi if
θ ⫺ θij minimum.
After observing the toxicity Yj at dose level x( j ), the posterior density of
the parameter a is obtained from the prior density f (a, Ωj ) and the likelihood of
the jth observation
L ( yj , x( j ), a) ⫽ ψ(x( j ), a) yj [(1 ⫺ ψ(x( j ), a)]1⫺yj (17)
using Bayes theorem as

L ( yj , x( j ), a) f (a, Ωj )
f (a, Ωj⫹1) ⫽ ∞
(18)
∫ L( yj , x( j ), u) f (u, Ωj ) du
0
The CRM starts with an a priori density g(a). The MTD is then estimated as
MTy D ⫽ x(N ⫹ 1), the dose for the (N ⫹ 1)st patient. Consistency in the sense
that the recommended dose converges to the target level was shown even under
Phase I Trials 17
model misspecification (43). There are cases where it converges to a close but
not the closest level. The treatment of batches of patients per dose level has also
been proposed (23).
4. Modifications of the CRM

The CRM was criticized mainly from three aspects: choice of the starting dose
x(1) according to a prior g(a) that could result in a dose level in the middle and
not in the lower dose region, allowance to jump over a larger part of the dose
region and skipping intermediate levels, and lengthening the trial because of
allowing treatment and examination of toxicity only for one patient after the
other. The ethical argument of risking too high toxicity and the practical argument
of undue duration elicited modifications that try to reduce that risk and keep at
the same time the benefit of reaching the MTD with a smaller number of dose
levels and treating more patients at effective doses. Modifications of the CRM
were obtained (44–48) through restrictions on choosing x1 as starting dose, not
skipping consecutive dose levels and allowing groups of patients at one dose
level. From these proposals evolved a design sketched (49) relying on a sugges-
tion of Faries (45).
Using modified CRM, start with one patient at dose level x1 and apply the
CRM. Given patient j-1 has been treated on dose level x( j ⫺ 1) ⫽ xi and informa-
tion Ωj ⫽ { y1, . . . , yj⫺1} predicts for the next patient j the dose level xCRM ( j),
then the next dose level x( j ) is chosen as follows:
xCRM ( j ) if xCRM ( j ) ⱕ x( j ⫺ 1)
x( j ) ⫽ xi⫹1 if xCRM ( j ) ⬎ x( j ⫺ 1) and if yj⫺1 ⫽ 0 (no DLT) (19)
xi if xCRM ( j ) ⬎ x( j ⫺ 1) and if yj⫺1 ⫽ 1 (DLT)
The main restriction in Eq. (19) is the start at x1 and the nonskipping. Further
modifications demonstrate the efforts of reducing anticonservatism in the CRM:
1. Modified CRM but stay with x( j ) always one dose level below xCRM ( j)
(version 1 in 45).
2. Modified CRM but x( j ) is not allowed to exceed the MTD estimate
based on Ωj (version 2 in 45).
3. Modified CRM but use the starting level of the CRM and enter there
three patients (version 4 in 45).
4. CRM but escalation (i.e., xCRM ( j) ⬎ x( j ⫺ 1)) is restricted within DK
⫽ {x1 ⬍ ⋅ ⋅ ⋅ ⬍ xK } to one step only (restricted version in 46).
5. Modified CRM but stopping if the next level has been already visited
by a predetermined number of patients (e.g., six) (44). Korn et al. (44)
introduce additionally a dose level x0 ⬍ x1 for an estimate of the MTD
18 Edler
if a formally proposed MTD equal to x1 would exhibit an unacceptable

toxicity.
6. Modified CRM run in three variants of one, two, or three patients per
cohort (48). This version is identical to 5 except that a dose level cannot
be passed after a DLT was observed.
7. CRM but allow more than one patient at one step at a dose level and
limit escalation to one step (47). A simulation study showed a reduction
of trial duration (50–70%) and reduction of toxicity events (20–35%)
compared with the CRM.
Another approach to modify the CRM consisted in starting with the UaD
until the first toxicity occurs and then switching to the CRM using all information
obtained so far (46). Implicitly, this extended CRM starts at x1 and escalates not
by more than one step until the first toxicity occurs, but it may cover more doses
than the CRM. Further modifications of this type with two and three patients per
cohort were investigated (46,48).
Modifications Using Toxicity of CTC Grade 2 (Secondary Grade Toxicity).
A substantial modification of previous discussed phase I designs was the sugges-
tion to use further toxicity information than that defined for the determination of
the DLT (44,45,48). It was proposed to account for the observation of grade 2
toxicity (secondary toxicity) that does not contribute normally to the DLT crite-
rion. The CRM was modified such that two patients were treated on the level
where the first patient had exhibited secondary toxicity (version 3 in 45). Refer-
ring to this and to a previous proposal (44), Ahn (48) implemented a so-called
secondary grade design similar to the UaD-BD design: The elementary UaD is
used with two patients per cohort. If one patient of the two shows a DLT or if
there is a second case of grade 2 toxicity, dose escalation continues with the
standard STER design.
5. Bayesian Designs
Bayesian methods are attractive for phase I designs because they can be applied
even if little data are present but prior information is available. Furthermore,
Bayesian methods are excellently adapted to decision making, and they tend to
allocate patients at higher doses after nontoxic and at lower doses after toxic
responses. The cumulation of information and the tendency for allowing real-
time decision makes Bayesian designs best suited when patients are treated one
at a time and when toxicity information is available quickly. A full Bayesian
design (50) is set up by the data model ψ(x, a), the prior distribution g(a) of the
model parameter a, and a continuous set of actions D. A gain function (negative
loss) G is needed to characterize the gain of information if an action is taken
given the true (unknown) value of a or equivalently of the MTD. Whitehead and
Phase I Trials 19
Brunier (50) use the precision of the MTD estimate obtained from the next patient
as gain function. The number of patients is fixed in advance. Given the responses
Yj ⫽ yj , the Bayes rule determines the posterior density g (a| yj ) of the parameter
a as
ψ( yj , a)g(a)
g (a |yj ) ⫽ (20)
ψ( yj , a)g(a)da
A new dose level is then selected by maximizing the posterior expected gain
E[G(a)| yj ] (21)
Given a, the likelihood of the data Ωj for the first j⫺1 patients is then a product
of the terms (17) used already for the CRM
j⫺1
fj⫺1(Ωj , a) ⫽ 兿
s⫽1
ψ(x(s), a)ys [1 ⫺ ψ(x(s), a)]1⫺ys (22)
The next dose x( j ) is then determined such that the posterior expected gain is
maximized with respect to x( j ). For details see Ref. 50.
Gatsonis and Greenhouse (51) used a Bayesian method to estimate directly
the dose–toxicity function and the MTDθ in place of the parameter a. The proba-
bility α ⫽ P(Dose ⬎ MTDθ) of overdosing a patient was used as target parameter
for Bayesian estimation in the so-called escalation with overdose control
(EWOC) method (41).
Using EWOC, given the previous results Ωj of the first j ⫺ 1 patients,
obtain the posterior cumulative distribution of the MTD as πj (x) ⫽ P(MTD ⱕ
x |Ωj ) that is the conditional probability of overdosing the jth patient with the
dose x, given the current data Ωj. The criterion for determining the dose level
for the ith patient is πj (x( j )) ⱕ α. Then, the posterior density of {ψ(x1, a), MTDθ,
given Ωj } is obtained from the prior distribution of a and from there the marginal
posterior distribution density π(MTD| Ωj ). The dose level x( j ) is then calculated
as x( j ) ⫽ πj⫺1 (α).
C. Sample Size per Dose Level

The number of patients to be treated per dose level was often implicitly deter-
mined in the previous described designs. Statistical methods on optimal sample
sizes per dose levels seem to be missing. Recommendations vary between one
and eight patients per dose level; other suggestions were three to six patients per
lower dose level (9) or a minimum of three per dose level and a minimum of
five near the MTD (7). Calculations of sample sizes separately from the sequential
design can be based on the binomial distribution and hypothesis testing of toxicity
rates (52). This gives some quantitative aid in planing the sample size per selected
20 Edler
dose level. Using the two probabilities PAT and PUAT of acceptable toxicity (AT)
and unacceptable toxicity (UAT), tables for a fixed low PAT ⫽ 0.05 and a few
PUAT values were given by Rademaker (52) with a straightforward algorithm.
Characteristically, the sample size and the error rates increase rapidly if the two
probabilities approach each other. If PAT ⫽ 0.05, a sample size of n ⱕ 10 per
dose level is only achieved if PUAT ⱖ 0.2. If n ⫽ 6, the probability of escalating
becomes 0.91 if the toxicity rate equals PAT ⫽ 0.05 and the probability of nonesca-
lating becomes 0.94 if the toxicity rate equals PUAT ⫽ 0.4. The latter probability
decreases from 0.94 to 0.90, 0.83 and 0.69 if n decreases from 6 to 5, 4 and 3,
respectively. If PUAT ⫽ 0.3, the probability of nonescalating decreases from 0.85
to 0.77, 0.66 and 0.51 when n decreases from 6 to 5, 4 and 3. Given the toxicity
rate p, one may determine a sample size by considering the probability POT ( p)
of overlooking this toxicity by the simple formula POT ( p) ⫽ (1 ⫺ p)n (53,54).
Given p ⫽ 0.33 POT ( p) takes the values 0.09, 0.06, 0.04, 0.03, 0.02, 0.012, 0.008
for n ⫽ 6, 7, 8, 9, 10, 11, 12.
D. Validation and Comparison of Phase I Designs

Simulation studies were performed to compare new designs with the standard
designs TER/STER or to compare the CRM with its modifications
(23,37,38,42,44,45). Unfortunately, the results are based on different simulation
designs and different criteria of assessment such that the comparison is rather
limited. The criteria of the comparisons were
1. Distribution of the MTD on the dose levels,
2. Distribution of the occurrence of the dose levels,
3. Average number of patients,
4. Average number of toxicities,
5. Toxicity probability (percentage treated above the MTD),
6. Percentage treated at the MTD or one level below,
7. Average number of steps (cohorts treated on successive dose levels),
8. Percentage of correct MTD estimation.
Roughly summarizing the findings of those studies and without going into
details, it appears that the TER is inferior to the UaD-BC and the UaD-BD (37)
in terms of the fraction of successful trials. The UaD-BD is superior to UaD-BC
at least in some situations. The BCD design from the RW class and the CRM
performed very similar (38). By examining the distribution of the occurrence of
toxic dose levels, percentage of correct estimation, or percentage treated at the
MTD or one level below (23), the single-stage UaD designs were inferior to the
CRM (23); the UaD recommended toxic levels much more often than the CRM.
STER designs were inferior to some modified CRMs with respect to the percent-
age of correct estimation, the percentage treated at the MTD or one level below,
Phase I Trials 21
and the percentage of patients treated with very low levels (44,45). The CRM
and some modified CRMs needed on the average more steps than the STER, and
they showed the tendency of treating more patients at levels higher than the MTD.
On the other hand, the STER recommended more dose levels lower than the true
MTD. The STER provided no efficient estimate of the MTD and did not stop at
the specified percentile of the dose–response function (55).
IV. CONDUCT OF THE TRIAL

A. Standard Requirements
Before a phase I trial is initiated, the following design characteristics should be
checked and defined in the study protocol:
1. Starting dose x1,
2. Dose levels D,
3. Prior information on the MTD,
4. Dose–Toxicity model,
5. Escalation rule,
6. Sample size per dose level,
7. Stopping rule,
8. Rule for completion of the sample size when stopping.
A simulation study is recommended before the clinical start of the trial. Examples
are found among the studies mentioned in Section II.D, and software has become
available recently by Simon et al. (56).
B. Dose Titration Approach

Patients are treated in a phase I trial in oncology mostly with the therapeutic aim
of palliation and relief; the goal of cure would be unrealistic in most cases, but
achievement of a partial remission of the tumor remains a realistic motivation.
The assessment of toxicity can therefore be based on only few treatment cycles,
mostly two. If treatment with the experimental drug is without complication and
if at least a status quo of the disease is retained, treatment normally continues.
The study protocol usually makes provisions for individual dose reductions in
case of non-DLT toxicity. From the same rationale by which the dose is decreased
in the case of toxicity, it should be allowed to increase in the case of nontoxicity
and good tolerance. Individual dose adjustment has been discussed repeatedly as
a possible extension of the design of a phase I trial (6,11,12,56). Intraindividual
dose escalation was proposed (6,11) if a sufficient time has elapsed after the last
treatment course such that any existing or lately occurring toxicity could have
been observed before an elevated dose is applied. It was recommended that pa-
22 Edler
tients escalated to a certain level are accompanied by ‘‘fresh’’ patients at that

level to allow the assessment of cumulative toxic effects or accumulating toler-
ance. Therefore, all patients should be at least 3–4 weeks on their primarily sched-
uled dose before an intraindividual escalation is performed. When planning intra-
individual dose escalation, one should weigh the advantage of dose increase and
faster escalation in the patient population against the risks of cumulative toxicity
in the individual patients. Further consideration should be given to the develop-
ment of tolerance in some patients and the risk of then treating new patients at
too high levels (12). The STER was modified as in Ref. 56 by a dose titration
design. The end point was defined as a categorial variable with four levels: accept-
able toxicity (grade ⱕ 1), conditionally acceptable toxicity (grade ⫽ 2), DLT
(grade ⫽ 3), unacceptable toxicity (grade ⫽ 4). Two intraindividual strate-
gies were considered: One uses no intraindividual dose escalation and only de-
escalation by one level per course in case of DLT or unacceptable toxicity, the
other uses escalation per course by one level as long as no DLT or unacceptable
toxicity occurs and de-escalation as in the first case. Three new designs were for-
mulated by using one of these two options given an action space D or DK:
Speed-up Design: Escalate the dose after each patient by one level as long
as no DLT or unacceptable toxicity occurs in the first cycle and at most
one patient shows grade 2 toxicity in the first cycle. If a DLT occurs in
the first cycle or if grade 2 has occurred twice in the first cycle, switch
to the STER with the second intraindividual strategy.
Accelerated Speed-up Design: Same as the Speed-up Design except that
doubling dose escalation is used in the first stage before switching to
STER, using again the second intraindividual strategy.
Modified Accelerated Speed-up Design: Same as the Accelerated Speed-
up Design but no restrictions on the escalation with respect to the cycle.
If one DLT occurred in any cycle or if grade 2 toxicity occurred twice
in any cycle, switch to the STER (second intraindividual strategy).
C. Toxicity–Response Approach
A design involving both dose finding and evaluation of safety and efficacy in
one early phase I/II trial was proposed by Thall and Russell (57). The goal was
to define a dose satisfying both safety and efficacy requirements, to stop early
when it was likely that no such dose could be found, and to continue if there
was chance enough to find one. The number of patients should be large enough
to estimate reliably both the toxicity rate and the response rate at the selected
dose. Two toxicity outcomes (absence or presence of toxicity) and three efficacy-
related outcomes (no effect, desired effect, undesired effect) are considered. Tox-
icity and efficacy outcomes were combined into a comprehensive ternary out-
come variable Y with the values Y ⫽ 0 if no effect and no toxicity, Y ⫽ 1 if
Phase I Trials 23
desired effect and no toxicity, and Y ⫽ 2 if undesired effect and toxicity oc-
curred. This results in three dose-dependent outcome probabilities ψj (x) ⫽
P(Y ⫽ j| Dose ⫽ d ), j ⫽ 0, 1, 2, and in a two-dimensional dose–effect end point
ψ(x) ⫽ (ψ1(x), ψ2(x)). The goal is to find a dose d such that the effect-and-
no-toxicity probability is reasonably large, e.g., ψ1(d ) ⱖ θ*1 (e.g., 0.50), and
that the undesired-effect-and-toxicity probability is limited ψ2(d ) ⱕ θ*2 (e.g.,
0.33). The combined dose–response relationship is parameterized as γj (d ) ⫽
P(Y ⱖ j| Dose ⫽ d ) for j ⫽ 0, 1, 2. As dose–response model for (γ1, γ2) the pro-
portional odds regression model and the cumulative odds model are considered
(58) and a strategy is developed to find in DK that dose d* which satisfies both
criteria.
V. EVALUATION OF PHASE I DATA
The statistical evaluation of the data resulting from a phase I trial has two primary
objectives,the estimation of the MTD and the characterization of the DLT. The
MTD is obtained through dose–toxicity modeling. Evaluation of the DLT has to
account for dose and time under treatment and should use pharmacokinetic mod-
eling. Besides these objectives, phase I data have to be presented in full detail
using transparent descriptive methods.
A. Descriptive Statistical Analysis

All results obtained in a phase I trial have to be reported in a descriptive statistical
analysis that accounts for the dose levels. This is somehow cumbersome because
each dose level has to be described as separate stratum. The evaluation of the
response can usually be restricted to a case by case description of all those patients
who exhibit a partial or complete response. Patients with a stable disease for a
longer period or patients with a minor improvement not sufficient for partial
response may be described and emphasized also. A comprehensive and transpar-
ent report of all toxicities observed in a phase I trial is an absolute must for both
the producer’s (drug developer) and the consumer’s (patient) risks and benefits.
Absolute and relative frequencies (related to the number of patients evaluable
for safety) are reported for all toxicities of the CTC list by distinguishing the
grading and the assessment of the relation to treatment. Table 3 provides an exam-
ple. It exhibits the complete toxicity observed, the assessment of the relation to
treatment, and two subtables that summarize for patients exhibiting toxicity the
ADRs and DLTs defined in section II.C. A description of the individual load of
toxicity of each patient has been made separately using individual descriptions
eventually supported by modern graphical methods of linking scatterplots for
multivariate data. Multiplicity of DLTs in some patients can be so presented (see
e.g. Benner et al. (59)).
24 Edler
Table 3 Descriptive Evaluation of Phase I Toxicity Data
A. Absolute and Relative Frequencies of Toxicity
CTC item n Grade 0 Grade 1 Grade 2 Grade 3 Grade 4 Total with

grade ⬎ 0
Vomiting 6 2 1 1 2 0 4
100% 33 17 17 33 0 67
B. Assessment of the Relation of the Toxicity to the Treatment by Case Type Listing
Grade 0 Grade 1 Grade 2 Grade 3 Grade 4
— Probable Possible Unprobable —

possible
C. Summary Table of ADRs
n No ADR ADR
6 1 3
D. Summary Table of DLTs
n No DLT DLT
6 3 1
B. MTD Estimation
The estimation of the MTD has been part of most search designs from Section
III, and an estimate MT y D resulted often directly through the stopping criterion.
y
In TER, MTD is by definition the dose level next lower to the unacceptable dose
level where a predefined proportion of in patients (e.g., 33%) experienced DLT.
An estimate of a standard error of the toxicity rate at the chosen dose level is
impaired by the small number of cases (ⱕ6) and also by the design. A general
method for analyzing dose–toxicity data is the logistic regression of the Yj on
the actually applied doses x( j ), j ⫽ 1, . . ., n, of all patients treated in the trial.
This would disregard any dependency of the dose–toxicity data on the design
that had created the data and may therefore be biased. If the sampling is forced
to choose doses below the true MTD, MT y D may be biased toward lower values.
The logistic regression takes all observed data ( yj , x( j ), j ⫽ 1, . . ., n) without
prejudice assuming that they are independently sampled from the patient popula-
tion and that the toxic responses per dose are identically distributed. The logistic
model for quantal response is therefore given by
P(Yj ⫽ 1 | x( j )) ⫽ {1 ⫹ exp[⫺(a0 ⫹ a1 x( j ))]}⫺1 (23)
Phase I Trials 25
Standard logistic regression provides the maximum likelihood estimate of a ⫽

(a0, a1) (22,60,61). MTDθ is estimated as the θ percentile of a tolerance distribu-
tion
y θ ⫽ logit (θ) ⫺ ay0 ⫽ ⫺ln 2 ⫺ ay0

MTD (24)
ay1 ay1
y Dθ is given by
if, e.g., θ ⫽ 0.33. The large sample variance of MT
y θ Va a ⫹ MTD
Va0 ⫹ 2MTD y 2θ Va Va
V⫽ 0 1 0 1
(25)
ay21
where (Va0, Va0a1, Va1) denotes the asymptotic variance–covariance matrix of the
model parameter vector. Confidence limits can be obtained by the delta method,
Fieller’s theorem, or the likelihood ratio test (37). For the dose–toxicity model
ψ(x, a) ⫽ F
冢 x ⫺ a0
a1 冣 (26)
y θ ⫽ a1 F ⫺1 (θ) ⫹ a0 .
MTD (27)
C. Pharmacokinetic Phase I Data Analysis

An often neglected but important secondary objective of a phase I trial is the
assessment of the distribution and elimination of the drug in the body. Specific
parameters that describe the pharmacokinetics of the drug are the absorption and
the elimination rate, the drug half-life, the peak concentration, and the area under
the time–drug concentration curve (AUC). Drug concentration measurements
ci (tr) of patient i at time tr are usually obtained from blood samples (additionally
also from urine samples) taken regularly during medication and are analyzed
using pharmacokinetic models (62). One- and two-compartment models have
been used to estimate the pharmacokinetic characteristics often in a two-step
approach: first for the individual kinetic of each patient and then for the patient
population using population kinetic models. In practice, statistical methodology
for pharmacokinetic data analysis is primarily based in nonlinear curve-fitting
using least-squares methods or their extensions (63).
Subsequent to the criticism of the traditional methods for requiring too
large a number of steps before reaching the MTD, pharmacokinetic information
to reduce this number of steps was suggested (24,64). Therefore, a pharmacoki-
netically guided dose escalation (PGDE) was proposed (24,65) based on the
equivalence of drug blood levels in mouse and humans and on the pharmacody-
namic hypothesis that equal toxicity is caused by equal drug plasma levels. It
postulates that the DLT is determined by plasma drug concentrations and that
AUC is a measure that holds across species (64). The AUC calculated at the
26 Edler
MTD for humans was found to be fairly equal to the AUC for mice if calculated
at the LD10 (in mg/m2 equivalents, MELD10). Therefore, AUC(LD10, mouse) was
considered as a target AUC, and a ratio
AUC(LD10,mouse)
F⫽ (28)
AUC(Starting Dose human)
was used to define a range of dose escalation. One tenth of MELD10 is usually
taken as the starting dose x1. Two variants have been proposed. In the square
root method, the first step from the starting dose x1 to the next dose x2 is equal
to the geometric mean between x1 and the target dose x1 F, i.e., x2 ⫽ √x1 Fx1 ⫽
x1 √F. Subsequent dose escalation continues with the MFDE. In the extended
factors of two methods, the first steps are achieved by a doubling dose scheme
as long as 40% of F has not been attained. Then one continues with the MFDE.
Potential problems and pitfalls were discussed (66). Although usage and further
evaluation was encouraged and guidelines for its conduct were proposed, the
PGDE has not often been used lateron in practice. For a more recent discussion
and appraisal, see Newell (67).
VI. REGULATIONS BY GOOD CLINICAL PRACTICE

AND INTERNATIONAL CONFERENCE ON
HARMONIZATION (ICH)
According to ICH Harmonized Tripartite Guideline, General Consideration for

Clinical Trials, Recommended for Adoption at Step 4 on 17 July 1997 (16), a
phase I trial is most typically a study on human pharmacology ‘‘of the initial
administration of an investigational new drug into humans.’’ It is considered as
having nontherapeutic objectives ‘‘to determine the tolerability of the dose range
expected to be needed for later clinical trials and to determine the nature of ad-
verse reactions that can be expected.’’ The definition of pharmacokinetics and
pharmacodynamics is seen as a major issue and the study of activity or potential
therapeutic benefit as a preliminary and secondary objective.
Rules for the planning and conduct of phase I trials are specifically ad-
dressed by the European Agency for the Evaluation of Medicinal Products and
its Note for Guidance on Evaluation of Anticancer Medicinal (18 March 1997)
CPMP/EWP/205/25. Primary objectives are
1. The determination of the MTD,

2. The characterization of frequent ‘‘side effects’’ of the agent and their
dose–response parameters,
3. The determination of relevant main pharmacokinetic parameters.
Phase I Trials 27
Single intravenous dosing every 3–4 weeks is recommended if nothing else is

suggested. The starting dose is not fixed. Dose escalation is oriented at the MFDE
or the PGDE scheme. A minimum of two cycles at the same dose level is pre-
ferred. The number of patients per dose level varies around n ⫽ 3 with an increase
to six in case of overt toxicity.
General Guidelines for obtaining and use of dose–response information
for drug registration has been given in ICH Tripartite Guideline, Dose-Response
Information to Support Drug Registration Recommended for Adoption Step 4
on 10 March 1994 that cover to some extent studies with patients suffering from
life-threatening diseases such as cancer. Based on these regulations and extending
earlier suggestions (6,7), a Phase I Trial Protocol may be organized as follows:
Objectives and Preclinical Background

Clinical Background
Eligibility and In/Exclusion of Patients
Treatment
Dose-Limiting Toxicity and MTD
Dose Levels and Dose Escalation Design
Number of Patients per Dose Level
End Points and Longitudinal Observations
Toxicity, Response
Termination of the Study
References
Appendices on Case Report Forms
Important Definitions
Common Toxicity Criteria
Informed Consent Form(s)
Declaration of Helsinki
VII. DISCUSSION AND OPEN PROBLEMS
Phase I trials are by their primary goal dose-finding studies, but they are per-
formed under the twofold paradigm of the existence of monotone dose–toxicity
and dose–benefit relationships. The standard paradigm has been as follows: Start
at a low dose that is likely to be safe and treat small cohorts of patients at progres-
sively higher doses until drug-related toxicity reaches some predefined level of
maximum toxicity or until unexpected and unacceptable toxicity occurs. The in-
herent ethical problem that a treatment is given for which risks and benefits are
unknown and that presumably is applied at a suboptimal or inactive dose has to
be accounted for by a design that treats each patient when entering the trial at
the maximal dose known to be safe at this time. Phase I trial designs have arisen
28 Edler
rather empirically without strong statistical foundation and limited estimating

precision (13). The rather opaque appearance of the Fibonacci scheme in phase
I research seems to be symptomatic of this situation.
The two main constituents of a phase I trial are the action space of the
dose levels, including the choice of the starting dose, and the dose escalation
scheme. Different from dose finding in bioassays, phase I trials are smaller in
sample sizes, take longer for observing the end point, and are strictly constraint
by the ethical requirement in choosing doses conservatively. Therefore, standard
approaches as unrestricted UaD and SAM designs are not applicable. Further
complexity arises by population heterogeneity, subjectivity in judging toxicity,
and censoring because of early drop-out. Phase I trials are not completely re-
stricted to new drugs but may also be conducted for new schedules or for new
formulations (e.g., packed with liposomes). This challenges the use of prior infor-
mation. For the use of drug combinations, see Ref. 68.
Dose escalation was presented in broad detail ranging from the more prag-
matic standard rules (S/TER) to partial and full Bayesian methods. A comprehen-
sive comparison of all the methods proposed is despite a number of simulation
results—mentioned in Section III.B.5—not available at present. All previous at-
tempts to validate a design and to compare it with others have to use simulations
because a trial can be performed only once with one escalation scheme and it
cannot be ‘‘reperformed.’’ Therefore, all comparisons can only be as valid as
the setup of the simulation study is able to cover clinical realty and complexity.
Nevertheless, there is evidence from simulations and from clinical experience that
the standard designs are too conservative in terms of underdosing and needless
prolongation. At the same time, the pure Bayesian and the CRM rules are lacking
because their optimality is connected with treating one patient after the other,
which may prolong a trial even more, and that they run the risk of treating too
many patients at toxic levels (for further criticism, see 44). This dilemma has
motivated a large number of modifications both of the S/TER and the CRM
(lowering the starting dose and restricting the dose escalation to one step to find
a way between the conservatism and the anticonservatism). This has restricted
the Bayesian dynamic of the CRM considerably, but there seems to be an advan-
tage using a modified CRM.
A driving force for the search of new designs and one of the most serious
objections against the standard design was the argument that as many patients
as possible should be treated at high dose levels best near the true MTD, such that
they may have a therapeutic benefit from the experimental drug. This argument is
absolutely correct, and given that the new experimental drug is finally proved to
be efficient (e.g., the taxanes for breast cancer), it seems empirically obvious yet
is a post-hoc argument. One has to be reminded that efficacy is not in accordance
with the primary aim of a phase I study, which is dose finding. The percentage
of patients entered into phase I trials who benefited from that treatment has rarely
Phase I Trials 29
been estimated in serious studies. Rough estimates give response rates—mostly

partial response only—in the range of a few percent. Nineteen responses (3.1%)
were recorded among 610 patients in the 3-year review (69) of 23 trials of the MD
Anderson Cancer Center from 1991 to 1993. Therefore, it would be unrealistic to
expect a therapeutic benefit even at higher dose levels for most patients, and the
impact of treating at high doses should not be overemphasized. Nevertheless, it
is the ethical concern to do as best as possible even in a situation with a very
low chance.
Phase I trials are weak in terms of a generalization to a larger patient popu-
lation because only a small number of selected patients is treated under some
special circumstances by specialists. Therefore, drug development programs im-
plement usually more than one phase I trial for the same drug. This poses ethical
concerns, however, about the premises of the phase I being the trial under which
the first-time patients are treated. Mostly, repeated phase I trials use variations
of the schedule and administration, but as long as there is no direct information
exchange among those trials with respect to occurrence of toxicity, ethical con-
cerns remain. The repeat of phase I trials has not found much consideration in
the past.
Further concern arises if patients are selected for inclusion into the trial
(12), for example, by including at higher doses less pretreated cases because of
the fear of serious toxicities. Those concerns are nourished by a seemingly miss-
ing possibility of randomization. Given multicentricity and enough patients per
center, some restricted type of randomization may be feasible and should be
considered, for example, randomly choosing the center for the first patient for
the next dose level. Interestingly, the very early phase I trial of DeVita et al. (32)
from 1965 used a randomized approach. Further improvement was put into the
perspective of increasing the sample size of phase I trials and increasing so the
information (13). The definition of the toxicity criteria and their assessment rules
are mostly beyond the statistical methodology but are perhaps more crucial than
any other means during the conduct of trial. It has to be taken care that no changes
in the assessment of toxicity occur progressively during the trial.
Statistical estimation of the MTD has been restricted above to basic proce-
dures leaving aside the question of estimation bias when neglecting the design.
Babb et al. (41) noted that they can estimate the MTD using a different prior
than used in the design and refer to work of others who suggested using a Bayes-
ian scheme for design and a maximum likelihood estimate of the MTD. The
MTD estimation above was restricted to a qualitative and at most categorical
outcome measure. Laboratory toxicity is, however, available on a quantitative
scale, and this information could be used for the estimation of an MTD (70).
Mick and Ratain (71) proposed therefore ln( y) ⫽ a0 ⫹ a1x ⫹ a2 ln(z) as a dose-
toxicity model, where y was a quantitative toxicity of myelosuppression (nadir
of white blood cell count) and z was a covariate (pretreatment white blood cell
30 Edler
count). Further research and practical application is needed to corroborate the

findings. Extensions of the use of the toxicity grades 1–4 were addressed shortly
in Section III.B.4. One may be tempted to define DLT specifically for different
toxicity grades with the aim of a grade-specific MTD also using grade-specific
acceptable tolerability θ. That would allow, for example, tolerability θ2 for grade
2 toxicity higher than the tolerability θ3 for grade 3 toxicity. This would demand
an isotonic relationship such that any grade 3 toxicity is preceded by a grade 2
toxicity and P(DLT2 | x) ⬎ P(DLT3 |x) for all doses x. A multigrade dose escala-
tion scheme was proposed (72) that allows escalation and reduction of dosage
using knowledge of the grade of toxicity. Depending on severity of toxicity, up
to six dose escalations and three stages of recruiting one, three, or six patients
at each dose level were planned. They showed that the multigrade design was
superior to the standard design and that it could compete successfully with the
two-stage designs. But it was not uniformly better. The authors also admitted
that a multigrade design is harder to comprehend and use, and to my knowledge
that design has never been used in practice, although it may have influenced
recent research in titration designs.
A silent assumption in this article was that the cohort of nk (e.g., three)
patients scheduled for one dose level x[k] are treated more or less in parallel and
that the toxicity results become available at once when that kth stage has been
finished. This situation may be rather rare in practice, and staggered information
by treatment course and patient may be the rule. If so, how can the cumulative
increase of information on the dose level x[k] be used to interfere with the dosing
of the current patients or to decide on the dosing of the next cohort? Except for
the dose-titration design in Section IV.B, no formal methods have been developed
to my knowledge that would acount for such an overlap of information. The
multiplicity of courses and even more the multiplicity of toxicities assessed with
the CTC scheme needs further research. Bayesian approaches may be most prom-
ising to deal with this difficult problem. To comply with clinical practice, such
a design should be flexible enough to allow both staggered entry distributed over
1–2 months and parallel entry in a few weeks.
ACKNOWLEDGMENTS
This work could not have been done without the long-standing extremely fruitful
cooperation with the Phase I/II Study Group of the AIO in the German Cancer
Society and the work done there with Wolfgang Queißer, Axel Hanauske, E.-D.
Kreuser, and Heiner Fiebig. I also owe the support and statistical input from the
Biometry Departments of Martin Schumacher (Freiburg) and Michael Schemper
(Vienna), Harald Heinzl (Vienna), and my next door colleague Annette Kopp-
Schneider. For technical assistance I thank Gudrun Friedrich for analyses and
Phase I Trials 31
Regina Grunert and Renate Rausch for typing and the bibliography. Finally, I
thank John Crowley for all the help and encouragement.
REFERENCES
1. Simon RM. A decade of progress in statistical methodology for clinical trials. Stat
Med 1991; 10:1789–1817.
2. World Medical Association. Declaration of Helsinki (http:/ /www.aix-scientifics.
com/).
3. Carter SK. Study design principles for the clinical evaluation of new drugs as devel-
oped by the chemotherapy programme of the National Cancer Institute. In: Staquet
MJ, ed. The Design of Clinical Trials in Cancer Therapy. Brussels: Editions Scient
Europ 1973:242–289.
4. Schwartsmann G, Wanders J, Koier IJ, et al. EORTC New Drug Development Office
Coordinating and Monitoring Programme for Phase I and II Trials with new antican-
cer agents. Eur J Cancer 1991; 27:1162–1168.
5. Spreafico F, Edelstein MB, Lelieveld P. Experimental bases for drug selection. In:
Buyse ME, Staquet MJ, Sylvester RJ, eds. Cancer Clinical Trials. Methods and Prac-
tice. Oxford: Oxford University Press, 1984:193–209.
6. Carter SK, Bakowski MT, Hellmann K. Clinical trials in cancer chemotherapy. In:
Carter SK, Bakowski MT, Hellmann K, eds. Chemotherapy of Cancer, 3rd ed. New
York: John Wiley, 1987:29–31.
7. EORTC New Drug Development Committee. EORTC guidelines for phase I trials
with single agents in adults. Eur J Cancer Clin Oncol 1985; 21:1005–1007.
8. Bodey GP, Legha SS. The phase I study: general objectives, methods and evaluation.
In: Muggia FM, Rozencweig M, eds. Clinical Evaluation of Antitumor Therapy.
Dordrecht: Nijhoff The Netherlands, 1987:153–174.
9. Leventhal BG, Wittes RE. Phase I trials. In: Leventhal BG, Wittes RE, eds. Research
Methods in Clinical Oncology. New York: Raven Press, 1988:41–59.
10. Schneiderman MA. Mouse to man: statistical problems in bringing a drug to clinical
trial. In: Proc. 5th Berkeley Symp Math Statist Prob, 4th ed. Berkeley: University
of California Press, 1967:855–866.
11. Von Hoff DD, Kuhn J, Clark GM. Design and conduct of phase I trials. In: Buyse
ME, Staquet MJ, Sylvester RJ, eds. Cancer Clinical Trials: Methods and Practice.
Oxford: Oxford University Press, 1984:210–220.
12. Ratain MJ, Mick R, Schilsky RL, Siegler M. Statistical and ethical issues in the
design and conduct of phase I and II clinical trials of new anticancer Agents. J Nat
Cancer Inst 1993; 85:1637–1643.
13. Christian MC, Korn EL. The limited precision of phase I trials. J Nat Cancer Inst
1994; 86:1662–1663.
14. Kerr DJ. Phase I clinical trials: adapting methodology to face new challenges. Ann
Oncol 1994; 5:S67–S70.
15. Pinkel D. The use of body surface area as a criterion of drug dosage in cancer chemo-
therapy. Cancer Res 1958; 18:853–856.
32 Edler
16. International Conference on Harmonisation: Guideline for Good Clinical Practice

(http:/ /ctep.info.nih.gov/ctc3/ctc.htm).
17. World Health Organization (WHO). WHO Handbook for Reporting Results of Can-
cer Treatment. WHO Offset Publication No. 48, Geneva: WHO, 1979.
18. Kreuser ED, Fiebig HH, Scheulen ME, et al. Standard operating procedures and
organization of German Phase I, II, and III Study Groups, New Drug Development
Group (AWO), and Study Group of Pharmacology in Oncology and Hematology
(APOH) of the Association for Medical Oncology (AIO) of the German Cancer
Society. Onkol 1998; 21(suppl 3), 1–22.
19. Mick R, Lane N, Daugherty C, Ratain MJ. Physician-determined patient risk of toxic
effects: impact on enrollment and decision making in phase I trials. J Nat Cancer
Inst 1994; 86:1685–1693.
20. Franklin HR, Simonetti GPC, Dubbelman AC, et al. Toxicity grading systems. A
comparison between the WHO scoring system and the common toxicity criteria
when used for nausea and vomiting. Ann Oncol 1994; 5:113–117.
21. Brundage MD, Pater JL, Zee B. Assessing the reliability of two toxicity scales:
implications for interpreting toxicity data. J Nat Cancer Inst 1993; 85:1138–1148.
22. Morgan BJT. Analysis of Quantal Response Data. London: Chapman & Hall, 1992.
23. O’Quigley J, Pepe M, Fisher L. Continual reassessment method: a practical design
for phase I clinical trials in cancer. Biometrics 1990; 46:33–48.
24. Collins JM, Zaharko DS, Dedrick RL, Chabner BA. Potential roles for preclinical
pharmacology in phase I clinical trials. Cancer Treat Rep 1986; 70:73–80.
25. Goldsmith MA, Slavik M, Carter SK. Quantitative prediction of drug toxicity in
humans from toxicology in small and large animals. Cancer Res 1975; 35:1354–
1364.
26. Hansen HH. Clinical experience with 1-(2-chloroethyl) 3-cyclohexyl-1-nitrosourea
(CCNU, NSC-79037). Proc Am Assoc Cancer Res 1970; 11:87.
27. Muggia FM. Phase I study of 4′demethyl-epipodophyllotoxin-β d-thenylidene gly-
coside (PTG, NSC-122819). Proc Am Assoc Cancer Res 1970; 11:58.
28. Carter SK. Clinical trials in cancer chemotherapy. Cancer 1977; 40:544–557.
29. Hansen HH, Selawry OS, Muggia FM, Walker MD. Clinical studies with 1-(2-chlo-
roethyl)-3-cyclohexyl-1-nitrosourea (NSC79037). Cancer Res 1971; 31:223–227.
30. Bellman RE. Topics in pharmacokinetics. III. Repeated dosage and impulse control.
Math Biosci 1971; 12:1–5.
31. Bellman RE. Dynamic Programming. Princeton: Princeton University Press, 1962.
32. DeVita VT, Carbone PP, Owens AH, Gold GL, Krant MJ, Epmonson J. Clinical
Trials with 1,3-bis(2-chlorethyl)-1-nitrosourea, NSC 409962. Cancer Treat Rep
1965; 25:1876–1881.
33. Goldin A, Carter S, Homan ER, Schein PS. Quantitative comparison of toxicity in
animals and man. In: Staquet MJ, ed. The Design of Clinical Trials in Cancer Ther-
apy. Brussels: Editions Scient Europ, 1973:58–81.
34. Penta JS, Rozencweig M, Guarino AM, Muggia FM. Mouse and large-animal toxi-
cology studies of twelve antitumor agents: relevance to starting dose for phase I
clinical trials. Cancer Chemother Pharmacol 1979; 3:97–101.
35. Rozencweig M, Von Hoff DD, Staquet MJ, et al. Animal toxicology for early clinical
trials with anticancer agents. Cancer Clin Trials 1981; 4:21–28.
Phase I Trials 33
36. Willson JKV, Fisher PH, Tutsch K, Alberti D, Simon K, Hamilton RD, Bruggink
J, Koeller JM, Tormey DC, Earhardt RH, Ranhosky A, Trump DL. Phase I clinical
trial of a combination of dipyridamole and acivicin based upon inhibition of nucleo-
side salvage. Cancer Res 48:5585–5590.
37. Storer BE. Design and analysis of phase I clinical trials. Biometr 1989; 45:925–
937.
38. Durham SD, Flournoy N, Rosenberger WF. A random walk rule for phase I clinical
trials. Biometr 1997; 53:745–760.
39. Dixon WJ, Mood AM. A method for obtaining and analyzing sensivity data. J Am
Statist Assoc 1948; 43:109–126.
40. Robbins H, Monro S. A stochastic approximation method. Ann Math Stat 1951; 29:
400–407.
41. Babb J, Rogatko A, Zacks S. Cancer phase I clinical trials: efficient dose escalation
with overdose control. Stat Med 1998; 17:1103–1120.
42. O’Quigley J, Chevret S. Methods for dose finding studies in cancer clinical trials:
a review and results of a Monte Carlo study. Stat Med 1991; 10:1647–1664.
43. Shen LZ, O’Quigley J. Consistency of continual reassessment method under model
misspecification. Biometrika 1996; 83:395–405.
44. Korn EL, Midthune D, Chen TT, Rubinstein LV, Christian MC, Simon RM. A com-
parison of two phase I trial designs. Stat Med 1994; 13:1799–1806.
45. Faries D. Practical modifications of the continual reassessment method for phase I
clinical trials. J Biopharm Stat 1994; 4:147–164.
46. Moller S. An extension of the continual reassessment methods using a preliminary
up-and-down design in a dose finding study in cancer patients, in order to investigate
a greater range of doses. Stat Med 1995; 14:911–922.
47. Goodman SN, Zahurak ML, Piantadosi S. Some practical improvements in the con-
tinual reassessment method for phase I studies. Stat Med 1995; 14:1149–1161.
48. Ahn C. An evaluation of phase I cancer clinical trial designs. Stat Med 1998; 17:
1537–1549.
49. Hanauske AR, Edler L. New clinical trial designs for phase I studies in hematology
and oncology: principles and practice of the continual reassessment model. Onkol
1996; 19:404–409.
50. Whitehead J, Brunier H. Bayesian decision procedures for dose determining experi-
ments. Stat Med 1995; 14:885–893.
51. Gatsonis C, Greenhouse JB. Bayesian methods for phase I clinical trials. Stat Med
1992; 11:1377–1389.
52. Rademaker AW. Sample sizes in phase I toxicity studies. In American Statist. Assoc.
(Alexandria, VA) ASA Proc Biopharm Sect 1989; 137–141.
53. Geller NL. Design of phase I and II clinical trials in cancer: a statistician’s view.
Cancer Invest 1984; 2:483–491.
54. Edler L. Modeling and computation in pharmaceutical statistics when analysing drug
safety. In: Kitsos CP, Edler L, eds. Contributions to Statistics. Heidelberg: Physica,
1997:221–232.
55. Storer BE. Small-sample confidence sets for the MTD in a phase I clinical trial.
Biometr 1993; 49:1117–1125.
56. Simon RM, Freidlin B, Rubinstein LV, Arbuck SG, Collins J, Christian MC. Accel-
34 Edler
erated titration designs for phase I clinical trials in oncology. J Nat Cancer Inst 1997;
89:1138–1147.
57. Thall PF, Russell KE. A strategy for dose-finding and safety monitoring based on
efficacy and adverse outcomes in phase I/II clinical trials. Biometr 1998; 54:251–
264.
58. McCullagh P. Regression models for ordinal data. J R Stat Soc B 1980; 42:109–
112.
59. Benner A, Edler L, Hartung G. SPLUS support for analysis and design of phase I
trials in clinical oncology. In: Millard S, Krause A, eds. SPLUS in Pharmaceutical
Industry. New York: Springer, 2000.
60. Bliss CI. The method of probits. Science 1934; 79:409–410.
61. Finney DJ. Statistical Method in Biological Assay. London: C. Griffin, 1978.
62. Gibaldi M, Perrier D. Pharmacokinetics. New York: Marcel Dekker, 1982.
63. Edler L. Computational statistics for pharmacokinetic data analysis. In: Payne R,
Green P, eds. COMPSTAT. Proceedings in Computational Statistics. Heidelberg:
Physica, 1998:281–286.
64. Collins JM. Pharmacology and drug development. J Nat Cancer Inst 1988; 80:790–
792.
65. Collins JM, Grieshaber CK, Chabner BA. Pharmacologically guided phase I clinical
trials based upon preclinical drug development. J Nat Cancer Inst 1990; 82:1321–
1326.
66. EORTC Pharmacokinetics and Metabolism Group. Pharmacokinetically guided dose
escalation in phase I clinical trials. Commentary and proposed guidelines. Eur
J Cancer Clin Oncol 1987; 23:1083–1087.
67. Newell DR. Pharmacologically based phase I trials in cancer chemotherapy. New
Drug Ther 1994; 8:257–275.
68. Korn EL, Simon RM. Using the tolerable-dose diagram in the design of phase I
combination chemotherapy trials. J Clin Oncol 1993; 11:794–801.
69. Smith TL, Lee JJ, Kantarjian HM, Legha SS, Raber MN. Design and results of phase
I cancer clinical trials: three-year experience at M.D. Anderson Cancer Center. J Clin
Oncol 1996; 14:287–295.
70. Egorin MJ. Phase I trials: a strategy of ongoing refinement. J Nat Cancer Inst 1990;
82:446–447.
71. Mick R, Ratain MJ. Model-guided determination of maximum tolerated dose in
phase I clinical trials: evidence for increased precision. J Nat Cancer Inst 1993; 85:
217–223.
72. Gordon NH, Willson JKV. Using toxicity grades in the design and analysis of cancer
phase I clinical trials. Stat Med 1992; 11:2063–2075.
2
Dose-Finding Designs Using Continual
Reassessment Method
John O’Quigley
University of California at San Diego, La Jolla, California
I. CONTINUAL REASSESSMENT METHOD
The continual reassessment method (CRM), as a tool for carrying out phase I
clinical trials in cancer, has greatly gained in popularity in recent years. Here I
describe the basic ideas behind the method, some important technical consider-
ations, the properties of the method, and the possibility for substantial generaliza-
tion, specifically the use of graded information on toxicities, the incorporation
of a stopping rule leading to further reductions in sample size, the incorporation
of information on patient heterogeneity, the incorporation of pharmacokinetics,
and the possibility of modeling within patient dose escalation. At the time of
writing, few of these generalizations have been fully studied in any depth, al-
though it seems clear that the CRM provides a structure around which such fur-
ther developments can be carried out.
A. Motivation
The precise goals of a phase I dose-finding study in cancer have not always been
clearly defined. The absence of such definitions and the lack of clinically moti-
vated exigencies have led to the use of a number of schemes, in particular the
up and down scheme, recalled in a broad review by Storer (28), having properties
35
36 O’Quigley
that can be considered undesirable in certain applications. This consideration un-

derscored the development of a different approach to such studies, the CRM (15),
in which the design was constructed to respond to specific requirements of the
phase I clinical investigation in cancer. These requirements are the following:
1. We should minimize the number of undertreated patients, i.e, patients
treated at unacceptably low dose levels.
2. We should minimize the number of patients treated at unacceptably
high dose levels.
3. We should minimize the number of patients needed to complete the
study (efficiency).
4. The method should respond quickly to inevitable errors in initial
guesses, rapidly escalating in the absence of indication of drug activity
(toxicity) and rapidly de-escalating in the presence of unacceptably
high levels of observed toxicity.
Before describing just how the CRM meets these requirements, let us first look
at the requirements themselves in the context of cancer dose-finding studies.
Most phase 1 cancer clinical trials are carried out on patients for whom all
currently available therapies have failed. There will always be hope in the thera-
peutic potential of the new experimental treatment, but such hope is invariably
tempered by the almost inevitable life-threatening toxicity accompanying the
treatment. Given that candidates for these trials have no other options concerning
treatment, their inclusion appears contingent on maintaining some acceptable de-
gree of control over the toxic side effects and trying to maximize treatment effi-
cacy (which translates as dose). Too high a dose, although offering in general
better hope for treatment effect, will be accompanied by too high a probability
of encountering unacceptable toxicity. Too low a dose, although avoiding this
risk, may offer too little chance of seeing any benefit at all.
Given this context, requirements 1 and 2 appear immediate. The third re-
quirement, a concern for all types of clinical studies, becomes of paramount im-
portance here where very small sample sizes are inevitable. This is because of
the understandable desire to proceed quickly with a potentially promising treat-
ment to the phase II stage. At the phase II stage the probability of observing
treatment efficacy is almost certainly higher than that for the phase I population
of patients. We have to do the very best we can with the relatively few patients
available, and the statistician involved in such studies should also provide some
ideas as to the error of our estimates, translating the uncertainty of our final
recommendations based on such small samples. The fourth requirement is not
an independent requirement and can be viewed as a partial re-expression of re-
quirements 1 and 2.
Taken together, the requirements point toward a method where we con-
verge quickly to the correct level, the correct level being defined as the one having
Continual Reassessment Method 37
Figure 1 Typical trial histories. (Top) CRM, (bottom) standard design.

38 O’Quigley
Figure 2 Fit of the working model.
a probability of toxicity as close as possible to some value θ. The value θ is

chosen by the investigator such that he or she considers probabilities of toxicity
higher than θ to be unacceptably high, whereas those lower than θ unacceptably
low in that they indicate, indirectly, the likelihood of too weak an antitumor
effect.
Figure 1 illustrates the comparative behavior of CRM with a fixed-sample
up and down design in which level 7 is the correct level.
How does CRM work? The essential idea is close to that of stochastic
approximation, the main differences being the use of a nonlinear underparameter-
ized model, belonging to a particular class of models, and a small number of
discrete dose levels rather than a continuum.
Figure 2 provides some insight into how the method behaves after having
included a sufficient number of patients into the study. The true unknown dose–
toxicity curve is not well estimated overall by the underparameterized dose–
toxicity curve taken from the CRM class, but, at the point of main interest, the
correct targeted level, the two curves nearly coincide. CRM will not do well if
the task is to estimate the overall dose–toxicity curve. It generally does very well
in addressing the more limited goal: to identify some single chosen percentile
from this unknown curve.
Patients enter sequentially. The working dose–toxicity curve, taken from
the CRM class (described below), is refitted after each inclusion. The curve is
then inverted to identify which of the available levels has an associated estimated
probability as close as we can get to the targeted acceptable toxicity level. The
next patient is then treated at this level. The cycle is continued until a fixed
number of subjects have been treated or until we apply some stopping rule (see
Sect. V). Typical behavior is that shown in Fig. 1.
B. Operating Characteristics
The above paragraphs outline how CRM works. The technical details are pro-
vided below. After asking how CRM works it is natural to ask how CRM behaves.
The original article on the method by O’Quigley et al. (15) considered, as an
illustration, a particularly simple model and how it worked when used to identify
a target dose level having probability of toxicity as near as possible to 0.2. Simu-
lations were encouraging and showed striking improvement over the standard
design, both in terms of accuracy of final recommendation and in terms of concen-
trating as large a percentage as possible of studied patients close to the target
level, thereby minimizing the number of overtreated and undertreated patients.
A large sample study (25) showed that under some broad conditions, the
level to which a CRM design converged will indeed be the closest to the target.
As pointed out by Storer (30), large sample properties themselves will not be
wholly convincing because, practically, we are inevitably faced with small to
moderate sample sizes. Nonetheless, if any scheme fails to meet such basic statis-
tical criteria as large sample convergence, we need to investigate with great care
its finite sample properties. The tool to use here is mostly that of simulation,
although for the standard up and down schemes, the theory of Markov chains
enables us to carry out exact probabilistic calculations (23,29).
Whether Bayesian or likelihood based, once the scheme is under way (i.e.,
the likelihood in nonmonotone), then it is readily shown that a nontoxicity always
points in the direction of higher levels and a toxicity in the direction of lower
levels, the absolute value of the change diminishing with the number of included
patients. For nonmonotone likelihood it is impossible to be at some level, observe
a toxicity, and then for the model to recommend a higher level as claimed by
some authors (see Sect. VI.B), unless pushed in such a direction by a strong
prior. Furthermore, when targeting lower percentiles such as 0.2, it can be calcu-
lated and follows our intuition that a toxicity, occurring with a frequency a factor
of 4 less than that for the nontoxicities, will have a much greater impact on the
likelihood or posterior density. This translates directly into an operating charac-
teristic whereby model-based escalation is relatively cautious and de-escalation
more rapid, particularly early on where little information is available. In the
model and examples of O’Quigley et al. (15), dose levels could never be skipped
when escalating. However, if the first patient, treated at level 3, suffered a toxic
side effect, the method skipped when de-escalating, recommending level 1 for
the subsequent two entered patients before, assuming no further toxicities were
seen, escalating to level 2.
Simulations in O’Quigley et al. (15), O’Quigley and Chevret (16), Good-
40 O’Quigley
man et al (9), Korn et al. (11), and O’Quigley (1999) show the operating charac-
teristics of CRM to be good, in terms of accuracy of final recommendation, while
simultaneously minimizing the numbers of overtreated and undertreated patients.
However, violation of the model requirements and allocation principle of CRM,
described in the following section, can have a negative, possibly disastrous, effect
on operating characteristics. Chevret (6, situation 6 in Table 2) used a model,
failing the conditions outlined in Section 11.A, that resulted in never recommend-
ing the correct level, a performance worse than we achieve by random guessing.
Goodman et al. (9) and Korn et al. (11) worked with this same model, and their
results, already favorable to CRM, would have been yet more favorable had a
model not violating the basic requirements been used (19). Both Faries (7) and
Moller (13) assigned to early levels other than those indicated by the model,
leading to large skips in the dose allocation, in one case skipping nine levels
after the inclusion of a single patient. We return to this in Section VI.
II. TECHNICAL ASPECTS
The aim of CRM is to locate the most appropriate dose, the so-called target dose,
the precise definition of which is provided below. This dose is taken from some
given range of available doses. The problem of dose spacing for single drug
combinations, often addressed via a modified Fibonacci design, is beyond the
scope of CRM. The need to add doses may arise in practice when the toxicity
frequency is deemed too low at one level but the next highest level is considered
too toxic. CRM can help with this affirmation, but as far as extrapolation or
interpolation of dose is concerned, the relevant insights will come from pharma-
cokinetics. For our purposes we assume that we have available k fixed doses;
d 1, . . . d k. These doses are not necessarily ordered in terms of the d i themselves,
in particular since each d i may be a vector, being a combination of different
treatments, but rather in terms of the probability R(d i ) of encountering toxicity
at each dose d i. It is important to be careful at this point since confusion can
arise over the notation. The d i, often multidimensional, describe the actual doses
or combinations of doses being used. We assume monotonicity and we take
monotonicity to mean that the dose levels, equally well identified by their integer
subscripts i (i ⫽ 1, . . . k), are ordered whereby the probability of toxicity at level
i is greater than that at level i′ whenever i ⬎ i ′. The monotonicity requirement or
the assumption that we can so order our available dose levels is thus important.
Currently, all the dose information required to run a CRM trial is contained in the
dose levels. Without wishing to preclude the possibility of exploiting information
contained in the doses d i and not in the dose levels i, at present we lose no
information when we replace d i by i.
The actual amount of drug therefore, so many mg/m2 say, is typically not
used. For a single-agent trial (see 13), it is in principle possible to work with the
actual dose. We do not advise this since it removes, without operational advan-
tages, some of our modeling flexibility. For multidrug or treatment combination
studies there is no obvious univariate measure. We work instead with some con-
ceptual dose, increasing when one of the constituent ingredients increases and,
under our monotonicity assumption, translating itself as an increase in the proba-
bility of a toxic reaction. Choosing the dose levels amounts to selecting levels
(treatment combinations) such that the lowest level hopefully has an associated
toxic probability less than the target and the highest level possibly close or higher
than the target.
The most appropriate dose, the ‘‘target’’ dose, is that dose having an associ-
ated probability of toxicity as close as we can get to the target ‘‘acceptable’’
toxicity θ. Values for the target toxicity level, θ, might typically be 0.2, 0.25,
0.3, 0.35, although there are studies in which this can be as high as 0.4 (13). The
value depends on the context and the nature of the toxic side effects.
The dose for the jth entered patient, X j, can be viewed as random taking
values x j, most often discrete in which case x j ∈ {d 1, . . . d k} but possibly continu-
ous where X j ⫽ x; x ∈ R⫹. In light of the remarks of the previous two paragraphs
we can, if desired, entirely suppress the notion of dose and retain only information
pertaining to dose level. This is all we need and we may prefer to write x j ∈
{1, . . . k}. Let Y j be a binary random variable (0, 1) where 1 denotes severe
toxic response for the jth entered patient ( j ⫽ 1, . . . n). We model R(x j ), the
true probability of toxic response, at X j ⫽ x j; x j ∈ {d 1, . . . d k} or x j ∈ {1, . . . k}
via
R(x j ) ⫽ Pr(Y j ⫽ 1|X j ⫽ x j ) ⫽ E(Y j | x j ) ⫽ ψ(x j, a)
for some one parameter model ψ(x j, a).
For the most common case of a single homogeneous group of patients, we
are obliged to work with an underparametrized model, notably a one-parameter
model. Although a two-parameter model may appear more flexible, the sequential
nature of CRM together with its aim to put the included patients at a single correct
level means that we will not obtain information needed to fit two parameters. We
are close to something like nonidentifiability. A likelihood procedure will be unsta-
ble and may even break down, whereas a two-parameter fully Bayesian approach
(8,15,31) may work initially, although somewhat artificially, but behave erratically
as sample size increases (see also Sect. VI). This is true even when starting out
at a low or the lowest level, initially working with an up and down design for
early escalation, before a CRM model is applied. Indeed, any design that ultimately
concentrates all patients from a single group on some given level can fit no more
than a single parameter without running into problems of consistency.
42 O’Quigley
A. Model Requirements
The restrictions on ψ(x, a) were described by O’Quigley, Pepe and Fisher (1990).
For given fixed x we require that ψ(x, a) be strictly monotonic in a. For fixed a
we require that ψ(x, a) be monotonic increasing in x or, in the usual case of
discrete dose levels d i i ⫽ 1, . . . , k, that ψ(d i, a) ⬎ ψ(d m, a) whenever i ⬎ m.
The true probability of toxicity at x (i.e., whatever treatment combination has
been coded by x) is given by R(x), and we require that for the specific doses
under study (d 1, . . . , d k ) there exists values of a, say a 1, . . . , a k, such that
ψ(d i, a i ) ⫽ R(d i ), (i ⫽ 1, . . . , k). In other words, our one-parameter model has
to be rich enough to model the true probability of toxicity at any given level.
We call this a working model since we do not anticipate a single value of a to
work precisely at every level, that is, we do not anticipate a 1 ⫽ a 2 ⫽ ⋅ ⋅ ⋅ ⫽ a k
⫽ a. Many choices are possible. We have obtained excellent results with the
simple choice:
ψ(d i, a) ⫽ α ai, (i ⫽ 1, . . . , k) (1)
where 0 ⬍ α 1 ⬍ ⋅ ⋅ ⋅ ⬍ α k ⬍ 1 and 0 ⬍ a ⬍ ∞. For the six levels studied in
the simulations by O’Quigley et al. (15), the working model had α 1 ⫽ 0.05, α 2
⫽ 0.10, α 3 ⫽ 0.20, α 4 ⫽ 0.30, α 5 ⫽ 0.50, and α 6 ⫽ 0.70. In that article this
was expressed a little differently in terms of conceptual dose d i, where d 1 ⫽
⫺1.47, d 2 ⫽ ⫺1.10, d 3 ⫽ ⫺0.69, d 4 ⫽ ⫺0.42, d 5 ⫽ 0.0, and d 6 ⫽ 0.42 obtained
from a model in which
α i ⫽ (tanh d i ⫹ 1)/2 (i ⫽ 1, . . . , k) (2)
The above ‘‘tanh’’ model was first introduced in this context by O’Quigley
et al. (15), the idea being that tanh (x) increases monotonically from 0 to 1 as x
increases from ⫺∞ to ∞. This extra generality is not usually needed since attention
is focused on the few fixed d i. Note that, at least as far as maximum likelihood
estimation is concerned (see Sect. III.A), working with model (1) is equivalent to
working with a model in which α i (i ⫽ 1, . . . , k), is replaced by α *i (i ⫽ 1, . . . ,
k), where α*i ⫽ α mi for any real m ⬎ 0. Thus, we cannot really attach any concrete
meaning to the α i. The spacing, however, between adjacent α i will have an impact
on operating characteristics. Working with real doses corresponds to using some
fixed dose spacing, although not necessarily one with nice properties. The spac-
ings chosen here have proved satisfactory in terms of performance across a broad
range of situations. An investigation into how to choose the α i with the specific
aim of improving certain aspects of performance has yet to be carried out.
Some obvious choices for a model can fail the above conditions, leading
to potentially poor operating characteristics. The one-parameter logistic model,
ψ(x, a) ⫽ w/(1 ⫹ w), in which b is fixed and where w ⫽ exp(b ⫹ ax), can be
seen to fail the above requirements (25). On the other hand, the less intuitive
model obtained by redefining w so that w ⫽ exp(a ⫹ bx), b ≠ 0, belongs to the

CRM class.
III. IMPLEMENTATION
Once a model has been chosen and we have data in the form of the set Ω j ⫽
{y 1, x 1, . . . , y j, x j}, the outcomes of the first j experiments we obtain estimates
R̂(d i ), (i ⫽ 1, . . . , k) of the true unknown probabilities R(d i ), (i ⫽ 1, . . . , k)
at the k dose levels (see below). The target dose level is that level having associ-
ated with it a probability of toxicity as close as we can get to θ. The dose or
dose level x j assigned to the jth included patient is such that
|R̂(x j ) ⫺ θ| ⬍ |R̂(d i ) ⫺ θ|, (i ⫽ 1, . . . , k; x j ≠ d i )
Thus, x j is the closest level to the target level in the above precise sense. Other
choices of closeness could be made, incorporating cost or other considerations.
We could also weight the distance, for example, multiply |R̂(x j ) ⫺ θ| by some
constant greater than 1 when R̂(x j ) ⬎ θ. This would favor conservatism, such a
design tending to experiment more often below the target than a design without
weights. Similar ideas have been pursued by Babb et al. (5).
The estimates R̂(x j ) obtain from the one-parameter working model. Two
questions dealt with in this section arise: How do we estimate R(x j ) on the basis
of Ω j⫺1 and how do we obtain the initial data, in particular since the first entered
patient or group of patients must be treated in the absence of any data based
estimates of R(x 1 )?
Even though our model is underparametrized, leading us into the area of
misspecified models, it turns out that standard procedures of estimation work.
Some care is needed to show this, and we look at this in Section IV. The proce-
dures themselves are described just below. Obtaining the initial data is partially
described in these same sections as well as being the subject of its own subsection,
two-stage designs.
To decide, on the basis of available information and previous observations,
the appropriate level at which to treat a patient, we need some estimate of the
probability of toxic response at dose level d i, (i ⫽ 1, . . . , k). We would currently
recommend use of the maximum likelihood estimator (17) described in section
III.A. The Bayesian estimator, developed in the original paper by O’Quigley et
al. (15), will perform very similarly unless priors are strong. The use of strong
priors in the context of an underparametrized and misspecified model may require
deeper study. Bayesian ideas can nonetheless be very useful in addressing more
complex questions such as patient heterogeneity and intrapatient escalation. We
return to this in Section VII.
44 O’Quigley
A. Maximum Likelihood Implementation

After the inclusion of the first j patients, the log-likelihood can be written as
j j
ᏸ j (a) ⫽ 冱y ᐉ⫽1
ᐉ log ψ(xᐉ, a) ⫹ 冱 (1 ⫺ y )log(1 ⫺ ψ(x , a))
ᐉ⫽1
ᐉ ᐉ (3)
and is maximized at a ⫽ â j Maximization of ᏸ j (a) can easily be achieved with

a Newton Raphson algorithm or by visual inspection using some software pack-
age such as Excel. Once we have calculated â j, we can next obtain an estimate
of the probability of toxicity at each dose level d i via
R̂(d i ) ⫽ ψ(d i, â j ), (i ⫽ 1, . . . , k)
On the basis of this formula, the dose to be given to the ( j ⫹ 1)th patient, x j⫹1
is determined. We can also calculate an approximate 100(1 ⫺ α)% confidence
interval for ψ(x j⫹1, â j ) as (ψj⫺, ψj⫹ ) where
ψj⫺ ⫽ ψ{x j⫹1, (â j ⫹ z 1⫺α/2 v(â j )1/2 )}, ψ j⫹ ⫽ ψ{x̂ j⫹1, (â j ⫺ z 1⫺α/2 v(â j )1/2 )}
where z α is the αth percentile of a standard normal distribution and v(â j ) is an

estimate of the variance of â j. For the model of Eq. (1), this turns out to be
particularly simple, and we can write
v⫺1 (â j ) ⫽ 冱
ᐉ ⱕ j, y ᐉ ⫽0
ψ(x ᐉ, â j )(log α ᐉ )2 /(1 ⫺ ψ(x ᐉ, â j ))2
Although based on a misspecified model these intervals turn out to be quite accu-
rate, even for sample sizes as small as 16, and thus helpful in practice (14).
A requirement to be able to maximize the log-likelihood on the interior of
the parameter space is that we have heterogeneity among the responses, that is,
at least one toxic and one nontoxic response (27). Otherwise, the likelihood is
maximized on the boundary of the parameter space and our estimates of R(d i ),
(i ⫽ 1, . . . k) are trivially either zero, one, or, depending on the model we are
working with, may not even be defined.
Thus, the experiment is considered as not being fully underway until we
have some heterogeneity in the responses. These could arise in a variety of differ-
ent ways: use of the standard up and down approach, use of an initial Bayesian
CRM as outlined below, or use of a design believed to be more appropriate by
the investigator. Once we have achieved heterogeneity, the model kicks in and
we continue as prescribed above (estimation-allocation).
Getting the trial underway, that is, achieving the necessary heterogeneity
to carry out the above prescription, is largely arbitrary. This feature is specific
to the maximum likelihood implementation and such that it may well be treated
separately. Indeed, this is our suggestion and we describe this more fully below
in the subsection Two-Stage Designs.
B. Bayesian Implementation
Before describing more closely the Bayesian implementation, it is instructive to
consider Fig. 3. After seeing the first observed toxicity at level 3, two other
patients being treated at this same level, a further two one level below, and a
Figure 3 Likelihood and posterior densities for small samples.

46 O’Quigley
single patient at the lowest level, the likelihood estimate for a can be seen to
be very close to that based on the Bayesian posterior. We expect this for
vague priors, but the illustration helps eliminate a potential concern that the
maximum likelihood estimate may be too unstable for small samples. This could
be the case for two-parameter models (see also Sect. VI.A and VI.C), but such
erratic behavior is avoided by the dampening effects of the one-parameter model.
Very few further observations are required before the maximum likelihood
estimator and the Bayesian estimator become, for practical purposes, indistin-
guishable.
In the original Bayesian setup (15), CRM stipulated that the first entered
patient would be treated at some level, believed by the experimenter, in the light
of all current available knowledge, to be the target level. Such knowledge, possi-
bly together with his or her own subjective conviction, led the experimenter to
a ‘‘point estimate’’ of the probability of toxicity at the starting dose to be the
same as the targeted toxic level. This level was chosen to be level 3 in an experi-
ment with six levels allowing the possibility of both escalation and de-escalation.
In O’Quigley et al. (15) the targeted toxicity level was 0.2. In Eq. (2) the ‘‘dose’’
that satisfies this is ⫺0.69, so that we had d 3 ⫽ ⫺0.69. In addition, we had d 1
⫽ ⫺1.47, d 2 ⫽ ⫺1.10, d 3 ⫽ ⫺0.69, d 4 ⫽ ⫺0.42, d 5 ⫽ 0.0, and d 6 ⫽ 0.42 so
that for a 0 ⫽ 1, corresponding to the mean of some prior distribution, the prior
point estimates of toxic probabilities were 0.05, 0.1, 0.2, 0.3, 0.5, and 0.7, respec-
tively. We considered that our point estimate 0.2, corresponding to a ⫽ 1, was
uncertain and likely to be in error.
The notion of uncertainty was expressed via a prior density on a having
support on Ꮽ and called g(a). In the model of Section II.A, Ꮽ is the positive
real line, and so we gave consideration to distributions having support on Ꮽ⫹,
the family of gamma distributions in particular. The simplest member of the
gamma family, the standard exponential distribution with g(a) ⫽ exp(⫺a),
showed itself to be a prior sufficiently vague for a large number of situations.
For this prior 95% Bayesian intervals, for the probability of toxicity at the starting
dose, lay between 0.003 and 0.96. For the lowest level, the corresponding interval
is (10⫺5, 0.93), whereas for the highest level, having a point prior estimate of
0.7, the interval becomes (0.26, 0.99). Such a prior is therefore not vague at all
levels and suggests that the highest level is likely to be too high. A more vague
prior would help acceleration from the starting level to the highest level when
we greatly overestimate the new treatment’s toxic potential. Even so, it does not
take long for the accumulating information to ‘‘override’’ the prior, and this
simple exponential formulation appeared to be fairly satisfactory for many cases
(15).
The starting level d i is such that we should have
冮 ψ(d , u)g(u)du ⫽ θ
Ꮽ
i
This may be a difficult integral equation to solve, and practically we might take
the starting dose to be obtained from ψ(d i, µ 0 ) ⫽ θ where
µ0 ⫽ 冮 ug(u)du
Ꮽ
The other doses could be chosen so that
ψ{d 1, µ 0} ⫽ α 1, . . . , ψ{d k, µ k} ⫽ α k
These initial values for the toxicities may reflect the experimenter’s best guesses
about the potential for toxicities at the available doses. Note that in contrast to
the maximum likelihood approach, the α i can be ascribed a more concrete mean-
ing, in terms of probabilities, rather than simply being parameters to a model up
to an arbitrary positive power transformation. Difficulties can arise if this proce-
dure is not followed. For instance, suppose we decide to start out with a deliber-
ately low dose, although according to the prior a higher dose would have been
indicated. This can lead to undesirable behavior of CRM, an example being the
potential occurrence of big jumps when escalating (7,13). Restricting escalation
increments may appear to alleviate the problem (see MCRM, Sect. VI.B), but
we do not recommend this in view of its ad hoc nature and since the problem
can be avoided at the setup stage when the guidelines of O’Quigley et al. (15),
recalled here, are carefully followed. Such difficulties do not arise with the maxi-
mum likelihood approach.
Given the set Ω j we can calculate the posterior density for a as
f(a|Ω j ) ⫽ H j⫺1 (a)exp{ᏸ j (a)}g(a) (4)
where H j (a) is a sequence of normalizing constant, i.e.,

∞
H j (a) ⫽ 冮
0
exp{ᏸ j (u)}g(u)du (5)
The dose x j⫹1 ∈ {d 1, . . . d k} assigned to the ( j ⫹ 1)th included patient is the

dose minimizing
|θ ⫺ ∫ ψ{x j⫹1, u}}f(u|Ω j )du|
If there are many dose levels, it may be more computationally efficient to locate
the dose level x j⫹1 ∈ {d 1, . . . d k} satisfying
Q ij (P ij ⫺ 2θ) ⬍ 0 (i ⫽ 1, . . ., k; x j⫹1 ≠ d i ) (6)
where
∞
Q ij ⫽ 冮0
{ψ{x j⫹1, u} ⫺ ψ{d i, u}}f(u|Ω j )du
48 O’Quigley
and
∞
P ij ⫽ 冮
0
{ψ{x j⫹1, u} ⫹ ψ{d i, u}}f (u|Ω j )du
Often it will make little difference if rather than work with the expectations of
the toxicities, we work with the expectation of a, thereby eliminating the need
for k ⫺ 1 integral calculations. Thus, we treat the ( j ⫹ 1)th included patient at
level x j⫹1 ∈ {d 1, . . . d k} such that |θ ⫺ ψ{x j⫹1, µ j}| is minimized where µ j ⫽
∫uf(u|Ω j )du. As in the likelihood approach, we can calculate an approximate
100(1 ⫺ α)% Bayesian interval for ψ(x j⫹1, µ j ) as (ψj⫺, ψj⫹ ) where
α⫺
冮
j
ψ j⫺ ⫽ ψ(x j⫹1, α j⫺ ), ψ j⫹ ⫽ ψ(x j⫹1, α j⫹ ), f(u|Ω j⫹1 )du ⫽ 1 ⫺ α

α⫹
j
The Bayesian approach has the apparent advantage of being immediately

operational in that it is not necessary to wait for patient heterogeneity before
being able to assess to which level we should assign the successively entered
patients. By modifying our prior and/or model, quantities that are to a large extent
arbitrarily defined, we can alter these early operating characteristics to mimic the
kind of behavior we would like, for instance, rapid or less rapid escalation. In
principle we could fine tune prior and model parameters to achieve, for example,
rapid initial escalation that is gradually dampened in the absence of observed
toxicities. Such goals, however, are more readily accomplished via the two-stage
designs of the following section.
C. Two-Stage Designs
It may be believed that we know so little before undertaking a given study that
it is worthwhile to split the design into two stages: an initial exploratory escalation
followed by a more refined homing in on the target. Such an idea was first pro-
posed by Storer (28) in the context of the more classical up and down schemes.
His idea was to enable more rapid escalation in the early part of the trial where
we may be quite far from a level at which treatment activity could be anticipated.
Moller (13) was the first to use this idea in the context of CRM designs. Her
idea was to allow the first stage to be based on some variant of the usual up and
down procedures.
In the context of sequential likelihood estimation, the necessity of an initial
stage was pointed out by O’Quigley and Shen (17) since the likelihood equation
fails to have a solution on the interior of the parameter space unless some hetero-
geneity in the responses has been observed. Their suggestion was to work with
any initial scheme, Bayesian CRM, or up and down, and for any reasonable
scheme the operating characteristics appear relatively insensitive to this choice.
However, we believe there is something very natural and desirable in two-
stage designs and that currently they could be taken as the designs of choice.
Early behavior of the method, in the absence of heterogeneity, (i.e., lack of toxic
response), appears to be rather arbitrary. A decision to escalate after inclusion
of three patients tolerating some level or after a single patient tolerating a level
or according to some Bayesian prior, however constructed, is translating directly,
less directly for the Bayesian prescription, the simple desire to try a higher dose
because thus far we have encountered no toxicity.
The use of a working model at this point, as occurs for Bayesian estimation,
may be somewhat artificial, and the rate of escalation can be modified at will,
albeit somewhat indirectly, by modifying our model parameters and/or our prior.
Rather than lead the clinician into thinking that something subtle and carefully
analytic is taking place, our belief is that it is preferable that he or she be involved
in the design of the initial phase. Operating characteristics that do not depend
on data ought be driven by clinical rather than statistical concerns. More im-
portantly, the initial phase of the design, in which no toxicity has yet been ob-
served, can be made much more efficient, from both the statistical and ethical
angles, by allowing information on toxicity grade to determine the rapidity of
escalation.
Here we describe an example of a two-stage design that has been used in
practice. There were many dose levels, and the first included patient was treated
at a low level. As long as we observe very low-grade toxicities, then we escalate
quickly, including only a single patient at each level. As soon as we encounter
more serious toxicities then escalation is slowed down. Ultimately we encounter
dose-limiting toxicities, at which time the second stage, based on fitting a CRM
model, comes fully into play. This is done by integrating this information and
that obtained on all the earlier non-dose-limiting toxicities to estimate the most
appropriate dose level.
It was decided to use information on low-grade toxicities in the first stage
of a two-stage design to allow rapid initial escalation since it is possible that we
be far below the target level. Specifically, we define a grade severity variable
Table 1 Toxicity ‘‘grades’’ (severities) for

trial
Severity Degree of toxicity
0 No toxicity
1 Mild toxicity (non-dose limiting)
2 Nonmild toxicity (non-dose limiting)
3 Severe toxicity (non-dose limiting)
4 Dose-limiting toxicity
50 O’Quigley
S(i) to be the average toxicity severity observed at dose level i, that is, the sum
of the severities at that level divided by the number of patients treated at that
level. The rule is to escalate, provided S(i) is less than 2. Furthermore, once we
have included three patients at some level, then escalation to higher levels only
occurs if each cohort of three patients does not experience dose-limiting toxicity.
This scheme means that, in practice, as long as we see only toxicities of severities
coded 0 or 1, then we escalate. The first severity coded 2 necessitates a further
inclusion at this same level, and anything other than a 0 severity for this inclusion
would require yet a further inclusion and a non-dose-limiting toxicity before be-
ing able to escalate. This design also has the advantage that should we be slowed
down by a severe (severity 3) albeit non-dose-limiting toxicity, we retain the
capability of picking up speed (in escalation) should subsequent toxicities be of
low degree (0 or 1). This can be helpful to avoid being handicapped by an outlier
or an unanticipated and possibly not drug-related toxicity.
Once dose-limiting toxicity is encountered, this phase of the study (the
initial escalation scheme) comes to a close and we proceed on the basis of CRM
recommendation. Although the initial phase is closed, the information on both
dose-limiting and non-dose-limiting toxicities thereby obtained is used in the
second stage.
D. Grouped Designs
O’Quigley et al. (15) described the situation of delayed response in which new
patients become available to be included in the study while the toxicity results
are still outstanding on already entered patients. The suggestion was, in the ab-
sence of information on such recently included patients, that the logical course
to take was to treat at the last recommended level. This is the level indicated by
all the currently available information.
The likelihood for this situation was written down by O’Quigley et al. (15)
and, apart from a constant term not involving the unknown parameter, is just
the likelihood we obtain were the subjects to have been included one by one.
There is therefore, operationally, no additional work required to deal with such
situations.
The question does arise, however, as to the performance of CRM in such
cases. The delayed response can lead to grouping or we can simply decide on
the grouping by design. Goodman et al. (9) and O’Quigley and Shen (17) studied
the effects of grouping. The more thorough study was that of Goodman et al. in
which cohorts of one, two, and three were evaluated. Broadly speaking, the cohort
size had little impact on operating characteristics and the accuracy of final recom-
mendation. O’Quigley and Shen (17) indicated that for groups of three and rela-
tively few patients (n ⫽ 16), when the correct level was the highest available
level and we start out at the lowest or a low level, then we might anticipate some
marked drop in performance when contrasted with, say, one-by-one inclusion.
Simple intuition would tell us this, and the differences disappeared for samples of
size 25. One-by-one inclusion tends to maximize efficiency, but should stability
throughout the study be an issue, then this extra stability can be obtained through
grouping. The cost of this extra stability in terms of efficiency loss appears to
be generally small. The findings of Goodman et al. (9), O’Quigley and Shen (17)
and O’Quigley (19) contradict the conjecture of Korn et al. (11) that any grouping
would lead to substantial efficiency losses.
E. Illustration
This brief illustration is recalled from O’Quigley and Shen (17). The study con-
cerned 16 patients. Their toxic responses were simulated from the known dose–
toxicity curve. There were six levels in the study, maximum likelihood was used,
and the first entered patients were treated at the lowest level. The design was
two stage. The true toxic probabilities were R(d 1 ) ⫽ 0.03, R(d 2 ) ⫽ 0.22, R(d 3 )
⫽ 0.45, R(d 4 ) ⫽ 0.6, R(d 5 ) ⫽ 0.8, and R(d 6 ) ⫽ 0.95. The working model was
that given by Eq. (1) where α 1 ⫽ 0.04, α 2 ⫽ 0.07 α 3 ⫽ 0.20, α 4 ⫽ 0.35, α 5 ⫽
0.55, and α 6 ⫽ 0.70. The targeted toxicity was given by θ ⫽ 0.2, indicating that
the best level for the maximum tolerated dose (MTD) is given by level 2 where
the true probability of toxicity is 0.22. A grouped design was used until heteroge-
neity in toxic responses was observed, patients being included, as for the classical
schemes, in groups of three. The first three patients experienced no toxicity at
level 1. Escalation then took place to level 2, and the next three patients treated
at this level did not experience any toxicity either. Subsequently, two of the three
patients treated at level 3 experienced toxicity. Given this heterogeneity in the
responses, the maximum likelihood estimator for a now exists and, following a
few iterations, could be seen to be equal to 0.715. We then have that R̂(d 1 ) ⫽
0.101, R̂(d 2 ) ⫽ 0.149, R̂(d 3 ) ⫽ 0.316, R̂(d 4 ) ⫽ 0.472, R̂(d 5 ) ⫽ 0.652, and R̂(d 6 )
⫽ 0.775. The 10th entered patient is then treated at level 2 for which R̂(d 2 ) ⫽
0.149 since, from the available estimates, this is the closest to the target θ ⫽ 0.2.
The 10th included patient does not suffer toxic effects, and the new maximum
likelihood estimator becomes 0.759. Level 2 remains the level with an estimated
probability of toxicity closest to the target. This same level is in fact recom-
mended to the remaining patients so that after 16 inclusions the recommended
MTD is level 2. The estimated probability of toxicity at this level is 0.212, and
a 90% confidence interval for this probability is estimated as (0.07, 0.39).
IV. STATISTICAL PROPERTIES
Recall that CRM is a class of methods rather than a single method, the members
of the class depending on arbitrary quantities chosen by the investigator such as;
the form of the model, the spacing between the doses, the starting dose, whether
52 O’Quigley
single or grouped inclusions, the initial dose escalation scheme in two stage de-
signs or the prior density chosen for Bayesian formulations. The statistical proper-
ties described in this section apply broadly to all members of the class, the mem-
bers nonetheless maintaining some of their own particularities.
A. Convergence
Convergence arguments are obtained from considerations of the likelihood. The
same arguments apply to Bayesian estimation as long as the prior is other than
degenerate, that is, all the probability mass is not put on a single point. Usual
likelihood arguments break down since our models are misspecified.
The maximum likelihood estimate, R̂(d i ) ⫽ ψ(d i, â j ), exists as soon as we
have some heterogeneity in the responses (27). We assume the dose toxicity
function, ψ(x, a), to satisfy the conditions described in Section 11.A, in particular
the condition that, for i ⫽ 1, . . ., k, there exists a unique a i such that ψ(d i, a i )
⫽ R(d i ). Note that the a i s depend on the actual probabilities of toxicity and are
therefore unknown. We also require
1. For each 0 ⬍ t ⬍ 1 and each x, the function
ψ′ ⫺ψ′
s(t, x, a): ⫽ t (x, a) ⫹ (1 ⫺ t) (x, a)
ψ 1⫺ψ
is continuous and is strictly monotone in a.
2. The parameter a belongs to a finite interval [A, B].
The first condition is standard for estimating equations to have unique solu-
tions. The second imposes no real practical restriction. We also require the true
unknown dose toxicity function, R(x), to satisfy the following conditions:
1. The probabilities of toxicity at d 1, . . . , d k satisfy 0 ⬍ R(d 1 ) ⬍, . . . ,
⬍ R(d k ) ⬍ 1.
2. The target dose level is x 0 ∈ (d 1, . . . , d k ), where |R (x 0 ) ⫺ θ| ⬍ |R(d i )
⫺ θ|, (i ⫽ 1, . . . , k; x 0 ≠ d i ).
3. Before writing down the third condition, note that since our model is
misspecified, it will generally not be true that ψ(d i, a 0 ) ⬅ R(d i ) for i
⫽ 1, . . . , k. We nonetheless require that the working model is not
‘‘too distant’’ from the true underlying dose toxicity curve, and this
can be made precise with the help of the set
S(a 0 ) ⫽ {a: |ψ(x 0, a) ⫺ θ| ⬍ |ψ(x i, a) ⫺ θ|, for all d i ≠ x 0} (7)
The condition we require is that, for i ⫽ 1, . . . , k, a i ∈ S(a 0 ).
It can be shown (25), under the assumptions on R(d i ) and ψ(x i, a), that
S(a 0 ) is an open and convex set. This result is the key to rewriting Ĩ n (a) below
in a way that is convenient.
Letting
冱冤y ψ {x , a} ⫹ (1 ⫺ y ) 1 ⫺ ψ {x , a}冥
n
1 ψ′ ⫺ψ′
I n (a) ⫽ j j j j
n j⫽1
and
冱冢R{x } ψ {x , a} ⫹ [1 ⫺ R{x }] 1 ⫺ ψ {x , a}冣

n
1 ψ′ ⫺ψ ′
Ĩ n (a) ⫽ j j j j
n j⫽1
then
sup |I n (a) ⫺ Ĩ n (a)| → 0, almost surely (8)
a∈[A,B]
This convergence result follows intuitively and can be demonstrated rigorously

in a number of ways. For instance, observe that for each dose level d i, (ψ′/ψ)
(d i, ⋅) and {ψ′/(1 ⫺ ψ)}(d i, ⋅) are uniformly continuous in a over the finite interval
[A, B]. Shen and O’Quigley (25) applied this result to a sufficiently fine partition
of the interval [A, B] to bound by arbitrarily small quantities the above differ-
ences. The result then follows.
The next important step is to consider the finite interval S 1 (a 0 ) ⫽ [a (1), a (k)]
in which a (1) ⫽ min{a 1, . . . , a k} and a (k) ⫽ max{a 1, . . . , a k}. The third condition
on R(x) and the convexity of the set S(a 0 ) imply that S 1 (a 0 ) ⊂ S(a 0 ). Define
π n (d i ) ∈ [0, 1] to be the frequency that the level d i has been used by the first n
experiments. Then we can rewrite Ĩ n (a) as
冱π (d )冦R(d ) ψ (d , a) ⫹ (1 ⫺ R(d )) 1 ⫺ ψ (d , a)冧

k
ψ′ ⫺ψ ′
Ĩ n (a) ⫽ n i i i j i (9)
i⫽1
Now, let ã n be the solution to the equation Ĩ n (a) ⫽ 0, i.e., ã n solves
冱π (d )冦R (d ) ψ (d , a) ⫹ (1 ⫺ R(d )) 1 ⫺ ψ (d , a)冧 ⫽ 0

k
ψ′ ⫺ψ′
n i i i j i (10)
i⫽1
For each 1 ⱕ i ⱕ k, the definition of a i and condition 1 on the dose–toxicity

function indicate that a i is the unique solution to the equation
冦R(d ) ψψ′ (d , a) ⫹ (1 ⫺ R(d )) 1⫺ψ⫺ ψ′ (d , a)冧 ⫽ 0

i i i i
It follows that ã n will fall into the interval S 1 (a 0 ). Since â n solves I n (a) ⫽ 0,
Eq. (8), and uniform continuity ensure that, almost surely, â n ∈ S(a 0 ) for n suffi-
ciently large. Hence, for large n, â n satisfies
|ψ(x 0, â n ) ⫺ θ| ⬍ |ψ(d i, â n ) ⫺ θ|, for i ⫽ 1, . . . , k, d i ≠ x 0
54 O’Quigley
Thus, for n large enough x n⫹1 ⬅ x 0 so that at this dose level x n⫹1 satisfies |x n⫹1
⫺ x 0| ⬍ |x n ⫺ x 0| if x n ≠ x 0. In other words, x n converges to x 0 almost surely.
Since there are only a finite number of dose levels, x n will stay at x 0 ultimately.
To establish the consistency of â n, observe that, as n tends to infinity, π n (d i ), (i ⫽
1, . . . k) in Eq. (9) become negligible, except π n (x 0 ), which tends to 1. Thus,
ã n, being the solution for Eq. (10), will tend to the solution to
冦R(x ) ψψ′(x , a) ⫹ {1 ⫺ R(x )}1⫺ψ′

0 0
⫺ψ
0 (x , a) ⫽ 0冧
0
The solution to the above equation is a 0. Applying Eq. (8) again, we obtain the
consistency of â n and, further, that the asymptotic distribution of √n (â n ⫺ a 0 ) is
N(0, σ2 ), with σ 2 ⫽ {ψ′(x 0, a 0 )}⫺2θ 0 (1 ⫺ θ 0 ).
B. Efficiency
O’Quigley (14) proposes using θ̂n ⫽ ψ(x n⫹1, â n ) to estimate the probability of
toxicity at the recommended level x n⫹1, where â n is the maximum likelihood esti-
mate. An application of the δ method shows that the asymptotic distribution of
√n{θ̂ n ⫺ R(x 0 )} is N{0, θ 0 (1 ⫺ θ 0 )}. The estimate then provided by CRM is
fully efficient. This is what our intuition would suggest given the convergence
properties of CRM. What actually takes place in finite samples needs to be inves-
tigated on a case by case basis. Nonetheless, the relatively broad range of cases
studied by O’Quigley (14) show a mean squared error for the estimated probabil-
ity of toxicity at the recommended level under CRM to correspond well with the
theoretical variance for samples of size n, were all subjects to be experimented
at the correct level. Some cases studied showed evidence of superefficiency,
translating nonnegligeable bias that happens to be in the right direction, whereas
a few others indicated efficiency losses large enough to suggest the potential for
improvement.
C. Safety
In any discussion on a phase I design the word safety will arise. This translates
a central ethical concern. A belief that CRM would tend to treat the early included
patients in a study at high-dose levels convinced many investigators that without
some modification CRM was not ‘‘safe.’’
Safety is in fact a statistical property of any method. When faced with some
potential realities or classes of realities, we can ask ourselves questions such as
what is the probability of toxicity for a randomly chosen patient that has been
included in the study or, say, what is the probability of toxicity for those patients
entered into the study at the very beginning?
Once we know the realities or classes of realities we are facing, the op-
erating rules of the method, which are obvious and transparent for up and down
schemes and less transparent for model-based schemes such as CRM, then in
principle we can calculate the probabilities mentioned above. In practice, these
calculations are involved, and we may simply prefer to estimate them to any
desired degree of accuracy via simulation.
Theoretical work and extensive simulations (1,15–17,19,23) indicate CRM
to be a safer design than any of the commonly used up and down schemes in
that for targets of less than θ ⫽ 0.30, the probability that a randomly chosen
patient suffers a toxicity is lower. Furthermore, the probability of being treated
at levels higher than the MTD was, in all the studied cases, higher with the
standard designs than with CRM.
If the definition of safety was to be widened to include the concept of
treating patients at unacceptably low levels, that is, levels at which the probability
of toxicity is deemed too close to zero, then CRM does very much better than
the standard designs. This finding is logical given that the purpose of CRM is
to concentrate as much experimentation as possible around the prespecified tar-
get. In addition, it ought be emphasized that we can adjust the CRM to make it
as safe as we require by changing the target level. For instance, if we decreased
the target from 0.20 to 0.10, the observed number of toxicities will be, on average,
roughly halved. This is an important point since it highlights the main advantages
of the CRM over the standard designs in terms of flexibility and the ability to
be adapted to potentially different situations. An alternative way to enhance con-
servatism is rather than choose the closest available dose to the target, systemati-
cally take the dose immediately lower than the target or change the distance
measure used to select the next level to recommend. Safety ought be improved,
although the impact such an approach might have on the reliability of final estima-
tion remains to be studied. Some study on this idea has been carried out by Babb
et al. (5).
V. MORE COMPLEX CRM DESIGNS
The different up and down designs amount to a collection of ad hoc rules for
making decisions when faced with accumulating observations. The CRM leans
on a model that, although not providing a broad summary of the true underlying
probabilistic phenomenon, in view of its being underparametrized, does nonethe-
less provide structure enabling better control in an experimental situation. In prin-
ciple at least, a model enables us to go further and accommodate greater complex-
ity. Care is needed, but with some skill in model construction, we may hope to
capture some other effects that are necessarily ignored by the rough and ready
up and down designs. The following sections consider some examples.
56 O’Quigley
A. Inclusion of a Stopping Rule

The usual CRM design requires that a given sample size is determined in advance.
However, given the convergence properties of CRM, it may occur in practice
that we appear to have settled on a level before having included the full sample
size n of anticipated patients. In such a case we may wish to bring the study to
an early close, thereby enabling the phase II study to be undertaken more quickly.
One possible approach suggested by O’Quigley et al. (15) would be to use
the Bayesian intervals, (ψ j⫺, ψ j⫹ ), for the probability of toxicity, ψ(x j⫹1, â j ), at
the currently recommended level and when this interval falls within some pre-
specified range, we stop the study. Another approach would be to stop after some
fixed number of subjects have been treated at the same level. Such designs were
used by Goodman et al. (9) and Korn et al. (11) and have the advantage of great
simplicity. The properties of such rules remain to be studied, and we do not
recommend their use at this point.
One stopping rule that has been studied in detail (18) is described here.
The idea is based on the convergence of CRM and that as we reach a plateau,
the accumulating information can enable us to quantify this notion. Specifically,
given Ω j we would like to say something about the levels at which the remaining
patients, j ⫹ 1 to n, are likely to be treated. The quantity we are interested in is
ᏼ j, n ⫽ Pr{x j⫹1 ⫽ x j⫹2 ⫽ ⋅ ⋅ ⋅ ⫽ x n⫹1|Ω j }
In words, ᏼ j, n is the probability that x j⫹1 is the dose recommended to all remaining
patients in the trial and is the final recommended dose. Thus, to find ᏼ j,n one
needs to determine all the possible outcomes of the trial based on the results
known for the first j patients.
The following algorithm achieves the desired result.
1. Construct a complete binary tree on 2n⫺j⫹1 ⫺ 1 nodes corresponding
to all possible future outcomes (y j⫹1, . . . , y n ). The root is labeled with
x j⫹1.
2. Assuming that y j⫹1 ⫽ 1, compute the value of â j⫹1. Determine the dose
level that would be recommended to patient j ⫹ 1 in this case. Label
the left child of the root with this dose level.
3. Repeat step 2, this time with y j⫹1 ⫽ 0. Determine the dose level that
would be recommended to patient j ⫹ 1. Label the right child of the
root with this dose level.
4. Label the remaining nodes of the tree level by level, as in the preceding
steps.
We use the notation T j,n to refer to the tree constructed with this algorithm.
Each path in T j,n that starts at the root and ends at a leaf whose nodes all have
the same label represents a trial where the recommended dose is unchanged be-
tween the ( j ⫹ 1)st and the (n ⫹ 1)st patient. The probability of each such path
is given by
{R(x j⫹1 )}τ{1 ⫺ R(x j⫹1 )}n⫺j⫺τ
where τ is the number of toxicities along the path. By exclusivity we can sum
the probabilities of all such paths to obtain ᏼ j,n.
Using ψ{x j⫹1, â j}, the current estimate of the toxicity of x j⫹1, we may esti-
mate the probability of each path by
[ψ(x j⫹1, â j )]τ[1 ⫺ ψ(x j⫹1, â j )]n⫺j⫺τ
Adding up these path estimates yields an estimate of ᏼj,n. Details are given in
O’Quigley and Reiner (18).
B. Patient Heterogeneity
As in other types of clinical trials we are essentially looking for an average effect.
Patients of course differ in the way they may react to a treatment, and although
hampered by small samples, we may sometimes be in a position to specifically
address the issue of patient heterogeneity. One example occurs in patients with
acute leukemia where it has been observed that children will better tolerate more
aggressive doses (standardized by their weight) than adults. Likewise, heavily
pretreated patients are more likely to suffer from toxic side effects than lightly
pretreated patients. In such situations we may wish to carry out separate trials
for the different groups to identify the appropriate MTD for each group. Other-
wise we run the risk of recommending an ‘‘average’’ compromise dose level,
too toxic for a part of the population and suboptimal for the other. Usually, clini-
cians carry out two separate trials or split a trial into two arms after encountering
the first dose limiting toxicities (DLTs) when it is believed that there are two
distinct prognostic groups. This has the disadvantage of failing to use information
common to both groups. A two-sample CRM has been developed so that only
one trial is carried out based on information from both groups (20). A multisample
CRM is a direct generalization, although we must remain realistic in terms of
what is achievable in the light of the available sample sizes.
Let I, taking value 1 or 2, be the indicator variable for the two groups.
Otherwise, we use the same notation as previously defined. For clarity, we sup-
pose that the targeted probability is the same in both groups and is denoted by
θ, although this assumption is not essential to our conclusions.
The dose–toxicity model is now the following:
Pr(Y ⫽ 1|X ⫽ x, I ⫽ 1) ⫽ ψ 1 (x, a)

Pr(Y ⫽ 1|X ⫽ x, I ⫽ 2) ⫽ ψ 2 (x, a, b)
58 O’Quigley
Parameter b measures to some extent the difference between the groups. The
functions ψ 1 and ψ 2 are selected in such a way that for each θ ∈ (0,1) and each
dose level x there exists (a 0, b 0 ), satisfying ψ 1 (x, a 0 ) ⫽ θ and ψ 2 (x, a 0, b 0 ) ⫽
θ. This condition is satisfied by many function pairs. The following model has
performed well in simulations:
exp(a ⫹ x) b exp(a ⫹ x)
ψ 1 (x, a) ⫽ , ψ 2 (x, a, b) ⫽
1 ⫹ exp(a ⫹ x) 1 ⫹ b exp(a ⫹ x)
There are many other possibilities, an obvious generalization of the model of

O’Quigley, et al. (15), arising from Eq. (1) in which Eq. (2) applies to group 1
and
α i ⫽ {tanh(d i ⫹ b) ⫹ 1}/2 (i ⫽ 1, . . . , k) (11)
to group 2. A non-zero value for b indicates group heterogeneity. Let z k ⫽ (x k,

y k, I k ), k ⫽ 1, . . . , j be the outcomes of the first j patients, where I k indicates
to which group the kth subject belongs, x k is the dose level at which the kth
subject is tested, and y k indicates whether or not the kth subject suffered a toxic
response. To estimate the two parameters, one can use a Bayesian estimate or
maximum likelihood estimate as for a traditional CRM design. On the basis of
the observations z k, (k ⫽ 1, . . . , j ) on the first j 1 patients in group 1 and j 2
patients in group 2 ( j 1 ⫹ j 2 ⫽ j), we can write down the likelihood as
j1
ᏸ(a, b) ⫽ 兿
i⫽1
ψ 1 (x i, a)yi (1 ⫺ ψ 1 (x i, a))1⫺y i
k
⫻ 兿
i⫽j1⫹1
ψ 2 (x i, a, b)y i (1 ⫺ ψ 2 (x i, a, b))1⫺yi
If we denote by (â j, b̂ j ) values of (a, b), maximizing this equation after the inclu-
sion of j patients, then the estimated dose–toxicity relations are ψ 1 (x, â j ) and
ψ 2 (x, â j, b̂ j ), respectively.
If the ( j ⫹ 1)th patient belongs to group 1, he or she will be allocated at
the dose level that minimizes |ψ 1 (x j⫹1, â) ⫺ θ| with x j⫹1 ∈ {d 1, . . . , d k}. On the
other hand, if the ( j ⫹ 1)th patient belongs to group 2, the recommended dose
level minimizes |ψ 2 (x j⫹1, â j, b̂ j ) ⫺ θ|.
The trial is carried out as usual: After each inclusion, our knowledge on
the probabilities of toxicity at each dose level for either group is updated via the
parameters. It has been shown that under some conditions, the recommendations
will converge to the right dose level for both groups and the estimate of the true
probabilities of toxicity at these two levels. Note that it is not necessary that the
two sample sizes are balanced nor that entry into the study is alternating.
Figure 4 Results of a simulated trial for two groups.
Figure 4 illustrates a simulated trial carried out with a two-parameter model.

Implementation was based on likelihood estimation, necessitating nontoxicities
and a toxicity in each group before the model could be fully fit. Before this, dose-
level escalation followed an algorithm incorporating grade information parallel-
ing that of Section III.C. The design called for shared initial escalation, that is,
the groups were combined until evidence of heterogeneity began to manifest it-
self. The first DLT occured in group 2 for the fifth included patient. At this point,
the trial was split into two arms, group 2 recommendation being based on ᏸ(â,
0) and ψ 2 (x, â, 0) and group 1 continuing without a model as in Section III.C.
Note that there are many possible variants on this design. The first DLT in group
1 was encountered at dose level 6 and led to recommend a lower level to the
next patient to be included. For the remainder of the study, allocation for both
groups leaned on the model together with the minimization algorithms described
above.
C. Pharmacokinetic Studies
Statistical modeling of the clinical situation of phase I dose-finding studies, such
as takes place with the CRM, is relatively recent. Much more fully studied in the
60 O’Quigley
phase I context are pharmacokinetics and pharmacodynamics. Roughly speaking,

pharmacokinetics deals with the study of concentration and elimination character-
istics of given compounds in specified organ systems, most often blood plasma,
whereas pharmacodynamics focuses on how the compounds affect the body. This
is a vast subject referred to as PK/PD modeling.
Clearly, such information will have a bearing on whether or not a given
patient is likely to encounter dose-limiting toxicity or, in retrospect, why some
patients and not others were able to tolerate some given dose. There are many
parameters of interest to the pharmacologist, for example, the area under the con-
centration time curve, the rate of clearance of the drug, and the peak concentration.
For our purposes, a particular practical difficulty arises in the phase I con-
text in which any such information only becomes available once the dose has
been administered. Most often then, the information will be of most use in terms
of retrospectively explaining the toxicities. However, it is possible to have phar-
macodynamic information and other patient characteristics relating to the pa-
tient’s ability to synthesize the drugs, available before selecting the level at which
the patient should be treated.
In principle we can write down any model we care to hypothesize, say one
including all the relevant factors believed to influence the probability of encoun-
tering toxicity. We can then proceed to estimate the parameters. However, as in
the case of patient heterogeneity, we must remain realistic in terms of what can
be achieved given the maximum obtainable sample size. Some pioneering work
has been carried out here by Piantadosi et al. (22), indicating the potential for
improved precision by the incorporation of pharmacokinetic information. This is
a large field awaiting further exploration.
The strength of CRM is to locate with relatively few patients the target
dose level. The remaining patients are then treated at this same level. A recom-
mendation is made for this level. Further studies, following the phase I clinical
study, can now be made, and this is where we see the main advantage of pharma-
cokinetics. Most patients will have been studied at the recommended level and
a smaller amount at adjacent levels. At any of these levels, we will have responses
and a great deal of pharmacokinetic information. The usual models, in particular
the logistic model, can be used to see if this information helps explain the toxici-
ties. If so, we may be encouraged to carry out further studies at higher or lower
levels for certain patient profiles, indicated by the retrospective analysis to have
probabilities of toxicity much lower or much higher than suggested by the average
estimate. This can be viewed as the fine tuning and may itself give rise to new
more highly focused phase I studies.
At this point we do not see the utility of a model in which all the different
factors are included as regressors. These further analyses are necessarily very
delicate, requiring great statistical and/or pharmacological skill, and a mechanis-
tic approach based on a catch-all model is probably to be advised against.
D. Graded Toxicities
Although we refer to dose-limiting toxicities as a binary (0,1) variable, most
studies record information on the degree of toxicity, from 0, complete absence
of side effects, to 4, life-threatening toxicity. The natural reaction for a statistician
is to consider that the response variable, toxicity, has been simplified when going
from five levels to two and that it may help to use models accomodating multi-
level responses.
In fact this is not the way we believe that progress is to be made. The issue
is not that of modeling a response (toxicity) at five levels but of controlling for
dose-limiting toxicity, mostly grade 4 but possibly also certain kinds of grade 3.
Lower grades are helpful in that their occurence indicates that we are approaching
a zone in which the probability of encountering a dose-limiting toxicity is becom-
ing large enough to be of concern. This idea is used implicitly in the two-stage
designs described in Section III. If we are to proceed more formally and hopefully
extract yet more information from the observations, then we need models relating
the occurence of dose-limiting toxicities to the occurence of lower grade toxicities.
In the unrealistic situation in which we can accurately model the ratio of
the probabilities of the different types of toxicity, we can make striking gains in
efficiency since the more frequently observed lower grade toxicities carry a great
deal of information on the potential occurence of dose-limiting toxicities. Such
a situation would also allow gains in safety since, at least hypothetically, it would
be possible to predict at some level the rate of occurence of dose-limiting toxici-
ties without necessarily having observed very many, the prediction leaning
largely on the model. At the opposite end of the model/hypothesis spectrum, we
might decide we know nothing about the relative rates of occurrence of the differ-
ent toxicity types and simply allow the accumulating observations to provide the
necessary estimates. In this case it turns out that we neither lose nor gain effi-
ciency, and the method behaves identically to one in which the only information
we obtain is whether or not the toxicity is dose limiting. These two situations
suggest a middle road, using a Bayesian prescription, in which very careful mod-
eling can lead to efficiency improvements, if only moderate, without making
strong assumptions.
To make this more precise, let us consider the case of three toxicity levels,
the highest being dose limiting. Let Y k denote the toxic response at level k, (k ⫽
1, . . . , 3). The goal of the trial is still to identify a level of dose whose probability
of severe toxicity is closest to a given percentile of the dose–toxicity curve. A
working model for the CRM could be
Pr(Y k ⫽ 3) ⫽ ψ 1 (x k, a)
Pr(Y k ⫽ 2 or Y k ⫽ 3) ⫽ ψ 2 (x k, a, b)
Pr(Y k ⫽ 1) ⫽ 1 ⫺ ψ 2 (x k, a, b)
62 O’Quigley
The contributions to the likelihood are 1 ⫺ ψ 2 (x k, a, b) when Y k ⫽ 1, ψ 1 (x k, a)

when Y k ⫽ 3, and ψ 2 (x k, a, b) ⫺ ψ 1 (x k, a) when Y k ⫽ 2.
With no prior information and maximizing the likelihood, we obtain exactly
the same results as with the more usual one-parameter CRM. This is due to the
parameter orthogonality. There is therefore no efficiency gain, although there is
the advantage of learning about the relationship between the different toxicity
types.
Let us imagine that the parameter b is known precisely. The model need
not be correctly specified, although b should maintain interpretation outside the
model, for instance some simple function of the ratio of grade 3 to grade 2 toxici-
ties. Efficiency gains are then quite substantial (21). This is not of direct practical
interest since the assumption of no error in b is completely inflexible, that is,
should we be wrong in our assumed value that this induces a noncorrectable bias.
To overcome this we have investigated a Bayesian setup in which we use prior
information to provide a ‘‘point estimate’’ for b but having uncertainty associated
with it, expressed via a prior distribution. Errors can then be overwritten by the
data. This work is incomplete at the time of this writing, but early results are
encouraging.
VI. RELATED APPROACHES
A number of other designs for phase I studies have been suggested. The modified
continual reassessment method, as its name suggests, is clearly related to CRM.
It is described below. Schemes that predate the CRM, such as that of Anbar (2–
4) and Wu (32,33), leaning on stochastic approximation, are in fact quite closely
related.
A. Stochastic Approximation
The problem is considered from the angle of estimating the root of a regression
function M(x) from observations (x 1, y 1 ), . . . , (x n, y n ), where (x i, y i ) have the
following relation:
y i ⫽ M(x i ) ⫹ σε i
The errors ε 1, . . . , ε n are i.i.d. random samples with mean zero and unit variance.
Let θ 0 be a real value and ξ 0 be the solution for M(x) ⫽ θ 0. We are interested
in sequential determination of the design values x 1, . . . , x n, so that ξ 0 can be
estimated consistently and efficiently from the corresponding observations
y 1, . . . , y n. This problem has its application in both industrial experiments and
medical research. Following the pioneering work of Robbins and Monro (24),
stochastic approximation has been applied to this problem and has been studied
by many authors.
More precisely, the Robbins–Monro procedure calculates the design values
sequentially according to
c
x n⫹1 ⫽ x n ⫺ (y n ⫺ θ 0 ) (12)
n
where c is some constant. Lai and Robbins (1979) pointed out the connection
between Eq. (12) and the following procedure based on the ordinary linear regres-
sion applied to (x 1, y 1 ), . . . , (x n, y n ):
x n⫹1 ⫽ x n ⫺ β̂ n⫺1(y n ⫺ θ 0 ) (13)
where β̂ n is the least-squared estimate of β. Wu (33) gave a heuristic argument
that Eq. (12) is equivalent to Eq. (13). Under certain circumstances, stochastic
approximation can be considered as fitting an ordinary linear model to the existing
data and then treating the regression line as an approximation of M(⋅) and using
it to calculate the next design point. Anbar (2,3) used Robbins–Monro in the
context of phase I dose finding, sequentially estimating the slope. Considerably
more observations were required to achieve relative stability than for the CRM
(16), although it was clear that such designs were superior to the standard up
and down design.
As pointed out by Wu (32), the Robbins–Monro procedure is unstable at
places where the function M(x) is flat. Wu (32) then proposed truncating β̂ n when-
ever it becomes too large. This stabilizes the estimate of β. We believe, however,
that the intrinsic source of instability lies in the use of a model with two parame-
ters (the intercept and the slope) in the estimation of a single root. Imagine that
after many experiments the design points have concentrated around the root ξ 0.
With relatively few points outside the small region around the root, there are
infinitely many pairs that fit the data equally well, since for every given intercept
we can find a slope passing through the observations. It is then quite possible
for the estimate of β to be unstable if a regression line is fitted to data with both
intercept and slope, irrespective whether M(⋅) is flat or not.
It should be possible to estimate ξ 0 by fitting the data with a linear model
without an intercept. Intuitively, a one-parameter model would be sufficient for
determining ξ 0 if most of the design values are around it. The estimate of the
slope then becomes relatively stable. The main concern is the consistency of the
estimate since the model becomes less flexible and thus it may be more difficult
for it to capture the nature of the data.
However, consider the simple case that the data are generated according
to an ordinary linear regression (M(x) ⫽ α ⫹ βx):
Y ⫽ α ⫹ βX ⫹ σε, α ⬎ 0, β ⬎ 0, X ⬎ 0 (14)
64 O’Quigley
Then ξ 0 is the solution for equation α ⫹ βx ⫽ θ 0 ⬎ 0. Suppose that we have

collected data (x 1, y 1 ), . . . , (x n, y n ), then we can fit the underparameterized
regression line going through the origin. This results in an estimate of the slope
β̂ n ⫽ y n /x n, where x n ⫽ ∑x i /n and y n ⫽ ∑y i /n. Solving β̂ n x ⫽ θ 0 yields the
recommended design value for the next experiment:
θ0 θ x
x n⫹1 ⫽ ⫽ 0 n (15)
β̂ n yn
This process is repeated after observation of y n⫹1. Note that the model used by
the procedure is different from that generating the data.
The design point x n and the average x n can both serve as estimates of ξ 0.
These estimates are called consistent if they converge to ξ 0 almost surely when
n goes to infinity. The definition of x n implies that its consistency is equivalent
to that of x n. The conditions for consistency have been identified by Shen and
O’Quigley (25) and the main arguments are close to those showing the consis-
tency for the CRM.
B. Modified, Extended, and Restricted CRM

The operating characteristics of CRM depend, as has been outlined in Section
II, on certain arbitrary specifications. For any situation or class of situations we
can obtain the operating characteristics to decide on the most appropriate design
within the class of CRM designs. Note that once the working model and so forth
has been specified, the levels recommended by the method are entirely determin-
istic. It is enough to specify given paths of visited levels and the associated re-
sponses to know exactly which level will be recommended by CRM. If we wish
to leave the paths as random, depending on the possible outcomes, then the appro-
priate tool to use is that of simulation. This would be analagous to calculating
sample size, degree of balance, and stratification issues in the more common
design context of randomized phase III studies. There is enough flexibility in the
CRM to obtain any reasonable characteristics we might believe necessary. There
is therefore no cause to be concerned about unanticipated, erratic, or aberrant
behavior. Before we carry out any given trial, we can put ourselves in a position
of knowing just what the behavior will be when faced with some particular cir-
cumstance.
The modified continual reassessment method (7) was developed to deal
with perceived problems in operating characteristics, in particular the possibility
of jumping dose levels. However, such problems do not arise when CRM is
correctly implemented (Section III.B). Correct implementation is preferable to
using schemes with ad hoc design modifications resulting in potentially poor
operating characteristics. Before considering this method, sometimes referred to

as MCRM, it is important to underline that the perceived difficulties arise from
an implementation of CRM differing from that described in this current work as
well as in the original paper by O’Quigley et al. (15). CRM does not work with
actual doses, as does Faries. As described by O’Quigley et al. (15), O’Quigley and
Shen (17), and Shen and O’Quigley (25), the ‘‘doses’’ are conceptual, ordered by
increasing probabilities of toxic reaction (see also Sect. II). In addition, if we are
to use a Bayesian implementation of CRM and our prior knowledge is weak,
then this must be reflected in the choice of prior. The priors selected by Faries
(7), in the context of a one-parameter model where the dose is now the real dose,
are informative. It should be no surprise at all that ‘‘CRM,’’ in such circum-
stances, does not behave as we might hope.
The consequences of the particular setup for CRM by Faries (7) were two-
fold: It was possible to observe dose escalation after an observed toxicity and it
was possible to observe large skips in the dose levels after nontoxicities. Faries
suggested fixing these awkward operating characteristics by continuing with the
same prescription but, rather than use the recommended level, introduce two
further rules. The first of these is that dose escalation after an observed toxicity
is not allowed. The method’s prescription is overruled, and we allocate at the
same level. The second rule is that escalation should never be more than a single
level so that ‘‘skipping’’ is also overruled when indicated by the method. The
new modified method is called MCRM.
For the standard case of six dose levels studied by O’Quigley et al. (15),
MCRM fixes a problem that does not exist under the usual guidelines. The usual
CRM will never recommend escalation after an observed toxicity and will never
skip doses when escalating. We may conclude that there is then little to choose
between CRM and MCRM. However, working with the actual doses as does
MCRM can be problematic. MCRM, as can be seen by studying Fig. 1 of Faries
(7), can be unstable. Essentially MCRM works with a different dose–toxicity
function, lacking the flexibility we may need, and although large sample behavior
should be similar to CRM for small finite samples, the relatively steep dose–
toxicity working models of Section II.A may be required to dampen oscillations
and reduce instability. Note that none of the comparative studies in Tables 4,
5, and 6 of Faries compare MCRM with CRM. A comparative study of MCRM
and CRM, to our knowledge, has yet to be carried out. Our few limited studies
indicate that MCRM performs poorly when compared with CRM.
For studies with a large number of dose levels, 12 or more (see 13), the
problem of skipping really could arise. However, the apparently alarming exam-
ple that Moller quotes in which CRM recommends treating the second entered
patient at level 10, following a nontoxicity for the first patient at level 1, would
as with Faries, never arise had CRM been implemented according to O’Quigley
66 O’Quigley
et al. (15). To prevent such an occurrence, Moller also suggests the same rule as
Faries. This design is called restricted CRM, but as described above for MCRM, it
addresses problems equally well solved within the framework of the standard
setup.
It is easy to understand what is going on from Fig. 1 of Moller (13). The
middle curve is the working model, and had this been chosen according to the
prescription of O’Quigley et al. (15), then the horizontal line at the target θ ⫽
0.4 would meet the lowest level rather than level 10 as in the figure. Essentially
this particular Bayesian setup expresses the idea that we believe the correct level
to be level 10, although there is some uncertainty in this quantified by our prior.
It is therefore perfectly natural that integrating some further small piece of infor-
mation (the first patient treated at the lowest level and tolerating the treatment),
we then recommend level 10 to patient 2. The posterior and the prior are almost
indistinguishable (see also Sect. III.B) and Ahn (1).
The prior uncertainty should be quantified by the function g(a). In Moller’s
example, the choice of an exponential prior would not seem a good one. The
subsequent attempt to make the exponential prior noninformative by modifying
the conceptual doses is certainly interesting albeit not the easiest way to go and
would need further study before it could be recommended. The comparisons of
the respective performances of restricted CRM and original CRM are based
on this rather unusual setup, certainly not the CRM as described by O’Quigley
et al. (15). Furthermore, these comparisons, unlike others in this area, assume
the particular model to truly generate the data. This is not realistic. For these
reasons, conclusions based on the comparisons should not be given very much
weight.
Skipping doses is an issue that necessitates some thought. It should be
viewed as an operating characteristic of the method. As mentioned earlier, such
operating characteristics depend on the choice of model and given design features
such as the number of dose levels and the value of θ. The model chosen by
O’Quigley, et al. in the context of six doses is such that it is not possible to skip
doses when escalating. For larger number of doses or other choices of models,
skipping could occur. Certainly we would not skip from level 1 to level 10, as
happened in Moller (13), unless we made this a feature of the design. Typically,
we would not work with such models. However, we may wish to use a model
that could jump a level, for instance, between 12 and 20 levels and relatively
few subjects. Indeed, since there is always the conceptual possibility of having
intermediary levels, we are of course inevitably ‘‘jumping’’ dose levels. This is
therefore not a real issue, and the relevant issue is to understand at the trial outset
just how we proceed through the available levels, given different models. This
can be relatively rapid or relatively slow, depending on initial design and the
chosen model. If the experimenter wishes to modify these properties, then this
can be done by changing the model. This is in our view to be preferred over
‘‘modified,’’ ‘‘restricted,’’ or other ‘‘improved’’ CRM designs where rules of
thumb, having their inspiration largely from the old, and very poorly behaved,
up and down schemes are grafted onto a methodology with a more solid founda-
tion and whose operating characteristics can be anticipated. This is not to say
that improvements cannot be made. But they are unlikely to come from ad hoc
improvisations that have, unfortunately, dominated the area of phase I trial design
for many years.
Moller (13) also refers to extended CRM. This is a particular case of
the two-stage designs described in Section III.C. As we pointed out there, we
believe the two-stage design to be the most flexible and generally applicable
design.
VII. BAYESIAN APPROACHES
The CRM is often referred to as the Bayesian alternative to the classic up and
down designs used in phase I studies. Hopefully Section III makes it clear that
there is nothing especially Bayesian about the CRM (see also 30). In O’Quigley
et al. (15), for the sake of simplicity, Bayesian estimators and vague priors were
proposed. However, there is nothing to prevent us from working with other esti-
mators, in particular the maximum likelihood estimator.
More fully Bayesian approaches, and not simply a Bayesian estimator, have
been suggested for use in the context of phase I trial designs. By more fully
we mean more in the Bayesian spirit of inference, in which we quantify prior
information, observed from outside the trial and that sollicited from clinicians
and/or pharmacologists. Decisions are made more formally using tools from deci-
sion theory. Any prior information can subsequently be incorporated via the
Bayes formula into a posterior density that also involves the actual current obser-
vations. Given the typically small sample sizes often used, a fully Bayesian ap-
proach has some appeal in that we would not wish to waste any relevant informa-
tion at hand. Unlike the setup described by O’Quigley et al. (15) we could also
work with informative priors.
Gatsonis and Greenhouse (8) consider two-parameter probit and logit mod-
els for dose response and study the effect of different prior distributions.
Whitehead and Williamson (31) carried out similar studies but with attention
focusing on logistic models and beta priors. Whitehead and Williamson (31)
worked with some of the more classic notions from optimal design for choosing
the dose levels in a bid to establish whether much is lost by using suboptimal
designs. O’Quigley et al. (15) ruled out criteria based on optimal design due to
the ethical criterion of the need to attempt to assign the sequentially included
68 O’Quigley
patients at the most appropriate level for the patient. This same point was also
emphasized by Whitehead and Williamson (31).
Whitehead and Williamson (31) suggest that CRM could be viewed as a
special case of their designs with their second parameter being assigned a degen-
erate prior and thereby behaving as a constant. Although in some senses this
view is technically correct, it can be misleading in that for the single sample
case, two-parameter CRM and one-parameter CRM are fundamentally different.
Two-parameter CRM was seen to behave poorly (15) and is generally inconsis-
tent (25). We have to view the single parameter as necessary in the homogeneous
case and not simply a parametric restriction to facilitate numerical integration
(see also Sect. VI.A). This was true even when the data were truly generated by
the same two-parameter model, an unintuitive finding at first glance but one that
makes sense in the light of the comments in Section VI.A.
We do not believe it will be possible to demonstrate large sample consis-
tency for either the Gatsonis and Greenhouse (8) approach or that of Whitehead
and Williamson (31) as was done for CRM by Shen and O’Quigley (25). It may
well be argued that large sample consistency is not very relevant in such typically
small studies. However, it does provide some theoretical comfort and hints that
for finite samples things might work out okay too. At the very least, if we fail
to achieve large sample consistency, then we might carry out large numbers of
finite samples studies, simulating most often under realities well removed from
our working model. This was done by O’Quigley, et al. (15) for the usual under-
parametrized CRM. Such comparisons remain to be done for the Bayesian meth-
ods discussed here. Nonetheless, we can conclude that a judicious choice of
model and prior, not running into serious conflict with the subsequent observa-
tions, may help inference in some special cases.
A quite different Bayesian approach has been proposed by Babb et al. (5).
The context is fully Bayesian. Rather than concentrate experimentation at some
target level as does CRM, the aim here is to escalate as fast as possible toward
the MTD while sequentially safeguarding against overdosing. This is interesting
in that it could be argued that the aim of the approach translates in some ways
more directly the clinician’s objective than does CRM. Model misspecification
was not investigated and would be an interesting area for further research. The
approach appears promising, and the methodology may be a useful modification
of CRM when primary concern is on avoiding overdosing and we are in a position
to have a prior on a two-parameter function. As above, there may be concerns
about large sample consistency when working with a design that tends to settle
on some level.
A promising area for Bayesian formulations is one where we may have
little overall knowledge of any dose–toxicity relationship but we may have some,
possibly considerable, knowledge of some secondary aspect of the problem. Con-
sider the two-group case. For the actual dose levels we are looking for we may
know almost nothing. Uninformative Bayes or maximum likelihood would then
seem appropriate. But we may well know that one of the groups, for example,
a group weakened by extensive prior therapy, is most likely to have a level strictly
less than that for the other group. Careful parametrization would enable this infor-
mation to be included as a constraint. However, rather than work with a rigid
and unmodifiable constraint, a Bayesian approach would allow us to specify the
anticipated direction with high probability while enabling the accumulating data
to override this assumed direction if the two run into serious conflict. Exactly
the same idea could be used in a case where we believe there may be group
heterogeneity but that it be very unlikely the correct MTDs differ by more than
a single level. Incorporating such information will improve efficiency. As already
mentioned in Section V, under some strong assumptions, we can achieve clear
efficiency gains, when incorporating information on the graded toxicities. Such
gains can be wiped out and even become negative, through bias, when the as-
sumption are violated. Once again, a Bayesian setup opens up the possibility of
compromise so that constraints become modifiable in the light of accumulating
data.
VIII. RESEARCH DIRECTIONS
There are many outstanding questions, both practical and theoretical, concerning
CRM. We do not believe that ad hoc modifications such as MCRM will be fruit-
ful, and we do not therefore see this as a useful research area. Nonetheless, op-
erating characteristics depend to some extent on arbitrary specification parameters
such as the chosen model, and it would clearly be helpful to have a better under-
standing of this. A deep understanding of how certain types of changes to the
model correspond to certain types of changes in operating characteristics could
ultimately lead to selecting models on the basis of some particularly desired be-
havior, for instance, relatively rapid escalation to the higher levels or models that
require accumulating a lot of precision on the safety of some particular level
before further escalation.
Very often we do not know the number of levels that may be used. In such
cases the model cannot be determined in advance. In our own applied work, we
have improvised when faced with such situations, via two-stage designs, if the
model being constructed at the beginning of the second stage. A more systematic
approach may be useful.
Regression models have been suggested to address issues of patient hetero-
geneity (26). This appears natural, but given the typically small sample sizes
and the implementation algorithms via underparametrized models, great care is
70 O’Quigley
needed. Deeper studies showing how and in which situations advantage can be
drawn from a regression model are needed. PK/PD information, some or all of
which may not be available before deciding on the appropriate dose level for the
patient, raise particular methodological questions.
Within-patient dose escalation is frequently undertaken in practice but not
analyzed as such. Indeed, regulatory agencies sometimes disallow the inclusion
of information on doses other than that at which the patient was first treated.
Nonetheless, the patient, even if off study, will continue to be treated to the
clinician’s best ability and often at doses higher than that initially given. Model-
ing would necessarily be difficult in view of the complex potential relationships
governing the outcomes at different levels for the same patient, whether toxicities
are cumulative or whether cumulative treatment provides some kind of protection.
Finally, Bayesian approaches, in conjunction with careful modeling, open
up many possibilities. Some of these are mentioned in the above paragraph. There
are numerous outstanding theoretical issues to be resolved: Conditions for con-
vergence, inference under misspecified models, optimality, incorporation of ran-
dom allocation, and efficiency of different stopping rules are just some of these.
The most interesting problems, as always, are those arising from practical applica-
tions, and as the method is increasingly used these will require more attention
from statisticians.
REFERENCES
1. Ahn C. An evaluation of phase I cancer clinical trial designs. Stat Med 1998; 17:
1537–1549.
2. Anbar D. The application of stochastic methods to the bioassay problem. J Stat
Planning Inference 1977; 1:191–206.
3. Anbar D. A stochastic Newton-Raphson method. J Stat Planning Inference 1978;
2:153–163.
4. Anbar D. Stochastic approximation methods and their use in bioassay and Phase I
clinical trials. Commun Statist 1984; 13:2451–2467.
5. Babb J, Rogatko A, Zacks S. Cancer Phase I clinical trials: efficient dose escalation
with overdose control. Stat Med 1998; 17:1103–1120.
6. Chevret S. The continual reassessment method in cancer phase I clinical trials: a
simulation study. Stat Med 1993; 12:1093–1108.
7. Faries D. Practical modifications of the continual reassessment method for phase I
cancer clinical trials. J Biopharm Stat 1994; 4:147–164.
8. Gatsonis C, Greenhouse JB. Bayesian methods for phase I clinical trials. Stat Med
1992; 11:1377–1389.
9. Goodman S, Zahurak ML, Piantadosi S. Some practical improvements in the contin-
ual reassessment method for phase I studies. Statist Med 1995; 14:1149–1161.
10. Huber PJ. The behavior of maximum likelihood estimates under nonstandard condi-
tions. Proc 6th Berkeley Symp 1967; 1:221–233.
11. Korn EL, Midthune D, Chen TT, Rubinstein LV, Christian MC, Simon R. A compar-
ison of two Phase I trial designs. Stat Med 1994; 13:1799–1806.
12. Mick R, Ratain MJ. Model-guided determination of maximum tolerated dose
in Phase I clinical trials: evidence for increased precision. JNCI 1993; 85:217–
223.
13. Moller S. An extension of the continual reassessment method using a preliminary
up and down design in a dose finding study in cancer patients in order to investigate
a greater number of dose levels. Stat Med 1995; 14:911–922.
14. O’Quigley J. Estimating the probability of toxicity at the recommended dose follow-
ing a Phase I clinical trial in cancer. Biometrics 1992; 48:853–862.
for Phase I clinical trials in cancer. Biometrics 1990; 46:33–48.
17. O’Quigley J, Shen LZ. Continual reassessment method: a likelihood approach. Bio-
metrics 1996; 52:163–174.
18. O’Quigley J, Reiner E. A stopping rule for the continual reassessment method. Bio-
metrika 1998; 85:741–748.
19. O’Quigley J. Another look at two Phase I clinical trial designs (with commentary).
Stat Med 1999; 18:2683–2692.
20. O’Quigley J, Shen L, Gamst A. Two sample continual reassessment method. J Bio-
pharm Stat 1999; 9:17–44.
21. Paoletti X, O’Quigley J. Using Graded Toxicities in Phase I Trial Design. Technical
Report 4, Biostatistics Group, University of California at San Diego. 1998.
22. Piantadosi S, Liu G. Improved designs for dose escalation studies using pharmacoki-
netic measurements. Stat Med 1996; 15:1605–1618.
23. Reiner E, Paoletti X, O’Quigley J. Operating characteristics of the standard phase
I clinical trial design. Comp Stat Data Anal 1998; 30:303–315.
24. Robbins H, Monro S. A stochastic approximation method. Ann Math Stat 1951; 29:
351–356.
25. Shen LZ, O’Quigley J. Consistency of continual reassessment method in dose find-
ing studies. Biometrika 1996; 83:395–406.
26. Simon R, Freidlin B, Rubinstein L, Arbuck S, Collins J, Christian M. Accelerated
titration designs for Phase I clinical trials in oncology. JNCI 1997; 89:1138–
1147.
27. Silvapulle MJ. On the existence of maximum likelihood estimators for the binomial
response models. JR Stat Soc B 1981; 43:310–313.
28. Storer BE. Design and analysis of Phase I clinical trials. Biometrics 1989; 45:925–
937.
29. Storer BE. Small-sample confidence sets for the MTD in a phase I clinical trial.
Biometrics 1993; 49:1117–1125.
30. Storer BE. Phase I clinical trials. Encylopedia of Biostatistics. New York: Wiley,
1998.
72 O’Quigley
31. Whitehead J, Williamson D. Bayesian decision procedures based on logistic regres-

sion models for dose-finding studies. J Biopharm Stat 1998; 8:445–467.
32. Wu CFJ. Efficient sequential designs with binary data. J Am Stat Assoc 1985; 80:
974–984.
33. Wu CFJ. (1986). Maximum likelihood recursion and stochastic approximation in
sequential designs. In: van Ryzin J, ed. Adaptive Statistical Procedures and Related
Topics. Institute of Mathematical Statistics Monograph 8, Hayward, CA: Institute
of Mathematical Statistics, pp 298–313.
34. Lai TL, Robbins H. Adaptive design and stochastic approximation. Annals of Statis-
tics 1979; 7:1196–1221.
3
Choosing a Phase I Design
Barry E. Storer
Fred Hutchinson Cancer Research Center, Seattle, Washington
I. INTRODUCTION AND BACKGROUND
Although the term phase I is sometimes applied generically to almost any ‘‘early’’
trial, in cancer drug development it usually refers specifically to a dose-finding
trial whose major end point is toxicity. The goal is to find the highest dose of a
potential therapeutic agent that has acceptable toxicity; this dose is referred to
as the maximum tolerable dose (MTD) and is presumably the dose that will be
used in subsequent phase II trials evaluating efficacy. Occasionally, one may
encounter trials that are intermediate between phase I and phase II, referred to
as phase IB trials. This is a more heterogeneous group but typically includes
trials evaluating some measure of biological efficacy over a range of doses that
have been found to have acceptable toxicity in a phase I (or phase IA) trial. This
chapter focuses exclusively on phase I trials with a toxicity end point.
What constitutes acceptable toxicity of course depends on the potential thera-
peutic benefit of the drug. There is an implicit assumption with most anticancer
agents of a positive correlation between toxicity and efficacy, but most drugs that
will be evaluated in phase I trials will prove ineffective at any dose. The problem
of defining an acceptably toxic dose is complicated by the fact that patient response
is heterogenous: At a given dose, some patients may experience little or no toxicity,
whereas others may have severe or even fatal toxicity. Since the response of the
patient will be unknown before the drug is given, acceptable toxicity is typically
defined with respect to the patient population as a whole. For example, given a
73
74 Storer
toxicity grading scheme ranging from 0 to 5 (none, mild, moderate, severe, life
threatening, fatal), one might seek the dose where, on average, one out of three
patients would be expected to experience a grade 3 or worse toxicity. The latter
is referred to as ‘‘dose-limiting toxicity’’ (DLT) and does not need to correspond
to a definition of unacceptable toxicity in an individual patient.
When defined in terms of the presence or absence of DLT, the MTD can
be defined as some quantile of a dose–response curve. Notationally, if Y is a
random variable whose possible values are 1 and 0, respectively, depending on
whether a patient does or does not experience DLT, and for dose d we have ψ(d )
⫽ Pr(Y ⫽ 1|d), then the MTD is defined by ψ(dMTD ) ⫽ θ, where θ is the desired
probability of toxicity.1
There are two significant constraints on the design of a phase I trial. The
first is the ethical requirement to approach the MTD from below, so that one
must start at a dose level believed almost certainly to be below the MTD and
gradually escalate upward. The second is the fact that the number of patients
typically available for a phase I trial is relatively small, say 15–30, and is not
driven traditionally by rigorous statistical considerations requiring a specified
degree of precision in the estimate of MTD. The pressure to use only small num-
bers of patients is large—literally dozens of drugs per year may come forward
for evaluation, and each combination with other drugs, each schedule, and each
route of administration requires a separate trial. Furthermore, the number of pa-
tients for whom it is considered ethically justified to participate in a trial with
little evidence of efficacy is limited. The latter limitation also has implications
for the relevance of the MTD in subsequent phase II trials of efficacy. Since the
patient populations are different, it is not clear that the MTD estimated in one
population will yield the same result when implemented in another.
II. DESIGNS FOR PHASE I TRIALS
As a consequence of the above considerations, the traditional phase I trial design

uses a set of fixed dose levels that have been specified in advance, that is, d ∈
{d 1, d 2, . . . , d K}. The choice of the initial dose level d 1, and the dose spacing, are
discussed in more detail below. Beginning at the first dose level, small numbers of
patients are entered, typically three to six, and the decision to escalate or not
depends on a prespecified algorithm related to the occurrence of DLT. When a
dose level is reached with unacceptable toxicity, then the trial is stopped.
A. Initial Dose Level and Dose Spacing

The initial dose level is generally derived either from animal experiments, if the
agent in question is completely novel, or by conservative consideration of previ-
Choosing a Phase I Design 75
ous human experience, if the agent in question has been used before but with a
different schedule, route of administration, or with other concomitant drugs. A
common starting point based on the former is from 1/10 to 1/3 of the mouse
LD 10, the dose that kills 10% of mice, adjusted for the size of the animal on a
per kilogram basis or by some other method.
Subsequent dose levels are determined by increasing the preceding dose
level by decreasing multiples, a typical sequence being {d 1, d 2 ⫽ 2d 1, d 3 ⫽ 1.67d 2,
d 4 ⫽ 1.5d 3,d 5 ⫽ 1.4d 4, and thereafter d k⫹1 ⫽ 1.33d k, . . .} Such sequences are often
referred to as ‘‘modified Fibonacci.’’2 Note that after the first few increments, the
dose levels are equally spaced on a log scale. With some agents, particularly
biological agents, the dose levels may be determined by log, that is, {d 1, d 2 ⫽
10d 1, d 3 ⫽ 10d 2,d 4 ⫽ 10d 3, . . .}, or approximate half-log, that is, {d 1, d 2 ⫽ 3d 1,
d 3 ⫽ 10d 1, d 4 ⫽ 10d 2, d 5 ⫽ 10d 3, . . .}, spacing.
B. Traditional Escalation Algorithms

A wide variety of dose escalation rules may be used. For purposes of illustration,
we describe the following, which is often referred to as the traditional ‘‘3 ⫹ 3’’
design. Beginning at k ⫽ 1,
(A) Evaluate three patients at d k:

(A1) If zero of three have DLT, then increase dose to d k⫹1 and go to (A).
(A2) If one of three have DLT, then go to (B).
(A3) If at least two of three have DLT, then go to (C).
(B) Evaluate an additional three patients at d k:
(B1) If one of six have DLT, then increase dose to d k⫹1 and go to (A).
(B2) If at least two of six have DLT, then go to (C).
(C) Discontinue dose escalation.
If the trial is stopped, then the dose level below that at which excessive
DLT was observed is the MTD. Some protocols may specify that if only three
patients were evaluated at that dose level, then an additional three should be
entered, for a total of six, and that process should procede downward, if neces-
sary, so that the MTD becomes the highest dose level where no more than one
toxicity is observed in six patients. The actual θ that is desired is generally not
defined when such algorithms are used, but implicitly 0.17 ⱕ θ ⱕ 0.33, so we
could take θ ⬇ 0.25.
Another example of a dose escalation algorithm, referred to as the ‘‘best-
of-5’’ design, is described here. Beginning at k ⫽ 1,
(A) Evaluate three patients at d k:
(A1) If zero of three have DLT, then increase dose to d k⫹1 and go to (A).
(A2) If one or two of three have DLT, then go to (B).
(A3) If three of three have DLT, then go to (D).
76 Storer
(B) Evaluate an additional one patient at d k:

(B1) If one of four have DLT, then increase dose to d k⫹1 and go to (A).
(B2) If two of four have DLT, then go to (C).
(B3) If three of four have DLT, then go to (D).
(C) Evaluate an additional one patient at d k:
(C1) If two of five have DLT, then increase dose to d k⫹1 and go to (A).
(C3) If three of five have DLT, then go to (D).
(D) Discontinue dose escalation.
Again, the value of θ is not explicitly defined, but we could take θ ⬇ 0.40.
Although traditional designs reflect an empirical common sense approach
to the problem of estimating the MTD under the noted constraints, only brief
reflection is needed to see that the determination of MTD will have a rather
tenuous statistical basis. Consider the outcome of a trial using the 3 ⫹ 3 design
where the frequency of DLT for dose levels d 1, d 2, and d 3 is 0 of 3, 1 of 6, and
2 of 6, respectively. Ignoring the sequential nature of the escalation procedure,
the pointwise 80% confidence intervals for the rate of DLT at the three dose
levels are, respectively, 0–0.54, 0.02–0.51, 0.09–0.67. Although the middle dose
would be taken as the estimated MTD, there is not even reasonably precise evi-
dence that the toxicity rate for any of the three doses is either above or below
the implied θ of approximately 0.25.
Crude comparisons among traditional dose escalation algorithms can be
made by examining the level-wise operating characteristics of the design, that
is, the probability of escalating to the next dose level given an assumption regard-
ing the underlying probability of DLT at the current dose level. Usually, this
calculation is a function of simple binomial success probabilities. For example,
in the 3 ⫹ 3 algorithm described above, the probability of escalation is B(0, 3;
ψ(d )) ⫹ B(1, 3; ψ(d )) B(0, 3; ψ(d )), where B(τ, n; ψ(d )) is the probability of r
successes (toxicities) out of n trials (patients) with underlying success probability
at the current dose level ψ(d ). The probability of escalation can then be plotted
over a range of ψ(d ), as is done in Fig. 1 for the two algorithms described above.
Although it is obvious from such a display that one algorithm is considerably
more aggressive than another, the level-wise operating characteristics do not pro-
vide much useful insight into whether or not a particular design will tend to select
an MTD that is close to the target. More useful approaches to choosing among
traditional designs and the other designs described below are discussed in Sec-
tion III.
C. A Bayesian Approach: The Continual Reassessment

Method
The small sample size and low information content in the data derived from
traditional methods have suggested to some the usefulness of Bayesian methods
Figure 1 Level-wise operating characteristics of two traditional dose escalation algo-

rithms. The probability of escalating to the next higher dose level is plotted as a function
of the true probability of DLT at the current dose.
to estimate the MTD. In principle, this approach allows one to combine any
prior information available regarding the value of the MTD with subsequent data
collected in the phase I trial to obtain an updated estimate reflecting both.
The most clearly developed Bayesian approach to phase I design is the
continual reassessment method (CRM) proposed by O’Quigley and colleagues
(1,2). From among a small set of possible dose levels, say {d 1, . . ., d 6}, experi-
mentation begins at the dose level that the investigators believe, based on all
available information, is the most likely to have an associated probability of DLT
equal to the desired θ. It is assumed that there is a simple family of monotone
dose–response functions ψ such that for any dose d and probability of toxicity
p there exists a unique a where ψ(d, a) ⫽ p and in particular that ψ(d MTD, a 0 )
⫽ θ. An example of such a function is ψ(d, a) ⫽ [(tanh d ⫹ 1)/2]a. Note that
ψ is not assumed to be necessarily a dose–response function relating a character-
istic of the dose levels to the probability of toxicity. That is, d does not need to
correspond literally to the dose of a drug. In fact, the treatments at each of the
dose levels may be completely unrelated, as long as the probability of toxicity
increases from each dose level to the next; in this case d could be just the index
78 Storer
of the dose levels. The uniqueness constraint implies in general the use of one-
parameter models and explicitly eliminates popular two-parameter dose–re-
sponse models like the logistic. In practice, the latter have a tendency to become
‘‘stuck’’ and oscillate between dose levels when any data configuration leads to
a large estimate for the slope parameter.
A prior distribution g(a) is assumed for the parameter a such that for the
∞
initial dose level, for example d 3, either ∫ 0 ψ(d 3, a)g(a)da ⫽ θ or, alternatively,
∞
ψ(d 3, µ a ) ⫽ θ, where µ a ⫽ ∫ 0 ag(a)da. The particular prior used should also
reflect the degree of uncertainty present regarding the probability of toxicity at
the starting dose level; in general, this will be quite vague.
After each patient is treated and the presence or absence of toxicity ob-
served, the current distribution g(a) is updated along with the estimated probabili-
ties of toxicity at each dose level, calculated by either method above (1). The
next patient is then treated at the dose level minimizing some measure of the
distance between the current estimate of the probability of toxicity and θ. After
a fixed number n of patients has been entered sequentially in this fashion, the
dose level selected as the MTD is the one that would be chosen for a hypothetical
n ⫹ 1st patient.
An advantage of the CRM design is that it makes full use of all the data
at hand to choose the next dose level. Even if the dose–response model used in
updating is misspecified, CRM will tend eventually to select the dose level that
has a probability of toxicity closest to θ (3), although its practical performance
should be evaluated in the small sample setting typical of phase I trials. A further
advantage is that unlike traditional algorithms, the design is easily adapted to
different values of θ.
In spite of these advantages, some practitioners object philosophically to
the Bayesian approach, and it is clear in the phase I setting that the choice of
prior can have a measurable effect on the estimate of MTD (4). However, the
basic framework of CRM can easily be adapted to a non-Bayesian setting and
can conform in practice more closely to traditional methods (5). For example,
there is nothing in the approach that prohibits one from starting at the same low
initial dose as would be common in traditional trials or from updating after groups
of three patients rather than single patients. Allowing for some ad hoc determinis-
tic rules to start the trial off, the Bayesian prior can be abandoned entirely and
the updating after each patient can be fully likelihood based.3
D. Storer’s Two-Stage Design

Storer (6,7) explored a combination of more traditional methods implemented in
such a way as to minimize the numbers of patients treated at low dose levels
and to focus sampling around the MTD; these methods also use an explicit dose–
response framework to estimate the MTD.
The design has two stages and uses a combination of simple dose-escalation
algorithms. The first stage assigns single patients at each dose level and escalates
upward until a patient has DLT or downward until a patient does not have DLT.
Algorithmically, beginning at k ⫽ 1,
(A) Evaluate one patient at d k:
(A1) If no patient has had DLT, then increase dose to d k⫹1 and go to (A).
(A2) If all patients have had DLT, then decrease dose to d k ⫺ 1 and go to (A).
(A3) If at least one patient has and has not had DLT, then if the current
patient has not had DLT, go to (B); otherwise, decrease the dose to d k ⫺ 1 and
go to (B).
Note that the first stage meets the requirement for heterogeneity in response
needed to start off a likelihood-based CRM design and could be used for that
purpose. The second stage incorporates a fixed number of cohorts of patients. If
θ ⫽ 1/3, then it is natural to use cohorts of size three, as follows:
(B) Evaluate three patients at d k:
(B1) If zero of three have DLT, then increase dose to d k⫹1 and go to (B).
(B2) If one of three have DLT, then go to (B).
(B3) If at least two of three have DLT, then decrease dose to d k⫺ 1 and
go to (B).
After completion of the second stage, a dose–response model is fit to the
data and the MTD estimated by maximum likelihood or other method. For exam-
ple, one could use a logistic model where logit (ψ(d)) ⫽ α ⫹ β log(d ), whence
the estimated MTD is log(d MTD ) ⫽ (logit(θ) ⫺ α̂)/β̂. A two-parameter model is
used here to make fullest use of the final sample of data; however, as noted
above, two-parameter models have undesirable properties for purposes of dose
escalation. To obtain a meaningful estimate of the MTD, one must have 0 ⬍ β̂
⬍ ∞. If this is not the case, then one needs either to add additional cohorts of
patients or substitute a more empirical estimate, such as the last dose level or
hypothetical next dose level.
As noted, the algorithm described above is designed with a target θ ⫽ 1/3
in mind. Although other quantiles could be estimated from the same estimated
dose–response curve, a target θ different from 1/3 would probably lead one to
use a modified second-stage algorithm.
Extensive simulation experiments using this trial design compared with
more traditional designs demonstrated the possibility of reducing the variability
of point estimates of the MTD and reducing the proportion of patients treated at
very low dose levels, without markedly increasing the proportion of patients
treated at dose levels where the probability of DLT is excessive. Storer (7) also
evaluated different methods of providing confidence intervals for the MTD. Stan-
dard likelihood-based methods that ignore the sequential sampling scheme (a
delta method, a method based on Fieller’s theorem, and a likelihood ratio method)
are often markedly anticonservative. More accurate confidence sets can be con-
80 Storer
structed by simulating the distribution of any of those test statistics at trial values
of the MTD; however, the resulting confidence intervals are often extremely
wide. Furthermore, the methodology is purely frequentist and may be unable to
account for minor variations in the implementation of the design when a trial is
conducted.
With some practical modifications, the two-stage design described above
has been implemented in a real phase I trial (8). The major modifications included
a provision to add additional cohorts of three patients, if necessary, until the
estimate of β in the fitted logistic model becomes positive and finite; a provision
that if the estimated MTD is higher than the highest dose level at which patients
have actually been treated, the latter will be used as the MTD; and a provision
to add additional intermediate dose levels if, in the judgement of the protocol
chair, the nature or frequency of toxicity at a dose level precludes further patient
accrual at that dose level.
E. Continuous Outcomes
Although not common in practice, it is useful to consider the case where the
major outcome defining toxicity is a continuous measurement, for example, the
nadir white blood count (WBC). This may or may not involve a fundamentally
different definition of the MTD in terms of the occurrence of DLT. For example,
suppose that DLT is determined by the outcome Y ⬍ c, where c is a constant, and
we have Y ⬃ N(α ⫹ βd, σ2 ). Then d MTD ⫽ (c ⫺ α ⫺ Φ⫺1(θ)σ)/β has the traditional
definition that the probability of DLT is θ. The use of such a model in studies with
small sample size makes some distributional assumption imperative. Some sequen-
tial design strategies in this context have been described by Eichhorn et al. (9).
Alternatively, the MTD might be defined in terms of the mean response,
that is, the dose where E(Y) ⫽ c. For the same simple linear model above, we
then have that d MTD ⫽ (c ⫺ α)/β. Fewer distributional assumptions are needed
to estimate d MTD, and stochastic approximation techniques might be applied in
the design of trials with such an end point (10). Nevertheless, the use of a mean
response to define MTD is not generalizable across drugs with different or multi-
ple toxicities and consequently has received little attention in practice.
A recent proposal for a design incorporating a continuous outcome is that
of Mick and Ratain (11). This is also a two-stage design, which for a hypothetical
study of etoposide assumes a simple regression model relating dose to the WBC
nadir. The model is log (WBC) ⫽ α ⫹ β 1 log(WBC pre ) ⫹ β 2d, where WBC pre
is the pretreatment WBC. The first phase uses cohorts of two patients. Ad hoc
rules for dose escalation are determined by the toxicity experience in the current
cohort; however, the model is fit each time and cohorts of two are added until
at least eight patients have been treated and β̂ 2 is significantly different from 0
at the 0.05 level of significance. In the second stage of the study, the dose for
the next cohort of two patients is determined by fitting the regression model to
the accumulated data and estimating the dose that leads to a mean nadir WBC
of 2.5, that is, d k⫹1 ⫽ (log(2.5) ⫺ α̂ ⫺ β̂ 1 log(WBC pre ))/β̂ 2. This continues until
at least eight patients have been treated and β̂ 2 is significantly different from 0
at the 0.001 level of significance.
Simulation studies of this design using a pharmakinetic model and historic
database demonstrated a clear increase in precision in the MTD estimated from
the model-based dose-escalation method, as compared with the MTD estimated
from a more traditional design. The average sample size was also measurably
smaller. Though such results are promising, the method applies only to situations
where the DLT is a single continuous outcome. Furthermore, the simulation stud-
ies that are needed to establish the usefulness of the method in specific situations
often require the use of human pharmacokinetic data that might not be available
at the time the study was being planned.
III. CHOOSING A PHASE I DESIGN
As noted above, only limited information regarding the suitability of a phase I

design can be gained from the levelwise operating characteristics shown in Fig.
1. Furthermore, for designs like CRM, which depend on data from prior dose
levels to determine the next dose level, it is not even possible to specify a level-
wise operating characteristic.
Useful evaluations of phase I designs must involve the entire dose–re-
sponse curve, which of course is unknown. Many simple designs for which the
level-wise operating characteristics can be specified can be formulated as discrete
Markov chains (6). The states in the chain refer to treatment of a patient or group
of patients at a dose level, with an absorbing state corresponding to the stopping
of the trial. For various assumptions about the true dose–response curve, one can
then calculate exactly many quantities of interest, such as the number of patients
treated at each dose level, from the appropriate quantities determined from suc-
cessive powers of the transition probability matrix P. Such calculations are fairly
tedious, however, and do not accommodate designs with nonstationary transition
probabilities, such as CRM. Nor do they allow one to evaluate any quantity de-
rived from all of the data, such as the MTD estimated after following Storer’s
two-stage design.
For these reasons, simulations studies are the only practical tool for evaluat-
ing phase I designs. As with exact computations, one needs to specify a range
of possible dose–response scenarios and then simulate the outcome of a large
number of trials under each scenario. Here we give an example of such a study
to illustrate the kinds of information that can be used in the evaluation and some
of the considerations involved in the design of the study.
82 Storer
A. Specifying the Dose–Response Curve

We follow the modified Fibonacci spacing described in Section II. For example,
in arbitrary units, we have d 1 ⫽ 100.0, d 2 ⫽ 200.0, d 3 ⫽ 333.3, d 4 ⫽ 500.0, d 5
⫽ 700.0, d 6 ⫽ 933.3, d 7 ⫽ 1244.4, . . . . We also define hypothetical dose levels
below d 1 that successively halve the dose above, that is, d 0 ⫽ 50.0, d ⫺1 ⫽
25.0, . . . . The starting dose is always d 1, and we assume that the true MTD is
four dose levels higher, at d 5, with θ ⫽ 1/3. To define a range of dose–response
scenarios, we vary the probability of toxicity at d 1 from 0.01 (0.01) 0.20, and
graph our results as a function of that probability. The true dose–response curve
is determined by assuming that a logistic model holds on the log scale.4 Varying
the probability of DLT at d 1 while holding the probability at d 5 fixed at θ results
in a sequence of dose–response curves ranging from relatively steep to relatively
flat. An even greater range could be encompassed by also varying the number
of dose levels between the starting dose and the true MTD, which of course need
not be exactly at one of the predetermined dose levels. The point is to study the
sensitivity of the designs to features of the underlying dose–response curve,
which obviously is unknown.
B. Specifying the Designs

This simulation will evaluate the two traditional designs described above, Storer’s
two-stage design, and a non-Bayesian CRM design. It is important to make the
simulation as realistic as possible in terms of how an actual clinical protocol
would be implemented or at least to recognize what differences might exist. For
example, the simulation does not place a practical limit on the highest dose level,
although it is rare for any design to escalate beyond d 10. An actual protocol might
have an upper limit on the number of dose levels, with a provision for how to
define the MTD if that limit is reached. Similarly, the simulation always evaluates
a full cohort of patients, whereas in practice, where patients are more likely en-
tered sequentially than simultaneously, a 3 ⫹ 3 design might, for example, forego
the last patient in a cohort of three if the first two patients had experienced DLT.
1. Traditional 3 ⫹ 3 Design
This design is implemented as described in Section II. In the event that excessive
toxicity occurs at d 1, the MTD is taken to be d 0. Although this is an unlikely
occurrence in practice, a clinical protocol should specify any provision to de-
crease dose if the stopping criteria are met at the first dose level.
2. Traditional Best-of-5 Design

Again implemented as described in Section II, with the same rules applied to
stopping at d 1.
3. Storer’s Two-Stage Design

Implemented as described in Section II, with a second-stage sample size of 24
patients. A standard logistic model is fit to the data. If it is not the case that 0
⬍ β̂ ⬍ ∞, then the geometric mean of the last dose level used and the dose level
that would have been assigned to the next cohort is used as the MTD. In either
case, if that dose is higher than the highest dose at which patients have actually
been treated, then the latter is taken as the MTD.
4. Non-Bayesian CRM Design

We start the design using the first stage of the two-stage design as described
above. Once heterogeneity has been achieved, 24 patients are entered in cohorts
of three. The first cohort is entered at the same dose level as for the second stage
of the two-stage design; after that successive cohorts are entered using likelihood
based updating of the dose–response curve. For this purpose we use a single
parameter logistic model—a two-parameter model with β fixed at 0.75.5 After
each updating, the next cohort is treated at the dose level with estimated probabil-
ity of DLT closest in absolute value to θ; however, the next level cannot be more
than one dose level higher than the current highest level at which any patients
have been treated. The level that would be chosen for a hypothetical additional
cohort is the MTD; however, if this dose is above the highest dose at which
patients have been treated, the latter is taken as the MTD.
C. Simulation and Results

The simulation is performed by generating 5000 sequences of patients and
applying each of the designs to each sequence for each dose–response curve
being evaluated. The sequence of patients is really a sequence of psuedo-random
numbers generated to be Uniform (0, 1). Each patient’s number is compared with
the hypothetical true probability of DLT at the dose level the patient is entered
at for the dose–response curve being evaluated. If the number is less than that
probability, then the patient is taken to have experienced DLT.
Figure 2 displays results of the simulation study above that relate to the
estimate d̂ MTD. Since the dose scale is arbitrary, the results are presented in terms
of ψ(d̂ MTD ). Figure 2(a) displays the mean probability of DLT at the estimated
MTD. The horizontal line at 1/3 is a point of reference for the target θ. Although
none of the designs is unbiased, all except the conservative 3 ⫹ 3 design perform
fairly well across the range of dose–response curves. The precision of the esti-
mates, taken as the root MSE of the probabilities ψ(d̂ MTD ), is shown in Fig. 2
(b). In this regard the CRM and two-stage designs perform better than the best-
of-5 design over most settings of the dose–response curve. One should also note
84 Storer
Figure 2 Results of 5000 simulated phase I trials according to four designs, plotted as
a function of the probability of DLT at the starting dose level. The true MTD is fixed at
four dose levels above the starting dose, with θ ⫽ 1/3. Results are expressed in terms of
p(MTD) ⫽ ψ(d̂ MTD ).
that, in absolute terms, the precision of the estimates is not high even for the
best designs.
In addition to the average properties of the estimates, it is also relevant to
look at the extremes. Figure 2(c) and (d) present the fraction of trials where
ψ(d̂ MTD ) ⬍ 0.20 or ψ(d̂ MTD ) ⬎ 0.50, respectively.6 The cutoff of 0.20 is the level
at which the odds of DLT are half that of θ. Although this may not be an important
consideration, to the extent that the target θ defines a dose with some efficacy
in addition to toxicity, the fraction of trials below this arbitrary limit may repre-
sent cases in which the dose selected for subsequent evaluation in efficacy trials
is ‘‘too low.’’ Because of their common first-stage design that uses single patients
at the initial dose levels, the two-stage and CRM designs do best in this regard.
Conversely, the cutoff used in Fig. 2 (d) is the level at which the odds of toxicity
are twice that of θ. Although the occurrence of DLT in and of itself is not neces-
sarily undesirable, as the probability of DLT increases there is likely a corre-
sponding increase in the probability of very severe or even fatal toxicity. Hence,
the trials where the probability of DLT is above this arbitrary level may represent
cases in which the dose selected as the MTD is ‘‘too high.’’ In this case there
are not large differences among the designs, and in particular we find that the
two designs that perform the best in Fig. 2(c) do not carry an unduly large penalty.
One of course could easily evaluate other limits if desired.
Some results related to the outcome of the trials themselves are presented
in Fig. 3. Panels (a) and (b) present the overall fraction of patients that are treated
below and above, respectively, the same limits as for the estimates in Fig. 2. The
two-stage and CRM designs perform best at avoiding treating patients at the
lower dose levels; the two-stage design is somewhat better than the CRM design
at avoiding treating patients at higher dose levels, although of course it does not
do as well as the very conservative 3 ⫹ 3 design.
Sample size considerations are evaluated in Fig. 3, (c) and (d). Panel (c)
shows the mean number of patients treated. Because they share a common first
stage and use the same fixed number of patients in the second stage, the two-
stage and CRM designs yield identical results. The 3 ⫹ 3 design uses the smallest
number of patients, but this is because it tends to stop well below the target. On
average, the best-of-5 design uses six to eight fewer patients than the two-stage
or CRM design. Figure 3(d) displays the mean number of ‘‘cycles’’ of treatment
that are needed to complete the trial, where a cycle is the period of time over
which a patient or group of patients needs to be treated and evaluated before a
decision can be made as to the dose level for the next patient or group. For
example, the second stage in the two-stage or CRM designs above always uses
eight cycles; each dose level in the 3 ⫹ 3 design uses either one or two cycles,
and so on. This is a consideration only for situations where the time needed to
complete a phase I trial is not limited by the rate of patient accrual but by the
86 Storer
a function of the probability of DLT at the starting dose level. The true MTD is fixed at
four dose levels above the starting dose, with θ ⫽ 1/3.
time needed to treat and evaluate each group of patients. In this case the results
are qualitatively similar to that of Fig. 3(c).
D. Summary and Conclusion

Based only on the results above, one would likely eliminate the 3 ⫹ 3 design
from consideration. The best-of-5 design would probably also be eliminated as
well due to the lower precision and greater likelihood that the MTD will be well
below the target. On the other hand, the best-of-5 design uses fewer patients. If
small patient numbers are a priority, it would be reasonable to consider an addi-
tional simulation in which the second-stage sample size for the two-stage and
CRM designs is reduced to, say, 18 patients. This would put the average sample
size for those designs closer to that of the best-of-5, and one could see whether
they continued to maintain an advantage in the other aspects. Between the two-
stage and CRM designs, there is perhaps a slight advantage to the former in terms
of greater precision and a smaller chance that the estimate will be too far above
the target; however, the difference is likely not important in practical terms and
might vary under other dose–response conditions.7
A desirable feature of the results shown is that both the relative and absolute
properties of the designs do not differ much over the range of dose–response
curves. Additional simulations could be carried out that would vary also the dis-
tance between the starting dose and the true MTD or place the true MTD between
dose levels instead of exactly at a dose level.
To illustrate further some of the features of phase I designs and the neces-
sity of studying each situation on a case by case basis, we repeated the simulation
study above using a target θ ⫽ 0.20. Exactly the same dose–response settings
are used, so that the results for the two traditional designs are identical to those
shown previously. The two-stage design is modified to use five cohorts of five
patients but follows essentially the same rule for selecting the next level described
above with ‘‘three’’ replaced by ‘‘five.’’ Additionally, the final fitted model esti-
mates the MTD associated with the new target, and of course the CRM design
selects the next dose level based on the new target.
The results for this simulation are presented in Fig. 4 and 5. In this case
the best-of-5 design is clearly eliminated as too aggressive. However, and perhaps
surprisingly, the 3 ⫹ 3 design performs nearly as well, or better, than the suppos-
edly more sophisticated two-stage and CRM designs. There is a slight disadvan-
tage in terms of precision, but given that the mean sample size with the 3 ⫹ 3
design is nearly half that of the other two, this may be a reasonable trade-off.
Of course, it could also be the case in this setting that using a smaller second-
stage sample size would not adversely affect the two-stage and CRM designs.
Finally, we reiterate the point that the purpose of this simulation was to
demonstrate some of the properties of phase I designs and of the process of
88 Storer
a function of the probability of DLT at the starting dose level. The dose–response curves
are identical to those used for Fig. 2 but with θ ⫽ 0.20. Results are expressed in terms
of p(MTD) ⫽ ψ(d̂ MTD ).
simulation itself, not to advocate any particular design. Depending on the particu-
lars of the trial at hand, any one of the four designs might be a reasonable choice.
An important point to bear in mind is that traditional designs must be matched
to the desired target quantile and will perform poorly for other quantiles. CRM
designs are particulary flexible in this regard; the two-stage design can only be
modified to a lesser extent.
a function of the probability of DLT at the starting dose level. The dose–response curves
are identical to those used for Fig. 3 but with σ ⫽ 0.20.
ENDNOTES
1. Alternately, one could define Y to be the random variable representing the threshold
dose at which a patient would experience DLT. The distribution of Y is referred to
as a tolerance distribution and the dose–response curve is the cumulative distribution
function for Y, so that the MTD would be defined by Pr(Y ⱕ d MTD ) ⫽ θ. For a given
sample size, the most effective way of estimating this quantile would be from a sample
of threshold doses. Such data are nearly impossible to gather, however, as it is imprac-
90 Storer
tical to give each patient more than a small number of discrete doses. Further, the
data obtained from sequential administration of different doses to the same patient
would almost surely be biased, as one could never distinguish the cumulative effects
of the different doses from the acute effects of the current dose level. Extended wash-
out periods between doses are not a solution, since the condition of the patient and
hence the response to the drug is likely to change rapidly for the typical patient in a
phase I trial. For this reason, almost all phase I trials involve the administration of
only a single dose level to each patient and the observation of the frequency of DLT
in all patients treated at the same dose level.
2. In a true Fibonacci sequence, the increments would be approximately 2, 1.5, 1.67,
1.60, 1.63, and then 1.62 thereafter, converging on the golden ratio.
3. Without a prior, the dose–response model cannot be fit to the data until there is some
heterogeneity in outcome, i.e., at least one patient with DLT and one patient without
DLT. Thus, some simple rules are needed to guide the dose escalation until heteroge-
neity is achieved. Also, one may want to impose rules that restrict one from skipping
dose levels during escalation, even if the fitted dose–response model would lead one
to select a higher dose.
4. The usual formulation of the logistic dose–response curve would be that logitψ(d)
⫽ α ⫹ βlog d. In the above setup, we specify d 1, ψ(d 1 ), and that ψ(d 5 ) ⫽ 1/3, whence
β ⫽ (logit(1/3) ⫺ logit(ψ(d 1 ))/∆, where ∆ ⫽ log d 5 ⫺ log d 1, and α ⫽ logit(ψ(d 1 ))
⫺ β log d 1.
5. This value does have to be tuned to the actual dose scale but is not particularly sensi-
tive to the precise value. That is, similar results would be obtained with β in the range
0.5–1. For reference, on the natural log scale the distance log (d MTD ⫺ log(d 1 ) ⬇ 2,
and the true value of β in the simulation ranges from 2.01 to 0.37 as ψ(d 1 ) ranges
from 0.01 to 0.20.
6. The see-saw pattern observed for all but the two-stage design is caused by changes
in the underlying dose–response curve, as the probability of DLT at particular dose
levels moves over or under the limit under consideration. Since the three designs
select discrete dose levels as d̂ MTD, this will result in a corresponding decrease in the
fraction of estimates beyond the limit.
7. The advantage of the two-stage design may seem surprising, given that the next dose
level is selected only on the basis of the outcome at the current dose level, and ignores
the information that CRM uses from all prior patients. However, the two-stage design
also incorporates a final estimation procedure for the MTD that uses all the data and
uses a richer family of dose–response models. This issue is examined in Storer (12).
REFERENCES

for Phase I clinical studies in cancer. Biometrics 1990; 46:33–48.
3. Shen LZ, O’Quigley J. Consistency of continual reassessment method under model

misspecification. Biometrika 1996; 83:395–405.
4. Gatsonis C, Greenhouse JB. Bayesian methods for Phase I clinical trials. Stat Med
1992; 11:1377–1389.
5. O’Quigley J, Shen LZ. Continual reassessment method: a likelihood approach. Bio-
metrics 1996; 52:673–684.
6. Storer B. Design and analysis of Phase I clinical trials. Biometrics 1989; 45:925–
937.
7. Storer B. Small-sample confidence sets for the MTD in a Phase I clinical trial. Bio-
metrics 1993; 49:1117–1125.
8. Berlin J, Stewart JA, Storer B, Tutsch KD, Arzoomanian RZ, Alberti D, Feierabend
C, Simon K, Wilding G. Phase I clinical and pharmacokinetic trial of penclomedine
utilizing a novel, two-stage trial design. Clin Oncol 1998; 16:1142–1149.
9. Eichhorn BH, Zacks S. Sequential search of an optimal dosage. J Am Stat Assoc
1973; 68:594–598.
10. Anbar D. Stochastic approximation methods and their use in bioassay and Phase I
clinical trials. Commun Stat Theory Methods 1984; 13:2451–2467.
11. Mick R, Ratain MJ. Model-guided determination of maximum tolerated dose in
Phase I clinical trials: evidence for increased precision. J Nat Cancer Inst 1993; 85:
217–223.
12. Storer B. An evaluation of Phase I designs for continuous dose response. Stat Med
(In press.)
4
Overview of Phase II Clinical Trials
Stephanie Green
I. DESIGN
Standard phase II studies are used to screen new regimens for activity and to
decide which ones should be tested further. To screen regimens efficiently, the
decisions generally are based on single-arm studies using short-term end points
(usually tumor response in cancer studies) in limited numbers of patients. The
problem is formulated as a test of the null hypothesis H 0: p ⫽ p 0 versus the
alternative hypothesis H A: p ⫽ p A, where p is the probability of response, p 0 is
the probability which, if true, would mean that the regimen was not worth study-
ing further, and p A is the probability which, if true, would mean it would be
important to identify the regimen as active and to continue studying it. Typically,
p 0 is a value at or somewhat below the historical probability of response to stan-
dard treatment for the same stage of disease, and p A is typically somewhat above.
For ethical reasons studies of new regimens usually are designed with two
or more stages of accrual, allowing early stopping due to inactivity of the regi-
men. A variety of approaches to early stopping has been proposed. Although
several of these include options for more than two stages, only the two-stage
versions are discussed in this chapter. (In typical clinical settings it is difficult
to manage more than two stages.) An early approach, due to Gehan (1), suggested
stopping if 0/N responses were observed, where the probability of 0/N was less
than 0.05 under a specific alternative. Otherwise accrual was to be continued until
the sample size was large enough for estimation at a specified level of precision.
93
94 Green
In 1982, Fleming (2) proposed stopping when results are inconsistent either
with H 0 or H AA: p ⫽ p′, where H 0 is tested at level α and p′ is the alternative
for which the procedure has power 1 ⫺ α. The bounds for stopping after the first
stage of a two-stage design are the nearest integer to N 1 p′ ⫺ Z 1⫺α{Np′(1 ⫺ p′)}1/2
(for concluding early that the regimen should not be tested further) and the nearest
integer to N 1 p 0 ⫹ Z 1⫺α{Np 0(1 ⫺ p 0)}1/2 ⫹ 1 (for concluding early that the regimen
is promising), where N 1 is the first stage sample size and N is the total after the
second stage. At the second stage, H 0 is accepted or rejected according to the
normal approximation for a single-stage design.
Since then other authors, rather than proposing tests, have proposed
choosing stopping boundaries to minimize the expected number of patients re-
quired, subject to level and power specifications. Chang et al. (3) proposed
minimizing the average expected sample size under the null and alternative
hypotheses. Simon (4), recognizing the ethical imperative of stopping when
the agent is inactive, recommended stopping early only for unpromising results
and minimizing the expected sample size under the null or, alternatively,
minimizing the maximum sample size. A problem with these designs is that
sample size has to be accrued exactly for the optimality properties to hold, so in
practice they cannot be carried out faithfully in many settings. Particularly in
multi-institution settings, studies cannot be closed after a specified number of
patients have been accrued. It takes time to get a closure notice out, and during
this time more patients will have been approached to enter the trial. Patients who
have been asked and have agreed to participate in a trial should be allowed to
do so, and this means there is a period of time during which institutions can
continue registering patients even though the study is closing. Furthermore, some
patients may be found to be ineligible after the study is closed. It is rare to end
up with precisely the number of patients planned, making application of fixed
designs problematic.
To address this problem, Green and Dahlberg (5) proposed designs
allowing for variable attained sample sizes. The approach is to accrue patients
in two stages with about the same number of patients per stage, to have level
approximately 0.05 and power approximately 0.9, and to stop early if the agent
appears unpromising. Specifically, the regimen is concluded unpromising and the
trial is stopped early if the alternative is rejected at the 0.02 level after the first
stage of accrual and the agent is concluded promising if H 0 is rejected at the
0.055 level after the second stage of accrual. The level 0.02 was chosen to balance
the concern of treating the fewest possible patients with an inactive agent against
the concern of rejecting an active agent due to treating a chance series of poor
risk patients. Level 0.05 and power 0.9 are reasonable for solid tumors due to
the modest percent of agents found to be active in this setting (6); less conserva-
tive values might be appropriate in more responsive diseases.
Phase II Trials 95
The design has the property that stopping at the first stage occurs when the
estimate of the response probability is less than approximately p 0, the true value
that would mean the agent would not be of interest. At the second stage the
agent is concluded to warrant further study if the estimate of the response proba-
bility is greater than approximately (p A ⫹ p 0)/2, which typically would be
equal to or somewhat above the historical probability expected from other
agents and a value at which one might be expected to be indifferent to the
outcome of the trial. However, there are no optimality properties. Chen and Ng
(7) proposed a different approach to flexible design by optimizing (with respect
to expected sample size under p 0) across possible attained sample sizes. They
assumed a uniform distribution over sets of eight consecutive N 1s and eight con-
secutive Ns; presumably if information is available on the actual distribution in
a particular setting, then the approach could be used for a better optimization.
Herndon (8) described another variation on the Green and Dahlberg designs. To
address the problem of temporary closure of studies, an alternative approach is
proposed that allows patient accrual to continue while results of the first stage
are reviewed. Temporary closures are disruptive, so this approach might be rea-
sonable for cases when accrual is relatively slow with respect to submission of
information (if too rapid, the ethical aim of stopping early due to inactivity is
lost).
Table 1 illustrates several of the design approaches mentioned above for
level 0.05 and power 0.9 tests, including Fleming designs, Simon minimax de-
signs, Green and Dahlberg designs, and Chen and Ng optimal design sets. Powers
and levels are reasonable for all approaches. (Chen and Ng designs have correct
level on average, although individual realizations have levels up to 0.075 among
the tabled designs.) Of the four approaches, Green and Dahlberg designs are the
most conservative with respect to early stopping for level 0.05 and power 0.9,
whereas Chen and Ng designs are the least.
In another approach to phase II design, Storer (9) suggested a procedure
similar to two-sided testing instead of the standard one-sided test. In this ap-
proach, the phase II is considered negative (H A: p ⱖ p A is rejected) if the number
of responses is sufficiently low, positive (H 0: p ⱕ p 0 is rejected) if sufficiently
high, and equivocal if intermediate (neither hypothesis rejected). For a value p m
between p 0 and p A, upper and lower rejection bounds (r U and r L) are chosen such
that P(x ⱖ r U | p m) ⬍ γ and P(x ⱕ r L |p m) ⬍ γ, with p m and sample size chosen
to have adequate power to reject H A under p 0 or H 0 under p A. When p 0 ⫽ 0.1
and p A ⫽ 0.3, an example of a Storer design is to test p m ⫽ 0.193 with γ ⫽ 0.33
and power 0.8 under p 0 and p A. For a two-stage design, N 1, N, r L1, r U1, r L2, and
r U2 are 18, 29, 1, 6, 4, and 7, respectively. If the final result is equivocal (5 or
6 responses in 29 for this example), the conclusion is that other information is
necessary to make a decision.
Table 1 Examples of Designs
96
Level (average and Power (average and
H0 vs. HA N1 a1 b1 N a2 b2 range for Chen) range for Chen)
Fleming 0.05 vs. 0.2 20 0 4 40 4 5 0.052 0.92

0.1 vs. 0.3 20 2 6 35 6 7 0.053 0.92
0.2 vs. 0.4 25 5 10 45 13 14 0.055 0.91
0.3 vs. 0.5 30 9 16 55 22 23 0.042 0.91
Simon 0.05 vs. 0.2 29 1 — 38 4 5 0.039 0.90
0.1 vs. 0.3 22 2 — 33 6 7 0.041 0.90
0.2 vs. 0.4 24 5 — 45 13 14 0.048 0.90
0.3 vs. 0.5 24 7 — 53 21 22 0.047 0.90
Green 0.05 vs. 0.2 20 0 — 40 4 5 0.047 0.92
0.1 vs. 0.3 20 1 — 35 7 8 0.020 0.87
0.2 vs. 0.4 25 4 — 45 13 14 0.052 0.91
0.3 vs. 0.5 30 8 — 55 22 23 0.041 0.91
Chen 0.05 vs. 0.2 17–24 1 — 41–46 4 5 0.046 0.90
47–48 5 6 0.022–0.069 0.845–0.946
0.1 vs. 0.3 12–14 1 — 36–39 6 7 0.050 0.90
15–19 2 — 40–43 7 8 0.029–0.075 0.848–0.938
0.2 vs. 0.4 18–20 4 — 48 13 14 0.050 0.90
21–24 5 — 49–51 14 15 0.034–0.073 0.868–0.937
25 6 — 52–55 15 16
0.3 vs. 0.5 19–20 6 — 55 21 22 0.050 0.90
21–23 7 — 56–58 22 23 0.035–0.064 0.872–0.929
24–26 8 — 59–60 23 24
— 61–62 24 25
Green
N1 is the sample size for the first stage of accrual, N is the total sample size after the second stage of accrual, ai is the bound for accepting H0 at stage i,
and bi is the bound for rejecting H0 at stage i (i ⫽ 1, 2). Designs are listed for Fleming (2), Simon (4), and Green and Dahlberg (5); the optimal design
set is listed for Chen and Ng (7).
Phase II Trials 97
II. ANALYSIS OF STANDARD PHASE II DESIGNS
As noted in Storer, the hypothesis testing framework used in phase II studies is

useful for developing designs and determining sample sizes. The resulting deci-
sion rules are not always meaningful, however, except as tied to hypothetical
follow-up trials that in practice may or may not be done. Thus, it is important
to present confidence intervals for phase II results, which can be interpreted ap-
propriately regardless of the nominal ‘‘decision’’ made at the end of the trial as
to whether further study of the regimen is warranted.
The main analysis issue after a multistage trial is how to generate a con-
fidence interval, since the usual procedures assuming a single-stage design
are biased. Various approaches to generating intervals have been proposed.
These involve ordering the outcome space and inverting tail probabilities or test
acceptance regions, as in estimation following single-stage designs; however,
with multistage designs, the outcome space does not lend itself to any simple
ordering.
Jennison and Turnbull (10) order the outcome space by which boundary
is reached, by the stage stopped at, and by the number of successes (stopping at
stage i is considered more extreme than stopping at stage i ⫹ 1 regardless of the
number of successes). A value p is not in the 1 ⫺ 2α confidence interval if the
probability under p of the observed result or one more extreme according to this
ordering is less than α (either direction). Chang and O’Brien (11) order the sample
space instead based on the likelihood principle. For each p, the sample space for
a two-stage design is ordered according to L(x, N*) ⫽ (x/N*)x(1 ⫺ p)N*⫺x /px{(N*
⫺ x)/N*}N*⫺x, where N* is N 1 if x can only be observed at the first stage and N if
at the second (x is the number of responses). P is in the confidence ‘‘interval’’ if
one half of the probability of the observed outcome plus the probability of a more
extreme outcome according to this ordering is α or less. The confidence set is not
always strictly an interval, but the authors state that the effect of discontinuous
points is negligible. Chang and O’Brien intervals were shorter than those of Jenni-
son and Turnbull, although this in part would be due to the fact that Jennison and
Turnbull did not adjust for discreteness by assigning only 1/2 of the probability
of the observed value to the tail as Chang and O’Brien did. Duffy and Santner
(12) recommend ordering the sample space by success percent and also develop
intervals of shorter length than Jennison and Turnbull intervals.
Although they produce shorter intervals, these last two approaches have the
major disadvantage of requiring knowledge of the final sample size to calculate an
interval for a study stopped at the first stage; as noted above, this typically will
be random. The Jennison and Turnbull approach can be used since it only requires
knowledge up to the stopping time.
However, it is not entirely clear how important it is to adjust confidence
intervals for the multistage nature of the design. From the point of view of appro-
98 Green
priately reflecting the activity of the regimen tested, the usual interval assuming
a single stage design may be sufficient. In this setting the length of the confidence
interval is not of primary importance (sample sizes are small and all intervals are
long). The primary concern is that the interval appropriately reflects the activity of
the regimen. Similar to Storer’s idea, it is assumed that if the confidence interval
excludes p 0, the regimen is considered active, and if it excludes p A, the regimen
is considered insufficiently active. If it excludes neither, results are equivocal;
this seems reasonable whether or not continued testing is recommended for the
better equivocal results.
For Green and Dahlberg designs, the differences between Jennison and
Turnbull and unadjusted tail probabilities are 0 if the trial stops at the first stage,
⫺∑ a01bin(i, N 1, p)∑ Nx⫺i2 bin( j, N ⫺ N 1, p) for the upper tail if stopped at the second
stage, and ⫹ ∑ a01bin(i, N 1, p) ∑ Nx⫺i⫹1
2 bin( j, N ⫺ N 1, p) for the lower tail if stopped
at the second stage. (A 1 is the stopping bound for accepting H 0 at the first stage.)
Both the upper and lower confidence bounds are shifted to the right for Jennison
and Turnbull intervals. These therefore will more often appropriately exclude p 0
when p A is true and inappropriately include p A when p 0 is true compared with the
unadjusted interval. However, the tail differences are generally small, resulting in
small differences in the intervals. The absolute value of the upper tail difference
is less than approximately 0.003 when the lower bound of the unadjusted interval
is p 0 (normal approximation), whereas the lower tail difference is constrained to
be ⬍0.02 for p ⬎ p A due to the early stopping rule. Generally, the shift in a
Jennison and Turnbull interval is noticeable only for small x at the second stage.
As Rosner and Tsiatis (13) note, such results (activity in the first stage, no activity
in the second) are unlikely, possibly suggesting the independent identically dis-
tributed assumption was incorrect.
For example, consider a common design for testing H 0: p ⫽ 0.1 versus H A:
p ⫽ 0.3: stop in favor of H 0 at the first stage if 0 or 1 responses are observed
in 20 patients and otherwise continue to a total of 35. Of the 36 possible trial
outcomes (if planned sample sizes are achieved), the largest discrepancy in the
95% confidence intervals occurs if two responses are observed in the first stage
and none in the second. For this outcome, the Jennison and Turnbull 95% confi-
dence interval is from 0.013 to 0.25, whereas the unadjusted interval is from
0.007 to 0.19. Although not identical, both intervals lead to the same conclusion:
The alternative is ruled out.
For the Fleming and Green and Dahlberg designs listed in Table 1, Table
2 lists the probabilities that the 95% confidence intervals lie above p 0 (evidence
the regimen is active), below p A (evidence the regimen has insufficient activity
to pursue), or cover both p 0 and p A (inconclusive). (In no case are p 0 and p A both
excluded.) Probabilities are calculated for p ⫽ p A and p ⫽ p 0 and for adjusted
(by the Jennison and Turnbull method) and unadjusted intervals.
For the Green and Dahlberg designs considered, probabilities for the Jenni-
son and Turnbull and unadjusted intervals are the same in most cases. The only
Phase II Trials 99
Table 2 Probabilities Under p 0 and p A for Unadjusted and Jennison-Turnbull (J-T)

Adjusted 95% Confidence Intervals
Probability
Probability Probability 95% CI includes
95% CI is above 95% CI is below p0 and pA when
p0 when p ⫽ pA when p ⫽ p⫽
p0 pA p0 pA p0 pA
0.05 vs. 0.2 Green J-T 0.014 0.836 0.704 0.017 0.282 0.147
Unadjusted 0.014 0.836 0.704 0.017 0.282 0.147
Fleming J-T 0.024 0.854 0.704 0.017 0.272 0.129
Unadjusted 0.024 0.854 0.704 0.017 0.272 0.129
0.1 vs. 0.3 Green J-T 0.020 0.866 0.747 0.014 0.233 0.120
Unadjusted 0.020 0.866 0.747 0.014 0.233 0.120
Fleming J-T 0.025 0.866 0.392 0.008 0.583 0.126
Unadjusted 0.025 0.866 0.515 0.011 0.460 0.123
0.2 vs. 0.4 Green J-T 0.025 0.856 0.742 0.016 0.233 0.128
Unadjusted 0.025 0.856 0.833 0.027 0.142 0.117
Fleming J-T 0.023 0.802 0.421 0.009 0.556 0.189
Unadjusted 0.034 0.862 0.654 0.022 0.312 0.116
0.3 vs. 0.5 Green J-T 0.022 0.859 0.822 0.020 0.156 0.121
Unadjusted 0.022 0.859 0.822 0.020 0.156 0.121
Fleming J-T 0.025 0.860 0.778 0.025 0.197 0.115
Unadjusted 0.025 0.860 0.837 0.030 0.138 0.110
discrepancy occurs for the 0.2 versus 0.4 design when the final outcome is 11
of 45 responses. In this case the unadjusted interval is from 0.129 to 0.395,
whereas the Jennison and Turnbull interval is from 0.131 to 0.402. There are
more differences between adjusted and unadjusted probabilities for Fleming de-
signs, the largest for ruling out p A in the 0.2 versus 0.4 and 0.1 versus 0.3 designs.
In these designs, no second-stage Jennison and Turnbull interval excludes the
alternative, making this probability unacceptably low under p 0.
The examples presented suggest that adjusted confidence intervals do not
necessarily result in more sensible intervals in phase II designs and in some cases
are worse than not adjusting.
III. OTHER PHASE II DESIGNS

A. Multiarm Phase II Designs
Occasionally, the aim of a phase II study is not to decide whether a particular
regimen should be studied further but to decide which of several new regimens
100 Green
should be taken to the next phase of testing (assuming they cannot all be). In
these cases selection designs are used, often formulated as follows: Take on to
further testing the treatment arm observed to be best by any amount, where the
number of patients per arm is chosen to be large enough such that if one treatment
is superior by ∆ and the rest are equivalent, the probability of choosing the supe-
rior treatment is p.
Simon et al. (14) published sample sizes for selection designs with response
endpoints, whereas Liu et al. (15) provide sample sizes for survival end points.
For survival the approach is to choose the arm with the smallest estimated β in
a Cox model. Sample size is chosen so that if one treatment is superior with β
⫽ ⫺ln (1 ⫹ ∆) and the others have the same survival, then the superior treatment
will be chosen with probability p.
Theoretically this all fine, but in reality the designs are not strictly followed.
If response is poor in all arms, the conclusion is to pursue none of the regimens
(not an option allowed in these designs). If a ‘‘striking’’ difference is observed,
then the temptation is to bypass the confirmatory phase III. In a follow-up to the
survival selection paper, Liu et al. (16) noted that the probability of a an observed
β ⬍ ⫺ln (1.7), which cancer investigators consider striking, is not trivial—with
two to four arms the probabilities are 0.07–0.08 when in fact there are no differ-
ences in the treatment arms.
B. Phase II Designs with Multiple End Points

The selected primary end point of a phase II trial is just one consideration in the
decision to pursue a new agent. Other end points (such as survival and toxicity if
response is primary) must also be considered. For instance, a trial with a sufficient
number of responses to be considered active may still not be of interest if too
many patients experience life-threatening toxicity or if they all die quickly. On
the other hand, a trial with an insufficient number of responses but a good toxicity
profile and promising survival might still be considered for future trials.
Designs have been proposed to incorporate multiple end points explicitly
into phase II studies. Bryant and Day (17) proposed an extension of Simon’s
approach, identifying designs that minimize the expected accrual when the regi-
men is unacceptable either with respect to response or toxicity. Their designs are
terminated at the first stage if either the number of responses is C R1 or less or
the number of patients without toxicity is C T1 or less (or both). The regimen is
concluded useful if the number of patients with responses and the number without
toxicity are greater than C R2 and C T2 respectively, at the second stage. N 1, N, C R1,
C T1, C R2, and C T2 are chosen such that the probability of recommending the regi-
men when the probability of no toxicity is acceptable (p T ⫽ p T1) but response is
unacceptable (p R ⫽ p R0) is less than or equal to α R, the probability of recommend-
ing the regimen when response is acceptable (p R ⫽ p R1) but toxicity is unaccept-
Phase II Trials 101
able (p T ⫽ p T0) is less than or equal to α T, and the probability of recommending

the regimen when both are acceptable is 1 ⫺ β or better. The constraints are
applied either uniformly over all possible correlations between toxicity and re-
sponse or assuming independence of toxicity and response. Minimization is done
subject to the constraints. For many practical situations, minimization assuming
independence produces designs that perform reasonably well when the assump-
tion is incorrect.
Conaway and Petroni (18) proposed similar designs assuming that a partic-
ular relationship between toxicity and response, an optimality criterion, and a
fixed total sample size are all specified. Design constraints proposed include lim-
iting the probability of recommending the regimen to α or less when both re-
sponse and toxicity are unacceptable (p R ⫽ p R0 and p T ⫽ p T0) and to γ or less
anywhere else in the null region (p R ⱕ p R0 or p T ⱕ p T0). The following year,
Conaway and Petroni (19) proposed boundaries allowing for trade-offs between
toxicity and response. Instead of dividing the parameter space as in Fig. 1a, it is
divided according to investigator specifications, such as in Fig. 1b, allowing for
fewer patients with no toxicity when the response probability is higher (and the
reverse).
The test proposed is to accept H 0 when T(x) ⬍ c 1 at the first stage or ⬍c 2
at the second, subject to maximum level α over the null region and power at
least 1 ⫺ β when p R ⫽ p R1 and p T ⫽ p T1 for an assumed value for the association
between response and toxicity. Here, T(x) ⫽ ∑p*ij ln( p*ij /p̂ij), where ij indexes
the cells of the 2 ⫻ 2 response–toxicity table, p̂s are the usual probability esti-
mates, and the p i*j s are the values achieving infH0 ∑pij ln( pij /p̂ij). (T(x) can be
interpreted in some sense as a distance from p̂ to H 0). Interim stopping bounds
are chosen to satisfy optimality criteria (the authors’ preference is minimization
of expected sample size under the null).
Figure 1 Division of parameter space for two approaches to bivariate phase II design.
(a) An acceptable probability of response and an acceptable probability of no toxicity are
each specified. (b) Acceptable probabilities are not fixed at one value for each but instead
allow for a trade-off between toxicity and response.
102 Green
There are a number of practical problems with these designs. As for other
designs relying on optimality criteria, they generally cannot be done faithfully
in realistic settings. Even when they can be carried out, defining toxicity as a
single yes–no variable is problematic, since typically several toxicities of various
grades are of interest. Perhaps the most important issue is that of the response–
toxicity trade-off. Any function specified is subjective and cannot be assumed to
reflect the preferences of either investigators or patients in general.
IV. DISCUSSION
Despite the precise formulation of hypotheses and decision rules, phase II trials
are not as objective as we would like. The small sample sizes used cannot support
decision making based on all aspects of interest in a trial. Trials combining more
than one aspect (such as toxicity and response) are fairly arbitrary with respect
to the relative importance placed on each end point (including the 0 weight placed
on the end points not included), so are subject to about as much imprecision in
interpretation as results of single end point trials. Furthermore, a phase II trial
would rarely be considered on its own. By the time a regimen is taken to phase
III testing, multiple phase II trials have been done and the outcomes of the various
trials weighed and discussed. Perhaps statistical considerations in a phase II de-
sign are most useful in keeping investigators realistic about how limited such
studies are.
For similar reasons, optimality considerations both with respect to design
and confidence intervals are not particularly compelling in phase II trials. Sample
sizes in the typical clinical setting are small and variable, making it more impor-
tant to use procedures that work reasonably well across a variety of circumstances
rather than optimally in one. Also, there are various characteristics it would be
useful to optimize; compromise is often in order.
A final practical note—choices of null and alternative hypotheses in phase
II trials are often made routinely, with little thought, but phase II experience
should be reviewed occasionally. As definitions and treatments change, old his-
torical probabilities do not remain applicable.
REFERENCES
1. Gehan EA. The determination of number of patients in a follow up trial of a new

chemotherapeutic agent. J Chronic Dis 1961; 13:346–353.
2. Fleming TR. One sample multiple testing procedures for Phase II clinical trials.
Biometrics 1982; 38:143–151.
Phase II Trials 103
3. Chang MN, Therneau TM, Wieand HS, Cha SS. Designs for group sequential Phase
II clinical trials. Biometrics 1987; 43:865–874.
4. Simon R. Optimal two-stage designs for Phase II clinical trials. Controlled Clin
Trials 1989; 10:1–10.
5. Green SJ, Dahlberg, S. Planned vs attained design in Phase II clinical trials. Stat
Med 1992; 11:853–862.
6. Simon R. How large should a Phase II trial of a new drug be? Cancer Treatment
Rep 1987; 71:1079–1085.
7. Chen T, Ng T-H. Optimal flexible designs in Phase II clinical trials. Stat Med 1998;
17:2301–2312.
8. Herndon J. A design alternative for two-stage, Phase II, multicenter cancer clinical
trials. Controlled Clin Trials 1998; 19:440–450.
9. Storer B. A class of Phase II designs with three possible outcomes. Biometrics 1992;
48:55–60.
10. Jennison C, Turnbull BW. Confidence intervals for a binomial parameter following a
multistage test with application to MIL-STD 105D and medical trials. Technometrics
1983; 25:49–58.
11. Chang MN, O’Brien PC. Confidence intervals following group sequential tests. Con-
trolled Clin Trials 1986; 7:18–26.
12. Duffy DE, Santner TJ. Confidence intervals for a binomial parameter based on
multistage tests. Biometrics 1987; 43:81–94.
13. Rosner G, Tsiatis AA. Exact confidence intervals following a group sequential trial:
a comparison of methods. Biometrika 1988; 75:723–729.
14. Simon R, Wittes R, Ellenberg S. Randomized Phase II clinical trials. Cancer Treat-
ment Rep 1985; 69:1375–1381.
15. Liu PY, Dahlberg S, Crowley J. Selection designs for pilot studies based on survival
endpoints. Biometrics 1993; 49:391–398.
16. Liu PY, LeBlanc M, Desai M. False positive rates of randomized Phase II designs.
Controlled Clin Trials 1999; 20:343–352.
17. Bryant J, Day R. Incorporating toxicity considerations into the design of two-stage
Phase II clinical trials. Biometrics 1995; 51:1372–1383.
18. Conaway M, Petroni G. Bivariate sequential designs for Phase II trials. Biometrics
1995; 51:656–664.
19. Conaway M, Petroni G. Designs for phase II trials allowing for a trade-off between
response and toxicity. Biometrics 1996; 52:1375–1386.
5
Designs Based on Toxicity
and Response
Gina R. Petroni and Mark R. Conaway

University of Virginia, Charlottesville, Virginia
I. INTRODUCTION
In principle, phase II trials evaluate whether a new agent is sufficiently promising

to warrant a comparison with the current standard of treatment. An agent is con-
sidered sufficiently promising based on the proportion of patients who ‘‘re-
spond,’’ that is, experience some objective measure of disease improvement. The
toxicity of the new agent, usually defined in terms of the proportion of patients
experiencing severe side effects, has been established in a previous phase I trial.
In practice, the separation between establishing the toxicity of a new agent
in a phase I trial and establishing the response rate in a phase II trial is artificial.
Most phase II trials are conducted not only to establish the response rate but also
to gather additional information about the toxicity associated with the new agent.
Conaway and Petroni (1) and Bryant and Day (2) cite several reasons why toxicity
considerations are important for phase II trials:
1. Sample sizes in phase I trials. The number of patients in a phase I trial
is small and the toxicity profile of the new agent is estimated with little
precision. As a result, there is a need to gather more information about
toxicity rates before proceeding to a large comparative trial.
2. Ethical considerations. Most phase II trials are designed to terminate
the study early if it does not appear that the new agent is sufficiently
105
106 Petroni and Conaway
promising to warrant a comparative trial. These designs are meant to

protect patients from receiving substandard therapy. Patients should be
protected also from receiving agents with excessive rates of toxicity,
and consequently phase II trials should be designed with the possibility
of early termination of the study if an excessive number of toxicities
are observed. This consideration is particularly important in studies of
intensive chemotherapy regimens, where it is hypothesized that a more
intensive therapy induces a greater chance of a response but also a
greater chance of toxicity.
3. The characteristics of the patients enrolled in the previous phase I trials
may be different from those of the patients to be enrolled in the phase
II trial. For example, phase I trials often enroll patients for whom all
standard therapies have failed. These patients are likely to have a
greater extent of disease than patients who will be accrued to the phase
II trial.
With these considerations, several proposals have been made for designing
phase II trials that formally incorporate both response and toxicity end points.
Conaway and Petroni (1) and Bryant and Day (2) propose methods that extend
the two-stage designs of Simon (3). In each of these methods, a new agent is
considered sufficiently promising if it exhibits both a response rate that is greater
than that of the standard therapy and a toxicity rate that does not exceed that of the
standard therapy. Conaway and Petroni (4) consider a different criterion, based on
a trade-off between response and toxicity rates. In these designs, a new agent
with a greater toxicity rate might be considered sufficiently promising if it also
has a much greater response rate than the standard therapy. Thall et al. (5,6)
propose a Bayesian method for monitoring response and toxicity that can also
incorporate a trade-off between response and toxicity rates.
II. DESIGNS FOR RESPONSE AND TOXICITY
Conaway and Petroni (1) and Bryant and Day (2) present multistage designs that
formally monitor response and toxicity. As a motivation for the multistage de-
signs, we first describe the methods for a fixed sample design, using the notation
in Conaway and Petroni (1). In this setting, binary variables representing response
and toxicity are observed in each of N patients. The data are summarized in a 2
⫻ 2 table where X ij is the number of patients with response classification i and
toxicity classification j (Table 1). The observed number of responses is X R ⫽ X 11
⫹ X 12 and the observed number of patients experiencing a severe toxicity is X T
⫽ X 11 ⫹ X 21. It is assumed that the cell counts in this table, (X 11, X 12 , X 21 , X 22 ),
have a multinomial distribution with underlying probabilities, (p11 , p12 , p21 , p22 ).
Designs Based on Toxicity and Response 107
Table 1 Classification of Patients by Response and Toxicity
Toxicity
Yes No Total
Response
Yes X 11 X 12 XR
No X 21 X 22 N ⫺ XR
Total XT N ⫺ XT N
That is, in the population of patients to be treated with this new agent, a propor-
tion, pij , would have response classification i and toxicity classification j (Table
2). With this notation the probability of a response is pR ⫽ p11 ⫹ p12 and the
probability of a toxicity is pT ⫽ p11 ⫹ p21.
The design is based on having sufficient power to test the null hypothesis
that the new treatment is ‘‘not sufficiently promising’’ to warrant further study
against the alternative hypothesis that the new agent is sufficiently promising to
warrant a comparative trial. Conaway and Petroni (1) and Bryant and Day (2)
interpret the term ‘‘sufficiently promising’’ to mean that the new treatment has
a greater response rate than the standard and that the toxicity rate with the new
treatment is no greater than that of the standard treatment. Defining pRo as the
response rate with the standard treatment and pTo as the toxicity rate for the stan-
dard treatment, the null hypothesis can be written as
H o: pR ⱕ pRo or pT ⱖ pTo
H a: pR ⬎ pRo and pT ⬍ pTo
The null and alternative regions are displayed in Fig. 1.
A statistic for testing H o versus H a is (X R , X T ), with a critical region of the
form C ⫽ {(X R , X T ): X R ⱖ c R and X T ⱕ c T}. We reject the null hypothesis and
Table 2 Population Proportions for Response and Toxicity

Classifications
Toxicity
Yes No Total
Response
Yes p11 p12 pR
No p21 p22 1 ⫺ pR
Total pT 1 ⫺ pT 1
Figure 1 Null and alternative regions.
declare the treatment sufficiently promising if we observe many responses and

few toxicities. We do not reject the null hypothesis if we observe too few re-
sponses or too many toxicities. Conaway and Petroni (1) choose the sample size,
N, and critical values (c R , c T ) to constrain three error probabilities to be less than
prespecified levels α, γ, and 1 ⫺ β, respectively. The three error probabilities
are
1. The probability of incorrectly declaring the treatment promising when
the response and toxicity rates for the new therapy are the same as
those of the standard therapy.
2. The probability of incorrectly declaring the treatment promising when
the response rate for the new therapy is no greater than that of the
standard or the toxicity rate for the new therapy is greater than that of
the standard therapy.
3. The probability of declaring the treatment not promising at a particular
point in the alternative region. The design should yield sufficient power
to reject the null hypothesis for a specific response and toxicity rate,
where the response rate is greater than that of the standard therapy and
the toxicity rate is less than that of the standard therapy.
Mathematically, these error probabilities are:
P(X R ⱖ c R , X T ⱕ c T | pR ⫽ pRo , pT ⫽ pTo , θ) ⱕ α, (1)
sup P(X R ⱖ c R , X T ⱕ c T | pR , pT, θ) ⱕ γ, (2)

pR ⱕpRo or pT ⱖpTo
P(X R ⱖ c R , X T ⱕ c T | pR ⫽ pRa , pT ⫽ pTa , θ) ⱕ 1 ⫺ β (3)

where these probabilities are computed for a prespecified value of the odds ratio,
θ ⫽ (p11 p22 )/(p12 p21 ), in Table 2. the point (pRa , pTa ) is a prespecified point in the
alternative region, with pRa ⬎ pRo and pTa ⬍ pTo.
Conaway and Petroni (1) compute the sample size and critical values by
enumerating the distribution of (X R , X T ) under particular values for (pR , pT , θ).
As an example, Conaway and Petroni (1) present a proposed phase II trial of
high-dose chemotherapy for patients with non-Hodgkin’s lymphoma. Results
from earlier studies for this patient population have indicated that standard ther-
apy results in an estimated response rate of 50% with approximately 30% of
patients experiencing life-threatening toxicities. In addition, previous results indi-
cated that approximately 35–40% of the patients who experienced a complete
response also experienced life-threatening toxicities. The odds ratio, θ, is deter-
mined by the assumed response rate, toxicity rate, and the conditional probability
of experiencing a life-threatening toxicity given that patient had a complete re-
sponse. Therefore, (pRo , pTo ) is assumed to be (0.50, 0.30) and the odds ratio is
assumed to be 2.0. Conaway and Petroni (1) chose values α ⫽ 0.05, γ ⫽ 0.30,
and β ⫽ 0.10. The trial is designed to have approximately 90% power at the
alternative determined by (pRa , pTa ) ⫽ (0.75, 0.15).
The extension to multistage designs is straightforward. The multistage de-
signs allow for the early termination of a study if early results indicate that the
treatment is not sufficiently effective or is too toxic. Although most phase II trials
are carried out in at most two stages, for the general discussion, Conaway and
Petroni (1) assume that the study is to be carried out in K stages. At the end of
the kth stage, a decision is made whether to enroll patients for the next stage or to
stop the trial. If the trial is stopped early, the treatment is declared not sufficiently
promising to warrant further study. At the end of the kth stage, the decision to
continue or terminate the study is governed by the boundaries (c Rk , c Tk ), k ⫽
1, . . . , K. The study continues to the next stage if the total number of responses
observed up to and including the kth stage is at least as great as c Rk and the total
number of toxicities up to and including the kth stage is no greater than c Tk. At
the final stage, the null hypothesis that the treatment is not sufficiently promising
to warrant further study is rejected if there are a sufficient number of observed
responses (at least c RK ) and sufficiently few observed toxicities (no more than
c TK ).
In designing the study, the goal is to choose sample sizes for the stages
m 1 , m 2 , . . . , m K and boundaries (c R1 , c T1 ), (c R2 , c T2 ), . . . , (c RK, c TK ) satisfying
the error constraints listed above. For a fixed total sample size, N ⫽ ∑k m k , there
may be many designs that satisfy the error requirements. An additional criterion,
such as one of those proposed by Simon (3) in the context of two-stage trials
with a single binary end point, can be used to select a design. The stage sample
sizes and boundaries can be chosen to give the minimum expected sample size
at the response and toxicity rates for the standard therapy (pRo , pTo ) among all
designs that satisfy the error requirements. Alternatively, one could choose the
design that minimizes the maximum expected sample size over the entire null
hypothesis region.
Conaway and Petroni (1) compute the ‘‘optimal’’ designs for these criteria
for two-stage and three-stage designs using a fixed prespecified value for the
odds ratio, θ. Through simulations, they evaluate the sensitivity of the designs
to a misspecification of the value for the odds ratio.
Bryant and Day (2) also consider the problem of monitoring binary end
points representing response and toxicity. They present optimal designs for two-
stage trials that extend the designs of Simon (3). In the first stage, N 1 patients
are accrued and classified by response and toxicity; Y R1 patients respond and Y T1
patients do not experience toxicity. At the end of the first stage, a decision to
continue to the next stage or to terminate the study is made according to the
following rules, where N 1 , C R1 , and C T1 are parameters to be chosen as part of
the design specification:
1. If Y R1 ⱕ C R1 and Y T1 ⬎ C T1 , terminate due to inadequate response.
2. If Y R1 ⬎ C R1 and Y T1 ⱕ C T1 , terminate due to excessive toxicity.
3. If Y R1 ⱕ C R1 and Y T1 ⱕ C T1 , terminate due to both factors.
4. If Y R1 ⬎ C R1 and Y T1 ⬎ C T1 , continue to the second stage.
In the second stage, N 2 ⫺ N 1 patients are accrued. At the end of this stage, the
following rules govern the decision whether or not the new agent is sufficiently
promising, where N 2 , C R2 , and C T2 are parameters to be determined by the design:
1. If Y R2 ⱕ C R2 and Y T2 ⬎ C T2 , ‘‘not promising’’ due to inadequate re-
sponse.
2. If Y R2 ⬎ C R2 and Y T2 ⱕ C T2 , ‘‘not promising’’ due to excessive toxicity.
3. If Y R2 ⱕ C R2 and Y T2 ⱕ C T2 , ‘‘not promising’’ due to both factors.
4. If Y R2 ⬎ C R2 and Y T2 ⬎ C T2 , ‘‘sufficiently promising.’’
The principle for choosing the stage sample sizes and stage boundaries is
the same as in Conaway and Petroni (1). The design parameters are determined
from prespecified error constraints. Although the methods differ in the particular
constraints considered, the motivation for these error constraints is the same.
One would like to limit the probability of recommending a treatment that has an
insufficient response rate or excessive toxicity rate. Similarly, one would like to
constrain the probability of failing to recommend a treatment that is superior to
the standard treatment in terms of both response and toxicity rates. Finally, among
all designs meeting the error criteria, the optimal design is the one that minimizes
the average number of patients treated with an ineffective therapy.
In choosing the design parameters, Q ⫽ (N 1 , N 2 , C R1 , C R2 , C T1 , C T2 ), Bryant
and Day (2) specify an acceptable (P R1 ) and an unacceptable (P R0 ) response rate
along with an acceptable (P T1 ) and unacceptable (P T0 ) rate of nontoxicity. Under
any of the four combinations of acceptable or unacceptable rates of response and
nontoxicity, Bryant and Day (2) assume that the association between response
and toxicity is constant. The association between response and toxicity is deter-
mined by the odds ratio, ϕ, in the 2 ⫻ 2 table cross-classifying response and
toxicity,
P(No Response, Toxicity) ⫻ P(Response, No toxicity)

ϕ⫽
P(No Response, No Toxicity) ⫻ P(Response, Toxicity)
Bryant and Day (2) parameterize the odds ratio in terms of response and no
toxicity so ϕ corresponds to 1/θ in the notation of Conaway and Petroni (1). For
a design, Q, and an odds ratio, ϕ, let α ij (Q, ϕ) be the probability of recommending
the treatment, given that the true response rate equals PRi and the true nontoxicity
rate equals PTj , i ⫽ 0, 1; j ⫽ 0, 1. Constraining the probability of recommending
a treatment with an insufficient response rate leads to α 01 (Q, ϕ) ⱕ α R , where
α R is a prespecified constant. Constraining the probability of recommending a
treatment with an insufficient response rate leads to α 10 (Q, ϕ) ⱕ α T , and ensuring
a sufficiently high probability of recommending a truly superior treatment re-
quires α 11 (Q, ϕ) ⱖ 1 ⫺ β, where α T and β are prespecified constants. Bryant
and Day (2) note that α 00 (Q, ϕ) is less than either α 01(Q, ϕ) or α 10 (Q, ϕ), so that
an upper bound on α 00(Q, ϕ) is implicit in these constraints.
There can be many designs that meet these specifications. Among these
designs, Bryant and Day (2) define the optimal design to be the one that mini-
mizes the expected number of patients in a study of a treatment with an unaccept-
able response or toxicity rate. Specifically, Bryant and Day (2) choose the design,
Q, that minimizes the maximum of E 01 (Q, ϕ) and E 10 (Q, ϕ), where E ij is the
expected number of patients accrued when the true response rate equals P Ri and
the true nontoxicity rate equals P Tj , i ⫽ 0, 1; j ⫽ 0, 1. The expected value E 00 (Q,
ϕ) does not play a role in the calculation of the optimal design because it is less
than both E 01 (Q, ϕ) and E 10(Q, ϕ).
The stage sample sizes and boundaries for the optimal design depend on
the value of the nuisance parameter, ϕ. For an unspecified odds ratio, among all
designs that meet the error constraints, the optimal design minimizes the maxi-
mum expected patient accruals under a treatment with an unacceptable response
or toxicity rate, max ϕ {max(E 01 (Q, ϕ), E 10 (Q, ϕ))}. Assumptions about a fixed
value of the odds ratio lead to a simpler computational problem; this is particu-
larly true if response and toxicity are assumed to be independent (ϕ ⫽ 1). Bryant
and Day (2) provide bounds that indicate that the characteristics of the optimal
design for an unspecified odds ratio do not differ greatly from the optimal design
found by assuming that response and toxicity are independent. By considering
a number of examples, Conaway and Petroni (1) came to a similar conclusion.
Their designs are computed under a fixed value for the odds ratio, but different
values for the assumed odds ratio led to similar designs.
III. DESIGNS THAT ALLOW A TRADE-OFF BETWEEN

RESPONSE AND TOXICITY
The designs for response and toxicity proposed by Conaway and Petroni (1) and
Bryant and Day (2) share a number of common features, including the form for
the alternative region. In these designs, a new treatment must show evidence of
a greater response rate and a lesser toxicity rate than the standard treatment. In
practice, a trade-off could be considered in the design, since one may be willing
to allow greater toxicity to achieve a greater response rate or may be willing to
accept a slightly lower response rate if lower toxicity can be obtained. Conaway
and Petroni (4) propose two-stage designs for phase II trials that allow early
termination of the study if the new therapy is not sufficiently promising and allow
a trade-off between response and toxicity.
The hypotheses are the same as those considered for the bivariate designs
of the previous section. The null hypothesis is that the new treatment is not suffi-
ciently promising to warrant further study, either due to an insufficient response
rate or excessive toxicity. The alternative hypothesis is that the new treatment is
sufficiently effective and safe to warrant further study. The terms ‘‘sufficiently
safe’’ and ‘‘sufficiently effective’’ are relative to the response rate, pRo , and the
toxicity rate, pTo , for the standard treatment.
One of the primary issues in the design is how to elicit the trade-off specifi-
cation. Ideally, the trade-off between safety and efficacy would be summarized
as a function of toxicity and response rates that defines a treatment as ‘‘worthy
of further study.’’ In practice this can be difficult to elicit. A simpler method for
obtaining the trade-off information is for the investigator to specify the maximum
toxicity rate, pT,max , that would be acceptable if the new treatment were to produce
responses in all patients. Similarly, the investigator would be asked to specify
the minimum response rate, pR,min , that would be acceptable if the treatment pro-
duced no toxicities. Figure 2 illustrates the set of values for the true response
Figure 2 Null and alternative regions for trade-offs.
rate (pR ) and true toxicity rate (p T ), which satisfy the null and alternative hypothe-
ses. The values chosen for Fig. 2 are p Ro ⫽ 0.5, pTo ⫽ 0.2, pR,min ⫽ 0.4, and pT,max
⫽ 0.7. The line connecting the point (pRo , pTo ) and (1, pT,max ) is given by the
equation pT ⫽ pTo ⫹ tan(ψT )(pR ⫺ pRo ), where tan(ψT ) ⫽ (pT,max ⫺ pTo )/(1 ⫺ pRo ).
Similarly, the equation of the line connecting (pRo , pTo ) and (1, pR,min ) is given by
the equation pT ⫽ pTo ⫹ tan(ψR )(pR ⫺ pRo ), where tan(ψR ) ⫽ pTo /(pRo ⫺ pR,min ).
With ψT ⱕ ψR , the null hypothesis is
H o : pT ⱖ pTo ⫹ tan(ψT )(pR ⫺ pRo ) or pT ⱖ pTo ⫹ tan(ψR )(pR ⫺ pRo )
and the alternative hypothesis is
H a : pT ⬍ pTo ⫹ tan(ψT )(pR ⫺ pRo ) and pT ⬍ pTo ⫹ tan(ψR )(pR ⫺ pRo )
The forms of the null and alternative are different for the case where ψT ⱖ ψR ,
although the basic principles in constructing the design and specifying the trade-
off information remain the same (cf. Conaway and Petroni [4]). Special cases of
these hypotheses have been used previously: ψT ⫽ 0 and ψR ⫽ π/2 yield the
critical regions of Conaway and Petroni (1) and Bryant and Day (2). ψR ⫽ ψT
⫽ 0 yield hypotheses in terms of toxicity alone, and ψR ⫽ ψT ⫽ π/2 yield hypoth-
eses in terms of response alone.
To describe the trade-off designs for a fixed sample size, we use the nota-
tion and assumptions for the fixed sample size design described in Section II.
As in their earlier work, Conaway and Petroni (4) determine sample size and
critical values under an assumed value for the odds ratio between response and
toxicity. The sample size calculations require a specification of a level of type
I error, α, and power, 1 ⫺ β, at a particular point pR ⫽ pRa and pT ⫽ pTa. The
point (pRa , pTa ) satisfies the constraints defining the alternative hypothesis and
represents the response and toxicity rates for a treatment considered to be superior
to the standard treatment. The test statistic is denoted by T(p), where p ⫽ (1/
N)(X 11 , X 12 , X 21 , X 22 ), is the vector of sample proportions in the four cells of
Table 1 and is based on computing an ‘‘I-divergence measure’’ (cf. Robertson
et al. [7]).
The test statistic has the intuitively appealing property of being roughly
analogous to a ‘‘distance’’ from p to the region H o. Rejection of the null hypothe-
sis results when the observed value of T(p) is ‘‘too far’’ from the null hypothesis
region. A vector of observed proportions p leads to rejection of the null hypothe-
sis if T(p) ⱖ c. For an appropriate choice of sample size (N), significance level
(α), and power (1 ⫺ β), the value c can be chosen to constrain the probability
of recommending a treatment that has an insufficient response rate relative to the
toxicity rate and ensure a high probability of recommending a treatment with
response rate pRa and toxicity rate pTa. The critical value c is chosen to meet the
error criteria:
sup P(T(p) ⱖ c | p R , pT , θ) ⱕ α
(pR, pT )∈Ho
and
P(T(p) ⱖ c | pRa , pTa , θ) ⱖ 1 ⫺ β
These probabilities are computed for a fixed value of the odds ratio, θ, by enumer-
ating the value of T(p) for all possible realizations of the multinominal vector
(X 11 , X 12 , X 21 , X 22 ).
The trade-off designs can be extended to two-stage designs that allow early
termination of the study if the new treatment does not appear to be sufficiently
promising. In designing the study, the goal is to choose the stage sample sizes
(m 1 , m 2 ) and decision boundaries (c 1 , c 2 ) to satisfy error probability constraints
similar to those in the fixed sample size trade-off design.
sup P(T1 (p1 ) ⱖ c 1 , T2 (p1 , p2 ) ⱖ c 2 | pR , pT , θ) ⱕ α

(pR, p T )∈Ho
and
P(T 1 (p 1 ) ⱖ c 1 , T 2 (p 1 , p 2 ) ⱖ c 2 | pRa , p Ta , θ) ⱖ 1 ⫺ β
where T 1 is the test statistic computed on the stage 1 observations and T 2 is the
test statistic computed on the accumulated data in stages 1 and 2. As in the fixed
sample size design, these probabilities are computed for a fixed value of the odds
ratio and are found by enumerating all possible outcomes of the trial.
In cases where many designs meet the error requirements, an optimal design
is found according to the criterion in Bryant and Day (2) and Simon (3). Among
all designs that meet the error constraints, the chosen design minimizes the maxi-
mum expected sample size under the null hypothesis. Through simulations, Cona-
way and Petroni (4) investigate the effect of fixing the odds ratio on the choice
of the optimal design. They conclude that unless the odds ratio is badly misspeci-
fied, the choice of the odds ratio has little effect on the properties of the optimal
design.
The critical values for the test statistic are much harder to interpret than
the critical values in Conaway and Petroni (1) or Bryant and Day (2), which are
counts of the number of observed responses and toxicities. We recommend two
plots, similar to Figs. 2 and 3 in Conaway and Petroni (4) to illustrate the charac-
teristics of the trade-off designs. The first is a display of the power of the test,
so that the investigators can see the probability of recommending a treatment
with true response rate pR and true toxicity rate pT. The second plot displays the
rejection region, so that the investigators can see the decision about the treatment
that will be made for specific numbers of observed responses and toxicities. With
these plots, the investigators can better understand the implications of the trade-
off being proposed.
The trade-off designs of Conaway and Petroni (4) were motivated by the
idea that a new treatment could be considered acceptable even if the toxicity rate
for the new treatment is greater than that of the standard treatment, provided
the response rate improvement is sufficiently large. This idea also motivated the
Bayesian monitoring method of Thall et al. (5,6). They note that, for example,
a treatment that improves the response rate by 15 percentage points might be
considered promising, even if its toxicity rate is 5 percentage points greater than
the standard therapy. If, however, the new therapy increases the toxicity rate by
10 percentage points, it might not be considered an acceptable therapy.
Thall, et al. (5,6) outline a strategy for monitoring each end point in
the trial. They define, for each end point in the trial, a monitoring boundary
based on prespecified targets for an improvement in efficacy and an unacceptable
increase in the rate of adverse events. In the example given above for a trial
with a single response end point and a single toxicity end point, the targeted
improvement in response rate is 15% and the allowance for increased toxicity is
5%.
Thall et al. (5,6) take a Bayesian approach that allows for monitoring each
end point on a patient by patient basis. Although their methods allow for a number
of efficacy and adverse event end points, we simplify the discussion by consider-
ing only a single efficacy event (response) and a single adverse event (toxicity)
end point. Before the trial begins, they elicit a prior distribution on the cell proba-
bilities in Table 2. Under the standard therapy, the cell probabilities are denoted
PS ⫽ (pS11 , pS12 , pS21 , pS22 ); under the new experimental therapy, the cell probabili-
ties are denoted PE ⫽ (pE11 , pE12 , pE21 , pE22 ). Putting a prior distribution on the
cell probabilities (p G11 , p G12 , p G21 , p G22 ) induces a prior distribution on p GR ⫽ p G11
⫹ p G12 and on p GT ⫽ p G11 ⫹ p G21 , where G stands for either S or E. A Dirichlet
prior for the cell probabilities is particularly convenient in this setting, since this
induces a beta prior on p GR and p GT , for G ⫽ S or E.
In addition to the prior distribution, Thall et al. (5,6) specify a target
improvement, δ(R), for response, and a maximum allowable difference, δ(T),
for toxicity. The monitoring of the end points begins after a minimum num-
ber of patients, m, have been observed. It continues until either a maximum
number of patients, M, have been accrued or a monitoring boundary has been
crossed.
In a typical phase II trial, in which only the new therapy is used, the distri-
bution on the probabilities under the standard therapy remains constant through-
out the trial, whereas the distribution on the probabilities under the new therapy
is updated each time a patient’s outcomes are observed. After the response and
toxicity classification on j patients, X j , have been observed, there are several
possible decisions one could make. If there is strong evidence that the new ther-
apy does not meet the targeted improvement in response rate, then the trial should
be stopped and the new treatment declared ‘‘not sufficiently promising.’’ Alterna-
tively, if there is strong evidence that the new treatment is superior to the standard
treatment in terms of the targeted improvement for response, the trial should be
stopped and the treatment declared ‘‘sufficiently promising.’’ In terms of toxicity,
the trial should be stopped if there is strong evidence of an excessive toxicity
rate with the new treatment. Thall et al. (5,6) translate these rules into statements
about the updated (posterior) distribution [pE | X j ] and the prior distribution pS ,
using prespecified cutoff for what constitutes ‘‘strong evidence.’’ For m ⱕ j ⱕ
M, the monitoring boundaries are
P[P ER ⫺ P SR ⬎ δ(R) | X j] ⱕ pL(R)
P[P ER ⬎ P SR|X j] ⱖ pU(R)
P[pET ⫺ pST ⬎ δ(T) | X j] ⱖ pU(T)
where pL (R), pU (R), and pU (T) are prespecified probability levels. Numerical inte-
gration is required to compute these probabilities, but the choice of the Dirichlet
prior makes the computations relatively easy.
IV. SUMMARY
All methods discussed in this chapter have advantages in monitoring toxicity in

phase II trials. None of the methods use asymptotic approximations for distribu-
tions and are well suited for the small sample sizes encountered typically in phase
II trials. The bivariate designs of Conaway and Petroni (1) and Bryant and Day
(2) have critical values that are based on the observed number of responses and
the observed number of toxicities; these statistics are easily calculated and inter-
preted by the investigators. Although there is no formal trade-off discussion in
these article, the general methods can be adapted to the kind of trade-off discussed
in Thall et al. (5,6). To do this, one needs to modify the hypotheses to be tested.
For example, the null and alternative hypothesis could be changed to
Ho : pR ⱕ pRo ⫹ δR or pT ⱖ pTo ⫹ δ T
Ha : pR ⬎ pRo ⫹ δ R and pT ⬍ pTo ⫹ δ T
for some prespecified δ R and δ T. The trade-off designs of Conaway and Petroni
(4) have a trade-off strategy that permits the allowable level of toxicity to increase
with the response rate. In contrast, in the trade-off example of Thall et al. (5,6),
a 5% increase in toxicity would be considered acceptable for a treatment with
a 15% increase in response. Because the allowance in toxicity is prespecified,
this means that only a 5% increase in toxicity is allowable even if the response
rate with the new treatment is as much as 30%. With the trade-off of Conaway
and Petroni (4), the standard for ‘‘allowable toxicity’’ is greater for a treatment
with a 30% improvement than for one with a 15% improvement. The methods
of Thall et al. (5,6) have advantages in terms of being able to monitor outcomes
on a patient by patient basis. At each monitoring point, the method can provide
graphical representations of the probability associated with each of the decision
rules.
REFERENCES
1. Conaway MR, Petroni GR. Bivariate sequential designs for phase II trials. Biometrics
1995; 51:656–664.
2. Bryant J, Day R. Incorporating toxicity considerations into the design of two-stage
phase II clinical trials. Biometrics 1995; 51:1372–1383.
3. Simon R. Optimal two-stage designs for phase II clinical trials. Controlled Clin Trials
1989; 10:1–10.
4. Conaway MR, Petroni GR. Designs for phase II trials allowing for trade-off between
response and toxicity. Biometrics 1996; 52:1375–1386.
5. Thall PF, Simon RM, Estey EH. Bayesian sequential monitoring designs for single-
arm clinical trials with multiple outcomes. Stat Med 1995; 14:357–379.
6. Thall PF, Simon RM, Estey EH. New statistical strategy for monitoring safety and
efficacy in single-arm clinical trials. J Clin Oncol 1996; 14:296–303.
7. Robertson T, Dykstra RL, Wright FT. Order Restricted Statistical Inference. Chiches-
ter: John Wiley and Sons, Ltd., 1988.
6
Phase II Selection Designs
P. Y. Liu
I. BASIC CONCEPT
When there are multiple promising new therapies in a disease setting, it may not
possible to test all of them against the standard treatment in a definitive phase
III trial. The sample sizes required for a phase III study with more than three
arms could be prohibitive (1). In addition, the analysis can be highly complex
and prone to errors due to the large number of possible comparisons in a multiarm
study (see Chap. 4). An alternative strategy is to screen the new therapies first
and choose one to test against a standard treatment in a simple two-arm phase
III trial. Selection designs can be used in such circumstances.
Simon et al. (2) first introduced statistical methods for ranking and selection to
the oncology literature. In a selection design, patients are randomized to treatments
involving new combinations or schedules of known active agents or new agents for
which activity against the disease in question has already been demonstrated in
some limited setting. In other words, the regimens under testing have already shown
promise. Now the aim is to narrow down the choice for formal comparisons with
the standard therapy. In this approach, one always selects the observed best treat-
ment for further study, however small the advantage over the others may appear to
be. Hypothesis tests are not performed. Sample sizes are established so that if a
treatment exists for which the underlying efficacy is superior to the others by a
specified amount, it will be selected with a high probability. The required sample
sizes are usually similar to those associated with pilot phase II trials.
119
120 Liu
Although the statistical principles for the selection design are simple, its
application can be rather slippery. Falsely justified by the randomized treatment
assignment, the major abuse of the design is to treat the observed ranking as
conclusive and forego the subsequent phase III testing. This practice is especially
dangerous when a standard arm is included as the basis for comparison or when
all treatment arms are experimental and a standard treatment does not exist for
the particular disease. A ‘‘treatment of choice’’ is often erroneously concluded
in such situations. It is of vital importance to emphasize up front that a selection
design serves merely as a precursor to the requisite definitive phase III compari-
son. Because of the design’s moderate sample sizes and lack of control for false-
positive and false-negative findings, the observed best treatment could be truly
superior or it could appear so simply due to chance with no real advantage over
the other treatments. The approach presumes the subsequent conduct of definitive
III trials and makes no attempt, and therefore has no power, in distinguishing
the former from the latter at the selection step. Results from selection trials are
error prone when treated as ends in themselves. Yet, one treatment can look
substantially better than the others, and there is often great temptation to treat
the unproved results as final (3). Therefore, unless the follow-on phase III study
is ensured by some external mechanism such as government regulations for new
drug approval, selection designs can do more harm than good by their propensity
for being misused. The false-positive rate of the misapplication is discussed in
more detail in Section IV.
II. SAMPLE SIZE REQUIREMENTS

A. Binary Outcomes
Table 1 shows sample size requirements for binary outcomes with K ⫽ two, three,
and four groups from Simon et al. (2). With the listed N per group and true re-
sponse rates, the correct selection probability should be approximately 0.90. The
sample sizes were presumably derived by normal approximations to binomial
distributions. A check by exact probabilities indicates that the actual correct se-
lection probability ranges from 0.89 down to 0.86 when N is small. Increasing
the sample size per group by five raises the correct selection probability to 0.90
in all cases and may be worth considering when N is less than 30.
Except in extreme cases, Table 1 indicates the sample size to be relatively
insensitive to baseline response rates (i.e., response rates of groups 1 through
K ⫺ 1). Since precise knowledge of the baseline rates is often not available, a
common conservative approach is to always use the largest sample size for each
K, that is, 37, 55, and 67 patients per group for K ⫽ 2, 3, and 4, respectively.
Although a total N of 74 for two groups is in line with large phase II studies,
the total number of patients required for four groups, that is, close to 270, could
render the design impractical for many applications.
Phase II Selection Designs 121
Table 1 Sample Size per Treatment for Binary

Outcomes and 0.90 Correct Selection Probability
Response rates N per Group
P1, . . . , PK⫺1 PK K⫽2 K⫽3 K⫽4
10% 25% 21 31 37
20% 35% 29 44 52
30% 45% 35 52 62
40% 55% 37 55 67
50% 65% 36 54 65
60% 75% 32 49 59
70% 85% 26 39 47
80% 95% 16 24 29
From Ref. 2.
B. Survival Outcomes
For censored survival data, Liu et al. (4) suggested fitting the Cox proportional
hazards model, h(t, z) ⫽ h0(t)exp(β′z), to the data where z is the (K ⫺ 1) dimen-
sional vector of treatment group indicators and β ⫽ (β1, . . . βk⫺1) is the vector
of log hazard ratios. We proposed selecting the treatment with the smallest β̂i
(where β̂K ⬅ 0) for further testing. Sample sizes for 0.90 correct selection proba-
bility were calculated based on the asymptotic normality of the β̂. The require-
ments for exponential survival and uniform censoring are reproduced in Table 2.
Simulation studies of robustness of the proportional hazards assumption found
the correct selection probabilities to be above 0.80 for moderate departures from
the assumption.
As with binary outcomes, the design becomes less practical when the
hazard ratio between the worst and the best groups is smaller than 1.5 or when
there are more than three groups. Table 2 covers scenarios where the patient
enrollment period is similar to the median survival of the worst groups. It
does not encompass situations where the two are different. Since the effective
sample size for exponential survival distributions is the number of uncensored
observations, the actual number of expected events are the same for the differ-
ent rows in Table 2. For 0.90 correct selection probability, Table 3 gives
the approximate number of events needed per group for the worst groups. With
∫IdF as the proportion of censored observations, where I and F are the respec-
tive cumulative distribution functions for censoring and survival times, some
readers may find the expected event count more flexible for planning pur-
poses.
122 Liu
Table 2 Sample Size per Treatment for Exponential Survival Outcomes with 1-Year
Accrual and 0.90 Correct Selection Probability
K⫽2 K⫽3 K⫽4
Median† Follow‡ 1.3* 1.4* 1.5* 1.3* 1.4* 1.5* 1.3* 1.4* 1.5*
0.5 0 115 72 51 171 107 76 206 128 91

0.5 71 44 31 106 66 46 127 79 56
1 59 36 26 88 54 38 106 65 46
0.75 0 153 96 69 229 143 102 275 172 122
0.5 89 56 40 133 83 59 160 100 70
1 70 44 31 104 65 46 125 78 55
1 0 192 121 87 287 180 128 345 216 153
0.5 108 68 48 161 101 72 194 121 86
1 82 51 36 122 76 54 147 92 65
* Hazard ratio of groups 1 through K ⫺ 1 vs. group K.

† Median survival in years for groups 1 through K ⫺ 1.
‡ Additional follow-up in years after accrual completion.
From Ref. 4.
Table 3 Event Count per

Group for the Worst Groups for
Exponential Survival and 0.90
Correct Selection Probability
HR*
K 1.3 1.4 1.5
2 54 34 24
3 80 50 36
4 96 60 43
* Hazard ratio of groups 1 through K

⫺ 1 vs. group K.
III. VARIATIONS OF THE DESIGN

A. Designs with Toxicity Acceptance Criteria
Toxicity or side effects often are major concerns for cancer treatments. If the
toxicity profiles of the treatments under evaluation are not well known, it is rec-
ommended that formal acceptance criteria should be established for them. Selec-
tion then takes place only among those with acceptable toxicity. Treatments could
even be stopped early to guard against excessive toxicity. As an example, the
Southwest Oncology Group study S8835 investigated mitoxantrone or floxuridine

administered intraperitoneally in patients with minimal residual ovarian cancer
after the second-look laparotomy (5). The study was designed with 37 patients
per arm; the treatment with a higher percent of patients free of disease progression
or relapse at 1 year would be selected for more evaluation. However, patient
accrual to either treatment could be stopped early if unacceptable toxicity was
observed. Forty percent or more of the patients not tolerating the initial dose for
at least two courses of treatment was considered not acceptable. A treatment
would be dropped if 13 or more of the first 20 patients on that arm cannot tolerate
at least two courses of treatment at the starting dose, since the probability of 13
or more out of 20 is only 0.02 if the true proportion is 40%.
B. Designs with Minimum Activity Requirements

Though the selection design is most appropriate when acceptable levels of treat-
ment efficacy are no longer in question, the idea of selection is sometimes applied
to randomized phase II trials when anticancer activities have not been previously
established for the treatments involved. Alternatively, the side effects of the treat-
ments could be sufficiently severe that a certain activity level must be met to
justify the therapy. In such cases, each treatment arm is designed as a stand-alone
phase II trial with the same acceptance criteria for all arms. When more than one
treatment arm is accepted, the observed best arm is selected for further study.
Statistical properties of this approach have not been formally quantified. How-
ever, designing the study with the larger sample size between what is required
for a standard phase II and that for a selection phase II would generally give a
reasonable result. For example, with binary data, if a 20% response rate would
not justify further investigating a treatment regimen whereas a 40% response rate
definitely would, a standard phase II design based on the binomial distribution
requires 44 patients in a single-stage study. Fourteen or more responses out of
44 would represent sufficient activity for continued pursuit (6). This design has
a type I error of 0.04 (chance of observing ⱖ14 responses out of 44 when the
true response rate is 20%) and a power of 0.90 (chance of observing ⱖ14 re-
sponses out of 44 when the true response rate is 40%).
On the other hand, for a selection design to have 0.90 correct selection
probability when the response in one treatment is higher than the others by an
absolute 15%, a sample size of 37 or 55 per group is required for K ⫽ 2 or 3,
respectively, per Table 1. Therefore, one would design the study with 44 patients
per arm when K ⫽ 2 and 55 patients per arm when K ⫽ 3. The acceptance
criterion for N ⫽ 55 is adjusted to 17 or more responses. When activity levels
are low for all treatments, the minimum response requirement would dominate
and reject all treatments. When there are more than one treatment with acceptable
response levels and one particularly superior treatment, the combined minimum
124 Liu
activity/selection criteria would also result in the correct selection with high prob-
ability. Limited simulations were conducted for K ⫽ 2, N ⫽ 44 and K ⫽ 3, N
⫽ 55, with minimum acceptance criteria stated above and one superior arm for
which the response rate is higher than the rest by 15%. The results indicate the
probability of the best arm meeting the minimum requirement and being selected
is close to 0.20 when the true best response rate ( p K) is 25%. The correct selection
probability is in the low to mid 0.70 range when p K ⫽ 35% and approximately
0.90 when p K is 45% or higher.
Alternatively, for outcomes available in a short time, the trial could proceed
in stages. The larger sample size of the two-stage design at full accrual and the
selection design should still be used. The sample sizes at different stages and
acceptance criteria could be adapted if needed (6). If more than one treatment is
accepted at the completion of the second stage, the observed best treatment would
be selected for further study. Some authors suggest designing the activity screen-
ing with standard two-stage design sample sizes, with more patients enrolled for
the selection purpose only when more than one treatment is accepted. This ap-
proach is not recommended because the decision to enroll more patients (or not)
could be influenced by observed outcome data.
C. Designs for Ordered Treatments

When the K (ⱖ3) treatments under consideration consist of increasing dose
schedules of the same agents, the design can take advantage of this inherent order.
A simple method is to fit regression models to the outcomes with treatment groups
coded in an ordered manner according to the increasing dose levels. Logistic
regression for binary outcomes and the Cox model for survival are obvious
choices. A single independent variable with equally spaced scores for the treat-
ments would be included in the regression (e.g., 1, 2, and 3 for three groups). If
the sign of the observed slope is in the expected direction, the highest dose with
acceptable toxicity is selected for further study. Otherwise, the lowest dose sched-
ule would be selected.
Compared with the nonordered design, this approach should require smaller
sample sizes for the same correct selection probability. Limited simulations were
conducted with the following results. For binary data with K ⫽ 3, p1 ⫽ 40%, p3
⫽ 55%, p1 ⱕ p2 ⱕ p3, approximately N ⫽ 35 per arm is needed for a 0.90 chance
that the slope from the logistic regression is positive. Compared with N ⫽ 55
given in Table 1, this is a substantial reduction in sample size. Similarly, for
K ⫽ 4, p1 ⫽ 40%, p4 ⫽ 55%, p1 ⱕ p2 ⱕ p3 ⱕ p4, 40 patients per arm are needed
instead of 67. For exponential survival data with a 1.5 hazard ratio between the
worst groups and the best group, approximately 28 and 32 events per group are
need for the worst groups for K ⫽ 3 and 4, respectively, as compared with 34
and 41 given in Table 3.
IV. MISAPPLICATIONS AND RESULTING FALSE-POSITIVE

RATES
As mentioned in the beginning, the principal misuse of the selection design is

to treat the results as ends in themselves without the required phase III investiga-
tions. Liu et al. (3) previously published the false-positive rates of this misapplica-
tion. Briefly, for binary outcomes, when the true response rates are the same in
all treatments and in the 10–20% range, the chance of observing an absolute
10% or higher difference between two treatments is roughly 0.20 to 0.35 when
K ⫽ 2 and 0.30 to 0.45 when K ⫽ 3. When the shared response rate is in the
30–60% range, the chance of observing a 15% or greater difference is close to
0.20 for K ⫽ 2 and 0.20 to 0.30 for K ⫽ 3. There is a 0.10 chance for a greater
than 20% difference if the true rate is 50% or 60% for either K. These are impres-
sive looking differences arising with high frequencies purely out of chance. Simi-
larly, with Table 2 sample sizes and the same exponential survival distribution
for all groups, the chances of observing a hazard ratio greater than 1.3 and 1.5
are 0.37 and 0.16, respectively, for K ⫽ 2 and 0.52 and 0.21, respectively, for
K ⫽ 3. Again, observed hazard ratios of 1.3 to 1.5 often represent true treatment
advances in large definitive phase III studies. But with selection sample sizes they
appear with alarmingly high probabilities when there are no survival differences
between treatments at all.
Some have proposed changing the selection criterion so that the observed
best treatment will be further studied only when the difference over the worst
group is greater than some positive ∆; otherwise, none of the treatments will be
pursued. Although this approach may seem appealing at face value, the sample
size requirement for the same correct selection probability increases quickly as
∆ increases. To illustrate, for binary data with K ⫽ 2, p1 ⫽ 50% and p2 ⫽ 65%,
36 patients per treatment are required for ∆ ⫽ 0 and a 0.90 correct selection
probability per Table 1. With the same configuration and 0.90 correct selection
probability, the required number of patients per group is 40, 55, 79, and 123
when ∆ ⫽ 1%, 3%, 5%, and 7%, respectively. Clearly when ∆ ⬎ 5% the sample
size required is impractical. Even with ∆ ⫽ 5% and 79 patients per group, by
design the correct selection probability remains 0.90 when the true response rates
are 50% and 65%, yet the results are by no means definitive when a greater than
5% absolute difference is observed. When p1 ⫽ p2 ⫽ 50% and n1 ⫽ n2 ⫽ 79,
the chances of observing |p̂1 ⫺ p̂2| ⬎ 5% and 10% are approximately 0.53 and
0.21. Therefore, this approach is not recommended because it may incur a false
sense of confidence that a true superior treatment has been found when the ob-
served results are positive.
We (3) also pointed out that performing hypothesis tests post-hoc changes
the purpose of the design. If the goal is to reach definitive answers, then a phase
III comparison should have been designed with appropriate analyses and error
126 Liu
rates. Testing hypotheses with selection sample sizes could be likened to conduct-
ing the initial interim analysis for phase III trials. It is well known that extremely
stringent p values are required to ‘‘stop the trial’’ at this early stage.
Finally, the inclusion of a standard or control treatment in a selection design
is especially dangerous. Without a control arm, any comparison between the stan-
dard and the observed best treatment from a selection trial is recognized as infor-
mal because the limitations of historical comparisons are widely accepted. When
a control arm is included for randomization, the legitimacy for comparison is
established and there is great temptation to interpret the results literally and
‘‘move on.’’ If there are no efficacy difference between treatments, the chance
of observing an experimental treatment better than the control is (K ⫺ 1)/K, that
is, 1/2 for K ⫽ 2, 2/3 for K ⫽ 3, and so on. The chance of an impressive difference
between an experimental treatment and the control treatment is as discussed
above for K ⫽ 2 and higher than 2/3 of those discussed above for K ⫽ 3. In this
case, false-negative conclusions are as damaging as false-positive ones because a
new treatment with similar efficacy as the control but less severe side effects
could be dismissed as ineffective.
V. CONCLUDING REMARKS
The statistical principles of the selection design are simple and adaptable to vari-
ous situations present in cancer clinical research. Applied correctly, the design
can serve a useful function in the long and arduous process of new treatment
discovery. However, perhaps due to the time and resources involved, there can
be a tremendous tendency to stop short of the phase III testing and declare winners
at less than a quarter of the distance of this marathon race. The resulting errors
can send research into lengthy detours and do great disservice to cancer patients.
Unless follow-on definitive evaluations are ensured by external means, the ab-
sence of a control treatment may be the only imperfect safeguard for the design.
REFERENCES
1. Liu PY, Dahlberg S. Design and analysis of multiarm clinical trials with survival
endpoints. Controlled Clin Trials 1995; 16:119–130.
2. Simon R, Wittes RE, Ellenberg SS. Randomized phase II clinical trials. Cancer
Treatment Rep 1985; 69:1375–1381.
3. Liu PY, LeBlanc M, Desai M. False positive rates of randomized phase II designs.
Controlled Clin Trials 1999; 20:343–352.
4. Liu PY, Dahlberg S, Crowley J. Selection designs for pilot studies based on survival.
Biometrics 1993; 49:391–398.
5. Muggia FM, Liu PY, Alberts DS, et al. Intraperitoneal mitoxantrone or floxuridine:
effects on time-to-failure and survival in patients with minimal residual ovarian can-
cer after second-look laparotomy—a randomized phase II study by the Southwest
Oncology Group. Gynecol Oncol 1996; 61:395–402.
6. Green S, Dahlberg S. Planned versus attained design in phase II clinical trials. Stat
Med 1992; 11:853–862.
7. Gibbons JD, Olkin I, Sobel M. Selecting and Ordering Populations: A New Statisti-
cal Methodology. New York: Wiley, 1977.
7
Power and Sample Size for Phase III
Clinical Trials of Survival
Jonathan J. Shuster
University of Florida, Gainesville, Florida
I. INTRODUCTION
This chapter is devoted to two treatment 1-1 randomized trials where the outcome
measure is survival or, more generally, time until an adverse event. In practical
terms, planning requires far more in the way of assumptions than trials whose
results are accrued quickly. This chapter is concerned with the most common set
of assumptions, proportional hazards, which implies that the ratio of the instanta-
neous probability of death at a given instant (treatment A: treatment B), for pa-
tients alive at time T, is the same for all values of T. Sample size guidelines are
presented first for situations in which there is no sequential monitoring. These
are modified by an inflation factor to allow for O’Brien–Fleming type sequential
monitoring (1). The methodology is built on a two-piece exponential model ini-
tially, but this is later relaxed to cover proportional hazards in general.
One of the most important aspects of this chapter centers on the non-
robustness of the power and type I error properties when proportional hazards
are violated. This is especially problematic when sequential methods are used.
Statisticians are faced with the need for such methods on the one hand but with
the reality that they are based on questionable forecasts. As a partial solution to
this problem, a ‘‘leap-frog’’ approach is proposed, where a trial accrues a block
of patients and is then put on hold.
129
130 Shuster
In Section II, a general sample size formulation for tests that are inversions
of confidence intervals is presented. In Section III, results for exponential survival
and estimation of the difference between hazard rates will be given. In Section IV,
the connection between the exponential results and the logrank test as described in
Peto and Peto (2) and Cox regression (3) are given. Section V is devoted to
a generalization of the results for the exponential distribution to the two-piece
exponential distribution with proportional hazards and to a numerical example.
In Section VI, the necessary sample size is developed for the O’Brien–Fleming
method of pure sequential monitoring. In that section, it is shown that the maxi-
mum sample size inflation over trials with no monitoring is typically of the order
of 7% or less. For group sequential plans, this inflation factor is even less. Section
VII is devoted to ways of obtaining the ‘‘information fraction time,’’ an essential
ingredient for sequential monitoring. The fraction of failures observed (interim
analysis to final planned analysis) is the recommended measure. If that measure
is indeed used and if the trial is run to a fixed number of failures, then the piece-
wise exponential assumption can be further relaxed to proportional hazards. Sec-
tion VIII deals with application of the methods to more complex designs, includ-
ing multiple treatments and 2 ⫻ 2 factorial designs. Section IX is concerned with
competing losses. Finally, Section X gives the practical conclusions, including
major cautions about the use of sequential monitoring.
II. A BASIC SAMPLE SIZE FORMULATION
The following formulation is peculiar looking but very useful (see Shuster (4)
for further details). Suppose the following Eq. (a) and (b) hold under the null
(H0 ) and alternative hypothesis (H1 ), respectively:
H0 : ∆ ⫽ ∆ 0 vs. H1 : ∆ ⫽ ∆1 ⬎ ∆ 0
P[(∆ˆ ⫺ ∆ 0 )/SE ⬎ Zα] ⬇ α (a)
P[W ⫹ (∆ˆ ⫺ ∆1 )/SE ⬎ ⫺ Zβ] ⬇ 1 ⫺ β (b)
where W ⫽ (Zα ⫹ Zβ )(S ⫺ SE)/SE. SE is a standard error estimate, calculated
from the data only, valid under both the null and alternate hypothesis, S is a
function of the parameters (including sample size) under the alternate hypothesis,
and S satisfies the implicit sample size equation:
S ⫽ (∆1 ⫺ ∆ 0 )/(Zα ⫹ Zβ ) (1)
Then it follows under Hα̂,
P[(∆ˆ ⫺ ∆ 0 )/SE ⬎ Z α] ⬇ 1 ⫺ β. (2)
That is, the approximate α level test has approximate power 1⫺β.
Power and Sample Size in Phase III Trials 131
Note that Z α and Z β are usually (but need not always be) the 100α and
100β upper percentiles of the standard normal cumulative distribution function
(CDF).
Suppose you have a ‘‘confidence interval inversion’’ test, with normal dis-
tributions after standardizing by the standard error statistic. To validate the im-
plicit sample size formula, you need only show that under Ha, S/SE converges
to one in probability. For two-sided tests, set up so that ∆1 ⬎ ∆ 0, α is replaced
by α/2.
Binomial Example: (one-sided test)

For binomial trials with success rates P1 and P2 and equal sample size N per
group,
S ⫽ sqrt[{P1(1 ⫺ P1 ) ⫹ P2 (1 ⫺ P2 )}/N]
∆1 ⫽ (P1 ⫺ P2 )
∆0 ⫽ 0
and hence from Eq. (1), each treatment will have sample size
N ⫽ [P1 (1 ⫺ P1 ) ⫹ P2 (1 ⫺ P2 )]{(Z α ⫹ Z β )/(P1 ⫺ P2 )}2
This formula is useful for determining the sample size based on the Kaplan–
Meier statistic (5). For the two-sided test, we replace α by α/2 in the above
expression.
III. EXPONENTIAL SURVIVAL
If the underlying distribution is exponential, it can be shown that the stochastic

process that plots the number of deaths observed (Y–axis) versus total accumu-
lated time on test (X-axis) is a homogeneous Poisson process with hazard rate
equal to the constant hazard of the exponential distribution.
For treatments i ⫽ 1, 2 let λi ⫽ hazard rate for treatment i and Fi ⫽ total
number of failures to total accumulated time on test Ti.
λ̂i ⫽ Fi /Ti ⬃ Asy N[λi, λ 2i /E(Fi )]
Let ∆ˆ ⫽ λ̂1 ⫺ λ̂2 and ∆ ⫽ λ1 ⫺ λ2.

In the notation of the previous section, one has
SE ⫽ [(λˆ 21 /F1 ) ⫹ (λˆ 22 /λ 2 )]0.5

132 Shuster
and
S ⫽ [λ21 /E(F1 ) ⫹ λ22 /E(F2 )]0.5 (3)
If patients are accrued uniformly (Poisson arrival) over calendar time (0,X)
and followed until death or to the closure (time X ⫹ Y) (Y being the minimum
follow-up time), then the probability of death for a random patient assigned to
treatment i is easily obtained as
Pi ⫽ 1 ⫺ Q i (4)
where
Q i ⫽ exp(⫺λ iY)[1 ⫺ exp(⫺λ i X)]/(λ i X)
The expected number of failures on treatment i is
E(F i ) ⫽ 0.5ΨXP i (5)
where Ψ is the accrual rate for the study (half assigned to each treatment). If the
allocations are unequal with γ i assigned to treatment i (γ 1 ⫹ γ 2 ⫽ 1), simply
replace the 0.5 in Eq. (5) by γ i. This is useful in planning studies of prognostic
factors or clinical trials where the experimental treatment is very costly compared
with the control.
After substituting Eq. (3) in Eq. (1), the resulting equation must be solved
iteratively (bisection is the method of choice) to identify the accrual period, X,
required for given planning values of the accrual rate, Ψ, minimum follow-up
period Y, and values λ 1 and λ 2 (and hence of ∆) under the alternate hypothesis.
Similar methods for exponential survival have been published by Bernstein
and Lagakos (6), George and Desu (7), Lachin (8), Morgan (9), Rubinstein et
al. (10), and Schoenfeld (11). These methods use various transformations and
thus yield slightly different though locally equivalent results. Schoenfeld allowed
the incorporation of covariates, whereas Bernstein and Lagakos allowed one to
incorporate stratification.
IV. APPLICATIONS TO THE LOGRANK TEST AND COX

REGRESSION
Two important observations extend the utility of the above sample size formula-
tion to many settings under ‘‘proportional hazards.’’
1. Peto and Peto (2) demonstrated the full asymptotic local efficiency of
the logrank test when the underlying survival distributions are expo-
nential. This implies that the power and sample size formulas of Sec-
tion III apply directly to the logrank test (as well as the likelihood-
based test, used above.) See Appendix I for further discussion of the
efficiency of the logrank test. For two treatments, the test for no treat-
ment effect in Cox regression (with no other covariates) is equivalent
to the logrank test.
2. Two distributions have proportional hazards if and only if there exists
a continuous strictly monotonic increasing transformation of the time
scale that simultaneously converts both to exponential distributions.
This means that an investigator can plan any trial that assumes proportional haz-
ards, as long as a planning transformation that converts the outcomes to exponen-
tial distributions is prespecified. This can be approximated well if historical con-
trol data are available. The only problems are to redefine the λ i and to evaluate
E(F i ), taking the transformation into account, since accrual may not be uniform
in the transformed time scale.
Three articles offer software that can do the calculations: Halperin and
Brown (12), Cantor (13), and Henderson et al. (14). These programs (or papers
as in the original) can also investigate the robustness of the sample size to devia-
tions from distributional assumptions. Another approach, that is quite robust to
failure in correctly identifying the transformation, presumes a two-piece exponen-
tial model. This is discussed in the next section along with a numerical example.
V. PLANNING A STUDY WITH PIECE-WISE EXPONENTIAL

SURVIVAL
A. Parameters to Identify
1. Input Parameters
Ψ ⫽ Annual planning accrual rate (total) (e.g., 210 patients per year are
planned, 50% per arm)
Y ⫽ Minimum follow-up time (e.g., Y ⫽ 3 years)
R 1 ⫽ Planning Y-year survival under the control treatment (e.g., R 1 ⫽ 0.60)
R 2 ⫽ Planning Y-year survival under the experimental treatment (e.g., R 2 ⫽
0.70)
Sidedness of test (one or two) (e.g., one)
α ⫽ type 1 error (e.g., 0.05)
π ⫽ 1 ⫺ β, the power of the test (e.g., 0.80)
ρ ⫽ Posttime Y:pretime Y hazard ratio (e.g., 0.5)
ρ represents the piecewise exponential. Before the minimum follow-up Y, if the
hazard is λ i on treatment i, it is ρλ i on treatment i after Y. Note that λ i ⫽
⫺ln(R i )/Y.
134 Shuster
2. Output Parameters
X ⫽ Accrual period (e.g., 2.33 years)
X⫹Y ⫽ Total study duration (e.g., 5.33 years)
ΨX ⫽ Total accrual required (e.g., 490 patients)
E(F 1 ) ⫹ E(F 2 ) ⫽ Total expected failures (e.g., 196)
Based on note 2, Section IV, the calculation of E(F i ) for Eq. (3) and (5) is straight-
forward. For a randomly selected patient, the time at risk is uniformly distributed
over the period from Y to (X ⫹ Y) and the probability of death, D i (x), conditional
on a time at risk, Y ⫹ x, is the probability of death by time Y, (1 ⫺ R i ), plus the
probability of surviving to time Y, but dying between times Y and x ⫹ Y, that is,
D i (x) ⫽ (1 ⫺ R i ) ⫹ R i (1 ⫺ exp(ρλ ix))]
The unconditional probability of death for a randomly selected patient on treat-
ment i is found by taking the expectation of D i (x) over the uniform distribution
for x from 0 to X.
For this piecewise exponential model and Poisson accrual process, the
value of Q i to be used in Eqs. (4) and (5) is
Q i ⫽ R i[1 ⫺ exp(⫺ ρλ i X)]/(ρλ iX) (6)
Note that λ i is defined via the equation
R i ⫽ exp(⫺ λ iY) (7)
and hence for exponential data (ρ ⫽ 1) the value of Q i agrees with that below
Eq. (4).
B. Key Observations
1. The expected failures on treatment i in the first Y years of patient risk
is 0.5ΨX(1 ⫺ R i ), where Ψ is the annual accrual rate, X is the accrual period,
and R i is the planning Y-year survival on Treatment i.
2. If one transformed the subset of the time scale (0, Y) onto (0, Y), with
a strictly monotonic increasing function, the expected number of failures that
occurred before patient time Y would be unchanged. This in turn implies that the
exponential assumption over the interval (0, Y) can be relaxed to require propor-
tional hazards only. (From Eq. (7), the definition of λ i ⫽ ⫺ln(R i )/Y in Eq. (3),
depends only on the fixed Y-year survival on treatment i, R i, and is not affected
by the transformation.)
3. All other things being equal, Eq. (3) implies that the larger the expected
number of failures, the greater the power of the test. Since the larger the value
of ρ, the greater the hazard post-Y-years, power increases as ρ increases.
4. If hazard rates are low after time Y, the exponentiality assumption post-
Y is not important (although proportional hazards may be). It is convenient to
think of ρ as an average hazard ratio (post:pre-year Y), since it enters the sample
size equation only through the expected number of failures. In fact, if one used
the above sample size methodology to approximate the accrual duration and mini-
mum follow-up but actually ran the trial until the total failures equaled the total
expected in the study: E(F 1 ) ⫹ E(F 2 ), then the power holds up under proportional
hazards, without regard to the piece-wise exponential assumption. To further il-
lustrate this point, in the above example, the planning hazard ratio (HR) for sur-
vival rates of 60% versus 70% at 3 years is 0.698. Based on Shuster (15), if each
failure is considered to approximate an independent binomial event, which under
the null hypothesis has a 50% probability of falling into each treatment and under
the alternative a HR/(1 ⫹ HR) ⫽ 0.698/1.698 ⫽ 0.411 probability of being
in the experimental arm versus 0.589 probability of being in the control arm,
then from the results of Section II, the study should accrue until 189 failures
have occurred (nearly an identical number to the 196 expected failures derived
from the piece-wise exponential model).
5. The variance (the square of the right-hand side of formula (3)) de-
creases more rapidly than (1/X), the reciprocal of the accrual duration, because
although the number of patients increase linearly, the longer time at risk increases
the probability of death for each entrant. However, for small increments in X,
the rate of change is approximately proportional to the derivative of (1/X), namely
⫺(1/X2 ). This will be used to apply an inflation factor for sequential monitoring
(Section VI).
6. The effect of ρ in the above example, where accrual is 2.33 years and
minimum follow-up is 3 years, is relatively modest. Under exponentiality, 448
patients would be required (ρ ⫽ 1), the actual case ρ ⫽ 0.5, 490 patients, whereas
if ρ ⫽ 0.0, 561 patients would be needed. The effect of ρ is much more striking
if accrual is slow. Under the same parametrization, but with an accrual of 60
patients per year instead of 210, the patient requirements are 353, 407, and 561
for ρ ⫽ 1, 0.5, and 0.0, respectively.
7. Software for these calculations is available in Shuster (5), on a Win-
dows platform. A Macintosh version exists but must invoke 16-bit architecture.
Appendix II contains a SAS macro that also can be used for the calculations.
The equal patient allocation of 50% to each arm is nearly optimal in terms of
minimizing the smallest sample size, but the macro allows for both equal or
unequal allocation
VI. SEQUENTIAL MONITORING BY THE

O’BRIEN–FLEMING METHOD
In this section, from elementary asymptotic considerations of the Poisson pro-

cesses obtained in exponential survival, one can use asymptotic Brownian motion
to connect full sequential monitoring power to no monitoring power.
136 Shuster
First, define a ‘‘time parameter’’ θ, 0 ⬍ θ ⬍ 1, with θ ⫽ 1 being the

maximum allowable time of completion, such that
∆ˆ θ is asymptotically N(∆, S 2 /θ)
θ represents the ratio of variances of the estimate of effect size (final to interim
analysis) and S 2 is the variance of the estimate of effect size at the planned final
analysis (θ ⫽ 1), calculated per Eq. (3). From asymptotic considerations of the
Poisson process (with the notation the same as the section on the exponential,
except that the index θ was added to delineate the time scale) θ ∆ˆ θ is asymptoti-
cally a Brownian motion process with drift ∆ and diffusion constant S 2.
From first passage time considerations (see Cox and Miller (16) for exam-
ple), on the basis of the rejection region for testing ∆ ⫽ 0 versus ∆ ⬎ 0, the
power function for the O’Brien–Fleming test that rejects the null hypothesis if
at any time θ
θ ∆ˆ θ ⬎ SZ α/2
is
Π(∆, S, α) ⫽ Φ[(∆/S) ⫺ Z α/2] ⫹ exp[(2∆Z α/2 )/S]Φ[⫺(∆/S)⫺Z α/2]
where Φ ⫽ standard normal CDF and Z p is the upper 100p percentile of Φ.
Note that Π(0, S, α) ⫽ α. For no monitoring, the power function is defined
by Π no (∆, S, α) ⫽ Φ[(∆/S)⫺Z α]
For example, if investigator 1 ran a study sensitive to an effect size ∆/S
⫽ 2.486 and required no monitoring, and investigator 2 ran a slightly larger study
sensitive to an effect size of ∆/S ⫽ 2.576 but used continuous O’Brien–Fleming
bounds for sequential monitoring, the two studies would both have 80% power
Table 1 Sensitivity of No Monitoring

(None) vs. Continuous Monitoring by
O’Brien–Fleming (OF)
None O-F
α Power ∆/S ∆/S
0.010 0.80 3.168 3.238

0.025 0.80 2.801 2.881
0.050 0.80 2.486 2.576
0.010 0.90 3.608 3.688
0.025 0.90 3.242 3.332
0.050 0.90 2.926 3.026
∆/S represents the detectable effect size.

at α ⫽ 0.05. There is almost no penalty for this type of monitoring, even if indeed
the study runs to completion.
Since as remarked in Section V, note 5, increasing the accrual duration
slightly means that the variance changes proportional to the change in the recipro-
cal of the accrual duration, an approximate measure for the increase in accrual
mandated by continuous O’Brien monitoring is approximately the square of the
ratio of the entries in Table 1. The inflation factors would be 4% (α ⫽ 0.01),
6% (α ⫽ 0.025), and 7% (α ⫽ 0.05).
Group sequential monitoring by the O’Brien–Fleming method would re-
quire a smaller inflation factor for the maximum sample size, since it will fall
between no monitoring and continuous monitoring. The software package
‘‘East’’ (17) does not handle nonexponential data but can be used to derive an
approximate inflation factor for true group sequential designs. But experience
has shown it to be not much smaller than those derived for continuous monitoring.
In the example above, where 490 patients were needed for a trial conducted
without sequential monitoring, an additional 7% would bring the necessary ac-
crual to 524 (34 more entrants), if the trial was to be sequentially monitored by
the O’Brien–Fleming method. This represents an accrual duration of at most 2.49
years with monitoring versus 2.33 years without monitoring.
VII. EVALUATION OF THE INFORMATION FRACTION
It is of interest to note that for the piece-wise exponential model, one can derive
the information fraction as the ratio of the final to interim variance as in the strict
definition (see Eq. (3)). All one needs to compute is the expected number of
failures at an interim analysis. Others use the ratio of expected failures (interim
to final), whereas still others use the ratio of actual failures to total expected.
Although the use of variance ratios Eq. (3) appears to be quite different from the
ratio of expected failures when the hazard ratios are not close to one, the fact is
that for studies planned with survival differences of the order of 15% or less,
where the expected total failures is 50 or higher, the two are almost identical.
In the numerical example, above, where planning accrual is 210 per year,
planning difference is a 10% improvement in 3-year survival from 60% (control)
to 70% (experimental) and monitoring is handled by a continuous O’Brien–Flem-
ing approach, it was concluded that 2.49 years of accrual plus 3 years of minimal
follow-up (total maximum duration of 5.49 years) were needed. The information
fractions, θ, for calendar times 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, and 5.49 years
are, respectively, for Eq. (3) and for expected failures ratios in []: 1.0 years 0.0685
[0.0684]; 1.5 years 0.1504 [0.1502]; 2.0 years 0.2611 [0.2607]; 2.5 years 0.3985
[0.3979]; 3.0 years 0.5424 [0.5418]; 3.5 years 0.6704 [0.6698]; 4.0 years 0.7785
[0.7780]; 5.49 years 100% [100%]. These numbers are impressively close, de-
138 Shuster
spite the fact that under the alternate hypothesis, the planning hazard ratio (experi-
mental to control) is 0.698, hardly a local alternative to the null value of 1.00.
As noted above, the most robust concept is to use actual failures and termi-
nate at a maximum of the exected number, which can be shown to be 211 for
this continuous O’Brien–Fleming design.
VIII. MULTIPLE TREATMENTS AND STRATIFICATION
The methods can be applied to pair-wise comparisons in multiple treatment trials.

If one wishes to correct for multiple comparisons (a controversial issue), then
one should use a corrected level of α to deal with this but keep the original
power. Note that the accrual rate would be computed for the pair of treatments
being compared. For example, if the study is a three-armed comparison and ac-
crual is estimated to be 300 patients per year, 200 per year would be accrued to
each pair-wise comparison. The definition of power applies to each pair-wise
comparison. In most applications, it is my opinion that the planning α level should
not be altered for multiple comparisons. This is because the inference about A
versus B should be the same, for the same data, whether or not there was a third
arm C. In other words, had a hypothetical trial of only A versus B been run and
accrued the same data, one could reach a different conclusion if the actual analysis
had been corrected for the number of pair-wise comparisons in the original trial.
In general, the use of stratification on a limited scale can increase the preci-
sion of the comparison. However, the improvement over a completely random-
ized design is difficult to quantify. Hence, the nonstratified plan represents a
conservative estimate of patient needs when the design and analysis are in fact
stratified.
If a 2 ⫻ 2 factorial study is conducted, this could be analyzed as a stratified
logrank test, stratifying for the concomitant treatment. Interventions should be
carefully selected to minimize the potential for qualitative interaction, a situation
where the superior treatment depends on which of the concomitant treatments is
used. If the assumption of no major interaction in the hazard ratios is reasonable,
a study planned as if it was a two-treatment nonstratified study will generally
yield a sample size estimate slightly larger (but unquantifiably so) than needed
in the stratified analysis, presuming proportional hazards within the strata hold.
If a qualitative interaction is indeed anticipated, then the study should be designed
as a four-treatment trial for the purposes of patient requirements.
IX. COMPETING LOSSES
In general, competing losses, patients censored for reasons other than being alive
at the time of the analysis, are problematic unless they are treatment uninforma-
tive (the reason for the loss is presumed to be independent of the treatment assign-
ment and, at least conceptually, unrelated to the patient’s prognosis). Although
it is possible to build in these competing losses in a quantified way for sample
size purposes [see (10)], a conservative approach, preferred by this author, is to
use a second ‘‘inflation factor’’ for competing losses. For example, if L ⫽ 0.10
(10%) are expected to be lost to competing reasons, the sample size would be
inflated by dividing the initial sample size calculation by (1 ⫺ L) ⫽ 0.9 to obtain
a final sample size.
X. PRACTICAL CONCLUSIONS
The methods proposed herein allow a planner to think in terms of fixed term
survival rather than hazard rates or hazard ratios. The use of a fixed number of
failures in implementation also allows for simplistic estimates of the information
fraction, as the failures to date divided by the total failures that would occur if
the study is run to its completion.
Irrespective of the proportional hazards assumption, the actual test (logrank
or Cox) is valid for testing the null hypothesis that the survival curves are identi-
cal. However, the sample size calculation and power are sensitive to violations
of the proportional hazards assumption and especially to ‘‘crossing hazards.’’ If
a superior treatment is associated with a high propensity for early deaths, an early
significance favoring the wrong treatment may emerge, causing the study to be
stopped early and reaching the incorrect conclusion. For example, if at an early
interim analysis, one treatment (e.g., bone marrow transplant) is more toxic and
appears to have inferior outcome when compared with the other (chemotherapy
alone), it might be very tempting but unwise to close the study for lack of efficacy.
Although sequential analysis is often an ethical necessity, it is also an ethical
dilemma, since statisticians are really being asked to forecast the future, where
they have little in the way of reliable information to work with.
It is recommended for studies where proportional hazards is considered to
be a reasonable assumption that the study be planned as if the piece-wise expo-
nential assumptions hold. If the plan is to monitor the study by the O’Brien–
Fleming method, whether it is group sequential or pure sequential, apply the
small inflation factor by taking the square of the ratio of the entry in the ∆/S
columns per Table 1. Next, apply Eq. (3) using expected values in Eq. (4), (5),
and (6) for the piece-wise model to obtain the expected number of failures to be
seen in the trial. Conduct the trial as planned until this many failures occur or
until the study is halted for significance by crossing an O’Brien–Fleming bound,
where the information fraction is calculated as actual failures/maximum number
of failures that would occur at the planned final analysis.
One possible recommendation, for survival trials where there is concern
about the model assumptions, is to conduct accrual in stages. For example, it
140 Shuster
might be prudent to accrue patients onto a trial (A) for a period of 1 year and,
irrespective of any interim results, close the trial temporarily and begin accrual
on a new trial (B) for a period of 1 year. A decision as to whether to continue
accrual to trial (A) for another year or begin yet another trial (C) for a year could
be made with more mature data than otherwise possible. This leap-frog approach
would continue in 1-year increments. This process might slow down completion
of trial A but also might speed up completion of trials B and C. In addition, for
safety purposes, there would be a much smaller disparity between calendar time
and information time. In the numerical example initiated in Section V, for the
O’Brien–Fleming design, the information fraction at 1 year (about 40% of ac-
crual) would be only 7%, whereas the information fraction at 2 years (about 80%
of accrual completed) would be 26% (19% coming from the first year of accrual
plus only 7% coming from the second year of accrual). In fact, if one had a 1-
year ‘‘time out’’ for accrual but ran the second stage for 1.2 years, analyzing the
data 3.0 years after the end of accrual, the study would have the same power as
the 2.49 year accrual study (both with continuous O’Brien–Fleming bounds).
Completion would occur after 6.2 years instead of 5.49 years, presuming the
study ran to completion. Offsetting this difference, trial B would be completed
earlier by the leap-frog approach by about 0.8 years. The slowdown, generated
by the leap-frog approach, would also allow the analyst to have a better handle
on model diagnostics, while fewer patients were at risk.
ACKNOWLEDGMENT
Supported in part by grant 29139 from the National Cancer Institute.
APPENDIX I: LOGRANK VS. EXPONENTIAL ESTIMATION—

LOCAL OPTIMALITY
For treatment i and day j, let N ij and F ij represent the number at risk at the start
of the day and number of failures on that day, respectively. For the logrank test,
the contribution to the observed minus expected for day j (to the numerator of the
statistic, where the denominator is simply the standard error of the numerator) is
F 1j ⫺ N 1j (F 1j ⫹ F 2j )/(N 1j ⫹ N 2j ) ⫽ {(F 1j /N 1j ) ⫺ (F 2j /N 2j )}/(N 1j⫺1 ⫹ N 2j⫺1 )
The day j estimate of each hazard, F ij /N ij, is weighted inversely proportional
to (N 1j⫺1 ⫹ N 2j⫺1 ), that is, proportional to N 1j N 2j /(N 1j ⫹ N 2j ). The weight is zero if
either group has no patient at risk starting day j.
For the exponential test, the contributions to the estimates of the hazards
for day j are
(N ij /N i.)(F ij /N ij )
where N i. is the total days on test for treatment i.
Under the null hypothesis, and exponentially, both weight the information
inversely proportional to the variance, and as such, both tests are locally optimal.
Using the approximation that the N ij are fixed rather than random, a relative
efficiency measure can easily be obtained for nonlocal alternatives, and in prac-
tice, the relative efficiency for studies planned as above with 30 or more failures
will generally be only trivially less than 1.00 (typically 0.99 or higher). Note that
there are two things helping the logrank test. First, if the ratio of N 1j /N 2j remains
fairly constant over time j ⫽ 1, 2, and so on, until relatively few are at risk, the
weights will be very similar except when both become very small. Second, stra-
tified estimation is robust against modest departures from the ‘‘optical alloca-
tion,’’ which is represented by the exponential weights.
Comment: The mathematical treatment of this chapter is somewhat uncon-
ventional in that it uses the difference in hazards for the exponential rather than
the hazard ratio or log hazard ratio. This enables a user to directly investigate
the relative efficiency issue of the logrank to exponential test. On the negative
side, by not using a transformation, the connection to expected failures is not as
direct as it would be under the transform to a hazard ratio.
Comment: SAS (Statistical Analysis Systems) macros for the logrank and
stratified logrank tests are included in Appendix III.
APPENDIX II: SAS MACRO FOR SAMPLE SIZE AND

ACCRUAL DURATION
Note that this macro can handle unequal allocation of patients, something useful
for planning studies for the prognostic significance of yes/no covariates.
Usage
%ssize(ddsetx,alloc,psi,y,r1,r2,side,alpha,pi,rho,lfu);
%ssize(a,alloc,psi,y,r1,r2,side,alpha,pi,rho,lfu);
ddsetx⫽user supplied name for data set containing the planning parameters. See
Section V.
Alloc⫽Fraction allocated to control group (Use .5 for 1-1 randomized trials)
Psi⫽Annual Accrual Rate
Y⫽Minimum Follow-up
R1⫽Control Group planned Y-Year Survival
R2⫽Experimental Group planned Y-Year Survival
142 Shuster
side⫽1 (one-sided) or 2(two-sided)

alpha⫽size of type I error
pi⫽Power
rho⫽Average Post:pre Y-Year hazard ratio
lfu⫽Fraction expected to be lost to follow-up.
%MACRO ssize(ddsetx,alloc,psi,y,r1,r2,side,alpha,pi,rho,lfu);
options ps⫽60 ls⫽75;
OPTIONS NOSOURCE NONOTES;
DATA DDSETX;set &ddsetx;
alloc⫽&alloc;psi⫽ψy⫽&y;r1⫽&r1;r2⫽&r2pha⫽α
pi⫽πrho⫽ρlfu⫽&lfu;
lam1 ⫽ log(r1)/y;
lam2⫽log(r2)/y;
del⫽abs(lam1⫺lam2);
za⫽probit(alpha/side);
zb⫽probit(pi);
s⫽del/(za⫹zb);
x⫽0;inc⫽1;
aa:x⫽x⫹inc;
q1⫽r1*(1⫺exp(⫺rho*lam1*x))/(rho*lam1*x);
q2⫽r2*(1⫺exp(⫺rho*lam2*x))/(rho*lam2*x);
p1⫽1⫺q1;
p2⫽1⫺q2;
v1⫽((lam1**2)/(alloc*psi*x*p1));
v2⫽((lam2**2)/((1⫺alloc)*psi*x*p2));
s2⫽sqrt(v1⫹v2);
if s2⬎s and inc⬎.9 then goto aa;
if inc⬎.9 then do;inc⫽.01;x⫽x⫺1;goto aa;end;
if s2⬎2 then goto aa;
n⫽int(.999⫹psi*x);
na⫽n/(1⫺1fu);na⫽int(na⫹.999);
ex fail⫽psi*x*(alloc*p1⫹(1⫺alloc)*p2);
data ddsetx;set ddsetx;
label alloc⫽‘Allocated to Control’
rho⫽‘Post to Pre y-Year Hazard’
psi⫽‘Accrual Rate’
y⫽‘Minimum Follow-up (MF)’
r1⫽‘Planned control Survival at MF’
r2⫽‘Planned Experimental Survival at MF’
side⫽‘Sidedness’
alpha⫽‘P-Value’
pi⫽‘power’
lfu⫽‘Loss to Follow-up Rate’
x⫽‘Accrual Duration’
n⫽‘Sample size, no losses’
na⫽‘Sample size with losses’
ex fail⫽‘Expected Failures’;
proc print label;
var alloc psi y r1 r2 side alpha pi rho lfu x n na ex fail;
%mend;
To produce the example in Section V, this program was run.
data ddsetx;
input alloc psi y r1 r2 side alpha pi rho lfu;
cards;
.5 210 3 .6 .7 1 .05 .8 1 .1
.5 210 3 .6 .7 1 .05 .8 .5 .1
.5 210 3 .6 .7 1 .05 .8 .000001 .1
.5 60 3 .6 .7 1 .05 .8 1 .1
.5 60 3 .6 .7 1 .05 .8 .5 .1
.5 60 3 .6 .7 1 .05 .8 .000001 .1
%ssize(ddsetx, alloc, psi, y, r1, r2, side, alpha, pi, rho, lfu);
APPENDIX III: SAS MACRO FOR LOGRANK AND

STRATIFIED LOGRANK TESTS
Sample usage:
%logrank(ddset, time, event, class)
%logrank(a, x1, x2, treat no)
for the unstratified logrank test
%logstr(ddset, time, event, class, stratum)
%logstr(a, x1, x2, treat no, gender)
for the stratified logrank test
time, event, class, and stratum are user-supplied names in the user-supplied
data set called ddset.
ddset ⫽ name of data set needing analysis.
time ⫽ time variable in data set ddset (time ⬍ 0 not allowed)
event ⫽ survival variable in data set ddset event ⫽ 0 (alive) or event ⫽
1 (died)
class ⫽ categorical variable in ddset identifying treatment group.
Stratum ⫽ categorical variable in ddset defining stratum for stratified log-
rank test.
144 Shuster
146 Shuster
REFERENCES
1. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics
1979; 35:549–556.
2. Peto R, Peto J. Asymptotically efficient rank invariant test procedures. J R Stat Soc
1972; A135:185–206.
3. Cox DR. Regression models and life tables [with discussion]. J R Stat Soc 1972;
B34:187–220.
4. Shuster JJ. Practical Handbook of Sample Size Guidelines for Clinical Trials. Boca
Raton, FL: CRC Press, 1992.
5. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am
Stat Assoc 1958; 53:457–481.
6. Bernstein D, Lagakos SW. Sample size and power determination for stratified clini-
cal trials. J Stat Comput Simul 1987; 8:65–73.
7. George SL, Desu MM. Planning the size and duration of a clinical trial studying
the time to some critical event. J Chronic Dis 1974; 27:15–29.
8. Lachin JM. Introduction to sample size determination and power analysis for clinical
trials. Controlled Clin Trials 1981; 2:93–113.
9. Morgan TM. Planning the duration of accrual and follow-up for clinical trials. J
Chronic Dis 1985; 38:1009–1018.
10. Rubinstein LJ, Gail MH, Santner TJ. Planning the duration of a comparative clinical
trial with loss to follow-up and a period of continued observation. J Chronic Dis
1981; 34:469–479.
11. Schoenfeld DA. Sample size formula for the proportional-hazards regression model.
Biometrics 1983; 39:499–503.
12. Halperin J, Brown BW. Designing clinical trials with arbitrary specification of sur-
vival functions and for the log rank test or generalized Wilcoxon test. Controlled
Clin Trials 1987; 8:177–189.
13. Cantor AB. Power estimation for rank tests using censored data: conditional and
unconditional. Controlled Clin Trials 1991; 12:462–473.
14. Henderson WG, Fisher SG, Weber L, Hammermeister KE, Sethi G. Conditional
power for arbitrary survival curves to decide whether to extend a clinical trial. Con-
15. Shuster JJ. Fixing the number of events in large comparative trials with low event
rates: a binomial approach. Controlled Clin Trials 1993; 14:198–208.
16. Cox DR, Miller HD. The Theory of Stochastic Processes. London: Methuen, 1965.
17. Mehta C. EaST: Early Stopping in Clinical Trials. Cambridge, MA: CyTEL Software
Corporation.
8
Multiple Treatment Trials
Stephen L. George
Duke University Medical Center, Durham, North Carolina
I. INTRODUCTION
Randomized clinical trials involving more than two treatments present problems
and challenges in their design and analysis that are absent in trials involving only
two treatments. These problems are part of the larger issues raised by multiplici-
ties in clinical trials. Multiplicity (1–3) refers to issues concerning multiple out-
come variables, subgroups, sequential analysis, covariate adjustments, and multi-
ple treatments. This chapter deals with multiple treatments, although some issues
raised here are relevant to other multiplicity topics.
When there are only two treatments in a trial, inference is relatively
straightforward since the only direct comparison is between the two treatments.
When there are more than two treatments inference is more complex, and this
complexity affects the design, conduct, and analysis of the trial (4). For example,
with three treatments, in addition to the global comparison of all three treatments
simultaneously, there are three possible paired comparisons among the three treat-
ments and an unlimited number of other comparisons arising from various pooled
weightings of the treatment groups. The number of paired and subset comparisons
increases rapidly as the number of treatments increases. In this setting, it is vitally
important to specify clearly the objectives of the trial and to be certain that these
objectives are reflected in an appropriate design and analysis.
149
150 George
In the following sections, several common settings in which more than

two treatments are involved are presented. These include multiple independent
treatments, multiple experimental treatments compared with a control, factorial
designs, and selection designs. In each setting, design strategies are presented
along with their implications for sample size and for the conduct and analysis
of the trial.
II. MULTIPLICITY AND ERROR RATES

A. Types of Errors
The primary difficulty raised by multiplicity, including multiple treatments, is
that error rates may be elevated beyond their putative or nominal level. A simple
and well-known example of this phenomenon arises in multiple significance test-
ing. If we conduct N independent tests of significance, each at the nominal α
level, the overall probability of finding at least one erroneous ‘‘significant’’ result
when there are truly no differences is 1 ⫺ (1 ⫺ α) N. For α ⫽ 0.05, this error
probability becomes quite large even for moderate N. For N ⫽ 5, it is 0.23. For
N ⫽ 10, it is 0.40. Multiple testing with no adjustment of individual error rates
dramatically increases the probability of finding spurious, but significant, differ-
ences.
In the classic statistical hypothesis testing framework, two types of errors
can occur in testing a specified null hypothesis against an alternative non-null
hypothesis. Erroneously rejecting a true null hypothesis is a type I error. Erron-
eously failing to reject a false null hypothesis is a type II error. The error rates
for these two types of errors are usually denoted by α and β, respectively. The
complement of the type II error rate (1 ⫺ β) is referred to as the statistical power
of the test procedure. These concepts apply equally to the overall experiment
(‘‘experimental’’ error rates) or to the individual comparisons within the experi-
ment. However, as the calculation in the previous paragraph illustrates, if we test
all possible pairs of treatments and wish to set the experimental type I error rate
to α, the individual type I error rates must each be less than α. But if we reduce
the individual type I error rates without increasing the sample size, the individual
type II error rates are increased. Thus, proper control of the experimental type I
error rate can lead to highly conservative procedures with low power to detect
individual differences. The sample size can be increased to avoid this problem,
but this may require an unacceptably large sample size. The purpose of this chap-
ter is to describe techniques of statistical design and analysis that address these
issues.
Multiple Treatment Trials 151
B. Bonferoni and Related Procedures

One of the older procedures used to control the experimental error rate is the
Bonferoni procedure, based on a simple form of one of the Bonferoni inequalities:
Pⱕ 冱p i⫽1
i
where P is the probability of at least one event occurring out of N possible events
and p i is the probability of the ith event (i ⫽ 1, . . . , N ) If we want P ⱕ α, one
approach is to set p i ⫽ α/N. Hence, the simplest Bonferoni procedure, applied
to multiple significance tests, is to use a significance level α/N for each individual
test, where N is the number of tests and α is the overall type I error rate (5).
This procedure is guaranteed to yield an overall error rate no larger than α when
the global null hypothesis is true. However, it is a very conservative procedure,
particularly where N is large and the tests are correlated. Application to treatment
trials involving K treatments in which the tests are limited to all pair-wise compar-
isons yields a significance level of 2α/K(K ⫺ 1) for each individual test. For
example, if α ⫽ 0.05 and K ⫽ 5, the significance level would be 0.005 for each
of the 10 possible paired comparisons.
Modifications of the simple Bonferoni procedure have been proposed to
mitigate the overly conservative nature of the procedure. One of these (6) uses
a significance level of jα/N for the jth ordered test ( j ⫽ 1, . . . , N). That is,
ordering the p values p (1) ⱕ ⋅⋅⋅ ⱕ p (N) , one rejects the global null hypothesis if p ( j )
ⱕ jα/N for any j. This procedure is less conservative than the simple Bonferoni
procedure. It has been extended by Hochberg (7), and Holm (8) defined a proce-
dure that rejects the hypothesis H (i) corresponding to p (i) if and only if
α
p(j) ⱕ for all j ⱕ i
N⫺j⫹1
This procedure is identical to Bonferoni for i ⫽ 1 but not as conservative other-

wise. Hochberg and Benjamini (9) discuss these procedures and provide some
suggested improvements.
To avoid complexity and to focus attention on problems of multiplicity
arising from multiple treatments, in the remainder of this chapter it is assumed
that the K treatments have a single primary outcome measure, x i , which is nor-
mally distributed with an unknown mean µ i and common variance σ 2 . Thus, the
various scenarios considered here are all expressed in terms of hypotheses involv-
ing the µ i . Complications arising from unequal variances, non-normal distribu-
tions, stratification, drop-outs, censoring, and so on are avoided. Such additional
152 George
considerations are of considerable practical importance in actual trials, but their

introduction here would introduce complexities that would obscure the salient
points of emphasis.
III. MULTIPLE INDEPENDENT TREATMENTS

A. Global Null Hypothesis
Perhaps the most common setting in clinical trials with multiple treatments in-
volves the comparison of K independent, or nominal, treatments. That is, there
are K different treatments with no implied ordering or other special relationship
among the treatments. Typically, K is equal to three or four, but in some trials
might be larger.
In this setting, the primary hypothesis test is usually
H 0 : µ 1 ⫽ µ 2 ⫽ ⋅⋅⋅ ⫽ µ K
vs.
H1 : µi ≠ µj for some i, j with i ≠ j
That is, the global null hypothesis H 0 is that all of the µ i are identical. The alterna-
tive H 1 is that at least one of the µ i is not equal to the others. There are two
general approaches in this setting:
(i) Hierarchical approach—First construct a global α-level test of H 0
(10). If H 0 is not rejected, conclude in favor of H 0 with no further
tests. However, if H 0 is rejected at the first step, proceed with paired
comparisons of the µi. The primary attraction of this approach is that
control of the experimental error rate is simple and direct. A rather
unsatisfying result is that in case of failure to reject the global null
hypothesis, there is no further examination of the treatment differ-
ences. The conclusion is simply that the K treatments are not demon-
strably different.
(ii) All possible paired comparisons—The second general approach is to
conduct all possible paired comparisons (K(K ⫺ 1)/2 in number) but
to do so in a way that preserves the experimental error rate α. As
discussed above in connection with Bonferoni-type adjustments, this
means that each comparison must be carried out at a significance level
less than α. Methods for testing all possible pairs of treatments in a
study have been investigated for a very long time (11,12). Some of
these techniques are Tukey’s honestly significant difference and
wholly significant difference tests, the Newman-Keuls test, the Dun-
can test, and the least significant difference test. All provide ap-
proaches to analyses that control the overall type I error rate. Details
are not given here.
B. Sample Size Implications

For both of the general approaches described above, there is a cost in terms of
the required sample size. In the case of two treatments with σ 2 known, the re-
quired (equal) sample size N for each treatment is
2(z α/2 ⫹ z β ) 2
N⫽
∆2
where z x is the 100x percentile of the standard normal distribution and ∆ ⫽ (µ 1 ⫺
µ 2 )/σ, the standardized difference in means that we desire to detect with power
1 ⫺ β. In practice, N is rounded up to the nearest integer value. For example, if
α ⫽ 0.05, β ⫽ 0.10, and ∆ ⫽ 0.50, then N ⫽ 85 patients are required for each
treatment.
If there are K ⬎ 2 treatments, the required sample size depends on the
specification of the means in the alternative case. Denoting the ordered means by
µ (1) ⱕ ⋅⋅⋅ ⱕ µ (K) , the least favorable configuration of means for a given maximum
difference ∆ ⫽ (µ (K) ⫺ µ (1))/σ occurs when µ (i) ⫽ (µ (1) ⫹ µ (K) )/2 for i ≠ 1, K.
This is the configuration yielding the minimum power for a fixed sample size
and is thus the configuration used to determine sample size. The exact sample
size can be obtained by an iterative solution to an equation involving noncentral
F distributions (13), but a close approximation is the following:
2{√χ 21⫺α (K ⫺ 1) ⫺ (K ⫺ 2) ⫹ z β } 2
N⫽
∆2
The required N for the case K ⬎ 2 treatments is always greater than that required
for K ⫽ 2 treatments. The relative increase is given in Table 1 for K ⫽ 3 to 10.
For example, with α ⫽ 0.05, β ⫽ 0.10, we require 18% more patients per treat-
Table 1 Multiples of Sample Size for Two Treatments Required for K ⬎ 2

Treatments
Number of treatments (K )
(α, β) 3 4 5 6 7 8 9 10
(0.05, 0.20) 1.21 1.35 1.46 1.56 1.65 1.73 1.80 1.87
(0.05, 0.10) 1.18 1.30 1.40 1.48 1.55 1.62 1.68 1.73
(0.01, 0.10) 1.16 1.27 1.35 1.43 1.50 1.56 1.61 1.67
154 George
ment for three treatments than for two treatments. Thus, if ∆ ⫽ 0.50 as in the
earlier example, a two-treatment trial requires 2 ⫻ 85 ⫽ 170 patients but a three-
treatment trial requires 3 ⫻ 1.18 ⫻ 85 ≅ 301 patients, rather than the apparent,
but erroneous, requirement of 3 ⫻ 85 ⫽ 255 patients based on the two-treatment
situation. More complex situations are covered elsewhere (14–18). The salient
point in all situations is that the number of patients per treatment group for K ⬎
2 treatments must be increased rather substantially over the number required for
two treatments, and the more treatments, the greater the increase.
IV. PRESPECIFIED COMPARISONS
In the previous section, it was assumed that comparison of all pairs of treatments
was of interest. No specific comparison was assumed to be of more or less interest
than any other comparison. In other settings, specific combinations of treatments
or specific comparisons may be of primary or even exclusive interest. If this is
the case, the number of comparisons can be limited and prespecified to reduce
the inherent problems of multiplicity.
One common example in a clinical trial setting is the use of K experimental
treatments and a control or standard treatment (or perhaps both). If the control
treatment mean is denoted µ 0 and the experimental treatment means µ i (i ⫽ 1,
. . . , K), the hypotheses of interest may be, for example, µ i vs. µ 0 , ∑ µ i /K vs.
µ 0 , or related hypotheses. The number of such hypotheses may be far less than
K(K ⫺ 1)/2, the total number of paired comparisons.
In the particular case of K experimental treatments compared with a control
treatment, the hypotheses of interest are
H 0 : µ i ⫽ µ 0 i ⫽ 1, . . . , K
H 1 : µ i ≠ µ 0 for at least one µ i
This represents exactly K comparisons, one for each experimental treatment with
the control. The experimental treatments themselves are not directly compared.
The most common test procedure in this case is Dunnett’s procedure (19), an
example of a ‘‘many-one’’ test statistic. In brief, the procedure involves comput-
ing K t-statistics in the usual way:
(y i ⫺ y 0 ) √N
ti ⫽
s√2
where y i is the mean for the ith treatment group, s is the (pooled) standard devia-
tion, and N is the number of observations in each treatment, assumed equal in
this formulation. To preserve the experimental type I error rate, the statistics t i
must each be compared with a critical value larger than the value for K ⫽ 1 (i.e.,
Table 2 Critical Values for Comparing K Experimental

Treatments to a Single Control Treatment Using Dunnett’s
Procedure (α ⫽ 0.05, two-sided tests)
Number of experimental treatments
Degrees of
Freedom 1 2 3 4 5
10 2.23 2.57 2.76 2.89 2.99

20 2.09 2.38 2.54 2.65 2.73
60 2.00 2.27 2.41 2.51 2.58
120 1.98 2.24 2.38 2.47 2.55
∞ 1.96 2.21 2.35 2.44 2.51
one treatment, one control). This critical value is a function of N and the degrees
of freedom, ν ⫽ (K ⫹ 1)(N ⫺ 1), in estimating s. Table 2 gives the required
critical values for this purpose (11).
Other work has extended Dunnett’s procedure in various ways. For exam-
ple, Chen and Simon, in a series of papers (20–22), considered comparing the
experimental treatments that are found to differ from the control in the case when
there is a prior preference ordering for the treatments based on considerations
such as toxicity or cost.
V. OTHER DESIGNS
A. Factorial Designs
Another type of design involving multiple treatments is the factorial design
(23,24) in which two or more factors, each with two or more levels, are combined
together. For example, two different drugs, say A and B, may be either present
or absent resulting in four different treatment groups: A only, B only, A and B,
neither A nor B. This is an example of the simplest type of factorial design, the
2 ⫻ 2 factorial. If the two factors have a and b levels, the design is an a ⫻ b
design and the number of treatment groups is K ⫽ ab. In general there is no limit
to the number of factors and the number of levels of each factor, but in clinical
trials the 2 ⫻ 2 design is the most common and introduces fewer complexities
than higher level designs.
Factorial designs are popular because they carry a promise of being able
to exploit the relationships among the treatments in ways not possible in the case
of K-independent treatments. And in a clinical setting in which the factors are
different treatment options (drugs, modalities, etc.) that might plausibly be ad-
ministered jointly, it is natural to study their joint effects and interactions directly.
156 George
Unfortunately, as in the case with K-independent treatments, it is not possible to

get something for nothing.
The primary issues are addressed in terms of the 2 ⫻ 2 design described
earlier in which the four treatments are defined based in the presence or absence
of the treatments A and B. Table 3 gives the four possible treatment group means
as µ i j , where i ⫽ 0, 1 and j ⫽ 0, 1 depending on the absence or presence of
treatments A and B, respectively.
The pooled means are simply the means pooled over the relevant categories.
For example, µ 1• is the mean where treatment A is present pooled over the treat-
ment B categories (absent, present). The ‘‘main effect’’ of treatment A is defined
as µ 1• ⫺ µ 0• and the ‘‘simple main effects’’ of treatment A are µ 10 ⫺ µ 00 and
µ 11 ⫺ µ 01 , the main effects of treatment A in the absence and in the presence of
treatment B. Similar definitions apply for treatment B. The primary advantage
of a factorial design occurs when it can be assumed that the simple main effects
of a treatment are equal. Then we can design the trial as if it were a two-treatment
trial for treatment A and obtain a test for treatment B seemingly at no cost. The
problem arises when the simple main effects are not equal. In this situation the
effect of one treatment depends on whether or not the second treatment is present
and is called an interaction between the treatments. The model can be written
µ i j ⫽ µ 00 ⫹ αx ⫹ βy ⫹ γxy
where x ⫽ 0, 1 and y ⫽ 0, 1 depending on the absence or presence of treatments
A and B, respectively, and α,β, and γ are parameters defining the effects of A
and B and their interaction, respectively. If γ ⫽ 0, there is no interaction.
Thus, for example, the simple main effects of A are α and α ⫹ γ in the
absence or presence of treatment B. If γ ⬎ 0, a positive interaction, the effect
of treatment A is heightened when given with treatment B. If γ ⬍ 0, a negative
interaction, the effect of A is decreased. Nonzero interactions are the source of
potential difficulties in factorial designs. Even moderate interaction effects can
have a profound impact on statistical power. For example, to detect an interaction
Table 3 Treatment Group Means in 2 ⫻ 2

Factorial Design
Treatment B
Absent Present Pooled
Treatment A
Absent µ 00 µ 01 µ 0•
Present µ 10 µ 11 µ 1•
Pooled µ •0 µ •1 µ ••
effect of the same magnitude as a treatment main effect requires four times the
number of patients required to detect the main effect where no interaction is
assumed (23). Smaller interactions will require more patients for reliable detec-
tion. Stated another way, a negative interaction will produce smaller overall treat-
ment effects and reduce the power of the statistical tests. Thus, in practice, one
must give careful consideration to the plausible size of an interaction and design
the study accordingly. It is generally unreasonable to expect to detect a small or
moderate interaction but reasonable to assume that they might exist. Simon and
Freedman (24) provide a Bayesian approach (see also Green [25] in this volume).
B. Selection Designs
An additional design involving multiple treatments and occasionally used in clini-
cal trials is a selection design (26). In this setting the purpose is to identify the
‘‘best’’ treatment or the one with the largest or smallest value of some parameter.
In the normal means case, we have an ordering of the means µ (1) ⱕ µ (2) ⱕ ⋅⋅⋅ ⱕ
µ (K) , and the purpose is to identify (‘‘select’’) the treatment associated with µ (K)
with a high probability of correct selection P* whenever µ (K) ⫺ µ (K⫺1) ⱖ δ*. The
required sample sizes in this setting are surprisingly small for most values of
δ*/σ.
The most highly publicized of this type of design applied to clinical trials
is a randomized phase II design (27,28). Here, several treatments that might have
been tested in a sequence of traditional phase II designs, one for each treatment,
are instead considered in a single study with random assignment of the available
treatments. At the end of the trial, the treatment with the highest response rate
is selected for further testing in a phase II trial.
These selection designs do not allow explicit comparisons among the treat-
ments in the usual sense. The smaller sample sizes required by these designs are
obtained by changing the objectives of the trial. The treatment selected may be
imperceptibly better than some or all of the competing treatments. Thus, such
designs should be carefully applied only in very select circumstances. One exam-
ple is when the treatments differ only slightly (e.g., in dose or intensity) so that
the loss in selecting the wrong treatment is not excessive. If one wishes to com-
pare the treatments in the usual way, a selection design is not appropriate. One
of the designs discussed earlier should be used (see also Liu [29] in this volume).
VI. SUMMARY
Multiple treatment trials are more difficult to design and analyze than trials with
only two treatments. Careful attention to detail and proper characterization of
the objectives of the trial can minimize these difficulties. However, it is important
158 George
to recognize at the outset that such trials will of necessity be larger than two-
treatment trials. In many settings the advantages may outweigh the costs, but there
are always costs. The considerations in this chapter can help in an assessment of
when the advantages outweigh the costs.
REFERENCES
1. Tukey JW. Some thoughts on clinical trials, especially problems of multiplicity.

Science 1977; 198:679–684.
2. Koch GG, Gansky SA. Statistical considerations for multiplicity in confirmatory
protocols. Drug Inform J 1996; 30:523–534.
3. Senn S. Statistical Issues in Drug Development. New York: Wiley, 1997.
4. Green S, Benedetti J, Crowley J. Clinical Trials in Oncology. New York: Chapman
and Hall, 1997.
5. Bland JM, Altman DG. Multiple significance tests: the Bonferroni method. Br Med
J 1995; 310:170.
6. Simes RJ. An improved Bonferroni procedure for multiple tests of significance. Bio-
metrika 1986; 73:751–754.
7. Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Bio-
metrika 1988; 75:800–802.
8. Holm S. A simple sequentially rejective multiple test procedure. Scand J Statist
1979; 6:65–70.
9. Hochberg Y, Benjamini Y. More powerful procedures for multiple significance test-
ing. Stat Med 1990; 9:811–818.
10. Bauer P. Multiple testing in clinical trials. Stat Med 1991; 10:871–890.
11. Miller RG. Simultaneous Statistical Inference. New York: Springer-Verlag, 1981.
12. Klockars AJ, Sax G. Multiple Comparisons. London: Sage Publications, 1986.
13. Desu MM, Raghavarao D. Sample Size Methodology. San Diego: Academic Press,
1990.
14. George SL. The required size and length of a clinical trial. In: Buyse ME, Staquet
MJ, Sylvester RJ, eds. Cancer Clinical Trials. Oxford: Oxford University Press,
1984, pp. 287–310.
15. Makuch RW, Simon RM. Sample size requirements for comparing time-to-failure
among k treatment groups. J Chron Dis 1982; 35:861–867.
16. Liu PY, Dahlberg S. Design and analysis of multiarm clinical trials with survival
endpoints. Control Clin Trials 1995; 16:119–130.
17. Phillips A. Sample size estimation when comparing more than two treatment groups.
Drug Inform J 1998; 32:193–199.
18. Ahnn S, Anderson SJ. Sample size determination for comparing more than two sur-
vival distributions. Stat Med 1995; 14:2273–2282.
19. Dunnett CW. A multiple comparison procedure for comparing several treatments
with a control. J Am Stat Assoc 1955; 50:1096–1121.
20. Chen TT, Simon R. A multiple-step selection procedure with sequential protection
of preferred treatments. Biometrics 1993; 49:753–761.
21. Chen TT, Simon RM. Extension of one-sided test to multiple treatment trials. Con-
trol Clin Trials 1994; 15:124–134.
22. Chen TT, Simon R. A multiple decision procedure in clinical trials. Stat Med 1994;
13:431–446.
23. Peterson B, George SL. Sample size requirements and length of study for testing
interaction in a 2 ⫻ k factorial design when time-to-failure is the outcome. Con-
24. Simon R, Freedman LS. Bayesian design and analysis of two ⫻ two factorial clinical
trials. Biometrics 1997; 53:456–464.
25. Green S. Factorial designs with time-to-event end points. In: Crowley J, ed. Hand-
book of Statistics in Clinical Oncology. New York: Marcel Dekker, 2001.
26. Gibbons JD, Olkin I, Sobel M. Selecting and Ordering Populations: A New Statisti-
cal Methodology. New York: John Wiley & Sons, 1977.
27. Simon R, Wittes RE, Ellenberg SS. Randomized phase II clinical trials. Cancer
Treatment Rep 1985; 69:1375–1381.
28. Liu PY, Dahlberg S, Crowley J. Selection designs for pilot studies based on survival.
Biometrics 1993; 49:391–398.
29. Liu PY. Phase II selection designs. In: Crowley J, ed. Handbook of Statistics in
Clinical Oncology. New York: Marcel Dekker, 2001.
9
Factorial Designs with Time-to-Event
End Points
Stephanie Green
FACTORIAL DESIGN
The frequent use of the standard two-arm randomized clinical trial is due in part
to its relative simplicity of design and interpretation. Conclusions are straightfor-
ward: Either the two arms are shown to be different or they are not. Complexities
arise with more than two arms; with four arms there are six possible pairwise
comparisons, 19 ways of pooling and comparing two groups, 24 ways of ordering
the arms, plus the global test of equality of all four arms. Some subset of these
comparisons must be identified as of interest; each comparison has power, level,
and magnitude considerations; the problems of multiple testing must be ad-
dressed; and conclusions can be difficult, particularly if the comparisons specified
to be of interest turn out to be the wrong ones.
Factorial designs are sometimes considered when two or more treatments,
each of which has two or more dose levels (possibly including level 0; i.e., no
treatment) are of interest alone or in combination. A factorial design assigns
patients equally to each possible combination of levels of each treatment. If treat-
ment i, i ⫽ 1 ⫺ K, has l i levels the result is an l1 ⫻ l2 . . . ⫻ lK factorial. Generally,
the aim is to study the effect of levels of each treatment separately by pooling
across all other treatments. The assumption often is made that each treatment has
the same effect regardless of assignment to the other treatments (no interaction).
There has been a fair amount of recent interest in factorial designs. Byar
(1) suggested potential benefit in use of factorials for studies with low event rates,
161
162 Green
such as screening studies. A theoretical discussion of factorials in the context of

the proportional hazards model is presented by Slud (2). Other recent contribu-
tions to the topic include those by Simon and Freedman (3), who discussed Bay-
sian design and analysis of 2 ⫻ 2 factorials (allowing for some uncertainty in
the assumption of no interaction); by Hung (4), who discussed testing first for
interaction when outcomes are normally distributed and interactions occur only
if there are effects of both treatment arms; and by Akritas and Brunner (5), who
proposed a nonparametric approach to analysis of factorial designs with censored
data (making no assumptions about interaction).
To illustrate the issues in factorial designs, a simulation of a 2 ⫻ 2 factorial
trial of control treatment O (control arm) vs. O plus treatment A (arm A) vs. O
plus treatment B (arm B) vs. O plus A and B (arm AB) was performed. The
simulated trial had 125 patients per arm accrued over 3 years with 3 additional
years of follow-up. Survival was exponentially distributed on each arm, and me-
dian survival was 1.5 years on the control arm. The sample size was sufficient
for a one-sided 0.05 level test of A vs. no-A to have power 0.9 with no effect
of B, no interaction, and an A/O hazard ratio of 1/1.33. Various cases were
considered using the usual proportional hazards model λ ⫽ λ0exp(αzA ⫹ βzB ⫹
γzA zB): neither A nor B effective (α and β ⫽ 0), A effective (α ⫽ ⫺ln(1.33))
with no effect of B, A effective and B detrimental (β ⫽ ln(1.5)), and both A and
B effective (α and β both ⫺ln(1.33)). Each of these was considered with no
interaction (γ ⫽ 0), favorable interaction (AB hazard improved compared to ex-
pected, γ ⫽ ⫺ln(1.33)), and unfavorable interaction (worse, γ ⫽ ln(1.33)). Table
1 summarizes the cases considered.
The multiple comparisons problem is one of the issues that must be consid-
ered in factorial designs. If tests of each treatment are performed at level α (typi-
cal for factorial designs; see Gail et al. [6]), then the experiment-wide level (prob-
ability that at least one comparison will be significant under the null hypothesis)
is greater than α. There is disagreement on the issue of whether all primary
questions should each be tested at level α or whether the experiment-wide level
across all primary questions should be level α, but clearly if the probability of at
least one false-positive result is high, a single positive result from the experiment
will be difficult to interpret and may well be dismissed by many as inconclusive.
Starting with global testing followed by pairwise tests only if the global test is
significant is a common approach to limit the probability of false-positive results;
a Bonferoni approach (each of T primary tests performed at α/T ) is also an option.
Power issues must also be considered. From the point of view of individual
tests, power calculations are straightforward under the assumption of no interac-
tion—calculate power according to the number of patients in the combined
groups (also typical; again see Gail et al. [6]). A concern even in this ideal case
may be the joint power for both A and B. If power to detect a specified effect
Factorial Designs 163
Table 1 Arm Medians

冢 O A
B AB 冣
Used in the Simulation
Case
4—Effect
2—Effect of 3—Effect of of A
A No effect A Effect Detrimental
Interaction 1—Null of B of B effect of B
No inter- 1.5 1.5 1.5 2 1.5 2 1.5 2

action 1.5 1.5 1.5 2 2 2.67 1 1.33
Unfavor- 1.5 1.5 1.5 2 1.5 2 1.5 2
able inter- 1.5 1.13 1.5 1.5 2 2 1 1
action
Favorable 1.5 1.5 1.5 2 1.5 2 1.5 2
interaction 1.5 2 1.5 2.67 2 3.55 1 1.77
Each Case Has The Median of the Best Arm in Bold.
of A is 1 ⫺ β and power to detect a specified effect of B is also 1 ⫺ β, the joint

power to detect the effects of both is closer to 1 ⫺ 2β.
From the point of view of choosing the best arm, power considerations are
considerably more complicated. The ‘‘best’’ arm must be specified for the possi-
ble true configurations, the procedures for designating the preferred arm at the
end of the trial (which generally is the point of a clinical trial) must be specified,
and the probabilities of choosing the best arm under alternatives of interest must
be calculated. Several scenarios were considered in the simulation.
The first approach is to analyze assuming there are no interactions and to
do only two one-sided tests, A vs. not-A and B vs. not-B. If both A is better
than not-A and B is better than not-B, then AB is assumed to be the best arm.
The second approach is to test first for interaction (two-sided) using the
model λ ⫽ λ0exp(αzA ⫹ βzB ⫹ γzAzB). If the interaction term is not significant,
then base conclusions on the tests of A vs. not-A and B vs. not-B. If it is signifi-
cant, then base conclusions on tests of the three terms in the model and on appro-
priate subset tests. The treatment of choice is as follows:
Arm O if
1. γ is not significant, A vs. not-A is not significant, and B vs. not-B is
not significant, or
2. γ is significant and negative (favorable interaction), α and β are not
164 Green
significant in the three-parameter model, and the test of O vs. AB is

not significant, or
3. γ is significant and positive (unfavorable interaction) and α and β are
not significant in the three-parameter model.
Arm AB if
1. γ is not significant, and A vs. not-A and B vs. not-B are both significant,
or
2. γ is significant and favorable and α and β are both significant in the
three-parameter model, or
3. γ is significant and favorable, α is significant and β is not significant
in the three-parameter model, and the test of A vs. AB is significant,
or
4. γ is significant and favorable, β is significant and α is not significant
in the three-parameter model, and the test of B vs. AB is significant,
or
5. γ is significant and favorable, α is not significant and β is not significant
in the three-parameter model, and the test of O vs. AB is significant.
Arm A if
1. γ is not significant, B vs. not-B is not significant, and A vs. not-A is
significant, or
2. γ is significant and favorable, α is significant and β is not significant
in the three-parameter model, and the test of A vs. AB is not significant,
or
3. γ is significant and unfavorable, α is significant and β is not significant
in the three-parameter model, or
4. γ is significant and unfavorable, α and β are significant in the three-
parameter model, and the test of A vs. B is significant in favor of A.
Arm B if
results are similar to A above, but with the results for A and B reversed.
Arm A or Arm B if
γ is significant and unfavorable, α and β are significant in the three-parame-

ter model, and the test of A vs. B is not significant.
(Try putting this into the statistical considerations of a protocol.)

The third approach is to control the overall level of the experiment by first
doing an overall test of differences among the four arms and to proceed with the
second approach above only if this test is significant. If the overall test is not
significant, then arm O is concluded to be the treatment of choice.
The possible outcomes of a trial of O vs. A vs. B vs. AB are to recommend
one of O, A, B, AB, or A or B but not AB. Tables 2–4 show the simulated
probabilities of making each of the conclusions in the 12 cases of Table 1, for
the approach of ignoring interaction, for the approach of testing for interaction,
and for the approach of doing a global test before testing for interaction. The
global test was done at the 0.05 level. Tests of A vs. not-A, B vs. not-B, O vs.
A, O vs. B, O vs. AB, and model terms α and β were one-sided; other tests were
two-sided. One-sided tests were done at the 0.05 level, two-sided at 0.1. For each
table the probability of drawing the correct conclusion is in bold.
Tables 2–4 illustrate several points. In the best case of using approach 1
when in fact there is no interaction, the experiment level is 0.11 and power when
both A and B are effective is 0.79, about as anticipated, and possibly insufficiently
conservative. Apart from that, approach 1 is best if there is no interaction. The
probability of choosing the correct arm is reduced if approach 2 (testing first for
interaction) is used instead of approach 1 in all four cases with no interaction.
If there is an interaction, approach 2 may or may not be superior. If the
interaction masks the effectiveness of the best regimen, it is better to test for
interaction (e.g., case 4 with an unfavorable interaction, where the difference
between A and not-A is diminished due to the interaction). If the interaction
enhances the effectiveness of the best arm, testing is detrimental (e.g., case 4
Table 2 Simulated Probability of Conclusion with Approach 1: No Test of

Interaction
Conclusion
Case, interaction O A B AB A or B
1, none 0.890 0.055 0.053 0.002 0

1, unfavorable 0.999 0.001 0 0 0
1, favorable 0.311 0.243 0.259 0.187 0
2, none 0.078 0.867 0.007 0.049 0
2, unfavorable 0.562 0.437 0 0.001 0
2, favorable 0.002 0.627 0.002 0.369 0
3, none 0.010 0.104 0.095 0.791 0
3, unfavorable 0.316 0.244 0.231 0.208 0
3, favorable 0 0.009 0.006 0.985 0
4, none 0.078 0.922 0 0 0
4, unfavorable 0.578 0.422 0 0 0
4, favorable 0.002 0.998 0 0 0
166
Table 3 Simulated Probability of Conclusion with Approach 2: Test of Interaction
Test for
Conclusion interaction,
probability of
Case, interaction O A B AB A or B rejection
1, none 0.865 0.060 0.062 0.005 0.008 0.116

1, unfavorable 0.914 0.036 0.033 0 0.017 0.467
1, favorable 0.309 0.128 0.138 0.424 0 0.463
2, none 0.086 0.810 0.006 0.078 0.019 0.114
2, unfavorable 0.349 0.601 0.001 0.001 0.048 0.446
2, favorable 0.003 0.384 0.002 0.612 0 0.426
3, none 0.009 0.089 0.089 0.752 0.061 0.122
3, unfavorable 0.185 0.172 0.167 0.123 0.353 0.434
3, favorable 0 0.004 0.003 0.990 0.002 0.418
4, none 0.117 0.883 0 0 0 0.110
4, unfavorable 0.341 0.659 0 0 0 0.472
4, favorable 0.198 0.756 0 0.046 0 0.441
Green
Factorial Designs
Table 4 Simulated Probability of Conclusion with Approach 3: Global Test Followed by Approach 2
Conclusion Global test,

Probability of
Case, interaction O A B AB A or B rejection
1, none 0.972 0.011 0.010 0.003 0.004 0.052

1, unfavorable 0.926 0.032 0.026 0 0.015 0.578
1, favorable 0.503 0.049 0.057 0.390 0 0.558
2, none 0.329 0.578 0.001 0.074 0.018 0.684
2, unfavorable 0.528 0.432 0 0 0.039 0.541
2, favorable 0.014 0.374 0.001 0.611 0 0.987
3, none 0.068 0.069 0.063 0.741 0.059 0.932
3, unfavorable 0.466 0.067 0.072 0.109 0.286 0.535
3, favorable 0 0.004 0.003 0.990 0.002 1.00
4, none 0.117 0.882 0 0 0 0.997
4, unfavorable 0.341 0.659 0 0 0 1.00
4, favorable 0.198 0.756 0 0.046 0 0.999
167
168 Green
with favorable interaction, where the difference between A and not-A is larger
due to the interaction whereas B is still clearly ineffective). In all cases the power
for detecting interactions is poor. Even using 0.1 level tests, the interactions were
detected at most 47% of the time in these simulations.
Approach 3 does restrict the overall level (probability of not choosing O
when there is no positive effect of A or B or AB), but this is at the expense of
a reduced probability of choosing the correct arm when the four arms are not
sufficiently different for the overall test to have high power.
Unfavorable interactions are particularly devastating to a study. The proba-
bility of identifying the correct regimen is poor for all methods if the correct arm
is not the control arm. Approach 1, assuming there is no interaction, is particularly
poor. Unfavorable interactions and detrimental effects happen: Study 8300 (simi-
lar to case 4 with an unfavorable interaction) is an unfortunate example (7). In
this study in limited non-small cell lung cancer, the roles of both chemotherapy
and prophylactic radiation to the brain were of interest. All patients received
radiation to the chest and were randomized to receive prophylactic brain irradia-
tion (PBI) plus chemotherapy vs. PBI vs. chemotherapy vs. no additional treat-
ment. PBI was to be tested by combining across the chemotherapy arms and
chemotherapy was to be tested by combining across PBI arms. Investigators
chose a Bonferoni approach to limit type 1 error: The trial design specified level
0.025 for two tests, a test of whether PBI was superior to no PBI, and a test of
whether chemotherapy was superior to no chemotherapy. No other tests were
specified. It was assumed that PBI and chemotherapy would not affect each other.
Unfortunately, PBI was found to be detrimental to patient survival. The worst
arm was PBI plus chemotherapy, followed by PBI, then no additional treatment,
then chemotherapy alone. Using the design criteria one would conclude that nei-
ther PBI nor chemotherapy should be used. With this outcome, however, it was
clear that the comparison of no further treatment vs. chemotherapy was critical—
but the study had seriously inadequate power for this test, and no definitive con-
clusion could be made concerning chemotherapy.
Once you admit your K ⫻ J factorial study is not one K-arm study and
one J-arm study (which happen to be in the same patients) but rather a K ⫻ J-
arm study with small numbers of patients per arm, the difficulties become more
evident. This is not ‘‘changing the rules’’ of the design; it is acknowledging the
reality of most clinical settings. Perhaps in studies where A and B have unrelated
mechanisms of action and are being used to affect different outcomes, then as-
sumptions of no interaction may not be unreasonable. However, in general A
cannot be assumed to behave the same way in the presence of B as in the absence
of B. Potential drug interactions, overlapping toxicities, differences in compli-
ance, and so on all make it more reasonable to assume there will be differences—
and with small sample sizes per group it is unlikely these will be detected.
OTHER APPROACHES TO MULTIARM STUDIES
Various approaches to multiarm studies are available. If the example study could
be formulated as O vs. A, B and AB, the problem of comparing control vs.
multiple experimental arms would apply. There is a long history of articles on
this problem, for instance Dunnet (8), Marcus et al. (9), and Tang and Lin (10),
focusing on appropriate global tests or on appropriate tests for subhypotheses.
Liu and Dahlberg (11) discuss design and provide sample size estimates (based
on the least favorable alternative for the global test) for K-arm trials with time-
to-event end points. The procedure investigated (a K-sample logrank test is per-
formed at level α, followed by α level pairwise tests if the global test is signifi-
cant) has good power for detecting the difference between a control arm and the
best treatment. These authors also point out the problems when power is consid-
ered in the broader sense of drawing the correct conclusions. Properties are good
for this approach when each experimental arm is similar either to the control arm
or the best arm but not when survivals are more evenly spread out among the
control and other arms.
Designs for ordered alternatives are another possibility (say for this exam-
ple there are theoretical reasons to hypothesize superiority of A over B, resulting
in the alternative O ⬍ B ⬍ A ⬍ AB). Liu et al. (12) propose a modified logrank
test for ordered alternatives,
冱冫冤冱冥
K⫺1 K⫺1 1/2
T⫽
i⫽1
L (i)
i⫽1
var(L (i) ) ⫹ 2 冱
i⬍j
cov(L (i), L ( j))
where L (i) is the numerator of the one-sided logrank test between the pooled
groups 1, . . . i and pooled groups i ⫹ 1 . . . K; this test is used as the global
test before pairwise comparisons in this setting. Similar comments apply as to
the more general case above, with the additional problem that the test will not
work well if the ordering is misspecified. A related approach includes a prefer-
ence ordering, say by expense of the regimens (which at least has a good chance
of being specified correctly), and application of a ‘‘bubble sort’’ analysis (e.g.,
take the most costly only if significantly better than the rest, the second most
only if significantly better than the less costly arms, and the most costly is not
significantly better . . .). This approach is discussed in Chen and Simon (13).
Any model assumption can result in problems when the assumptions are
not correct. As with testing for interactions, testing other assumptions can either
be beneficial or detrimental, with no way of ascertaining beforehand which is
the case. If assumptions are tested, procedures must be specified for when the
assumptions are shown not to be met, which changes the properties of the experi-
ment and complicates sample size considerations. Southwest Oncology Group
170 Green
study S8738 (14) provides an example of incorrect assumptions. This trial ran-
domized patients to low-dose cisplatin (CDDP) vs. high-dose CDDP vs. high-
dose CDDP plus mitomycin-C (with the obvious hypothesized ordering). The
trial was closed approximately half way through the planned accrual because
survival on high-dose CDDP was convincingly shown not to be superior to stan-
dard-dose CDDP by the hypothesized 25% (in fact, it appeared to be worse). A
beneficial effect of adding mitomycin-C to high-dose CDDP could not be ruled
out at the time, but this comparison became meaningless in view of the standard-
dose vs. high-dose comparison.
CONCLUSION
The motivation for simplifying assumptions in multiarm trials is clear. The sam-
ple size required to have adequate power for multiple plausible alternatives, while
at the same time limiting the overall level of the experiment, is large. If power
for specific pairwise comparisons is important for any outcome, then the required
sample size is larger. An even larger sample size is needed if detection of interac-
tion is of interest. To detect an interaction of the same magnitude as the main
effects in a 2 ⫻ 2 trial, four times the sample size is required (15), thereby elimi-
nating what most view as the primary advantage to factorial designs.
Likely not all simplifying assumptions are wrong, but disappointing experi-
ence tells us that too many are too often wrong. Unfortunately, the small sample
sizes resulting from oversimplification lead to unacceptable chances of inconclu-
sive results (and a tremendous waste of resources). The correct balance between
conservative assumptions vs. possible efficiencies is rarely clear.
In the case of factorial designs, combining treatment arms seems to be a
neat trick—multiple answers for the price of one—until you start considering
how to protect against the possibility that the assumptions allowing the trick are
incorrect.
REFERENCES
1. Byar J. Factorial and reciprocal control design. Stat Med 1990; 9:55–64.
2. Slud E. Analysis of factorial survival experiments. Biometrics 1994; 50:25–38.
3. Simon R, Freedman L. Baysian design and analysis of two x two factorial clinical
4. Hung H. Two-stage tests for studying monotherapy and combination therapy in two-
by-two factorial trials. Stat Med 1993; 12:645–660.
5. Akritas M, Brunner E. Nonparametric methods for factorial designs with censored
data. J Am Stat Assoc 1997; 92:568–576.
6. Gail M, You W-C, Chang Y-S, Zhang L, Blot W, Brown L, Groves F, Heinrich J,
Hu J, Jin M-L, Li J-Y, Liu W-D, Ma J-L, Mark S, Rabkin C, Fraumeni J, Xu
G-W. Factorial trial of three interventions to reduce the progression of precancerous
gastric lesions in Sandong, China: design issues and initial data. Control Clin Trials
1998; 19:352–369.
7. Miller T, Crowley J, Mira J, Schwartz J, Hutchins L, Baker L, Natale R, Chase E,
Livingston R. A randomized trial of chemotherapy and radiotherapy for stage III
non-small cell lung cancer. Cancer Ther 1998; 1:229–236.
8. Dunnet C. A multiple comparisons procedure for comparing several treatments with
a control. J Am Stat Assoc 1955; 60:573–583.
9. Marcus R, Peritz E, Gabriel K. On closed testing procedures with special reference
to ordered analysis of variance. Biometrika 1976; 63:655–660.
10. Tang D-I, Lin S. An approximate likelihood ratio test for comparing several treat-
ments to a control. J Am Stat Assoc 1997; 92:1155–1162.
11. Liu P-Y, Dahlberg S. Design and analysis of multiarm trial clinical trials with sur-
vival endpoints. Control Clin Trials 1995; 16:119–130.
12. Liu P-Y, Tsai W-Y, Wolf M. Design and analysis for survival data under order
restrictions with a modified logrank test. Stat Med 1998; 17:1469–1479.
13. Chen T, Simon R. Extension of one-sided test to multiple treatment trials. Control
Clin Trials 1994; 15:124–134.
14. Gandara D, Crowley J, Livingston R, Perez E, Taylor C, Weiss G, Neefe J, Hutchins
L, Roach R, Grunberg S, Braun T, Natale R, Balcerzak S. Evaluation of cisplatin
in metastatic non-small cell lung cancer: A phase III study of the Southwest Oncol-
ogy Group. J Clin Oncol 1993; 11:873–878.
15. Peterson B, George S. Sample size requirements and length of study for testing
interaction in a 2 ⫻ K factorial design when time to failure is the outcome. Control
Clin Trials 1993; 14:511–522.
10
Therapeutic Equivalence Trials
Richard Simon
National Cancer Institute, National Institutes of Health, Bethesda, Maryland
I. INTRODUCTION
The objective of a therapeutic equivalence trial is generally to demonstrate that

a new treatment is equivalent to a standard therapy with regard to a specified
clinical end point. The new treatment may be less invasive and less debilitating
or it may be more convenient. Consequently, if it is equivalent to the standard
with regard to the primary efficacy end point, it would be attractive to patients.
Usually, however, one is willing to exchange only very small reductions in effi-
cacy for the advantages in secondary end points.
Therapeutic equivalence trials are contrasted to bioequivalence trials where
the objective is to demonstrate equivalence of serum concentrations of the active
moiety. In some therapeutic equivalence trials called active control trials, investi-
gators would like to demonstrate that the new treatment is effective compared
with no treatment, but because use of a no-treatment arm is not feasible, they
attempt to demonstrate therapeutic equivalence to a standard treatment.
In this chapter we review some of the problems with therapeutic equiva-
lence trials and provide recommendations for the design and analysis of such
trials.
173
174 Simon
II. PROBLEMS WITH THERAPEUTIC EQUIVALENCE

TRIALS
One inherent problem is that it is impossible to demonstrate equivalence. If the

outcomes for the two treatments are very different, then one can conclude that
the two treatments are not therapeutically equivalent. In the absence of demon-
strating lack of equivalence, however, one can only conclude that results are
consistent with only small differences.
In conventional trials, rejection of the null hypothesis is usually established
with substantial statistical reliability, and rejection of the null hypothesis leads
to change in the treatment of future patients. The implications of failure to reject
the null hypothesis are often more difficult to interpret. Failure to reject the null
hypothesis in conventional trials generally does not lead to change medical prac-
tice, however, and hence the ambiguity associated with its interpretation is in
some sense of less concern.
For therapeutic equivalence trials the situation is quite different. Failure to
demonstrate nonequivalence is the ambiguous outcome and the outcome that
leads to change in the treatment of future patients. This is not always the case
but is for a large class of therapeutic equivalence trials where a standard effective
treatment is compared with a shorter lower dose or less invasive regimen. Failure
to demonstrate nonequivalence is often interpreted as a demonstration of thera-
peutic equivalence and grounds for adoption of the new regimen. Many problems
with therapeutic equivalence trials are associated with reasons why failure to
demonstrate nonequivalence should not be interpreted as demonstration of equiv-
alence.
Someone once said that everything looks like a nail to a person whose only
tool is a hammer. Overreliance on statistical significance testing is one of the
problems with the conduct of therapeutic equivalence trials. Failure to reject the
null hypothesis may be a result of inadequate sample size and not grounds for
concluding equivalence. Unfortunately, closeness of sample means or sample
distributions and nonsignificant p values are convincing to a large part of the
medical audience. The sample size for clinical trials is often determined on practi-
cal grounds based on patient availability over a limited time period or funding
available. Consequently, many trials are undersized. This is a particular problem
for therapeutic equivalence trials because it leads to failure to demonstrate non-
equivalence as a result of inadequate statistical power.
A related problem with therapeutic equivalence trials is that large sample
sizes are often needed. For example, consider a cancer trial evaluating tumor
resection as an alternative to amputation of the organ containing the tumor in a
setting where amputation is the standard therapy known to be curative in a large
number of cases. Tumor resection may have clear advantages with regard to
Therapeutic Equivalence Trials 175
quality of life, but many patients would be interested in these advantages only
if they were assured that the chance for cure they might give up would be very
small. Hence, the appropriate trial should focus on distinguishing the null hypoth-
esis that the new treatment is no worse than the standard, i.e., ∆ ⱕ 0 from the
hypothesis that the standard treatment is superior by at least some very small
amount δ, i.e., ∆ ⱖ δ. Consequently, this trial would have to be quite large.
Some therapeutic equivalence trials compare a treatment regimen E to a
control C when the advantage of C over placebo P or no treatment is small. Such
trials must also be very large because they must demonstrate that the difference
in efficacy between E and C is no greater than a fraction of the difference between
C and P.
Another difficulty with the therapeutic equivalence trial is that there is no
internal validation of the assumption that the control C is actually effective for
the patient population at hand. It is not enough for E to be therapeutically equiva-
lent to C, we want equivalence coupled with the effectiveness of E and C relative
to P.
A related problem is the difficulty in selecting the difference δ to be distin-
guished from the null hypothesis. In general, the difference δ should represent
the largest difference that a patient is willing to give up in efficacy of the standard
treatment C for the secondary benefits of the experimental treatment E. The differ-
ence δ must be no greater than the efficacy of C relative to P and will in general
be a fraction of this quantity δ c . Estimation of δ c requires review of clinical trials
that established the effectiveness of C relative to P. δ c should not be taken as
the maximum likelihood estimate (mle) of treatment effect from such trials be-
cause there is substantial probability that the true treatment effect in those trials
was less than the point mle. We discuss later in this chapter quantitative methods
for utilizing information that demonstrates the effectiveness of C relative to P in
planning a therapeutic equivalence trial.
III. DESIGN AND ANALYSIS
Of the most frequent methods that have been proposed for the design and analysis
of therapeutic equivalence trials, that based on confidence intervals has important
advantages (1–3). As indicated above, a significance test of the null hypothesis
can provide a very misleading interpretation of the results of a therapeutic equiva-
lence trial. Testing an alternative hypothesis has been proposed as an alternative
method (4). For an experiment that is too small to be informative, a test of an
alternative hypothesis will be less likely to be interpreted as supporting therapeu-
tic equivalence, but this approach will still not indicate the basic inadequacy of
the experiment. A confidence interval for the difference in efficacy between E
176 Simon
and C will indicate the range of values that are consistent with the data. A confi-
dence interval is more informative than a hypothesis test and a statement of statis-
tical power. Statistical power is relevant for planning the study, but it is not a
good parameter for interpreting the results of a study because it ignores the data
actually obtained. A confidence interval incorporates both the size of the study
and the results obtained. The well-known study by Frieman et al. (5) tabulated
the power of 71 trials reported as negative. They found that 50 of these 71 trials
had power less than 0.9 for detecting a 50% treatment effect. If one computes
approximate confidence intervals from those trials, however, one finds that 16
of these 50 trials with inadequate statistical power have confidence limits that
exclude 50% treatment effects and hence are definitively negative. Frieman et
al. focused attention on statistical power of trials that claimed to be ‘‘negative.’’
This was useful, but calculation of confidence intervals for treatment differences
is a much more relevant and informative means of analysis. To encourage physi-
cians to use such confidence intervals, Simon (2) showed how to calculate such
approximate confidence intervals for commonly encountered end points.
If one decides to use a confidence interval as the method of analysis, the
questions of one-sided versus two-sided and the confidence coefficient arise.
Therapeutic equivalence trials are by nature asymmetric with regard to E and C.
We are generally interested in whether E is about the same as or substantially
worse than C. Often, there is information that makes it less likely that E will be
superior to C. In any case, clinical decision making may be the same whether E
is equivalent to C or if E is superior to C. Consequently, one-sided confidence
intervals are often justified. Since for some therapeutic equivalence trials it is
possible that E is superior to C, two-sided confidence intervals may also be desir-
able. I have recommended symmetric two-sided 90% confidence intervals in
many cases. This provides the same upper limit for the C ⫺ E difference as a
one-sided 95% confidence interval and also provides a lower limit for evaluating
whether E is actually superior to C. Another alternative is the 11/2-sided confi-
dence limit in which the lower limit is extended. For example, a 11/2-sided 95%
confidence limit would have 31/3% area above the upper limit and 12/3% area
below the lower limit.
Let ∆ˆ denote the mle of the difference in treatment effects C ⫺ E. We will
assume that a positive value favors C and that ∆ˆ is approximately normal with
mean ∆ and standard deviation σ. An upper 1 ⫺ α level confidence limit for ∆
is approximately
∆ up ⫽ ∆ˆ ⫹ z 1⫺α σ (1)
In planning the trial a value δ must be specified where δ represents the largest
true C ⫺ E difference consistent with therapeutic equivalence. Once the data are
obtained, if ∆ up ⬍ δ, then one concludes that E is therapeutically equivalent to
C. Whether or not this condition is achieved, the confidence interval provides

the range of relative effectiveness’ consistent with the data.
There are several approaches to planning the size of the study using the
confidence interval as the basis for analysis. All the methods are based, however,
on the fact that σ that occurs in Eq. (1) is a function of the sample size. In the
case of survival comparisons with proportional hazards, σ is a function of the
number of events observed. One approach is to set σ so that under the null hypoth-
esis, the probability that ∆ up ⬍ δ is a specified value 1 ⫺ β. If σ is independent
of the value of ∆, this leads to the familiar condition for sample size planning
with normal distributions that
δ/σ ⫽ z 1⫺α ⫹ z 1⫺β (2)
For the two-sample normal case σ ⫽ √2σ 20 /n, where σ 0 is the standard deviation
per observation and n is the sample size per treatment group. For the two-sample
normal case, this approach provides the same sample size as does the hypothesis
testing framework, with proper definition of α and β. For the two-sample bino-
mial or the two-sample time-to-event case, the correspondence is not exact be-
cause of dependence of σ on ∆. The correspondence is approximately the same,
however. For example, Eq. (2) can be used for sample size planning in the two-
sample time-to-event case with the approximation σ ⫽ √4/total events .
An alternative approach to sample size planning is to use a symmetric two-
sided confidence interval and require that
Pr ∆⫽0 [∆ˆ ⫹ z 1⫺α σ ⬎ δ] ⫽ β and Pr ∆⫽δ [∆ˆ ⫺ z 1⫺α σ ⬍ 0] ⫽ β (3)
In the case where σ is independent of ∆, satisfying condition (2) automatically

satisfies both parts of condition (3).
A more stringent approach to sample size planning is to require that the
width of the two-sided confidence interval be of size δ. This ensures that the
confidence interval will always exclude either 0 or δ. It requires substantially
more patients, however.
Interim analysis using confidence intervals and group sequential methods
is described by Durrleman and Simon (6).
IV. ANALYSIS TO ESTABLISH EQUIVALENCE AND

EFFECTIVENESS
In an important sense, none of the above approaches represents a satisfactory

statistical framework for the design and analysis of therapeutic equivalence trials.
These approaches depend on the specification of a minimal difference δ in effi-
178 Simon
cacy that one is willing to tolerate. None of the approaches deal with how δ is
determined. Fleming (7) and Gould (8,9) have noted that the design and interpre-
tation of equivalence trials must utilize information about previous trials of the
active control. Fleming proposed that the new treatment is considered effective
if an upper confidence limit for the amount that the new treatment may be inferior
to the active control does not exceed a reliable estimate of the improvement of
the active control over placebo or no treatment. Gould provided a method for
creating a synthetic placebo control group based on previous trials comparing
the active control to placebo. Simon presented a general Bayesian approach to
the utilization of information from previous trials in the design and analysis of
an equivalence trial (10).
Two major objectives can be distinguished. The first is to determine
whether the experimental treatment is effective relative to P. This requires explicit
use of prior information about outcomes of trials comparing P to the active con-
trol. Meaningful interpretation of active control trials is impossible without con-
sideration of such information. Establishing whether or not the experimental treat-
ment is effective relative to P is a first requirement. The second objective is to
determine whether any medically important portion of the treatment effect for
the active control is lost with the experimental treatment. In some cases this
objective is unrealistic because the size of the treatment effect (relative to P) for
the active control is imprecisely determined.
We use the following model: y ⫽ α ⫹ βx ⫹ γz ⫹ ε, where y denotes the
response of a patient, x ⫽ 0 for placebo or the experimental treatment and 1
for the control treatment, z ⫽ 0 for placebo or the control treatment and 1 for
the experimental treatment, and ε is normally distributed experimental error.
Hence the expected response for C is α ⫹ β, the expected response for E is α
⫹ γ, and the expected response for P is α. The likelihood function for the data
(D) from the active controlled trial can be expressed as π(D| α, β, γ) ⬀ π(y c |α,
β)π(y e |α, γ), where the first factor is the likelihood of the data for the control
group and the second factor is the likelihood of the data for the experimental
group. We use the notation π( ) informally to denote either probability density
of observable data, prior probability density of a parameter, or posterior density
of a parameter. The first factor is N(α ⫹ β, σ 2 ) and the second factor is N(α ⫹
γ, σ 2 ), where σ is the standard error for the observed means. We assume that σ 2
is known, although it will generally be estimated. For the large sample sizes
appropriate for active control trials, the additional variability caused by uncer-
tainty in σ 2 should be very small. This assumption enables us to obtain simple
analytical results, but a more exact treatment is possible using posterior distribu-
tion sampling methods.
The posterior distribution of Θ ⫽ (α, β, γ) has density proportional to
π(D| α, β, γ)π(Θ). We shall assume that the parameters have independent normal
prior densities π(α) ⬃ N(µ α , σ 2α ), π(β) ⬃ N(µ β , σ 2β ), and π(γ) ⬃ N(µ γ , σ 2γ ). Hence,
the posterior distribution of Θ is π(Θ| D) ⬀ π(y c | α, β)π(y e |α, γ)π(α)π(β)π(γ).

The posterior distribution can be shown to be multivariate normal. The covariance
matrix is
∑ ⫽ (K/σ 2 )
冢冣
(1 ⫹ r β )(1 ⫹ r γ ) ⫺(1 ⫹ r γ ) ⫺(1 ⫹ r β )
⫺(1 ⫹ r γ ) r γ ⫹ (1 ⫹ r α )(1 ⫹ r γ ) 1 (4)
⫺(1 ⫹ r β ) 1 r β ⫹ (1 ⫹ r α )(1 ⫹ r β )
where r α ⫽ σ 2 /σ 2α , r β ⫽ σ 2 /σ 2β , and r γ ⫽ σ 2 /σ 2γ and K ⫽ r α (1 ⫹ r β )(1 ⫹ r γ ) ⫹

r β (1 ⫹ r γ ) ⫹ (1 ⫹ r β )r γ . The mean vector η ⫽ (η α , η β , η γ ) of the posterior
distribution is
r α (1 ⫹ r β )(1 ⫹ r γ )µ α ⫹ r β (1 ⫹ r γ )(y c ⫺ µ β ) ⫹ r γ (1 ⫹ r β )(y e ⫺ µ γ )
ηα ⫽
K
r (r ⫹ (1 ⫹ r α )(1 ⫹ r γ ))µ β ⫹ r α (1 ⫹ r γ )(y c ⫺ µ α ) ⫹ r γ (y c ⫺ y e ⫹ µ γ )
ηβ ⫽ β γ
K
r γ (r β ⫹ (1 ⫹ r α )(1 ⫹ r β ))µ γ ⫹ r α (1 ⫹ r β )(y e ⫺ µ α ) ⫹ r β (y e ⫺ y c ⫹ µ β )
ηγ ⫽
K
(5)
This indicates that the posterior mean of α is a weighted average of three esti-
mates of α. The first estimate is the prior mean µ α . The second estimate is the
observed y c minus the prior mean for β. This makes intuitive sense since the ex-
pectation of y c is α ⫹ β. The third estimate in the weighted average is the ob-
served y e minus the prior mean for γ. The expectation of y e is α ⫹ γ. The sum
of the weights is K. The other posterior means are similarly interpreted.
The marginal posterior distribution of γ is normal with mean η γ and vari-
ance the (3, 3) element of ∑. The parameter γ represents the contrast of experi-
mental treatment versus placebo. One can thus easily compute the posterior prob-
ability that γ ⬎ 0, which would be a Bayesian analog of a statistical significance
test of the null hypothesis that the experimental regimen is no more effective
than placebo (if negative values of the parameter represent effectiveness).
The posterior distribution of γ ⫺ kβ is univariate normal with mean η γ ⫺
kη β and variance ∑ 33 ⫹ k 2 ∑ 22 ⫺ 2k ∑ 23 . Consequently, one can also easily com-
pute the posterior probability that γ ⫺ kβ ⱕ 0. For k ⫽ 0.5, if β ⬍ 0 this represents
the probability that the experimental regimen is at least half as effective as the
active control. Since there may be positive probability that β ⬎ 0, it is more
appropriate to compute the joint probability that β ⬍ 0 and γ ⫺ kβ ⱕ 0 to repre-
sent the probability that the experimental regimen is at least a kth as effective
as the active control.
180 Simon
In the special case where noninformative prior distributions are adopted

for α and γ, one obtains
冢冣
1 ⫹ rβ ⫺1 ⫺(1 ⫹ r β )
∑ ⫽ σ 2β ⫺1 1 1 (6)
⫺(1 ⫹ r β ) 1 1 ⫹ 2r β
In this case the posterior distribution of β is N(µ β , σ 2β ) the same as the prior
distribution, the posterior distribution of γ is N(µ β ⫹ ye ⫺ yc , σ 2β ⫹ 2σ 2 ), and
the posterior distribution of α is N(y c ⫺ µ β , σ 2β ⫹ σ 2 ). It can be seen that the
clinical trial comparing C to E contains information about α if an informative
prior distribution is used for β.
One may permit correlation among the prior distributions. Let S denotes
the covariance matrix for the multinormal prior distribution for (α, β, γ). Then
∑⫺1 ⫽ M ⫹ S⫺1 where
冤冥
2 1 1
M ⫽ (1/σ 2 ) 1 1 0 (7)
1 0 1
and the posterior mean vector is the solution of ∑⫺1 η ⫽ (1/σ 2 )(y • y c y e )′ ⫹
S⫺1 µ′ where µ ⫽ (µ α µ β µ γ ) and y • ⫽ y c ⫹ y e .
The above results can be applied to binary outcome data by approximating
the log odds of failure by a normal distribution. The approach can also be ex-
tended in an approximate manner to the proportional hazards model. Let the
hazard be written as
λ(t) ⫽ λ 0 (t) exp(βx ⫹ γz)
where λ 0 (t) denotes the baseline hazard function and the indicator variables x
and z are the same as described above in Section IV. The data will be taken as
the maximum likelihood estimate of the log hazard ratio for E relative to C for
the active control study and will be denoted by y. Thus, for large samples y is
approximately normally distributed with mean γ ⫺ β and variance σ 2 ⫽ 1/d C ⫹
1/d E where the d’s are the number of events observed on C and E, respectively.
Using normal priors for β and γ as above, the same reasoning results in the poste-
rior distribution of the parameters (β, γ) being approximately normal with mean
η ⫽ (η β , η γ ) and covariance matrix ∑ ⫽ (λ ij )⫺1 with
λ 11 ⫽ 1/σ 2 ⫹ 1/σ 2β
λ 22 ⫽ 1/σ 2 ⫹ 1/σ 2γ
λ 12 ⫽ ⫺1/σ 2
and mean vector determined by
Λη ⫽
冢 µ β /σ 2β ⫺ y/σ 2
y/σ 2 ⫹ µ γ /σ 2γ 冣
If a noninformative prior is used for γ, then λ 22 ⫽ ⫺λ 12 , and we obtain
that the posterior distribution of β is N(µ β , σ 2β ), the same as the prior distribution.
In this case the posterior distribution of γ is N(µ β ⫹ y, σ 2β ⫹ σ 2 ). The posterior
covariance of β and γ is ⫺σ 2β . Hence, the posterior probability that the experimen-
tal treatment is effective relative to placebo is Φ(⫺(µ β ⫹ y)/√σ 2β ⫹ σ 2 ).
V. PLANNING TO ESTABLISH EQUIVALENCE AND

EFFECTIVENESS
A minimal objective of the active controlled trial is to determine whether or not

E is effective relative to P. Hence, we might require that if γ ⫽ β, then it should
be very probable that the trial will result in data y ⫽ (y e , y c ) such that Pr(γ ⬍
0| y) ⬎ 0.95, where γ ⬍ 0 represents effectiveness of the experimental treatment.
Thus, we want
Pr[η γ /√ ∑33 ⬍ ⫺1.645] ⱖ ξ (8)
where η γ , ∑ 33 are the posterior mean and variance of γ, the probability is calcu-
lated assuming γ ⫽ β and that β is distributed according to its prior distribution,
and ξ is some appropriately large value such as 0.90.
The posterior mean η γ is a linear combination of the data and is thus itself
normally distributed with mean and variance denoted by ρ γ , ζ γ respectively. Thus,
Eq. (8) can be written
⫺1.645 √ ∑33 ⫺ ρ γ
⫽ zξ (9)
√ζ γ
where z ξ is the 100ξth percentile of the standard normal distribution. When γ ⫽ β
ρ γ ⫽ ∑31 (2µ β /σ 2 ⫹ µ α /σ 2α ) ⫹ ∑32 (µ β /σ 2 ⫹ µ β /σ 2β )

(10)
⫹ ∑33 (µ β /σ 2 ⫹ µ γ /σ 2γ )
and
ζ γ ⫽ ∑31 (2 ⫹ µ α /σ 2α ) ⫹ ∑32 (1 ⫹ µ β /σ 2β ) ⫹ ∑33 (1 ⫹ µ γ /σ 2γ ) (11)

182 Simon
Hence, one can determine the value of σ 2 that satisfies Eq. (9). σ 2 represents the
variance of the means y e and y c and hence is inversely proportional to the sample
size per treatment arm in the active controlled trial.
In the special case where noninformative prior distributions are adopted
for α and γ, that is, σ 2α ⫽ σ 2γ → ∞, the above results simplify and the mean of
the predictive distribution is ρ γ ⫽ µ β with predictive variance ζ γ ⫽ 2σ 2. Using
these results in Eq. (9) and simplifying yields
⫺1.645√1 ⫹ 2σ 2 /σ 2β ⫺ µ β /σ β
⫽ zξ (12)
√2σ 2 /σ 2β
The trial may be sized by finding the value of σ 2 that satisfies Eq. (12). It is of
interest that µ β /σ β is the ‘‘z value’’ for the evaluation of the active control versus
placebo. The required sample size for the active control trial is very sensitive to
that z value. For example, suppose that µ β /σ β ⫽ 3. This represents substantial
evidence that the active control is indeed effective relative to placebo. In this
case, for ξ ⫽ 0.8 one requires that the ratio r ⫽ σ 2 /σ 2β ⫽ 0.4 for Eq. (12) to be
satisfied. Since σ 2β is known and since σ 2 represents the variance of the mean
response per treatment arm in the active controlled trial, the sample size per
arm can be determined. Alternatively, if there is less substantial evidence for the
effectiveness of the active control, for example µ β /σ β ⫽ 2, then one requires that
the ratio r ⫽ σ 2 /σ 2β ⫽ 0.05 to satisfy Eq. (12). This represents eight times the
sample size required for the case when r ⫽ 3. When the evidence for the effective-
ness of the active control is marginal, then the active control design is neither
feasible nor appropriate.
For the binary response approximation described in Section III, we have
approximately σ 2 ⫽ 1/npq, where n is the sample size per treatment group in
the active control trial. If there is one previous randomized trial of active control
versus placebo on which to base the prior distribution of β, then we have approxi-
mately that σ 2β ⫽ 2/n 0 pq, where n 0 denotes the average sample size per treatment
group in that trial. Consequently, σ 2 /σ 20 ⫽ n 0 /2n. If µ β /σ β ⫽ 3, then n 0 /2n ⫽
0.4, that is n ⫽ 1.25n 0 , and the sample size required for the active control trial
is 25% larger than that required for the trial, demonstrating the effectiveness of
the active control. On the other hand, if µ β /σ β ⫽ 2, then n 0 /2n ⫽ 0.05, that is,
n ⫽ 10n 0 .
Planning the trial to demonstrate that the new regimen is effective com-
pared with placebo seems a minimal requirement. As indicated above, even estab-
lishing that objective may not be feasible unless the data demonstrating the effec-
tiveness of the active control is definitive. One can be more ambitious and plan
the trial to ensure with high probability that the results will support the conclusion
that the new treatment is at least 100k% as effective as the active control when
in fact the new treatment is equivalent to the active control. That is, we would
require that Pr(γ ⬍ kβ| y) ⬎ 0.95. To achieve this, one obtains instead of Eq.
(9) the requirement
⫺1.645√(1 ⫺ k) 2 ⫹ 2σ 2 /σ 2β ⫺ (1 ⫺ k)µ β /σ β
⫽ zξ (13)
√2σ 2 /σ 2β
VI. EXAMPLE
As an example of the analysis of therapeutic equivalence trials, we consider two

recently reported clinical trials of bolus t-PA (tissue plasminogen activator) for
lysis of coronary artery thrombosis. Both trials, GUSTO III and COBALT, com-
pared t-PA administered in two boluses separated by 30 min to standard t-PA
administered in an accelerated infusion over 90 min (11,12). Heparin was admin-
istered intravenously in all cases. The GUSTO III trial used a recombinant mutant
version of t-PA for the bolus group. Infusion t-PA was considered the standard
treatment; but bolus administration is more convenient to administer.
Thirty-day mortality results for the COBALT and GUSTO III trials are
shown in Tables 1 and 2. In COBALT, the 30-day mortality for bolus was higher
than that for infusion, but the difference was not statistically significant. The
investigators concluded that ‘‘double-bolus alteplase was not shown to be equiva-
lent according to the prespecified criteria, to accelerated infusion with regard to
30-day mortality. There was also a slightly higher rate of intracranial hemorrhage
with the double-bolus method. Therefore, accelerated infusion of alteplase over
a period of 90 minutes remains the preferred regimen.’’
The results of Gusto III were similar to those for COBALT. The 30-day
mortality for the bolus arm was slightly but not statistically significantly higher
than for the infusion arm. In contrast to the COBALT result, the investigators
implied that the two regimens were equivalent, although they indicated that the
trial was not sized for demonstrating therapeutic equivalence since they expected
the bolus regimen to be superior.
Table 1 COBALT
30-day
n(planned) n(actual) mortality (%)
t-PA ⫹ IV heparin 4029 3584 7.53

Bolus t-PA ⫹ IV heparin 4029 3595 7.98
184 Simon
Table 2 GUSTO III
30-day
n mortality (%)
t-PA ⫹ IV heparin 4,921 7.24

Bolus reteplase ⫹ IV heparin 10,138 7.47
Using the logit approximation, the logit of the odds of 30-day mortality
for the bolus regimen compared with infusion was ⫺0.0621 with a standard error
of 0.088 for COBALT and ⫺0.0341 with a standard error of 0.067 for GUSTO
III. A weighted average of these two results gives a log odds ratio of ⫺0.044
with standard deviation of 0.053. The negative logit reflects an odds ratio of
0.957, slightly favoring the infusion regimen. A 95% two-sided confidence inter-
val for the log odds ratio is (⫺0.148, 0.06), which corresponds to a confidence
interval for the odds ratio of (0.862, 1.062). The lower limit corresponds to a
14% lower 30-day mortality for the standard infusion regimen compared with
the bolus regimen.
The two arms of GUSTO I using infusion t-PA gave an average 30-day
mortality rate of 6.65% based on 20,672 patients (13). The other two arms involv-
ing streptokinase (SK) gave an average of 7.30% 30-day mortality based on
20,173 patients. The odds ratio for infusion t-PA relative to SK is 0.9046 and
the logit is ⫺0.10027 with a standard error of 0.039. The Z value for this compari-
son is 2.57, and an approximate 95% confidence limit for the odds ratio is (0.838,
0.976). Since the point estimate of the odds ratio for 30-day mortality for infusion
t-PA versus SK is 0.9046, there is about a 50% chance that the reduction in risk
is less than 10%.
The Bayesian analysis described previously was applied to these data in
an approximate manner. Flat prior distributions were used for the intercept param-
eter (α) and for the effect of bolus t-PA relative to SK (γ). The prior distribution
for the effect of infusion t-PA relative to SK was obtained from GUSTO I as
indicated in the previous paragraph. That is, for β we used a normal prior with
mean ⫺0.10 and standard deviation of 0.039. This ignores any possible interstudy
variability in the effectiveness of infusion t-PA. We could account for such addi-
tional variability by increasing the standard deviation of β.
We incorporated the COBALT and GUSTO III data in a two-step manner.
First, we summarized the result of COBALT using the empirical logit transform
to be represented by y c ⫽ ⫺2.507, y e ⫽ ⫺2.445, and σ ⫽ 0.0625. The standard
deviation was computed as the average of the standard deviations for the two
treatment arms. Using these data, we computed the posterior distributions of the
parameters. These results are shown in Table 4. We then summarized the results
Table 3 GUSTO I
30-day
Sample size mortality (%)
t-PA ⫹ IV heparin 10,344 6.3

SK ⫹ IV heparin 10,377 7.4
SK ⫹ SC heparin 9,796 7.2
t-PA ⫹ SK ⫹ IV heparin 10,328 7.0
of GUSTO III in a similar manner as y c ⫽ ⫺2.5512, y e ⫽ ⫺2.517, and σ ⫽

0.0472. In this study the sample sizes for the two arms are quite different, and
it would be more accurate to generalize the results of the Bayesian approach for
this. We have, however, approximated using an average standard deviation. For
this second step of analysis we used the posterior distribution obtained from the
COBALT data as a prior distribution for incorporating the GUSTO III data. It
should be noted that there are substantial correlations among the parameters in the
posterior distribution obtained from the COBALT data, and hence the generalized
formula (7) was used. The last column of Table 4 shows the approximate posterior
distributions obtained after incorporating both the COBALT and GUSTO III data.
From the mean and standard deviation of the posterior distribution of γ we can
compute that the posterior probability that γ ⬍ 0, that is, that bolus t-PA, is more
effective than SK is 0.80. Hence, these data provide only suggestive, but not
definitive, evidence that bolus t-PA is even more effective that SK. We also
computed the posterior probability that both β ⬍ 0 and γ ⬍ 0.5β. This can be
interpreted as the probability that infusion t-PA is more effective than SK and
that bolus t-PA is at least 50% as effective as infusion t-PA. This probability
was 0.54. Hence there appears to be little evidence from these trials that bolus
t-PA is at least 50% as effective as infusion t-PA relative to SK.
Table 4 Distribution of Parameters
After COBALT
Prior After COBALT and GUSTO III
α: mean ⫾ SD 0 ⫾ 10 ⫺2.41 ⫾ 0.074 ⫺2.44 ⫾ 0.054

β: mean ⫾ SD ⫺0.10 ⫾ 0.039 ⫺0.10 ⫾ 0.039 ⫺0.10 ⫾ 0.039
γ: mean ⫾ SD 0 ⫾ 10 ⫺0.038 ⫾ 0.0966 ⫺0.056 ⫾ 0.066
ρα β 0 ⫺0.53 ⫺0.72
ρα γ 0 ⫺0.76 ⫺0.82
ρβ γ 0 0.40 0.59
186 Simon
One can obtain from Eq. (13) the size of clinical trial needed to establish
that a regimen is at least 50% as effective as infusion t-PA relative to SK. We
used Eq. (13) with Z ⫽ ⫺2.57 from GUSTO I. With k ⫽ 0.5 we found that a
ratio R ⫽ σ 2 /σ 2β of 0.059 is required to make the righthand side equal 0.84,
corresponding to 80% power. This means that the sample size required for the
planned equivalence trial should be 1/0.059 or about 17 times the size of GUSTO
I. Even to perform an equivalence trial for establishing indirectly that a regimen
is more effective than SK (k ⫽ 0), one obtains that a ratio R of 0.235 is required
for 80% power. This corresponds to a sample size 4.25 as large as for GUSTO
I. One can conclude that infusion t-PA was not sufficiently better than SK and the
difference was not strongly enough established in GUSTO I to make therapeutic
equivalence trials practical.
VII. CONCLUSION
In this chapter we have attempted to clarify the serious limitations of therapeutic

equivalence trials. We have also tried to indicate that standard methods for the
planning and analysis of such trials are problematic and potentially misleading,
and we have described a new approach to planning and analysis of such trials.
This new approach is based on the premise that a therapeutic equivalence trial
is not interpretable unless one provides the quantitative evidence that the control
treatment is effective. The method is presented in a Bayesian context but has a
frequentist interpretation if flat priors are used for the α and γ parameters. An
important implication of the new approach is that reliable therapeutic equivalence
trials are not practical unless the evidence of the effectiveness of the control
treatment is overwhelming. Unless this is the case, the sample size needed for
the equivalence trial is many times larger than the sample size needed to establish
the effectiveness of the control treatment. Conventional methods for planning
therapeutic equivalence trials often miss this point because they take the maxi-
mum likelihood estimate of the effectiveness of the control treatment as if it were
a value known with certainty. This ignores the fact that the degree of effectiveness
of the control treatment is only imprecisely known unless the effect is overwhelm-
ingly significant. For example, if the effect is of borderline significance, then the
confidence interval for the size of the effect almost includes zero. Consequently,
many planned therapeutic equivalence trials, even large multicenter trials, cannot
demonstrate clinically relevant objectives. The methods described here for the
planning of such trials will hopefully help organizations to avoid such misdirected
efforts. A corollary to these considerations is that superiority trials, rather than
therapeutic equivalence trials with marginally effective control treatments, is
strongly preferable whenever possible.
REFERENCES
1. Makuch R, Simon R. Sample size requirements for evaluating a conservative ther-

apy. Cancer Treatment Rep 1978; 62:1037–1040.
2. Simon R. Confidence intervals for reporting results from clinical trials. Ann Intern
Med 1986; 105:429–435.
3. Simon R. Why confidence intervals are useful tools in clinical therapeutics. J Bio-
pharm Stat 1993; 3:243–248.
4. Blackwelder W. Proving the null hypothesis in clinical trials. Control Clin Trials
1982; 3:345–353.
5. Frieman JA, Chalmers TC, Smith HJ. The importance of beta the type II error and
sample size in the design and interpretation of the randomized control trial: survey
of 71 ‘‘negative’’ trials. N Engl J Med 1978; 299:690–694.
6. Durrleman S, Simon R. Planning and monitoring of equivalence studies. Biometrics
1990; 46:329–336.
7. Fleming T. Evaluation of active control trials in AIDS. J Acquir Immune Defic Syndr
1990; 3:S82–S87.
8. Gould A. Another view of active-controlled trials. Control Clin Trials 1991; 12:
474–485.
9. Gould L. Sample sizes for event rate equivalence trials using prior information. Stat
Med 1993; 12:2001–2023.
10. Simon R. Bayesian design and analysis of active control clinical trials. Biometrics
1999; 55:484–487.
11. The GUSTO III Investigators. A comparison of reteplase with alteplase for acute
myocardial infarction. N Engl J Med 1998; 337:1118–1123.
12. The COBALT Investigators. A comparison of continuous infusion of alteplase with
double bolus administration for acute myocardial infarction. N Engl J Med 1998;
337:1124–1130.
13. The GUSTO Investigators. An international randomized trial comparing four throm-
bolytic strategies for acute myocardial infarction. N Engl J Med 1993; 329:673–
682.
11
Early Stopping of Cancer Clinical Trials
James J. Dignam
National Surgical Adjuvant Breast and Bowel Project and University of
Pittsburgh, Pittsburgh, Pennsylvania, and University of Chicago,
Chicago, Illinois
John Bryant and H. Samuel Wieand
National Surgical Adjuvant Breast and Bowel Project and University of
Pittsburgh, Pittsburgh, Pennsylvania
I. INTRODUCTION
Most cancer clinical trials use formal statistical monitoring rules to serve as
guidelines for possible early termination. Such rules provide for the possibility
of early stopping in response to positive trends that are sufficiently strong to
establish the treatment differences the clinical trial was designed to detect. At
the same time, they guard against prematurely terminating a trial on the basis of
initial positive results that may not be maintained with additional follow-up.
We may also consider stopping a trial before its scheduled end point if cur-
rent trends in the data indicate that eventual positive findings are unlikely. For
example, consider a trial comparing a new treatment to an established control regi-
men. Early termination for negative results may be called for if the data to date are
sufficient to rule out the possibility of improvements in efficacy that are large
enough to be clinically relevant. Alternatively, it may have become clear that study
accrual, drug compliance, follow-up compliance, or other factors have rendered
the study incapable of discovering a difference, whether or not one exists.
In this chapter we discuss methods for early stopping of cancer clinical
trials. We focus in particular on situations where evidence suggests that differ-
189
190 Dignam et al.
ences in efficacy between treatments will not be demonstrated, as this aspect of trial
monitoring has received less attention. For concreteness, we restrict our attention
to randomized clinical trials designed to compare two treatments, using survival
(or slightly more generally, time to some event) as the primary criterion. However,
the methods we discuss may be extended to other trial designs and efficacy criteria.
In most applications, one treatment represents an established regimen for the disease
and patient population in question, whereas the second is an experimental regimen
to be tested by randomized comparison with this control.
In this chapter, we first describe group sequential approaches to trial moni-
toring and outline a general framework for designing group sequential monitoring
rules. We then discuss the application of asymmetric monitoring boundaries to
clinical trials in situations where it is appropriate to plan for the possibility of
early termination in the face of negative results. Next, we consider various ap-
proaches to assessing futility in ongoing trials, including predictive methods such
as stochastic curtailment. We then briefly examine Bayesian methods for trial
monitoring and early stopping. National Surgical Adjuvant Breast and Bowel
Project (NSABP) Protocol B-14 is presented as a detailed example illustrating
the use of stochastic curtailment calculations and Bayesian methods. We also
give a second example to illustrate the use of a common asymmetric monitoring
plan adapted for use in Southwest Oncology Group (SWOG) Protocol SWOG-
8738. This approach is compared with a slight modification of a monitoring rule
proposed by Wieand et al. We conclude with a discussion of considerations rele-
vant to the choice of a monitoring plan.
II. GROUP SEQUENTIAL MONITORING RULES
The most common statistical monitoring rules are based on group sequential
procedures. Consider a clinical trial designed to compare two treatments, A and
B, using survival as the primary end point. The relative effectiveness of the two
treatments can be summarized by the parameter δ ⫽ ln(λ B (t)/λ A(t)), the logarithm
of the ratio of hazard rates λ B (t) and λ A (t). We assume that this ratio is indepen-
dent of time t. Thus, the hypothesis that the two treatments are equivalent is H 0:
δ ⫽ 0, whereas values of δ ⬎ 0 indicate the superiority of A to B and values of
δ ⬍ 0 indicate the superiority of B to A.
Suppose patients are accrued and assigned at random to receive either treat-
ment A or B. In a group sequential test of H 0, information is allowed to accumu-
late over time; at specified intervals an interim analysis is performed, and a deci-
sion is made whether to continue with the accumulation of additional information
or to stop and make some decision based on the information collected to date.
The accumulation of information is usually quantified by the total number of
deaths, and the comparison of treatments is based on the logrank statistic.
A large number of group sequential procedures have been proposed in this
Stopping Clinical Trials 191
setting. Most fall into a common general framework that we now describe: For
k ⫽ 1, 2, . . . , K ⫺ 1, an interim analysis is scheduled to take place after m k total
deaths have occurred on both treatment arms, and a final analysis is scheduled to
occur after the m Kth death. Let L k denote the logrank statistic computed at the
kth analysis, let V k denote its variance, and let Z k represent the corresponding
standardized statistic Z k ⫽ L k /√V k. For each k ⫽ 1, 2, . . . , K, the real line R1
is partitioned into a continuation region C k and a stopping region S k ⫽ R1 ⫺ C k ;
if Z k ∈ C k , we continue to the (k ⫹ 1)st analysis, but if Z k ∈ S k, we stop after
the kth analysis. The stopping region for the Kth analysis is the entire real line,
S K ⫽ R 1.
Define t k ⫽ m k /m K, so that t k represents the fraction of the total information
available at the kth analysis, and let W k ⫽ Z k ⋅ √t k , k ⫽ 1, 2, . . . , K. Under
appropriate conditions (roughly, sequential entry, randomized treatment assign-
ment, loss to follow-up independent of entry time and treatment assignment), the
W k behave asymptotically like Brownian motion: Defining ∆t k ⫽ t k ⫺ t k ⫺ 1 and
η ⫽ δ ⋅ √m K /2, the increments W k ⫺ W k⫺1 are approximately uncorrelated normal
random variables with means η ⋅ ∆t k and variances ∆t k (1–3). This result permits
the extension of sequential methods based on evolving sums of independent nor-
mal variates to the survival setting. In particular, the recursive integration scheme
of Armitage et al. (4) may be used to compute the density
fk (w; η) ⫽ dPr{τ ⱖ k,Wk ⱕ w;η}/dw
where τ represents the number of the terminal analysis: Letting φ{⋅} represent
the standard normal density, f1 (w;η) ⫽ φ{w ⫺ η ⋅ t 1 )/√t1}/√t 1 , and for k ⫽ 2,
3, . . . , K
fk (w;η) ⫽ 冮Ck⫺1
fk⫺1( y;η) ⋅ [φ{(w ⫺ y ⫺ η ⋅ ∆tk )/√∆tk }/√∆tk ]dy (1)
From this result all operating characteristics of the group sequential procedure
(size, power, stopping probabilities, etc.) may be obtained.
In cases where a two-sided symmetric test of the hypothesis H 0: δ ⫽ 0 is
appropriate, the continuation regions are of the form C k ⫽ {Z k : ⫺b k ⱕ Z k ⱕ b k },
k ⫽ 1, 2, . . . , K ⫺ 1. If Z k ⬍ ⫺b k at the kth analysis, we reject H 0 in favor of
H A : δ ⬍ 0, whereas if Z k ⬎ b k , H 0 is rejected in favor of H A : δ ⬎ 0. If testing
continues to the Kth and final analysis, a similar decision rule is applied except
that if ⫺b K ⱕ Z K ⱕ b K , we accept H 0 rather than continuing to an additional
analysis. The b k are chosen to maintain a desired experiment-wise type I error
rate Pr{Reject H 0 |H 0 } ⫽ α, and the maximum duration m K of the trial is selected
to achieve power 1 ⫺ β against a specified alternative H A :δ ⫽ δ A, by determining
that value of η that yields Pr{Reject H 0 |η} ⫽ 1 ⫺ β, and then setting
mK ⫽ 4 ⋅ η2 /δ2A (2)
192 Dignam et al.
Early stopping rules proposed by Haybittle (5), Pocock (6,7), O’Brien and
Fleming (8), Wang and Tsiatis (9), and Fleming et al. (10) all fit into this general
framework. In the method by Haybittle, a constant large critical value is used
for analyses k ⫽ 1, 2, . . . , K ⫺ 1, and the final analysis is performed using a
critical value corresponding to the desired overall type I error level. For a moder-
ate number of analyses (say, K ⫽ 5) and a large critical value (z ⫽ 3.0 was
suggested if one wishes to obtain an overall 0.05 level procedure), the method
can be shown to achieve nearly the desired type I error rate, despite no adjustment
to the final test boundary value. To obtain the final critical value that would yield
precisely the desired level overall, Eq. (1) can be used. The Pocock bounds are
obtained by constraining the z-critical values to be identical for each k: b k ⫽
constant, k ⫽ 1, 2, . . . , K. For the O’Brien–Fleming procedure, the W-critical
values are constant, so that b k ⫽ constant/√t k . Wang and Tsiatis (9) boundaries
have the form bk ⫽ constant ⋅ tk∆⫺1/2, where ∆ is a specified constant. Fleming et
al. (10) boundaries retain the desirable property of the O’Brien–Fleming proce-
dure that the nominal level of the Kth analysis is nearly equal to α but avoid the
extremely conservative nature of that procedure for small k when K ⬎ 3.
Since most phase III trials compare a new therapy A to an accepted standard
B, it may often times be appropriate to consider one-sided hypothesis tests of
H 0: δ ⫽ 0 versus H A: δ ⬎ 0 and to make use of asymmetric continuation regions
of the form C k ⫽ {Z k : a k ⱕ Z k ⱕ b k } or equivalently C k ⫽ {W k : A k ⱕ W k ⱕ
B k }, where A k ⫽ a k ⋅ √t k , B k ⫽ b k ⋅ √t k . Crossing the upper boundary results in
rejection of H 0 in favor of H A , whereas crossing the lower boundary results in
trial termination in recognition that the new therapy is unlikely to be materially
better than the accepted standard or that H 0 is unlikely to be rejected with further
follow-up. The design of such asymmetric monitoring plans presents no signifi-
cant new computational difficulties. After restricting the choice of a k and b k,
k ⫽ 1, 2, . . . , K, to some desired class of boundaries, Eq. (1) is used (gener-
ally in an iterative fashion) to fix the both the size of the monitoring procedure
and its power against a suitable alternative or set of alternatives. Equation (2)
is used to determine the maximum duration of the trial in terms of observed
deaths.
In this context, DeMets and Ware (11) proposed the use of asymmetric
Pocock boundaries: The lower boundary points a k are fixed at some specified
value independent of k (the range ⫺2.0 ⱕ a k ⱕ ⫺0.5 is tabled) and then a constant
value for the b k is determined by setting the type I error rate to α. A second
suggestion was to use a test with boundaries that are motivated by their similar-
ity to those of a sequential probability ratio test (12). This procedure is most eas-
ily expressed in terms of its W-critical values, which are linear in information
time:
Ak ⫽ ⫺ (Z′L /η) ⫹ (η/2) ⋅ tk , Bk ⫽ (Z′U /η) ⫹ (η/2) ⋅ tk , k ⫽ 1,2, . . . , K

(3)
Here Z′L ⫽ ln((1 ⫺ α)/β), and Z u′ and η are chosen to satisfy type I and type II
error requirements by iterative use of Eq. (1). The maximum number of observed
deaths is given by Eq. (2), as before. In a subsequent publication (13), DeMets
and Ware recommended that the Wald-like lower boundary should be retained
but the upper boundary should be replaced by an O’Brien–Fleming boundary B k
⬅ B, k ⫽ 1, 2, . . . , K. Although iteration is still required to determine η and
B, the value of B is reasonably close to the symmetric O’Brien–Fleming bound
at level 2α.
Whitehead and Stratton (14) indicate how the sequential triangular test
(15,16) may be adapted to the group sequential setting in which K analyses will
be carried out at equally spaced intervals of information time, t k ⫽ k/K, k ⫽ 1,
2, . . . , K. Suppose first that it is desired to achieve type I error rate α and power
1-α against the alternative δ ⫽ δ A . The W-critical values are
Ak ⫽ ⫺Q ⫹ (3η/4) ⋅ tk , Bk ⫽ Q ⫹ (η/4) ⋅ tk , k ⫽ 1,2, . . . , K
where η satisfies η2 ⫹ {2.332/√K } ⋅ η ⫺ 8 ⋅ ln(1/2α) ⫽ 0 and Q ⫽ 2 ⋅ ln(1/2α)/
η ⫺ 0.583/√K. The maximum number of observed deaths required to achieve this
is given by Eq. (2). If instead one wishes to achieve a power of 1 ⫺ β ≠ 1 ⫺ α
against the alternative δ ⫽ δ A , the operating characteristic curve of the fixed
sample size test satisfying Pr{Reject H 0 | δ ⫽ 0} ⫽ α, Pr{Reject H 0 | δ ⫽ δ A} ⫽
1 ⫺ β may be used to determine an alternative δ′A such that Pr{Reject H 0|δ ⫽
δ′A} ⫽ 1 ⫺ α. Then δ′A should be used in place of δ A in Eq. (2). The adjustment
factor of 0.583/√K in the formula for Q is an approximate correction for exact
results that hold in the case of a purely sequential procedure. Slightly more accu-
rate results may be obtained by iteratively determining η and Q to satisfy type
I and type II error constraints via Eq. (1).
The triangular test approximately minimizes the expected number of events
at termination under the alternative δ ⫽ δ′A /2. Jennison (17) considered group
sequential tests that minimize expected number of events under various alterna-
tives and presented parametric families of tests that are nearly optimal in this
sense. These are specified in terms of spending functions, similar to Lan and
DeMets (18).
The power boundaries of Wang and Tsiatis (9) may be adapted for use in
testing hypotheses of the form H 0 : δ ⫽ 0 verses H A : δ ⫽ δ A (19,20). W-critical
values are of the form
Ak ⫽ ⫺Q ⋅ t ∆k ⫹ η ⋅ tk , Bk ⫽ (η ⫺ Q) ⋅ t ∆k , k ⫽ 1,2, . . . , K
where the constant ∆ is specified by the trial designer; η and Q are determined
iteratively to satisfy type I and type II error constraints using Eq. (1). The maxi-
mum number of required deaths is given by Eq. (2). ∆ ⫽ 0 corresponds essentially
to a design using an upper O’Brien–Fleming bound to test H 0:δ ⫽ 0 and a lower
O’Brien–Fleming bound to test H A: δ ⫽ δ A . ∆ ⫽ 1/2 results in Pocock-like
boundaries. In general, larger values of ∆ correspond to a greater willingness to
194 Dignam et al.
terminate at an earlier stage. Emerson and Fleming (19) compared the efficiencies
of one-sided symmetric designs having power boundaries to the results of Jenni-
son (17) and concluded that the restriction to boundaries of this form results in
negligible loss of efficiency. Pampallona and Tsiatis (20) provide a comparison
of the operating characteristics of asymmetric one-sided designs based on power
boundaries with the designs proposed by DeMets and Ware (11,13). Both Emer-
son and Fleming (19) and Pampallona and Tsiatis (20) also consider two-sided
group sequential procedures that allow for the possibility of early stopping in
favor of the null hypothesis. These procedures are similar in spirit to the double
triangular test (14).
Wieand, Schroeder, and O’Fallon (21) proposed a method for early termi-
nation of trials when there appears to be no benefit after a substantive portion
of total events has been observed, which is tantamount to adopting asymmetric
boundaries after sufficient information has been obtained to guarantee high power
against alternatives of interest. The method is an extension of earlier work by
Ellenberg and Eisenberger (22) and Wieand and Therneau (23) and was first
considered for multistage trials in advanced disease, where patient outcomes are
poor and there is likely to be substantial information regarding treatment efficacy
while accrual is still underway. In its simplest form, the proposed rule calls for
performing an interim analysis when one half of the required events have taken
place. At that time, if the event rate on the experimental arm exceeds that on the
control arm, then termination of the trial should be considered. It can be shown
that the adoption of this rule has essentially no effect on the size of a nominal
0.05-level test of equality of hazards and results in a loss of power of ⱕ0.02 for
any alternative hypothesis indicating a treatment benefit, compared with a fixed
sample size test of that same alternative at the scheduled definitive analysis (21).
Similarly, if this rule is superimposed on symmetric two-sided boundaries by
replacing the lower boundary a k with 0 for any information time t k greater than
or equal to one half and the result is viewed as an asymmetric group se-
quential procedure testing a one-sided hypothesis, there is almost no change in
the operating characteristics. In this implementation, the stopping rule calls for
early termination if at any scheduled interim analysis at or after the halfway
point the experimental treatment is observed to be no more efficacious than the
control.
III. CONDITIONAL POWER METHODS

A. Stochastic Curtailment
A commonly applied predictive approach to early stopping makes use of the
concept of stochastic curtailment (24–27). The stochastic curtailment approach
requires a computation of conditional power functions, defined as
γ ⫽ Pr(Z(1) ∈ R | D, H ) (4)
where Z(1) represents a test statistic to be computed at the end of the trial, R is
the rejection region of this test, D represents current data, and H denotes either
the null hypothesis H 0 or an alternative hypothesis H a. If this conditional probabil-
ity is sufficiently large under H 0, one may decide to stop and immediately reject
H 0. On the other hand, if under a ‘‘realistic’’ alternative hypothesis H a this proba-
bility is sufficiently small or, equivalently, if 1-γ ⫽ Pr(Z(1) ∉ R | D, H a ) is
sufficiently large, we may decide that continuation is futile because H 0 ultimately
will not be rejected regardless of further observations. This is the case of interest
when considering early stopping for negative results. In an example presented
later, we condition on various alternatives in favor of the treatment to assess the
potential for a trial to reverse from early interim analyses results unexpectedly
favoring the control group.
In Section II it was noted that the normalized logrank statistics asymptoti-
cally behave like Brownian motion. This provides an easy way to compute condi-
tional power over a range of alternatives (26,27):
C(t) ⫽ 1 ⫺ Φ
冢 Zα ⫺ Z(t)√t ⫺ η(1 ⫺ t)
√1 ⫺ t 冣 (5)
In Eq. (5), Φ(⋅) is the standard normal distribution function, t is the fraction of
total events for definitive analysis that have occurred to date (so-called informa-
tion time; this was defined for prespecified increments as t k ⫽ m k /m K in Sect.
II), Z(t) is the current standard normal variate associated with the logrank test,
Z α is the critical value against which the final test statistic Z(1) is to be compared,
and η is defined in Section II.
B. Predictive Power and Current Data Methods

Stochastic curtailment has been criticized on the basis that it requires conditioning
on the current data and at the same time an alternative hypothesis that may be
unlikely to have given rise to that data. In any case, since H a must be specified,
the method always depends on unknown information at the time of the decision.
This criticism has motivated methods that take an unconditional predictive ap-
proach in assessing the consequences of continuing the trial (28–30). These so-
called predictive power procedures use weighted averages of conditional power
over values of the alternative, specified through a distribution
Pr(Z(1) ∈ R | D) ⫽ ∫ Pr(Z(1) ∈ R | D, H )Pr(H| D)dH (6)
A Bayesian formulation is a natural setting for this approach. If a noninformative
prior distribution is used for the distribution of the parameter of interest (e.g., H
196 Dignam et al.
expressed as a difference of means, difference or ratio of proportions, or hazard

ratio), then the posterior distribution in this weighted average of conditional
power depends only on the current data. Alternatively, the current (observed)
alternative could be used in the conditional power formulation in Eq. (5) to project
power resulting from further follow-up according to the pattern of observations
thus far (26,29,30).
When an informative prior distribution is used, then we obtain a fully
Bayesian approach, described in the following section.
IV. A BAYESIAN APPROACH TO ASSESS EARLY

TERMINATION FOR NO BENEFIT
Recently, interest has grown in the application of Bayesian statistical methodol-

ogy to problems in clinical trial monitoring and early stopping (31–35). Although
Bayesian analyses entail the difficult and sometimes controversial task of speci-
fying prior information, if the goal of any clinical trial is ultimately to influence
clinical practice, its results must be sufficiently strong to prove compelling to a
community of clinical researchers whose prior opinions and experiences are di-
verse. Thus, in situations where early termination is considered, an analysis of
the robustness of conclusions over a range of priors thought to resemble the a
priori beliefs of reasonable members of the clinical research community should
provide insight into the impact that trial results might be expected to exert on
clinical practice. This is often an overlooked aspect of trial monitoring, as early
stopping can result in diminished impact of the findings and continued con-
troversy and delay while results are debated and large expensive trials are repli-
cated.
Bayesian calculations for clinical trial monitoring can be motivated by
adopting the log hazard ratio δ defined in Section II as a summary measure of
relative treatment efficacy (32). We denote the partial maximum likelihood esti-
mate of the log hazard ratio as δ̂. δ has an approximately normal likelihood with
mean δ̂ and variance 4/m, where m is the total number of events currently ob-
served. We assume a normal prior distribution with specified mean δ p and vari-
ance σ 2p. The values of δ p and σ 2p may be determined to reflect an individual’s
prior level of enthusiasm regarding the efficacy of a proposed regimen, and these
parameters may be altered to reflect varying degrees of enthusiasm. In this spirit,
the notion of ‘‘skeptical’’ and ‘‘optimistic’’ prior distributions is discussed by
numerous authors (31–33,36). It is suggested that a skeptical member of the
clinical community may adopt a prior for δ that is centered at 0, reflecting the
unfortunate fact that relatively few regimens tested lead to material improvements
in outcome. Nevertheless, the trial designers will have specified a planning alter-
native for δ, δ ⫽ δ A , say, that they must believe is both clinically meaningful
and relatively probable. If the skeptic is reasonably open-minded, he or she would
be willing to admit some probability that this effect could be achieved, perhaps
on the order of 5%. Using these considerations, a skeptical prior with mean δ p
⫽ 0 and standard deviation σ p ⫽ δ A /1.645 is specified. Using similar logic, one
might be inclined to consider the trial organizers as being among the most opti-
mistic of its proponents, but even they would be reasonably compelled to admit
as much as a 5% chance that the proposed regimen will have no effect, i.e., that
δ ⱕ 0. It therefore may be reasonable to model an optimist’s prior by setting δ p
⫽ δ A and σ p ⫽ δ A /1.645.
For a given prior distribution and the observed likelihood, a posterior den-
sity can be computed for δ, and the current weight of evidence for benefit can
thus be assessed directly by observing the probability that the effect is in some
specified range of interest, say δ ⬎ 0, indicating a benefit, or δ ⱖ δ ALT ⬎ 0,
corresponding to some clinically relevant effect size δ ALT. Following well-known
results from Bayesian inference using the normal distribution, the posterior distri-
bution for δ has mean and variance given by
δpost ⫽ (n0δp ⫹ mδ̂)/(n0 ⫹ m) and σ 2post ⫽ 4/(n0 ⫹ m)
where n0 ⫽ 4/σ 2p. This quantity is thought of as the prior ‘‘sample size,’’ since
the information in the prior distribution is equivalent to that in an hypothetical
trial yielding a log hazard ratio estimate of δ p based on this number of events.
From the posterior distribution one can also formulate a predictive distribu-
tion to assess the consequences of continuing the trial for some fixed additional
number of failures. As before, let m be the number of events observed thus far
and let n be the number of additional events to be observed. Denote by δ̂ n the log
relative risk that maximizes that portion of the partial likelihood corresponding to
failures m ⫹ 1, m ⫹ 2, . . . , m ⫹ n. The predictive distribution of δ̂ n is normal
with the same mean as the posterior distribution and variance σ 2pred ⫽ 4/(n 0⫹m)
⫹ 4/n.
V. EXAMPLES
A. A Trial Stopped Early for No Benefit
In 1982, NSABP initiated Protocol B-14, a double-blind comparison of 5 years
of tamoxifen (10 mg b.i.d.) with placebo in patients having estrogen receptor-
positive tumors and no axillary node involvement. The first report of findings in
1989 indicated improved disease-free survival (DFS, defined as time to either
breast cancer recurrence, contralateral breast cancer or other new primary cancer,
or death from any cause, 83% vs. 77% event free at 4 years, p ⬍ 0.00001).
Subsequent follow-up through 10 years has confirmed this benefit, with 69% of
patients receiving tamoxifen remaining event free compared with 57% of placebo
patients and has also showed a significant survival advantage (at 10 years; 80%
tamoxifen vs. 76% placebo, p ⫽ 0.02) (37).
198 Dignam et al.
In April 1987 a second randomization was initiated. Patients who had re-
ceived tamoxifen and were event free through 5 years were rerandomized to
either continue tamoxifen for an additional 5 years or to receive placebo. Between
April 1987 and December 1993, 1172 patients were rerandomized. To provide
for a 0.05 level one-sided test with a power of at least 0.85 under the assumed
alternative of a 40% reduction in DFS failure rate, a total of 115 events would
be required before definitive analysis. Four interim analyses were scheduled at
approximately equal increments of information time. Stopping boundaries were
obtained using the method of Fleming et al. (10) at the two-sided 0.10 level.
Confidential interim end point analyses were to be compiled by the study statisti-
cian and presented to the independent Data Monitoring Committee of the
NSABP.
At the first interim analysis, based on all data received as of September
30, 1993, more events had occurred in the tamoxifen group (28 events) than
among those receiving placebo (18 events). There had been six deaths on the
placebo arm and nine among tamoxifen patients. By the second interim analysis
(data received as of September 30, 1994), there were 24 events on the placebo
arm and 43 on the tamoxifen arm (relative risk ⫽ 0.57, nominal 2p ⫽ 0.03),
with 10 deaths on the placebo arm and 19 among patients receiving tamoxifen.
Although there was concern regarding the possibility of a less favorable outcome
for patients continuing tamoxifen, we recommended that the trial be continued
to the next scheduled interim analysis because the early stopping criterion was
not achieved (2α ⫽ 0.0030) and follow-up for most patients was relatively short
(mean, 3.75 years). At that time, we computed the conditional probability of
rejecting the null hypothesis at the scheduled final analysis (115 events), given
the current data and a range of alternative hypotheses [Eq. (5)]. Results suggested
that even under extremely optimistic assumptions concerning the true state of
nature, the null hypothesis could almost certainly not be rejected: Even under the
assumption of a 67% reduction in failures, the conditional probability of eventual
rejection was less than 5%. We also considered an ‘‘extended trial,’’ repeating
conditional power calculations as if we had intended to observe a total of 229
events before final analysis (this number of events would allow for the detec-
tion of a 30% reduction in event rate with a power of 85%). Results indicated
that if the trial was continued and the underlying relative risk was actually
strongly in favor of tamoxifen, then the null hypothesis could possibly be rejected
(Fig. 1).
At the third interim analysis (data received as of June 30, 1995), there were
32 events on the placebo arm and 56 on the treatment arm (relative risk ⫽ 0.59).
Four-year DFS was 92% for patients on placebo and 86% for patients on tamoxi-
fen. The boundary for early termination (2α ⫽ 0.0035) was not crossed (2p ⫽
0.015). However, calculations showed that even if the remaining 27 events (of
115) all occurred on the placebo arm, the logrank statistic would not approach
Figure 1 Conditional probability of finding a significant benefit for tamoxifen in the

NSABP B-14 trial if definitive analysis was deferred to the 229th event. Probabilities are
conditional on the results of the second and third interim analysis and are graphed as a
function of the assumed placebo/tamoxifen relative risk. The solid line is based on Eq.
(5) (27); the dashed line is based on a binomial calculation following from a Poisson
occurrence assumption. Reprinted from (41) with permission from Elsevier Science.
significance. The imbalance in deaths also persisted (13 placebo arm, 23 tamoxi-
fen, 2p ⫽ 0.11). For the extended trial allowing follow-up to 229 events, Figure
1 shows that conditional power was now about 15% under the planning alterna-
tive of 40% reduction in relative risk and was only 50% under the more unlikely
assumption of a twofold benefit for continuing tamoxifen. At this time, we also
considered the early stopping rule proposed by Wieand et al. (21) discussed ear-
lier. To illustrate the consequences of superimposing this rule on the established
monitoring boundaries of this trial, suppose the lower boundaries at the third,
fourth, and final analysis were replaced with zeros. Then the (upper) level of
significance is reduced from 0.0501 to 0.0496 and the power under the alternative
of a 40% reduction in event rate is reduced from 0.8613 to 0.8596. By this interim
analysis, considerably more events had occurred on the treatment arm than on
the control arm. Had such a conservative ‘‘futility’’ rule such as that described
200 Dignam et al.
above been incorporated into the monitoring plan, it would have suggested termi-
nation by this time.
As discussed, the approaches taken in considering the early termination of
the B-14 study were frequentist. We subsequently also applied Bayesian methods
for comparative purposes and to attempt to address the broader question of con-
sensus in clinical trials, as the closure of the B-14 study had prompted some
criticism from the cancer research community (38–40).
Figure 2 shows the log hazard ratio likelihood for the B-14 data at the third
interim analysis, having mean ln(0.586) ⫽ ⫺0.534 and standard deviation 0.213.
Also shown is an ‘‘optimistic’’ prior distribution centered at δ p ⫽ δ A ⫽ 0.511,
Figure 2 Prior distribution, likelihood, and posterior distribution of the logged placebo/
tamoxifen hazard ratio after the third interim analysis of NSABP B-14. An ‘‘optimistic’’
normal prior distribution is assumed, under which the most probable treatment effect is
a 40% reduction in risk, with only a 5% prior probability that the treatment provides no
benefit. The resulting posterior distribution contains about 13% probability mass to the
right of ln(hazard ratio) δ ⫽ 0. Reprinted from (41) with permission from Elsevier Science.
corresponding to a 40% reduction in failures for patients continuing on tamoxifen

relative to those stopping at 5 years. The prior standard deviation is σ p ⫽ δ A /
1.645 ⫽ 0.311. The resulting posterior distribution has mean ⫺0.199 and stan-
dard deviation 0.176 (also shown). From this distribution one can determine that
the posterior probability that δ ⬎ 0 is 1 ⫺ Φ(0.199/0.176) ⫽ 0.13, where Φ(⋅) is
the standard normal distribution function. To the degree that this prior distribution
represents that of a clinical researcher who was initially very optimistic, these
calculations suggest that even in the face of the negative trial findings as of the
third interim analysis, this individual would still assign a small but nonnegligible
probability to the possibility that continued tamoxifen has some benefit. On the
other hand, this individual would now assign essentially no probability (⬇3 ⫻
10⫺5) to the possibility that the benefit is as large as a 40% reduction in risk.
For the prior distribution specified above and the observations at the third
interim analysis, we also computed the predictive distribution. If the trial were
to be extended to allow a total of 229 events or 141 events beyond the third
interim analysis, we obtain µ pred ⫽ ⫺0.199 and σ pred ⫽ 0.243. The predictive
probability of a significant treatment comparison following the 229th event is
determined as follows: If δ̌ m⫹n denotes the estimated log relative risk based on
all the data, then the ultimate result will be significant at the 0.05 level is δ̂ m⫹n
⬎ 1.645 ⋅ √(4/229) ⫽ 0.217. Since approximately δ̂ m⫹n ⬇ (88δ̂ ⫹ 141δ̂ n )/299,
this event requires that δ̂ n ⬎ 0.686. The predictive probability of this occurrence
is 1 ⫺ Φ({0.686 ⫹ 0.199}/0.243) ⬇ 0.0001.
Subsequent follow-up of Protocol B-14 continues to support the findings
that prompted early closure of this study. By 1 year subsequent to publication
(data through December 31, 1996), 135 total events had occurred, 85 among
patients who had received continued tamoxifen and 50 among patients rerandom-
ized to placebo (relative risk ⫽ 0.60, nominal 2p ⫽ 0.002). There were 36 deaths
among tamoxifen patients and 17 among control patients (nominal 2p ⫽ 0.01).
A more extensive discussion of this case study has been published else-
where (41).
B. Effect of Two Easily Applied Rules in the Adjuvant and

Advanced Disease Setting
The trial presented in the preceding example was not designed with an asymmet-
ric rule for stopping in the face of negative results. It was partly for this reason
that the data monitoring committee and investigators considered several methods
of analysis before reaching a decision to stop the trial. Although there will some-
times be special circumstances that require analyses not specified a priori, it is
preferable to determine in advance whether the considerations for stopping are
asymmetric in nature and, if so, to include an appropriate asymmetric stopping
rule in the initial protocol design.
202 Dignam et al.
Computer packages (East, Cytel Software Corp., Cambridge, MA; and

PEST3, MPS Research Unit, University of Reading; Reading, UK) are available
to help with the design of studies using any of the rules discussed in Section 2
(see Emerson [42] for a review). Alternatively, one may modify a ‘‘standard’’
symmetric rule (e.g., O’Brien–Fleming boundaries) by retaining the upper
boundary for early stopping due to positive results but replacing the lower bound-
ary to achieve a more appropriate rule for stopping due to negative results. It is
often the case that this will alter the operating characteristics of the original plan
so little that no additional iterative computations are required.
To illustrate this, suppose one had designed a trial to test the hypothesis
H 0: δ ⫽ 0 versus δ ⬎ 0 to have 90% power versus the alternative H A: δ ⫽ ln(1.5)
with a one-sided α of 0.025, using a Fleming et al. (10) rule with three interim
looks. Such a design would require interim looks when there had been 66,132,
and 198 events, with final analysis at 264 events. From Table 1a of Fleming et
al. (10), one such rule would be to stop and conclude that the treatment was
beneficial if the standardized logrank statistic Z exceeded 2.81 at the first look,
2.74 at the second look, or 2.67 at the third look. The null hypothesis would be
rejected at the end of the trial if Z exceeded 2.02. If a symmetric lower boundary
were considered inappropriate, one might choose to replace it by simply testing
the alternate hypothesis H A: δ ⫽ ln(1.5) versus δ ⬍ ln(1.5) at some small signifi-
cance level (e.g., α ⫽ 0.005) at each interim look (this suggestion is adapted
from the monitoring rules in SWOG Protocol SWOG-8738, an advanced disease
lung cancer trial). In the framework of Section II, this rule is asymptotically
equivalent to stopping if the standardized Z is ⬍ ⫺0.93 at the first look, or ⬍
⫺0.25 at the second look, or ⬍0.28 at the third look (a fact that one does not
need to know to use the rule, since the alternative hypothesis can be tested directly
using standard statistical software, e.g., SAS Proc PHREG [43]). It follows that
if the experimental treatment adds no additional benefit to the standard regimen,
there would be a 0.18 chance of stopping at the first look, a 0.25 chance of
stopping at the second look, and a 0.22 chance of stopping at the third look (Table
1). Adding this rule does not significantly change the experiment-wise type I
error rate (α ⫽ 0.0247) and would only lower the power to detect a treatment
effect of δ A ⫽ ln(1.5) from 0.902 to 0.899.
Following Wieand et al. (21), an alternative but equally simple way to
modify the symmetric Fleming et al. boundaries would be simply to replace the
lower Z-critical values with zeroes at each interim analysis at or beyond the half-
way point. Using this approach, if the experimental arm offered no additional
benefit to that of the standard regimen, the probability of stopping at the first
look would be very small (0.0025), but the probability of stopping at the second
look would be 0.4975, and at the third look would be 0.10 (Table 1). Again no
special program is needed to implement this rule, and its use has a negligible
effect on the original operating characteristics of the group sequential procedure
(α ⫽ 0.0248, power ⫽ 0.900).
Table 1 Probability of Stopping at Each of Three Early Looks Using Two Easily
Applied Rules
Probability of Stopping Probability of Stopping

If Treatments Under Alternative
are Equivalent δ ⫽ 1.5
No. of events SWOG WSO SWOG WSO
66 0.18 0.0025 0.005 0.000

132 0.25 0.4975 0.004 0.010
198 0.22 0.10 0.003 0.001
WSO, Wieand, Schroeder, O’Fallon (21).

SWOG, Southwest Oncology Group.
The decision of which rule to use will depend on several factors, including
the likelihood that patients will still be receiving the treatment at the time of the
early looks and whether it is likely that the experimental treatment would be used
outside the setting of the clinical trial before its results are presented. To illustrate
this, we consider two scenarios.
Scenario 1: The treatment is being tested in an advanced disease trial where
the median survival with conventional therapy has been 6 months and the alterna-
tive of interest is to see if the experimental treatment results in at least a 9-month
median survival. Under the assumption of constant hazards, this is equivalent to
the hypothesis δ A ⫽ ln(1.5). Suppose one would expect the accrual rate to such
a study to be 150 patients per year. If one designed the study to accrue 326
patients, which would take 26 months, one would need to follow them for slightly
less than 4.5 additional months to observe 264 deaths if the experimental regimen
offers no additional benefit to the standard regimen or an additional 7.5 months
if δ ⫽ δ A ⫽ ln(1.5). If the experimental treatment offers no additional benefit,
one would expect 146 patients to have been entered when 66 deaths have oc-
curred, 227 patients to be entered when 132 deaths have occurred, and 299 pa-
tients to have been entered when 198 deaths have occurred (Table 2). Early stop-
ping after 66 deaths have occurred would prevent 180 patients from being entered
to the trial and stopping after 132 deaths would prevent 99 patients from being
entered. Thus, the potential benefit of stopping in the face of negative results
would be to prevent a substantial number of patients from receiving the appar-
ently ineffective experimental regimen, in addition to allowing early reporting
of the results (the savings in time for reporting the results would be approximately
19, 12, and 7 months according to whether the trial stopped at the first, second,
or third look, respectively).
Scenario 2: The treatment is being tested in an adjuvant trial where the
expected hazard rate is 0.0277 deaths/person-year, corresponding to a 5-year sur-
vival rate of slightly more than 87%. If one is now looking for an alternative δ A
204 Dignam et al.
Table 2 Effect of Early Stopping on Accrual and Reporting Time
No. of Patients No. of Patients to be Time until Final

Accrued Accrued Analysis (mo)
Advanced Adjuvant Advanced Adjuvant Advanced Adjuvant

Disease Disease Disease Disease Disease Disease
No. of Events Trial Trial Trial Trial Trial Trial
66 146 1975 180 625 19 36

132 227 2600 99 0 12 24
198 299 2600 27 0 7 12
264 326 2600 0 0 0 0
⫽ ln(1.5) (which would roughly correspond to increasing the 5-year survival rate
to 91%) and if the accrual rate was approximately 800 patients per year, a reason-
able plan would be to accrue 2600 patients, which would take approximately 39
months, and to analyze the data when 264 deaths have occurred, which should
occur approximately 66 months after initiation of the trial, if the experimental
regimen offers no additional benefit to the standard regimen (75 months after
initiation if δ ⫽ δ A ⫽ ln{1.5}). With this accrual and event rate, 1975 of the
expected 2600 patients will have been entered by the time 66 events have oc-
curred if the experimental regimen offers no additional benefit to the standard
regimen (Table 2). The second and third looks would occur approximately 3 and
15 months after the termination of accrual, so early stopping after these analyses
would have no effect on the number of patients entering the trial, although it
could permit early reporting of the results. The savings in time for reporting the
results would be approximately 36, 24, and 12 months according to whether the
trial stopped at the first, second, or third look, respectively. If there is little likeli-
hood that the therapy will be used in future patients unless it can be shown to
be efficacious in the current trial, there may be little advantage to reporting early
negative results, and one might choose not to consider early stopping for negative
results at any of these looks.
VI. SUMMARY
Statistical monitoring procedures are used in cancer clinical trials to ensure the
early availability of efficacious treatments while at the same time preventing
spurious early termination of trials for apparent benefit that may later diminish.
Properly designed, these procedures can also provide support for stopping a trial
early when results do not appear promising, conserving resources and affording
patients the opportunity to pursue other treatment options and avoid regimens
that may have known and unknown risks while offering little benefit. By
weighing these considerations against each other in the specific study situation
at hand, a satisfactory monitoring procedure can be chosen.
The group sequential monitoring rules described in this chapter differ with
respect to their operating characteristics, and care should be taken to select a
monitoring policy that is consistent with the goals and structure of a particular
clinical trial. For example, among symmetric rules, the Pocock procedure is asso-
ciated with a relatively large maximum number of events (m K ) required for final
analysis. But the probability of early stopping under alternatives of significant
treatment effect is relatively high, so that the expected number of events required
to trigger reporting of results is reduced under such alternatives. In contrast, for
the O’Brien–Fleming procedure, m K is only very slightly greater than the number
of events that would be required if no interim analyses was to be performed. The
price paid for this is some loss of efficiency (in terms of expected number of
events required for final analysis) under alternatives of significant treatment ef-
fect. In phase III cancer trials (particularly in the adjuvant setting), it is often the
case that both accrual and treatment of patients are completed before a significant
number of clinical events (e.g., deaths or treatment failures) have occurred, and
more emphasis has been placed on the use of interim analysis policies to prevent
the premature disclosure of early results than on their use to improve efficiency
by allowing the possibility of early reporting. In such circumstances, it has often
been considered to be most important to minimize the maximum number of re-
quired events, leading to the rather widespread use of the O’Brien–Fleming
method and similar methods such as those of Haybittle and Fleming et al. Other
considerations (e.g., the need to perform secondary subset analyses, the possibil-
ity of attenuation of treatment effect over time) also argue for the accumulation
of a significant number of events before definitive analysis and therefore favor
policies that are rather conservative in terms of early stopping. The power bound-
aries of Wang and Tsiatis provide a convenient way to explore trade-offs between
maximum event size and expected number of events to final analysis by consider-
ing a variety of values of the tuning parameter ∆.
In the absence of a prespecified stopping rule that is sensitive to the possi-
bility of early stopping for no benefit or negative results, such evidence of a
negative effect for an experimental therapy at the time of an interim analysis may
be analyzed with the help of conditional power calculations, predictive power, or
fully Bayesian methods. These methods are quite practical, have had a careful
mathematical development, and have well-studied operating characteristics. We
discussed these approaches in Sections III and IV and applied several of them
to data from the NSABP Protocol B-14, a study that was closed in the face of
206 Dignam et al.
negative results for an experimental schedule of extended tamoxifen (10 years

vs. the standard 5 years).
It is preferable to include a plan for stopping in the face of negative results
at the time the study protocol is developed. In particular, it is important to know
what effect the plan will have on the power and significance level of the overall
design. The mathematics required to create an appropriate group sequential de-
sign that incorporates asymmetric monitoring boundaries is presented in Section
II, with examples of several asymmetric designs in current use. Many factors
enter into the choice of a design, including the anticipated morbidity of the experi-
mental regimen, severity of the disease being studied, the expected accrual rate
of the study, and the likely effect of early release of results on other studies. The
example in Section V.B showed that a rule applied in an advanced disease setting
might prevent the accrual of a fraction of patients to an ineffective experimental
regimen, whereas the same rule applied to an adjuvant trial is likely only to affect
the timing of the presentation of results, as accrual may be completed before
the rule is applied. When asymmetric monitoring boundaries are required, our
preference has been to use simple approaches such as the Wieand et al. modi-
fication of symmetric boundaries or the use of an upper boundary of the
O’Brien–Fleming or Fleming et al. type, coupled with a lower boundary derived
by testing the alternative hypothesis H A: δ ⫽ δ A, using the O’Brien–Fleming or
Fleming et al. rules. In the latter case, one may or may not require the procedure
to be closed (i.e., require the upper and lower boundaries to join at the Kth
analysis). If closure is required, the use of O’Brien–Fleming boundaries leads
to the method of Pampallona and Tsiatis (20) with ∆ ⫽ 0. Our experience is
that these approaches are easily explained to (and accepted by) our clinical col-
leagues.
We advocate that trial designers give serious thought to the suitability of
asymmetric monitoring rules. If such an approach is reasonable, the methods in
Section II allow the statistician to develop a design that seems most appropriate
for his or her situation. If at an interim look one is faced with negative results
and has not designed the trial to consider this possibility, we recommend one of
the approaches described in Sections III and IV. Even when one has had the
foresight to include an asymmetric design, one may gain further insight regarding
unexpected results by applying some of these methods. Of course, when one
deviates from the original design, the initial power and significance level compu-
tations of the study are altered.
After we completed this chapter, we became aware of a new volume by
Jennison and Turnbull (44). They were kind enough to provide us with an advance
copy as one of us (Bryant) was about to teach a Group Sequential Monitoring
course at the University of Pittsburgh. Their work contains a comprehensive cov-
erage of many of the methods and issues discussed in our chapter, and we heartily
recommend the book to individuals who wish to expand their knowledge regard-
ing early stopping procedures.
REFERENCES
1. Tsiatis AA. The asymptotic joint distribution of the efficient scores test for the pro-
portional hazards model calculated over time. Biometrika 1981; 68:311–315.
2. Tsiatis AA. Repeated significance testing for a general class of statistic used in cen-
sored survival analysis. J Am Stat Assoc 1982; 77:855–861.
3. Gail MH, DeMets DL, Slud EV. Simulation studies on increments of the two-sample
logrank score test for survival time data, with application to group sequential bound-
aries. In: Crowley J, Johnson RA, eds. Survival Analysis. Hayward, CA: Institute
of Mathematical Statistics Lecture Notes—Monograph Series, Vol. 2, 1982, pp.
287–301.
4. Armitage P, McPherson CK, Rowe BC. Repeated significance tests on accumulating
data. J Royal Stat Soc Series A 1969; 132:235–244.
5. Haybittle JL. Repeated assessment of results in clinical trials in cancer treatments.
Br J Radiol 1971; 44:793–797.
6. Pocock SJ. Group sequential methods in the design and analysis of clinical trials.
Biometrika 1977; 64:191–199.
7. Pocock SJ. Interim analyses for randomized clinical trials: the group sequential ap-
proach. Biometrics 1982; 38:153–162.
1979; 35:549–556.
9. Wang SK, Tsiatis AA. Approximately optimal one-parameter boundaries for group
sequential trials. Biometrics 1987; 43:193–199.
10. Fleming TR, Harrington DP, O’Brien PC. Designs for group sequential tests. Control
Clin Trials 1984; 5:348–361.
11. DeMets DL, Ware JH. Group sequential methods for clinical trials with a one-sided
hypothesis. Biometrika 1980; 67:651–660.
12. Wald A. Sequential Analysis. New York: John Wiley and Sons, 1947.
13. DeMets DL, Ware JH. Asymmetric group sequential boundaries for monitoring clin-
ical trials. Biometrika 1982; 69:661–663.
14. Whitehead J, Stratton I. Group sequential clinical trials with triangular continuation
regions. Biometrics 1983; 39:227–236.
15. Whitehead J, Jones D. The analysis of sequential clinical trials. Biometrika 1979;
66:443–452.
16. Whitehead J. The Design and Analysis of Sequential Clinical Trials. Chichester:
Ellis Horwood Ltd. 1983.
17. Jennison C. Efficient group sequential tests with unpredictable group sizes. Biome-
trika 1987; 74:155–165.
18. Lan KKG, DeMets DL. Discrete sequential boundaries for clinical trials. Biometrika
1983; 70:659–663.
208 Dignam et al.
19. Emerson SS, Fleming TR. Symmetric group sequential test designs. Biometrics
1989; 45:905–923.
20. Pampallona S, Tsiatis AA. Group sequential designs for one-sided and two-sided
hypothesis testing with provision for early stopping in favor of the null hypothesis.
J Stat Plan Infer 1994; 42:19–35.
21. Wieand S, Schroeder G, O’Fallon JR. Stopping when the experimental regimen does
not appear to help. Stat Med 1994; 13:1453–1458.
22. Ellenberg SS, Eisenberger MA. An efficient design for phase III studies of combina-
tion chemotherapies. Cancer Treatment Rep 1985; 69:1147–1154.
23. Wieand S, Therneau T. A two-stage design for randomized trials with binary out-
comes. Control Clin Trials 1987; 8:20–28.
24. Lan KKG, Simon R, Halperin M. Stochastically curtailed tests in long-term clinical
trials. Commun Stat Sequent Anal 1982; 1:207–219.
25. Halperin M, Lan KKG, Ware JH, Johnson NJ, DeMets DL. An aid to data monitoring
in long-term clinical trials. Control Clin Trials 1982; 3:311–323.
26. Lan KKG, Wittes J. The B-value: a tool for monitoring data. Biometrics 1988; 44:
579–585.
27. Davis BR, Hardy RJ. Upper bounds for type I and type II error rates in conditional
power calculations. Commun Stat Theory Meth 1991; 19:3571–3584.
28. Spiegelhalter DJ, Freedman LS, Blackburn PR. Monitoring clinical trials: condi-
tional or predictive power? Control Clin Trials 1986; 7:8–17.
29. Choi SC, Pepple PA. Monitoring clinical trials based on predictive probability of
significance. Biometrics 1989; 45:317–323.
30. Jennison C, Turnbull BW. Statistical approaches to interim monitoring of medical
trials: a review and commentary. Stat Sci 1990; 5:299–317.
31. Kass R, Greenhouse J. Comment on ‘‘Investigating therapies of potentially great
benefit: ECMO’’ by J. Ware. Stat Sci 1989; 4:310–317.
32. Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to randomized
trials [with discussion]. J R Stat Soc A 1994; 157:357–416.
33. Freedman LS, Spiegelhalter DJ, Parmar MKB. The what, why and how of Bayesian
clinical trials monitoring. Stat Med 1994; 13:1371–1383.
34. Greenhouse J, Wasserman L. Robust Bayesian methods for monitoring clinical trials.
Stat Med 1994; 14:1379–1391.
35. Berry DA, Stangl DK, eds. Bayesian Biostatistics. New York: Marcel Dekker, 1996.
36. Parmar MKB, Ungerleider RS, Simon R. Assessing whether to perform a confirma-
tory randomized clinical trial. J Natl Cancer Inst 1996; 88:1645–1651.
37. Fisher B, Dignam J, Bryant J, et al. Five versus more than five years of tamoxifen
therapy for breast cancer patients with negative lymph nodes and estrogen receptor-
positive tumors. J Natl Cancer Inst 1996; 88:1529–1542.
38. Swain SM. Tamoxifen: the long and short of it. J Natl Cancer Inst 1996; 88:1510–
1512.
39. Peto R. Five years of tamoxifen—or more? J Natl Cancer Inst 1996; 88:1791–1793.
40. Current Trials Working Party of the Cancer Research Campaign Breast Cancer Trials
Group. Preliminary results from the Cancer Research Campaign trial evaluating ta-
moxifen duration in women aged fifty years or older with breast cancer. J Natl Can-
cer Inst 1996; 88:1834–1839.
41. Dignam J, Bryant J, Wieand HS, Fisher B, Wolmark N. Early stopping of a clinical
trial when there is evidence of no treatment benefit: Protocol B-14 of the National
Surgical Adjuvant Breast and Bowel Project. Control Clin Trials 1998; 19:575–588.
42. Emerson SS. Statistical packages for group sequential methods. Am Stat 1996; 50:
183–192.
43. SAS Technical Report P-217. SAS/STAT Software: the PHREG Procedure, Version
6. Cary, NC: Sas Institute Inc., 1991. 63 pp.
44. Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical
Trials. Boca Raton: Chapman & Hall/CRC, 2000.
12
Use of the Triangular Test in Sequential
Clinical Trials
John Whitehead
The University of Reading, Reading, England
I. INTRODUCTION
A clinical trial is described as sequential if its design includes one or more interim
analyses that could lead to a resolution of the primary therapeutic question. Thus
it can be distinguished from a fixed-sample trial, in which there are no interim
analyses and the necessary sample size is calculated in advance. In a sequential
trial, the sample size is unknown in advance and is determined in part by the
nature of the emerging data. Although a fixed-sample trial may not, in the event,
achieve its target sample size, that will occur for practical or logistical reasons
rather than as a consequence of the nature of the accumulating data. Trials with
purely administrative looks or with interim assessments of safety only are not
generally considered to be sequential, as the primary therapeutic question is not
repeatedly addressed.
Sequential clinical trials are becoming increasingly common in clinical re-
search because they offer the ethical and economic advantages of avoiding contin-
uation in the face of mounting evidence against a treatment and of requiring
relatively small sample sizes when the advantage of a treatment becomes quickly
and clearly apparent. Most sequential designs currently being used are derived
from either the boundaries approach or the α-spending approach, and this chap-
ter concentrates on the most frequently implemented member of the former class:
the triangular test.
211
212 Whitehead
The triangular test is an efficient form of sequential procedure that uses as

small a sample size as possible while still maintaining the required precision of
the testing procedure. It is an asymmetrical procedure in the sense that over-
whelming evidence is required to reach an early conclusion that an experimental
treatment is effective, whereas the trial will be stopped for lack of effect as soon
as it is apparent that continuation is futile. These features are made more precise
in subsequent sections.
The following section is an introduction to the clinical trial context in which
the triangular test can most easily be applied. Section III consists of a detailed
account of a trial in renal cancer that used the triangular test. The history and
mathematical properties of the method are given in Section IV, and rivals and
variations to the approach are described in Section V. Section VI is a survey of
recent applications of the triangular test.
II. COMPARATIVE CLINICAL TRIALS
Throughout this chapter it is assumed that patients are being randomized between
one experimental treatment E and one control treatment C and that the primary
therapeutic question concerns a single patient response. The symbol θ will be
used to denote the advantage of E over C in the patient population as a whole,
whereas Z will denote the observed advantage apparent from the current data.
The amount of information about θ contained in Z will be denoted by V. The
quantity θ is an unknown population parameter, whereas Z and V are observable
sample statistics.
To clarify the meaning of each of the quantities above, two examples are
given. Suppose that in a trial of cancer therapy, the primary patient response is the
survival time of the patient from randomization to death. Then θ, the advantage of
E over C, might be expressed as minus the log of the ratio of the hazard on E
to that on C. The log is taken so that when hazards are equivalent, θ is equal to
log (1) ⫽ 0, thus expressing zero advantage; the minus sign means that a reduc-
tion in hazard on E will show as a positive value of θ. The statistic Z is the
logrank statistic, which can be thought of as the observed number of deaths on
C minus the number expected under the null hypothesis of no advantage of E
over C. The control treatment is focused on, so that a positive value of Z indicates
an advantage of E. The statistic V will be the null variance of Z, which is approxi-
mately equal to one quarter of the number of deaths. The full formulae for Z and
V are given in Section 3.4 of Whitehead (1).
For a second example, take a comparison of the types of mattress on which
a patient lies during surgery, in which the primary patient response is the inci-
dence of pressure sores. Denote by pE and pC the probabilities of suffering a
Triangular Test in Sequential Clinical Trials 213
pressure sore on the experimental and control (standard) mattresses, respectively.

Then θ will be taken to be the log-odds ratio:
θ ⫽ log
冢
1 ⫺ pE
pE 冣
⫺ log
冢
1 ⫺ pC
pC
.
冣
Let SE, SC and FE, FC denote the numbers of successes (no pressure sores) and
failures (pressure sores) on E and C, respectively. Then
nS n n SF
Z ⫽ SE ⫺ E and V ⫽ E C3
n n
where nE and nC denote the total number of patients on E and C, respectively,
and S ⫽ SE ⫹ SC, F ⫽ FE ⫹ FC, and n ⫽ nE ⫹ nC. The traditional χ2 statistic for
a 2 ⫻ 2 contingency table, usually expressed as ∑(O ⫺ E)2 /E, can be shown to
be equal to Z 2 /V.
At each interim analysis of the trial, Z and V are computed from the avail-
able data. The statistic Z is constructed so that its expected value is θV and its
variance is V; consequently, a positive value of Z is encouraging. In a sequential
analysis, the value of Z is plotted against V at each interim, resulting in an ex-
pected linear path of gradient θ, with random variation about it quantified by the
variance V. This is illustrated in Figure 1.
Figure 1 Maintaining a plot of Z against V at interim analyses.

214 Whitehead
Sequential designs deriving from the boundaries approach are defined by

superimposing boundaries on the Z-V plane. The triangular test is illustrated in
Figure 2. A rising path of Z against V indicates growing evidence of an advantage
of E over C, and the upper boundary is placed so that once the path crosses it,
the trial can be stopped with a conclusion that E is significantly better than C.
The trial is also stopped if the lower boundary is crossed, and a region correspond-
ing to significant disadvantage of E over C is indicated as the solid portion of
the lower boundary in Figure 2.
In large samples the maximum likelihood estimate θ̂ of θ and its stan-
dard error SE(θ̂) are approximately equal to Z/V and 1/√V, respectively.
One commonly used variation on the general scheme above is to use θ̂ {SE (θ̂)}⫺2
in place of Z and {SE (θ̂)}⫺2 in place of V.
Sequential designs are constructed according to the same sort of power
requirement as governs a conventional sample size calculation. Crossing the up-
per boundary represents significant evidence that E is better than C. It is arranged
that when θ ⫽ 0, this occurs with probability 1/2α, and for θ equal to some refer-
ence improvement θR ⬎ 0, it occurs with probability (1 ⫺ β). The part of the
Figure 2 Boundaries of the triangular test.——, reject the null hypothesis;– – –, do not
reject the null hypothesis.
lower boundary corresponding to significant evidence that E is worse than C is

reached with probability –12 α when θ ⫽ 0; consequently, if either rejection region
is reached the final two-sided p value will be less than α.
Once the trial has stopped, an analysis must be performed. If a conventional
analysis is applied, then the p value found is invalid, the point estimate of θ is
biased, and confidence intervals for θ are too narrow. A variety of techniques is
now available to overcome these problems and to produce acceptable analyses
based on the form of sequential design used. Full details of both design and
analysis methods are given in Whitehead (1).
III. A CLINICAL TRIAL OF ALPHA-INTERFERON IN

METASTATIC RENAL CARCINOMA
The MRC Renal Cancer Collaborators (2) described a multicenter, randomized,

controlled trial in patients with metastatic renal carcinoma. Standard treatment
in this indication is hormonal therapy with medroxyprogesterone acetate, and the
patients randomized to the control group (C) received this therapy. The treatment
under investigation was biological therapy with alpha-interferon, and patients
receiving this treatment formed the experimental group (E). Patients were as-
signed to treatment using the method of minimization, stratifying by center and
by whether the patient had had a nephrectomy and by single or multiple metasta-
ses. The primary treatment comparison concerned survival time from randomiza-
tion to death.
Treatment with alpha-interferon is both expensive and toxic. It was be-
lieved to be appropriate to continue the trial only as long as the emerging results
were consistent with an outcome in favor of E. It was certainly not believed
necessary to have a high power of establishing a significant disadvantage of E
relative to C to dissuade clinicians from using E if the trial was negative. Follow-
ing these considerations, a triangular design was chosen.
Interim analyses were planned for 12 months after the start of the trial and
at 6-month intervals thereafter. It was anticipated that recruitment would proceed
at the rate of 125 patients per year. At each interim, the survival patterns of
groups E and C were compared using a logrank test, stratified by whether the
patient had had a nephrectomy before randomization. The stratification was not
accounted for in the design, which was based on overall target and anticipated
survival patterns on E and C.
Denote the hazard functions of patients on E and C by hE(t) and hC(t),
respectively, and the survivor functions by SE(t) and SC(t). For survival data, the
parameter θ measuring the advantage of E over C was defined in Section I to be
minus the log of the ratio of the hazard on E to that on C. Mathematically,
216 Whitehead
θ ⫽ ⫺log 冦hh (t)(t)冧,

C
E
for all t ⬎ 0. (1)
The assumption that this quantity is constant over all t is known as the propor-
tional hazards assumption. An equivalent expression for θ is
θ ⫽ ⫺log {⫺log SE(t)} ⫹ log {⫺log SC(t)}, for all t ⬎ 0. (2)
Based on results from Selli et al. (3), a 2-year survival rate of 0.2 was
anticipated in the control group, that is, SC(2) ⫽ 0.2. A power of (1 ⫺ β) ⫽ 0.90
was set for achieving significance at the 5% level (two-sided alternative) if alpha-
interferon increased this 2-year survival rate to SE(2) ⫽ 0.32. Substituting in Eq.
(2) gives a reference improvement of θR ⫽ ⫺log {⫺log (0.32)} ⫹ log {⫺log
(0.2)} ⫽ 0.345, corresponding to a hazard ratio of hE(t)/hC(t) ⫽ 0.708.
The triangular design satisfying this power requirement has upper and
lower boundaries
Z ⫽ 14.28 ⫹ 0.105V
and
Z ⫽ ⫺14.28 ⫹ 0.315V
respectively. If the lower boundary is crossed with V ⱕ 17.3, then E will be
declared to be significantly inferior to C. Table 1 shows the properties of the
Table 1 Properties of the Triangular Test Used in the Renal Carcinoma Trial
Probability of No. of Duration of

finding E deaths at trial
significantly termination (mo)
Median Median
θ SE(2) SC(2) Better Worse (90th %ile) (90th %ile)
⫺0.345 0.2 0.103 0.000 0.293 82 17

(124) (22)
⫺0.173 0.2 0.148 0.000 0.105 109 21
(176) (29)
0 0.2 0.200 0.025 0.025 163 28
(284) (41)
0.173 0.2 0.258 0.362 0.004 246 38
(381) (52)
0.345 0.2 0.320 0.900 0.000 201 34
(341) (49)
design. If θ ⫽ ⫺0.345, that is, E is worse than C by a magnitude equal to the

target improvement (on the log-hazards scale), then the power of obtaining a
significant result is only 0.3. This emphasises the asymmetrical nature of the
design and represents the scientific loss due to reducing expected sample size.
(Although a scientific loss, it is of course an ethical gain.) The equivalent fixed
sample size design would require 353 deaths and would last for 49 months. It
can be seen that a substantial reduction in trial duration is likely to be achieved
by use of the triangular test.
In Table 1, the probabilities of finding E significantly better or worse, and
the medians and 90th percentiles of the number of deaths, depend only on the
values of the (minus) log-hazard ratio θ indicated. On the other hand, the medians
and 90th percentiles of duration depend on the pretrial estimate of SC(2) as 0.2
being correct, on an exponential form of survival pattern in both groups, and on
a steady entry rate of 125 patients per year. Recruitment was to continue until
a boundary was reached or until 600 patients had entered the trial. Details of the
trial design were published by Fayers et al. (4).
Recruitment to the trial began in February 1992. The recruitment rate aver-
aged 60 per year throughout the trial, less than half the anticipated rate. A total
of six interim analyses were performed, and these are summarized in Table 2.
Each row first gives the date of the interim analysis, the number of patients re-
cruited (n), and the number of known deaths (d). Then the logrank statistic (Z)
and its null variance (V) are given separately for each stratum by nephrectomy
and combined by summation. The Q statistic is Cochran’s test statistic for hetero-
geneity between strata and is given by Q ⫽ ∑ (Z 2 /V) ⫺ (∑ Z)2 /(∑ V). This
formula is familiar from meta-analysis (see ref. 5). If this were a fixed sample
study, Q would follow the χ2 distribution on one degree of freedom: Here caution
is required in interpretation due to the repeated interim analyses.
Table 2 Interim Analyses for the Renal Carcinoma Trial
No
Nephrectomy nephrectomy Combined
Date n d Z V Z V Z V Q
29 Oct 93 69 20 ⫺1.62 2.44 2.69 1.99 1.07 4.42 4.47

22 Sept 94 122 37 ⫺1.84 4.79 4.21 3.75 2.38 8.54 4.78
20 Feb 95 158 67 ⫺2.01 9.38 3.61 6.33 1.60 15.71 2.33
14 Feb 96 222 130 2.03 16.73 7.17 14.63 9.21 31.35 1.06
10 Feb 97 293 190 6.68 23.99 9.69 22.00 16.37 45.99 0.30
1 Oct 97 335 236 9.27 29.98 10.55 26.68 19.83 56.66 0.10
218 Whitehead
The combined values of Z and V are plotted against one another in Figure
3. This figure displays a feature of the stopping rule, which has not yet been
discussed. The outer triangular boundaries are calculated to achieve the required
error probabilities when monitoring of the trial is continuous. Because the interim
analyses are discrete, it is possible that the hypothetical sample path arising from
continuous monitoring might cross the boundaries undetected between interim
analyses, and return by the time of the next look. Thus discrete monitoring makes
stopping less likely. To compensate and achieve the required error probabilities,
the stopping boundaries must be brought closer together. The jagged inner bound-
aries achieve this: The longer the gap between looks, the more the boundaries
are brought in. The magnitude of the correction is 0.583 times the square root
of the increment in V. The resulting boundaries are known as Christmas tree
boundaries because of their shape. It is sufficient for the plotted point to reach
these inner boundaries for the trial to be stopped. They can be used with any
design based on straight-line boundaries but are especially accurate for the trian-
gular test.
At each of the first two interim analyses, the Q statistic was nominally
significant at the 5% level, indicating an interaction between treatment and ne-
phrectomy group. Relative to control, the experimental group appeared to be
Figure 3 The final sequential plot for the MRC trial of alpha-interferon.
benefited within the no nephrectomy stratum and disadvantaged within the ne-
phrectomy stratum. These first two interim analyses, like all subsequent ones,
were presented to a Data Monitoring Committee, and the apparent heterogeneity
caused some concern. However, taking informal account of multiple and repeated
testing, it was decided to take no action at either of the first two interims. The
impression of heterogeneity thereafter subsided, and later analyses showed no
trace of it at all. At the sixth interim analysis, the upper boundary was reached.
The trial protocol gave the Data Monitoring Committee the power to recommend
stopping or continuing the trial. The validity of proportional hazards and the
consistency of the advantage of E over C over various stratifications of the pa-
tients were considered informally and found to be satisfactory. The Data Monitor-
ing Committee did recommend stopping, and the decision was confirmed by the
MRC Renal Cancer Working Party. Recruitment was closed on 30 November
1997.
The analysis conducted on 1 October 1997, which allowed for the sequen-
tial nature of the design, found that alpha-interferon is associated with signifi-
cantly better survival than medroxyprogesterone acetate, with p ⫽ 0.017 (two-
sided alternative). The median unbiased estimate of log-hazard ratio is θM ⫽
0.334, with 95% confidence interval (0.062, 0.600). Transformed to the hazard
ratio hE(t)/hC(t), a median unbiased estimate of 0.716 is obtained, with a 95%
confidence interval of (0.549, 0.940).
The value of θM is very close to the target improvement of θR ⫽ 0.345
under which the trial was powered. Two-year survival probabilities for E and C
were estimated to be 0.22 and 0.12, respectively. Both are worse than anticipated,
and this greater death rate has contributed to the reduction in duration of the trial.
Median survival times are estimated to be 8.5 months on E and 6.0 months on
C, the extra 2.5 months on E being a modest advantage given the cost and toxicity
of alpha-interferon. At the time of the sixth interim analysis, 236 deaths had
occurred. This is just two thirds of the 353 calculated as required for a fixed
sample design at the beginning of the trial. By reacting to both the increased death
rate relative to predictions and to the emerging advantage of E, the triangular test
appreciably reduced the duration of the renal carcinoma trial.
IV. HISTORY OF THE TRIANGULAR TEST
Sequential analysis originated in the war-time work of Wald (6) and Barnard (7)
concerning the inspection of newly manufactured batches of military hardware.
These authors both developed the sequential probability ratio test (SPRT). Ap-
plied directly to the clinical trial context of Section II, the stopping boundaries
for Z are a ⫹ cV (upper) and ⫺b ⫹ cV (lower). These parallel boundaries are
220 Whitehead
open-ended, and so there is no upper limit on the amount of information that is

required.
The SPRT has an optimality property, stated and proved by Wald and Wol-
fowitz (8). For the clinical trial context, imagine that interim analyses are very
frequent and that the power is set at 1 ⫺ –12 α. Under these circumstances b ⫽ a
and c ⫽ –12 θR. If V* denotes the value of V at termination, then the optimality of
the SPRT implies that both E(V*; 0) and E(V*; θR)—which will be equal to one
another—reach the minimum value achievable by a sequential test satisfying the
specified power requirement.
Following the publications of Wald and Barnard (6,7), many authors sought
to modify the SPRT and in particular to overcome the open-ended nature of the
boundaries. Anderson (9) derived properties of a variety of procedures based on
straight-line stopping boundaries, which included the triangular test. The 2-SPRT
design of Lorden (10) and the minimum probability ratio test of Hall (11) are
both alternative characterizations of the triangular test.
Now we return to the case of very frequent interim analyses and a power
set to be 1 ⫺ –12 α, in which the SPRT minimizes E(V*; θ) at θ ⫽ 0 and θ ⫽ θR.
Despite this optimal property, the maximum value of E(V*; θ), which occurs at
θ ⫽ –12 θR , can be quite large. When θ ⫽ –12 θR, the sample path has the maximum
propensity for wandering between the two parallel boundaries. Lai (12) sought
designs that minimize maxθ E(V*; θ) ⫽ E(V*; –12 θR) for the specified power re-
quirement. He described these and found that as α → 0 they become the triangular
test, with gradient of the lower boundary three times that of the upper boundary as
described in Sections I and II above. The recommendation of the triangular test as
a means of minimizing expected sample size is justified by this result. Huang et al.
(13) found that although α ⬎ 0, E(V*; –12 θR) can be reduced further by using a
triangular test that is longer and thinner than that illustrated in Figures 2 and 3;
however, the reduction is very small. Jennison (14) reported numerical work seek-
ing optimal designs when interim analyses are not especially frequent. When ex-
pressed on the Z-V diagram, they become similar to triangular stopping regions.
Although sequential analysis was first introduced in the context of quality
control, its attractions for clinical research were quickly recognized. During the
early 1950s, Bross (15) devised sequential medical plans, which resemble the
double triangular test in spirit, and Kilpatrick and Oldham (16) implemented a
sequential t-test (a form of SPRT) in a trial of bronchial dilators. Excellent sur-
veys of the emerging methodology are given in the two editions of the book by
Armitage (17,18). The latter edition describes a wide variety of designs, most
based on the approximate properties of straight-line boundaries. Included are dou-
ble versions of the SPRT (popularly known as the trouser test), restricted proce-
dures, and skew plans that resemble the triangular test (see also Spicer [19]).
The sequential designs of the mid-1970s suffered from four major limita-
tions:
1. Only normally distributed and binary patient responses were catered

for, and in the case of binary observations the artificial stratagem of
matched pairs was usually necessary to eliminate nuisance parame-
ters.
2. Approximations used in the theory were accurate only if interim analy-
ses occurred after every individual response or matched pair of re-
sponses.
3. No special methods existed for producing a valid analysis once a se-
quential trial has been completed.
4. Software for sequential methods was rudimentary, and usually a choice
had to be made from a limited repertoire of tabulated designs.
My own work (20,21) examined the analogy between the test statistics Z
and V and the sample sum and sample size of independent normal observations
with mean θ and variance 1. This allows many of the early procedures developed
for the latter case to be applied far more generally.
The requirement that sequential methods involve very frequent interim
analyses was a greater practical barrier to widespread implementation. Continu-
ous monitoring of data emerging from a trial was logistically difficult. More
reasonable was a series of a few interim analyses, each conducted after a new
group of patients had responded. The group sequential designs of Pocock (22)
and O’Brien and Fleming (23) were based on an approach that was totally differ-
ent from much of what had been developed earlier. The concept was not of bound-
aries but of setting the null probability of stopping to reject the null hypothesis
at or before each interim analysis, so that the final value was equal to the required
α. The idea was given flexibility and generality through the α-spending function
introduced by Lan and DeMets (24). Although formulated in terms of error proba-
bilities, ‘‘group sequential’’ designs could be displayed on the Z-V plane, where
they were seen as symmetrical, bearing a general resemblance to restricted proce-
dures.
Two camps of sequential methodology having been established, they rap-
idly began to converge. Asymmetrical group sequential methods introduced by
DeMets and Ware (25) began the development of an α-spending counterpart to
the triangular test, whereas adjustments for discrete monitoring (26) allowed the
triangular test to be used with group sequential sampling and eventually led to
the Christmas tree correction described in Section III above.
For continuous monitoring, all sequential designs have both a boundaries
and an α-spending function representation, two such functions being needed to
characterize asymmetrical designs. Allowance for discrete monitoring can be
achieved by using the recursive numerical integration routines described by Arm-
itage et al. (27) or by corrections of continuous boundaries such as the Christmas
tree method. The latter has been developed only for straight-line boundaries. It
222 Whitehead
is extremely accurate when used with the triangular test (28), but less precise for
restricted procedures or the SPRT.
The issue of posttrial analysis has attracted a great deal of attention, which
is outside the scope of this chapter. For a summary of methods that lead to valid
analyses, see Chapter 5 of Whitehead (1). Software in the form of PEST 4 (29)
and EaSt 2000 (30) is now available: For a comparative review of earlier versions
of these two packages, see Emerson (31). A sequential module to the language
S plus is also available. The renal carcinoma described in Section III above was
designed and analyzed using PEST.
The triangular test was ready for application in the early 1980s and was
first used in a comparison of anesthetic techniques (32) and in a clinical trial in
lung cancer (33–35).
V. RIVALS AND VARIATIONS TO THE TRIANGULAR TEST
The triangular test is a design that will efficiently distinguish between superiority
of an experimental treatment E over a control C and lack of superiority. The
power properties noted in the alpha-interferon example of Section III hold quite
generally: A triangular test designed to have high power of detecting a treatment
advantage will have little power to detect a disadvantage of comparable magni-
tude. Figure 4, a and b, illustrates two asymmetric designs with similar properties.
Figure 4a shows a truncated SPRT, whereas Figure 4b shows a member of the
class of designs described by Pampallona and Tsiatis (36).
The precise optimality results of the (untruncated) SPRT and the triangular
test cited in Section IV motivate the following less formal considerations. The
triangular test is effective in reducing sample size whatever the true value of θ
might be and especially for moderate values lying between 0 and the reference
improvement θR, for which the largest sample sizes occur. The truncated SPRT
is effective at reducing sample size if θ is negative or exceeds θR. This can be
seen from the fact that for small V, the boundaries of the truncated SPRT are
closer together than those of the triangular test. To achieve such optimality, trun-
cation should occur at quite a large value of V, so that the design resembles its
untruncated counterpart. Truncation at the value of V 50% greater than the fixed
sample size for equivalent power should suffice.
If there is good reason to believe that θ exceeds θR, perhaps from previous
studies, then the truncated SPRT might be chosen. It should lead to a small sample
size if the prediction of substantial superiority proves true, while providing a
valid if more lengthy procedure if such optimism turns out not to be justified.
Sometimes there is reason to fear that θ ⬍ 0 but nevertheless to wish to proceed
with a trial because the potential for benefit remains worthy of investigation. In
this case too, a truncated SPRT is indicated. It is likely, however, that the triangu-
lar test would be preferred in most trials.
Figure 4 Alternative sequential designs. (a) Truncated SPRT; (b) a design of Pam-
pallona and Tsiatis; (c) reverse triangular test; (d) double triangular test.
The Pampallona and Tsiatis design shown in Figure 4b is one of a family

of asymmetric designs indexed by the curvature of the boundaries. This one is
very similar to a triangular test of equivalent power, especially if there are to be
no interim analyses when V is small. The designs are not derived from consider-
ations of optimality.
Other sequential alternatives to the triangular test for fulfillment of asym-
metric power requirements include stochastic curtailment procedures based on
conditional or predictive power. The use of conditional power was introduced
by Lan et al. (37), whereas Spiegelhalter et al. (38) discussed predictive power.
These and other approaches are described in the book by Jennison and Turnbull
(39). Although outside the scope of this chapter, it is worth remarking that any
such stopping rule can be mapped onto the Z-V plane as a stopping boundary
for comparison with the procedures discussed here. Such a step is recommended
so that an informed choice of design can be made.
Figure 4, c and d, shows, respectively, a reverse triangular test and a double
triangular test. The reverse triangular test has high power of detecting inferiority
of the experimental treatment but low power of showing advantage. It might be
224 Whitehead
used in cases in which E has clear nonefficacy advantages such as cost, ease of
use, or safety, so that only proven inferiority will dissuade clinicians and patients
from using it. The double triangular test will potentially distinguish between three
true situations: E better than C, E no different from C, and E worse than C. It
is sometimes used when investigators wish to hedge their bets. The most desirable
trial outcome is a claim that E is better than C, but a fall-back in which E is no
different from C but has secondary advantages (not apparent in the sequential
plot) might be worthwhile. The double triangular test is also appropriate when
demonstration of equivalence is the primary concern. When departing from a
conventional objective of demonstrating superior efficacy, the values specified
for power and for type I error rate must be chosen with caution (see ref. 40).
The truncated SPRT and the reverse and double triangular tests are imple-
mented in the computer package PEST 4, whereas the Pampallona and Tsiatis
designs (including ‘‘double’’ versions) are available in EaSt 2000. Sometimes
symmetrical designs are required, which will have small sample size only if a
major treatment difference is apparent, with the maximum sample size being
desirable otherwise. This may be to allow sufficient power to meet secondary
objectives such as subgroup analysis or investigation of secondary end points.
This requirement also rules out the double triangular test. In these situations,
restricted procedures (1,18) or symmetrical designs based on the α-spending
function approach described in Section IV could be implemented.
VI. RECENT CLINICAL TRIALS BASED ON THE

TRIANGULAR TEST AND FURTHER WORK
The triangular test has now been used in a wide variety of clinical studies con-
cerned with many therapeutic areas. Examples include trials of corticosteroids
for AIDS-induced pneumonia (41), of enoxaparin for prevention of deep vein
thrombosis resulting from hip replacement surgery (42), of isradipine for the
acute treatment of stroke (43), and of implanted defibrillators in coronary heart
disease (44). In pediatric medicine, the triangular design has been used to study
the use of surfactant to alleviate respiratory distress in infants (45) and in a trial
concerning gastrointestinal reflux (46). An evaluation of the drug Viagra in the
treatment of erectile dysfunction after spinal injury also used the method (47),
and it has been implemented in animal studies of medical techniques (48). An
interesting combination of the triangular test with the play-the-winner rule was
applied in a study of spinal anesthesia during cesarean section (49). Within oncol-
ogy, besides the renal and lung cancer trials mentioned in Sections III and IV,
respectively, Storb et al. (50) described a triangular test of immunotherapy as a
preparation for bone marrow transplantation in leukemia.
The double triangular test has also found application. Nixon et al. (51)
described such a design in a comparison of pressure sore rates after the use of
two types of mattress during cancer surgery, Boden et al. (52) report a trial based
on the design in cardiology and Yentis (53) described an application to a trial
in elective Caesarian section. Other trials using triangular and related designs are
given on the web page: http://www.rdg.ac.uk/mps/mps_home/software/pest4/
practice.htm
The properties of the triangular test are now well understood, and its opti-
mality makes it hard to improve. It is not appropriate for every situation, and
suitable alternatives exist. The Christmas tree correction could be improved on,
but in the case of the triangular test, not by much. An exact version of the triangu-
lar test, applicable to a single stream of binary observations, is described by
Stallard and Todd (54). There is a range of response types to which the triangular
and related designs can be extended; in particular, work is proceeding on survival
responses with nonproportional hazards and longitudinal ordinal responses.
Methods for analyzing data after sequential trials of this type are also being devel-
oped.
In the future, two main challenges remain. One is to use the principles
underlying the triangular design to help in the construction of optimal sequential
designs for multivariate responses and for multiple treatment comparisons. The
other is, quite simply, to spread its usage to all trials that can benefit from its
ethical and economic advantages.
REFERENCES
1. Whitehead J. The Design and Analysis of Sequential Clinical Trials. Revised 2nd
Edition. Chichester: Wiley, 1997.
2. MRC Renal Cancer Collaborators. Interferon-α and survival in metastatic renal car-
cinoma: early results of a randomised controlled trial. Lancet 1999; 353:14–17.
3. Selli C, Hinshaw W, Woodward BH, Paulson DF. Stratification of risk factors in
renal cell carcinoma. Cancer 1983; 52:899–903.
4. Fayers PM, Cook PA, Machin D, Donaldson N, Whitehead J, Ritchie A, Oliver RTD,
Yuen P. On the development of the medical research council trial of α-interferon in
metastatic renal carcinoma. Stat Med 1994; 13:2249–2260.
5. Whitehead A, Whitehead J. A general parametric approach to the meta-analysis of
randomized clinical trials. Stat Med 1991; 10:1665–1677.
6. Wald A. Sequential Analysis. 1947; New York: Wiley.
7. Barnard GA. Sequential tests in industrial statistics. J R Statist Soc 1946; (Suppl.
8): 1–26.
8. Wald A, Wolfowitz J. Optimum character of the sequential probability ratio test.
Ann Math Stat 1948; 19:326–339.
9. Anderson TW. A modification of the sequential probability ratio test to reduce sam-
ple size. Ann Math Stat 1960; 31:165–197.
226 Whitehead
10. Lorden G. 2-SPRT’s and the modified Kiefer-Weiss problem of minimizing an ex-
pected sample size. Ann Stat 1976; 4:281–291.
11. Hall WJ. Sequential minimum probability ratio tests. In: Chakravarti IM, ed. Asymp-
totic Theory of Statistical Tests and Estimation. New York: Academic Press,
1980.
12. Lai TL. Optimal stopping and sequential tests which minimize the maximum ex-
pected sample size. Ann Stat 1973; 1:659–673.
13. Huang P, Dragalin V, Hall WJ. Asymptotic design of symmetric triangular tests for
the drift of Brownian motion. Sequential Analysis, to appear.
14. Jennison C. Efficient group sequential tests with unpredictable group sizes. Biome-
trika 1987; 74:155–165.
15. Bross I. Sequential medical plans. Biometrics 1952; 8:188–205.
16. Kilpartick GS, Oldham PD. Calcium chloride and adrenaline as bronchial dilators
compared by sequential analysis. Br Med J 1954; 2:1388–1391.
17. Armitage P. Sequential Medical Trials, 1st ed. Oxford: Blackwell, 1960.
18. Armitage P. Sequential Medical Trials, 2nd ed. Oxford: Blackwell, 1975.
19. Spicer CC. Some new closed sequential designs for clinical trials. Biometrics 1962;
18:203–211.
20. Whitehead J. Large sequential methods with application to the analysis of 2 ⫻ 2
contingency tables. Biometrika 1978; 65:351–356.
21. Jones DR, Whitehead J. Sequential forms of the log rank and modified Wilcoxon
tests for censored data. Biometrika 1979; 66:105–113. [see correction, Biometrika
1981; 68:576.]
22. Pocock SJ. Group sequential methods in the design and analysis of clinical trials.
Biometrika 1977; 64:191–199.
1979; 35:549–556.
24. Lan KKG, DeMets DL. Discrete sequential boundaries for clinical trials. Biometrika
1983; 70:659–663.
25. DeMets DL, Ware JH. Group sequential methods for clinical trials with a one-sided
hypothesis. Biometrika 1980; 67:651–660.
26. Whitehead J, Stratton I. Group sequential clinical trials with triangular continuation
regions. Biometrics 1983; 39:227–236.
27. Armitage P, McPherson CK, Rowe BC. Repeated significance tests on accumulating
data. J R Stat Soc A 1969; 132:235–244.
28. Stallard N, Facey KM. Comparison of the spending function method and the Christ-
mas tree correction for group sequential trials. J Biopharm Stat 1996; 6:361–
373.
29. MPS Research Unit. PEST 4: Operating Manual. The University of Reading: En-
gland, 2000.
30. Cytel Software Corporation. EaSt 2000: A software package for the design and in-
terim monitoring of group sequential clinical trials. Cambridge MA: Cytel, 2000.
31. Emerson SS. Statistical packages for group sequential methods. Am Stat 1996; 50:
183–192.
32. Hackett GH, Harris MNE, Plantevin M, Pringle HM, Garrioch DB, Avery A. Anaes-
thesia for out-patient termination of pregnancy, a comparison of two anaesthetic
techniques. Br J Anaesth 1982; 54:865–870.
33. Jones DR, Newman CE, Whitehead J. The design of a sequential clinical trial for
comparison of two lung cancer treatments. Stat Med 1982; 1:73–82.
34. Whitehead J, Jones DR, Ellis SH. The analysis of a sequential clinical trial for the
comparison of two lung cancer treatments. Stat Med 1983; 2:183–190.
35. Newman CE, Cox R, Ford CHJ, Johnson JR, Jones DR, Wheaton M, Whitehead J.
Reduced survival with radiotherapy and razoxane compared with radiotherapy alone
for inoperable lung cancer in a randomised double-blind trial. B J Cancer 1985; 51:
731–732.
36. Pampallona S, Tsiatis AA. Group sequential designs for one-sided and two-sided
hypothesis testing with provision for early stopping in favor of the null hypothesis.
J Stat Plan Infer 1994; 42:19–35.
37. Lan KKG, Simon R, Halperin M. Stochastically curtailed tests in long-term clinical
trials. Sequent Anal 1982; 1:207–219.
38. Spiegelhalter DJ, Freedman LS, Blackburn PR. Monitoring clinical trials: condi-
tional or predictive power? Control Clin Trials 1986; 7:8–17.
39. Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical
Trials. London: Chapman and Hall/CRC, 2000.
40. Whitehead J. Sequential designs for equivalence studies. Stat Med 1996; 15:2703–
2715.
41. Montaner JSG, Lawson LM, Levitt N, Belzberg A, Schechter MT, Ruedy J. Cortico-
steroids prevent early deterioration in patients with moderately severe pneumocystis
carinii pneumonia and the acquired immunodeficiency syndrome (AIDS). Ann Intern
Med 1990; 113:14–20.
42. Whitehead J. Sequential designs for pharmaceutical clinical trials. Pharm Med 1992;
6:179–191.
43. Whitehead J. Application of sequential methods to a phase III clinical trial in stroke.
Drug Inform J 1993; 27:733–740.
44. Moss AJ, Hall WJ, Cannom DS, Daubert JP, Higgins SL, Klein H, Levine JH, Sak-
sena S, Waldo AL, Wilber D, Brown MW, Heo M. Improved survival with an im-
planted defibrillator in patients with coronary disease at high risk for ventricular
arrhythmia. N Engl J Med 1996; 335:1933–1940.
45. Gortner L, Pohlandt F, Bartmann P, Bernsau U, Porz F, Hellwege H-H, Seitz RC,
Hieronimi G, Kuhls E, Jorch G, Hentschel R, Reiter H-L, Bauer J, Versmold H,
Meiler B. High-dose versus low-dose bovine surfactant treatment in very premature
infants. Acta Paediatr 1994; 83:135–141.
46. Bellisant E, Duhamel J-F, Guillot M, Pariente-Khayat A, Olive G, Pons G. The
triangular test to assess the efficacy of metoclopramide in gastroesophageal reflux.
Clin Pharma Ther 1997; 61:377–384.
47. Derry FA, Dinsmore WW, Fraser M, Gardner BP, Glass CA, Maytom MC, Smith
MD. Efficacy and safety of oral sildenafil (Viagra) in men with erectile dysfunction
caused by spinal cord injury. Neurology 1998; 51:1629–1633.
48. Niemann TJ, Cairns CB, Sharma J, Lewis RJ. Treatment of prolonged ventricular
fibrillation: immediate countershock versus high dose epinephrine and CPR preced-
ing countershock. Circulation 1992; 85:281–287.
49. Rout CC, Rocke DA, Levin J, Gouws E, Reddy D. A reevaluation of the role of
crystalloid preload in the prevention of hypotention associated with spinal anesthesia
for elective cesarean section. Anesthesiology 1993; 79:262–269.
228 Whitehead
50. Storb R, Deeg, J, Whitehead J, Appelbaum F, Beatty P, Bensinger W, Buckner CD,

Clift R, Doney K, Farewell V, Hansen J, Hill R, Lum L, Martin P, McGuffin R,
Sanders J, Stewart P, Sullivan K, Witherspoon R, Yee G, Thomas ED. Methotrexate
and cyclosporine compared with cyclosporine alone for prophylaxis of acute graft
versus host disease after bone marrow transplantation for leukemia. N Engl J Med
1986; 314:729–735.
51. Nixon J, McElvenny D, Mason S, Brown J, Bond S. A sequential randomised con-
trolled trial comparing a dry visco-elastic polymer pad and standard operating table
mattress in the prevention of post-operative pressure sores. Int J Nurs Stud 1998;
35:193–203.
52. Boden WE, van Gilst WH, Scheldewaert RG, Starkey IR, Carlier MF, Julian DG,
Whitehead A, Bertrand ME, Col JJ, Pedersen OL, Lie KI, Santoni J-P, Fox KM.
Diltiazem in acute myocardial infarction treated with thrombolytic agents: a ran-
domised placebo-controlled trial. Lancet 2000; 355:1751–1756.
53. Yentis SM, Jenkins CS, Lucas DN, Barnes PK. The effect of prophylactic glycopyr-
rolate on maternal haemodynamics following spinal anaesthesia for elective caesar-
ian section. Int J Obstet Anesth 2000; 9:156–159.
54. Stallard N, Todd S. Exact sequential tests for single samples of discrete responses
using spending functions. Statistics in Medicine 2000; 19:3051–3064.
13
Design and Analysis Considerations for
Complementary Outcomes
Bernard F. Cole
Dartmouth Medical School, Lebanon, New Hampshire
I. INTRODUCTION
In cancer clinical trials, data are routinely collected for various patient outcomes,
including treatment-related adverse events and clinical response. Adverse event
data are used to describe the risks associated with a new treatment, and the clinical
response data are used to describe the benefits. Clinical response is often defined
in terms of changes in tumor size, time until disease progression, or time until
death. By compiling such data, researchers are able to make an objective evalua-
tion of the safety and efficacy of a new therapy. Modern clinical trials often
collect data in addition to these usual outcomes. The two most common of these
‘‘complementary outcomes’’ are quality of life (1–3) and factors related to eco-
nomic cost (4). By including these outcomes in clinical trials, researchers are
able to address questions regarding quality of life and monetary cost that may
be raised by patients, physicians, payers, and policy makers considering the use
of a new regimen. In this chapter, we provide an overview of design and analysis
considerations relating to complementary outcomes in cancer clinical trials.
229
230 Cole
II. QUALITY-OF-LIFE ASSESSMENT

A. History
Early measures of quality of life in cancer focused on physical functioning. The
Karnofsky performance status (KPS), introduced in 1948 (5,6), is generally con-
sidered to be the first such measure. KPS is measured on an 11-point scale from
0% to 100% (10% increments) where 0% denotes death, 100% denotes normal
function, and other values denote ‘‘approximate percentage of normal physical
performance.’’ The KPS assessment is made by the clinician.
Subsequent efforts in quality-of-life assessment evaluated illnesses and
therapeutic regimens from the patients’ perspective. For example, in 1971 Izsak
and Medalie (7) developed a multidimensional scale that measured physical, so-
cial, and psychological variables in cancer patients. In 1975 a trial for patients
with acute myelogenous leukemia used a six-level assessment of quality of life
ranging from ‘‘hospital stay throughout illness’’ to ‘‘no symptoms, normal life’’
(8). The assessments were based on patient reports of their symptoms and func-
tioning.
Modern quality-of-life assessment in cancer clinical trials is generally cited
to have begun in 1976 with Priestman and Baum’s study of breast cancer treat-
ment (9). Using a 10-question instrument, they assessed patients’ general feeling
of well-being, mood, level of activity, pain, nausea, appetite, ability to perform
housework, social activities, general level of anxiety, and overall treatment expe-
rience. The results indicated that this instrument could be used to assess the sub-
jective benefit of treatment in individual women, to detect changes over time,
and to compare different treatments within a clinical trial.
B. Measuring Quality of Life

Many instruments are available for measuring quality of life in clinical trials.
These can be divided into general and disease-specific instruments. Commonly
used general instruments include the SF-36 (10), the Sickness Impact Profile (11),
and the SCL-90-R (12). Each of these instruments includes general questions
relating to a patient’s health and functioning, and they can be applied in a wide
range of disease settings. A list of cancer-specific instruments is provided in
Table 1.
The goal of each instrument is to measure overall quality of life and various
quality-of-life domains, such as physical functioning, social functioning, and
mental health. Other domains include disease symptoms, pain, general health
perceptions, vitality, and role functioning. Quality-of-life instruments usually in-
clude several individual questions, or items pertaining to a particular domain,
and the domain score (also called scale score) is obtained by summarizing the
responses from the associated items (e.g., average of the item responses). Each
Complementary Outcomes 231
Table 1 Cancer-Specific Quality-of-Life Measurement Instruments
Number
Instrument and reference of Items Domains assessed
Breast Cancer Chemotherapy Ques- 30 Attractiveness, fatigue, physical

tionnaire (BCQ) (43) symptoms, inconvenience, emo-
tional, hope, social support
Cancer Rehabilitation Evaluation 93–132 Physical, psychosocial, medical in-
System (CARES) (44) teraction, marital, sexual, symp-
tom- and treatment-specific items
European Organization for Research 42 Five functional scales (physical,
and Treatment of Cancer scale role, cognitive, emotional, social),
(EORTC: QLQ-C30) (45) three symptom scales (fatigue,
pain, nausea), disease-specific
items, global quality of life
Functional Assessment of Cancer 36–40 Physical, social/family, relationship
Therapy (FACT) (46) with doctor, emotional, func-
tional, well-being, disease-
specific items
Functional Living Index—Cancer 22 Psychological, social, disease symp-
(FLIC) (47) toms, global well-being, treat-
ment and disease issues, physical
functioning
International Breast Cancer Study 10 Physical well-being, mood, fatigue,
Group Quality of Life Question- appetite, coping, social support,
naire (IBCSG–QL) (48) symptoms, overall health
Linear Analogue Self-Assessment 25 Physical, psychological, social
(LASA) (9)
Quality of Life Index (QLI) (18) 5 Physical activity, daily living,
health perceptions, psychological,
social support, outlook on life
instrument has its own rules regarding the computation of the domain scores,
and these rules are established after careful testing. Each item is generally mea-
sured on a Likert scale or a linear analogue self-assessment (LASA) scale. The
Likert scale is an ordered categorical scale consisting of a limited choice of
clearly defined responses. The most frequently used scales have either four or
five categories. In contrast, the LASA scale is an unmarked line, usually 10 cm
long, with text at either end describing the extremes of the scale. Each patient
is asked to place a mark on the line in a position that best reflects his or her
response relative to the two labeled extreme points.
232 Cole
C. Measuring Patients’ Preferences and Utilities

In addition to measuring descriptive quality of life, it is possible to measure a
patient’s preference, or utility, for particular health states. This can be accom-
plished by assessing a patient’s value for one health state compared with another
based on quality-of-life considerations. For example, two patients might report
similar symptoms with similar frequency and duration, but they may differ on
how important these symptoms are in their daily lives. Descriptive quality-of-life
instruments will correctly provide similar scores for these two patients, whereas a
measurement of preference or utility will differentiate them.
Utility is measured on a scale from 0 to 1, where 0 denotes a health state
‘‘as bad as death’’ and 1 denotes a health state ‘‘as good as perfect health.’’
Values between 0 and 1 denote degrees between these extremes. A simple inter-
pretation of a utility for a specific health state, A, is that the utility represents the
amount of time in a state of perfect health that a patient values as equal to one
unit of time in state A. For example, suppose that state A has a utility of 0.7.
Then 1 month in state A is equivalent in value to 0.7 months of perfect health.
This interpretation leads to the idea that quality-of-life-adjusted time can be ob-
tained by multiplying a health state duration by its utility coefficient. For example,
if a patient experiences 6 months of toxicity and has a utility weight of 0.8 for
time with toxicity, then the quality-adjusted time spent with toxicity is 4.2
months. This adjustment allows treatments that have different impacts on quality
of life to be compared in a meaningful way.
Classically, utility assessment is carried out using interview techniques.
The ‘‘standard gamble’’ technique gives patients a choice between a chronic
health state with certainty or an uncertain health state that is either perfect health
(with probability p) or death (with probability 1 ⫺ p). The probability p is varied
until the patient is indifferent between the certain and the uncertain choice and
the final p is taken as the utility value. The ‘‘time trade-off’’ technique gives
patients a choice between living for a certain amount of time in a state of less
than perfect health or a shorter amount of time in a state of perfect health. The
duration of the ‘‘perfect health’’ state is varied until the patient expresses indiffer-
ence to the choice. The utility is then taken as the ratio of the final health state
durations. For a detailed overview of utility assessment, see Bennett and Torrance
(13).
Interview techniques are cumbersome to use in practice. Fortunately, there
are procedures for obtaining utility data from quality-of-life instruments using
multiattribute utility theory (14). Generally, these procedures were developed by
administering both the instrument and the interview to a study sample and build-
ing a statistical model for predicting the utility value from the instrument re-
sponses. Instruments that can be used for this purpose include the EuroQol (15),
Health Utilities Index (16), and the Q-tility Index (17), which uses Spitzer’s Qual-
ity of Life Index (18).
III. ANALYSIS OF QUALITY-OF-LIFE DATA

A. Overview
Quality-of-life data are generally collected longitudinally in cancer clinical trials.
Often the data collection is most intense during the treatment phase of the study
when patients have frequent clinic visits (e.g., 1-month intervals). Posttreatment
measurements are generally collected at longer intervals corresponding to follow-
up visits or are obtained using mailed surveys (e.g., 6-month intervals). Therefore,
a longitudinal analysis procedure is appropriate. Standard methods include re-
peated-measures analysis of variance or more general mixed effects regression
models. Other techniques include growth curve models and the construction of
summary measures.
The main difficulty in analyzing longitudinal quality-of-life data is the ap-
propriate handling of missing observations. Observations may be missing for
many reasons, some of which may be considered missing at random, whereas
others are related to the quality of life the patient is experiencing. For example,
a patient may be too ill to complete the questionnaire, or the patient may be doing
so well that he or she does not visit the clinic at the time when a quality-of-life
assessment is due.
As with most statistical analyses, a graphical display of the data is a useful
starting point. A common display for longitudinal quality-of-life data consists of
a plot of mean scores over time according to treatment group for each quality-
of-life domain measured. It is useful to indicate on these graphs which assessment
time points occurred during the treatment phase of the study. It is also useful to
indicate how many subjects provided data at each time point. These summaries
will help to guide the modeling of the data and the interpretation of the modeling
results.
B. Repeated-Measures Analysis of Variance

Repeated-measures analysis of variance is a commonly used and convenient ap-
proach to modeling longitudinal quality-of-life data. Generally, the model pre-
dicts a specific domain of quality of life using treatment group, time point, and
a treatment group by time point interaction as independent factors. Other covari-
ates, such as age, may be included as adjustment factors if confounding is a
concern. In addition, specific contrasts of the regression parameters can be evalu-
ated. For example, if a treatment by time interaction is found, the treatment effect
234 Cole
can be evaluated at particular time points by testing the appropriate linear combi-
nation of the regression parameters. Standard software for this analysis usually
assumes compound symmetry for the covariance matrix of the longitudinal as-
sessments. The use of mixed effect regression analysis allows one to fit other
forms for the covariance matrix or to leave the covariance matrix unspecified,
in which case it is estimated from the data. When quality-of-life assessments are
obtained sporadically or at varying time points, one can model covariance as a
function of the time difference between two assessments.
Generally, a separate analysis is performed for each quality-of-life measure
obtained in a study. However, this approach can lead to inflated type I error due
to the multiple testing. Although a multivariate analysis approach may be used,
this uses only subjects with valid data on all subscales. Often patients are missing
one or more subscales from an instrument, so that the multivariate approach uses
only a fraction of the available subjects. Other corrections for multiple testing
can be used when this problem is present (e.g., the Bonferroni procedure). The
main difficulty with repeated-measures analysis of variance is accommodating
missing observations.
C. Other Techniques and Missing Data

Other techniques for analyzing longitudinal quality-of-life data include growth
curve models and the construction of summary measures. Growth curve modeling
generally involves fitting polynomials to the longitudinal data for an individual,
with the analysis focusing on the fitted polynomials (19). The method of summary
measures similarly collapses the multiple observations for an individual into a
single outcome. Examples include the mean of the observations or the area under
the curve of the plotted quality-of-life scores over time. An advantage of these
methods is that missing data can be accommodated in a variety of ways. Alterna-
tively, more recently developed methods for missing data can be applied when
using a multivariate or repeated-measures model (e.g., imputation techniques).
We refer to Fairclough (20) for an excellent review of these methods. An addi-
tional valuable reference in this area is the proceedings volume of the Workshop
on Missing Data in Quality of Life Research in Cancer Clinical Trials: Practical
and Methodological Issues (21). Finally, Bonetti et al. (22) recently developed
a method-of-moments estimation procedure for evaluating quality-of-life data in
the presence of nonignorable missing observations.
D. Example
Hürny et al. (23) evaluated the quality-of-life impact of various adjuvant cyto-
toxic therapy schedules in patients with breast cancer who were separately treated
in two clinical trials conducted by the International Breast Cancer Study Group
(trials VI and VII). One of these trials, trial VI, consisted of 1475 patients who
were pre- or perimenopausal at the time of study recruitment and were random-
ized in a 2 ⫻ 2 factorial design to receive three or six initial cycles of chemother-
apy with or without later reintroduction of three single cycles of chemotherapy
administered at 3-month intervals. The quality-of-life instrument used in the study
measured five indicators of quality of life: physical well-being, mood, appetite,
perceived adjustment/coping, and emotional well-being. The first four indicators
were measured using LASA scales, and the fifth was based on a 28-item adjective
checklist. Measurements were obtained at baseline, 2 months later, then every 3
months until 24 months, and again 1 and 6 months after disease recurrence.
The analysis conducted by Hürny et al. (23) focused on the data recorded
during the first 18 months after randomization. By this time all patients had com-
pleted their treatment. Square-root transformations were used to stabilize the vari-
ance of all scales. Analysis of variance was used to compare the treatment groups
at each time point, and a repeated-measures model was used to make comparisons
over time. Both models included the patient’s language/culture as a covariate.
Patients who had missing data at a particular time point were excluded only from
the analysis of that time point.
The number of patients who furnished usable data at each time point varied
from 1022 patients at baseline to 797 patients at the 18-month time point. The
results indicated that mean quality-of-life scores increased over time for all treat-
ment groups. In particular, statistically significant (p ⬍ 0.05) differences were
observed in the mean coping scores for the four groups at time points 6, 9, 12,
and 15 months. At these time points, the group that received three cycles of initial
therapy with no reintroduction therapy had the highest coping scores (indicating
a better degree of coping). At the 18-month time point, all four treatment groups
had similar mean coping scores.
IV. ANALYSIS OF QUALITY-ADJUSTED SURVIVAL

A. Introduction
Quality-of-life-adjusted survival time is a complementary outcome that is increas-
ingly being used in cancer clinical research. It represents a patient’s survival time
weighted by the quality of life experienced, where the weightings are based on
utility values. Because utility is measured on the unit interval (0,1), quality-
adjusted survival time is in the same time units as overall survival. This allows
comparisons of treatments that differ in their quality-of-life effects and their ef-
fects on survival using a metric that simultaneously accounts for both of these
differences.
The Quality-adjusted Time Without Symptoms of disease or Toxicity of
treatment (Q-TWiST) method (24,25) is one technique for evaluating quality-
236 Cole
adjusted survival in clinical trials. Q-TWiST compares treatments by computing

the time spent in a series of clinical health states that may impact a patient’s
quality of life. Each health state is then weighted by a utility value, and the
Q-TWiST outcome is defined by the sum of the weighted health state durations.
The three steps involved in a Q-TWiST analysis are described briefly below fol-
lowed by an illustrative example.
B. Step 1: Define Clinical Health States

The first step in the analysis is to define quality-of-life-oriented health states that
are relevant for the disease setting and the treatments being studied. The health
states should reflect changes in clinical status that may be associated with changes
in quality of life (e.g., treatment-related adverse events, disease progression, late
sequelae), and they should be progressive. That is, patients must move through
the health states in order, although health states may be skipped. For example,
we may define a k-state model with health states S1, . . . , Sk, where the only
possible transitions are from Si to Sj, 1 ⱕ i ⱕ j ⱕ k. For example, in the adjuvant
chemotherapy setting, health states may be defined as follows: S1 ⫽ time with
treatment-related toxicity, S2 ⫽ time without toxicity and without disease progres-
sion, and S3 ⫽ time with disease recurrence until death.
C. Step 2: Partition Overall Survival

The second step is to estimate the mean health state durations using the clinical
trial data. This is accomplished by partitioning the overall survival time into the
health states for each treatment group separately. The Kaplan-Meier method can
be used for this purpose; however, in practice, censoring precludes one from
estimating the entire survival curve. In this case, the partitioning is done up to
a restriction time L. A common choice for L is the median follow-up duration.
For each patient, the clinical trial data are used to define the exiting time for each
health state. For the k-state model, let ti denote the exiting time (measured from
study entry or randomization time) from health state Si, i ⫽ 1, . . . , k. If a state
Sj is skipped, then tj ⫽ tj⫺1. If S1 is skipped, then t1 ⫽ 0. The exiting time from
state Sk will be the time of death. For example, in the adjuvant chemotherapy
setting, the health state exiting times may be defined as follows: t1 ⫽ time from
randomization until the end of treatment-related toxicity, disease progression, or
death, whichever occurs first; t2 ⫽ time from randomization until disease progres-
sion or death, whichever occurs first; and t3 ⫽ time from randomization until
death. If any of the exiting times are censored, then the exiting times for all
subsequent states will be similarly censored.
Let Ki (⋅) denote the Kaplan-Meier estimate corresponding to ti, i ⫽
1, . . . , k. Then the mean health state duration, restricted to L for health state S1, is
冮
L
τ̂1 ⫽ K1(u)du
0
and the mean health state duration for health state Si (2 ⱕ i ⱕ k) is
冮
L
τ̂i ⫽ (Ki(u) ⫺ Ki⫺1(u))du
0
This approach to the estimation of the health state durations provides consistent
estimates, whereas averaging individual health state durations when censoring is
present leads to biased results (26).
D. Step 3: Compare the Treatments

The third step is to compare the treatments using a weighted sum of the mean
health state durations. For example, if u1, . . . , uk denote the utility coefficients
for the respective health states S1, . . . , Sk, the Q-TWiST end point is given by
k
Q⫺TWiST ⫽ 冱 uτ
i⫽1
i i
Note that if all of the utility coefficients equal unity, Q-TWiST is equivalent to
the mean survival time restricted to L.
Q-TWiST is calculated separately for each treatment group, and the treat-
ment effects are obtained by subtracting the Q-TWiST estimates corresponding
to two treatment groups (e.g., experimental drug group vs. control group). Stan-
dard errors for the treatment effects can be obtained using the bootstrap method
(24) or recently derived closed-form estimators (27–29). If data are available for
estimating u1, ..., uk, then these data can be incorporated directly into the analysis.
When utility data are not available, the treatment comparison can be evaluated
by computing the treatment effect for varying values of the utility weights in a
sensitivity analysis. When two utility weights are unknown and two treatments
are being compared, the treatment comparison can be plotted for all possible
values of the unknown utility coefficients in a two-dimensional graph called a
‘‘threshold utility plot.’’ Contour lines can be used to indicate the magnitude of
the Q-TWiST treatment effect associated with different pairs of utility values.
The contour line corresponding to a treatment effect of zero is called the ‘‘thresh-
old line.’’ The threshold line indicates all utility value pairs for which the treat-
ment effects are equal in terms of Q-TWiST. Confidence bounds for the threshold
line can also be plotted to define regions of utility coefficient values for which
the treatment effect difference is statistically significant.
For example, in the case where k ⫽ 3 and u2 ⫽ 1, the contour lines for
the threshold utility plot are defined by
238 Cole
c ⫺ u 1 ∆1 ⫺ ∆2
ue ⫽
∆3
where c represents the Q-TWiST treatment effect and ∆i is the treatment group
difference for the duration of health state Si. Note that if c ⫽ 0, the above equation
represents the threshold line. Contour lines can be obtained by setting appropriate
values for c and plotting the resulting equations (e.g., c ⫽ 3, ⫺2, ⫺1, 0, 1, 2, 3
months).
E. Example
The Eastern Cooperative Oncology Group (ECOG) clinical trial EST1684 com-
pared high-dose interferon alfa-2b therapy versus clinical observation for the ad-
juvant treatment of high-risk resected malignant melanoma in 280 patients
(30,31). The health states defined for the Q-TWiST analysis were as follows:
Tox ⫽ all time with severe or life-threatening side effects of high-dose interferon;
TWiST ⫽ all time without severe or life-threatening treatment toxicity and with-
out symptoms of disease relapse; and Rel ⫽ all time following disease relapse.
These health states reflect the major clinical changes in quality of life that are
important for evaluating the impact of high-dose interferon.
Figure 1 shows the partitioning of overall survival into the health states
according to treatment group based on the product limit method and restricted
to the median follow-up interval of 84 months. For each graph, the area beneath
the overall survival curve (OS) is partitioned into the health states Tox, TWiST,
and Rel. This was accomplished by plotting on the same graph as OS a survival
curve for the duration of severe or life-threatening side effects of interferon (Tox)
and a survival curve for the time until disease relapse or death (RFS).
Table 2 shows the mean health state durations, the mean overall survival
time (OS), and the mean relapse-free survival time (RFS) within the first 84
months from randomization in the study. The results indicate that patients in the
interferon group experienced more time in TWiST and less time in Rel as com-
pared with the observation group; however, the interferon group also experienced
more time with severe or life-threatening toxicity. Table 3 shows the computation
of Q-TWiST for two possible selections of the utility weights uTox and uRel. The
utility weight for TWiST in this analysis was assumed to be unity because TWiST
represents a state of best possible quality of life. Note that mean OS is equivalent
to mean Q-TWiST when uTox ⫽ uRel ⫽ 1 and that mean RFS is equivalent to
mean Q-TWiST when uTox ⫽ 1 and uRel ⫽ 0.
Figure 2 illustrates the threshold plot for the treatment comparison. Note
that in this case, the interferon group experienced more quality-adjusted time
than the control group regardless of the utility values used. Therefore, the thresh-
old line does not appear on the graph. However, the contour lines indicate that
Figure 1 Partitioned survival plots for the ECOG clinical trial comparing (A) clinical
observation (i.e., no therapy) and (B) interferon for patients with malignant melanoma.
Each plot illustrates the overall survival curve (OS), the relapse-free survival curve (RFS),
and a curve representing the duration of Toxicity (Tox). The area between the OS and
RFS curves represents the duration of the relapse health state (Rel), and the area between
the Tox and RFS curves represents time without symptoms of relapse or toxicity (TWiST).
(From Ref. 31.)
240 Cole
Table 2 Mean Time in Months for the Components of Q-TWiST Restricted to 84

Months of Median Follow-Up in the ECOG Trial EST 1684
Treatment Group
Interferon
Outcome* Observation Alfa-2b Difference† 95% CI† p (two-sided)†
Tox 0.0 5.8 5.8 5.0 to 6.7 ⬍0.001

TWiST 30.0 33.1 3.1 ⫺4.8 to 11.0 0.4
Rel 12.4 10.4 ⫺2.0 ⫺6.2 to 2.3 0.4
OS 42.4 49.3 7.0 ⫺0.6 to 14.5 0.07
RFS 30.0 38.9 8.9 0.8 to 17.0 0.03
* Tox, time with severe or life-threatening side effects of treatment; TWiST, time without severe or
life-threatening side effect of treatment and without symptoms of disease relapse; Rel, time follow-
ing disease relapse until death; OS, overall survival time; RFS, relapse-free survival time.
† Treatment difference corresponds to interferon minus observation and is given with a 95% confi-
dence interval (CI) and a two-sided p value based on a Z-test.
Source: Ref. 31.
the Q-TWiST benefit for interferon ranges from 2 to 8 months depending on the
selection of utility weight values. In addition, the upper 95% confidence band
for the threshold line appears as a dashed line. Values of the utility weights above
the dashed line correspond to a significant (p ⬍ 0.05) benefit for interferon in
terms of Q-TWiST and values below the dashed line indicate utility coefficient
Table 3 Mean Q-TWiST in Months Within 84 Months of Median Follow-Up in the

ECOG Trial for Arbitrary Sets of Utility Weight Values
Utility
Values* Treatment Group
Interferon
uTox uRel Observation Alfa-2b Difference† 95% CI† p (two-sided)†
0.5 0.5 36.2 41.2 5.0 ⫺2.4 to 12.5 0.2

0.9 0.4 34.9 42.5 7.6 0.0 to 15.1 0.05
* uTox, the utility weight associated with the severe or life-threatening side effects of treatment (TOX);
uRel, the utility weight associated with disease relapse. Each utility weight is measured on a scale
from 0 ⫽ ‘‘as bad as death’’ to 1 ⫽ ‘‘as good as perfect health.’’
† Treatment difference corresponds to interferon minus observation and is given with a 95% confi-
dence interval (CI) and a two-sided p value based on a Z-test.
Source: Ref. 31.
Figure 2 Threshold utility analysis for the ECOG clinical trial. The graph illustrates
the treatment comparison in terms of Q-TWiST for all possible values of the utility weights
for toxicity and relapse. The parallel dotted lines represent contours for the treatment effect
in terms of Q-TWiST (interferon minus observation) as the utility weights vary between
0 and 1. The positive numbers on the contour lines indicate that mean Q-TWiST for the
interferon group is greater than for the observation group for all possible pairs of utility
weights between 0 and 1. For utility value pairs above the heavy dashed line in the upper
left corner of the plot, the Q-TWiST treatment effect is statistically significant (i.e., p ⬍
0.05). (From Ref 31.)
value pairs for which the Q-TWiST comparison favored the interferon group but
did not reach statistical significance.
F. Further Developments
Quality-of-life-adjusted survival in clinical trials is an area of active methodologi-
cal research. Parametric (32) and semiparametric (33) regression models have
been developed for quality-adjusted survival. In addition, methods have been
developed for forecasting treatment effects (34) and performing meta-analyses
(35–37). A number of recently published papers hold promise for an increasing
array of statistical tools. In particular, Glasziou et al. (38) describe methods for
combining longitudinal quality-of-life data with survival data using an integra-
tion-based approach. Zhao and Tsiatis (27) present a consistent estimator for the
242 Cole
distribution of quality-adjusted survival time and, in a second paper (28), describe

techniques for estimating mean quality-adjusted lifetime based on their consistent
estimator. Ongoing research focuses on developing statistical tools for the analy-
sis of utility data in conjunction with quality-adjusted survival analysis.
V. COST-EFFECTIVENESS AND COST–UTILITY ANALYSIS

A. Overview
The economic evaluation of different treatment options is critically important for
including medical costs in policy decisions and clinical practice decisions. Cost-
effectiveness analysis and cost–utility analysis are two common methods for
comparing the economic cost of treatments in a standardized way. Both methods
result in ratios that relate the incremental cost of a therapy to the incremental
clinical benefit of that therapy. The denominator in cost-effectiveness analysis is
any measure of benefit, whereas in cost–utility analysis it is expressed in quality-
adjusted time. For example, to evaluate the cost-effectiveness of drug A relative
to drug B, the additional cost of using drug A is quantified and the clinical benefit
of drug A relative to drug B. In particular, if cA and cB denote the total costs of
using drug A and drug B, respectively, and if µA and µB denote the respective
mean survival times for the two drugs, the cost-effectiveness ratio is given by
cA ⫺ cB
µA ⫺ µB
This ratio represents the additional cost of using drug A per unit of lifetime saved
relative to drug B. The cost–utility ratio is obtained by replacing the denominator
in the cost-effectiveness ratio with the quality-adjusted treatment effect, where
the quality-adjustment is made using utility weights. For example, the treatment
effect can be expressed in terms of Q-TWiST in a cost–utility ratio.
The main challenge for clinical trials is to measure the utilization of re-
sources related to treatment. The gathering of cost information is not necessary
because utilization data can be assigned appropriate costs at a later time. More-
over, in practice it is very difficult to measure the real costs of medical treatment;
costs can change drastically over the course of a study, and costs may vary widely
from one geographic region to another. As a result, it is much more practical to
collect data on resource utilization and later assign costs.
At a minimum, data should be collected for all physician visits, drug treat-
ments, diagnostic tests, and hospitalizations. The clinical centers involved with
the trial can provide some of these data for the utilization at these clinical sites;
however, it is important that data also are collected directly from the patients
because many will receive care from other sites. The use of a patient diary or
interval questionnaire, along with periodic telephone calls from a study coordina-
tor, will facilitate accurate collection of the data.
The statistical issues related to economic analysis are too complex to be
covered in this chapter. The main difficulty is that cost-effectiveness ratios are
derived from varied data sources, and some of the parameters (e.g., cost) may
be point estimates and not based on sampled data. The result is that the ratios have
an unknown distribution, making statistical inferences impossible. A common
solution to this problem is to use sensitivity analysis to examine how the cost-
effectiveness ratio varies as the parameters of the analysis vary (e.g., cost parame-
ters can be varied within certain ranges). Nevertheless, in some cases, it is possi-
ble to make statistical inferences on cost-effectiveness and cost–utility ratios (39).
B. Example
Hillner et al.(40) performed an economic evaluation of interferon treatment for
malignant melanoma based on the ECOG clinical trial EST1684. They estimated
that the total cost of medical care within the first 7 years after diagnosis was
$91,656 for interferon-treated patients and $76,580 for patients not treated with
interferon. Therefore, the incremental cost was estimated at $15,076. These fig-
ures were discounted at an annual rate of 3%. This discounting allows the future
(decreased) value of money to be expressed in current dollars. Table 2 indicates
that the increase in mean survival within the first 84 months associated with
interferon was 7.0 months. The cost-effectiveness ratio is therefore given by
$15,076/7.0 ⫽ $2152 per life month saved ($25,848 per life-year saved). Note
that the results presented by Hillner et al. differ slightly from these calculations
because of rounding and because Hillner et al. discounted the survival time (as
well as costs) by 3% per year.
VI. STUDY DESIGN IN CLINICAL TRIALS
As a gold standard study design we propose a cancer clinical trial including the
following outcomes: (1) the usual clinical end points such as progression-free
survival and overall survival, (2) the usual assessment of toxicity/adverse event
frequency and grade, (3) measurements of the timing and duration of all toxicities
and adverse events, (4) longitudinal assessment of quality of life using a general
instrument and a disease-specific instrument, (5) a procedure for estimating pa-
tient utility or preference, and (6) a procedure for estimating health care utiliza-
tion. By including all these components in a clinical trial, it becomes possible
to address the clinical benefits of a new therapy and its impact on quality of life
and whether it is cost effective. Of course, few studies will include all these
components due to constrained resources. In addition, clinical trials that began
244 Cole
in the 1980s or early 1990s will generally not include components for measuring
quality of life or utility, because methods for assessment were not as well estab-
lished as they are today. To fill this potential gap, researchers use other methods
to address pressing clinical issues. One approach is to launch a smaller study
that collects quality-of-life and utility data from a group of patients. Such a study
can be longitudinal or cross-sectional. The advantage of the cross-sectional de-
sign is that the study can be completed more quickly. The disadvantage is that
longitudinal effects on quality of life cannot be estimated. Inferior ancillary study
designs include those that use proxy data for subjective quality-of-life domains.
Another approach is to retrospectively evaluate the duration of major health
states that are thought to impact quality of life (e.g., toxicity, disease progression).
By combining clinical outcome data with patient-level cycle-by-cycle toxicity
data (both of which are typically collected in cancer clinical trials), it is often
possible to obtain estimates of durations of the health states. Utility weights can
then be assigned to the health states, and this health state utility model can be
used to compare treatments in terms of quality-adjusted time. The utility weights
can be estimated from a secondary cross-sectional study, or they may be left
unspecified. In the latter case, the results of the analysis should be displayed for
a wide variety of choices for the utility weight values and not for just one or two
arbitrary selections.
For many clinical trials currently being designed where quality of life is an
important end point, it is critical that quality-of-life components are prospectively
incorporated. At a minimum, a disease-specific quality-of-life instrument should
be administered longitudinally. The timing of assessments should be designed to
measure quality of life for the various clinical health states that a patient might
experience both during and after therapy (e.g., treatment-related toxicity, disease
progression). For randomized studies, a baseline assessment should take place
before treatment randomization. In addition, patients should be asked to self-
report troublesome adverse events and symptoms and their durations using a
diary. These data could be used to validate the physician-reported adverse event
data typically collected. The patient diary idea is particularly appealing from a
quality-of-life perspective because it is likely that a patient will self-report ad-
verse events that cause distress and therefore represent decrements in quality of
life. As a result, patient-diary data are useful for estimating the duration of time
spent with adverse events—an outcome necessary for a health-state utility model.
VII. CONCLUSION
In this chapter, we provide an overview of the basic components of quality-of-

life research in cancer clinical trials. Unfortunately, in this short space, we cannot
fully cover all aspects of this topic; however, there are a number of excellent
references for further reading. In particular, the large volume edited by Spilker
(41) is a thorough reference covering quality-of-life measurement, analysis,
cross-cultural and cross-national issues, health policy issues, and pharmacoeco-
nomics. This book is particularly well suited to the quality-of-life researcher in-
volved with study design and analysis. Another more compact reference is the
chapter by Gelber and Gelber (42), which reviews methods used in clinical re-
search and provides more detail regarding statistical analysis methods.
We also provided an example illustrating the use of quality-of-life-adjusted
survival time (Q-TWiST) in cancer clinical research. The Q-TWiST analysis of
the ECOG trial EST1684 improved the clinical usefulness of the information
obtained from the clinical trial. Moreover, the evaluation illustrates the need to
consider quality-adjusted survival comparisons in clinical research and to develop
more practical methods for assessing patient preferences for incorporation in the
decision-making process.
The use of assessment tools and procedures similar to those described in
this chapter is becoming increasingly important in cancer clinical research. At a
minimum, future clinical trials should carefully collect data regarding toxicity
grade and duration in addition to the usual clinical outcomes. The longitudinal
use of a quality-of-life instrument is also strongly recommended, as is the tracking
of individual health care costs over the course of a study. With these components
in place, a meaningful evaluation can be made of treatments in terms of clinical
outcome, quality of life, and cost.
ACKNOWLEDGMENTS
I thank Richard Gelber and Shari Gelber for helpful comments on this chapter.
Supported in part by the American Cancer Society (RPG-90-013-08-PBP) and
the National Cancer Institute (CA23108).
REFERENCES
1. Schumacher M, Olschewski M, Schulgen G. Assessment of quality of life in clinical

trials. Stat Med 1991; 10:1915–1930.
2. Cox DR, Fitzpatrick R, Fletcher AE, Gore SM, Spiegelhalter DJ, Jones DJ. Quality
of life assessment: can we keep it simple? J R Stat Soc A 155:353–393.
3. Gelber RD, Goldhirsch A, Hürny C, Bernhard J, Simes RJ. Quality of life in clinical
trials of adjuvant therapies. J Nat Cancer Inst Monogr 1992; 11:127–135.
4. Neymark N, Kiebert W, Torfs K, et al. Methodological and statistical issues of qual-
246 Cole
ity of life and economic evaluation in cancer clinical trials: report of a workshop.
Eur J Cancer 1998; 34:1317–1333.
5. Karnofsky DA, Abelmann WH, Craver LF, Burchenal JH. The use of nitrogen mus-
tards in the palliative treatment of carcinoma. Cancer 1948; 1:634.
6. Yates JW, Chalmer B, McKegney FP. Evaluation of patients with advanced cancer
using the Karnofsky performance status. Cancer 1980; 45:2220–2224.
7. Izsak FC, Medalie JH. Comprehensive follow-up of carcinoma patients. J Chron
Dis 1971; 24:179–191.
8. Burge PS, Prankerd TAJ, Richards JDM, et al. Quality of survival in acute myeloid
leukemia. Lancet 1975; 2:621–624.
9. Priestman TJ, Baum M. Evaluation of quality of life in patients receiving treatment
for advanced breast cancer. Lancet 1976; 1:899–900.
10. Ware JE Jr. The SF-36 health survey. In: Spilker B, ed. Quality of Life and Pharma-
coeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven, 1996, pp.
337–345.
11. Damiano AM. The sickness impact profile. In: Spilker B, ed. Quality of Life and
Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven,
1996, pp. 347–354.
12. Derogatis LR, Derogatis MF. SCL-90-R and the BSI. In: Spilker B, ed. Quality of
Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-
Raven, 1996, pp. 323–335.
13. Bennett KJ, Torrance GW. Measuring health state preferences and utilities: rating
scale, time trade-off and standard gamble techniques. In: Spilker B, ed. Quality of
Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-
Raven, 1996, pp. 253–265.
14. Farquhar PH. A survey of multiattribute utility theory and applications. TIMS Stud-
ies Mgmt Sci 1977; 6:59–89.
15. Kind P. The EuroQoL instrument: an index of health-related quality of life. In:
Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed.
Philadelphia: Lippincott-Raven, 1996, pp. 191–201.
16. Feeny DH, Torrance GW, Furlong WJ. Health Utilities Index. In: Spilker B, ed.
Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia:
Lippincott-Raven, 1996, pp. 239–252.
17. Weeks J, O’Leary J, Fairclough D, et al. The ‘‘Q-tility index’’: a new tool for assess-
ing health-related quality of life and utilities in clinical trials and clinical practice.
Proc ASCO 1994; 13:436.
18. Spitzer WO, Dobson AJ, Hall J, et al. Measuring the quality of life of cancer patients:
a concise QL-index for use by physicians. J Chron Dis 1981; 34:585–597.
19. Zee B. Growth curve model analysis for quality of life data. Stat Med 1998; 17:
757–766.
20. Fairclough DL. Methods of analysis for longitudinal studies of health-related quality
of life. In: Staquet M, ed. Quality of Life Assessment in Clinical Trials. Oxford:
Oxford University Press 1998, pp. 227–247.
21. Bernhard J, Gelber RD, eds. Workshop on Missing Data in Quality of Life Research
in Cancer Clinical Trials: Practical and Methodological Issues. Stat Med 1998; 17:
511–796.
22. Bonetti M, Cole BF, Gelber RD. A method-of-moments estimation procedure for
categorical quality-of-life data with non-ignorable missingness. J Am Stat Assoc
1999; 94:1025–1034.
23. Hürny C, Bernhard J, Coates AS, et al. Impact of adjuvant therapy on quality of
life in women with node-positive breast cancer. Lancet 1996; 347:1279–1284.
24. Glasziou PP, Simes RJ, Gelber RD. Quality adjusted survival analysis. Stat Med
1990; 9:1259–1276.
25. Gelber RD, Cole BF, Gelber S, Goldhirsch A. The Q-TWiST method. In: Spilker B,
ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia:
26. Gelber RD, Gelman RS, Goldhirsch A. A quality-of-life oriented endpoint for com-
paring therapies. Biometrics 1989; 45:781–795.
27. Zhao H, Tsiatis AA. A consistent estimator for the distribution of quality adjusted
survival time. Biometrika 1997; 84:339–348.
28. Zhao H, Tsiatis AA. Estimating mean quality adjusted lifetime with censored data.
Sankhya 2000; 62, Series B, Part 1: 175–188.
29. Murray S, Cole BF. Variance and sample size calculations in quality-of-life adjusted
survival analysis (Q-TWiST). Biometrics 2000; 56:173–182.
30. Kirkwood JM, Hunt Strawderman M, Ernstoff MS, et al. Interferon alpha-2b adju-
vant therapy of high-risk resected cutaneous melanoma: The Eastern Cooperative
Oncology Group Trial EST 1684. J Clin Oncol 1996; 14:7–17.
31. Cole BF, Gelber RD, Kirkwood JM, et al. Quality-of-life adjusted survival analysis
of interferon alfa-2b adjuvant treatment of high-risk resected cutaneous melanoma:
an Eastern Cooperative Oncology Group Study. J Clin Oncol 1996; 14:2666–
2673.
32. Cole BF, Gelber RD, Anderson KM. Parametric approaches to quality adjusted sur-
vival analysis. Biometrics 1994; 50:621–631.
33. Cole BF, Gelber RD, Goldhirsch A. Cox regression models for quality adjusted
survival analysis. Stat Med 1993; 12:975–987.
34. Gelber RD, Goldhirsch A, Cole BF. Parametric extrapolation of survival estimates
with applications to quality of life evaluation of treatments. Controlled Clin Trials
1993; 14:485–499.
35. Cole BF, Gelber RD, Goldhirsch A. A quality-adjusted survival meta-analysis of
adjuvant chemotherapy for premenopausal breast cancer. Stat Med 1995; 14:1771–
1784.
36. Gelber RD, Cole BF, Goldhirsch A, et al. Adjuvant chemotherapy for premenopausal
breast cancer: a meta-analysis using quality-adjusted survival. Cancer J Sci Am
1995; 1:114–121.
37. Gelber RD, Cole BF, Goldhirsch A, et al. Adjuvant chemotherapy plus tamoxifen
compared with tamoxifen alone for postmenopausal breast cancer: a meta-analysis
using quality-adjusted survival. Lancet 1996; 347:1066–1071.
38. Glasziou P, Cole BF, Gelber RD, Hilden J, Simes RJ. Quality-adjusted survival
analysis with repeated quality-of-life measures. Stat Med 1998; 17:1215–1229.
39. O’Brien BJ, Drummond MF. Statistical versus quantitative significance in the socio-
economic evaluation of medicines. PharmacoEconom 1994; 5:389–398.
40. Hillner BE, Kirkwood JM, Atkins MB, Johnson ER, Smith TJ. Economic analysis of
248 Cole
adjuvant interferon alfa-2b in high-risk melanoma based on projections from Eastern

Cooperative Oncology Group 1684. J Clin Oncol 1997; 15:2351–2358.
41. Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed.
Philadelphia: Lippincott-Raven, 1996.
42. Gelber R, Gelber S. Quality-of-life assessment in clinical trials. In: Thall PF, ed.
Recent Advances in Clinical Trial Design and Analysis. Norwell, Massashusetts:
Kluwer Academic Publishers, pp. 225–246.
43. Levine MN, Guyatt GH, Gent M. Quality of life in stage II breast cancer: an instru-
ment for clinical trials. J Clin Oncol 1988; 6:1789–1810.
44. Ganz PA, Schag CA, Lee JJ, et al. The CARES: A generic measure of health-related
quality of life for patients with cancer. Qual Life Res 1992; 1:19–29.
45. Aaronson NK, Bullinger M, Ahmedzai S. A modular approach to quality-of-life
assessment in cancer clinical trials. Recent Results Cancer Res 1988; 111:231–249.
46. Cella DF, Bonomi AE. (1996). The functional assessment of cancer therapy (FACT)
and functional assessment of HIV infection (FAHI) quality of life measurement sys-
tem. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials,
2nd ed. Philadelphia: Lippincott-Raven, pp. 203–214.
47. Clinch JJ. The functional living index—cancer: ten years later. In: Spilker B, ed.
Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia:
48. Hürny C, Bernhard J, Gelber RD, et al. Quality of life measures for patients receiving
adjuvant therapy for breast cancer: an international trial. Eur J Cancer 1992; 28:
118–124.
14
Health-Related Quality-of-Life Outcomes
Benny C. Zee
National Cancer Institute of Canada, Kingston, Ontario, Canada
David Osoba
Quality of Life Consulting, West Vancouver, British Columbia, Canada
I. INTRODUCTION
Health-related quality of life (HRQL) is now included as a major end point, in

addition to the traditional end points such as tumor response and survival, in
many oncological clinical trials. When HRQL was first proposed as an outcome
to be measured in clinical trials, there was controversy over how to define the term
HRQL and the breadth of the constructs to be included. The construct ‘‘quality of
life’’ can be very broad and can include such dimensions as the air we breathe,
socioeconomic status, and job satisfaction. However, in health care the term
HRQL is restricted to how illness or its treatment affects patients’ ability to func-
tion and their symptom burden (1). Thus, HRQL is not in itself only performance
status, toxicity ratings, tumor measurement, or laboratory values. Researchers
have agreed that at a minimum, it should include physical, social, and emotional
dimensions of life (2–4). Thus, HRQL in oncology is a multidimensional con-
struct consisting of subjective indicators of health. These indicators should in-
clude physical concerns (symptoms and pain), functional ability (activity), family
well-being, emotional well-being, spirituality, treatment satisfaction, future orien-
249
250 Zee and Osoba
tation, sexuality, social functioning, and occupational functioning. Depending on

the purpose of the study, some quality-of-life assessments are targeted to obtain
information for decision making in health care policy, whereas others are de-
signed to assess the impact of both symptoms of disease and toxicity of treatment
in a phase III randomized trial setting. Information obtained from quality-of-life
assessments in phase II testing for new chemotherapeutic agents can also guide
quality-of-life evaluations planned in future large randomized studies (5). How-
ever, not all dimensions are relevant to a particular study. The general approach
in medical studies is to use a generic, or a general (condition-specific), question-
naire that assesses physical, emotional, and social functioning and then, de-
pending on the population being studied and the specific conditions of the study,
to add disease-specific or situation-specific modules or checklists (6,7). There
are several reasons for applying HRQL assessments in oncology (8): (1) studies
in which symptom control is the primary outcome, (2) cancers with a poor prog-
nosis, (3) treatment arms with similar survivals, (4) supportive care interventions,
(5) identification of the full range of side effects and impact of treatment, and
(6) using quality of life as a predictor of response and survival. From the regula-
tory point of view, the U.S. Food and Drug Administration recommends that
the beneficial effects on HRQL and/or survival be the basis of approval of new
anticancer drugs. Therefore, when treatment does not have an impact on survival,
demonstration of a favorable effect on HRQL is more important than most other
traditional measures of efficacy (9).
To perform a clinical trial using HRQL as end points, it is important to
develop a protocol with clearly defined objectives and definitions of end points
(7). For example, many antineoplastic therapies give rise to a number of dis-
tressing side effects with a presumed deterioration in HRQL. However, these
side effects are usually reversible when the treatments have been completed, and
the patients may have increased survival and improved HRQL (10). When the
objective of a study is to evaluate longer term effects, the duration and frequency
of the HRQL measurements should be clearly stated in the protocol, so that the
design of the study and follow-up schedule are developed to ensure good compli-
ance (HRQL questionnaire completion rates). Eligibility requirements for the
HQRL component should be given. The choice of instruments is important;
they should be discussed in the protocol and the psychometric properties of the
generic instruments should be referenced. When a supplemental disease-specific
checklist is needed, the motivation for adding these items should be addressed.
Frequency of measurements and logistics of data collection should be considered
to minimize potential problems with missing data. In the following sections,
we discuss some of these issues in clinical trials incorporating HRQL as an end
point.
Health-related Quality of Life Outcomes 251
II. DESIGN CONSIDERATIONS
During the design stage of a phase III randomized trial, one of the most important
questions is whether the HRQL end point is unambiguously stated and addresses
the study question fully. For example, a study of dose-intensive chemotherapy
versus standard alternating-dose chemotherapy for patients with extensive-stage
small cell lung cancer may expect to have more treatment-related symptoms dur-
ing the duration of chemotherapy. When the primary objective is to evaluate
whether patients survive longer with a dose-intensive regimen, then survival time
is an obvious end point. However, extensive-stage small cell lung cancer patients
have a rather short median survival of about 1 year, and the emergence and mag-
nitude of a clinically significant benefit may be evident around 4–6 months. It
is also likely that dose-intensive treatment produces more side effects and may
have an impact on patients HRQL. The improvement in median survival time
gained by using highly toxic dose-intensive chemotherapy needs to be justified
by the HRQL outcomes to make an informed treatment decision (11). Another
example where HRQL is an important end point is the situation where the primary
objective of the study, such as survival, shows equivalence in treatment effect
but one treatment arm shows less toxicity or improved overall HRQL. Such an
outcome may indicate that one treatment is preferable to another, even when it
does not extend survival.
A. Randomization and Blinding

The basic design issues when HRQL is included as an end point are similar to
those of most conventional randomized controlled trials. However, there are
some additional practical issues that need to be considered in phase III trials
with HRQL components. For example, a study may permit patient entry to the
treatment protocol without participation in the HRQL component. It is impor-
tant to consider the proportion of nonparticipants in the HRQL components,
and appropriate stratification may be required so that the purpose of the ran-
domization process to reduce bias in treatment selection is preserved. In a
double-blind randomized trial, blinding of the allocated treatment for the person
administering the HRQL questionnaire(s) is as important as the blinding of the
allocated treatment for the primary care personnel. For many nonblinded cancer
trials, a standardized procedure must be used to reduce bias from the admini-
stration procedure. This is critical when a study is comparing two different treat-
ment modalities or treatment schedules; the HRQL assessments should be
patterned on the treatment schedules with an identical administration procedure
(12).
252 Zee and Osoba
B. Eligibility
The effect of a disproportionate number of nonparticipators in HRQL components
between two treatment arms may introduce bias. The randomization and the strat-
ification procedure should account for this problem. However, it is difficult to
know how much the generalizability of the results from the study population, as
defined in the eligibility criteria of the protocol, will be affected. Does the patient
population include all those who consent to take part in the treatment aspect or
include only those who consent to participate in the HRQL component? Exclu-
sion of those who do not take part in HRQL assessments may limit the generaliz-
ability of the treatment outcome (10). For example, a study that includes only
patients who consent to take part in HRQL assessments may involve only well-
adjusted patients. Information about HRQL for this subset of patients may affect
the generalizability of the study results. On the other hand, allowing physicians
to treat the quality-of-life assessments as optional and then to stratify for those
who consent to take part in HRQL assessments versus those who do not may
reduce the problem, but the HRQL results of the study must be interpreted with
caution since they represent a slightly different patient population than that of
the survival end point. A strategy for collecting HRQL data that is appropriate to
all patients should be considered. For example, culturally appropriate instruments
should be used, translation to other languages for various ethnic groups, assis-
tance for patients with visual or auditory impairment, using an instrument with
appropriate length to minimize noncompliance due to deteriorating physical or
emotional illness, and the use of proxies who know the patients well are all impor-
tant considerations.
C. Psychometric Properties
The choice of instruments with well-established psychometric properties are im-
portant for proper interpretation of the HRQL results. There are many good re-
view articles on the choice of quality of life instruments. Moinpour (13) discussed
the operational definition of HRQL in clinical trials with respect to health care
and the treatment of disease, i.e., how physical, mental, and social well being
are affected by medical intervention. Once an instrument has been chosen for a
study, previous work on the psychometric properties of the instrument should be
referenced in the protocol. The chosen instrument should have demonstrated good
internal consistency, high test–retest, and interrater reliability (if rater-completed
HRQL assessment is used). Internal consistency is usually measured by Cron-
bach’s alpha coefficient (14), which is a measure of the extent to which different
items represent the same domain content. Test–retest reliability reflects the repro-
ducibility of scores at two different time points between which HRQL is not
expected to change. Interrater reliability estimates the degree of agreement among
different raters but is not applicable in most patient self-administered HRQL

instruments. The concept of validity is more complex since there is no gold stan-
dard for HRQL assessment and it is primarily the repeated use of an instrument
in many trials over time that establishes validity. However, an instrument that
demonstrates an appropriate content validity (including adequate relevant items
in a specific domain) and convergent and divergent validity with other instru-
ments is sufficient to justify its use in most clinical trials. Sometimes a demonstra-
tion of criterion validity is possible if the criterion exists at the same time as the
measure, e.g., if we know that there is a significant difference between two groups
of patients with respect to a criterion (e.g., performance status), criterion validity
may be established if the quality of life domains correctly show a difference
between groups. However, criterion validity is not easy to establish in HRQL
instruments. Finally, construct validity assesses whether the HRQL instrument
relates to other observed variables or other constructs in a way that is consistent
with a theoretical framework. Factor analysis, multitrait scaling analysis, and
structural equation modeling are common methods to assess construct validity
(15).
D. Global Quality of Life

An important aspect of choosing HRQL instruments is to select one that con-
tains relevant items for a specific purpose. For example, if we are to study over-
all health, global quality of life, and general well-being, a questionnaire with a
separate global quality-of-life domain is an important consideration. Question-
naires that aggregate individual items into a total score do not necessarily repre-
sent global quality of life or general well-being. This is important because the
weighting of various domains in an instrument varies with the individual valua-
tion. As Mor and Guadagnoli (16) pointed out, the interpretation of an aggregate
score from a number of existing domains as equaling global quality of life may
provide misleading results when the instrument is predominated by certain do-
mains. For example, the Functional Living Index—Cancer (FLIC) and Cancer
Rehabilitation Evaluation System (CARES) were used to assess the quality of
life of patients treated with either modified radical mastectomy or segmental mas-
tectomy (17). A global score for FLIC was obtained from 22 visual analogue
scales including concerns related to pain, stress, and the ability related to work
and do household chores. The CARES (18) instrument has 93 to 132 items with
five-point Likert scale, and a summary score is obtained from 5 higher order
factors including physical, psychosocial, medical interaction, marital, and sexual
domains. The Profile of Mood States with 65 items with a five-point Likert scale
response format was also used, the average score representing a total mood distur-
bance. Both the CARES summary scores and FLIC global score showed no sig-
nificant difference between the two groups, but patients with segmental mastec-
254 Zee and Osoba
tomy had significantly more mood disturbance at one month than did those with
total mastectomy. More importantly, at 1 year of follow-up, patients with segmen-
tal mastectomy had significantly fewer problems with clothing and body image
as indicated in these domains in CARES. These are clear indications of differ-
ences in HRQL. However, the conclusion based on aggregate scores indicates no
significance difference. One explanation for these results may be that the chosen
questionnaires did not include factors that patients believed to be important. Kem-
eny et al. (19) pointed out that the subject matter in identifying proper domains
to study, such as body image, appearance, and femininity, is important. Another
likely explanation is that the scores for all the domains in a questionnaire may
not change in the same direction, and thus improvement in some domains may
be canceled out by worsening in others, giving no change in the aggregate score.
It is therefore very difficult to develop an instrument that provides an aggregate
score applicable across different studies. Statements about overall or global qual-
ity of life based on aggregate scores may be dangerous. Although some investiga-
tors have suggested weighting of certain domain scores (e.g., toward physical
functioning or emotional functioning), the assigned weights are derived from
observers opinions and not necessarily those of patients (2). In contrast to the
method of aggregate scoring, one or two questions asking directly for an overall
assessment of overall health/global quality of life represents an overall assess-
ment incorporating an individual’s own values. Once this information is obtained,
association of global quality of life with physical, emotional, and social domains
and other symptoms can be determined. This would provide us with further infor-
mation on the impact of specific dimensions on patients’ global quality of life.
E. Symptom Checklists and HRQL

One of the design questions in HRQL studies is whether to incorporate extra
symptom checklists to validated general HRQL questionnaires (20). Checklists
(and modules) are developed to be disease and situation specific, that is, to capture
additional information about the effects of a particular cancer, or its treatment
in a given clinical trial. They can provide additional information not provided
by general (core) questionnaires.
There is no theoretical reason why symptom checklists cannot be completed
at the same time as are general HRQL questionnaires. One difficulty is that the
timing of the symptom checklists intended to only capture the side effects of
treatment may differ from that of HRQL assessments. The effects of some symp-
toms, which are expected to peak within a day or two (i.e., nausea and vomiting
after chemotherapy), may not be captured by a questionnaire with only a 7-day
time frame, whereas the same questionnaire may be adequate for symptoms that
occur over longer periods of time (e.g., fatigue) after treatment. Symptom check-
lists, in the form of a self-administered daily diary, are often used to capture
symptoms induced by treatment. However, the design of the diary has to be short
and simple so that the data collection burden for both patients and the data center
is minimized but the chance of capturing crucial symptomatic data is maximized.
Some appropriately scheduled quality-of-life assessments have been shown
to capture critical information when compared with a symptom checklist. For
example, Pater et al. (21) showed that the timing of assessment (either day 4 or
day 8 after chemotherapy) and the recall period (either 3 days or 7 days) using
the EORTC QLQ-C30 is associated closely with the occurrence of nausea and
vomiting as captured by a daily diary using a visual analogue scale. Thus, patients
were capable of assimilating the symptomatic experience for the corresponding
recall period. It is also possible to add additional HRQL assessments (including
checklists) at other times to capture the effects of maximum toxicity (12).
III. DATA COLLECTION
It is widely recognized that there is disagreement between physician rating and

patients’ self-assessed HRQL scores (22). It has been generally accepted that
HRQL should be assessed by patient self-administered questionnaires because
of the subjective nature of the content. The data management approach suitable
for obtaining HRQL data has to be considered to minimize the chance of missing
data. The following are a few points about the general principles of data manage-
ment practices.
A. HRQL Questionnaires
The format and layout of the questionnaire should be spacious and clear, using
a larger than usual font for instructions and questions to avoid having double
answers for one item and missing answers in the adjacent item. A clear informa-
tion sheet about the HRQL assessment and instructions about how to fill out the
questionnaires are useful. Printing should always be done on single page, similar
to a case report form, even when the questionnaires are in the form of a booklet
(patients sometimes miss an entire page of questions if these are placed on the
reverse side). It is also important that the instrument is brief to reduce patient
burden as much as possible and questions are easy to understand without using
technical terminology or jargon. When a symptom checklist is added to the core
questionnaire, the use and wording of a conditional lead-in question must be
carefully considered. For example, in a symptom control study of antiemetics,
the investigators wished to know how nausea and vomiting affected patients’
functioning and HRQL. A lead-in instruction was used in a nausea and vomiting
checklist to identify patients who had experienced nausea and vomiting in the
past 7 days, and they were then asked to assess the impact of that nausea and
256 Zee and Osoba
vomiting on their functioning. It was discovered that some patients who had
reported no experience of nausea and vomiting in their daily diary provided an-
swers to the checklist of questions and some patients who should have answered
did not answer the checklist. However, the core questionnaire in the same study
had very good completion rates. It was also noted that the nausea and vomiting
domain based on two items from the core HRQL questionnaire had a high correla-
tion with the nausea and vomiting scores from the daily diary (21).
B. Compliance
Some earlier studies reported difficulties collecting HRQL data (23). However,
compliance (i.e., HRQL questionnaire completion rates) in later studies has im-
proved. For example, the National Surgical Adjuvant Breast and Bowel Project
Breast Cancer Prevention Trial had a very high compliance rate for the first 12-
month assessments on the placebo arm (23). The general awareness of the value
of HRQL data has been raised, and more attention has been given to achieve
high compliance (24–29). In a workshop on ‘‘Missing Data in Quality of Life
Research in Cancer Clinical Trials’’ (30) held in 1997, most research groups
reported that the baseline compliance rates were above 90%. The compliance rate
while patients were receiving treatment was in the 80% range and after patients
completed treatment, in the 70% range. There are a number of common character-
istics of studies with good compliance. Often there has been a training session
on HRQL data collection provided for the clinical research personnel both at data
collection centers and at central data processing and analysis centers. An ongoing
interaction between the clinical research personnel in the clinics and the study
coordinator of the central office is useful to understand the problems a particular
center might be facing, and possible solutions may be attempted. A clear set of
instructions or a full manual should be developed for the clinical research person-
nel in the centers to follow. If the instructions are specific to a particular trial,
they should be included within the protocol. The schedule for HRQL assessments
should coincide with the schedule of regular follow-up visits as much as possible
to facilitate data collection. It is generally better to complete the HRQL question-
naires at a standardized time in the patient visit while patients are still in the
clinic rather than have patients take questionnaires home and mail them back at
specific time points. Standardized telephone interviews may also be used, particu-
larly if HRQL assessment is required at times not corresponding to clinic visits.
In the data center, it may be difficult to monitor HRQL compliance when
the schedules and frequency of collections vary between many trials. Special
computer programs are needed to monitor compliance in specific studies. To
develop a general computer program for monitoring compliance, a simple ques-
tion on the case report form asking whether HRQL should be performed at the
current visit is extremely helpful. The study coordinator can check this item with
the appropriate protocol schedule, and a query can be sent to the center if the
expected HRQL questionnaires were not done. For each HRQL assessment, a
front-page (cover sheet) form is required from the center. This front page contains
information on the date and location of completion and whether assistance was
required at the time of completion. If a quality-of-life assessment was scheduled
but patients did not complete a questionnaire, then the reasons of noncompliance
are documented (31). Computerization of these items makes the compliance rate
calculation simpler than assigning windows for scheduled visits where delay of
treatment or visits may increase the complexity of the calculation.
IV. SAMPLE SIZE
One of the critical aspects in the design of a phase III clinical trial involving
HRQL as an end point(s) is sample size calculation. The study must be adequately
powered to be able to make a firm conclusion about the HRQL hypothesis. For
example, considering a comparative study between two treatment arms, the pri-
mary objective is to compare the study treatment with the control treatment to
show the difference in effects on HRQL data. In this setting, a hypothesis-testing
approach is used for the quality-of-life data. Since quality of life is a multidimen-
sional construct, the hypothesis of interest should be clearly defined. The study
may have a specific question in mind, for example, a study designed to evaluate
the impact on cognitive functioning in patients with metastatic breast cancer re-
ceiving high-dose chemotherapy versus a treatment that may improve cognitive
functioning. The hypothesis for this study is to focus specifically on cognitive
functioning. Other studies may focus on different dimensions, for example, in
symptom control studies, the primary end point may be a specific symptom such
as nausea and vomiting induced by moderately or highly emetogenic chemother-
apy or fatigue. In general, if the study question is not about a specific functioning
domain or symptom, the global quality-of-life scores can be used as the end point
for the sample size estimation.
To determine the sample size required for a trial, several quantities must
be considered: (1) the significance level at which we wish to perform our hypothe-
sis test (usually 5%) and if it is to be one-tailed or two-tailed; (2) the smallest
clinically meaningful difference; (3) the power of the statistical test, which is the
probability of rejecting the null hypothesis if the real effect is larger than the
smallest clinically meaningful difference specified (usually 80% or 90%); and
(4) the variability of the outcome measure estimated from previous studies. For
a study with HRQL as an end point, the most difficult quantity for justifying the
sample size calculation is the smallest clinically meaningful difference. This is
because the meaning of the change on HRQL depends on the perspective of the
potential user of the information (31). The societal perspective considers the de-
258 Zee and Osoba
gree of importance at a population level where small differences may be impor-

tant because of the large number of individuals who may be affected. At the
institution level, one may consider a degree of change to be large enough if it
leads to the adoption of certain health care policies. In a randomized clinical trial,
the magnitude of change may be considered clinically worthwhile when the de-
gree of change is large enough to cause most clinicians to consider using a speci-
fied study intervention in a given situation (e.g., discontinuation or alteration of
treatment). However, when HRQL is considered as an end point, none of the
above definitions seems sensitive enough to provide us with a practical criterion
that defines the smallest clinically meaningful difference. In fact, since HRQL
is a subjective end point, the change perceived by patients as being meaningful
should be considered important. Osoba et al. (31) used a Subjective Significance
Questionnaire asking about physical, emotional and social functioning, and global
quality of life to assess the degree of change in the EORTC QLQ-C30 scores
that were perceptible to patients with breast cancer and small cell lung cancer.
The results showed that patients with breast cancer and small cell lung cancer
perceived a change of 6.9 to 10.7 (on a 0- to 100-point scale) for global quality
of life, respectively. From this information, a sample size formula can be derived
for a randomized trial with two treatment arms, using global quality of life as
the primary end point. The sample formula reviewed by Lachin (32) can be used:
2 (Z1-α /2 ⫹ Z1-β)2 σ2
n⫽
d2
The standard deviation for global quality of life can be determined from previous
studies; a sample size of 63 patients per arm is required to detect a difference
of 10 points using a two-sided 5% level test with 80% power if the standard
deviation is 20. To detect a difference of seven points, 128 patients per arm would
be needed.
Another consideration in sample size estimation is the multidimensional
aspect of HRQL. If global quality of life is not the only specific domain of inter-
est, inclusion of other functioning domains or symptoms may inflate the type I
error of the hypothesis tests. It is advisable to control for the increase in type I
error using Bonferroni type adjustment for multiple end points when we estimate
the required sample size. Here we are adopting the view that global quality of
life should be treated as a separate domain and that this information cannot be
assimilated from other existing domains and symptoms. Otherwise, a global test
statistic for multiple end points such as those proposed by O’Brien (33) or Tandon
(34) may be more appropriate. A discussion about matching the clinical questions
for multiple end points to appropriate statistical procedures can be found in
O’Brien and Geller (35).
In many clinical trials, the HRQL assessments are scheduled to be done

repeatedly over time. If the analysis is to be done based on summary statistics
for individuals’ repeated measurements, the method by Dawson (36) can be used.
Examples of summary statistics include average postrandomization HRQL as-
sessments, last observation, the slope of HRQL scores along time, and area under
the curve. For a vector Yij of (K ⫹ 1) of observed repeated HRQL measurements
for treatment i and subject j, the observed measures are represented by a random
effects model:
Yij ⫽ µi ⫹ X′bij ⫹ eij
where µ i ⫽ (µ i0, µ i1, . . . , µ iK) are group means for the (K ⫹ 1) repeated measures,
and bij is a q ⫻ 1 vector of random subject effects distributed as N(0, D(q⫻q)), and
eij is random error distributed as N(0, σ2eI(K⫹1)). Therefore, Yijk is distributed as
N(µik, Xk′D xk ⫹ σ2e) and x′k is the (k ⫹ 1)th row of X. The summary statistic
can be written as a linear combination of the observations Sij ⫽ c′Yij. For example,
the last observation is obtained by defining c′ ⫽ (0, . . . , 0, 1) and the slope by
c′ ⫽ (⫺K/2, ⫺K/2 ⫹ 1, . . . , K/2 ⫺ 1, K/2). The average of the summary
statistics for the two treatment groups is denoted by ∑j Sij /ni and is distributed as
N(c′µi, [c′XDX′c ⫹ c′cσ2e]/ni). The sample size for the two treatment groups is
n1 and n2 ⫽ mn1, respectively.
(c′XDX′c ⴙ c′c σe2) (1 ⫹ 1/m) (Z1⫺α/2 ⫹ Z1⫺β)
n1 ⫽
c′(µ1 ⫺ µ2)2
When there are missing data and the data can be separated into strata according
to missing data patterns, the above sample size formula can be modified (36).
For analysis based on multivariate analysis of variance for repeated mea-
sures, the sample size can be determined using the method proposed by Rochon
(37). Let yit ⫽ [yij1, . . . , yijT] denote the set of repeated measures of quality
of life for the jth individual in the ith treatment group and assume that each
yij ⬃ MVN(µi, ⌺), a multivariate normal distribution with µ′i ⫽ [µi1, . . . , µiT]
represents the mean quality-of-life scores for treatment group i ⫽ 1, 2 at time
t ⫽ 1 to T and a variance–covariance matrix of ⌺ that can be written as function
of a vector of anticipated standard deviations σ and the correlation matrix P, i.e.,
⌺ ⫽ Dσ PDσ, where Dσ is a diagonal matrix whose elements consist of σ′ ⫽
[σ1, . . . , σT] and P is a function of ρ. Consider the hypothesis
H0 : Hδ ⫽ 0 vs. Ha : Hδ ≠ 0
where δ ⫽ µ1 ⫺ µ2, and H is an (h ⫻ T) matrix, of full row rank, imposing h
linearly independent restrictions on δ. To estimate sample size, we need to assume
either a compound symmetry (PCS) or autoregressive (PAR) correlation structure.
260 Zee and Osoba
The parameters required to be specified include a vector of δ and a value for ρ,

an anticipated difference δ, and the matrix H. In a repeated-measures approach,
we need to formulate the hypothesis of interest. In particular, a test of whether
the treatment difference is consistent from evaluation to evaluation is of interest,
i.e., a test of treatment ⫻ time interaction where H ⫽ [⫺1, IT⫺1] and IT⫺1 is an
identity matrix of dimension T ⫺ 1. The sample size can be determined using
Hotelling’s T 2 distribution, with h and n1 ⫹ n2 ⫺ 2 degree of freedom, and the
noncentrality parameter. Since there is no closed-form expression, an iterative
procedure is required.
Instead of using mutivariate analysis of variance (MNOVA) approach, the
method of Liu and Liang (38) may be used. They proposed a modification of
the sample size and power formula of Self and Mauritsen (39) to handle correlated
observations based on the generalized estimating equation method and a quasi-
score test statistic. This method requires specification of the regression model
for the marginal mean and parameters for both null and alternative hypotheses.
V. ANALYSIS
The analysis of HRQL data presents a number of challenges to statisticians. Some

of these problems have been discussed in previous sections and include (1) deal-
ing with multiple end points due to the multidimensional nature of HRQL data,
(2) the administration and collection of questionnaires, (3) the careful implemen-
tation of data management procedures and monitoring, (4) longitudinal data col-
lection that may happen at irregular intervals due to delay in treatment and miss-
ing follow-up visits, (5) dealing with missing data, and (6) difficulty in the
interpretation when missing data are nonrandom.
A. General Methods of Analysis for HRQL Data

The analysis of HRQL data at a cross-sectional time point is the simplest way
of looking at the data. Simple, cross-sectional, descriptive statistics at each time
interval can be used to describe the overall trend of the two treatment groups.
Comparisons using this method do not take into account within-patient variation
and may inflate the type I error. The tests between mean values of two treatment
groups at different time intervals may also hide significant individual changes
and do not take into account different proportions of missing data in the two
arms. An alternative method is to summarize the longitudinal data into a summary
statistic before performing a between-arms comparison, as suggested by Dawson
and Lagakos (40). This method is simple but may overlook important changes in
HRQL along time, and it suffers similar problems to the cross-sectional analysis.
Ganz et al. (17) suggested assessing the general change of HRQL over time
and called this approach ‘‘pattern analysis.’’ For each individual, the differences
between two consecutive quality-of-life scores for a specific domain were calcu-
lated, and the sign of the differences was recorded (positive, zero, or negative).
Based on the sign of the differences, the patterns of changes were classified into
one of three categories: consistent increase over time, no consistent patterns of
increase or decrease, or consistent decrease over time. The patterns were com-
pared between the two treatment groups using ordered polytomous logistic re-
gression in which the patterns were treated as ordered categorical variables. The
treatment group assessment adjusted for confounding variables was then carried
out. However, the definition of consistent increasing or decreasing patterns is
rather arbitrary, and it only provides a rough estimate of the general changes in
quality of life.
Other methods include univariate repeated measures and MNOVA as de-
scribed in Zee and Pater (41) for continuous repeated measures. For categorical
data, Koch et al. (42) suggested a weighted least-squares approach as described
in Grizzle et al. (43) to determine a test statistic for testing the hypothesis. A
review of other methods by Davis (44) includes generalized estimating equations
of Liang and Zeger (45) and the two-stage method of Stram et al. (46). Since
the number of repeated measures of HRQL in clinical trials may be quite large,
especially for studies with long-term follow-up, the analysis of repeated-measures
approach may require a large number of parameters in the model. All these meth-
ods require a transformation of the time axis into discrete time intervals to set
up for the repeated-measures analysis. A different approach was proposed by
Zwinderman (47) using a logistic latent trait model. However, when the number
of time points is large, the number of parameters will increase and the estimation
procedure for a conditional logistic model will again become complicated.
B. Methods of Analysis with Missing Data

For HRQL data with missing observations, the summary statistics for repeated
measurements approach can be used. Dawson and Lagakos (40) proposed a stra-
tified test for repeated-measures data that contain missing data for comparing
two treatment groups in a randomized trial. The type I error of the test is properly
retained when the distribution of missing pattern is the same and the distribution
of outcome conditional on the pattern of missing data is also the same between
the two treatment groups. Suppose that the quality-of-life measurements Yijk were
measured at xk (k ⫽ 0, 1, . . . , K) time points for subject j ( j ⫽ 1, 2, . . . , ni)
and within group i (i ⫽ 1, 2). The null hypothesis of interest is that the outcome
vectors are distributed equally in the two groups.
H0 : F1(y) ⫽ F2(y) vs. H1 : F1( y) ≠ F2( y)

262 Zee and Osoba
Suppose that the summary statistic S is some scalar function of Y. Under H0, the
n1 ⫹ n2 vectors Y11, . . . , Y1n1, Y21, . . . , Y2n2, are independent and identically
distributed, as are the corresponding values of S, say S11, . . . , S1n1, S21, . . . , S2n2.
Consequently, any distribution-free test comparing the summary statistics in the
two groups will have the correct size under H0.
Shih and Quan (48) further assessed the situations when some sufficient
conditions as proposed by Dawson and Lagakos are not met, for example, when
data are not missing completely at random or the test statistics across strata indi-
cate a treatment by missing data pattern interaction.
Curran et al. (49) gave a general review of the test of missing data assump-
tions with respect to repeated HRQL data. They focused particularly on the
method proposed by Ridout (50), using logistic regression to model the missing
data pattern and examine whether the assumption of missing completely at ran-
dom (MCAR) is valid. Park and Davis (51) provided a test of the missing data
mechanism for repeated categorical data. Lipsitz et al. (52) extended the general-
ized estimating equations method proposed by Liang and Zeger (45) to incorpo-
rate repeated multinomial response. However, this class of model requires the
assumption MCAR.
Little (53) gave a comprehensive review of the modeling approaches for
various drop-out mechanisms in repeated-measures analysis and proposed the
pattern-mixture model for testing various assumptions. Schluchter (54) proposed
a joint mixed-effect and survival model to estimate the association between the
repeated-measures and the drop-out mechanism. Fairclough et al. (55) used these
methods with respect to quality-of-life examples. Another model-based approach
was proposed by Zee (56) based on growth curve models stratified for health
state which is likely to have different missing data mechanisms. For example,
the health states can be separated by the period of time when patients are receiving
protocol treatment versus the period after patients are off protocol treatment. This
model requires only missing at random assumption to be satisfied within each
of the health states. Methods such as the pattern-mixture model (53) can be
adopted in the growth-curve method to assess the missing data mechanism for
both the whole data set and within health states to evaluate the appropriateness
of the assumption.
C. Other Methods
The incorporation of HRQL end points in a study may be motivated by a highly
toxic treatment where clinical benefit with respect to relative gain in survival is
discounted by the duration of toxicity experienced from treatment and duration
of progression due to disease. There is a trade-off between the duration of quality
lifetime gained by the study treatment and the duration of toxicity due to treat-
ment and relapse due to disease. Goldhirsh et al. (57) proposed a Quality-adjusted
Time Without Symptoms and Toxicity (Q-TWiST) model to assess the benefit
of adjuvant treatment in breast cancer. This method can be used to incorporate
both survival and HRQL data in a single comparison between treatment groups.
However, a utility measure is needed to summarize the trade-off for individual
patients, which is traditionally determined using standard gamble or time trade-
off techniques (58). These methods are rather difficult to implement because they
are conceptually complex tools. Other alternative methods include the Health
Utility Index (59) and cancer-specific Q-tility Index (60). For the purpose of
comparison in a randomized controlled trial, it is not necessary to obtain patient-
derived utility scores. Instead, a sensitivity analysis, called a threshold utility
analysis, can be performed to assess the relative benefits between treatment
groups for all combinations of utility scores. Approximate 95% confidence inter-
vals for the differences in Q-TWiST between the two arms for each set of utility
values can be determined by bootstrap method (61).
D. Conclusion
In summary, two different approaches are available to measure HRQL. The first
is to measure multidimensional domains of HRQL and follow patients longitudi-
nally. This approach provides a clear picture of the experience of patients with
respect to the HRQL domains being measured. The average HRQL scores can
be assessed using summary statistics or longitudinal data analysis methods for
comparing the effect of treatments. Model-based methods such as mixed effect
model or growth-curve model are appropriate. Missing data pattern should be
verified to identify potential violation of assumptions. Another approach is to
define a totally new end point that incorporates overall survival and various health
states of patients during and after treatment, such as the Q-TWiST end point. A
major limitation of this method is that it does not address change in psychosocial
and emotional functioning. One may try to expand the utility coefficient and the
definition of toxic side effects to cover the psychosocial and emotional function-
ing. One may try to expand the utility coefficient and the definition of toxic
side effects to cover the psychosocial and emotional functioning. However, the
advantage and usefulness of the Q-TWiST model is being compromised when
the summary measure becomes more complicated in its interpretation.
REFERENCES
1. Osoba D. Measuring the effect of cancer on quality of life. In: Osoba D, ed. Effect
of Cancer on Quality of Life. Boca Raton: CRC Press, 1991, pp. 26–40.
264 Zee and Osoba
2. Ware JE Jr. Measuring functioning, well-being, and other generic health concepts.
In: Osoba D, ed. Effect of Cancer on Quality of Life. Boca Raton: CRC Press, 1991,
pp. 7–23.
3. Kaasa S. Measurement of quality of life in clinical trials. Oncology 1992; 49:288–
294.
4. Bruner D. In search of the ‘‘quality’’ in quality-of-life research. Int J Radiat Oncol
Biol Phys 1995; 31:191–192.
5. Seidman A, Portenoy R, Yao T, Lapore J, Mont E, Kortmansky J, Onetto N, Ren
L, Grechko J, Beltangady M, Usakewicz J, Souhrada M, Houston C, McCabe M,
Salvaggio M, Thaler H, Norton L. Quality of life in phase II trials: a study of method-
ology and predictive value in patients with advanced breast cancer treated with pacli-
taxel and granulocytecolony stimulating factor. J Nat Cancer Inst 1995; 87:316–
322.
6. Aaronson NK, Bullinger M, Ahmedzai S. A Modular approach to quality-of-life
assessment in clinical trials. In: Scheurlen H, Kay R, Baum M, eds. Cancer Clinical
Trials: A Critical Appraisal. Recent Results in Cancer Research. Berlin: Springer,
1988; 111:231–249.
7. Osoba D. Guidelines for measuring health-related quality of life in clinical trials.
In: Staquet MJ, Hays RD, Fayers PM; eds. Quality of Life Assessment in Clinical
Trials. Methods and Practice. Oxford: Oxford University Press, 1998, pp. 19–
35.
8. Osoba D, Aaronson NK, Till JE. A practical guide for selecting quality-of-life mea-
sures in clinical trials and practice. In: Osoba D, ed. Effect of Cancer on Quality
of Life. Boca Raton: CRC Press, 1991, pp. 89–104.
9. Beitz J, Gnecco C, Justice R. Quality of life endpoints in cancer clinical trials: the
Food and Drug Administration perspectives. J Natl Cancer Inst Monogr 1996; 20:
7–9.
10. Gotay C, Korn E, McCabe M, Moore T, Cheson B. Quality of life assessment in
cancer treatment protocols: research issues in protocol development. JNCI 1992; 84:
575–579.
11. Murray N, Livingston R, Shepherd F, et al. A randomised study of CODE plus
thoracic irradiation versus alternating CAV/EP for extensive stage small cell lung
cancer (ESCLC) (abstr). Proc ASCO 1997; 16:456a.
12. Osoba D. Rationale for the timing of health-related quality-of-life assessments in
oncological palliative therapy. Cancer Treat Rev 1996 22 (suppl A): 69–73.
13. Moinpour C. Cost of quality of life research in South West Oncology Group. J Natl
Cancer Inst Monogr 1996; 20:11–16.
14. Cronbach L. Coefficient alpha and the internal structure of tests. Psychometrika
1951; 16:297–334.
15. Bollen K. Structural Equations with Latent Variables. New York: John Wiley &
Sons.
16. Mor V, Guadagnoli E. Quality of life measurement: a psychometic tower of Babel.
J Clin Epidemiol 1988; 41:1055–1058.
17. Ganz P, Schag A, Lee J, Polinsky M, Tan S. Breast conservation versus mastectomy:
is there a difference in psychological adjustment or quality of life in the year after
surgery? Cancer 1992; 69:1729–1738.
18. Schaq C, Heinrich R, Aadland R, Ganz P. Assessing problems of cancer patients:

psychometric properties of the Cancer Inventory of Problems Situations. Health Psy-
chol 1990; 9:83–102.
19. Kemeny M, Wellisch D, Schain W. Psychosocial outcome in a randomized surgical
trial for treatment of primary breast cancer. Cancer 1988; 62:1231–1237.
20. Osoba D. Self-rating symptom checklists: a simple method for recording and evalu-
ating symptom control in oncology. Cancer Treat Rev 1993 19 (suppl A): 43–
51.
21. Pater J, Osoba D, Zee B, Lofter W, Gore M, Dempsey E, Palmer M, Chin C. Effects
of altering the time of administration and the time frame of quality of life assess-
ments in clinical trials: an example of using the EORTC QLQ-C30 in a large anti-
emetic trial. Qual Life Res 1998; 7:273–278.
22. Slevin ML, Plant H, Lynch D, et al. Who should measure quality of life, the doctor
or the patient? Br J Cancer 1988; 57:109–112.
23. Ganz P, Haskell C, Figlin R, Soto N, Siau J. Estimating the quality of life in a
clinical trial of patients with metastatic lung cancer using Karnofsky performance
status and the Functional Living Index—Cancer. Cancer 1988; 61:849–856.
24. Osoba D. The Quality of Life Committee of the Clinical Trials Group of the National
Cancer Institute of Canada: organization and functions. Qual Life Res 1992; 1:211–
218.
25. Osoba D, Dancey J, Zee B, Myles J, Pater J. Health related quality of life studies
of the National Cancer Institute of Canada Clinical Trials Group. J Natl Cancer Inst
Monogr 1996; 20:107–111.
26. Sadura A, Pater J, Osoba D, et al. Quality-of-life assessment: patient compliance
with questionnaire completion. J Nat Cancer Inst 1992; 84:1023–1026.
27. Coates A, Gebski V. Quality of life studies of the Australian New Zealand Breast
Cancer Trials Group: approaches to missing data. Stat Med 1998; 17:533–540.
28. Hahn E, Webster K, Cella D, Fairclough D. Missing data in quality of life research
in Eastern Clinical Oncology Group (ECOG) clinical trials: problem and solutions.
Stat Med 1998; 17:547–559.
29. Osoba D, Zee B. Completion rates in health-related quality of life assessment: ap-
proach of the National Cancer Institute of Canada Clinical Trials Group. Stat Med
1998; 17:603–612.
30. Bernhard J, Gelber RD, eds. Workshop on Missing Data in Quality of Life Research
in Cancer Clinical Trials: Practical and Methodological Issues. Stat Med 1998; 17:
511–651.
31. Osoba D, Rodrigues G, Myles J, Zee B, Pater J. Interpreting the significance of
changes in health-related quality of life scores. J Clin Oncol 1998; 16:139–144.
32. Lachin J. Introduction to sample size determination and power analysis for clinical
trials. Control Clin Trials 1981; 2:93–113.
33. O’Brien P. Procedures for comparing samples with multiple endpoints. Biometrics
1984; 40:1079–1087.
34. Tandon P. Application of global statistics in analysing quality of life data. Stat Med
1990; 9:819–827.
35. O’Brien P, Geller N. Interpreting test for efficacy in clinical trials with multiple
endpoints. Controlled Clin Trials 1997; 18:222–227.
266 Zee and Osoba
36. Dawson JD. Sample size calculation based on slopes and other summary statistics.
Biometrics 1998; 54:323–330.
37. Rochon J. Sample size calculations for two-group repeated-measures experiments.
Biometrics 1991; 47:1383–1398.
38. Liu G, Liang K. Sample size calculation for studies with correlated observations.
Biometrics 1997; 53:937–947.
39. Self S, Mauritsen R. Powers/sample size calculations for generalized linear models.
Biometrics 1988; 44:79–46.
40. Dawson JD, Lagakos SW. Size and power of two-sample tests of repeated measures
data. Biometrics 1993; 49:1022–1032.
41. Zee B, Pater J. Statistical analysis of trials assessing quality of life. In: Osoba D,
eds. The Effect of Cancer on Quality of Life. Boca Raton: CRC Press, 1991, pp.
113–123.
42. Koch G, Landis R, Freeman J, Freeman D, Lehnen R. A general methodology for
the analysis of experiments with repeated measurement of categorical data. Biomet-
rics 1977; 33:133–158.
43. Grizzle J, Starmer F, Koch G. Analysis of categorical data by linear models. Biomet-
rics 1969; 25:489–504.
44. Davis C. Semi-parametric and non-parametric methods for the analysis of repeated
measurements with applications to clinical trials. Stat Med 1991; 10:1959–1980.
45. Liang K, Zeger S. Longitudinal data analysis using generalized linear models. Bio-
metrika 1986; 73:13–22.
46. Stram D, Wei L, Ware J. Analysis of repeated ordered categorical outcomes with
possibly missing observations and time-dependent covariates. J Am Stat Assoc 1988;
83:631–637.
47. Zwinderman A. The measurement of change of quality of life in clinical trials. Stat
Med 1990; 9:931–942.
48. Shih J, Quan H. Stratified testing for treatment effects with missing data. Biometrics
1998; 54:782–787.
49. Curran D, Molenberghs G, Fayers P, Machin D. Incomplete quality of life data in
randomized trials: missing forms. Stat Med 1998; 17:697–709.
50. Ridout M. Testing for random dropouts in repeated measurement data. Biometrics
1991; 47:1617–1618.
51. Park T, Davis CS. A test of missing data mechanism for repeated categorical data.
Biometrics 1993; 49:631–638.
52. Lipsitz S, Kim K, Zhao L. Analysis of repeated categorical data using generalized
estimating equations. Stat Med 1994; 13:1149–1163.
53. Little R. Modelling the drop-out mechanism in repeated-measures studies. J Am Stat
Assoc 1995; 90:1112–1121.
54. Schluchter MD. Methods for the analysis of informatively censored longitudinal
data. Stat Med 1992; 11:1861–1870.
55. Fairclough D, Peterson H, Cella D, Bonomi P. Comparison of several model-based
methods for analysing incomplete quality of life data in clinical trials. Stat Med
1998; 17:781–796.
56. Zee B. Growth curve model analysis for quality of life data. Stat Med 1998; 17:
757–766.
57. Goldhirsh A, Gelber R, Simes J, Glasziou P, Coates A. Cost and benefit of adjuvant
therapy in breast cancer: a quality adjusted survival analysis. J Clin Oncol 1989; 7:
36–44.
58. Weeks J. Taking quality of life into account in health economic analysis. J Natl
Cancer Inst Monogr 1996; 20:23–27.
59. Torrance G, Furlong W, Feeny D, Boyle M. Multi-attribute preference functions:
health utilities index. Pharmacoeconomics 1995; 7:503–520.
60. Weeks J, O’Leary J, Fairclough D, Paltiel D, Weinstein M. The Q-tility index: a
new tool for assessing the health-related quality of life and utilities in clinical trials
and clinical practices. Proc ASCO 1994; 13:436.
1990; 9:1259–1276.
15
Statistical Analysis of Quality of Life
Andrea B. Troxel
Joseph L. Mailman School of Public Health, Columbia University, New York,
New York
Carol McMillen Moinpour

I. INTRODUCTION
In randomized treatment trials for cancer or other chronic diseases, the primary
reason for assessing quality of life (QOL) is to broaden the scope of treatment
evaluation. We sometimes characterize QOL and cost outcomes as alternative or
complementary because they add to information provided by traditional clinical
trials’ end points such as survival, disease-free survival, tumor response, and
toxicity. The challenge lies in combining this information in the treatment evalua-
tion context. There is fairly strong consensus that, at least in the phase III setting,
QOL should be measured comprehensively (1–3). Although a total or summary
score is desirable for the QOL measure, it is equally important to have separate
measures of basic domains of functioning (e.g., physical, emotional, social, role
functioning) and symptom status. Symptoms specific to the cancer site and/or
the treatments under evaluation are also usually included to monitor for toxicities
and to gauge the palliative effect of the treatment on disease-related symptoms.
In some trials, investigators may study additional areas such as financial concerns,
spirituality, family well-being, and satisfaction with care. Specific components
of QOL provide information not only on specific interpretation of treatment ef-
269
270
Table 1 Examples of Comprehensive QOL Questionnaires
Questionnaire QOL Dimensions (# Items) Reference
SF-36 Physical functioning (10) Health transition (10) 8–14

Role-physical (4) Physical component
Bodily pain (2) summary score (35)
General health (5)
Vitality (4) Mental component
Social functioning (2) summary score (35)
Role-emotional (4) No total score
Mental health (5)
EORTC QLQ-C30, Version 2 Core (30) Modules: Cancer-specific: lung (13), 15–21
Physical functioning (5) breast (23), head and neck (35),
Role functioning (2) esophageal (24), colorectal (38)
Cognitive functioning (2)
Emotional functioning (4) Others in development: bladder, body
Troxel and Moinpour

Social functioning (2) image, leukemia, myeloma opthal-
Symptom scales mic, ovarian, pancreatic, prostate
Fatigue (3) Treatment-specific: high-dose chemo-
Pain (2) therapy, palliative care
Nausea/vomiting (2)
Single-item symptoms (5) No total score, module scores
Financial impact (1)
Global HRQOL (1)
Statistical Analysis of Quality of Life
Functional Assessment of Cancer Core* Additional concerns (cont.) 22–26
Therapy (FACT)† Physical (7) Treatment-specific: BMT (23), bio-
Version 3 Functional (7) logic response modifiers (13) neuro-
Social (7) toxicity from systemic chemo (11),
Version 4 Emotional (6) taxane (16)
Scores: FACT-G (Core items);
TOI; Module; Total score Symptom-specific:
anorexia/cachexia (12), diarrhea (11),
Additional concerns (modules): fatigue (13), anemia/fatigue (20),
Cancer-specific: breast (9), bladder endocrine (18), fecal incontinence
(12), brain (19), colorectal (9), (12), urinary incontinence (11)
CNS (12), cervix (15), esopha-
geal (17), head and neck (11), Other modules/scales: Spirituality
hepatobiliary (18), lung (9) (12), FAHI (47)
ovarian (12), prostate (12) FAMS (59), FANLT (26)
Cancer Rehabilitation Evaluation Sys- Physical (10) Misc. items contributing to overall 27–30
tem—Short Form (CARES-SF) Psychosocial (17) score (19)
Medical interaction (4)
Marital (6) Total score
Sexual (3) 5 subscale scores
EORTC, European Organization for the Research and Treatment of Cancer.

† The Functional Assessment of Chronic Illness Therapy (FACIT) measurement system represents the current version (#4) of the FACT questionnaires.
* Relationship with doctor items no longer included in FACT scores; TOI, Trial Outcome Index composed of physical, functional, and cancer-specific
module; CNS, central nervous system; BMT, bone marrow transplantation; FAHI, Functional Assessment of HIV Infection; FAMS, Functional Assessment
of Multiple Sclerosis; FANLT, Functional Assessment of Non-Life Threatening Conditions.
271
272 Troxel and Moinpour
fects, but also can identify areas in which cancer survivors need assistance in
their return to daily functioning. Data on specific areas of functioning can also
help suggest ways to improve cancer treatments; Sugarbaker et al. (4) conducted
a study in which the radiotherapy regimen was modified as a result of QOL data.
QOL data should be generated by patients in a systematic standardized
fashion. Interviews can be used to obtain these data, but self-administered ques-
tionnaires are usually more practical in the multiinstitution setting of clinical
trials. Selected questionnaires must be reliable and valid (5) and sensitive to
change over time (6,7); good measurement properties, along with appropriate
item content, ensure a more accurate picture of the patient’s QOL. Table 1 de-
scribes four QOL questionnaires that meet these measurement criteria and are
frequently used in cancer clinical trials. The FACT and EORTC QOL question-
naires have a core section and symptom modules specific to the disease or type
of treatment. Others, like the SF-36 and CARES-SF, can be used with any cancer
site but may require supplementation with a separate symptom measure to address
concerns about prominent symptoms and side effects.
When QOL research is conducted in many and often widely differing insti-
tutions, quality control is critical to ensure clean complete data. The first step is
to make completion of a baseline QOL assessment a trial eligibility criterion.
Enforcement of the same requirements for both clinical and QOL follow-up data
communicates the importance of the QOL data for the trial. Ongoing training of
clinical research associates is mandatory because QOL data are still not consid-
ered routine and there is a fair degree of turnover in data management staff.
Centralized monitoring of both submission rates and the quality of data submitted
must also be considered. This effort requires substantial staff time and therefore
cannot be done without adequate resources. Even with the best quality control
procedures, submission rates for follow-up QOL questionnaires can be less than
desirable, particularly in the advanced-stage disease setting. It is precisely in the
treatment of advanced disease, however, that QOL data provide important out-
comes, that is, they can document the extent of palliation achieved by an experi-
mental treatment.
An increasing focus on QOL studies in the context of clinical trials has
resulted in accumulation of QOL data in a wide variety of conditions and patient
populations. Although this is a rich source of information, data analysis is often
complicated by problems of missing information. Patients sometimes fail to com-
plete QOL assessments because of negative events they experience, such as treat-
ment toxicities, disease progression, or death. Because not all patients are subject
to these missing observations at the same rate, especially when treatment failure
or survival rates differ between arms, the set of complete observations is not
always representative of the total group; analyses using only complete observa-
tions are therefore potentially biased.
Statistical Analysis of Quality of Life 273
Several methods have been developed to address this problem. They range
in emphasis from the data collection stage, where attention focuses on obtaining
the missing values, to the analysis stage, where the goal is adjustment to properly
account for the missing values. We first describe methods that are appropriate
for complete or nearly complete data and then move on to techniques for incom-
plete data sets.
II. ANALYSIS OPTIONS FOR ‘‘COMPLETE’’ DATA SETS

A. Longitudinal Methods
In general the methods described below are applicable to both repeated measures
on an individual over time and measurements of different scales or scores on a
given individual at the same point in time. Many studies of course use both de-
signs, asking patients to fill out questionnaires comprising several subscales at
repeated intervals over the course of the study.
1. Repeated-Measures ANOVA or MANOVA

Analysis of variance (ANOVA) and covariance (ANCOVA) or their multivariate
versions (MANOVA) represent a very popular class of models for continuous
QOL data. The models rely on an assumption of normally distributed data; the
total variation in the data is then attributed to between-group and within-group
portions. The groups may be defined by treatment arm, prognosis, grade, and so
on; hypotheses concerning difference among groups or between combinations of
groups may be tested.
2. Generalized Linear Models

A second general class of models is the likelihood-based generalized linear model
(GLM) (31). This framework is attractive since it accommodates a whole class of
data rather than being restricted to continuous Gaussian measurements; it allows a
unified treatment of measurements of different types, with specification of an
appropriate link function that determines the form of the mean and variance.
Estimation proceeds by solving the likelihood score equations, usually using iter-
atively reweighed least-squares or Newton-Raphson algorithms. GLMs can be
fit with generalized linear interactive modeling (GLIM) (32) or with Splus, using
the glm() function (33). If missing data are random (see below), unbiased esti-
mates will be obtained.
Generalized linear mixed models are a useful extension, allowing for the
inclusion of random effects in the GLM framework. SAS macros are available
to fit these models.
3. Change-Score Analysis
Analysis of individual or group changes in QOL scores over time is often of
great importance in longitudinal studies. Growth curve models can be used to
accomplish this, either at a population level or using a two-stage model to allow
individual rates of change that are then dependent on characteristics such as treat-
ment group or demographic variables. Change-score analysis has the advantage
of inherently adjusting for the baseline score but must also be undertaken with
caution, as it is by nature sensitive to problems of regression to the mean (34).
4. Time-to-Event Analysis
If attainment of a particular QOL score or milestone is the basis of the experiment,
time-to-event or survival analysis methods can be applied. Once the event has
been clearly defined, the analysis tools can be directly applied. These include
Kaplan-Meier estimates of ‘‘survival’’ functions (35), Cox proportional hazard
regression models (36) to relate covariates to the probability of the event, and
logrank and other tests for differences in the event history among comparison
groups. The QOL database, however, supports few such milestones at this time.
III. TYPES OF MISSING DATA PROBLEMS
As mentioned briefly above, QOL data are often subject to missingness. De-
pending on the nature of the mechanism producing the missing data, analyses
must be adjusted differently. Below, we list several types of missing data and
provide general descriptions of the mechanisms along with their more formal
technical names and terms.
A. Missing Completely at Random (MCAR)

This mechanism is sometimes termed ‘‘sporadic.’’ Missing data probabilities are
independent of both observable and unobservable quantities; observed data are
a random subsample of complete data. This type of mechanism rarely obtains in
real data.
B. Missing at Random (MAR)

Missing data probabilities are dependent on observable quantities (such as covari-
ates like age, sex, stage of disease), and the analysis can generally be adjusted
by weighting schemes or stratification. This type of mechanism can hold if sub-
jects with poor baseline QOL scores are more prone to missing values later in
the trial or if an external measure of health, such as the Karnofsky performance
status, completely explains the propensity to be missing. Because the missingness

mechanism depends on observed data, analyses can be conducted that adjust
properly for the missing observations.
C. Nonrandom, Missing Not at Random (MNAR), or

Nonignorable
Missing data probabilities are dependent on unobservable quantities, such as
missing outcome values or unobserved latent variables describing outcomes such
as general health and well-being. This type of mechanism is fairly common in
QOL research. One example is treatment-based differences in QOL compliance,
due to worse survival on one arm of the trial. Or, subjects having great difficulty
coping with disease and treatment may be more likely to refuse to complete a
QOL assessment.
D. Evaluating the Missing Data Problem

As noted, clinical trials studying QOL will likely suffer somewhat from missing
QOL data, even with the best efforts at data collection. Other covariates are col-
lected routinely on patients, however, such as survival and disease status and
toxicities. Many groups include a cover sheet with the QOL questionnaire to
record the reason for incomplete assessments. This information can be used to
model the missing data and QOL processes and determine the extent and nature
of the missing data process.
To determine which methods of statistical analysis will be appropriate, the
analyst must first determine the patterns and amount of missing data and identify
the mechanisms that generate missing data. Rubin (37) addressed the assumptions
necessary to justify ignoring the missing data mechanism and established that
the extent of ignorability depends on the inferential framework and the research
question of interest. More precisely, likelihood based and Bayesian inference are
valid under both MCAR and MAR, but nonlikelihood based inference is valid
under MCAR only. The research question is relevant when considering condi-
tional analyses given complete data; results assuming MCAR and MAR may
differ. Identification of missing data mechanisms in QOL research proceeds
through two complementary avenues: collecting as much additional patient infor-
mation as possible and applying simple graphical techniques and using hypothesis
testing to distinguish missing data processes.
Graphical presentations can be crucial as a first step in elucidating the rela-
tionship of missing data to the outcome of interest and providing an overall sum-
mary of results that is easily understood by nonstatisticians. A clear picture of
the extent of missing QOL assessments is necessary both for selection of the
appropriate methods of analysis and for honest reporting of the trial with respect
to reliability and generalizability. In clinical trials, this means summarizing the

proportions of patients in whom assessment is possible (e.g., surviving patients
still on study) and then the pattern of assessments among these patients. Machin
and Weeden (38) combine these two concepts in Figure 1, using the familiar
Kaplan-Meier plot to indicate survival rates and a simple table describing QOL
Figure 1 Kaplan-Meier estimates of the survival curves of patients with small cell lung
cancer by treatment group (after MRC Lung Cancer Working Party, 1996). The times at
which QOL assessments were scheduled are indicated beneath the time axis. The panel
indicates the QOL assessments made for the seven scheduled during the first 6 months
as a percentage of those anticipated from the currently living patients. (From Ref. 38,
copyright John Wiley & Sons Limited, reproduced with permission.)
assessment compliance. For this study of palliative treatment for patients with
small cell lung cancer and poor prognosis, the Kaplan-Meier plot illustrates why
the expected number of assessments is reduced by 60% at the time of the final
assessment. The table further indicates the increase in missing data even among
surviving subjects, from 25% at baseline to 71% among the evaluable patients
at 6 months. If the reasons for missing assessments differ over time or across
treatment groups, it may be necessary to present additional details about the miss-
ing data.
A second step is describing the missing data mechanism, especially in rela-
tion to patients’ QOL. A useful technique is to present the available data sepa-
rately for patients with different amounts of and reasons for drop-out. This is
illustrated in Figure 2, due to Troxel (39), where estimates of average symptom
distress in patients with advanced colorectal cancer are presented by reason for
drop-out and duration of follow-up. Higher symptom distress is reported by pa-
tients who drop out due to death or illness, and the worsening of symptom status
over time is more severe for these patients as well. Patients with a decreasing
QOL score may also be more likely to drop out, as demonstrated by Curran et
al. (40), where a change score between two previous assessments was predictive
of drop-out.
Finally, graphical presentations can convey results so that readers may indi-
vidually balance the importance of early versus late differences among treatment
arms or across the different domains of QOL. A particularly simple but informa-
tive display is given in Figure 3, due to Coates and Gebski (41). This illustrates
the differences in the physician ratings of QOL between patients who completed
and did not complete the QOL self-assessment. The average differences, with
95% confidence intervals, clearly communicate the consistent trends across all
the time points and the statistical significance at specific points in time. Since
baseline QOL scores are often predictive of survival (42), the usual Kaplan-Meier
plots, stratified by baseline QOL, can be very informative.
1. Comparing MCAR and MAR

Assuming a monotone pattern of missing data, Diggle (43) and Ridout (44) pro-
posed methods to compare MCAR and MAR drop-out. The former proposal in-
volves testing whether scores from patients who drop out immediately after a
given time point are a random sample of scores from all available patients at
that assessment. The latter proposal centers on logistic regression analysis to test
whether observed covariates affect the probability of dropout.
2. Testing for MNAR

As mentioned earlier, if likelihood or Bayesian inference is used, then distin-
guishing between MCAR and MAR is often not the primary concern. Recall that
Figure 2 Average scores by type and length of follow-up: Symptom Distress Scale.
——, lost-death; - - - , lost-illness; – – –, lost-other; — —, complete follow-up. (From Ref.
39, copyright John Wiley & Sons Limited, reproduced with permission.)
if either MCAR or MAR holds, the missing data mechanism depends on observed
quantities only and inferences on Y can be based solely on the observed data.
The main issue for likelihood or Bayesian inference is distinguishing between
MAR and MNAR. Unfortunately, testing the assumptions of MAR against a hy-
pothesis of MNAR is not trivial; such a procedure rests on strong assumptions
Figure 3 ANZ 8614: difference in the Quality of Life Index (QLI) score (assessed by
the physician) between patients who did or did not comply with self-assessment of QOL
at the relevant time point, indicated in weeks after randomization. Plots show mean differ-
ence and 95% confidence intervals. The possible range of difference in scores is 10. Nega-
tive scores indicate that patients who did not comply with self-assessment of QOL had
worse QOL as assessed by the physician using the QLI. *Number not complying, number
complying with self-assessment. (From Ref. 41, copyright John Wiley & Sons Limited,
reproduced with permission.)
that are themselves untestable (40). When fitting a nonignorable model, certain
assumptions are made in the specification of the model about the relationship
between the missing data process and unobserved data. These assumptions are
fundamentally untestable. Molenberghs et al. (45) provide examples where differ-
ent models produce almost similar fits to the observed data but yield completely
different predictions for the unobserved data. Little (46), discussing pattern-mix-
ture models, suggests that underidentifiability is a serious problem with MNAR
missing data models and that problems may arise when estimating the parameters
of the missing data mechanism simultaneously with the parameters of the under-
lying data model. Similar problems may exist in the selection model framework
(47).
Because of the difficulties in identifying the missing data mechanism, anal-
ysis of repeated measures with missing data is not trivial. This is especially true
for QOL assessments where data may be missing for several reasons. If a suffi-
cient amount of data is collected relating to why QOL questionnaires have not
been completed, however, analysts will have a more solid basis for missing data
models.
E. Special problems
Several special scenarios arise with respect to QOL data collection, and a few
are described below. Although these issues are of some concern in their own
right, they become increasingly problematic when there is differential missing
data between two treatment arms. Differential rates of illness, death, or relapse
will generally result in differing amounts of missing data for the study arms,
making a valid comparison of study treatments extremely difficult with respect
to the QOL end point.
1. Missing Data due to Illness

It is perhaps inevitable in a clinical trial studying QOL that some subjects will
be at certain times too ill to complete their QOL assessments. Since health status
and QOL are almost certainly not independent, this results in nonignorable miss-
ingness, as described above. To facilitate modeling of the missing data process
in this situation, as much information as possible should be collected regarding
the patient’s health status, toxicity episodes, and clinical status; proxy measures
of the patient’s QOL can also be useful, although the poor correlation between
patient and proxy measures is well documented (48).
2. Missing Data due to Death

Obviously subjects who die before completion of the study will have shortened
QOL vectors. Because vital status and QOL are almost certainly not independent,
this too results in nonignorable ‘‘missing data.’’ This situation is different from
that of missing data due to illness described above, however, since in this case
the patient’s potentially available data has been simply truncated rather than not
observed. It makes little sense to discuss what a patient’s QOL score would have
been after death, had it been observed. In this situation, conditional analyses of
QOL, given survival up to a certain point, may be the most appropriate. Models
that jointly assess both the QOL and survival end points can also be useful.
3. Enforced Missingness due to Study Constraints

In some trials, patients who fail or relapse are taken off-study; in general they
are still followed for vital status. Although every effort should be made to com-
plete the QOL assessment schedule, subsequent QOL assessments are not always
obtained. As with patients who have missing QOL data due to illness, this can
result in nonignorable missing data. It is potentially less severe, since all patients
who fail are taken off-study rather than some being self-selected for inability to
complete a QOL assessment. Nonetheless, differential failure rates on treatment
arms can have a devastating effect on the analysis of the QOL data.
IV. METHODS
Several extensions to standard mixed models have been proposed in the context
of longitudinal measurements in clinical trials. Zee (49) proposed growth curve
models where the parameters relating to the polynomial in time are allowed to
differ according to the various health states experienced by the patient (e.g., on
treatment, off treatment, postrelapse, etc.). The model may also contain other
covariates, such as the baseline response value or other relevant clinical informa-
tion, and these may be either constant or varying across health states. This method
requires that the missing data be MAR and may be fit with standard packages
by simply creating an appropriate variable to indicate health state; in essence it
is a type of pattern-mixture model (see below). Care must be taken to ensure
that enough patients remain in the various health states to properly estimate the
extra parameters.
Schluchter (50) proposed a joint mixed effects model for the longitudinal
assessments and the time to drop-out. Suppose the time to drop-out, or censoring,
is denoted by Ti. The joint model allows the Ti (or a function of the Ti, in this
case the log) to be correlated with the random effects bi. The model is as follows:
冤log(Tb )冥 ⬃ N 冢冤µ0 冥, 冤σB στ 冥冣

i
i t
′
bt
bt
2
For example, patients with steeper rates of decline in measurements over time
(as measured by the random effects bi) may be more likely to fail early. This
model allows MNAR data in the sense that the time of drop-out is allowed to
depend, through the covariance parameter σbt, on the rate of change in the under-
lying measurements. If there are intermittent patterns of missing data, these must
be assumed to be MCAR. Software to fit this model is not readily available.
A. Generalized Estimating Equations (GEEs)

GEEs (51) provide a framework to treat disparate kinds of data in a unified way.
In addition, they require specification of only the first two moments of the re-
peated measures, rather than the likelihood. Instead, estimates are obtained by
solving an estimating equation of the following form:
n
U⫽ 冱 DV
i⫽1
′
i
⫺1
i (Yi ⫺ µi) ⫽ 0
Here µi ⫽ E(Yi|Xi,β) and Di ⫽ ∂µi /∂β are the usual mean and derivative functions
and Vi is a working correlation matrix. For Gaussian measurements, the estimat-
ing equations resulting from the GEE are equivalent to the usual score equations
obtained from a multivariate normal maximum likelihood model; the same esti-
mates will be obtained from either method. Software is available in the form of
an SAS macro (52). GEEs produce unbiased estimates for data that are MCAR.
Extensions to the GEE do exist for data that are MAR: Weighted GEEs will
produce unbiased estimates provided the weights are estimated consistently
(53,54). When the missingness probabilities depend only on observed covariates,
such as the stage of disease, or responses, such as the baseline QOL score, a
logistic or probit model can be used to estimate missingness probabilities for
every subject; the weights used in the analysis are then the inverses of these
estimated probabilities. Robins et al. (54) discuss these equations and their prop-
erties in detail; presented simply, the estimating equation takes the form
冢冣
n
冱 DV
Ri
U⫽ ′
i
⫺1
idiag (Yi ⫺ µi) ⫽ 0
i⫽1
π̂i
where πij ⫽ P(Rij ⫽ 1|Y0i , Wi, α), π̂ij is an estimate of πij, and diag(Q) indicates
a matrix of zeroes with the vector Q on the diagonal. Although software exists
to fit GEEs, additional programming is required to fit a weighted version.
B. Joint Modeling of Measurement and Missingness

Processes
One can model the joint distribution of the underlying complete data Yi and the
missingness indicators Ri. If conditioning arguments are used, two types of mod-
els can result; the selection model is concerned with f(Yi)f(Ri|Yi), whereas the
pattern mixture model is concerned with f(Ri)f(Yi|Ri). The two approaches are
discussed and compared in detail by Little (46). Pattern mixture models proceed
by estimating the parameters of interest within strata defined by patterns of and/
or reasons for missingness and then by combining the estimates. Selection models
proceed by modeling the complete data and then modeling the behavior of the
missingness probabilities conditional on the outcome data.
Models for continuous data have been proposed by Diggle and Kenward
(47) and Troxel et al. (55). Although the computations can be burdensome, the
approach will produce unbiased estimates even in the face of MNAR data, pro-
vided that both parts of the model are correctly specified. The selection models
assume that the complete underlying responses are multivariate normal; any para-
metric model, such as the logistic, can be used for the missing data probabilities.
The type of missingness mechanism is controlled by the covariates and/or re-
sponses that are included in the model for the missingness probabilities. In this
example, the probabilities may depend on the current, possibly unobserved mea-
surement Yij, implying that the missing data may be MNAR; it is possible to
allow dependence on previous values as well. The observed data likelihood is
obtained by integrating the complete data likelihood over the missing values.
Estimates are usually obtained through direct maximization of the likelihood sur-
face; numerical integration is generally required. Once estimates are obtained,
inference is straightforward using standard likelihood techniques. This method
allows analysis of all the data, even when the missingness probabilities depend
on potentially unobserved values of the response. The estimates are also likely
to depend on modeling assumptions, most of which are untestable in the presence
of MNAR missing data. Despite these drawbacks, these models can be very useful
for investigation and testing of the missingness mechanism. In addition, the bias
that results from assuming the wrong type of missingness mechanism may well
be more severe than the bias that results from misspecification of a full maximum
likelihood model. Software to fit the Diggle and Kenward (47) model is available
(56).
For discrete data, methods allowing for nonignorable missing data have
been proposed by Fay (57), Baker and Laird (58), and Conaway (59). Here, log-
linear models are used for the joint probability of outcome and response variables
conditional on covariates. The models can be fit using the EM algorithm (60),
treating the parameters of the missingness model as a nuisance.
C. Multiple Imputation
Imputation, or ‘‘filling-in,’’ of data sets is a way of converting an incomplete to
a complete data set. This method is attractive because once the imputation is
conducted, the methods for complete data described in Section II can be applied.
Simple imputation consists of substituting a value for the missing observations,
such as the mean of the existing values, and then adjusting the analysis to account
for the fact that the substituted value was not obtained with the usual random
variation.
Multiple imputation (61) is similar in spirit to simple imputation but with
added safeguards against underestimation of variance due to substitution. Several
data sets are imputed, and the analysis in question is conducted on each of them,
resulting in a set of estimates obtained from each imputed data set. These several
results are then combined to obtain final estimates based on the multiple set.
Multiple imputation can be conducted in the presence of all kinds of miss-
ingness mechanisms. The usual drawback with respect to nonignorable miss-
ingness applies, however. A model is required to obtain the imputed values, and
in the presence of nonignorable missingness, the resultant estimates are sensitive
to the chosen model; even worse, the assumptions governing that model are gen-
erally untestable due to the very nature of the missing data. Finally, some question
whether it is appropriate to impute values for subjects whose data are missing
because of early death. Such an approach allows an estimate of the study popula-
tion’s experience if the entire group stayed on study for the entire period of QOL
assessment. That is, QOL results are generalizable to new patients who could be
candidates for the treatments where their length of survival is unknown.
V. AVOIDING PITFALLS: SOME COMMONLY

USED SOLUTIONS
A. Substitution Methods
In general, methods that rely on substitution of some value, determined in a vari-
ety of ways, are subject to bias and heavily subject to assumptions made in ob-
taining the substituted value. For these reasons, they should not be used to pro-
duce a primary analysis on which treatment or other decisions are based. One
of the most serious problems with substitution methods, especially when the
worse score method is used, is that they can seriously damage the psychometric
properties of a measure. These properties, such as reliability and validity, rely
on variations in the scores to hold. A second problem is that in substituting values
and then conducting analyses based on that data, the variance of estimates will
be underestimated, since the missing values, had they been observed, would carry
with them random variation, which the substituted values do not. Substitution
methods can be useful, however, in conducting sensitivity analyses to determine
the extent to which the analysis is swayed by differing data sets.
1. Worst Score
This method is often used in sensitivity analyses, since the implicit assumption
is that the subjects who did not submit an assessment are all as worse off as they
can possibly be with respect to QOL. This is usually the most extreme assumption
possible, so an analysis robust to worst-score substitution has a strong defense.
The comment raised above regarding the psychometric measurement properties
warrants caution, however.
2. Last Value Carried Forward

This substitution method tries to use each patient’s score to provide information
about the imputed value. It assumes, however, that subjects who drop out do not
have a changing QOL score, when in practice often it is the subjects in rapid
decline who tend to drop out prematurely. For this reason, last value carried
forward should be used with extreme care, if at all.
3. Average Score
Use of the average score, either within a patient or within a group of patients
(such as those on the same treatment arm), is more closely related to classic
imputation methods. Again, it assumes that the imputed values are no different
from the observed values, but it does not necessarily force each subject’s score
to remain constant.
B. Adjusted Survival Analyses

Some authors (62,63) proposed analyses in which survival is treated as the pri-
mary outcome, but it is adjusted for the QOL experience of the patients. This is
an extremely appealing idea, for it clarifies the inherent trade-off between length
and quality of life that applies to most patients. It can be difficult to implement
satisfactorily in practice, however, because of the difficulty of obtaining the ap-
propriate values with which to weight survival in different periods. The two meth-
ods described below have gained some popularity.
1. Quality-adjusted Life Years

This method consists of estimating a fairly simple weighted average, in which
designated periods of life are weighted according to some utility describing QOL.
Because utilities are obtained using lengthy interviews or questionnaires focusing
on time trade-offs or standard gambles, investigators commonly substitute utili-
ties obtained from some standard population rather then any information obtained
directly from the patient. This renders the analysis largely uninterpretable, in our
view.
2. Q-TWiST
Q-TWiST (64), or quality-adjusted time without symptoms and toxicity, is a more
detailed method of adjustment, though still one that relies on utilities. The pa-
tient’s course through time is divided up into intervals in which the patients expe-
riences toxicity due to treatment, toxicity due to disease (i.e., brought on by re-
lapse), and no toxicity. These intervals may be somewhat arbitrary, determined
not by the patient’s actual experience with toxicity but by a predefined expecta-
tion of the average interval in which patients with a given disease receiving a
given treatment experience problems due to treatment, the average interval they
spend disease-free after the end of therapy, and the average time until relapse.
To compound this arbitrariness, utilities for each period are chosen by the analyst.
For example, time spent suffering from treatment-induced toxicities may be rated
at 50% of perfect health. This results in an analysis that reflects only a small
amount of patient-derived data and a large number of parameters chosen by the
investigator. Data from patient rating scales and Q-TWiST analyses can differ
(65).
REFERENCES
1. Moinpour CM, Feigl P, Metch B, Hayden KH, Meyskens FL Jr, Crowley J. Quality
of life end points in cancer clinical trials: review and recommendations. J Natl Can-
cer Inst 1989; 81:485–495.
2. Nayfield S, Ganz PA, Moinpour CM, Cella D, Hailey B. Report from a National
Cancer Institute (USA) workshop on quality of life assessment in cancer clinical
trials. Quality of Life 1992; 1:203–210.
3. National Cancer Institute (US). Quality of life in clinical trials. Proceedings of a
workshop held at the National Institute of Health; March 1–2, 1995. Bethesda, MD:
National Institute of Health, 1996.
4. Sugarbaker PH, Barofsky I, Rosenberg SA, Gianola FJ. Quality of life assessment
of patients in extremity sarcoma clinical trials. Surgery 1982; 91:17–23.
5. Nunnally J. Psychometric Theory. New York: McGraw-Hill, 1978.
6. Kirshner B, Guyatt GH. A methodological framework for assessing health indices.
J Chronic Dis 1985; 1:27–36.
7. Guyatt GH, Deyo RA, Charlson M, Levine MN, Mitchell A. Responsiveness and
validity in health status measurement: a clarification. J Clin Epidemiol 1989; 42:
403–408.
8. Stewart AL, Ware JE Jr. Measuring Functioning and Well-Being: The Medical Out-
comes Approach. Durham and London: Duke University Press, 1992.
9. Ware JE Jr, Sherbourne CD. The MOS 36-item short-form health survey (SF-36).
I. Conceptual framework and item selection. Med Care 1992; 30:473–483.
10. McHorney C, Ware J, Raczek A. The MOS 36-item short-form health survey (SF-
36). II. Psychometric and clinical tests of validity in measuring physical and mental
health constructs. Med Care 1993; 31:247–263.
11. McHorney C, Ware J, Lu R, Sherbourne C. The MOS 36-item short-form health
survey (SF-36). III. Tests of data quality, scaling assumptions, and reliability across
diverse patient groups. Med Care 1994; 32:40–46.
12. Ware, JE Jr, Snow KK, Kosinski MA, Gandek B. SF-36 Health Survey: manual and
Interpretation Guide. Boston: Nimrod Press, 1993.
13. Ware, JE Jr, Kosinski M, Keller SD, SF-36 Physical and Mental Health Summary
Scales: A User’s Manual. Boston: The Health Institute, New England Medical Cen-
ter, 1994.
14. Ware JE Jr. The SF-36 Health Survey. In: Spilker B, ed. Quality of Life and Pharma-
coeconomics in Clinical Trials. 2nd ed. Philadelphia: Lippincott-Raven, 1996.
15. Aaronson NK, Bullinger M, Ahmedzai S. A modular approach to quality of life
assessment in cancer clinical trials. Recent Results Cancer Res 1988; 111:231–249.
16. Aaronson NK, Ahmedzai S, Bergman B, et al. The European Organization for Re-
search and Treatment of Cancer QLQ-C30: a quality of life instrument for use in
international clinical trials in oncology. J Natl Cancer Inst 1993; 85:365–373.
17. Bergman B, Aaronson NK, et al. The EORTC QLQ-LC13: a modular supplement
to the EORTC QLQ-C30 for use in lung cancer trials. Eur J Cancer 1994; 30A:
635–642.
18. Bjordal K, Ahlner EM, et al. Development of an EORTC questionnaire module to
be used in quality of life assessments in head and neck cancer patients. Acta Oncol
1994; 33:879–885.
19. Sprangers M, Cull A. The European Organization for Research and Treatment of
Cancer approach to quality of life: guidelines for developing questionnaire modules.
Qual Life Res 1993; 2:287–295.
20. EORTC Quality of Life Study Group. EORTC QLQ-C30 Scoring Manual. Brussels:
EORTC Quality of Life Study Group, 1997.
21. EORTC Quality of Life Study Group. EORTC QLQ-C30 Reference Values. Brus-
sels: EORTC Quality of Life Study Group, 1998.
22. Cella D, Tulsky DS, Gray G, Sarafian B, Linn E, Bonomi AE, et al. The functional
assessment of cancer therapy scale: development and validation of the general mea-
sure. J Clin Oncol 1993; 11:570–579.
23. Weitzner M, Meyers C, Gelke C, Byrne K, Cella DF, Levin V. The Functional
Assessment of Cancer Therapy (FACTS) scale. Development of a brain subscale
and revalidation of the general version (FACT-G) in patients with primary brain
tumors. Cancer 1995; 75:1151–1161.
24. Brady MJ, Cella DF, Mo F, Bonomi AE, Tulsky DS, Lloyd SR, Deasy S, Cobleigh
M, Shiomoto G. Reliability and validity of the Functional Assessment of Cancer
Therapy–Breast Quality-of-Life Instrument. J Clin Oncol 1997; 15:974–986.
25. D’Antonio LL, Zimmerman GJ, Cella DF, Long S. Quality of life and functional
status measures in patients with head and neck cancer. Arch Otolaryngol Head Neck
Surg 1996; 122:482–487.
26. Cella DF. Manual of the Functional Assessment of Chronic Illness Therapy (FACIT
Scales)—Version 4. Outcomes Research and Education (CORE). Evanston North-
western Healthcare and Northwestern University, 1997.
27. Schag CAC, Ganz PA, Heinrich RL. Cancer Rehabilitation Evaluation System—
Short Form (CARES-SF): a cancer specific rehabilitation and quality of life instru-
ment. Cancer 1991; 68:1406–1413.
28. Heinrich RL, Schag CAC, Ganz PA. Living with cancer: the cancer inventory of
problem situations. J Clin Psychol 1984; 40:972–980.
29. Schag CAC, Heinrich RL, Ganz PA. The cancer inventory of problem situations:
an instrument for assessing cancer patients’ rehabilitation needs. J Psychosoc Oncol
1983; 1:11–24.
30. Schag CAC, Heinrich RL. Development of a comprehensive quality of life measure-
ment tool: CARES. Oncology 1990; 4:135–138.
31. McCullagh P, Nelder JA. Generalized Linear Models. 2nd ed. London: Chapman
and Hall. 1989.
32. Baker RJ, Nelder JA. The GLIM System, Release 3, Generalized Linear Interactive
Modeling. Oxford: Numerical Algorithms Group. 1978.
33. Hastie TJ, Pregibon D. Generalized linear models. In: Chambers JM, Hastie TJ, eds.
Statistical Models in S. London: Chapman and Hall, 1993.
34. Fleiss JL. The Design and Analysis of Clinical Experiments. New York: John Wiley
and Sons, 1986.
35. Kaplan EL, Meier P. Nonparametric estimator from incomplete observations. J Am
Stat Assoc 53:457–481.
36. Cox DR. Regression models and life tables [with discussion]. J R Stat Soc B 1972;
21:411–421.
37. Rubin DB. Inference and missing data. Biometrika 1976; 63:581–592.
38. Machin D, Weeden S. Suggestions for the presentation of quality of life data from
clinical trials. Stat Med 1998; 17:711–724.
39. Troxel AB. A comparative analysis of quality of life data from a Southwest Oncol-
ogy Group randomized trial of advanced colorectal cancer. Stat Med 1998; 17:767–
779.
40. Curran D, Bacchi M, Schmitz SFH, Molenberghs G, Sylvester RJ. Identifying the
types of missingness in quality of life data from clinical trials. Stat Med 1998; 17:
697–710.
41. Coates A, Gebski VJ. Quality of life studies of the Australian New Zealand Breast
Cancer Trials Group: approaches to missing data. Stat Med 1998; 17:533–540.
42. Moinpour CM, Savage MJ, Troxel AB, Lovato LC, Eisenberger M, Veith RW, Hig-
gins B, Skeel R, Yee M, Blumenstein BA, Crawford ED, Meyskens FL Jr. Quality
of life in advanced prostate cancer: results of a randomized therapeutic trial. JNCI
1998; 90:1537–1544.
43. Diggle PJ. Testing for random dropouts in repeated measurements data. Biometrics
1989; 45:1255–1258.
44. Ridout M. Testing for random dropouts in repeated measurement data. Biometrics
1991; 47:1617–1621.
45. Molenberghs G, Goetghebeur EJT, Lipsitz SR. Non-random missingness in categori-
cal data: strengths and limitations. Am Statist 1999; 53:110–118.
46. Little RJA. Modeling the drop-out mechanism in repeated-measures studies. JASA
1995; 90:1112–1121.
47. Diggle P, Kenward M. Informative drop-out in longitudinal analysis [with discus-
sion]. Appl Stat 1994; 43:49–93.
48. Sprangers MAG, Aaronson NK. The role of health care providers and significant
others in evaluating the quality of life of patients with chronic disease: a review.
J Clin Epidemiol 1992; 45:743–760.
49. Zee BC. Growth curve model analysis for quality of life data. Stat Med 1998; 17:
757–766.
50. Schluchter MD. Methods for the analysis of informatively censored longitudinal
data. Stat Med 1992; 11:1861–1870.
51. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models.
Biometrika 1986; 73:13–22.
52. Groemping U. GEE: a SAS macro for longitudinal data analysis. Technical Report,
Fachbereich Statistik, Universitaet Dortmund, Germany, 1994.
53. Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression mod-
els with missing data. JASA 90:122–129.
54. Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models
for repeated outcomes in the presence of missing data. JASA 1995; 90:106–121.
55. Troxel AB, Harrington DP, Lipsitz SR. Analysis of longitudinal data with non-ignor-
able non-monotone missing values. Appl Stat 1998;47:425–438.
56. Smith DM. The Oswald Manual. Technical Report, Statistics Group, University of
Lancaster, Lancaster, England, 1996.
57. Fay RE. Causal models for patterns of nonresponse. JASA 81:354–365.
58. Baker SG, Laird NM. Regression analysis for categorical variables with outcome
subject to nonignorable nonresponse. JASA 1988; 83:62–69.
59. Conaway MR. The analysis of repeated categorical measurements subject to nonig-
norable nonresponse. JASA 1992; 87:817–824.
60. Dempster AP, Laird NM, Rubin DB. Maximum likelihood estimation from incom-
plete data via the EM algorithm [with discussion]. J R Stat Soc B 1977; 39:1–38.
61. Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley
and Sons, 1987.
1990; 9:1259–1276.
63. Goldhirsch A, Gelber RD, Simes J, Glasziou P, Coates A. Costs and benefits of
adjuvant therapy in breast cancer: a quality-adjusted survival analysis. JCO 1989;
7:36–44.
64. Gelber RD, Cole BF, Goldhirsch A. Comparing treatments using quality-adjusted
survival: the Q-TWiST method. Am Stat 1995; 49:161–169.
65. Fairclough DL, Fetting JH, Cella D, Wonson W, Moinpour CM. Quality of life and
quality adjusted survival for breast cancer patients receiving adjuvant therapy. Qual
Life Res 2000; 8:723–731.
16
Economic Analysis of Cancer
Clinical Trials
Gary H. Lyman
Albany Medical College, and State University of New York at Albany School of
Public Health, Albany, New York
I. INTRODUCTION
A. Costs of Cancer Care
Health care expenditures in the United States have risen dramatically, now ex-
ceeding one trillion dollars annually and constituting 14% of the gross domestic
product (Fig. 1) (1,2). Approximately 10% of health care expenditures are allo-
cated for cancer care, totaling more than $100 billion annually. More than 90% of
medical costs for cancer are associated with five diagnoses: breast cancer (24%),
colorectal cancer (24%), lung cancer (18%), prostate cancer (18%), and bladder
cancer (8%) (3,4). Hospital care represents the largest single cost component
accounting for approximately 50% of total cancer care costs. Other major compo-
nents of health care costs include physician/professional costs (25%) and pharma-
ceutical and home care costs (approximately 10% each). Cancer care costs vary
over time and are generally greater during the period immediately after diagnosis
and during the last few months before death (3).
B. Health Care Outcome Measures

There is increasing interest in the assessment of health care outcomes beyond
traditional clinical measures of efficacy. Alternative measures of interest include
291
292 Lyman
Figure 1 Annual health care expenditures. Annual U.S. health care expenditures for
selected years from 1960 to 1998 reported by the Health Care Financing Administration.
Total annual expenditures are reported in units of $100 billion, whereas per capita expendi-
tures are presented in $ thousands. Total U.S. health expenditures projected for the year
2007 are $2.1 trillion.
health-related quality of life and economic outcomes. The primary economic

measure in most economic studies is the mean cost or cost difference between
treatment groups. To facilitate the comparison of different treatment strategies,
combined measures have been developed that bring together clinical, quality of
life, and economic outcomes into summary measures such as the quality-adjusted
life year (QALY) and cost-effectiveness and cost-utility ratios.
C. Economic Analyses
A number of different types of economic evaluations have been developed, in-
cluding cost-minimization, cost-effectiveness, and cost-utility analyses. The anal-
ysis of economic outcomes is complicated by the multiple outcomes, skewed
distributions, and frequent missing data. Nevertheless, combined clinical and eco-
nomic outcome measures permit more rational comparisons of different clinical
strategies for purposes of medical decision making, patient counseling, clinical
practice guideline development, and health care policy formulation.
Economic Analysis of Clinical Trials 293
D. Economic Analysis in Controlled Clinical Trials

Performing economic analyses in association with controlled clinical trials
(CCTs) has gained increasing enthusiasm in recent years. Figure 2 compares
published cost-effectiveness measures for several types of cancer treatment de-
rived from CCTs. Such analyses, however, are associated with several impor-
tant methodological challenges. Economic measures are often of secondary in-
terest in such trials lacking a priori hypotheses with frequent missing data and
inadequate sample size for valid statistical inference. The variability in the cost
measures and the lack of agreement on clinically meaningful cost differences
further limit the conclusions derived from such studies. The addition of eco-
nomic outcomes to traditional measures of clinical efficacy increases the com-
plexity and cost of CCTs. Economic analyses, therefore, should be limited to
large phase III trials where important trade-offs between efficacy and cost are
anticipated. This chapter focuses attention on the design, conduct, analysis, and
Figure 2 Cost effectiveness of cancer treatment. Estimated cost effectiveness for vari-
ous cancer treatment modalities adapted from Smith et al. (41). Cost effectiveness is ex-
pressed in terms of incremental cost ($U.S. thousands) per life year saved. AML, acute
myelogenous leukemia; NSCLC, non small cell lung cancer; HD, Hodgkin’s disease;
ABMT, autologous bone marrow transplantation; CMF, cyclophosphamide, methotrexate,
5-fluorouracil; CAE, cyclophosphamide, adriamycin(doxorubicin), etoposide; IFN, inter-
feron; adj, adjuvant; met, metastatic; adv, advanced.
294 Lyman
reporting of economic analyses in the setting of cancer clinical trials. The

strengths and the limitations of such analyses are discussed, and guidelines are
offered for the proper conduct, evaluation, and interpretation of such economic
analyses.
II. HEALTH CARE OUTCOMES

A. Clinical Efficacy
Response and survival often represent the primary clinical end points for the
assessment of efficacy upon which sample size and power calculations are based.
Alternatively, clinical outcome can be measured in terms of life expectancy or
the average number of years of life remaining at a given age. The life expectancy
of a population can be thought of as representing the area under the corresponding
survival curve (5). The gain in life expectancy or life years saved with treatment
represents the marginal efficacy and can be thought of as the area between the
survival curves with and without intervention. This represents a more powerful
method for assessing treatment effect than comparing median survivals or the
proportion event free at a given time (Fig. 3). Changes in life expectancy are often
used in economic analyses to express the efficacy of treatment. Such measures are
limited by the difficulty in judging a clinically important gain in life years and
extrapolating censored survival data beyond the trial period.
B. Health-related Quality of Life (HRQOL)

In recent years, there has been increasing interest in the assessment of the impact
of cancer and cancer treatment on quality of life. Health profiles derived from
psychosocial theory attempt to assess HRQOL through one of a variety of scales
addressing the relevant dimensions associated with quality of life, such as func-
tional ability, emotional well-being, sexuality/intimacy, family well-being, treat-
ment satisfaction, and social functioning (6). Alternatively, utility measures de-
rived from economic and decision theory attempt to assess HRQOL by eliciting
patient preferences for specific outcome states (7). Patient preferences can be
assessed through a time trade-off method incorporating a standard reference gam-
ble generating a single value of health status along a linear continuum from death
(0) to full health (1). The major advantage of measures of patient preference or
utility is that they can then be used to adjust measures of longevity such as life
expectancy (e.g., quality-adjusted life years or QALYs). The QALY represents
the time in full health considered by the patient equivalent to actual time in the
diseased state. Serial measurement of patient preferences over time can be used
to estimate the cumulative impact of treatment on HRQOL. The sum over all
health care states of the product of the time spent in each state and the utility
associated with the state will yield the quality-adjusted time without symptoms
Figure 3 Gain in life expectancy. Hypothetical survival curves of control and treatment
subjects displaying the probability of survival over time since randomization adapted from
Naimark and from Wright and Weinstein (5). The gain in median survival and probability
of 5-year survival are shown. The area between the curves represents the life years gained
with the intervention.
of disease or toxicity of treatment or Q-TWIST described by Gelber et al. (8).

The value of such measures is limited by the time and cost involved in their
assessment through direct patient encounters and the lack of elucidation of the
multidimensional aspects of HRQOL. The assessment of HRQOL in conjunction
with conventional clinical efficacy measures in CCTs has gained increasing inter-
est over the past several years (9–11). Several authors have addressed the method-
ological challenges of HRQOL outcomes associated with the design and analysis
of clinical trials (12,13). Guidelines have been proposed for the incorporating of
HRQOL measures into CCTs (14).
C. Economic Outcomes
Economic outcome measures differ in several respects from traditional clinical
outcome measures. The most important economic outcome of interest for clinical
decision making and health policy formation is cumulative total cost, which con-
siders both the activity level over time and unit costs. The activity level represents
the amount of various resources used and the time expended in providing medical
296 Lyman
Table 1 Types of Economic Analysis
Methodology Cost unit Effect unit
Cost of illness Monetary —

Cost minimization Monetary Equal
Cost effectiveness Monetary LYS*
Cost utility Monetary QALYS†
Cost benefit Monetary Monetary
* Life years saved.

† Quality-adjusted life years saved.
care. Unit costs represent the cost associated with each unit of activity. The total
cost of illness represents the weighted sum of the unit costs where the weights
are represented by the units of activity for each cost item such that
n
Total cost ⫽ 冱 [unit activity ⫻ unit cost]

The major focus of such economic analyses relates to those resources and costs
that might differ between treatment groups. The types of economic analyses asso-
ciated with cancer CCTs are summarized in Table 1. Direct medical costs repre-
sent the costs of providing medical services for the prevention, diagnosis, treat-
ment, follow-up, rehabilitation, and palliation of disease. These costs include
those associated with hospitalization, professional services, pharmaceuticals, ra-
diologic and laboratory testing, and home health care services. Direct nonmedical
costs represent additional expenditures incurred while receiving medical care,
such as transportation costs to and from the institution and child care expenses.
Indirect costs include those associated with the morbidity of disease and treat-
ment, such as days lost from work and the economic impact of lost economic
output due to premature death. Intangible costs are those associated with pain
and suffering and the loss of companionship. Although it is very difficult to ex-
press such concerns in monetary terms, these represent real social and emotional
costs to the patient and family. Often economic outcome measures are combined
with clinical and/or quality of life measures to provide a summary outcome mea-
sure reflecting the simultaneous difference in cost and the change in survival or
quality-adjusted survival.
III. ECONOMIC ANALYSIS

A. When Should Economic Analyses Be Performed?
Economic analyses are most useful when the clinical and economic outcomes of
interest are discordant, that is, when an intervention is associated with equal or
improved outcome but a greater cost or when the cost of an intervention is the
same or less but with less effectiveness. Clearly, interventions that are associated
with large or uncertain resource consequences and small or unclear efficacy are
most likely to be candidates for an economic analysis. The proper timing of an
economic evaluation in the development of a new intervention is important. Intro-
duction too early in the process before efficacy and standard procedures have been
established may lead to the waste of limited resources, whereas incorporation too
late in the process may limit the ability of the evaluation to alter the dissemination
of the technology. Economic analyses, therefore, should generally be limited to
definitive or confirmatory studies of promising approaches likely to have consid-
erable economic consequences or for which a trade-off between efficacy and cost
is anticipated.
B. Types of Economic Analysis

1. Noncomparative Evaluations
Noncomparative (descriptive) economic studies generally are performed for ei-
ther health administrative or public health purposes and do not involve explicit
comparisons of treatment options, although implicit comparisons are often made.
A common approach is that of burden-of-illness or cost-of-illness studies where
the cost of disease in a population is summarized by tabulating the incidence or
prevalence of disease, the associated morbidity or mortality, and the total costs
of illness.
2. Comparative Evaluations
Comparative economic studies evaluate possible interventions in cohorts of indi-
viduals comparing the benefits and the costs. As shown in Table 1, several types
of economic evaluations are available (15–21). When clinical effectiveness is
not an issue or is considered equal between therapeutic alternatives, the evalua-
tion may be most reasonably based on differences in resource utilization or cost
through a cost-minimization analysis where the strategy associated with the low-
est total cost is identified. Clinical benefits are sometimes converted into the
same economic measure in a cost-benefit analysis to combine them into a single
measure. Such an approach, however, is limited by the requirement that a mone-
tary value is placed on clinical and quality of life outcome measures. When im-
portant differences in both clinical efficacy and cost are anticipated, it is often
preferable to combine economic measures with those of clinical efficacy (Fig.
4). In this situation, the measures of interest are generally the additional cost of
one strategy over another (marginal cost) and the additional clinical benefit
(marginal efficacy) or quality-adjusted clinical benefit (marginal utility). Cost-
effectiveness analysis compares interventions based on the ratio of the marginal
cost and the marginal effectiveness (marginal cost-effectiveness) expressed as
the added cost per life year saved (Fig. 5). Cost-utility analysis compares treat-
298 Lyman
Figure 4 Combined outcome measures. Relationship between clinical and economic

outcome measures. Clinical measures such as survival or life expectancy and quality of
life may be combined with economic outcome measures such as cost to simultaneously
evaluate cost and efficacy in terms of cost-effectiveness or cost-utility ratios.
ments based on the ratio of the marginal cost and the marginal utility (marginal
cost-utility) expressed as the added cost per QALY saved. Ultimately, these latter
two approaches attempt to identify the most efficient approach, that is, the least
costly strategy associated with the greatest effectiveness or utility.
C. Limitations of Economic Analyses

The evaluation and interpretation of an economic analysis will often differ sub-
stantially depending on the perspective from which it was undertaken, for exam-
ple, the patient or family, a health care provider or institution, a third party payor,
or that of society as a whole. Indirect and intangible costs, although very impor-
tant to the patient and family, may not even be considered in economic analyses
from most other perspectives. From the narrowest perspective, the lowest cost
will be associated with the absence of care or no intervention and the shortest
survival. Likewise, lifetime costs will often be less in those with the shortest life
expectancy such as the elderly. From a more global perspective, public health
efforts aimed at screening and early detection and disease prevention assume
greater importance since these will ultimately improve clinical outcome. In addi-
tion, marginal summary measures do not reflect the absolute benefit or cost of
an intervention. A strategy associated with a lower absolute effectiveness may
Figure 5 Cost-effectiveness plane. Plane displaying the relationship between incremen-

tal cost (ordinate) and incremental effectiveness (abscissa). Any point on the plane repre-
sents cost effectiveness expressed as the ratio of incremental cost to incremental effective-
ness. The straight line from the lower left of the plane to the upper right represents the
maximum acceptable cost-effectiveness ratio determined by society. Interventions associ-
ated with greater effectiveness and lower cost (lower right) are always considered accept-
able, whereas those associated with greater cost and less effectiveness (upper left) are
always unacceptable. The acceptability of cost-effectiveness ratios in the other boxes de-
pends on whether it lies below or above the maximum cost-effectiveness line. Any esti-
mated cost effectiveness below that line represents an acceptable ratio, whereas those
above the line are considered unacceptable.
actually appear superior in terms of cost-effectiveness or cost-utility. It is impor-

tant, therefore, to measure both the absolute and the marginal benefit and cost
in such analyses.
IV. ECONOMIC ANALYSIS AND CCTs

A. Why Perform Economic Analyses in Association
with Clinical Trials?
1. Strengths
The quality of an economic analysis depends upon the precision and validity of
the underlying data best provided by CCTs. Just as CCTs are thought to represent
300 Lyman
the most definitive way to evaluate interventions for efficacy, economic analyses
based on such trials may represent the best means to evaluate the cost and cost-
efficiency of treatment. Such economic analyses will be based on the most reli-
able estimates of treatment efficacy, and they will facilitate the comprehensive
comparison of therapeutic options. Study design, data collection, and planned
analyses are generally detailed in a written protocol. The care taken in the design,
conduct, and analysis of such trials may provide the best available information
on resource utilization and treatment efficacy. The importance of randomized
controlled trials is evident in efforts of observational studies and nonrandomized
trials to emulate their careful design and analysis procedures to achieve the same
conclusions. Economic analyses associated with CCTs should be sought before
wide dissemination of new technologies, especially when the resource conse-
quences or costs are large.
2. Weaknesses
Economic outcomes measured in association with clinical trials are often consid-
ered of secondary importance with no a priori hypothesis, small sample size, and
frequent missing data. Even when properly designed and conducted, economic
analyses with CCTs may have low external validity related to the lack of represen-
tativeness and limited generalizability due, in part, to strict eligibility criteria.
The study population also must adhere to clinical monitoring that may not be
representative of clinical practice and will be associated with resource utilization
and costs differing considerably from routine. The costs involved with explor-
atory or early clinical trials may not be representative of what they would be
with more experience. Finally, economic analysis will add to the cost and com-
plexity of CCTs and should generally be limited to use with large, prospective,
phase III, randomized clinical trials. Careful consideration should be given to the
importance of the economic information and the appropriateness of the clinical
trial design prior to incorporating economic assessment into a CCT, and care
must be utilized in selecting only the most relevant and objective measures of
resource utilization for inclusion in the trial. The same methodological rigor
should be applied to the economic analysis as is commonly used in the assessment
of therapeutic efficacy. The appropriate use of economic analyses in association
with CCTs, therefore, requires careful attention to the proper design, conduct,
analysis, and reporting of such analyses (22–27).
B. Design Considerations
1. Types of Studies
As shown in Table 2 three general types of economic analysis related to CCTs
are described that vary in the nature and source of the economic data. In type I
Table 2 Economic Analyses Associated with Cancer Clinical Trials
Type Efficacy Activity Unit cost Precision Generalizability
I Prospective Retrospective Retrospective ⫹ ⫹⫹⫹

II Prospective Prospective Retrospective ⫹⫹ ⫹⫹
III Prospective Prospective Prospective ⫹⫹⫹ ⫹
Prospective, data from all or a sample of trial institutions; Retrospective, retrospective data from
study institutions or other sources; Efficacy, primary outcomes; Activity, resources used.
economic analyses, all cost information is obtained either from an independent

source in an unsampled fashion or from a subsample of study subjects. Such
studies can often be performed rapidly at relatively low cost, but there is little
information on measure variability and subsequent analysis is based on sensitivity
analysis to assess the robustness of the assumptions. In type II economic studies,
resource utilization is sampled concurrently with measures of clinical efficacy.
Such an approach provides information on variability for estimation and hypothe-
sis testing but may limit generalizability to other economic environments and
time periods. Missing data cannot be assumed to be missing at random, which
allows for the introduction of a measurement bias. In type III economic studies,
complete cost information is obtained on the trial subjects, including resource
utilization and unit costs. The amount of information collected often requires
limiting sampling to a subgroup of the study population, usually at a few institu-
tions. Such an analysis has limited generalizability and requires considerable ef-
fort and justification addressing concerns about sampling and measurement bias.
2. Study Hypotheses
The major study questions related to economic measures should be clearly stated
in terms of testable hypotheses. All primary economic questions and secondary
hypotheses relating to outcome differences among specified subgroups should be
stated in advance of the trial. The clinical and economic relevancy of the study
hypotheses should be clearly stated. The economic importance of specific inter-
ventions are likely to be greatest when considering diseases of clinical and public
health significance and interventions associated with considerable cost trade-offs.
3. Study Design
The design of a clinical investigation, including any economic analysis, should
attempt to minimize the potential for systematic error or bias, including that asso-
ciated with subject selection, measurement, and confounding (28). Confounding
represents the modification of the true treatment effect by a factor associated with
302 Lyman
both the outcome of interest and treatment group assignment. Confounding can
obscure a true outcome difference when it exists or create an apparent difference
that does not exist. The potential for confounding is most effectively addressed
in the design of a trial by incorporating appropriate controls, basing treatment
assignment on randomization, and by blinding subjects and investigators to the
assigned treatment (double blinding). Randomization ensures that both known
and unknown confounding factors will be distributed equally in the treatment
groups on average. The balance of important prognostic factors within treatment
groups can be enhanced by randomization separately within subgroups (stratifi-
cation) but should be confirmed in the analysis.
4. Study Population
All subjects in the study should be described and accounted for. The nature,
location, and setting of the study should be fully detailed. Eligibility criteria,
including any inclusion and exclusion criteria, should be presented. A balance
between narrow eligibility to enhance study power and limiting restrictions to
increase generalizability should be sought.
5. Sample Size
The goal of a clinical trial is to confirm the treatment effect accurately or to refute
it unambiguously. The sample size necessary to adequately address primary study
hypotheses should be stated in advance based on the likely treatment effect or
the number of events anticipated, measurement variation, maximum tolerable
alpha error (false positive), and maximum beta error (false negative) considered
acceptable. It is imperative that sufficient numbers of subjects are included in
the trial that a negative study is unlikely to be a false negative (29). Sample sizes
large enough to achieve a power of 80–95% are generally considered desirable
for detecting meaningful differences. Multiinstitutional CCTs may increase study
accrual and sample size and external validity of both the clinical and the economic
outcomes. When the primary outcomes represent failure time data (time-to-
event), sample size estimation should consider the event (or cost) rate and the
anticipated duration of observation and the expected censoring rate. Failure to
consider censoring in an economic analysis may further compromise the power
of the study. When the sample size of the trial is appropriately targeted to the
economic outcomes, it must be anticipated that a longer duration of accrual or
follow-up may be needed. This longer period of observation may not always be
justified or even ethical, especially when meaningful differences in clinical out-
come are already apparent. In small trials, even relatively large and clinically
important differences in outcome may be statistically insignificant because of
low study power. In studies with insufficient sample size to address subgroup
analyses, results should be presented descriptively for purposes of hypothesis

generation only. Because of greater variability, skewed distributions and frequent
missing data, CCTs with primary economic hypotheses may require larger sample
sizes to achieve the desired ability to demonstrate an economic effect. It may be
difficult to estimate sample size requirements given the limited information on
what constitutes meaningful differences in economic outcomes. Sample size esti-
mates should consider any adjustment needed for multiple testing due to the mul-
tiple outcome measures involved. Sample size estimation in economic studies is
complicated by the limited efficiency of conventional methods used with such
data. As a first-order approximation, sample size requirements can be estimated
on the basis of the approximately log-normal distribution of cost data. Sample
size estimates based on ratios of cost and efficacy should consider the variance
and covariance of both measures and the desired level of precision. Although
interim analyses of large trials of expensive technologies might be desirable, they
are seldom designed with early stopping rules based on secondary outcomes such
as cost, which are often not available until later or even after trial completion.
6. Outcome Measures and Analysis

Economic outcomes should include measures of activity level (including time
and resources used) and unit costs of such activity. The economic resource and
cost information collected should be objective and comprehensive and yet limited
to that needed to address prestated hypotheses matching clinical measures in style
and frequency. Resource utilization measures should be specified in advance and
applied equally to each intervention group ideally by blinding both the patient
and investigator to the treatment assignment. Where this is not feasible or ethical,
standardized measurement of economic outcomes should be applied equally to
treatment groups. The quality and completeness of observation, measurement,
and data recording are important to minimize bias and random error in a trial.
Missing data associated with death, disability, treatment delay, loss to follow-
up, or noncompliance may result in either item nonresponse at a specific point
in time or unit nonresponse where most information on a resource component is
missing. Missing data, even when randomly missing, will reduce the power of
the study analysis. However, of greater concern is the possibility that missing
data, including subject withdrawal, may be missing nonrandomly and bias group
comparisons. Although the prospective concurrent collection of outcome data in
a CCT generally reduces the potential for missing data, methods for minimizing
and dealing with missing data should be discussed in advance and explicitly han-
dled in the analysis. The primary data analysis and any planned subgroup analysis
should be described in advance in sufficient detail to provide the reader with a
full understanding of the planned analysis. Even the most elegant analysis, how-
ever, will not salvage an underpowered or biased clinical trial.
304 Lyman
C. Study Conduct Considerations

1. Resource Utilization Data
Patient monitoring and data collection procedures for conventional clinical out-
comes in the conduct of a clinical trial are relatively standardized. Unfortunately,
economic data are often considered of secondary importance or are added to a
clinical trial as an afterthought and relegated to a low level of importance. Re-
source utilization data in a CCT is most accurate and complete when collected
concurrently with efficacy data. The types of resource utilization generally con-
sidered in such studies are summarized in Table 3. Economic analyses in associa-
tion with CCTs often do not adequately address changes in resource utilization
and cost that occur over time. Answers to economic questions depend on resource
utilization and cost through the period of full recovery or death requiring longer
patient monitoring than for the estimation of clinical efficacy. It is essential that
the same systematic effort and precision are applied to the collection of economic
outcomes as are used to measure clinical efficacy. It is also important that
resources consumed and unit costs are measured separately since they vary
quite differently. Resource utilization depends primarily on the clinical situation,
whereas unit costs vary considerably between institutions, regions, and health
systems and over time. It is essential to distinguish between resource utilization
related to the intervention and that related to the conduct of the trial, including
data collection and altered patterns of care and follow-up.
2. Unit Cost Data

Costing methodology varies considerably between studies. When the focus is on
internal validity and maintaining the direct association between resource utiliza-
tion and cost, concurrent and prospective collection of cost information should
be considered. Even when concurrent costing is not feasible or desirable, the use
of site-specific cost information should be applied to the pooled resource data.
When the focus is on external validity or institutional data are not considered
representative, the use of more representative external unit cost data may be con-
sidered. It must always be kept in mind, however, that resource utilization and
unit cost information are generally not independent of one another or of the clini-
cal trial design.
D. Analysis Considerations
1. Type of Study
The type of analysis appropriate for an economic evaluation depends on the study
design and the nature of the data (30,31). A survey of published randomized
trials, including an economic evaluation with cost values suitable for statistical
Table 3 Sources of Resource Utilization in Economic Analyses

Associated with Cancer Clinical Trials
Direct: medical
1. Hospitalization*
Routine vs. intensive care
Frequency
Duration
Physician/nursing services
Laboratory/radiology services
Type and number of tests
Pharmacy services (medications, chemotherapy)
Radiation therapy services
Drugs/treatments
Surgical procedures
Blood bank services (transfusions)
Other services: support services
2. Ambulatory (clinic)
Frequency
Outpatient tests/procedures
Outpatient treatment (surgery, radiation, chemotherapy, etc.)
3. Nursing home/hospice care
Visits (M.D., R.N., Social Services, other)
Direct: nonmedical
1. Loss of work time by patient, family, and friends during treatment.
Lost wages, distance traveled, time spent
Indirect: medical
1. Medical/nursing services
Home visits
Interim testing
2. Physical therapy
3. Social Services
4. Other medical support services
Indirect: nonmedical
1. Impact on family resources
Days lost from work
Transportation costs
Out-of-pocket expenses
* Direct and indirect institutional expenditures including overhead (utilities, rent, etc.),
equipment maintenance and depreciation, consumables.
306 Lyman
analysis, was recently reported (32). Economic outcomes collected in the context
of CCTs are often considered of secondary importance with limited attention
given to prestated economic hypotheses, sample size requirements, missing data,
or multiple testing issues. The source of economic data in such studies is often
derived from small subsamples or from separate nonsampled sources. The results
of such economic studies should be viewed as exploratory or hypothesis generat-
ing and should be presented descriptively. When information on variability is
available, it is often informative to review the distribution of each outcome mea-
sure along with some percentile range such as the interquartile range. Economic
measures are often skewed with frequent outliers and greater variability than
most clinical measures. Measurement variability is often greater for indirect costs
where missing or incomplete data are also more likely to be a problem. Calculated
mean costs or combined measures of cost and efficacy, such as cost effectiveness,
frequently ignore the inherent variability between subjects relying on sensitivity
analyses to assess the robustness of any conclusions. In such analyses, the investi-
gator controls the variation and range of parameters, the potential interaction
between parameters is ignored, and robustness is arbitrarily defined. In larger
studies incorporating a limited number of a priori economic hypotheses, the same
rigor of statistical analysis should be applied as is used for assessing clinical
efficacy. Study evaluation should be based on an intention-to-treat analysis and
appropriately powered to measure effect sizes of economic importance carefully
considering measurement distributions, missing data, and multiple comparisons.
The source of unit cost information and any discounting considered should be
justified, and the external validity or generalizability of results should be dis-
cussed.
2. Missing Data
Missing data may have an impact not only on study precision by reducing the
number of subjects with complete data but also study validity by biasing outcome
estimates if the missing data are associated with outcome measures or treatment
group assignment. Missing data can also complicate multivariate modeling which
considers only cases for which data are available on all covariates considered.
The relationship between missing data and treatment group assignment, efficacy
and cost outcomes, or important covariates should be studied. If missing data
are independent of observed and unobserved data, they are considered missing
completely at random and can be dealt with by complete case analysis with some
loss in power or by simple imputation of missing values such as the last observed
or mean values with some underestimation of variance. When missing data is
missing at random but related to the observed data, multiple imputation tech-
niques and bootstrapping can provide more reasonable point estimates and vari-
ance. The most difficult situation is that associated with informative missing
data, which depends on missing data or the parameters of interest. It is gener-

ally not considered necessary to have unit cost information on all subjects as
long as resource utilization data are complete. Unit cost data can be collected on
a subset of patients or from an independent data source, which may actually
increase external validity with total costs estimated by regression or multiple
imputation techniques. When the variance of unit costs and the covariance be-
tween cost and resources used are not available, the robustness of the assump-
tions may be assessed with a sensitivity analysis. Such an analysis is limited by
the potential bias in selecting variables for analysis, the range of values consid-
ered, the lack of standard criteria for ‘‘robustness,’’ and the inability to address
interaction.
3. Statistical Analysis
Economic outcomes of a clinical trial such as activity level or cost are seldom
equal among the study groups. The observed differences in outcome may repre-
sent either true effects or differences due to random error (variability) or system-
atic error (bias). A true difference is supported by a large treatment effect, small
variability, large sample size, low false-positive rate, and low false-negative rate.
In the analysis of a clinical trial, random error is addressed through statistical
inference, including estimation and hypothesis testing. Estimation summarizes
the distribution of outcomes providing measures of central tendency, such as
means or proportions, and measures of variability or precision, such as confidence
intervals, which represents the upper and lower bounds likely to contain the true
value of a variable. Hypothesis testing involves an assessment of the probability
of obtaining the observed difference in outcome under the null hypothesis of no
true difference between the groups. Economic analyses are often faced with mul-
tiple outcome measures and repeated measures over time, which increases the
chance of observing a statistically significant difference due to chance alone.
Appropriate adjustment in significance levels for multiple testing in the analysis
is necessary. Although it is sometimes useful to compare cumulative cost dis-
tributions between groups using a general nonparametric technique such as the
Kolmogorov-Smirnov test, more powerful methods exist for comparing specific
distribution parameters such as mean and median costs.
Cost Differences. Statistical inference in economic studies is most com-
monly based on differences in arithmetic mean costs between treatment groups
since only these estimates permit ready calculation of the total costs of interest.
Inferences on cost differences between treatment groups should be supported by
measures of precision (e.g., confidence intervals) of the estimated difference in
mean costs or appropriate hypothesis testing considering outcome distributions.
Cost data, however, are often highly skewed due to high costs incurred by a few
patients. When dealing with very large data sets that are reasonably well behaved,
308 Lyman
greater power will be associated with the use of parametric analyses such as
Student’s t test, which may be reasonably applied with confidence intervals calcu-
lated on the mean cost difference. Log transformation of costs will reduce the
impact of outliers and may be useful when it results in normal and similarly
sized distributions (30,33). However, inference based on log transformed costs
compare geometric means, which do not address the primary issue of importance
related to arithmetic mean cost differences. Zhou and Gao (34) proposed a Z-
score for differences in means when group variances are not equal since the log
of the mean of the untransformed costs equals the mean of the log of the trans-
formed costs plus one half of the variance. Alternative methods, including the
truncation of outliers, will result in loss of economically important information
and may yield misleading results. When faced with smaller data sets or unre-
solved skewed distributions, analysis with nonparametric methods such as rank
and log-rank tests is more appropriate. The Wilcoxon rank or Mann-Whitney tests
are often used in this situation since they are much more efficient for comparing
asymmetric distributions and yet relatively efficient even when comparing normal
distributions.
The product-limit estimation method of Kaplan and Meier represents a rea-
sonable approach for dealing with cumulative costs over time, particularly when
censoring is present (35,36). Methods that ignore censoring will potentially bias
mean costs and cost-effectiveness ratios. A number of difficulties may be encoun-
tered in assessing costs in failure time studies, however. The assumption of inde-
pendent censoring is often violated in cost-to-event type analyses. This is illus-
trated by the nonconstant changes in cost over time and the informative
relationship of costs to health status exemplified by the increase in costs immedi-
ately before death. When cost data are censored before death, censoring is infor-
mative with regard to costs and survival. The different scales for death and cen-
soring can result in informative censoring even if no deaths are observed (27).
When dealing with the need for covariate adjustment, the log-rank test related
to the proportional hazards regression method of Cox may have advantages. Rank
procedures, however, generally assume that group distributions have the same
variance and shape and replace economically relevant information with ranks.
In addition, they compare the median and the distributions of costs rather than
arithmetic mean cost differences. Recently, nonparametric bootstrap methods
based on the original data have been proposed which make no assumption about
the shape or equality of the underlying distributions (36). The observed data are
treated as an empirical probability distribution that can be sampled repeatedly
with replacement providing a distribution of outcomes from which confidence
limits and hypothesis testing can be developed. In addition, Bayesian methods
based on subjective prior beliefs have been proposed but the need to determine
a priori distributions and computational complexity limit their applicability.
Combined Outcome Measures. Statistical inference on combined mea-

sures of cost and effectiveness is complicated by the lack of information on the
variance and covariance structure of costs and clinical efficacy. These are often
dealt with conservatively by presenting variance or confidence limits around point
estimates of efficacy and resource utilization or cost separately, ignoring correla-
tion between cost and benefit. A ‘‘confidence box’’ may be defined by estimating
confidence limits separately for incremental effect and incremental cost. The re-
sulting confidence limits on the cost-effectiveness plane are generally considered
overly conservative. In addition, the confidence limits are problematic when the
uncertainty includes different quadrants of the cost-effectiveness plane (37). Sev-
eral methods for estimating confidence intervals for cost-effectiveness ratios
based on the joint variance of cost and efficacy have been proposed, none of
which is entirely satisfactory (38). Parametric estimation of the joint density of
incremental effect and cost considers the covariance generally defining an ellipse
on the cost-effectiveness plane. Van Hout et al. (39) calculated the probability
that the cost-effectiveness ratio falls below a defined maximum acceptable ratio
on the cost-effectiveness plane (Fig. 4), which they claim is equal to integrating
under the appropriate regions of the joint probability distribution f(E, C) around
maximum likelihood point estimates for cost effectiveness, where E and C repre-
sent observed incremental mean effectiveness and mean cost, respectively. Hlatky
et al. (40) reported the use of the bootstrap technique to obtain a nonparametric
estimate of the joint density based on the probability of results falling below a
specified threshold level of cost effectiveness. Assessing the probabilities associ-
ated with varying ceiling cost-effectiveness ratios define an acceptability curve
where the 50th percentile defines the point estimate. Acceptability curves can be
used to summarize uncertainty in cost-effectiveness studies. Such a curve crosses
the probability axis at the one-sided p value for the incremental cost (∆C) and
is asymptotic to 1 minus the one-sided p value for the incremental effectiveness
(∆E). Therefore, confidence limits may be defined for the ceiling cost-effective-
ness ratio from the acceptability curve. Confidence limits may also be derived
from the net-benefit statistic where the net benefit (NB) is defined as
NB ⫽ CER ceiling ∆E ⫺ ∆C
The net benefit can be shown to be normally distributed with variance and confi-
dence limits defined as
Var(NB) ⫽ CER 2ceiling var(∆E) ⫹ var(∆C) ⫺ 2CER 2ceiling cov(∆E, ∆C)
Confidence limits ⫽ (NB ⫺ z α/2 √var(NB) , NB ⫹ z α/2 √var(NB)
The net-benefit statistic offers some advantages for handling uncertainty in cost-
effectiveness analysis, including sample size calculation (37). It has also been
310 Lyman
suggested that a Bayesian approach allows a more direct method for estimating
cost-effectiveness ratios (41).
4. Adjustment
Covariate adjustment is generally undertaken for one of three reasons: to increase
precision or tighten confidence intervals on estimates of treatment effect, to in-
crease validity by controlling for confounding bias, and to estimate outcomes in
patient subgroups. Covariate adjustment of treatment effect and costs will nearly
always increase power. Covariates of particular interest in economic analyses
include demographic factors (age, sex, race, marital status), socioeconomic fac-
tors (income, education, occupation, employment status, residence, family/care-
giver status, type of health insurance and provider organization), and comorbidi-
ties (functional status, prior treatment). The outcomes of interest in economic
analysis are the absolute cost difference and absolute treatment effect that depend
on the control survival and the relative survival advantage with treatment. Even
when relative treatment effects are the same across subgroups, covariate adjust-
ment is necessary to estimate absolute effects because of the heterogeneity in
prognostic factors. Despite efforts to minimize bias in the design and conduct of a
clinical trial, the distribution of known prognostic factors within treatment groups
should be evaluated. Any covariate found to be associated with both treatment
group assignment and the outcome of interest must be considered a possible con-
founding factor and addressed further in the analysis. If actual confounding has
occurred, the apparent relationship between treatment and outcome will be either
strengthened or weakened with adjustment through either stratified analysis or
multivariate modeling. While multiple regression is commonly used in covariate
adjustment, the skewness of cost distributions may result in overestimates of
variance and broad confidence limits. Regression on linear and logarithmic trans-
formations of costs may not yield normal residuals limiting the interpretation of
results. The proportional hazards regression method of Cox has been proposed
for skewed resource or cost data providing estimates of mean cost differences
by including treatment assignment as a covariate (42). Such models are complex,
make no assumption about the distribution of costs for an individual, and must
deal with the proportional hazards and linearity assumptions of the model. Never-
theless, they permit cost analyses to consider the issue of censoring that might
otherwise result in low cost estimates when considering a severe illness or an
intervention associated with high early mortality or withdrawal.
5. Cost Discounting
It is also important to adjust changes in cost or benefit measures for changes over
time and place. Cost discounting considers preferences for immediate over future
benefit and for delaying present costs to the future. Price adjustment is necessary
when observations extend over time (⬎1 year) or geographical region to present
economic results in a common framework. In the United States, cost adjustments
are generally based on the Consumer Price Index or the Fixed Weight Index. All
future and past costs are generally expressed in terms of the present or some
fixed point in time. The Cost Discount Rate (CDR) represents the cost discount
(future cost ⫺ present cost) as a proportion of the present cost. The present cost,
therefore, represents the future cost divided by (CDR ⫹ 1) n when discounting is
conducted over n years.
6. Subgroup Analyses
Treatment effects and costs in a clinical trial often differ between subgroups of
the study population. Although such differences may represent an interaction
between the intervention and a covariate (e.g., prognostic or predictive factor),
the observed differences may also be the result of random error or study bias.
Statistically significant treatment effects in one group and not in another may
reflect the variation in power when one group has a more favorable outcome with
fewer events and therefore lower power to show an effect. Even when the treat-
ment effect is uniform across subsets, multiple testing in subgroup analyses is
associated with an increase probability of finding significant differences due to
chance alone (type I error). Therefore, multiple subgroup analyses should be
discouraged and limited to those of major interest and stated in advance of the
trial for which a difference in efficacy or cost effectiveness might be anticipated
(e.g., stratification factors). Subgroup analyses should include measures of vari-
ability in the effect measures such as confidence limits. Unless there are valid
reasons to expect such subgroup differences in treatment effect, strong evidence
for such effect modification should be provided. The best approach to subgroup
analyses is to perform a test for interaction to assess the homogeneity of treatment
effect across patient subsets rather than reporting difference in outcomes between
subgroups. It is reasonable to view any differences with considerable skepticism
utilizing more restrictive criteria for judging statistical significance.
7. Modeling
Modeling of the relationship between treatment and outcome is used for a variety
of purposes: adjustment for known confounding variables, development of clini-
cal prediction models, and decision modeling of clinical and economic outcomes.
Adjustment for confounding factors may improve both validity and precision
providing more accurate estimates. Such models may also permit estimation of
outcome differences within subgroups when heterologous. Clinical prediction
models for patient selection may improve the cost efficiency of an intervention.
Ideally, such models should be externally validated on an independent data set
and some measure of goodness-of-fit of the model to the data reported. Clinical
312 Lyman
decision models represent valuable methods for the economic evaluation of data
from comparative studies of intervention strategies permitting simultaneous con-
sideration of more than one type of outcome measure (e.g., cost-effectiveness
analysis or cost-utility analysis). The analysis of decision models requires speci-
fication of the model structure, including choices, chance events, and outcomes;
probabilities of all chance events; and outcome values, including benefits and
costs. The analysis of decision models is based on calculating the expected value
of each choice by a process of folding back, which involves multiplying the
estimated outcome value by the probability of that outcome occurring and sum-
ming over all branches of the immediately preceding chance event. This weighted
sum then represents the expected value of the outcome, which now becomes the
outcome value for the immediately preceding step.
When a decision point is reached, the choice associated with the greatest
expected benefit or lowest expected cost represents the preferred choice. Sensitiv-
ity analyses based on such models permit an assessment of the robustness of the
optimal strategy by assessing how changes in parameter values effect the ex-
pected value of the choices and the threshold where expected outcome values
are equal. The threshold probability of an event relates to the ratio of benefits
and costs reflected in the values or utilities incorporated into the model (See the
Appendix). In such models, emphasis on descriptive and graphical displays is
often more rewarding than any formal statistical testing. Despite certain limita-
tions, Markov modeling provides a valuable tool for economic evaluation of
chronic diseases with simultaneous assessment of effectiveness and cost, includ-
ing discounting with disease progression over time (43). Particular attention must
be paid to the Markovian assumption that state transition probabilities are inde-
pendent of previous health states requiring the use of a combination of distinct
health states to model the medical history.
E. Interpretation and Reporting Considerations

1. Individual Studies
The interpretation and reporting of economic analyses should always consider
other possible explanations for the observed differences in outcome, including
low study power (sample size), measurement variability, differences in study
populations, missing data, and multiple comparisons (44,45,46). The generaliz-
ability of the results for patients outside of the context of the individual CCT
should be discussed (47). The costs measured and details of the cost analysis
should be presented and discussed. A review of cost analyses associated with
randomized clinical trials revealed that only one half of the studies actually re-
ported cost figures and few reported indirect costs or study-related costs (44).
Economic analyses related to CCTs are subject to the same sources of varia-
Table 4 Guidelines for Economic Analyses Associated with Cancer Clinical Trials
Design
1. Study hypotheses: Define before study initiation, a limited number of testable eco-
nomic hypotheses along with their relevance.
2. Study rationale: The logical basis and importance of an economic analysis should
be laid along with the rationale for conducting an economic evaluation in relation-
ship to a CCT.
3. Perspective: The viewpoint from which the study is to be conducted and ana-
lyzed should be specified and justified.
4. Study population: Define the source and nature of the study population (treatment
and control groups) including eligibility and exclusion criteria.
5. Sample size: Sample size should be sufficient for valid conclusions concerning
primary and major secondary hypotheses: effect size, variability, alpha and beta
error (power), and multiple comparisons, repeated measures over time.
6. Treatment assignment: Treatment assignment should be randomized or at least
standardized and the rationale presented.
7. Outcome measures: Measures of clinical efficacy and economic cost should be
specified in advance.
8. Planned analyses: The type of economic analysis should be specified and justified
in advance including any subgroup analysis planned and any model to be used.
Data collection
1. Outcome measures: All pertinent clinical efficacy, quality of life, and economic
outcomes should be measured using valid instruments specified in advance.
2. Activity measures (quantities of resources used): These and unit costs (direct and
indirect) should be collected and reported separately.
3. Unit costs: Measure and record unit cost data including adjustments/conversions.
4. Missing data: Every effort should be made to minimize incomplete data.
Data analysis
1. Separate analysis: Resources used and unit costs should be analyzed separately
before any combined analysis.
2. Estimation: After careful examination of distributions, calculate summary esti-
mates and confidence limits for treatment effect and resource quantities and unit
costs.
3. Combined outcomes: Focus subsequent estimation on combined outcomes of in-
cremental cost and incremental efficacy (cost-effectiveness or utility ratios)
4. Hypothesis testing: Apply appropriate method of statistical inference to group
comparisons based on the observed data distributions.
5. Multiple testing: Correct for multiple testing related to multiple outcomes and re-
peated measures over time.
6. Power: Estimate statistical power for evaluating group comparisons and assessing
confidence in reported results.
7. Discounting: Any applied discount rate for inflation/time should be specified and
justified.
314 Lyman
Table 4 Continued
8. Cost-effectiveness/utility: Treatment groups should be compared on the basis of

an incremental analysis, e.g., marginal cost effectiveness.
9. Quality of Life: HRQOL measures should be reported separately and in com-
bined measures with other clinical measures.
10. Modeling: Any model used should be justified and model parameters, including
probabilities and outcome values, should be appropriately estimated and justified.
11. Sensitivity analysis: Sensitivity analyses should be based on valid models with
justification for the range of variable variation.
Data interpretation and reporting
1. Methods: Present methods fully including a priori hypotheses, study population,
sample size originally planned, treatment assignment, outcome measures includ-
ing resource utilization and cost estimates, data analysis including statistical infer-
ence and modeling.
2. Resource utilization: Present resources used and cost estimates separately and uti-
lizing appropriate aggregate or combined measures.
3. Results: Discuss results in the context of the primary and any secondary hypothe-
ses.
4. Limitations: Discuss the limitations of the study design, study population, mea-
surements obtained, analysis performed.
5. Validity: Discuss the issues of internal and external validity; including generaliz-
ability to other settings.
6. Relevance: Discuss importance of study question, including relevance to clinical
decision making, cost efficiency, and health policy formulation.
tion in results as other clinical investigations. In a recent review of 45 random-

ized trials that included individual cost data, 25 (56%) presented statistical tests
or measures of precision on the cost comparisons between groups, whereas only
9 (20%) reported adequate measures of variability (48). The authors of this
study concluded that only 36% provided conclusions justified on the basis of
the data presented. Preliminary guidelines for the design, conduct, analysis, and
interpretation of economic analyses associated with clinical trials are offered in
Table 4.
2. Meta-Analysis
If the existing information already suggests that the intervention in question is
efficacious, then it may be reasonable to base an economic analysis on either a
systematic review or formal meta-analysis. Meta-analysis can form the basis of
an economic evaluation by systematically summarizing the results of several
studies of a given clinical intervention providing greater confidence of treatment

effect and resource utilization than individual studies. Such an analysis is limited
by the type and quality of economic data collected or reported. Meta-analysis of
economic evaluations related to clinical trials must consider the same method-
ological challenges as other meta-analyses. The principle difficulty consists if
identifying and accessing all relevant results on a particular issue considering
publication bias due to failure to publish negative study results, studies indepen-
dent of the primary trial, and studies commissioned for specific administrative
purposes. Computerized literature searches are inadequate for identifying unpub-
lished analyses. Clinical trial data banks may identify additional clinical trials
with concurrent economic evaluations but will not detect independent economic
evaluations. Any economic analysis based on the results of a meta-analysis is
constrained by the potential bias from incomplete ascertainment and by the in-
complete collection and reporting of resource use by most CCTs. Nevertheless,
when properly designed and conducted, economic analyses based on such com-
prehensive data may provide powerful information on the cost efficiency of an
intervention.
V. SUMMARY AND CONCLUSIONS
In conclusion, cancer care is associated with both clinical and economic outcomes
of interest. Economic analyses have gained increasing importance in the evalua-
tion of costly cancer treatments in the setting of limited resources (49–59). CCTs
appear to represent an excellent source of carefully obtained information for in-
corporation into economic analyses. In many ways, such trials provide a desirable
environment for assessing complementary outcomes such as costs and quality of
life in addition to measures of clinical efficacy. Recent reviews of the analysis
and interpretation of economic data in randomized controlled trials reveal a lack
of awareness about important statistical issues. Guidelines have been provided
here for the design and analysis of economic studies in association with CCTs.
Clearly, the investigator must first decide whether an economic analysis is needed
and whether a CCT is a reasonable framework. When such analyses are war-
ranted, the same methodological rigor in design, conduct, analysis, and reporting
should be applied as used for conventional measures of clinical efficacy. Ideally,
the economic analysis will be incorporated into a written and approved protocol,
including a priori hypotheses, the population to be studied, the clinical and eco-
nomic measurements to be obtained, and the planned statistical analysis. Atten-
tion to the many important issues in planning, conducting, analyzing, or reporting
an economic analysis related to a CCT discussed in this chapter will enhance the
quality and validity of the study. The investigator must ultimately decide how
316 Lyman
to interpret and present the data and what it means in the broader health care
setting. Perhaps of greatest importance is the generalizability of economic results
of a clinical trial for the routine application of such interventions within a larger
population. In the years to come, such analyses will play an increasingly impor-
tant role in clinical decision making, individual patient counseling, evidence-
based clinical guideline development, reimbursement, and national and interna-
tional health policy formulation. The ability to properly measure and analyze
such data will greatly aid clinicians and health care planners in providing optimal
quality and cost-effective care to patients with cancer (60).
APPENDIX: DECISION MODEL THRESHOLD ANALYSIS

BASED ON BENEFITS AND COSTS
Each possible outcome in a realistic clinical situation can be considered to have

a certain value or utility (U) and a certain probability of disease (p). The expected
value of the treatment and no treatment strategies is therefore
EVtreatment ⫽ p ⋅ U treat/disease ⫹ (1 ⫺ p) ⋅ U treat/no disease
EVno treatment ⫽ p ⋅ U no treat/disease ⫹ (1 ⫺ p) ⋅ U no treat/no disease
The treatment strategy associated with the greatest expected value should
be chosen to optimize the likelihood of the best result. The benefits and costs
can be derived from utility estimates as shown:
Benefit of treatment ⫽ U treat/disease ⫺ U no treat/disease
Cost of treatment ⫽ U no treat/no disease ⫺ U treat/no disease
A sensitivity analysis could be conducted comparing the expected value
functions as the probability of disease is varied. Most often, however, we are
interested in determining the threshold probability at which point the expected
value of the treatment strategies are equal, i.e.,
EVtreatment ⫽ EVno treatment
p ⋅ U treat/disease ⫹ (1 ⫺ p) ⋅ U treat/no disease
⫽ p ⋅ U no treat/disease ⫺ (1 ⫺ p) ⋅ U no treat/no disease
U no treat/no disease ⫺ U treat/no disease
p threshold ⫽
U treat/disease ⫺ U no treat/disease ⫹ U no treat/no disease ⫺ U treat/no disease
⫽ cost/(benefit ⫹ cost) ⫽ 1/(benefit/cost) ⫹ 1
From such a relationship, it is evident that as the ratio of benefit to cost
increases, the threshold probability of disease decreases. Above the threshold
probability of disease, treatment will be associated with a greater expected value

and will therefore be the favored strategy. The indications for treatment therefore
broaden as the ratio of benefit to cost increases.
REFERENCES
1. Brown ML. The national economic burden of cancer: an update. J Natl Cancer Inst
1990; 82:1811–1814.
2. Brown ML, Fintor L. The economic burden of cancer. In: Cancer Prevention and
Control. New York: Marcel Dekker. 1995.
3. Gaumer GL, Stavins J. Medicare use in the last 90 days of life. Med Care 1991;
29:725–742.
4. Baker MS, Kessler LC, et al. Site-specific treatment costs. In: Cancer in Cancer
Care and Cost. Health Administration Press 1989.
5. Wright JC, Weinstein MC. Gains in life expectancy from medical interventions-
standardizing data on outcomes. N Engl J Med 1998; 330:380–404.
6. Cella DF, Bonomi AE. Measuring quality of life: 1995 update. Oncology 1995; 9:
47–60.
7. Weeks J. Measurement of utilities and quality-adjusted survival. Oncology 1995; 9:
67–70.
8. Gelber RD, Goldhirsch A, Cavelli F. Quality-of-life-adjusted evaluation of adjuvant
therapy for operable breast cancer. Ann Intern Med 1991; 114:621–628.
9. Gotay CC, Korn EL, McCabe MS, Moore TD, Cheson BD. Quality-of-life assess-
ment in cancer treatment protocols: research issues in protocol development. J Natl
Cancer Inst 1992; 84:575–579.
10. Drummond MF. Resource allocation decisions in health care: a role for quality of
life assessments. J Chronic Dis 1987; 40:605–616.
11. Staquet MJ, Hays RD, Fayers PM. Quality of Life Assessment in Clinical Trials:
Methods and Practice. Oxford: Oxford University Press, 1998.
12. Olschewski M, Schulgen G, Schumacher M, Altman DG. Quality of life assessment
in clinical cancer research. Br J Cancer 1994; 70:1–5.
13. Pocock SJ. A perspective on the role of quality-of-life assessment in clinical trials.
Controlled Clin Trials 1991; 12:257S–265S.
14. Fayers PM, Hopwood P, Harvey A, Girling DJ, Machin D, Stephens R. Quality of
life assessment in clinical trials—guidelines and a checklist for protocol writers: the
U.K. Medical Research Council Experience. Eur J Cancer 1997; 33:20–28.
15. Detsky AS, Naglie IG. A clinician’s guide to cost-effectiveness analysis. Ann Intern
Med 1990; 113:147–154.
16. Task Force on Principles for Economic Analyses of Health Care Technology. Eco-
nomic analyses of health care technology: a report on principles. Ann Intern Med
1995; 122:61–70.
17. American Society of Clinical Oncology. Outcomes of cancer treatment for technol-
ogy assessment and cancer treatment guidelines. J Clin Oncol 1996; 14:671–
679.
318 Lyman
18. Russell LB, Gold MR, Siegel JE, Daniels N, Weinstein MC. The role of cost-
effectiveness analysis in health and medicine. JAMA 1996; 276:1172–1177.
19. Weinstein MC, Siegel JE, Gold MR, Kamiet MS, Russell LB. Recommendations
of the panel on cost-effectiveness in health and medicine. JAMA 1996; 276:1253–
1258.
20. Siegel JE, Weinstein MC, Russell LB, Gold MR. Recommendations for reporting
cost-effectiveness analysis. JAMA 1996; 276:1330–1341.
21. Udvarhelyi IS, Colditz GA, Rai A, Epstein AM. Cost-effectiveness and cost benefit
analyses in the medical literature: are the methods being used correctly? Ann Intern
Med 1992; 116:238–244.
22. Drummond MF, Davies L. Economic analysis alongside clinical trials. Revisiting
the methodological issues. Int J Technol Assess Health Care 1991; 7:561–573.
23. Drummond MF, Stoddart GL. Economic analysis and clinical trials. Controlled Clin
Trials 1984; 5:115–128.
24. Bennett CL, Golub R, Waters TM, Tallman MS, Rowe JM. Economic analyses of
phase III cooperative cancer group clinical trials: are they feasible? Cancer Invest
1997; 15:227–236.
25. Bennett CL, Armitage JL, Buchner D, Gulati S. Economic analysis in phase III
clinical cancer trials. Cancer Invest 12:336–342.
26. Bennett CL, Westerman IL. Economic analysis during phase III clinical trials: who,
what, when, where, and why? Oncology 1994; 9:169–175.
27. Brown M, Glick HA, Harrell F, et al. Integrating economic analysis into cancer
clinical trials: the National Cancer Institute-American Society of Clinical Oncology
Economics Workbook. J Natl Cancer Inst Monogr 1998; 24:1–28.
28. Coyle D, Davies L, Drummond MF. Trials and tribulations: emerging issues in de-
signing economic evaluations alongside clinical trials. Int J Technol Assess Health
Care 1998; 14:135–144.
29. O’Brien BJ, Drummond MF, Labelle RJ, Willan A. In search of power and signifi-
cance: issues in the design and analysis of stochastic cost-effectiveness studies in
health care. Med Care 1994; 32:150–163.
30. Rutten-Van Mölken MPMH, Van Doorslaer EKA, Van Vliet RCJA. Statistical anal-
ysis of cost outcomes in a randomized controlled clinical trial. Health Econ 1994;
3:333–345.
31. Grieve AP. Issues for statisticians in pharmaco-economic evaluations. Stat Med
1998; 17:1715–1723.
32. Barber JA, Thompson SG. Analysis and interpretation of cost data in randomized
controlled trials: review of published studies. Br Med J 1998; 317:1195–1200.
33. Thompson SG, Barber JA. How should cost data in pragmatic randomized trials be
analyzed? Brit Med J 2000; 320:1197–1200.
34. Zhou XH, Gao S. Confidence intervals for log-normal means. Stat Med 1997; 16:
783–790.
Stat Assoc 1958; 53:457–481.
36. Desgagne A, Castilloux A-M, Angers J-F, LeLorier J. The use of the bootstrap statis-
tical method for the pharmacoeconomic cost analysis of skewed data. Pharmacoeco-
nomics 1998; 13:487–497.
37. Briggs AH, Fenn P. Confidence intervals or surfaces? Uncertainty on the cost effec-
tiveness plane. Health Econ 1998; 7:723–740.
38. Willan AR, O’Brien BJ. Confidence intervals for cost-effectiveness ratios: an appli-
cation of Fieller’s theorem. Health Econ 1996; 5:297–305.
39. Van Hout BA, Maiwenn JA, Gilad S. Costs, effects and C/E ratios alongside a
clinical trial. Health Econ 1994; 3:309–319.
40. Hlatky MA, Boothroyd DB, Johnstone IM, et al. Long-term cost-effectiveness of
alternative management strategies for patients with life-threatening ventricular ar-
rhythmias. J Clin Epidemiol 1997; 50:185–193.
41. Heitjan DF, Moskowitz AJ, Whang W. Bayesian estimation of cost-effectiveness
ratios from clinical trials. Health Econ 1999; 8:191–201.
42. Dudley RA, Harrell FE, Smith LR, et al. Comparison of analytic models for estimat-
ing the effect of clinical factors on the cost of coronary artery bypass graft surgery.
J Clin Epidemiol 1993; 46:261–271.
43. Briggs A, Sculpher M. An introduction to Markov modelling for economic evalua-
tion. Pharmacoeconomics 1998; 13:397–409.
44. Balas EA, Rainer ACK, Gnann W, et al. Interpreting cost analysis of clinical inter-
ventions. JAMA 1998; 279:54–57.
45. Drummond MF, Jefferson TO, on behalf of the BMJ Economic Evaluation Working
Party. Guidelines for authors and peer reviewers of economic submissions to the
B.J. Br Med J 1996; 313:275–283.
46. Torgerson DS, Campbell MK. Cost effectiveness calculations and sample size. Brit
Med J 2000; 321:697.
47. Fayers PM, Hand DJ. Generalization from phase III clinical trials: survival, quality
of life, and health economics. Lancet 1997; 350:1025–1027.
48. Barber JA, Thompson SG. Analysis and interpretation of cost data in randomised
controlled trials: review of published studies. Br Med J 1998; 317:1195–1200.
49. Earle CC, Coyle D, Evans WK. Cost-effectiveness analysis in oncology. Ann Oncol
1998; 9:475–482.
50. Smith TJ, Hillner BE, Desch CE. Efficacy and cost-effectiveness of cancer treatment:
rational allocation of resources based on decision analysis. J Natl Cancer Inst 1993;
85:1460–1474.
51. Smith, TJ, Hillner BE. The efficacy and cost-effectiveness of adjuvant therapy of
early breast cancer in pre-menopausal women. J Clin Oncol 1993; 11:771–776.
52. Reeves GAG. Cost effectiveness in oncology. Lancet 1985; 2:1405–1408.
53. Goodwin PJ, Feld R, Evans WK, Pater J. Cost-effectiveness of cancer chemotherapy:
an economic evaluation of a randomized trial in small-cell lung cancer. J Clin Oncol
1988; 6:1537–1547.
54. Smith TJ, Hillner BE, Neighbors DM, McSorley PA, LeChevalier T. Economic eval-
uation of a randomized clinical trial comparing vinorelbine, vinorelbine plus cis-
platin, and vindesine plus cisplatin for non-small cell lung cancer. J Clin Oncol 1995;
13:2166–2173.
55. Jaakkimainen L, Goodman PJ, Pater J, et al. Counting the costs of chemotherapy
in a National Cancer Institute of Canada randomized trial of non-small cell lung
cancer. J Clin Oncol 1990; 8:1301–1309.
56. Hillner BE, Smith TJ, Desch CE. Efficacy and cost-effectiveness of autologous bone
320 Lyman
marrow transplantation in metastatic breast cancer: estimates using decision analysis

while awaiting clinical trial results. JAMA 1992; 267:2055–2061.
57. Emanuel EJ, Emanuel LL. The economics of dying: the illusion of cost savings at
the end of life. N Engl J Med 1994; 330:540–544.
58. Bailes JS. Cost Aspects of palliative cancer care. Semin Oncol 1995; 22:64–66.
59. Torgerson DJ, Campbell MK. Use of unequal randomization to aid the economic
efficiency of clinical trials. Brit Med J 2000; 321:759.
60. Goldman DP, Schoenbaum ML, Potsky AL, Weeks JC, Berry SH, Escarce JJ,
Weidmer BA, Kilore ML, Wagle N, Adams JL, Figlin RA, Lewis JH, Kaplan R,
McCabe M. Measuring the incremental cost of clinical cancer research. J Clin Oncol
2000; 19:105–110.
17
Prognostic Factor Studies
Martin Schumacher, Norbert Holländer, Guido Schwarzer,

and Willi Sauerbrei
Institute of Medical Biometry and Medical Informatics, University of Freiburg,
Freiburg, Germany
I. INTRODUCTION
Besides investigations on etiology, epidemiology, and the evaluation of therapies,

the identification and assessment of prognostic factors constitutes one of the ma-
jor tasks in clinical cancer research. Studies on prognostic factors attempt to
determine survival probabilities or, more generally, a prediction of the course of
the disease for groups of patients defined by the values of prognostic factors, and
to rank the relative importance of various factors. In contrast to therapeutic stud-
ies, however, where statistical principles and methods are well developed and
generally accepted, this is not the case for the evaluation of prognostic factors.
Although some efforts toward an improvement of this situation have been under-
taken (1–4), most studies investigating prognostic factors are based on historical
data lacking precisely defined selection criteria. Furthermore, sample sizes are
often far too small to serve as a basis for reliable results. As far as the statistical
analysis is concerned, a proper multivariate analysis considering simultaneously
the influence of various potential prognostic factors on overall or event-free sur-
vival of the patients is not always attempted. Missing values in some or all prog-
nostic factors constitute a serious problem that is often underestimated.
In general, the evaluation of prognostic factors based on historical data has
the advantages that follow-up and other basic data of patients might be readily
available in a database and that the values of new prognostic factors obtained
321
322 Schumacher et al.
from stored tissue or blood samples may be added retrospectively. However,

such studies are particularly prone to some of the deficiencies mentioned above,
including insufficient quality of data on prognostic factors and follow-up data
and heterogeneity of the patient population due to different treatment strategies.
These issues are often not mentioned in detail in the publication of prognostic
studies but might explain, at least to some extent, why prognostic factors are
discussed controversially and why prognostic models derived from such studies
are often not accepted for practical use (5).
There have been some ‘‘classic’’ articles on statistical aspects of prognostic
factors in oncology (6–10) that describe the statistical methods and principles
that should be used to analyze prognostic factor studies. These articles, however,
do not fully address the problem that statistical methods and principles are not
adequately applied when analyzing and presenting the results of a prognostic
factor study (4,5,11,12). It is therefore a general aim of this chapter not only to
present updated statistical methodology but also to point out the possible pitfalls
when applying these methods to prognostic factor studies. Statistical aspects of
prognostic factor studies are also discussed in the monograph on prognostic fac-
tors in cancer (13) and in some recent textbooks on survival analysis (14,15). To
illustrate important statistical aspects in the evaluation of prognostic factors and
to examine the problems associated with such an evaluation in more detail, data
from three prognostic factor studies in breast cancer serve as illustrative exam-
ples. In this disease, the effects of more than 160 potential prognostic factors are
currently controversially discussed; more than 500 papers have been published in
1997. This illustrates the importance and the unsatisfactory situation in prognostic
factors research. A substantial improvement of this situation seems possible with
an improvement in the application of statistical methodology in this area.
Throughout this chapter we assume that the reader is familiar with standard
statistical methods for survival data to that extent as is presented in more practi-
cally orientated textbooks (14–19); for a deeper understanding why these meth-
ods work, we refer to the more theoretically oriented textbooks on survival analy-
sis and counting processes (20–22).
II. ‘‘DESIGN’’ OF PROGNOSTIC FACTOR STUDIES
The American Joint Committee on Cancer has established three major criteria
for prognostic factors: Factors must be significant, independent, and clinically
important (23). According to Hermanek et al. (13), significance implies that the
prognostic factor rarely occurs by chance, independent means that the prognostic
factor retains its prognostic value despite the addition of other prognostic factors,
and clinically important implies clinical relevance, such as being capable (at least
in principle) of influencing patient management and thus outcome.
Prognostic Factor Studies 323
From these criteria it becomes obvious that statistical aspects will play an
important role in the investigation of prognostic factors (13,24–26). That is also
emphasized by Simon and Altman (4), who give a concise and thoughtful review
on statistical aspects of prognostic factor studies in oncology. Recognizing that
these will be observational studies, the authors argue that they should be carried
out in a way that the same careful design standards are adopted as are used in
clinical trials, except for randomization. For confirmatory studies that may be
seen comparable with phase III studies in therapeutic research, they listed 11
important requirements, given in a somewhat shortened version in Table 1. From
these requirements it can be deduced that prognostic factors should be investi-
gated in carefully planned prospective studies with sufficient numbers of patients
and sufficiently long follow-up to observe the end point of interest (usually event-
free or overall survival). Thus, a prospective observational study where treatment
is standardized and everything is planned in advance emerges as the most desir-
able study design. A slightly different design is represented by a randomized
controlled clinical trial where in addition to some therapeutic modalities, various
prognostic factors are investigated. It is important in such a setting that the prog-
nostic factors of interest are measured either in all patients enrolled into the clini-
cal trial or in those patients belonging to a predefined subset. Both designs, how-
ever, usually require enormous resources and especially a long time until results
will be available. Thus, a third type of ‘‘design’’ is used in most prognostic factor
studies that can be termed a ‘‘retrospectively defined historical cohort’’ where
stored tumor tissue or blood samples are available and basic and follow-up data
of the patients are already documented in a database. To meet the requirements
listed in Table 1 in such a situation, it is clear that inclusion and exclusion criteria
have to be carefully applied. Especially, treatment has to been given in a standard-
Table 1 Requirements for Confirmatory Prognostic Factor Studies According to

Simon and Altman (4)
1. Documentation of intra- and interlaboratory reproducibility of assays

2. Blinded conduct of laboratory assays
3. Definition and description of a clear inception cohort
4. Standardization or randomization of treatment
5. Detailed statement of hypotheses (in advance)
6. Justification of sample size based on power calculations
7. Analysis of additional prognostic value beyond standard prognostic factors
8. Adjustment of analyses for multiple testing
9. Avoidance of outcome-orientated cutoff values
10. Reporting of confidence intervals for effect estimates
11. Demonstration of subset-specific treatment effects by an appropriate statistical
test
ized manner, at least to some sufficient extent. Otherwise, patients for whom
these requirements are not fulfilled have to be excluded from the study. If the
requirements are followed in a consistent manner, this will usually lead to a
drastic reduction in the number of patients eligible for the study as compared
with that number of patients originally available in the database. In addition,
follow-up data are often not of such quality as should be the case in a well-
conducted clinical trial or prospective study. Thus, if this design is applied, spe-
cial care is necessary to arrive at correct and reproducible results regarding the
role of potential prognostic factors.
The three types of designs described above will also be represented by the
three prognostic studies in breast cancer that we use as illustrative examples and
that are dealt with in more detail in the next section. It is interesting to note that
other types of designs (e.g., nested case-control studies, case-cohort studies, or
other study types often used in epidemiology [27]) have only been rarely used
for the investigation of prognostic factors. Their role and their potential use for
prognostic factor research has not yet been fully explored. There is one situation
where the randomized controlled clinical trial should be the design type of choice:
the investigation of so-called predictive factors that indicate whether a specific
treatment works in a subgroup of patients defined by the predictive factor but
not—or is even harmful—in another subgroup of patients. Since this is clearly
an investigation of treatment–covariate interactions, this ideally should be per-
formed in the setting of a large-scaled randomized trial where information on
the potential predictive factor is recorded and analyzed by means of appropriate
statistical methods (28–31).
III. EXAMPLES: PROGNOSTIC STUDIES IN BREAST

CANCER
A. Freiburg DNA Study
The database of the first study consisted of all patients with primary previously
untreated node-positive breast cancer who were operated between 1982 and 1987
in the Department of Gynecology at the University of Freiburg and whose tumor
material was available for DNA investigations. Some exclusion criteria (history
of malignoma, T 4 and/or M 1 tumors according to the TNM classification system
of the International Union Against Cancer (13), without adjuvant therapy after
primary surgery, older than 80 years, etc.) were defined retrospectively. This left
139 of 218 patients originally investigated for the analysis. This study is referred
to as the Freiburg DNA study.
Eight patients characteristics were investigated. Besides age, number of
positive lymph nodes, and size of the primary tumor, the grading score according
to Bloom and Richardson (32) and estrogen and progesterone receptor status were
recorded. DNA flow cytometry was used to measure ploidy status of the tumor
(using a cutpoint of 1.1 for the DNA index) and S-phase fraction, which is the
percentage of tumor cells in the DNA synthetizing phase obtained by cell cycle
analysis. The distribution of these characteristics in the patient population is
shown in Table 2A.
The median follow-up was 83 months. At the time of analysis, 76 events
were observed for event-free survival, which was defined as the time from surgery
to the first of the following events: occurrence of locoregional recurrence, distant
metastasis, second malignancy, or death. Event-free survival was estimated as
50% after 5 years. Further details of the study which we are using solely for
illustrative purposes can be found elsewhere (33).
Table 2A Patient Characteristics in the Freiburg DNA Breast

Cancer Study
Factor Category n (%)
Age ⱕ50 yr 52 (37)

⬎50 yr 87 (63)
No. of positive lymph nodes 1–3 66 (48)
4–9 42 (30)
ⱖ10 31 (22)
Tumor size ⱕ2 cm 25 (19)
2–5 cm 73 (54)
⬎ 5 cm 36 (27)
Missing 5
Tumor grade 1 3 (2)
2 81 (59)
3 54 (39)
Missing 1
Estrogen receptor ⱕ20 fmol 32 (24)
⬎ 20 fmol 99 (76)
Missing 8
Progesterone receptor ⱕ20 fmol 34 (26)
⬎20 fmol 98 (74)
Missing 7
Ploidy status Diploid 61 (44)
Aneuploid 78 (56)
S-phase fraction ⬍3.1 27 (25)
3.1–8.4 55 (50)
⬎8.4 27 (25)
Missing 30
B. GBSG-2 Study
The second study is a prospective, controlled, clinical trial on the treatment of
node-positive breast cancer patients conducted by the German Breast Cancer
Study Group (GBSG) (34); this study is referred to as GBSG-2 study.
The principal eligibility criterion was a histologically verified primary
breast cancer of stage T1a-3aN ⫹ MO, that is, with positive regional lymph nodes
but no distant metastases. Primary local treatment was by a modified radical
mastectomy (Patey) with en bloc axillary dissection with at least six identifiable
lymph nodes. Patients were not older than 65 years of age and presented with a
Karnofsky index of at least 60. The study was designed as a comprehensive cohort
study (35), that is, randomized and nonrandomized patients who fulfilled the entry
criteria were included and followed according to the study procedures.
The study had a 2 ⫻ 2 factorial design with four adjuvant treatment arms:
three versus six cycles of chemotherapy with and without hormonal treatment.
Prognostic factors evaluated in the trial were patient’s age, menopausal status,
tumor size, estrogen and progesterone receptor, tumor grading according to
Bloom and Richardson (32), histological tumor type, and number of involved
lymph nodes. Histopathological classification was reexamined, and grading was
performed centrally by one reference pathologist for all cases. Event-free survival
Table 2B Patient Characteristics in GBSG-2 Study
Age ⱕ45 yr 153 (22)

46–60 yr 345 (50)
⬎60 yr 188 (27)
Menopausal status Pre 290 (42)
Post 396 (58)
Tumor size ⱕ20 mm 180 (26)
21–30 mm 287 (42)
⬎30 mm 219 (32)
Tumor grade 1 81 (12)
2 444 (65)
3 161 (24)
No. of positive lymph nodes 1–3 376 (55)
4–9 207 (30)
ⱖ10 103 (15)
Progesterone receptor ⬍20 fmol 269 (39)
ⱖ20 fmol 417 (61)
Estrogen receptor ⬍20 fmol 262 (38)
ⱖ20 fmol 424 (62)
was defined as time from mastectomy to the first occurrence of either locoregional
or distant recurrence, contralateral tumor, secondary tumor, or death.
During 6 years, 720 patients were recruited, of whom about two thirds were
randomized. Complete data on the seven standard prognostic factors as given in
Table 2B were available for 686 patients (95.3%), who were taken as the basic
patient population for this study. After a median follow-up of nearly 5 years, 299
events for event-free survival and 171 deaths were observed. Event-free survival
was about 50% at 5 years. The data of this study as used in this chapter are
available from http:/ /www.blackwellpublishers.co.uk/rss/.
C. GBSG-4 Study
As a third example we use data from a prospective study in node-negative breast
cancer conducted by the GBSG (36) that is referred to as GBSG-4 study. From
1984 to 1989, 662 patients were enrolled into the study, all having mastectomy
Table 2C Patient Characteristics in GBSG-4 Study
Age ⱕ40 yr 62 (10)

⬎40 yr 541 (90)
Menopausal status Pre 215 (36)
Post 388 (64)
Tumor size ⱕ10 mm 45 (7)
11–20 mm 236 (39)
21–30 mm 236 (39)
31–50 mm 74 (12)
⬎50 mm 12 (2)
Estrogen receptor ⬍20 fmol 270 (45)
20–49 fmol 98 (16)
50–299 fmol 181 (30)
ⱖ300 fmol 54 (9)
Progesterone receptor ⬍20 fmol 283 (47)
20–49 fmol 81 (13)
50–299 fmol 175 (29)
ⱖ300 fmol 64 (11)
Tumor grade 1 136 (23)
2 325 (54)
3 142 (24)
Histologic tumor Solid 300 (50)
type Invasive duct or lob 124 (21)
Others 179 (30)
and one cycle of chemotherapy given perioperatively as standardized treatment.

Age, menopausal status, tumor size, tumor grade, histological tumor type and
estrogen and progesterone receptor were recorded as prognostic factors. We re-
strict ourselves to 603 patients with complete data on the seven prognostic factors
considered. The distribution of these factors is summarized in Table 2C. Median
follow-up is about 5 years; the end point of primary interest is event-free survival
defined as the time from treatment to the first of the following events: locoregio-
nal recurrence, distant metastases, second cancer, and death. There have been
155 events observed so far; the Kaplan-Meier estimate of event-free survival at
5 years is 0.73.
IV. CUTPOINT MODEL
In prognostic factor studies, values of the factors considered are often categorized
in two or three categories. This may sometimes be done according to medical or
biological reasons or may just reflect some consensus in the scientific community.
When a ‘‘new’’ prognostic factor is investigated, the choice of such a categoriza-
tion represented by one or more cutpoints is by no means obvious. Thus, often
an attempt is made to derive such cutpoints from the data and to take those cut-
points that give the best separation in the data at hand. In the Freiburg DNA
breast cancer study we consider S-phase fraction (SPF) as a new prognostic factor
but indeed was some years old (11). For simplicity, we restrict ourselves to the
problem of selecting only one cutpoint and to a so-called univariate analysis.
This means that we consider only one covariate Z—in the Freiburg DNA breast
cancer data the SPF—as a potential prognostic factor. If this covariate has been
measured on a quantitative scale, the proportional hazards (37) cutpoint model
is defined as
λ (t |Z ⬎ µ) ⫽ exp(β) λ (t| Z ⱕ µ), t ⬎ 0
where λ (t | ⋅) ⫽ lim h →0 (1/h)(Pr (t ⱕ T ⬍ t ⫹ h|T ⱖ t, ⋅)) denotes the hazard
function of the event free survival time random variable T. The parameter θ ⫽
exp (β) is referred to as the relative risk of observations with Z ⬎ µ with respect
to observations with Z ⱕ µ and is estimated through θ̂ ⫽ exp(β̂) by maximizing
the corresponding partial likelihood (37) with given cutpoint µ. The fact that µ
is usually unknown makes this a problem of model selection where the cutpoint
µ has to be estimated from the data too. A popular approach for such a data-
dependent categorization is the so-called minimum p value method where—
within a certain range of the distribution of Z, the selection interval—the cutpoint
µ̂ is taken such that the p value for the comparison of observations below and
above the cutpoint is a minimum. Applying this method to SPF in the Freiburg
DNA breast cancer data we obtain, based on the logrank test, a cutpoint of µ̂ ⫽
10.7 and a minimum p value of p min ⫽ 0.007 when using the range between the
10% and the 90% quantile of distribution of Z as the selection interval. Figure
1A shows the resulting p values as a function of the possible cutpoints considered;
Figure 1B displays the Kaplan-Meier estimates of the event-free survival func-
tions of the groups defined by the estimated cutpoint µ̂ ⫽ 10.7. The difference
in event-free survival looks rather impressive, and the estimated relative risk with
respect to the dichotomized covariate I(Z ⬎ µ̂) using the ‘‘optimal’’ cutpoint µ̂
⫽ 10.7, θ̂ ⫽ 2.37 is quite large; the corresponding 95% confidence interval is
[1.27; 4.44].
Simulating the null hypothesis of no prognostic relevance of SPF with re-
spect to event-free survival (β ⫽ 0), we illustrate that the minimum p value
method may lead to a drastic overestimation of the absolute value of the log-
relative risk (38). By a random allocation of the observed values of SPF to the
observed survival times, we simulate independence of these two variables, which
is equivalent to the null hypothesis β ⫽ 0. This procedure was repeated 100
times, and in each repetition we selected a cutpoint by using the minimum p
value method, which is often also referred to as an optimal cutpoint. In the 100
repetitions, we obtained 45 significant ( p min ⬍ 0.05) results for the logrank test
Figure 1 p Values of the logrank test as a function of all possible cutpoints for S-phase
fraction (A) and Kaplan-Meier estimates of event-free survival probabilities by S-phase
fraction (B) in the Freiburg DNA study.
corresponding well to theoretical results as outlined in Lausen and Schumacher

(39).
The estimated optimal cutpoints of the 100 repetitions and the correspond-
ing estimates of the log-relative risk are shown in Figure 2A. We obtained no
estimates near the null hypothesis β ⫽ 0 as a result of the optimization process
of the minimum p value approach. Because of the well-known problems resulting
from multiple testing, it is obvious that the minimum p value method cannot lead
to correct results of the logrank test. However, this problem can be solved by
using a corrected p value p cor as proposed in Lausen and Schumacher (39), which
has been developed by taking the minimization process into account. The formula
reads
冤
p cor ⫽ ϕ (u) u ⫺
1
u冥冤
log
(1 ⫺ ε)2
ε2
⫹4 冥
ϕ (u)
u
where ϕ denotes the probability density function and u is the (1 ⫺ p min /2) quantile
of the standard normal distribution. The selection interval is characterized by the
proportion ε of smallest and largest values of Z that are not considered as potential
cutpoints. It should be mentioned that other approaches of correcting the mini-
mum p value could be applied; a comparison of three approaches can be found
in an article by Hilsenbeck and Clark (40). Especially, if there are only a few
Figure 2 Estimates of cutpoints and log-relative risks in 100 repetitions of randomly

allocated observed SPF values to event-free survival times in the Freiburg DNA study
before (A) and after (B) correction.
cutpoints, an improved Bonferroni inequality can be applied (41–43). Using the

correction formula in the 100 repetitions of our simulation experiment, we ob-
tained four significant results ( p cor ⬍ 0.05) corresponding well to the significance
level of α ⫽ 0.05. Four significant results were also obtained with the usual p
value when using the median of the empirical distribution of SPF in the original
data as a fixed cutpoint in all repetitions.
To correct for overestimation, a so-called shrinkage factor has been pro-
posed (44) to shrink the parameter estimates. Considering the cutpoint model,
the log-relative risk should then be estimated by
β̂ cor ⫽ ĉ ⋅ β̂
where β̂ is based on the minimum p value method and ĉ is the estimated shrinkage
factor. Values of ĉ close to one should indicate a minor degree of overestimation,
whereas small values of ĉ should reflect a substantial overestimation of the log-
relative risk. Obviously, with maximum partial likelihood estimation of c in a
model
λ (t | SPF ⬎ µ) ⫽ exp(cβ̂) λ (t | SPF ⱕ µ)
using the original data, we get ĉ ⫽ 1 since β̂ is the maximum partial likelihood
estimate. Recently, Schumacher et al. (45) compared several methods to estimate
ĉ. In Figure 2B the results of the correction process in the 100 simulated studies
are displayed when a heuristic estimate ĉ ⫽ (β̂2 ⫺ var (β̂))/β̂2 was applied where
β̂ and var(β̂) result from the minimum p value method (46). This heuristic esti-
mate performed quite well when compared with more elaborated cross-validation
and resampling approaches (45).
In general, it has to be recognized that the minimum p value method leads
to a dramatic inflation of the type I error rate; the chance of declaring a quantita-
tive factor as prognostically relevant when in fact it does not have any influence
on event-free survival is about 50% when a level of 5% has been intended. Thus,
correction of p values is essential but leaves the problem of overestimation of
the relative risk in absolute terms. The latter problem that is especially relevant
when sample sizes and/or effect sizes are of small or moderate magnitude could
at least partially be solved by applying some shrinkage method. It should be
noted, however, that the optimal cutpoint approach has further disadvantages.
One of these is that in almost every study where this method is applied, another
cutpoint will emerge. This makes comparisons across studies extremely difficult
or even impossible. Altman et al. (11) pointed out this problem for studies of
the prognostic relevance of SPF in breast cancer published in the literature. They
identified 19 different cutpoints used in the literature, and some of them were
solely used because they emerged as the optimal cutpoint in a specific data set.
Thus, other approaches as regression modeling, for example, might be preferred.
In the Freiburg DNA breast cancer data, we obtain a corrected p value of
p cor ⫽ 0.123 that provides no clear indication that S-phase is of prognostic rele-
vance for node-positive breast cancer patients. The correction of the relative risk
estimate by applying some shrinkage factor leads to a value of θ̂ cor ⫽ 2.1 for the
heuristic method and to θ̂ cor ⫽ 2 for the cross-validation and bootstrap approaches.
Unfortunately, confidence intervals are not straightforward to obtain; bootstrap-
ping the whole model-building process including the estimation of a shrinkage
factor would be one possibility. In contrast, taking S-phase as a continuous covar-
iate with an assumed log-linear relationship in a conventional Cox regression
model,
λ (t |Z) ⫽ λ 0 (t) exp(β̃ Z)
leads to a p value of p ⫽ 0.061 for testing the null hypothesis β̃ ⫽ 0. For compari-
son, the estimated log-relative risks for both approaches are displayed in Figure
3.
Figure 3 Log-relative risk for S-phase fraction in the Freiburg DNA study estimated
by the minimum p value method, before and after correction, and by a Cox model assuming
a log-linear relationship.
V. REGRESSION MODELING AND RELATED ASPECTS
The standard tool for analyzing the prognostic relevance of various factors—in
more technical terms usually called covariates—is the Cox proportional hazards
regression model (37,47). If we denote the prognostic factors under consideration
by Z 1, Z 2, . . . , Z k, then the model is given by
λ (t | Z 1 ,Z 2 . . . , Z k ) ⫽ λ 0 (t) exp(β 1Z 1 ⫹ β 2 Z 2 ⫹ . . . ⫹ β k Zk )
where λ (t |⋅) denotes the hazard function of the event-free or overall survival time
random variable T and λ 0 (t) is the unspecified baseline hazard. The estimated log-
relative risks β̂ j can then be interpreted as estimated ‘‘effects’’ of the factors
Z j ( j ⫽ 1, . . . , k). If Z j is measured on a quantitative scale, then exp (β̂ j ) represents
the increase or decrease in risk if Z j is increased by one unit; if Z j is a binary
covariate, then exp (β̂ j ) is simply the relative risk of category 1 to the reference
category (Z j ⫽ 0), which is assumed to be constant over the time range consid-
ered. It has to be noted that the ‘‘final’’ multivariate regression model is often
the result of a more or less extensive model-building process that may involve
the categorization and/or transformation of covariates and the selection of vari-
ables in an automatic or a subjective manner. This model-building process should
in principle be taken into account when judging the results of a prognostic study,
in practice it is often neglected. We come back to this problem at several occa-
sions below, especially in sections VII and VIII.
We demonstrate various approaches with the data of the GBSG-2 study.
The factors listed in Table 2B are investigated with regard to their prognostic
relevance. Since all patients received adjuvant chemotherapy in a standardized
manner and there appeared no difference between three and six cycles (34), che-
motherapy is not considered any further. Because of the patients’ preference in
the nonrandomized part and because of a change in the study protocol concerning
premenopausal patients, only about a third of the patients received hormonal
treatment. Age and menopausal status had a strong influence on whether this
therapy was administered. Therefore, all analyses were adjusted for hormonal
treatment.
Since the impact of hormonal treatment is not of primary interest in this
prognostic study, this was done by using a Cox regression model stratified for
hormonal treatment, that is, the baseline hazard is allowed to vary between the
two strata while keeping the regression coefficient of the other factors constant
over strata.
In a first attempt all quantitative factors are included as continuous covari-
ates assuming a log-linear relationship. Age is taken in years, tumor size in mm,
and so on, menopausal status is a binary covariate per se; specifically we coded
‘‘0’’ for premenopausal and ‘‘1’’ for postmenopausal patients. Grade is consid-
ered as quantitative covariate in this approach, that is, the risk between grade
categories 1 and 2 is the same as between grade categories 2 and 3. The results
of this Cox regression model are given in Table 3 in terms of estimated relative
risks and p values of the corresponding Wald tests under the heading ‘‘full
model.’’ In a publication, this should at least be accompanied by confidence inter-
vals for the relative risks that we have omitted here in order not to present too
many numbers. From this full model it can be seen that tumor size, tumor grade,
number of positive lymph nodes, and the progesterone receptor have a significant
impact on event-free survival, when a significance level of 5% is used. Age,
menopausal status, and the estrogen receptor do not exhibit prognostic relevance.
The full model has the advantage that the regression coefficients of the factors
considered can be estimated in an unbiased fashion; it is, however, hampered by
the fact that the assumed log-linear relationship for quantitative factors may be
in sharp contrast to the real situation and that also irrelevant factors are included
that will not be needed in subsequent steps, for example, in the formation of risk
groups defined by the prognostic factors. In addition, correlation between various
factors may lead to undesirable statistical properties of the estimated regression
coefficients, such as inflation of standard errors or problems of instability caused
by multicollinearity. It is therefore desirable to arrive at a simple and parsimoni-
ous ‘‘final model’’ that only contains those prognostic factors that strongly affect
event-free survival (48). The three other columns of Table 3 contain the results
of the Cox regression models obtained after backward elimination (BE) for three
different selection levels (49). For selection of a single factor, backward elimina-
tion with a selection level of 15.7% (BE(0.157)) corresponds asymptotically to
the well-known Akaike information criterion whereas selection levels of 5% or
even 1% lead to a more stringent selection of factors (50). In general, backward
elimination can be recommended because of several advantages compared with
other stepwise variable selection procedures (48,51,52).
In the GBSG-2 study, tumor grade, lymph nodes, and progesterone receptor
are selected for all three selection levels considered; when using 15.7% as the
selection level, tumor size is included in addition. Thus, the results of the full
model and the three backward elimination procedures do not differ too much in
these particular data; this, however, should not be expected in general. One reason
might be that there is a relatively clearcut difference between three strong factors
(and tumor size that seems to have a borderline influence) and the others that
show only a negligible influence on event-free survival in this study.
The previous approach implicitly assumes that the influence of a prognostic
factor on the hazard function follows a log-linear relationship. By taking lymph
nodes as the covariate Z, for example, this means that the risk is increased by
the factor exp (β̂) if the number of positive lymph nodes is increased from l to
l ⫹ 1 for l ⫽ 1, 2, . . . . This could be a questionable assumption at least for
large numbers of positive lymph nodes. For other factors even monotonicity of
the log-relative risk may be violated, which could result in overlooking an impor-
Table 3 Estimated Relative Risks (RR) and Corresponding p Values in the Cox Regression Models for the GBSG-2 Study;
Quantitative Prognostic Factors Are Taken as Continuous Covariates Assuming a Log-Linear Relationship
Full model BE (0.157) BE (0.05) BE (0.01)
Factor RR p value RR p value RR p value RR p value
Age 0.991 0.31 — — — — — —

Menopausal status 1.310 0.14 — — — — — —
Tumor size 1.008 0.049 1.007 0.061 — — — —
Tumor grade 1.321 0.009 1.325 0.008 1.340 0.006 1.340 0.006
Lymph nodes 1.051 ⬍0.001 1.051 ⬍0.001 1.057 ⬍0.001 1.057 ⬍0.001
Progesterone receptor 0.998 ⬍0.001 0.998 ⬍0.001 0.998 ⬍0.001 0.998 ⬍0.001
Estrogen receptor 1.000 0.67 — — — — — —
335
tant prognostic factor. Because of this uncertainty, the prognostic factors under
consideration are often categorized and so-called dummy variables for the differ-
ent categories are defined, thus avoiding that the categorized factors are treated
as quantitative covariates. In the GBSG-2 study, the categorization presented in
Table 2B is used that was specified independently of the specific data set in
accordance with the literature (34). For those factors with three categories, two
binary dummy variables were defined contrasting the corresponding category
with the reference category chosen as that with the lowest values. So, for example,
lymph nodes were categorized into 1–3, 4–9, and ⱖ 10 positive nodes; 1–3
positive nodes serves as the reference category. Table 4 displays the results of
the Cox regression model for the categorized covariates; again, the results of the
full model are supplemented by those obtained after backward elimination with
three selection levels. Elimination of only one dummy variable corresponding to
a factor with three categories would correspond to an amalgamation of categories
(8). In these analyses where tumor grade, lymph nodes, and progesterone receptor
show again the strongest effects, age and menopausal status are also marginally
significant and are included into the model by backward elimination with a selec-
tion level of 15.7%. For age, there is some indication that linearity or even mono-
tonicity of the log-relative risk may be violated. Grade categories 2 and 3 do not
seem well separated as is suggested by the previous approach presented in Table
3 where grade was treated as ordinal covariate; the latter one would lead to esti-
mated relative risks of 1.321 and 1.745 ⫽ (1.321) 2 for grade 2 and grade 3,
respectively, in contrast to values of 1.723 and 1.746 when using dummy vari-
ables. The use of dummy variables may also be the reason that grade is no longer
included by backward elimination with a selection level of 1%. In Table 4 we
give the p values of the Wald tests for the two dummy variables separately;
alternatively, we could also test the two-dimensional vector of corresponding
regression coefficients to be zero. In any case this needs two degrees of freedom,
whereas when treating grade as a quantative covariate, one degree of freedom
would be sufficient. The data of the GBSG-2 study suggest that grade categories
2 and 3 could be amalgamated into one category (grade 2–3); this would lead
to an estimated relative risk of 1.728 and a corresponding p value of 0.019.
The results of the two approaches presented in Tables 3 and 4 show that
model building within the framework of a prognostic study has to find a compro-
mise between sufficient flexibility with regard to the functional shape of the un-
derlying log-relative risk functions and simplicity of the derived model to avoid
problems with serious overfitting and instability. From this point of view the first
approach assuming all relationships to be log-linear may not be flexible enough
and may not capture important features of the relationship between various prog-
nostic factors and event-free survival. On the other hand, the categorization used
in the second approach can always be criticized because of some degree of arbi-
trariness and subjectivity concerning the number of categories and the specific
Table 4 Estimated Relative Risks (RR) and Corresponding p Values in the Cox Regression Models for the GBSG-2 Study;
Prognostic Factors Are Categorized as in Table 2B
Full model BE (0.157) BE (0.05) BE (0.01)
Factor RR p value RR p value RR p value RR p value
Age ⱕ45 1 — 1 — — — — —
45–60 0.672 0.026 0.679 0.030 — — — —
60 0.687 0.103 0.692 0.108 — — — —
Menopausal status Pre 1 — 1 — — — — —
Post 1.307 0.120 1.304 0.120 — — — —
Tumor size ⱕ20 1 — — — — — — —
21–30 1.240 0.165 — — — — — —
⬎30 1.316 0.089 — — — — — —
Tumor grade 1 1 — 1 — 1 — — —
2 1.723 0.031 1.718 0.032 1.709 0.033 — —
3 1.746 0.045 1.783 0.036 1.778 0.037 — —
Lymph nodes 1–3 1 — 1 — 1 — 1 —
4–9 1.976 ⬍0.001 2.029 ⬍0.001 2.071 ⬍0.001 2.110 ⬍0.001
ⱖ10 3.512 ⬍0.001 3.687 ⬍0.001 3.661 ⬍0.001 3.741 ⬍0.001
Progesterone ⬍20 1 — 1 — 1 — 1 —
receptor ⱖ20 0.545 ⬍0.001 0.545 ⬍0.001 0.536 ⬍0.001 0.494 ⬍0.001
Estrogen receptor ⬍20 1 — — — — — — —
ⱖ20 0.994 0.97 — — — — — —
337
cutpoints chosen. In addition, it will not fully exploit the information available
and will be associated with some loss in efficiency. For a more flexible modeling
of the functional relationship, a larger number of cutpoints and corresponding
dummy variables would be needed. We will therefore sketch a third approach
that will provide more flexibility while preserving simplicity of the final model
to an acceptable degree.
The method has been originally developed by Royston and Altman (53) and
has been termed the ‘‘fractional polynomial’’ (FP) approach. For a quantitative
covariate Z it uses functions β 0 ⫹ β 1 Z p ⫹ β 2 Z q to model the log-relative risk;
the powers p and q are taken from the set {⫺2, ⫺1, ⫺0.5, 0, 0.5, 1, 2, 3} and
Z 0 is defined as log Z. This simple extension of ordinary polynomials generates
a considerable range of curve shapes while still preserving simplicity when com-
pared with smoothing splines or other nonparametric techniques, for example.
Sauerbrei and Royston (54) extended the proposed multivariate FP ap-
proach to a model-building strategy considering transformation and selection of
variables. Without going into the details of this model-building process reported
elsewhere (54,55), we summarize the results in Table 5. For age, the powers ⫺2
and ⫺0.5 have been estimated and provide significant contributions to the log-
relative risk function. This function is displayed in Figure 4A in comparison with
the corresponding functions derived from the two other approaches. It provides
some further indication that there is a nonmonotonic relationship that would be
overlooked by the log-linear approach. Grade categories 2 and 3 have been amal-
gamated as has been pointed out above. For lymph nodes a further restriction
has been incorporated by assuming that the relationship should be monotone with
an asymptote for large numbers of positive nodes. This was achieved by using
the simple primary transformation exp(⫺0.12 ⋅ lymph nodes) where the factor
0.12 was estimated from the data (54). The estimated power for this transformed
variable was equal to one and a second power was not needed. Likewise, for
progesterone receptor, a power of 0.5 was estimated that gives a significant contri-
Table 5 Estimated Regression Coefficients and Corresponding p

Values in the Final Cox Regression Model for the GBSG-2 Study
Using the Fractional Polynomial Approach
Factor/function Regression coefficient p Value
(Age/50) ⫺2 1.742 ⬍0.001

(Age/50) ⫺0.5 ⫺7.812 ⬍0.001
Tumor grade 1 0 —
Tumor grade 2–3 0.517 0.026
exp (⫺0.12 ∗ Lymph nodes) ⫺1.981 ⬍0.001
(Progesterone receptor ⫹ 1) 0.5 ⫺0.058 ⬍0.001
Figure 4 Estimated log-relative risk functions for age (A), lymph nodes (B), and proges-
terone receptor (C) obtained by the FP, categorization and log-linear approach in the
GBSG-2 study.
bution to the log-relative risk functions. Figure 4 shows these functions for lymph
nodes (B) and progesterone receptor (C) in comparison with those derived from
the log-linear and from the categorization approach. For lymph nodes, it suggests
that the log-linear approach underestimates the increase in risk for small numbers
of positive nodes, whereas it substantially overestimates it for very large numbers.
The categorization approach seems to provide a reasonable compromise for this
factor.
At the end of this section, some general comments are in order. First, there
is a variety of other flexible methods available that has not been presented here.
Sauerbrei and Royston (54) provide a graphical comparison of the FP approach
with generalized additive models (56) for the log-relative risk functions displayed
in Figure 4, A–C. Various other nonparametric methods could in principle be
used. It should be stressed, however, that there must always be a compromise
between flexibility and simplicity and that simple models have the additional
advantage that they can be interpreted by clinical colleagues more easily. The
aspect of model complexity is discussed in more detail by Sauerbrei (48). Second,
the analyses presented for the GBSG-2 study concentrated on the Cox regression
model, the standard statistical tool for prognostic studies and without any doubt
the one that is most commonly used. This, however, has important consequences:
Model checking with regard to the assumptions of this model has to be carefully
undertaken. Some special aspects have already been addressed above (e.g., the
log-linear relationship in a ‘‘standard’’ Cox model); numerous others can be
found in textbooks and review articles on survival analysis (14,57,58). If impor-
tant assumptions appear to be seriously violated, extensions of the Cox model
(e.g., with time-varying regression coefficients) or other models should be taken
into consideration, some alternative approaches are discussed in Sections VI and
VIII.
Third, when dealing with prognostic factor studies, other features than ful-
fillment of model assumptions are getting more important. One is stability and
addresses the question whether we could replicate the selected final model having
different data. Bootstrap resampling has been applied to investigate the stability
of the selected ‘‘final model’’ (59–61). In each bootstrap sample, the whole
model selection or building process is repeated and the results are summarized
over the bootstrap samples. We illustrate this procedure for backward elimination
with a selection level of 5% in the Cox regression model with quantitative factors
included as continuous covariates (Table 3). Because of simplicity we do not
consider the selection process including transformation of covariates as was used
by Sauerbrei and Royston (54). In Table 6 the inclusion frequencies over 1000
bootstrap samples are given for the prognostic factors under consideration. These
frequencies underline that tumor grade, lymph nodes, and progesterone receptor
Table 6 Inclusion Frequencies over 1000

Bootstrap Samples Using the Backward
Elimination Method (BE (0.05)) with a
Selection Level of 5% in the GBSG-2 Study
Factor Inclusion frequency
Age 18.2%
Menopausal status 28.8%
Tumor size 38.1%
Tumor grade 62.3%
Lymph nodes 100%
Progesterone receptor 98.1%
Estrogen receptor 8.1%
are by far the strongest factors; lymph nodes are always included, progesterone
receptor in 98% and tumor grade in 62% of the bootstrap samples, respectively.
The percentage of bootstrap samples where exactly this model—containing these
three factors only—is selected is 26.1%. In 60.4% of the bootstrap samples a
model is selected that contains these three factors, possibly with other selected
factors. These figures might be much lower in other studies where more factors
with a weaker effect are investigated. Bootstrap resampling of this type also pro-
vides insight into the interdependencies between different factors by inspecting
the bivariate inclusion frequencies (61).
VI. CLASSIFICATION AND REGRESSION TREES
Analysis by building a hierarchical tree is one approach for nonparametric model-

ing of the relationship between a response variable and several potential prognos-
tic factors. Breiman et al. (62) gives a comprehensive description of the method
of classification and regression trees (CART) that has been modified and extended
in various directions (63). We concentrate solely on the application to survival
data (64–68) and use the abbreviation CART as a synonym for different types
of tree-based analyses.
Briefly, the idea of CART is to construct subgroups that are internally as
homogeneous as possible with regard to the outcome and externally as separated
as possible. Thus, the method leads directly to prognostic subgroups defined by
the potential prognostic factors. This is achieved by a recursive tree-building
algorithm. As in Section V, we start with k potential prognostic factors Z 1 ,
Z 2 , . . . , Z k that may have an influence on the survival time random variable T.
We define a minimum number of patients within a subgroup, n min say, and pre-
specify an upper bound for the p values of the logrank test statistic, p stop . Then
the tree-building algorithm is defined by the following steps (42):
1. The minimal p value of the logrank statistic is computed for all k fac-
tors and all allowable splits within the factors. An allowable split is
given by a cutpoint of a quantitative or an ordinal factor within a given
range of the distribution of the factor or some bipartition of the classes
of a nominal factor.
2. The whole group of patients is split into two subgroups based on the
factor and the corresponding cutpoint with the minimal p value if the
minimal p value is smaller or equal to p stop.
3. The partition procedure is stopped if no allowable split exists, if the
minimal p value is greater than p stop, or because the size of the subgroup
is smaller than n min .
4. For each of the two resulting subgroups, the procedure is repeated.
This tree-building algorithm yields a binary tree with a set of patients, a splitting
rule, and the minimal p value at each interior node. For the patients in the resulting
final nodes that may again be combined by some amalgamation, various quanti-
ties of interest, as Kaplan-Meier estimates of event-free survival or relative risks
with respect to some reference, can be computed.
Since the potential prognostic factors are usually measured on different
scales, the number of possible partitions will also be different. This leads to the
problems that have already been extensively discussed in Section IV. Thus, cor-
rection of p values and/or restriction to a set of few prespecified cutpoints may
be useful to overcome the problem that factors allowing more splits have a higher
chance of being selected by the tree-building algorithm because of multiple test-
ing and may be preferred to binary factors with prognostic relevance.
We illustrate the procedure by means of the GBSG-2 study. If we restrict
the possible splits to the range between the 10% and 90% quantile of the empirical
distribution of each factor, then the factor age, for example, will allow 25 splits
whereas the binary factor menopausal status will allow only 1 split. Likewise,
tumor size will allow 32 possible splits and tumor grade, 2. Lymph nodes will
allow for 10 possible splits, and progesterone and estrogen receptor offer 182
and 177 possible cutpoints, respectively. Thus, we decide to use the p value cor-
rection as outlined in Section IV, and we define n min ⫽ 20 and p stop ⫽ 0.05. As
a splitting criterion we use the test statistic of the logrank test; for simplicity,
the logrank test was not stratified for hormonal therapy that could have been done
in principle.
We start with the whole group of 686 patients (the ‘‘root’’) where a total
of 299 events (crude event rate 43.6%) has been observed. The factor with the
smallest corrected p value is lymph nodes and the whole group is split at an
estimated cutpoint of nine positive nodes ( p cor ⬍ 0.0001) yielding a subgroup of
583 patients with less than or equal to nine positive nodes (event rate 38.8%)
and a subgroup of 103 patients with more than nine positive patients (event rate
70.9%). The procedure is then repeated with the left node (patients with number
of positive lymph nodes, ⱕ9) and the right node (patients with number of positive
lymph nodes, ⬎9). At this level, in the left node, lymph nodes again appeared
to be the strongest factor. With a cutpoint of three positive nodes ( p cor ⬍ 0.0001)
this yields a subgroup of 376 patients with less than or equal to three positive
nodes (event rate 31.6%) and a subgroup of 207 patients with four to nine positive
nodes (event rate 51.7%). For the right node (patients with more than nine posi-
tive nodes), progesterone receptor is associated with the smallest corrected p
value and the cutpoint is obtained as 23 fmol ( p ⫽ 0.0003). This yields subgroups
with 43 patients (progesterone receptor ⬎ 23, event rate 51.2%) and 60 patients
(progesterone receptor ⱕ 23, event rate 85%), respectively. In these two sub-
groups, no further splits are possible because of the p stop criterion; thus they are
regarded as final nodes.
The subgroups of patients with one to three and four to nine positive nodes
allow further splits; again, progesterone receptor is the strongest factor with cut-
points of 90 fmol ( p cor ⫽ 0.006) and 55 fmol ( p cor ⫽ 0.0018), respectively. Be-
cause of the p stop criterion, no further splits are possible and the resulting sub-
groups are considered as final nodes too. The result of the tree-building procedure
is summarized in Figure 5. In this graphical representation, the size of the sub-
groups is taken proportionally to the width of the boxes, whereas the centers of
the boxes correspond to the observed event rates. This presentation allows an
immediate visual impression about the resulting prognostic classification ob-
tained by the final nodes of the tree.
As already outlined above, a variety of definitions exists of CART type
algorithms that usually consist of tree building, pruning, and amalgamation
(62,63,69,70). We present a somewhat different algorithm that concentrates on
the tree-building process. To protect against serious overfitting of the data—that
in other algorithms is accomplished by tree pruning—we define various restric-
tions like the p stop and the n min criteria and the use of corrected p values. Applying
these restrictions, we obtain the tree displayed in Figure 5 that is parsimonious
in the sense that only the strongest factors, lymph nodes and the progesterone
Figure 5 Classification and regression tree obtained for the GBSG-2 study; p value
correction but no prespecification of cutpoints was used.
receptor, are selected for the splits. However, the values of the cutpoints obtained
for progesterone receptor (90, 55, and 23 fmol) are somewhat arbitrary and may
not be reproducable and/or not comparable with those obtained in other studies.
Thus, another useful restriction may be the definition of a set of prespecified
possible cutpoints for each factor. In the GBSG-2 study we used 35, 40, 45, 50,
55, 60, 65, and 70 years for age: 10, 20, 30, and 40 mm for tumor size; and 5,
10, 20, 100, and 300 fmol for progesterone and estrogen receptors. The resulting
tree is displayed in Figure 6A. It is only different from the one without this
restriction in that the selected cutpoints for the progesterone receptor are now
Figure 6 Classification and regression trees obtained for the GBSG-2 study. p Value
correction and prespecification of cutpoints (A); no p value correction with (B) and without
(C) prespecification of cutpoints.
100, 20, and 20 fmol in the final nodes. For comparison, the trees without using
the p value correction with and without prespecification of a set of possible cut-
points are presented in Figure 6, B and C. Since lymph nodes and progesterone
receptor are the dominating prognostic factors in this patient population, the re-
sulting trees are identical at the first two levels to those where the p values have
been corrected. The final nodes in the latter ones, however, will again be split,
leading to a larger number of final nodes. In addition, other factors like age,
tumor size, and estrogen receptor are now used for the splits at subsequent nodes
too. A more detailed investigation on the influence of p value correction and
prespecification of possible cutpoints on resulting trees and their stability is given
by Sauerbrei (71).
VII. FORMATION AND VALIDATION OF RISK GROUPS
The final nodes of a regression tree define a prognostic classification scheme per
se; to be useful in practice, however, some combination of final nodes to a prog-
nostic subgroup might be indicated. This is especially important if the number
of final nodes is large and/or if the prognosis of patients in different final nodes
is comparable. So, for example, from the regression tree presented in Figure 6A
( p value correction and predefined cutpoints), the prognostic classification given
in Table 7 can be derived that is very much in agreement to current knowledge
about the prognosis of node-positive breast cancer patients. The definition of
subgroups III and IV reflects that patients with more than nine positive lymph
nodes can still be further separated by other prognostic factors, in particular by
subdivision of progesterone-positive (⬎20) and -negative patients (72) and that
progesterone-negative patients with four to nine positive lymph nodes have a
similarly poor prognosis as progesterone-positive patients with more than nine
Table 7 Prognostic Classification Scheme Derived from the

Regression Tree ( p Value Correction and Predefined Cutpoints)
in the GBSG-2 Study
Prognostic
subgroup Definition of subgroup
I (LN ⱕ 3 and PR ⬎ 100)

II (LN ⱕ 3 and PR ⱕ 100) or (LN 4–9 and PR ⬎ 20)
III (LN 4–9 and PR ⱕ 20) or (LN ⬎ 9 and PR ⬎ 20)
IV (LN ⬎ 9 and PR ⱕ 20)
LN, no. of positive lymph nodes; PR, progesterone receptor.

positive lymph nodes. Among the other patients, subgroup I with a relatively
favorable prognosis can be defined by one to three positive lymph nodes and a
‘‘markedly’’ positive progesterone receptor (⬎100). The results in terms of esti-
mated event free survival are displayed in Figure 7A; the Kaplan-Meier curves
show a good separation of the four prognostic subgroups. Since in other studies
or in clinical practice progesterone receptor may often be only recorded as posi-
tive or negative, the prognostic classification scheme in Table 7 may be modified
in the way that the definition of subgroups I and II are replaced by
I*: (LN ⱕ 3 and PR ⬎ 20)
and
II*: (LN ⱕ 3 and PR ⱕ 20) or (LN 4–9 and PR ⬎ 20)
respectively, where LN is lymph nodes and PR is progesterone receptor, since

20 fmol is a more commonly agreed cutpoint. The resulting Kaplan-Meier esti-
mates of event-free survival are depicted in Figure 7B.
For two of the regression approaches outlined in Section V, prognostic
subgroups have been formed by dividing the distribution of the so-called prognos-
tic index, β̂ 1 Z 1 ⫹ ⋅ ⋅ ⋅ ⫹ β̂ k Z k , into quartiles. The results in terms of estimated
event-free survival are displayed in Figure 8A (Cox regression model with contin-
uous factors, BE(0.05), Table 3) and in Figure 8B (Cox regression model with
categorized covariates, BE(0.05), Table 4). It should be noted that in the definition
of the corresponding subgroups, tumor grade enters in addition to lymph nodes
and progesterone receptor.
Figure 7 Kaplan-Meier estimates of event-free survival probabilities for the prognostic

subgroups derived from the CART approach (A) and the modified CART approach (B)
in the GBSG-2 study.
Figure 8 Kaplan-Meier estimates of event-free survival probabilities for the prognostic

subgroups derived from a Cox model with continuous (A) and categorized (B) covariates
and according to the Nottingham Prognostic Index (C) in the GBSG-2 study.
For comparison, Figure 8C shows the Kaplan-Meier estimates of event-

free survival for the well-known Nottingham Prognostic Index (NPI) (73,74) that
is the only prognostic classification scheme based on standard prognostic factors
that enjoys widespread acceptance (75). This index is defined as
NPI ⫽ 0.02 ⫻ size (in mm) ⫹ lymph node stage ⫹ tumor grade
where lymph node stage is equal to 1 for node-negative patients, 2 for patients
with one to three positive lymph nodes, and 3 if four or more lymph nodes were
involved. It is usually divided into three prognostic subgroups: NPI-I (NPI ⬍
3.4), NPI-II (3.4 ⱕ NPI ⱕ 5.4), and NPI-III (NPI ⬎ 5.4). Since it was developed
for node-negative and node-positive patients, there seems room for improvement
by taking other factors (e.g., progesterone receptor) into account (76).
Since the NPI has been validated in various other studies (75), we can argue
that the degree of separation displayed in Figure 8C could be achieved in general.
This, however, is by no means true for the other proposals derived by regression
modeling or CART techniques where some shrinkage has to be expected
(46,77,78). We therefore attempted to validate the prognostic classification
schemes defined above with the data of an independent study that—in more tech-
nical terms—is often referred to as a ‘‘test set’’ (79). As a test set we take the
Freiburg DNA study that covers the same patient population and in addition com-
prises the same prognostic factors as in the GBSG-2 study. Some complications
have to be resolved, however. So only progesterone and estrogen receptor status
(positive, ⬎20 fmol; negative, ⱕ20 fmol) is recorded in the Freiburg DNA study
and the original values are not available. Thus, only those classification schemes
where progesterone receptor enters as positive or negative can be considered for
validation. Furthermore, we restrict ourselves to those patients where the required
information on prognostic factors is complete. Table 8A shows the estimated
relative risks for the prognostic groups derived from the categorized Cox model
and from the modified CART classifications scheme defined above. The relative
risks have been estimated by using dummy variables defining the risk groups
and by taking the group with the best prognosis as reference. When applying the
classification schemes to the data of the Freiburg DNA study, the definitions and
Table 8A Estimated Relative Risks for Various Prognostic

Classification Schemes Derived in the GBSG-2 Study and Validated in
the Freiburg DNA Study
Estimated relative risks (no. of patients)
Freiburg DNA
Prognostic groups GBSG-2 study study
Cox
I 1 (52) 1 (33)
II 2.68 (218) 1.78 (26)
III 3.95 (236) 3.52 (58)
IV 9.92 (180) 7.13 (14)
CART
I* 1 (243) 1 (50)
II* 1.82 (253) 1.99 (38)
III 3.48 (133) 3.19 (33)
IV 8.20 (57) 4.34 (11)
NPI
II 1 (367) 1 (46)
III 2.15 (301) 2.91 (87)
categorization derived in the GBSG-2 study are used. Note that the categorization
into quartiles of the prognostic index does not yield groups with equal number
of patients since the prognostic index from the categorized Cox model takes only
few different values.
From the values given in Table 8A, it can be seen that there is some shrink-
age in the relative risks when estimated in the Freiburg DNA study that we used as
a test set. This shrinkage is more pronounced in the modified CART classification
scheme (reduction by the factor 0.53 in the high-risk group) as compared with
the categorized Cox model (reduction by the factor 0.72 in the high-risk group).
To get some idea of the amount of shrinkage that has to be anticipated in
a test set, based on the original data where the classification scheme has been
developed—the so-called training set (79)—cross-validation or other resampling
methods can be used. For classification schemes derived by regression modeling,
similar techniques as already outlined in Section IV can be used. These consist
essentially in estimating a shrinkage factor for the prognostic index (44,46). The
relative risks for the prognostic subgroups are then estimated by categorizing the
shrinked prognostic index according to the cutpoints used in the original data.
In the GBSG-2 study we obtained an estimated shrinkage factor of ĉ ⫽ 0.95 for
the prognostic index derived from the categorized Cox model indicating that we
would not expect a serious shrinkage of the relative risks between the prognostic
subgroups. Compared with the estimated relative risks in the Freiburg DNA study
(Table 8A), it is clear that the shrinkage effect in the test set can only be predicted
to a limited extent. This deserves at least two comments. First, we have used
leave-one-out cross-validation that possibly could be improved by bootstrap or
other resampling methods (45), second, we did not take the variable selection
process into account. By doing so, we would expect more realistic estimates of
the shrinkage effect in an independent study. Similar techniques can in principle
also be applied to classification schemes derived by CART methods. How to do
this best, however, is still a matter of ongoing research (71).
VIII. ARTIFICIAL NEURAL NETWORKS
During the last years, the application of artificial neural networks (ANNs) for
prognostic and diagnostic classification in clinical medicine has attracted growing
interest in the medical literature. So, for example, a ‘‘miniseries’’ on neural net-
works that appeared in the Lancet contained three more or less enthusiastic review
articles (80–82) and an additional commentary expressing some scepticism (83).
In particular, feed-forward neural networks have been used extensively, often
accompanied by exaggerated statements of their potential. In a recent review
article (84), we identified a substantial number of articles with application of
ANNs to prognostic classification in oncology.
The relationship between ANNs and statistical methods, especially logistic

regression models, has been described in several articles (85–90). Briefly, the
conditional probability that a binary outcome variable Y is equal to one, given
the values of k prognostic factors Z 1 , Z 2 , . . . , Z k is given by a function f (Z,w).
In feed-forward neural networks, this function is defined by
冢冢冣冣
r k
f (Z,w) ⫽ Λ W 0 ⫹ 冱j⫽1
W j ⋅ Λ w 0j ⫹ 冱w
i⫽1
ij Zi
where w ⫽ (W 0 , . . . , W r , w 01 , . . . , w kr) are unknown parameters (called

‘‘weights’’) and Λ (⋅) denotes the logistic function (Λ (u) ⫽ (1 ⫹ exp (⫺u))⫺1),
called ‘‘activation-function.’’ The weights w can be estimated from the data via
maximum likelihood, although other optimization procedures are often used in
this framework. The ANN is usually introduced by a graphical representation
like that in Figure 9. This figure illustrates a feed-forward neural network with
one hidden layer. The network consists of k input units, r hidden units, and one
output unit and corresponds to the ANN with f (Z, w) defined above. The arrows
indicate the ‘‘flow of information.’’ If there is no hidden layer (r ⫽ 0), the ANN
reduces to a common logistic regression model which is also called the ‘‘logistic
perceptron.’’
In general, feed-forward neural networks with one hidden layer are univer-
sal approximators (91) and thus can approximate any function defined by the
conditional probability that Y is equal to one given Z with arbitrary precision by
Figure 9 Graphical representation of an artificial neural network with one input, one
hidden, and one output layer.
increasing the number of hidden units. This flexibility can lead to serious overfit-
ting that can again be compensated by introducing some weight decay (79,92)
that is, by adding a penalty term
冢冱冱冱w 冣
r r k
⫺λ W 2j ⫹ 2
ij
j⫽1 j⫽1 i⫽1
to the log-likelihood. The smoothness of the resulting function is then controlled

by the decay parameter λ. It is interesting to note that in our literature review of
articles published between 1991 and 1995, we have not found any application
in oncology where weight decay has been used (84).
Extension to survival data with censored observations is associated with
various problems. Although there is a relatively straightforward extension of
ANNs to handle grouped survival data (93), several naive proposals can be found
in the literature. To predict outcome (death or recurrence) of individual breast
cancer patients, Ravdin and Clark (94) and Ravdin et al. (95) used a network
with only one output unit but using the number j of the time interval as additional
input. Moreover, they consider the unconditional probability of dying before t j
rather than the conditional one as output. Their underlying model then reads
冢冣
k
P (T ⬍ t j | Z ) ⫽ Λ w 0 ⫹ 冱w Z ⫹w
i⫽1
i i k⫹1 ⋅j
for j ⫽ 1, . . . , J. T denotes again the survival time random variable, and the
time intervals are defined through t j⫺1 ⱕ t ⬍ t j, 0 ⫽ t 0 ⬍ t 1 ⬍ ⋅ ⋅ ⋅ ⬍ t J ⬍ ∞.
This parameterization ensures monotonicity of the survival probabilities but also
implies a rather stringent and unusual shape of the survival distribution, since in
the case that no covariates are considered this reduces to
P (T ⬍ t j ) ⫽ Λ (w 0 ⫹ w k⫹1 ⋅ j)
for j ⫽ 1, . . . , J. Obviously, the survival probabilities do not depend on the

length of the time intervals, which is a rather strange and undesirable feature.
Including a hidden layer in this expression is a straightforward extension retaining
all the features summarized above. De Laurentiis and Ravdin (96) call such type
of neural networks ‘‘time-coded models.’’ Another form of neural networks that
has been applied to survival data are the so-called single time point models (96).
Since they are identical to a logistic perception or a feed-forward neural network
with a hidden layer, they correspond to fitting of logistic regression models or
their generalizations to survival data. In practice, a single time point t* is fixed
and the network is trained to predict the survival probability. The corresponding
model is given by
冢冱w Z冣
k
P (T ⬍ t*| Z ) ⫽ Λ w 0 ⫹ i i
i⫽1
or its generalization when introducing a hidden layer. This approach is used by

Burke (97) to predict 10-year survival of breast cancer patients based on various
patient and tumor characteristics at time of primary diagnosis. McGuire et al.
(98) used this approach to predict 5-year event-free survival of patients with
axillary node-negative breast cancer based on seven potentially prognostic vari-
ables. Kappen and Neijt (99) used it to predict 2-year survival of patients with
advanced ovarian cancer obtained from 17 pretreatment characteristics. The neu-
ral network they actually used reduced to a logistic perceptron.
Of course, such a procedure can be repeatedly applied for the prediction
of survival probabilities at fixed time points t 1 ⬍ t 2 ⬍ ⋅ ⋅ ⋅ ⬍ t J . For example,
Kappen and Neijt (99) trained several (J ⫽ 6) neural networks to predict survival
of patients with ovarian cancer after 1, 2, . . . , 6 years. The corresponding model
reads
冢冣
k
P (T ⬍ t j | Z ) ⫽ Λ w 0 j ⫹ 冱w
i⫽1
ij Zi
in the case that no hidden layer is introduced. Note that without restriction on
the parameters such an approach does not guarantee that the probabilities P (T
⬍ t j |Z ) increase with j and hence may result in life-table estimators suggesting
nonmonotone survival function. Closely related to such an approach are the so-
called multiple time point models (96) where one neural network with J output
units with or without a hidden layer is used.
The common drawback of these naive approaches is that they do not allow
one to incorporate censored observations in a straightforward manner, which is
closely related to the fact that they are based on unconditional survival probabili-
ties instead of conditional survival probabilities as is the Cox model. Neither
omission of the censored observations—as suggested by Burke (97)—nor treat-
ing censored observations as uncensored are valid approaches but a serious source
of bias, which is well known in the statistical literature. De Laurentiis and Ravdin
(96) propose imputed estimated conditional survival probabilities for the censored
cases from a Cox regression model, that is, they use a well-established statistical
Figure 10 Estimated event-free survival probabilities at 2 years vs. at 1 year for various
artificial neural networks in the GBSG-2 study.
procedure just to make an artificial neural network work. The latter approach is
also used by Ripley (92), who emphasized that the resulting bias may be negli-
gible.
We illustrate some of the points made above with data from the GBSG-2
study. First, we have used the approach of single time point models for the predic-
tion of event-free survival at 1, 2, 3, 4, and 5 years, respectively. All seven prog-
nostic factors were considered as quantitative covariates, except menopausal sta-
tus, and were scaled to the interval [0, 1]; they, and hormonal therapy in addition
were used as inputs for the neural nets. Censored observations occurring before
the corresponding time points were omitted. Figure 10 shows the results of vari-
ous ANNs in terms of estimated event-free survival at 2 years versus estimated
event-free survival at 1 year for those 623 patients who were not censored before
2 years. The ANN with no hidden unit corresponds to an ordinary logistic regres-
sion model for event-free survival at 1 and 2 years, respectively. It can be seen
that in this model estimated event-free survival probabilities are still monotone.
We then increased the number of hidden units to 2 and 5 and varied the degree
of weight decay resulting in severe violations of monotonicity of the estimated
event-free survival probabilities for a considerable number of patients. For a de-
cay parameter of λ ⫽ 0.1 we then obtain nearly the same results as for the logistic
regression model (9 parameters), although five hidden units and a corresponding
number of 42 additional parameters were introduced.
In a second stage, we illustrate the impact of insufficient handling of cen-
sored observation. Figure 11, A and B, shows the estimated event-free survival
probabilities at 5 years when censored observations are omitted or replaced by
imputed values from a Cox model, respectively. For this imputation, we used
the Cox model presented in Table 4 (full model). Both ways of handling censored
observations are contrasted by the estimated event-free survival probabilities at
5 years derived from the final FP Cox model (Sec. V, Table 5). The figures
demonstrate the possible bias resulting from the omission of censored observa-
tions; the bias is smaller for the imputation method, but this may be still not
considered a fully satisfactory method.
We therefore come to a third approach that has been originally suggested
by Faraggi and Simon (100) and extended by others (101). The idea is to replace
the function exp (β 1 Z 1 ⫹ ⋅ ⋅ ⋅ ⫹ β kZ k ) in the definition of the Cox model by a
more flexible function motivated by the function f (Z, w) used in ANNs. This
leads to a neural network generalization of the Cox regression model defined by
Figure 11 Estimated event-free survival probabilities at 5 years when censored observa-

tions are omitted (A) and replaced by imputed values (B) vs. estimated event-free survival
probabilities at 5 years obtained from the final FP Cox model in the GBSG-2 study.
λ (t | Z 1 , . . . , Z k ) ⫽ λ 0 (t) exp ( f FS (Z,w))
where
冢冣
r k
f FS (Z,w) ⫽ 冱j⫽1
W j Λ w 0j ⫹ 冱w
i⫽1
ij Zi
Note that the constant W 0 is omitted in the framework of the Cox model. Estima-
tion of weights is then done by maximizing the partial likelihood that includes
the correct and usual handling of censored observations in that these patients
contribute to the partial likelihood as long as they are at risk. Although the prob-
lem of censoring is satisfactorily solved in this approach, problems remain with
potentially serious overfitting of the data, especially if the number r of hidden
units is large.
For illustration, we again used the data of the GBSG-2 study where we
applied some preselection of variables in that we took those factors that were
included in the final FP model (Sec. V, Table 5). Thus, we used the four factors,
age, grade, lymph nodes, and progesterone receptor (all scaled to the interval [0;
1]), and hormone therapy as inputs for the Faraggi and Simon (F&S) network.
Figure 12 shows the results for various F&S networks compared with the FP
approach in terms of Kaplan-Meier estimates of event-free survival in the prog-
nostic subgroups defined by the quartiles of the corresponding prognostic indices.
It should be noted that the F&S network contains 5 ⫹ (6 ⫻ 5) ⫽ 35 parameters
when 5 hidden units are used and 20 ⫹ (6 ⫻ 20) ⫽ 140 when 20 hidden units
are used. The latter one must be suspected to serious overfitting with a high
chance that the degree of separation achieved could never be reproduced in other
studies. To highlight this phenomenon we trained a slightly different F&S net-
work where, in addition to age, tumor size, tumor grade, and number of lymph
nodes, estrogen and progesterone status (positive, ⬎20 fmol; negative; ⱕ20 fmol)
were used as inputs. This network contained 20 hidden units (20 ⫹ (7 ⫻ 20) ⫽
160 parameters) and showed a similar separation than the one where estrogen
and progesterone receptor entered as quantitative inputs. Table 8B contrasts the
results from the GBSG-2 study used as training set and the Freiburg DNA study
used as test set in terms of estimated relative risks where the predicted event-
free survival probabilities are categorized in quartiles. In the training set, we
observe a 20-fold increase in risk between the high-risk and the low-risk group,
whereas the F&S network turns out to yield a completely useless prognostic clas-
Figure 12 Kaplan-Meier estimates of event-free survival probabilities for the prognos-

tic subgroups derived from various Faraggi & Simon networks and from the FP approach
Table 8B Estimated Relative Risks for Various Prognostic

Classification Schemes Based on F&S Neural Networks Derived in the
GBSG-2 Study and Validated in the Freiburg DNA Study
Estimated relative risks (no. of patients)
Freiburg DNA
Prognostic groups GBSG-2 study study
F&S*
I 1 (179) 1 (37)
II 3.24 (178) 0.34 (16)
III 7.00 (159) 0.98 (38)
IV 22.03 (170) 1.39 (35)
F&S†
I 1 (171) 1 (23)
II 1.45 (172) 1.57 (25)
III 2.62 (171) 3.09 (32)
IV 5.75 (172) 4.27 (46)
F&S‡
I 1 (172) 1 (20)
II 1.64 (171) 1.03 (31)
III 3.27 (171) 1.89 (28)
IV 8.49 (172) 2.72 (47)
F&S§
I 1 (172) 1 (23)
II 1.46 (171) 1.57 (25)
III 2.62 (171) 3.22 (33)
IV 5.77 (172) 4.14 (45)
* Twenty hidden units, weight decay ⫽ 0.

† Twenty hidden units, weight decay ⫽ 0.1.
‡ Five hidden units, weight decay ⫽ 0.
§ Five hidden units, weight decay ⫽ 0.1
sification scheme in the test set where the estimated relative risks are not even
monotone increasing. It is obvious that some restrictions either in terms of a
maximum number of parameters or by using some weight decay are absolutely
necessary to avoid such an amount of overfitting as observed in the two prognos-
tic classification schemes based on F&S networks where weight decay was not
applied. The results for an F&S network with five hidden units are very much
comparable with the FP approach, especially when some weight decay is intro-
duced. It should be noted that the FP approach contains at most eight parameters
if we ignore the preselection of the four factors.
Summarizing our experience with ANNs for prognostic classification, it

should be emphasized that they have to be regarded as very flexible nonlinear
regression models deserving the same careful model building as other statistical
models of similar flexibility. Especially, the dangers of serious overfitting have
to be taken into account. When applying ANNs to survival data, one has to take
care that the standard requirements for such data as the proper incorporation of
censored observations or the modeling of conditional survival probabilities are
met. In our literature survey we did not find a satisfactory application (84), al-
though some progress has been made in recent methodological contributions
(92,101,102).
IX. ASSESSMENT OF PROGNOSTIC CLASSIFICATION

SCHEMES
Once a prognostic classification scheme is developed and defined, the question

arises how its predictive ability can be assessed and how its performance can be
compared with that of competitors. It is interesting to note that there is no com-
monly agreed approach available in the statistical literature and most measures
that are used have some ad-hoc character. Suppose that a prognostic classification
scheme consist of g prognostic groups—called risk strata or risk groups—then
one common approach is to present the Kaplan-Meier estimates for event-free
or overall survival in the g groups. This is the way in that we also presented the
results of prognostic classification schemes derived by various statistical methods
in previous sections. The resulting figures are often accompanied by p values of
the logrank test for the null hypothesis that the survival functions in the g risk
strata are equal. It is clear that a significant result is a necessary but not a sufficient
condition for good predictive ability. Sometimes, a Cox model using dummy
variates for the risk strata is fitted and the log-likelihood and/or estimated relative
risks of risk strata with respect to a reference are given. Recently, we proposed
a summary measure of separation (36) defined as
冤冱 nn |β̂ |冥
g
j
SEP ⫽ exp j
j⫽1
where n j denotes the number of patients in risk stratum j and β̂ j is the estimated
log-hazard ratio or log-relative risk of patients in risk stratum j with respect to
a baseline reference. In particular, we used the baseline reference estimated in a
Cox model where the dummy variates for risk strata were centered to have mean
zero. SEP is the weighted geometric mean of ‘‘absolute’’ relative risks between
strata and baseline, ‘‘absolute’’ meaning that 1/RR replaces RR for relative risks
RR ⬍ 1 (103). Often this model-based baseline reference turns out to be very

similar to the estimated marginal distribution of T, i.e., to the pooled Kaplan-
Meier estimate Ŝ(t). Therefore, SEP essentially compares risks within strata with
the risk in the entire population. In fact, the pooled Kaplan-Meier estimate has
been used previously as baseline reference (36), although the model-based ap-
proach may be preferable for formal reasons.
In this section, we use the GBSG-4 study for illustration. In the node-
negative breast cancer patients we compare the NPI (73,74) that has already been
defined in Section VII and two classification schemes that have been derived from
a Cox regression model and a CART approach, respectively. The first scheme is
defined through a simplified prognostic index
COX ⫽ I (age ⱕ 40 years) ⫹ I (size ⬎ 20mm) ⫹ grade
where I(⋅) denotes the indicator function being equal to 1 if ‘‘⋅’’ holds true and
0 otherwise. Three risk groups (I, II, and III with numbers of patients in the
GBSG-4 study) are defined through COX ⫽ 1, 2 (n 1 ⫽ 277), Cox ⫽ 3 (n 2 ⫽
205), and Cox ⫽ 4, 5 (n 3 ⫽ 121), respectively. For the second one, four risk
groups have been obtained (36), given as
CART I: Grade 1 and age ⱕ 60 years (n 1 ⫽ 78),
CART II: Size ⱕ 20 mm and [(grade 2–3 and age ⬎ 40 years) or (grade
1 and age ⬎ 60 years)] (n 2 ⫽ 222),
CART III: (age ⱕ 40 years and grade 2–3) or (size ⬎ 20 mm and age ⬎
60 years and grade 1) or (size ⬎ 20 mm and grade 2–3 and estrogen
receptor ⱕ 300 fmol) (n 3 ⫽ 284),
CART IV: Size ⬎ 20 mm and grade 2–3 and estrogen receptor ⬎ 300
fmol (n ⫽ 19).
As can be seen from the numbers of patients, this leads to two relatively large
medium risk strata, a smaller low risk stratum, and a very small high risk stratum
Figure 13, A–C, shows the Kaplan-Meier estimates in the various risk strata
corresponding to the three prognostic classification schemes. Table 9A summa-
rizes the results of some ad hoc measures applied to the data of the GBSG-4
study. For all three prognostic classification schemes considered, the p values of
the logrank test are highly significant ( p ⬍ 0.0001). Thus, this measure does not
prove to be particularly useful. There is some improvement in the log-likelihood
for the NPI and some further improvement for the simplified Cox Index. The
CART classification scheme shows the best result. A formal comparison, how-
ever, is hampered by the fact that the corresponding regression models are not
nested. The summary measure SEP yields an average absolute risk with respect
to baseline of about 1.56 for the NPI and 1.55 for the simplified Cox Index. With a
value of 1.71, the CART classification scheme again shows the best performance.
Figure 13 Kaplan-Meier estimates of event-free survival probabilities for the prognos-

tic subgroups according to the Nottingham Prognostic Index (A), derived by a Cox model
(B), and a CART approach (C) in the GBSG-4 study in node-negative breast cancer.
Table 9A Ad Hoc Measures for Predictive Ability of Three Prognostic Classification

Schemes in the GBSG-4 Study in Node-negative Breast Cancer
Ad hoc Pooled COX CART

measure Kaplan-Meier NPI index index
p value — ⬍0.0001 ⬍0.0001 ⬍0.0001

⫺2 log L 1856.1 1827.9 1817.3 1802.9
SEP 1 1.559 1.550 1.710
Since these ad hoc measures are only of limited value, we now briefly
outline some recent developments; a detailed description can be found elsewhere
(103). First, it is of central importance to recognize that the time-to-event itself
cannot adequately be predicted (104–108). The best one can do at t ⫽ 0 is to
try to estimate the probability that the event of interest will not occur until a
prespecified time horizon represented by some time point t*, given the available
covariate information for a particular patient at t ⫽ 0. Consequently a measure
of inaccuracy that is aimed to assess the value of a given prognostic classification
scheme should compare the estimated event-free probabilities with the observed
individual outcome.
Thus, we consider an approach based directly on the estimates of event-
free probabilities S (t*| Z ⫽ z) for patients with Z ⫽ z. As outlined above, it is
the aim of a prognostic classification scheme to provide estimated event-free
probabilities Ŝ (t*| j) for patients in risk stratum j ( j ⫽ 1, . . . , g). These estimated
probabilities may be used as predictions of the event status Y ⫽ I (T ⬎ t*). To
determine the mean square error of prediction in this case, the observed survival
or event status at t*, Y ⫽ I (T ⬎ t*) has to be compared with the estimated
probability, Ŝ (t*| j), leading to
n
冱 (I(T ⬎ t*) ⫺ Ŝ (t*| j ))

1
BS(t*) ⫽ i i
2
n i⫽1
where the sum goes over all n patients. This quantity is known as the quadratic
score. Multiplied by a factor of 2 (omitted here for simplicity), it is equal to
the Brier score, which was originally developed for judging the inaccuracy of
probabilistic weather forecasts (109–111). The expected value of the Brier score
may be interpreted as a mean square error of prediction if the event status at t*
is predicted by the estimated event-free probabilities Ŝ (t*| j). In the extreme
case where the estimated event-free probabilities are 0 or 1 for all patients—this
corresponds to the assertion that the event-free status a t* can be predicted with-
out error—BS(t*) will be zero if Ŝ (t*| j) coincides with the observed event status.
It will attain its maximum value of 1 only if the estimated event-free probabilities
happen to be equal to 1 minus the observed event status for all patients. In the
absence of any knowledge about the disease under study, a trivial constant predic-
tion Ŝ (t*) ⫽ 0.5 for all patients would be the most plausible approach. This
yields a Brier score equal to 0.25.
If some closer relationship to the likelihood is intended, the so-called loga-
rithmic score may be preferred. This is given by
n
冱 {I(T ⬎ t*) log Ŝ (t*| j )

1
LS(t*) ⫽ ⫺ i i
n i⫽1
⫹ I(T i ⱕ t*) log (1 ⫺ Ŝ(t*| j i ))}

where we adopt conventions ‘‘0 ⋅ log 0 ⫽ 0’’ and ‘‘1 ⋅ log 0 ⫽ ⫺∞ .’’ Hence
LS (t*) is equal to zero in the extreme situation where the estimated event-free
probabilities Ŝ (t*| j i ) are 0 or 1 for all patients and coincide with their observed
event status I (T i ⬎ t*). It will attain infinity if the estimated event-free probability
happens to be equal to I (T i ⱕ t*) for at least one patient (111,112).
If we do not wish to restrict ourselves to one fixed time point t*, we can
consider both the Brier score and the logarithmic score as a function of time for
0 ⱕ t ⱕ t*. This function can also be averaged over time, i.e., for t ∈ [0, t*],
by integrating it with respect to some weight function W(t) (103).
So far, censoring has not been taken into account into the definition of both
measures of inaccuracy. How to do that in a way that the resulting measures are
still consistent estimates of the population quantities—in case of the Brier score
the mean square error of prediction—is not a trivial problem. It can, however,
be solved by reweighting the individual contributions in a similar way as in the
calculation of the Kaplan-Meier estimator. The reweighting of uncensored obser-
vations and of observations censored after t* is done by the reciprocal of the
Kaplan-Meier estimate of the censoring distribution, whereas observations cen-
sored before t* get weight zero. With this weighting scheme, a Brier or a logarith-
mic score under random censorship can be defined that enjoys the desirable statis-
tical properties (103). Using these scores, R2-type measures (113–116) can also
be readily defined by relating the Brier or the logarithmic score for a prognostic
classification scheme to that score where the pooled Kaplan-Meier estimate is
used as ‘‘universal’’ prediction for all patients.
We calculated the Brier and the logarithmic score for the data of the GBSG-
4 study in node-negative breast cancer. For the NPI, the Brier score at t* ⫽ 5
is equal to 0.184, which is not very much below 0.196, the value reached by the
pooled Kaplan-Meier prediction Ŝ(t*) ⫽ 0.733 for all patients; remember that
the Brier score is equal to 0.25 when the trivial prediction Ŝ (t*) ⫽ 0.5 is made
for all patients. Table 9B summarizes the results of various measures of inaccu-
racy for the NPI, the simplified COX index, and the CART index. For all mea-
sures, the simplified COX index performs better than the NPI, and some further
Table 9B Measures of Inaccuracy for Three Prognostic Classification Schemes in

the GBSG-4 Study in Node-negative Breast Cancer
Measure of Pooled
inaccuracy Kaplan-Meier NPI COX index CART index
BS (t* ⫽ 5) 0.196 0.184 0.179 0.175

LS (t* ⫽ 5) 0.580 0.549 0.538 0.529
R2 0.0% 6.1% 8.7% 10.4%
improvement is achieved by the CART index. Relative to the prediction with the
pooled Kaplan-Meier estimate for all patients, there is only a moderate gain of
accuracy; the R2-measure of explained residual variation based on the Brier score
just reaches 10.4% for the best prognostic classification scheme.
In general, it has to be acknowledged that measures of inaccuracy tend to
be large, or, the other way round, R2-type values tend to be small reflecting that
predictions are far from being perfect (117). In addition, it has to be mentioned
that there may be still some overoptimism present resulting when a measure of
inaccuracy is calculated in the same data where the prognostic classification
scheme is derived from. We have assumed that the estimated probabilities Ŝ (t |j)
of being event free up to time t have emerged from external sources. For the
GBSG-4 study in node-negative breast cancer, this is by no means true and only
pretended for illustrative purposes. Actually, the COX and the CART indices
have been derived from the same data set. Even for the NPI that has been pro-
posed in the literature (73,74), we have estimated the event-free probabilities
from our data set and used them as predictions. To reduce the resulting overopti-
mism, cross-validation and resampling techniques may be used in a similar way
as for the estimation of error rates (118,119) or for the reduction of bias of effect
estimates as outlined in Section IV. For definitive conclusions, however, the de-
termination of measures of inaccuracy in an independent test data set is absolutely
necessary (79).
X. SAMPLE SIZE CONSIDERATIONS
If the role of a new prognostic factor is to be investigated, a careful planning of

an appropriate study is required. This includes an assessment of the power of
the study in terms of sample sizes. An adequate analysis of the independent prog-
nostic effect of a new factor has to be adjusted for the existing standard factors
(4,120). With survival or event-free survival as the end point, this will often be
done with the Cox proportional hazards model. Sample size and power formulae
in survival analysis have been developed for randomized treatment comparisons.
In the analysis of prognostic factors, however, the covariates included are ex-
pected to be correlated with the factor of primary interest. In this situation, the
existing sample size and power formulae are not valid and may not be applied.
In this section we give an extension of Schoenfeld’s formula (121) to the situation
that a correlated factor is included in the analysis.
We consider the situation that we wish to study the prognostic relevance
of a certain factor—denoted by Z 1 in the presence of a second factor Z 2, which
can also be a score based on several other factors. The criterion of interest is
survival or event-free survival of the patients. We assume that the analysis of
the main effects of Z 1 and Z 2 is performed with Cox proportional hazards model
given by
λ (t | Z 1, Z 2 ) ⫽ λ 0 (t) exp (β 1 Z 1 ⫹ β 2 Z 2 )
where λ 0 (t) denotes an unspecified baseline hazard function and β 1 and β 2 are
the unknown regression coefficients representing the effects of Z 1 and Z 2. For
sake of simplicity we assume that Z 1 and Z 2 are binary with p ⫽ P (Z 1 ⫽ 1)
denoting the prevalence of Z 1 ⫽ 1. The relative risk between the groups defined
by Z 1 is then given by θ 1 ⫽ exp (β 1). Assume that the effect of Z 1 shall be tested
by an appropriate two-sided test based on the partial likelihood derived from the
Cox model with significance level α and power 1 ⫺ β to detect an effect that is
given by a relative risk of θ 1 .
For independent Z 1 and Z 2 , it was shown by Schoenfeld (121) that the total
number of patients required is given by the following expression
(u 1⫺α/2 ⫹ u 1⫺β )2
N⫽
(log θ 1 )2 ψ (1 ⫺ p)p
where ψ is the probability of an uncensored observation and u γ denotes the γ
quantile of the standard normal distribution. This is the same formula as that
used for a comparison of two populations as developed by George and Desu
(122) for an unstratified and by Bernstein and Lagakos (123) for a stratified com-
parison of exponentially distributed survival times and by Schoenfeld (124) for
the unstratified logrank test.
The sample size formula depends on p, the prevalence of Z 1 ⫽ 1, that has
to be taken into account. Obviously, the expected number of events—often also
called the ‘‘effective sample size’’—to achieve a prespecified power is minimal
for p ⫽ 0.5, the situation of a randomized clinical trial with equal probabilities
for treatment allocation. By using the same approximations as Schoenfeld (121),
one can derive a formula also for the case when Z 1 and Z 2 are correlated with
correlation coefficient ρ; for details we refer to Schmoor et al. (125). This formula
reads
N⫽
(u 1⫺α/2 ⫹ u 1⫺β )2 1
冢
(log θ 1 ) ψ (1 ⫺ p)p 1 ⫺ ρ 2
2 冣
the factor 1/(1 ⫺ ρ 2 ) is usually called the variance inflation factor (VIF).
This formula is identical to a formula derived by Lui (126) for the exponen-
tial regression model in the case of no censoring. Table 10 gives for some situa-
tions the value of the VIF and the effective sample size Nψ, that is, the number
of events required to obtain a power of 0.8 to detect an effect to Z 1 of magnitude
θ 1 as calculated by the formula given above. It shows that the required number
Table 10 Variance Inflation Factors and Effective Sample Size Required to Detect
an Effect of Z 1 of Magnitude θ 1 with Power 0.8 as Calculated by the Approximate
Sample Size Formula for Various Values of p, ρ, θ 1 (α ⫽ 0.05)
θ 1 ⫽ 1.5 θ1 ⫽ 2 θ1 ⫽ 4
p ρ VIF Nψ
0.5 0 1 191 65 16
0.2 1.04 199 68 17
0.4 1.19 227 78 19
0.6 1.56 298 102 26
0.3 0 1 227 78 19
0.2 1.04 237 81 20
0.4 1.19 271 93 23
0.6 1.56 355 122 30
of events for the case of two correlated factors may increase up to a factor of 50%
in situations realistic in practice. Note that the formula given above is identical to
that developed by Palta and Amini (127) for the situation that the effect of Z 1 is
analyzed by a stratified logrank test where Z 2 ⫽ 0 and Z 2 ⫽ 1 define the two
strata.
The sample size formulae given above will now be illustrated by means
of the GBSG-2 study. Suppose we want to investigate the influence of the pro-
gesterone receptor in the presence of tumor grade. The Spearman correlation
coefficient of these two factors is ρ ⫽ ⫺0.377; if they are categorized as binary
variables we find ρ ⫽ ⫺0.248 from Table 11. Taking the prevalence of progester-
one-positive tumors ( p ⫽ 60%) into account, a number of 213 events is required
to detect a relative risk of 0.67 and of 74 events to detect an relative risk of 0.5
Table 11 Distribution of Progesterone Receptor by Tumor Grade and Estrogen

Receptor in the GBSG-2 Study
Tumor grade Estrogen receptor
1 2⫹3 ⬍20 ⱖ20
Progesterone receptor
⬍20 5 264 190 79
ⱖ20 76 341 72 345
Correlation coefficient ⫺0.248 0.536
with a power of 80% (significance level α ⫽ 5%). In this situation, the variance
inflation factor is equal to 1.07, indicating that the correlation between the two
factors has only little influence on power and required sample sizes.
If we want to investigate the prognostic relevance of progesterone receptor
in the presence of estrogen receptor, a higher correlation has to be considered.
The Spearman correlation coefficient is equal to ρ ⫽ 0.598 if both factors are
measured on a quantitative scale and ρ ⫽ 0.536 if they are categorized into posi-
tive (⬎20 fmol) and negative (ⱕ20 fmol), as given in Table 11. This leads to a
variance inflation factor of 1.41 and a number of events of 284 and 97 required
to detect a relative risk of 0.67 and of 0.5, respectively (power ⫽ 80%, signifi-
cance level α ⫽ 5%). This has to be contrasted with the situation that both factors
under consideration are uncorrelated; in this case the required number of events
is 201 to detect a relative risk of 0.67 and 69 to detect a relative risk of 0.5, both
with a power of 80% at a significance level of 5%.
So from this aspect, the GBSG-2 study with 299 events does not seem too
small to investigate the relevance of prognostic factors that exhibit at least a
moderate effect (relative risk of 0.67 or 1.5). The question is whether it is large
enough to permit the investigation of several prognostic factors. There have been
some recommendations in the literature, based on practical experience or on re-
sults from simulation studies, regarding the event per variable relationship
(58,128–131). More precisely, it is the number of events per model parameter
that matters which is often overlooked. These recommendations range from 10
to 25 events per model parameter to ensure stability of the selected model and
of corresponding parameter estimates and to avoid serious overfitting.
The sample size formula given above addresses the situation of two binary
factors. For more general situations (i.e., factors occurring on several levels or
factors with continuous distribution), the required sample size may be calculated
using a more general formula that can be developed according to the lines of
Lubin and Gail (132). The anticipated situation has then to be specified in terms
of the joint distribution of the factors under study and the size of corresponding
effects on survival. It may be more difficult to pose the necessary assumptions
than in the situation of only two binary factors, but in principle it is possible to
base the sample size calculation on that formula. Numerical integration tech-
niques are then required to perform the necessary calculation.
So if several factors that should be included in the analysis, one practical
solution is to prespecify a prognostic score based on the existing standard factors
and to consider this score as the second covariate to be adjusted for. Another
possibility would be to adjust for that prognostic factor where the largest effect
on survival and the highest correlation is anticipated. Finally, it should be men-
tioned that a sample size formula for the investigation of interactive effects of
two prognostic factors is also available (125,133,134).
XI. CONCLUDING REMARKS
In this chapter we consider statistical aspects of the evaluation of prognostic

factors. In particular we highlight the situation that historical data might be avail-
able in a database. In contrast to therapeutic studies where the comparison of
treatments based on historical data is almost always a totally useless exercise
(135,136), such data might provide a valuable source for studying the role of
prognostic factors under certain conditions. The problems when dealing with his-
torical data range from the probably insufficient quality of data, completeness
of baseline, and follow-up data to the heterogeneity of patients with respect to
prognostic factors and therapy. Problems with completeness and quality of data
can usually not retrospectively be solved; the only possibility is then a prospective
collection of such data, including a regular follow-up. Problems with the hetero-
geneity of the patient population can at least partially be avoided by definition
of suitable inclusion and exclusion criteria preferably in a similar study protocol
as it is common practice in prospective therapeutic studies. This has, for example,
been done in a study on the role of DNA content in advanced ovarian cancer
(137). Here it is desirable to define a relatively homogeneous study population
and to avoid a nontransparent mixture of patients from different stages of a dis-
ease and with substantially different treatments.
As far as the statistical analysis is concerned, a multivariate approach is
absolutely essential. In prognostic factor studies in oncology where the end point
is survival or event-free survival, the Cox regression model provides a flexible
tool for such an analysis, but also here various problems have to be considered
requiring expert knowledge in medical statistics. By using data from three prog-
nostic factor studies in breast cancer, some of these problems have been demon-
strated in this chapter. We have shown that various approaches for such a model-
building process exist; including the method of classification and regression trees
and ANNs. The most important requirements one should keep in mind are to
arrive at models that are as simple and parsimonious as possible and to avoid
serious overfitting. Only if these requirements are acknowledged can generaliz-
ability for future patients be achieved. Thus, validation in an independent study
is an essential step in establishing a prognostic factor or a prognostic classification
scheme. Some insight into the stability and generalizability of the derived models
can be gained by cross-validation and resampling methods that, however, cannot
be regarded to completely replace an independent validation study. In addition,
further methodological research is needed about the use of such methods under
various circumstances.
The lack of appropriate validation studies in combination with insufficient
design considerations and inadequate statistical analyses resulting in serious over-
fitting has led to the situation of conflicting results and failures to establish many
‘‘new’’ and even ‘‘old’’ prognostic factors (138,139). Thus, a careful planning
of prognostic factor studies and a proper and thoughtful statistical analysis is an
essential prerequisite for achieving an improvement of the current situation. In
illustrating various approaches, we were not consistent in the sense that we have
not attempted a complete analysis of a particular study according to some gener-
ally accepted guidelines. We have rather shown the strengths and weaknesses of
various approaches to protect against potential pitfalls. For a concrete study, the
statistical analysis should be carefully planned step by step and the model-building
process should at least principally be fixed in advance in a statistical analysis
plan as is required in much more detail for clinical trials according to international
guidelines (140,141).
The problem of adequate sample sizes for prognostic factor studies has not
been fully appreciated in the past. In contrast to therapeutic studies where one
might argue that also very small differences associated with relative risks close
to 1 are relevant for a comparison of therapies (142–144), one might accept the
requirement that established prognostic factors should exhibit large relative risks.
Thus, at a first glance, studies on prognostic factors seem to require smaller num-
ber of patients. Three points have to be recognized, however, that are essential
for the calculation of sample sizes. First, it is the expected number of events and
not the total number of patients that constitutes the quantity of central importance
and that depends on the length and completeness of follow-up. Second, the distri-
bution can be described by the prevalence of the factor that might differ substan-
tially from the optimal value of 0.5 usually arising in a randomized clinical trial.
For small values of the prevalence, a study has to be larger than a comparable
therapeutic trial using the same value of the relative risk as a clinically relevant
difference. Third, it is the number of factors under consideration, or better the
number of model parameters, that should affect the size of a study. Practical
experience and results from simulation studies suggest as some rule of thumb
that studies with less than 10 to 25 events per factor (or parameter) cannot be
considered as an informative and reliable basis for the evaluation of prognostic
factors. It has also to be recognized that the number of patients or events, respec-
tively, suitable for the final analysis might differ substantially from the number
of patients available in a database when rigorous inclusion and exclusion criteria
are applied. From these considerations it can be derived that small or premature
studies are not informative and cannot lead to an adequate assessment of prognos-
tic factors. They can only create hypotheses in an explorative sense and might
even lead to more confusion on the role of prognostic factors because of various
sources of bias including the publication bias (145–148). To reach definitive
conclusions, close cooperation between different centers or study groups might
be necessary that might lead to a meta-analysis type evaluation of prognostic
factors (149). This approach would surely be associated with many additional
difficulties, but it would help to avoid some of the problems due to publication
bias (146,149,150). It would also encourage the use of standard prognostic mod-
els for particular entities or stages of cancer. In addition, the size of an indepen-
dent validation study should be large enough to allow valid and definitive conclu-
sions. Thus, the Freiburg DNA study that we used as a validation study several
times throughout this contribution was far too small from this point of view and
should have served for illustrative purposes only.
A number of important topics have not or have only been mentioned in
passing in this chapter. One of these topics is concerned with the handling of
missing values in prognostic factors. We have always confined ourselves to a
so-called complete case analysis that would lead to consistent estimates of the
regression coefficients if some assumptions are met (151,152). However, this
may not be a very efficient approach, especially if the missing rates are higher
than in the three prognostic studies that we used for illustration. Thus, more
sophisticated methods for dealing with missing values in some prognostic factors
might be useful (152–154). For applying prognostic factors or prognostic classi-
fication schemes to future patients, one also has to be prepared that some factors
may have not been measured. To arrive at a prediction of survival probabilities for
such patients, surrogate definitions for the corresponding prognostic classification
schemes are required.
Throughout this chapter, we have also assumed that effects of prognostic
factors are constant over time and that prognostic factors are recorded and known
at time of diagnosis. These assumptions do not cover the situation of time-varying
effects and of time-dependent covariates. If multiple end points or different events
are of interest, the use of competing risk and multistate models may be indicated.
For these topics that are also of importance for prognostic factor studies, we
refer to more advanced textbooks in survival analysis (14,19,20,22) and current
research papers.
In general, the methods and approaches presented here have at least in part
been selected and assessed according to the subjective views of the authors. Thus,
other approaches might also be seen as useful and adequate. What should not be
a matter of controversy, however, is the need for a careful planning, conducting,
and analyzing of prognostic factor studies to arrive at generalizable and reproduc-
ible results that could contribute to a better understanding and possibly to an
improvement of the prognosis of cancer patients.
ACKNOWLEDGMENTS
We thank our colleagues Erika Graf and Claudia Schmoor for valuable contribu-
tions and Regina Gsellinger for her assistance in preparing the manuscript.
REFERENCES
1. McGuire WL. Breast cancer prognostic factors: evaluation guidelines. J Natl Can-
cer Inst 1991; 83:154–155.
2. Infante-Rivard C, Villeneuve J-P, Esnaola S. A framework for evaluating and con-
ducting prognostic studies: an application to cirrhosis of the liver. J Clin Epidemiol
1989; 42:791–805.
3. Clark GM, ed. Prognostic factor integration. Special issue. Breast Cancer Res Treat
1992; 22:185–293.
4. Simon R, Altman DG. Statistical aspects of prognostic factor studies in oncology.
Br J Cancer 1994; 69:979–985.
5. Wyatt JC, Altman DG. Prognostic models: clinically useful or quickly forgotten?
Commentary, Br Med J 1995; 311:1539–1541.
6. Armitage P, Gehan EA. Statistical methods for the identification and use of prog-
nostic factors. Int J Cancer 1974; 13:16–36.
7. Byar DP. Analysis of survival data: Cox and Weibull models with covariates. In:
Mike V, Stanley KE, eds. Statistics in Medical Research. New York: Wiley, 1982,
pp. 365–401.
8. Byar DP. Identification of prognostic factors. In: Buyse ME, Staquet MJ, Sylvester
RJ, eds. Cancer Clinical Trials: Methods and Practice. Oxford: Oxford University
Press, 1984, pp. 423–443.
9. Simon R. Use of regression models: statistical aspects. In Buyse ME, Staquet MJ,
Sylvester RJ, eds. Cancer Clinical Trials: Methods and Practice. Oxford: Oxford
University Press, 1984, pp. 444–466.
10. George SL. Identification and assessment of prognostic factors. Semin Oncol 1988;
5:462–471.
11. Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using ‘‘optimal’’
cutpoints in the evaluation of prognostic factors. Commentary. J Nat Cancer Inst
1994; 86:829–835.
12. Altman DG, De Stavola BL, Love SB, Stepniewska KA. Review of survival analy-
ses published in cancer journals. Br J Cancer 1995; 72:511–518.
13. Hermanek P, Gospodarowicz MK, Henson DE, Hutter RVP, Sobin LH. Prognostic
Factors in Cancer. Heidelberg–New York: Springer, 1995.
14. Marubini E, Valsecchi MG. Analysing Survival Data from Clinical Trials and Ob-
servational Studies. Chichester: Wiley, 1995.
15. Parmar MKB, Machin D. Survival Analysis: A Practical Approach. Chichester:
Wiley, 1995.
16. Harris EK, Albert A. Survivorship Analysis for Clinical Studies. New York: Marcel
Dekker, 1991.
17. Lee ET. Statistical Methods for Survival Data Analysis. 2nd ed. New York: Wiley,
1992.
18. Collett D. Modelling Survival Data in Medical Research. London: Chapman &
Hall, 1994.
19. Klein JP, Moeschberger ML. Survival Analysis: Techniques for Censored and
Truncated Data. New York: Springer, 1997.
20. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. New
York: Wiley, 1980.
21. Fleming TR, Harrington DP. Counting Processes and Survival Analysis. New York:
Wiley, 1991.
22. Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Methods Based on Count-
ing Processes. New York: Springer, 1992.
23. Burke HB, Henson DE. Criteria for prognostic factors and for an enhanced prognos-
tic system. Cancer 1993; 72:3131–3135.
24. Henson DE. Future directions for the American Joint Committee on Cancer. Cancer
1992; 69:1639–1644.
25. Fielding LP, Fenoglio-Preiser CM, Freedman LS. The future of prognostic factors
in outcome prediction for patients with cancer. Cancer 1992; 70:2367–2377.
26. Fielding LP, Henson DE. Multiple prognostic factors and outcome analysis in pa-
tients with cancer. Cancer 1993; 71:2426–2429.
27. Rothman KJ, Greenland S. Modern Epidemiology. 2nd ed. Philadelphia: Lippin-
cott-Raven, 1998.
28. Simon R. Patients subsets and variation in therapeutic efficacy. Br J Cancer 1982;
14:473–482.
29. Gail M, Simon R. Testing for qualitative interactions between treatment effects and
patient subsets. Biometrics 1985; 41:361–372.
30. Byar DP. Assessing apparent treatment covariate interactions in randomized clini-
cal trials. Stat Med 1985; 4:255–263.
31. Schmoor C, Ulm K, Schumacher M. Comparison of the Cox model and the regres-
sion tree procedure in analyzing a randomized clinical trial. Stat Med 1993; 12:
2351–2366.
32. Bloom HJG, Richardson WW. Histological grading and prognosis in primary breast
cancer. Br J Cancer 1957; 2:359–377.
33. Pfisterer J, Kommoss F, Sauerbrei W, Menzel D, Kiechle M, Giese E, Hilgarth M,
Pfleiderer A. DNA flow cytometry in node positive breast cancer: prognostic value
and correlation to morphological and clinical factors. Anal Quant Cytol Histol
1995; 17:406–412.
34. Schumacher M, Bastert G, Bojar H, Hübner K, Olschewski M, Sauerbrei W,
Schmoor C, Beyerle C, Neumann RLA, Rauschecker HF for the German Breast
Cancer Study Group. Randomized 2⫻2 trial evaluating hormonal treatment and
the duration of chemotherapy in node-positive breast cancer patients. J Clin Oncol
1994; 12:2086–2093.
35. Schmoor C, Olschewski M, Schumacher M. Randomized and non-randomized pa-
tients in clinical trials: experiences with comprehensive cohort studies. Stat Med
1996; 15:263–271.
36. Sauerbrei W, Hübner K, Schmoor C, Schumacher M for the German Breast Cancer
Study Group. Validation of existing and development of new prognostic classifica-
tion schemes in node negative breast cancer. Breast Cancer Res Treat 1997; 42:
149–163. [Correction. Breast Cancer Res Treat 1998; 48:191–192.]
37. Cox DR. Regression models and life tables (with discussion). J R Stat Soc Ser B
1972; 34:187–220.
38. Schumacher M, Holländer N, Sauerbrei W. Reduction of bias caused by model
building. Proceedings of the Statistical Computing Section, American Statistical

Association, 1996, pp. 1–7.
39. Lausen B, Schumacher M. Maximally selected rank statistics. Biometrics 1992;
48:73–85.
40. Hilsenbeck SG, Clark GM. Practical P-value adjustment for optimally selected cut-
points. Stat Med 1996; 15:103–112.
41. Worsley KJ. An improved Bonferroni inequality and applications. Biometrika
1982; 69:297–302.
42. Lausen B, Sauerbrei W, Schumacher M. Classification and regression trees (CART)
used for the exploration of prognostic factors measured on different scales. In:
Dirschedl P, Ostermann R, eds. Computational Statistics. Heidelberg: Physica-
Verlag, 1994, pp. 1483–1496.
43. Lausen B, Schumacher M. Evaluating the effect of optimized cutoff values in the
assessment of prognostic factors. Comput Stat Data Analysis 1996; 21:307–326.
44. Verweij P, Van Houwelingen HC. Cross-validation in survival analysis. Stat Med
1993; 12:2305–2314.
45. Schumacher M, Holländer N, Sauerbrei W. Resampling and cross-validation tech-
niques: a tool to reduce bias caused by model building. Stat Med 1997; 16:2813–
2827.
46. Van Houwelingen HC, Le Cessie S. Predictive value of statistical models. Stat Med
1990; 9:1303–1325.
47. Andersen PK. Survival analysis 1982–1991: the second decade of the proportional
hazards regression model. Stat Med 1991; 10:1931–1941.
48. Sauerbrei W. The use of resampling methods to simplify regression models in medi-
cal statistics. Appl Stat 1999; 48:313–329.
49. Miller AJ. Subset Selection in Regression. London: Chapman and Hall, 1990.
50. Teräsvirta T, Mellin I. Model selection criteria and model selection tests in regres-
sion models. Scand J Stat 1986; 13:159–171.
51. Mantel N. Why stepdown procedures in variable selection? Technometrics 1970;
12:621–625.
52. Sauerbrei W. Comparison of variable selection procedures in regression models—
a simulation study and practical examples. In: Michaelis J, Hommel G, Wellek S,
eds. Europäische Perspektiven der Medizinischen Informatik, Biometrie und Epide-
miologie. München: MMV, 1993, pp. 108–113.
53. Royston P, Altman DG. Regression using fractional polynomials of continuous
covariates: parsimonious parametric modelling (with discussion). Appl Stat 1994;
43:429–467.
54. Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models:
transformation of the predictors by using fractional polynomials. J R Stat Soc Ser
A 1999; 162:71–94.
55. Sauerbrei W, Royston P, Bojar H, Schmoor C, Schumacher M, and the German
Breast Cancer Study Group (GBSG). Modelling the effects of standard prognostic
factors in node positive breast cancer. Br J Cancer 1999; 79:1752–1760.
56. Hastie TJ, Tibshirani RJ. Generalized Additive Models. New York: Chapman and
Hall, 1990.
57. Valsecchi MG, Silvestri D. Evaluation of long-term survival: use of diagnostics
and robust estimators with Cox’s proportional hazards model. Stat Med 1996; 15:
2763–2780.
58. Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in devel-
oping models, evaluating assumptions and adequacy, and measuring and reducing
errors. Stat Med 1996; 15:361–387.
59. Chen CH, George SL. The bootstrap and identification of prognostic factors via
Cox’s proportional hazards regression model. Stat Med 1985; 4:39–46.
60. Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regres-
sion model. Stat Med 1989; 8:771–783.
61. Sauerbrei W, Schumacher M. A bootstrap resampling procedure for model build-
ing: application to the Cox regression model. Stat Med 1992; 11:2093–2109.
62. Breiman L, Friedman JH, Olshen R, Stone CJ. Classification and Regression Trees.
Wadsworth: Monterey, 1984.
63. Zhang H, Crowley J, Sox HC, Olshen R. Tree-structured statistical methods. In:
Armitage P, Colton T, eds. Encyclopedia of Biostatistics. Chichester. Wiley, 1998,
pp. 4561–4573.
64. Gordon L, Olshen R. Tree-structured survival analysis. Cancer Treat Rep 1985;
69:1065–1069.
65. LeBlanc M, Crowley J. Relative risk regression trees for censored survival data.
Biometrics 1992; 48:411–425.
66. LeBlanc M, Crowley J. Survival trees by goodness of split. JASA 1993; 88:457–
467.
67. Segal MR. Regression trees for censored data. Biometrics 1988; 44:35–47.
68. Segal MR. Tree-structured survival analysis in medical research. In: Everitt BS,
Dunn G, eds. Statistical Analysis of Medical Data: New Developments, London:
Arnold, 1998, pp. 101–125.
69. Ciampi A, Hendricks L, Lou Z. Tree-growing for the multivariate model: the RE-
CPAM approach. In: Dodge Y, Whittaker J, eds. Computational Statistics. Vol 1.
Heidelberg: Physica-Verlag, 1992.
70. Tibshirani R, LeBlanc M. A strategy for binary description and classification.
J Comput Graph Stat 1992; 1:3–20.
71. Sauerbrei W. On the development and validation of classification schemes in sur-
vival data. In: Klar R, Opitz O, eds. Classification and Knowledge Organization.
Berlin, Heidelberg, New York: Springer, 1997, pp. 509–518.
72. Schmoor C, Schumacher M. Methodological arguments for the necessity of ran-
domized trials in high-dose chemotherapy for breast cancer. Breast Cancer Res
Treat 1999; 54:31–38.
73. Haybittle JL, Blamey RW, Elston CW, Johnson J, Doyle PJ, Campbell FC, Nichol-
son RI, Griffiths K. A prognostic index in primary breast cancer. Br J Cancer 1982;
45:361–366.
74. Galea MH, Blamey RW, Elston CE, Ellis IO. The Nottingham Prognostic Index
in primary breast cancer. Breast Cancer Res Treat 1992; 22:207–219.
75. Balslev I, Axelsson CK, Zedeler K, Rasmussen BB, Carstensen B, Mouridsen HT.
The Nottingham Prognostic Index applied to 9,149 patients from the studies of the
Danish Breast Cancer Cooperative Group (DBCG). Breast Cancer Res Treat 1994;
32:281–290.
76. Collett K, Skjaerven R, Machle BO. The prognostic contribution of estrogen and
progesterone receptor status to a modified version of the Nottingham Prognostic
Index. Breast Cancer Res Treat 1998; 48:1–9.
77. Copas JB. Using regression models for prediction: shrinkage and regression to the
mean. Stat Methods Med Res 1997; 6:167–183.
78. Vach W. On the relation between the shrinkage effect and a shrinkage method.
Comput Stat 1997; 12:279–292.
79. Ripley BD. Pattern Recognition and Neural Networks. Cambridge: University
Press, 1996.
80. Baxt WG. Application of artificial neural networks to clinical medicine. Lancet
1995; 346:1135–1138.
81. Cross SS, Harrison RF, Kennedy RL. Introduction to neural networks. Lancet 1995;
346:1075–1079.
82. Dybowski R, Gant V. Artificial neural networks in pathology and medical labora-
tories. Lancet 1995; 346:1203–1207.
83. Wyatt J. Nervous about artificial neural networks? Lancet 1995; 346:1175–
1177.
84. Schwarzer G, Vach W, Schumacher M. On the misuses of artificial neural networks
for prognostic and diagnostic classification in oncology. Stat Med 2000; 19:541–
561.
85. Penny W, Frost D. Neural networks in clinical medicine. Med Decis Making 1996;
16:386–398.
86. Cheng B, Titterington DM. Neural networks: a review from a statistical perspective
(with discussion). Stat Sci 1994; 9:2–54.
87. Ripley BD. Statistical aspects of neural networks. In: Barndorff Nielsen OE, Jensen
JL, eds. Networks and Chaos—Statistical and Probabilistic Aspects. London:
Chapman and Hall, 1993.
88. Schumacher M, Rossner R, Vach W. Neural networks and logistic regression. Part
I. Comput Stat Data Anal 1996; 21:661–682.
89. Stem HS. Neural networks in applied statistics (with discussion). Technometrics
1996; 38:205–220.
90. Warner B, Misra M. Understanding neural networks as statistical tools. Am Stat
1996; 50:284–293.
91. Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are univer-
sal approximators. Neural Networks 1989; 2:359–366.
92. Ripley RM. Neural network models for breast cancer prognosis. Ph.D. dissertation,
University of Oxford, Dept. of Engineering Science, Oxford, 1998.
93. Liestøl K, Andersen PK, Andersen U. Survival analysis and neural nets. Stat Med
1994; 13:1189–1200.
94. Ravdin PM, Clark GM. A practical application of neural network analysis for pre-
dicting outcome on individual breast cancer patients. Breast Cancer Res Treat 1992;
22:285–293.
95. Ravdin PM, Clark GM, Hilsenbeck SG, Owens MA, Vendely P, Pandian MR,
McGuire WL. A demonstration that breast cancer recurrence can be predicted by
neural network analysis. Breast Cancer Res Treat 1992; 21:47–53.
96. De Laurentiis M, Ravdin PM. Survival analysis of censored data: neural network
analysis detection of complex interactions between variables. Breast Cancer Res

Treat 1994; 32:113–118.
97. Burke HB. Artificial neural networks for cancer research: outcome prediction.
Semin Surg Oncol 1994; 10:73–79.
98. McGuire WL, Tandon AK, Allred, DC, Chamness GC, Ravdin PM, Clark GM.
Treatment decisions in axillary node-negative breast cancer patients. Monogr Nat
Cancer Inst 1992; 11:173–180.
99. Kappen HJ, Neijt JP. Neural network analysis to predict treatment outcome. Ann
Oncol 1993; 4(suppl):31–34.
100. Faraggi D, Simon R. A neural network model for survival data. Stat Med 1995;
14:73–82.
101. Biganzoli E, Boracchi P, Mariani L, Marubini E. Feed forward neural networks
for the analysis of censored survival data: a partial logistic regression approach.
Stat Med 1998; 17:1169–1186.
102. Ripley BD, Ripley RM. Neural networks as statistical methods in survival analysis.
In: Dybowski R, Gant V, eds. Clinical Applications of Artificial Neural Networks.
New York: Cambridge University Press, 2001 (in press).
103. Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of
prognostic classification schemes for survival data. Stat Med 1999; 18:2529–
2545.
104. Parkes MC. Accuracy of predictions of survival in later stages of cancer. Br Med
J 1972; 264:29–31.
105. Forster LA, Lynn J. Predicting life span for applicants to inpatient hospice. Arch
Intern Med 1988; 148:2540–2543.
106. Maltoni M, Pirovano M, Scarpi E, Marinari M, Indelli M, Arnoldi E, Galluci M,
Frontini L, Piva L, Amadori D. Prediction of survival of patients terminally ill with
cancer. Cancer 1995; 75:2613–2622.
107. Henderson R, Jones M. Prediction in survival analysis. model or medic. In: Jewell
NP, Kimber AC, Ting Lee ML, Withmore GA, eds. Lifetime Data: Models in Reli-
ability and Survival Analysis. Dordrecht: Kluwer Academic Publishers. 1995.
108. Henderson R. Problems and prediction in survival-data analysis. Stat Med 1995;
3:143–152.
109. Brier GW. Verification of forecasts expressed in terms of probability. Monthly
Weather Rev 1950; 78:1–3.
110. Hilden J, Habbema JDF, Bjerregard B. The measurement of performance in proba-
bilistic diagnosis III: methods based on continuous functions of the diagnostic prob-
abilities. Methods Inform Med 1978; 17:238–246.
111. Hand DJ. Construction and Assessment of Classification Rules. Chichester: Wiley,
1997.
112. Shapiro AR. The evaluation of clinical predictions. N Engl J Med 1997; 296:1509–
1514.
113. Korn EJ, Simon R. Explained residual variation, explained risk and goodness of
fit. Am Stat 1991; 45:201–206.
114. Schemper M. The explained variation in proportional hazards regression. Biome-
trika 1990; 77:216–218. [Correction. Biometrika 1994; 81:631.]
115. Graf E, Schumacher M. An investigation on measures of explained variation in

survival analysis. Statistician 1995; 44:497–507.
116. Schemper M, Stare J. Explained variation in survival analysis. Stat Med 1996; 15:
1999–2012.
117. Ash A, Schwartz M. R2: a useful measure of model performance when predicting
a dichotomous outcome. Stat Med 1999; 18:375–384.
118. Efron B. Estimating the error rate of prediction rule: improvement on cross-valida-
tion. JASA 1983; 78:316–330.
119. Efron B, Tibshirani R. Improvement on cross-validation: the .632⫹ bootstrap
method. JASA 1997; 92:548–560.
120. Fayers PM, Machin D. Sample size: how many patients are necessary? Br J Cancer
1995; 72:1–9.
121. Schoenfeld DA. Sample size formula for the proportional-hazard regression model.
Biometrics 1983; 39:499–503.
122. George SL, Desu MM. Planning the size and duration of a clinical trial studying
the time to some critical event. J Chronic Dis 1974; 27:15–24.
123. Bernstein D, Lagakos SW. Sample size and power determination for stratified clini-
cal trials. J Stat Comput Sim 1978; 8:65–73.
124. Schoenfeld DA. The asymptotic properties of nonparametric tests for comparing
survival distribution. Biometrika 1981; 68:316–319.
125. Schmoor C, Sauerbrei W, Schumacher M. Sample size considerations for the evalu-
ation of prognostic factors in survival analysis. Stat Med 2000; 19:441–452.
126. Lui K-J. Sample size determination under an exponential model in the presence of
a confounder and type I censoring. Control Clin Trials 1992; 13:446–458.
127. Palta M, Amini SB. Consideration of covariates and stratification in sample size
determination for survival time studies. J Chronic Dis 1985; 38:801–809.
128. Concato J, Peduzzi P, Holford TR, Feinstein AR. Importance of events per indepen-
dent variable in proportional hazards analysis. I. Background, goals and general
strategy. J Clin Epidemiol 1995; 48:1495–1501.
129. Peduzzi P, Concato J, Feinstein AR, Holford TR. Importance of events per indepen-
dent variable in proportional hazards analysis. II. Accuracy and precision of regres-
sion estimates. J Clin Epidemiol 1995; 48:1503–1510.
130. Harrell FE, Lee KL, Califf RM, Pryor DB, Rosati RA. Regression modeling strate-
gies for improved prognostic prediction. Stat Med 1984; 3:143–152.
131. Harrell FE, Lee KL, Matchar DB, Reichert TA. Regression models for prognostic
prediction: advantages, problems and suggested solutions. Cancer Treat Rep 1985;
69:1071–1077.
132. Lubin JH, Gail MJ. On power and sample size for studying features of the relative
odds of disease. Am J Epidemiol 1990; 131:551–566.
133. Peterson B, George SL. Sample size requirements and length of study for testing
interaction in a 2 ⫻ k factorial design when time-to-failure is the outcome. Control
Clin Trials 1993; 14:511–522.
134. Olschewski M, Schumacher M, Davis K. Analysis of randomized and nonrandom-
ized patients in clinical trials using the comprehensive cohort follow-up study de-
sign. Control Clin Trials 1992; 13:226–239.
135. Dambrosia JM, Ellenberg JH. Statistical considerations for a medical data base.
Biometrics 1980; 36:323–332.
136. Green SB, Byar DP. Using observational data from registries to compare treat-
ments: the fallacy of omnimetrics. Stat Med 1984; 3:361–370.
137. Pfisterer J, Kommoss F, Sauerbrei W, Renz H, duBois A, Kiechle-Schwarz M,
Pfleiderer A. Cellular DNA content and survival in advanced ovarian cancer. Can-
cer 1994; 74:2509–2515.
138. Hilsenbeck SG, Clark GM, McGuire WL. Why do so many prognostic factors fail
to pan out? Breast Cancer Res Treat 1992; 22:197–206.
139. Altman DG, Royston P. What do we mean by validating a prognostic model? Stat
Med 2000; 19:453–473.
140. European Community, CPMP Working Party on Efficacy of Medicinal Products.
Biostatistical methodology in clinical trials in applications for marketing authorisa-
tions for medicinal products. Note for Guidance III/3630/92-EN, December 1994.
Stat Med 1995; 14:1659–1682.
141. International Conference of Harmonisation and Technical Requirements for Regis-
tration of Pharmaceuticals for Human Drugs. ICH Harmonised Tripartite Guideline:
Guideline for Good Clinical Practice. Recommended for Adoption at Step 4 of the
ICH Process on 1 May 1996 by the ICH Steering Committee.
142. Peto R. Clinical trial methodology. Biomed 1978; 28:24–36.
143. Yusuf S, Collins R, Peto R. Why do we need some large, simple randomized trials?
(with discussion). Stat Med 1984; 3:402–422.
144. Lubsen J, Tijssen GP. Large trials with simple protocols. Indications and contraindi-
cations. Control Clin Trials 1987; 10:151–160.
145. Simes RJ. Publication bias: the case for an international registry of clinical trials.
J Clin Oncol 1986; 4:1529–1541.
146. Begg CB, Berlin A. Publication bias: a problem of interpreting medical data. J R
Stat Soc Ser A 1988; 151:419–463.
147. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR. Publication bias in clinical
research. Lancet 1991; 337:867–872.
148. Stern JM, Simes RJ. Publication bias: evidence of delayed publication in a cohort
study of clinical research projects. Br Med J 1997; 315:640–645.
149. Simon R. Meta-analysis and cancer clinical trials. Principles Pract Oncol 1991; 5:
1–9.
150. Felson DT. Bias in meta-analytic research. J Clin Epidemiol 1992; 45:885–892.
151. Vach W. Logistic Regression with Missing Values in the Covariates. Lecture Notes
in Statistics 86. New York: Springer 1994.
152. Vach W. Some issues in estimating the effect of prognostic factors from incomplete
covariate data. Stat Med 1997; 16:57–72.
153. Robins JM, Rotnitzky A, Zhao LD. Estimation of regression coefficients when
some regressors are not always observed. JASA 1994; 89:846–866.
154. Lipsitz SR, Ibrahim JG. Estimating equations with incomplete categorical covari-
ates in the Cox model. Biometrics 1998; 54:1002–1013.
18
Statistical Methods to Identify
Prognostic Factors
Kurt Ulm, Hjalmar Nekarda, Pia Gerein,

and Ursula Berger
I. INTRODUCTION
In recent years the search for prognostic factors has stimulated increasing atten-
tion in medicine, especially in oncology and cardiology. Two reasons are mainly
responsible for this trend. First, one is interested in getting more insight in the
development of the disease (e.g., in the tumor biology). Second, there is a ten-
dency away from a more or less uniform therapy toward an individual therapy.
An improved estimate about prognosis can also be used to inform a patient more
precisely about the further outcome of the disease. Further reasons to explore
prognostic factors are discussed by Byar (1).
Risk stratification in oncology until now is based mainly on conventional
factors of tumor staging (TNM classification: UICC [2]) like local tumor invasion
or size, status of lymph nodes, and status of metastasis. But the outcome of these
factors to answer the questions mentioned above is limited. For a better stratifica-
379
380 Ulm et al.
tion of patients for prognosis and therapy, the staging system needs to be more
sophisticated and new factors have to be identified. For example in breast cancer,
presumably one of the leading fields in that research area, about 100 new factors
are under discussion (3). One of the related topics is the question about adjuvant
therapy, especially who should be treated.
The problems are not restricted to breast cancer. In other locations and in
other medical disciplines there are the same problems (e.g., in stomach cancer
[4,5] or in cardiology [6]). Stomach cancer belongs to the category of tumors
where the effect of an adjuvant chemotherapy has not been proven.
The search for important factors is a great challenge in medicine. The ap-
propriate analysis on the other hand is by far not a routine task for statisticians.
Comparable with medicine where traditional factors were mainly used, in statis-
tics classic methods like the logistic or the Cox regression have been used over
decades.
Parallel to the development of new prognostic factors in medicine, new
statistical tools have been described. The aim of this chapter is to summarize and
highlight some of these new developments. Data on patients with stomach cancer
are used to illustrate these new methods. Recently, Harrell et al. (7) proposed a
system that can be used to identify important factors. The proposals made here
contain some other features and are concentrated to give an answer to the two
questions mentioned at the beginning.
II. METHODS
A. Classic Method
The example used to illustrate the new developments contains censored data. For
this type of data in the literature, mostly the Cox model is used. If the time of
follow-up is denoted by t and the potential prognostic factors with Z ⫽ (Z l . . . ,
Z p ), the model is usually given in the following form (8):
λ(t |Z ) ⫽ λ(t | Z ⫽ O) ⋅ e ∑βjZj ⫽ λ 0 (t) ⋅ e ∑βjZ j (1)
with λ(t |Z ) being the hazard rate at time t given the factors Z.
Throughout the chapter very often the ratio of the hazard function or the
logarithm is considered:
p
λ(t |Z )
ln
λ 0 (t)
⫽ 冱 βZ
j⫽1
j j (2)
which is also denoted as relative hazards and can be interpreted as the logarithm
of the relative risk (ln RR).
Statistical Methods to Identify Prognostic Factors 381
B. Linearity Assumption
In the simplest form, each continuous factor Z is assumed, maybe after some
transformation, to be linearly related to the outcome. There are several proposals
in the literature on how to check the assumption of linearity. One approach is to
change the linear relationship β ⋅ Z into a functional form β(Z ). We want to
mention two methods to estimate β(Z ). One approach is not to specify the func-
tion β(Z ). The only assumption is that β(Z ) has to be smooth. This leads to
smoothing splines (9). The method to estimate β(Z ) can be described in the con-
text of a penalized log-likelihood function Lp(β):
Lp(β) ⫽ 2 ⋅ 1(β) ⫺ λ ⋅ P(β) (3)
where l(β) is the usual log-likelihood function, P(β) is a roughness penalty, penal-
izing deviations from smooth functions, and λ is the weight of penalizing. Using
the integrated squared derivative as roughness penalty
P(β) ⫽ ∫(β″(Z )) 2 dZ
the maximum of relation (3) leads to natural cubic splines (10).
The problem is the appropriate choice of λ. The main problems for wider
applications associated with this approach are the use of special software like
S⫹ and the lack of a simple statistics to test whether β(Z ) is equal to β ⋅ Z or
different.
A second option is the use of fractional polynomials (11). The idea is sim-
ply to construct a function β(Z ) consisting of up to some polynomials of the
form Z pi , p i ∈ {⫺2, ⫺1.5, ⫺1, ⫺0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3}. Either one compo-
nent or at least two components seems to be sufficient for most practical applica-
tions, i.e., i ∈ {1, 2}. The advantage of this approach is the representation of
β(Z ) in a functional form. If two polynomials are used a variety of functional
relationships can be described. In Figure 1 both methods, the result of using
smoothing splines and fractional polynomials for one of the factors, PAI-1, of
the example used in Section III are shown. There is only a slight difference be-
tween the results of both methods. The deviance from linearity is obvious.
The advantages of fractional polynomials is that standard software pack-
ages, like SPSS or SAS, and common test statistics can be used for their determi-
nation.
C. Proportional Hazards Assumption

One of the basic assumptions in using the Cox model is that of proportional
hazards. The effect of a certain factor is assumed to be constant over the total
follow-up period. On the other hand, it is more natural to assume a change (e.g.,
a decrease in the effect if time is prolonged). Considering one factor Z, the idea
382 Ulm et al.
Figure 1 Influence of a prognostic factor (PAI-1) on the relative risk plotted on an ln-
scale. The results of assuming a linear relationship (⋅ ⋅ ⋅ ⋅), smoothing spline (——), includ-
ing the 95% CI (---) and fractional polynomials (— —), are shown.
is to extend the linear assumption β ⋅ Z into γ(t) ⋅ Z. Now the influence of a

certain factor Z on the hazard ratio can be described as a function of time. One
way to simplify the analysis is to restrict this form of relationship to binary vari-
ables. Otherwise, some form of relationship between γ(t) and Z has to be assumed.
The hypothesis of interest is whether γ(t) is constant or not (H 0: γ(t) ⫽ γ 0).
There is a long history in extending the classical Cox model. The first approach
by Cox himself was based on using some predefined functions, e.g., a linear
(γ(t) ⫽ t) or a log-function (γ(t) ⫽ ln t). Over the years several proposals have
been made to extend this approach. One approach is again the use of smoothing
splines (12) to analyze the time-varying effect of a certain factor Z.
As an alternative, fractional polynomials can also be used by defining
γ(t) ⫽ ∑ β i ⋅ t pi. The advantages of fractional polynomials compared with smooth-
ing splines are again the use of standard software packages and the direct use of
a simple test statistics. For the analysis, estimation methods for regression models
Figure 2 The time-varying effect of age (ⱕ65 years vs. older) using fractional polyno-
mials (γ(t) ⫽ ⫺1.28 ⫹ 4.94/ √t) is shown.
with time-dependent covariates can be applied by performing the transformations

X i (t) ⫽ Z ⋅ t pi. According to the selection of γ(t), the values X i (t) have to be
calculated for all observed failure times. Figure 2 shows an example with a time-
varying effect. In this example the influence of age on the survival rate is consid-
ered (for details, see Sect. III). Age is divided into two groups according to the
median of 65 years. In all the analyses, a decrease of the effect during extended
follow-up can be seen
D. Combination of ␤(Z ) and ␥(t) into One Model

In the context of a regression model both extensions can be combined into a
model of the form
*
λ(t| Z,Z*) ⫽ λ 0 (t) ⋅ e ∑βj(Zj )⫹∑γj(t)⋅Zj (4)
384 Ulm et al.
where Z denotes all the continuous factors and Z* all the binary covariates either
binary in a natural way, like gender, or coded. Within this model the influence
of certain factors on the event rate can be investigated in greater extent compared
with the classical Cox model.
E. Selection Procedure
We describe very briefly one option for the selection of factors in the context of
fractional polynomials. In the first step, in a univariate analysis the ‘‘optimal’’
choices for β(Z ) and γ(t) can be identified. For the division of a continuous factor
Z into a binary variable Z*, two options are possible, either the use of the prede-
fined cutpoints or the selection of ‘‘optimal’’ cutpoints based on maximal selected
test statistics. The second choice is connected with an inflated p value. However,
no proof exists whether ‘‘optimal’’ binary coding has any influence on the selec-
tion of γ(t). In a stepwise forward procedure, either the factor Zj or Z*j is selected
that provides the best fit, based on the likelihood ratio statistics taking into ac-
count the degrees of freedom for β(Z ) or γ(t). An alternative can be the use of
the criterion proposed by Akaike
AIC ⫽ Dev ⫹ 2 ⋅ ν
with Dev the deviance and v the number of parameters used to describe for β(Z )
or γ(t). The selection in connection with the use of smoothing splines is described
in another article (13). We want to concentrate here on the use of fractional
polynomials. However, model (4) does not give directly an answer to the classifi-
cation of a patient into a certain risk group. One approach is to divide the func-
tional term PI ⫽ ∑β j (Z j ) ⫹ ∑γ(t)Z*j into certain intervals. The problem is of
course related to the cutpoints used. Therefore, another approach, the CART
method, has attracted great attention—at least in the medical community (14).
F. CART Method
The idea of this method is simply to split the whole data into two subsamples
with the greatest difference in the outcome (e.g., the survival rate). For this split
all factors with all possible divisions into two groups are considered. If a split
is performed, both subsamples are further analyzed independently in the same
way until no further split is recommended, for example, the difference is too
small or the number of patients in the subsample is too low. For performing a
split, a certain test statistic has to be selected. In the situation of failure time data,
the log-rank test is often used. The CART method results in a variety of so-called
terminal nodes with subgroups of patients at different risks. The clinicians can
now identify the subgroups where different therapies should be applied. There
are several proposals how to define the ‘‘optimal’’ tree (e.g., with the lowest
misclassification rate). One way to get an optimal tree is to prune the tree (15).
This means a large tree is constructed and afterward this large tree is cut back.
Another problem is related to the selection of the optimal splits. Usually there
is a mixture of continuous and discrete factors. It is well known that there is an
inflation of the test statistics in analyzing continuous data, called the maximal
selected test statistics. To make a fair comparison, the p value has to be adjusted.
One can use a permutation test or some correction formulas (16).
III. RESULTS
A. Description of the Data
The study contains data of 295 completely resected patients with stomach cancer
who underwent curative surgery between 1987 and 1996 (17). The follow-up
period is between 3 months and 11 years (median, 41 months). Until now 108
patients had died. Figure 3 shows the survival curve for the whole sample.
In addition to the traditional factors (TNM classification), new factors like
uPA and PAI-1 were investigated (17). Table 1 gives a short description of the
prognostic factors used in the analysis.
The classic Cox model gives the results as shown in Table 2. The continu-
ous factors have not been divided into certain categories.
In the multivariate analysis the percentage of positive lymph nodes
(NODOS.PR) and the local tumor invasion (T.SUB) (18) turned out to be statis-
tically significantly correlated with the survival rate.
B. Results Using Fractional Polynomials

1. Univariate Analysis
The first step was used to identify ‘‘optimal’’ functions for β(Z ) and γ(t). The
results for β(Z ) can be seen in Table 3. Only the continuous factors are included
in this analysis.
Three of five continuous factors show an association with the event rate
(NODOS.PR, uPA, and PAI-1) comparable with the classic Cox model. However,
the form of the relationship is better described in a nonlinear way. Two factors
(AGE and NODOS.GE) show no association on the log of the hazard ratio even
using a more flexible form of the relationship.
For the analysis of time-varying effects all continuous factors Z have been
changed into binary variables Z*. The transformation was based either on pre-
defined cutpoints (AGE, NODOS.PR, and NODOS.GE) or optimized cutpoints
386 Ulm et al.
Figure 3 Survival curve for the total sample (n ⫽ 295, 108 deaths) of patients with
stomach cancer.
(uPA and PAI-1). The results of the univariate analysis of time-varying effects
(⫽ γ(t)) can be seen in Table 4.
For AGE, T.SUB, METAS, DIFF, and NODOS.GE there is a significant
change in the effect during follow-up.
2. Multivariate Analysis
In the multivariate analysis, model (4) has been considered. In a stepwise forward
procedure all the results from the univariate analyses either in considering β(Z )
or γ(t) have been included (Table 5). This means the functional form remained
unchanged. Only the parameters are newly estimated in this multivariate analysis.
The selection procedure is based on the likelihood ratio statistics taking into ac-
count the degrees of freedom.
The percentage of positive lymph nodes is the most important factor, show-
ing a constant effect over time. The value of the likelihood function (⫺2 ⋅ ln L)
Table 1 Prognostic Factors Analyzed in the Stomach Cancer Study
Factor Range Coding Interpretation
AGE 28–90 0: ⱕ65 (median) Age at surgery

1: ⬎65
NODOS.PR 0–97 0: ⬍20 Percentage of positive lymph
1: ⱖ20 nodes
T.SUB 1–7 0: ⱕ4 (cutoff) Local tumor invasion (Japanese
1: ⬎4 staging system [17])
Cutoff ⫽ lamina subserosa
METAS yes/no 0: no Lymph node metastasis (no. 12,
1: yes 13 of comp. III [17])
DIFF 1–4 0: ⫽ 1, 2 (cutoff) Grading
1: ⫽ 3, 4
NODOS.GE 6–105 0: ⱕ42 (median) Total number of removed
1: ⬎42 lymph nodes
uPA 0.02–20.57 0: ⱕ5.94 (cutoff) Urokinase-type plasminogen ac-
1: ⬎5.94 tivator
PAI-1 0.02–264.62 0: ⬍4.13 (cutoff) Plasminogen activator inhibitor
1: ⱖ4.13 type 1
Table 2 Result of the Analysis of the Stomach Cancer Study Using the Classic
Cox Model
Type of analysis
Univariate Multivariate
Factor eβ p Value eβ p Value
AGE 1.01 0.41 1.01 0.34

NODOS.PR 50.6 ⬍0.001 19.2 ⬍0.001
T.SUB (5–7 vs. 1–4) 4.7 ⬍0.001 2.7 ⬍0.001
METAS (yes vs. no) 3.9 ⬍0.001 1.3 0.30
DIFF (3,4 vs. 1,2) 1.7 0.02 1.1 0.64
NODOS.GE 1.0 0.48 1.0 0.77
uPA 1.1 ⬍0.001 1.1 0.11
PAI-1 1.1 0.004 1.0 0.43
388 Ulm et al.
Table 3 ‘‘Optimal’’ Choices for the Fractional Polynomials β(Z ) in the Univariate
Analysis (only Continuous Factors Are Included)
Factor β(Z ) p(H 0 : β(Z ) ⫽ 0)
AGE 0.01 ⋅ Z 2
0.35
NODOS.PR ⫺0.12 Z ⫺1 ⫹ 1.97 Z 2 ⬍0.001
NODOS.GE ⫺113 ⋅ Z ⫺2 0.12
uPA ⫺0.17 Z ⫺2 ⫹ 0.01 Z 2 ⬍0.001
PAI-1 ⫺1.99/√Z ⬍0.001
Table 4 ‘‘Optimal’’ Choices for the Fractional Polynomials γ(t) Analyzing the
Time-varying Effect of all Dichotomized Factors (Univariate Analysis)
Factor γ(t) p(H 0 : γ (t) ⫽ γ 0)
AGE ⫺1.3 ⫹ 4.9/√t 0.001

NODOS.PR Constant —
T.SUB 1.0 ⫹ 0.01 t 2 ⫺ 0.003 t 2 ⋅ ln t 0.04
METAS 1.8 ⫹ 84.5/t 2 ⫺ 68.7/t 2 ⋅ ln t 0.01
DIFF ⫺0.09 ⫺ 0.09/t 2 ⫹ 37.8/t 2 ⋅ ln t 0.01
NODOS.GE 0.7 ⫺ 0.001 ⋅ t 2 0.001
uPA Constant —
PAI-1 Constant —
Table 5 Multivariate Analysis: Result of the Stepwise Selection Procedure
Likelihood
Step Factor β(Z ) or γ(t)* ratio statistics R p
1 NODOS.PR β 1 (Z ) 115 ⬍0.001

2 T.SUB γ 2 (t) 20 ⬍0.001
3 AGE γ 3 (t) 13 0.002
4 NODOS.GE γ 4 (t) 10 0.007
5 PAI-1 β 5 (Z ) 7 0.008
Total 165
* β 1 (Z) ⫽ ⫺0.09/Z ⫹ 1.84 ⋅ Z 2

γ 2 (t) ⫽ ⫺0.09 ⫹ 0.01 ⋅ t 2 ⫺ 0.004 t 2 ⋅ ln t
γ 3 (t) ⫽ ⫺0.96 ⫹ 4.44/√t
γ 4 (t) ⫽ 0.62 ⫺ 0.001 ⋅ t 2
β 5 (Z) ⫽ ⫺1.05/√Z
(a)
(b)
Figure 4 Results of the multivariate analysis: (a) time-varying effects γ(t) for T.SUB,
AGE, and NODOS.GE; (b) β(Z ) for NODOS.PR (b1) and PAI-1 (b2).
390 Ulm et al.
(c)
Figure 4 Continued
is increased by value of R ⫽ 115. The second factor selected is the local tumor
invasion (T.SUB), showing a strong time-varying effect (R ⫽ 20). The influence
of T.SUB increases within the first 3 years and then decreases. The next factor
selected is age, with a dynamic effect (R ⫽ 13). Shortly after surgery the older
patients (65 years and older) had a higher mortality rate. But the difference is
declining, and after about 2 years of follow-up the situation changes and the
younger patients seem to have the higher risk.
The fourth factor selected is the total number of lymph nodes removed
(R ⫽ 10). There is again a change of the effect over time. The patients with 42
or more lymph nodes removed have the higher risk at the beginning. But about
2 years after surgery, the risk in the group with fewer lymph nodes removed is
increasing.
Finally, the effect of PAI-1 is considered to be important (R ⫽ 7). The
influence of PAI-1 is constant over time but the value of PAI-1 seems important.
In contrast to the result of the classical Cox model, an additional effect of
AGE, NODOS.GE, and PAI-1 could be identified. The result can be seen in
Figure 4.
C. CART Method
The most important factor was the percentage of positive lymph nodes. This
factor was first divided into two categories (ⱕ20% and greater) based on clinical
experience (19). Seventy-three patients had more than 20% positive lymph nodes,
whereas 54 have died so far. Among the remaining 222 patients with less than
20% positive lymph nodes, 54 have also died. The analysis of the continuous
factor, percentage of positive lymph nodes, showed that the predefined cut-
point of 20% was close to the optimal cutpoint (Fig. 5). The cutpoint with the
highest value of the test statistics was 12% (χ 2LR ⫽ 124.5) followed by 21%
(χ 2LR ⫽ 121.3).
In the next step, the subsample with the high mortality rate was further
divided by the same factor using a cutpoint of 70% into a group of 12 patients
where all have died and another group where 42 of 61 have died. In the subsample
with less than 20% positive lymph nodes, the factor T.SUB shows the best dis-
crimination. The next split is performed with uPA. Altogether six subgroups are
Figure 5 Log-rank rest statistics to select the optimal cutoff value for the split of the
continuous factor NODOS.PR into two groups.
392 Ulm et al.
Figure 6 Result of the CART-analysis after pruning. At each split, the value of the log-
rank test statistics the total number of patients (⫽ n) and the number of deaths (⫽ ⫹) are
given. For each terminal node additionally the relative risk (RR) compared with the total
sample is calculated.
identified with the optimal tree after pruning, given in Figure 6. There is a great
difference in the mortality rate starting from 13 of 126 to 12 of 12.
The relative risks given in Figure 6 for each of the terminal nodes represent
the risk compared with the total sample. Therefore, some of the values are below
1 and some are above 1. Four of the six subgroups have a higher mortality rate
compared with the total sample. The two remaining subgroups containing about
half of all patients have a low mortality rate.
IV. DISCUSSION
Within the extended regression model, additionally the nonlinear influence of

PA1-1 and the significant change in the effect of AGE, T.SUB, and NODOS.GE
during follow-up could be detected.
Based on these results, a more detailed prognosis for an individual patient

can be made. A further impact of this analysis can be a different schedule for
follow-up. Patients with an increased risk at the beginning should be medically
examined more frequently shortly after surgery. The result can also be used to
investigate the factors in greater detail. There is a discussion in the literature
regarding adjuvant treatment in gastric cancer (20). Based on the result of CART,
all patients except that in the two groups with the lowest risks (RR ⫽ 0.23 and
RR ⫽ 0.47) seem to be candidates for some sort of adjuvant therapy.
Fractional polynomials can be applied in connection with standard software
packages like SPSS or SAS. Especially the time-varying effect can be analyzed
in an easy way. The idea is simply to use available estimation methods for time-
dependent covariates after having applied suitable transformations. Therefore at
each observed failure time the transformed value of the covariate, assuming a
time-varying effect, has to be calculated. To simplify the analysis, the form of
the relationship can be estimated in a univariate model. This form is than used
in the multivariate analysis. The next step can be to extend this model for analyz-
ing also interactions.
On the other hand, developing treatment decisions based on the result of
regression analysis should be met with suspicion. It seems more appropriate to
use the result of CART for the identification of subgroups of patients where
certain strategies should be applied. But in any case it seems justified to use
these extensions of the classic models to get more insight into the data and the
disease.
For the selection of variables in the classic Cox model, a stepwise proce-
dure, either forward or backward, is mostly used. However, the effect of the
variables is sometimes overestimated. In recent years some procedures to correct
these estimates have been proposed. One method is called shrinkage (21). A
shrinkage factor λ (λ ⬍ 1), depending on the total number of regression parame-
ters and the likelihood ratio statistics of the particular model, is calculated and
the estimated regression parameters β have to be multiplied by λ to give adequate
values for these parameters. The shrinkage factor can also be estimated by using
bootstrapping or cross-validation (7).
Another approach, recently published by Tibshirani (22), is called lasso.
The idea there is to estimate the parameters β under same constraints (∑ | β j | ⱕc).
The sum of the standardized regression parameters should be less than some
predefined value c. The effect is that some of the factors Z j , which are only
borderline significant, are ignored in the model. The estimation of β depends
on the choice of c. A small value of c corresponds to a model with only few
parameters.
Breiman (23) made a proposal on how to improve the prediction using
CART, called ‘‘bagging.’’ The idea is to construct new samples by using boot-
strap techniques and apply the CART method to all samples. For each sample a
394 Ulm et al.
new tree or decision rule is obtained. The result is a whole set of different decision
rules. To classify a new patient, these decision rules have to be applied and the
average has to be calculated. It can be shown that the misclassification rate can
be reduced. The problem of this method, however, is that no simple decision rule
is available.
In summary the result of the extended Cox model gives more insight into
the further development of the disease. The result can be used to give a better
information about prognosis and to define the appropriate schedule for further
medical examinations. The CART method helps to identify risk groups and can
be used directly for treatment decisions.
REFERENCES
1. Byar D. Identification of prognostic factors. In: Buyse M, et al., eds. Cancer Clinical
Trials. Oxford Press, 1984.
2. Hermanek P, Henson DE, Hutter RUP, Sobin LH. UICC TNM supplement. A com-
mentary on uniform use. Berlin: Springer, 1993.
3. McGuire WL, Clark GM. Prognostic factors and treatment decisions in axillary-
node-negative breast cancer. N Engl J Med 1992; 326:1756–1761.
4. Siewert JR, Fink U, et al. Gastric cancer. Curr Probl Surg 1997; 34:838–928.
5. Allgayer H, Heiss MM, Schildberg FW. Prognostic factors in gastric cancer. Br J
Surg 1997; 84:1651–1664.
6. Schmidt G, Malik M, Ulm K, et al. Heart rate chronotropy following ventricular
prematine beats predicts mortality after acute myocardial infarction. Lancet 1999;
353:1390–1396.
7. Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in devel-
oping models, evaluating assumptions and adequacy and measuring and reducing
errors. Stat Med 1996; 15:361–387.
8. Cox DR. Regression models and life tables. J R Stat Soc B 1972; 34:187–220.
9. Hastie T, Tibshirani R. Generalized additive Models. London: Chapman and Hall,
1990.
10. Green P, Silverman B. Nonparametric regression and generalized linear models.
London: Chapman and Hall, 1994.
11. Royston P, Altmann DG. Regression using fractional polynomials of continuous
covariates: parsimonious parametric modelling. Appl Stat 1994; 43:429–467.
12. Hastie T, Tibshirani R. Varying-coefficient models. J R Stat Soc B 1993; 55:757–
796.
13. Ulm K, Klinger A, Dannegger F. Identifying and modeling prognostic factors with
censored data. Stat Med 1998.
14. Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees.
New York: Chapman and Hall, 1984.
15. Le Blanc M, Crowley J. Survival trees by goodness of split. J Am Stat Assoc 1993;
88:457–467.
16. Hilsenbeck SG, Clark GM. Practical p-value adjustment for optimally selected cut-
points. Stat Med 1996; 15:103–112.
17. Nekarda H, Schmitt M, Ulm K, Wenninger A, Vogelsang H, et al. Prognostic impact
of urokinase-type plasminogen activator and its inhibitor PAI-1 in completely re-
sected gastric cancer. Cancer Res 1994; 54:2900–2907.
18. Japanese Research Committee for Gastric Cancer. The general rules for gastric can-
cer study in surgery and pathology. Jpn J Surg 1981; 11:127–138.
19. Roder JD, Bottcher K, Busch R, Wittekind C, Hermanek P, Siewert JR. Classifica-
tion of regional lymph node metastasis from gastric carcinoma. German Gastric Can-
cer Study Group. Cancer 1998; 82:621–631.
20. Bleiberg H, Sahmoud T, Di Leo A, Cunningham D, Rougier P. Adequate number
of patients are needed to evaluate adjuvant treatment in gastric cancer. J Clin Oncol
1998; 16:3714.
21. Van Houwelingen JC, Le Cessie S. Predictive value of statistical models. Stat Med
1990; 9:1303–1325.
22. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B 1996;
58:267–288.
23. Breiman L. Bagging predictors. Machine Learning 1996; 26:123–140.
19
Explained Variation in Proportional
Hazards Regression
John O’Quigley
University of California at San Diego, La Jolla, California
Ronghui Xu
Harvard School of Public Health and Dana-Farber Cancer Institute,
Boston, Massachusetts
I. EXPLAINED VARIATION IN SURVIVAL

A. Motivation
For many survival studies based on the use of a regression model, in addition to
the usual model fitting and diagnostic tools—the evaluation of relative and com-
bined predictive effects—it is also desirable to present summary measures esti-
mating the percentage of explained variation. Making precise the notion of ex-
plained variation in the particular context of proportional hazards regression
requires some thought. But before considering more closely the specifics of the
model, roughly speaking we know that any suitable measure would reflect the
relative importance of the covariates. This relative importance applies to the data
set in hand, but additionally any measure should be estimating some meaningful
population counterpart, a population value that can be given a concrete and intu-
itively useful interpretation.
To give the ideas a more tangible framework, consider a study of 2174
breast cancer patients, followed over a period of 15 years at the Institute Curie
in Paris, France. A large number of potential and known prognostic factors were
397
398 O’Quigley and Xu
recorded. Detailed analyses of these data have been the subject of a number of
communications. Let us suppose that we focus here on a subset of prognostic
factors: age at diagnosis, histology grade, stage, progesterone receptor status, and
tumor size. We would like to be able to say, for example, that stage explains
some 20% of survival but that once we have taken account progesterone status,
age, and grade, this figures drops to 5%. Or that by adding tumor size to a model
in which the main prognostic factors are already included, the explained variation
increases, say, a negligible amount, specifically from 32% to 33%. Or given that
some variable can explain so much variation, then to what extent do we lose (or
gain), in terms of these percentages, by recoding a continuous prognostic variable,
age at diagnosis for example, into discrete classes on the basis of cutpoints? Note
that for this latter problem the models are nonnested and so the problem would
be inherently more involved.
B. Explained Variation in Regression Models

Consider the pair of random variables (T, Z ). Denote the marginal distribution
functions by F(t) and G(z) and the conditional distribution functions by F(t | z)
and G(z| t). A question of interest might relate to the reduction in variance of the
random variable T by conditioning upon Z. The conditional variance of T given
Z translates predictability, for a normal model directly in terms of prediction
intervals and for other models if only by virtue of the Chebyshev inequality.
Quite generally, i.e., independently of any model, we have that
Var(T ) ⫽ E{Var(T|Z )} ⫹ Var{E(T|Z )} (1)
The above identity enables us to write down an expression for the proportion of
explained variation Ω 2 as
Var(T ) ⫺ E{Var(T| Z )} Var{E(T| Z )}

Ω 2 (T|Z ) ⫽ ⫽ (2)
Var(T ) Var(T )
the notation Ω 2 (T| Z ) reminding us which way round we are conditioning. We

may also be interested in Ω 2 (Z|T ), the two quantities coinciding for bivariate
normal models. The above expression does not lean on any model. When there
is no reduction in variance by conditioning upon Z, then Var(T ) ⫽ E{Var(T| Z )}
and Ω 2 ⫽ 0. When a knowledge of Z makes T deterministic, then Var(T|Z ) ⫽
0 and Ω 2 ⫽ 1. Intermediate values of Ω 2 have a precise interpretation in terms
of percentages of explained variation as a consequence of Eq. (2). Apart from
the marginal variance, Var(T ), the relevant quantity we need to define Ω 2 can
be expressed as
Explained Variation in Proportional Hazards Regression 399
冦冮冧
2
E{Var(T|Z )} ⫽ 冮冮ᐆ ᐀
t⫺
᐀
tdF (t| z) dF (t | z)dG(z) (3)
This elementary definition is helpful in highlighting two important and related

points: First, the values t and z only enter into the equation as dummy variables
and second, consistent estimates for Ω 2 will follow if we can consistently estimate
F(t | z) and G(z). Given the pairs of i.i.d. observations {(t i , z i ); i ⫽ 1, . . . , n},
the empirical distribution functions F n (t), G n (z), and F n (t |z), then it is only neces-
sary to replace F(t), G(z), and F(t| z) in Eq. (2) by F n (t), G n (z), and F n (t | z),
respectively, to obtain R 2 as a consistent estimator of Ω 2. It will also be helpful
to consider an equivalent expression for E {Var(T| Z )} arising from a simple
application of Bayes theorem. Instead of Eq. (3) write
冦冧
2
E{Var(T|Z )} ⫽ 冮冮 t⫺
冮 tg(z| u)dF (u)
᐀
dG(z| t)dF (t) (4)
冮 g(z |u)dF(u)
᐀ ᐆ
The above expression can be advantageous in certain estimation contexts, for

example, when we may be able to estimate more readily the conditional distribu-
tion of Z given T rather than the other way around.
In the main we are interested in regression models, the dependence being
expressed via the conditional distribution of one of the variables given the other,
and this dependence quantified by some parameter β, the larger the value of β in
absolute value then the greater the degree of dependence for any given covariate
distribution. Again we will sometimes wish to make this dependence explicit by
writing Ω 2 (Z|T;β) or simply Ω 2 (β) when it is clear which way the regression is
being done.
We parameterize our regression model such that the special value β ⫽ 0
indicates an absence of association between the variables. The value of β itself
quantifies the strength of regression effect and thereby directly relates to Ω 2 (β),
the reason for including it as an argument. Typically we would not be interested
in values of Ω 2 (β) elsewhere than at the true population value of β, but the con-
cept turns out to be useful. Whereas the actual value of β itself will depend on
the scaling of the covariate, Ω 2 (β) will be invariant to location and scale changes
in the covariate and, in some sense, represent a standardized measure of strength
of effect lying between 0 and 1.
To avoid confusion when referring to β as an argument of a function as
opposed to some assumed population value, we may denote the fixed population
value as β 0. Note that the special value β ⫽ 0 corresponds to absence of associa-
tion between T and Z; it helps the development by making this explicit in that
E{Var(T| Z; β ⫽ 0)} ⫽ E{Var(T )} ⫽ Var(T ) (5)
leading to an expression for Ω 2 (β 0) in which the role of β is readily understood

from
E{Var(T | Z;β ⫽ 0)} ⫺ E{Var(T |Z;β ⫽ β 0 )}

Ω 2 (T|Z;β 0 ) ⫽ (6)
E{Var(T |Z;β ⫽ 0)}
Var{E(T|Z )}
⫽
Var(T )
One of the variables, most often Z, may have been assigned certain values by
design, and we model the conditional distribution of T given Z. This is the case
with the proportional hazards model where Z represents the covariate and T the
elapsed time. It may seem the more natural to work with Ω 2 (T|Z ); however, this
may not be the way to proceed, the definition Ω 2 (Z| T ) having some advantage
in this context. The reason is outlined in the following section.
C. Schoenfeld Residuals and Explained Variation in

Proportional Hazards Models
Inference in the proportional hazards model remains invariant following mono-
tonic increasing transformations on the time scale. This is a fundamental property,
expressed via a model including an unknown baseline hazard function. Only the
observed ranks of the failures matter, the actual values of the failure times them-
selves having no impact on parameter estimates and their associated variance. It
could be argued that such a property ought be maintained for an appropriate
Ω 2 (β) measure and its sample-based estimate R 2 (β̂). For definition Ω 2 (Z| T;β) it
can be seen that failure rank invariance is respected, whereas for definition
Ω 2 (T| Z;β) such invariance fails. More importantly, if we wish to accommodate
time-dependent covariates, a fundamental feature of the Cox model, then Ω 2 (Z
| T;β) can be readily generalized, maintaining an analogous interpretation,
whereas Ω 2 (T|Z;β) is no longer even well defined.
It can then be argued that a suitable measure of explained variation for
the Cox model would relate to the predictability of the failure ranks rather than the
actual times, absence of effect should translate as 0%, perfect prediction of the
survival ranks should translate as 100%, and intermediate values should be inter-
pretable. The measure introduced by O’Quigley and Flandre (1994) comes under
this heading and corresponds to Ω 2 (Z| T;β). Xu (1996) shows that a reduction in
the conditional variance of Z given T translates as greater predictability of the
failure rankings given Z. These considerations lead to an Ω 2 of the form Ω 2 (Z| T;
β) where
E{Var(Z(t)| T ⫽ t;β ⫽ 0)} ⫺ E{Var(Z(t) |T ⫽ t;β)}

Ω 2 (Z|T;β) ⫽ (7)
E{Var(Z(t) |T ⫽ t;β ⫽ 0)}
When talking about proportional hazards regression, this form is assumed unless
indicated otherwise. We therefore suppress the notation Z |T in the definition of
Ω 2, although the dependence on β may be indicated. We can write the above as
∫ E β {[Z(t) ⫺ E β (Z(t)| t)] 2 | t}dF(t)
Ω 2 (β) ⫽ 1 ⫺ (8)
∫ E β {[Z(t) ⫺ E 0 (Z(t)| t)] 2 | t}dF(t)
where E β denotes expectation assuming the model to be true at the value β and
E 0 is the expected value under the null model. We return to this below but note
that if we can consistently estimate all the quantities in Eq. (8), then our problem
is solved. As it turns out (O’Quigley and Flandre 1994), the usual Schoenfeld
residuals play a key role here, and it can be shown (Xu 1996) that
∑ δi⫽1 r 2i (βˆ )w i
R 2 (β̂) ⫽ 1 ⫺ (9)
∑ δi⫽1 r 2i (0)w i
where r i (⋅) are the Schoenfeld (1982) residuals, evaluated at β̂ and 0, is consistent
for Ω 2 (β). In this expression the weights w i are the decrements in the marginal
Kaplan-Meier estimate. In many practical cases, ignoring the w i by equating them
all to one may have little impact, thereby providing a particularly simple expres-
sion in terms of the Schoenfeld residuals. These residuals are typically an ingredi-
ent of any standard analysis.
For ordinary linear regression, one minus the usual R 2 is the ratio of the
average of the squared residuals and the average squared deviations about the
overall mean. The estimate R 2 here is of the same form, in which the squared
residuals of linear regression are replaced by the squared Schoenfeld residuals
and where the square deviations about the overall mean are replaced by the square
deviations of Z about the overall mean values of Z sequentially conditional on the
risk sets. In the absence of censoring, the quantity ∑ ni⫽1r 2i (β̂)/n can be viewed as the
average discrepancy between the observed covariate and its expected value under
the model, whereas ∑ ni⫽1r 2i (0)/n can be viewed as the average discrepancy without
a model. Censoring is dealt with by correctly weighting the squared residuals.
II. ESTIMATION UNDER PROPORTIONAL HAZARDS

A. Model and Notation
Let T 1 , T 2 , . . . , T n be the failure times and C 1 , C 2, . . . , C n be the censoring
times for the individuals i ⫽ 1, 2, . . . , n. For each i we observe X i ⫽ min(T i ,
C i ) and δ i ⫽ I(T i ⱕ C i ), where I(⋅) is the indicator function. Define the ‘‘at risk’’
indicator Y i (t) ⫽ I(X i ⱖ t). We also use the counting process notation: let N i (t)
⫽ I {T i ⱕ t, T i ⱕ C i } and N(t) ⫽ ∑ n1 N i (t). The left continuous version of the
Kaplan-Meier estimate of survival is denoted Ŝ(t) and the Kaplan-Meier estimate
of the distribution function by F̂ (t) ⫽ 1 ⫺ Ŝ(t). Usually we are interested in the
situation where each subject has related covariates, or explanatory variables, Z i
(i ⫽ 1, 2, . . . , n). All the results given here hold for an independent censorship
model. Mostly, for ease of exposition, we assume the covariate Z to be one dimen-
sional. Z in general could be time dependent, in which case it is assumed to be
a predictable stochastic process and we will use the notation Z(t), Z i (t), etc.
The Cox (1972) proportional hazards model assumes that the hazard func-
tion λ i (t) (i ⫽ 1, . . . , n) for individuals with different covariates, Z i (t), can be
written
λ i (t) ⫽ λ 0 (t) exp{βZ i (t)} (10)
where λ 0 (t) is a fixed unknown ‘‘baseline’’ hazard function and β is a relative
risk parameter to be estimated.
B. Basis for Inference

First some basic definitions. Let
π i (β,t) ⫽ K(t)Y i (t) exp{βZ i (t)} (11)
where K ⫺1 (t) ⫽ ∑ nᐉ⫽1 Y ᐉ (t) exp{βZ ᐉ (t)}. Under the model, π i (β,t) is exactly the
conditional probability that at time t it is precisely individual i who is selected
to fail, given all the individuals at risk and given that one failure occurs. Let ᐆ(t)
be a step function of t with discontinuities at the points X i , i ⫽ 1, . . . , n, at
which the function takes the value Z i (X i ). Also, for fixed t, define the expectation
of Z(t) under the distribution {π i (β,t)}ni⫽1 by
n
ε β (Z|t) ⫽ 冱 Z (t)π (β,t)

ᐉ⫽1
ᐉ ᐉ
n (12)
⫽ 冱 K(t)Y (t)Z (t) exp{βZ (t)}
ᐉ⫽1
ᐉ ᐉ ᐉ
Statistical inference on β is usually carried out by maximizing Cox’s (1975)

partial likelihood which is equivalent to obtaining the value of β satisfying
U 1 (β) ⫽ ∫ {ᐆ(t) ⫺ ε β (Z| t)}dN(t) ⫽ 0 (13)
An alternative to the partial likelihood estimator, useful when β may not be con-
stant with time, is given by the solution to (Xu and O’Quigley 1998)
U 2 (β) ⫽ ∫ W(t){ᐆ(t) ⫺ ε β (Z| t)}dN(t) ⫽ 0 (14)
where W(t) ⫽ Ŝ(t){∑ ni⫽1 Y i (t)} ⫺1. Our purpose here is consistent estimation of
Ω 2 and not robust estimation of β, but the two estimates of Ω 2 of the following
section are related in a way not dissimilar to the relationship between the above
two estimators. For practical calculation note that W(X i ) ⫽ F̂ (X i ⫹) ⫺ F̂ (X i ) ⫽
w i at each observed failure time X i , i.e., the jump of the Kaplan-Meier curve. This
is also of theoretical interest since under departures from proportional hazards, an
estimate based on U 2 (β) has a solid interpretation as average effect, whereas the
estimate based on U 1 (β) cannot be interpreted in the presence of censoring (Xu
1996, Xu and O’Quigley 1998). This can be anticipated from the definition of
W(X i ) whereby we can write
U 2 (β) ⫽ ∫ {ᐆ(t) ⫺ ε β (Z|t)}dF̂(t) ⫽ 0 (15)
C. Estimating ⍀ 2
Our basic task is accomplished in this section via a main theorem and a series
of corollaries. The proofs are not given here. They can be found in Xu (1996)
where proofs of the statements of the following section can also be found.
Theorem 1 Under model (10), an independent censoring mechanism, and where
β̂ is any consistent estimate of β, the conditional distribution function of Z(t)
given T ⫽ t is consistently estimated by
F̂ t (z |t) ⫽ P̂(Z(t) ⱕ z |T ⫽ t) ⫽ 冱
{ᐉ:Z ᐉ (t)ⱕz}
π ᐉ {β̂,t}
Corollary 1 Defining
n
ε β (Z |t) ⫽
k
冱 Z (t)π {β,t}
i⫽1
k
i i (16)
⫽ 冱 K(t) Y (t)Z (t) exp{βZ (t)},

i⫽1
i
k
i i k ⫽ 1, 2, . . .
then ε β̂ (Z k |t) provide consistent estimates of E β (Z k (t)|T ⫽ t), under the model.
In addition we have the following two results. Let
∞
᏶(β,b) ⫽ 冮 0
W(t)ε β {[Z(t) ⫺ ε b (Z(t)| t)]2 |t} dN(t) (17)
then
Corollary 2 ᏶ (β, β) converges in probability to
∫ E β {[Z(t) ⫺ E β (Z(t)| t)]2 |t}dF(t) (18)

Corollary 3 ᏶(β,0) converges in probability to
∫ E β {[Z(t) ⫺ E 0 (Z(t)| t)] 2 | t}dF(t) (19)
Theorem 2 Define
᏶(β,β)
R 2ε (β) ⫽ 1 ⫺ (20)
᏶(β,0)
Then R 2ε (βˆ ) converges in probability to Ω 2 (β) in Eq. (8).

Theorem 3 Let Ᏽ(b) for b ⫽ 0, β be defined by
n ∞
Ᏽ(b) ⫽ 冱冮
i⫽1
0
W(t){Z i (t) ⫺ ε b (Z| t)}2 dN i (t) (21)
then
Ᏽ (β̂)
R 2 (β̂) ⫽ 1 ⫺ (22)
Ᏽ(0)
is a consistent estimate of Ω 2 (β 0) in Eq. (8).

Notice that the above defined R 2 (β̂) is the same as Eq. (9). Our experi-
ence has been that when the proportional hazards model correctly generates the
data, R 2ε will be very close in value to R 2. Indeed, we can show that under the
model, | R 2 (β̂) ⫺ R 2ε (β̂) | converges to zero in probability. When discrepancies
arise, this would seem to be indicative of a failure in model assumptions. Al-
though R 2ε (β̂) is of interest in its own right, our main purpose for studying it has
been to develop certain statistical properties and for providing a simple way to
construct confidence intervals for the population quantity Ω 2 (β). The coefficients
R 2 and R 2ε and the population counterpart Ω 2 have a number of useful properties.
We have R 2 (0) ⫽ 0, R 2ε (0) ⫽ 0, R 2ε (β) ⱕ 1, and R 2 (β) ⱕ 1. Although R 2ε and Ω 2
are nonnegative, we cannot guarantee the same for R 2. This would nonetheless
be unusual corresponding to the case in which the best-fitting model, in a least-
squares sense, provides a poorer fit than the null model. Our experience is that
R 2 (β̂) will only be slightly negative in finite samples if β̂ is very close to zero.
Both R 2 and R 2ε are invariant under linear transformations of Z and monotonically
increasing transformations of T. Viewed as a function of β, R 2 (β) reaches its
maximum close to β̂. In contrast, R 2ε (β) → 1 as | β | → ∞ and, as a function of
β, R 2ε (β) increases monotonically with |β|. This last property (as well as all the
stated properties of R 2ε ) also applies to Ω 2 (β) and enables us to construct confi-
dence intervals for Ω 2 using that of β, as illustrated in the example. Finally, we
can show that R 2 (β̂) and R 2ε (β̂) are asymptotically normal.
III. SUMS OF SQUARES INTERPRETATION
We have the following sums of squares decomposition for R 2ε (β):
ε β {[Z ⫺ ε 0 (Z| X i )] 2 | X i }⫽ ε β {[Z ⫺ εβ |Z |X i )] 2 |X i } (23)

⫹ {ε β (Z| X i ) ⫺ ε 0 (Z| X i )} 2
on the basis of which we can rearrange Eq. (20) so that

∑ ni⫽1 δ i W(X i ){ε β (Z|X i ) ⫺ ε 0 (Z|X i )} 2
R 2ε (β) ⫽ (24)
∑ ni⫽1 δ i W(X i )ε β {[Z ⫺ ε 0 (Z|X i )] 2 |X i }
Furthermore, we can take ∑ ni⫽1 δ i W(X i )r 2i (β) to be a residual sum of squares
analogous to those from linear regression, whereas ∑ ni⫽1 δ i W(X i )r 2i (0) corresponds
to the total sum of squares. Then
n
冱 δ W(X ){Z (X ) ⫺ ε (Z|X )}

i⫽1
i i i i 0 i
2
n n
⫽ 冱
i⫽1
δ i W(X i ){Z i (X i ) ⫺ ε β (Z|X i )} 2 ⫹ 冱 δ W(X ){ε (Z|X ) ⫺ ε (Z| X )}
i⫽1
i i β i 0 i
2
⫹2 冱 δ W(X ){ε (Z|X ) ⫺ ε (Z|X )}{Z (X ) ⫺ ε (Z|X )}

i⫽1
i i β i 0 i i i β i
Now the last term in the above is a weighted score that according to Proposition
1 of Xu (1996) is asymptotically zero with β ⫽ β̂. So defining
n
SS tot ⫽ 冱 δ W(X )r (0)

i⫽1
i i
2
i
SS res ⫽ 冱 δ W(X )r (βˆ )

i⫽1
i i
2
i
SS reg ⫽ 冱 δ W(X ){ε (Z|X ) ⫺ ε (Z|X )}

i⫽1
i i β̂ i 0 i
2
we obtain an asymptotic decomposition of the total sum of squares into the resid-
ual sum of squares and the regression sum of squares, i.e.,
SS tot ⫽ SS res ⫹ SS reg (25)
holds asymptotically. So R 2 is asymptotically equivalent to the ratio of the regres-

sion sum of squares to the total sum of squares.
IV. MULTIVARIATE EXTENSION
Most often we are interested in explanatory variables Z of dimension greater

than 1. Common classes of regression models consider the impact of a linear
combination η ⫽ β ′Z on T, where β is a vector of the same dimension as Z and
a′b denotes the usual inner product of a with b. For the multivariate normal
model F(t | Z ⫽ z) and F(t | η ⫽ β′z) are the same, and so it is only necessary to
consider η and not the actual values of z themselves. For other models, we may
not have such a result, but in as much as we consider the effect as essentially
being summarized via β′z, it makes sense to consider the multiple coefficient of
explained variation, known as the coefficient of determination in linear regres-
sion, as Ω 2 (T| η;β). Such a quantity would not be invariant to monotonic transfor-
mations upon T and, assuming we deem this a requirement, then we should con-
sider Ω 2 (η| T;β). Everything now follows through exactly as for the univariate
case in which we work with the fitted Schoenfeld residuals and the null residuals.
The exact linear combination we would use for η is of course unknown, and in
practice we replace the vector β by the vector β̂. The multiple coefficient is then
∑ δi⫽1 [β̂ ′r i (β̂)] 2 w i

R 2 (β̂) ⫽ 1 ⫺ (26)
∑ δi⫽1 [β̂′r i (0)] 2 w i
V. OTHER SUGGESTED MEASURES
There have been other suggestions for suitable measures of explained variation
under the proportional hazards model. The earliest suggestions date back to Har-
rell (1986). His measure depends heavily on censoring, and this effectively rules
it out for practical use. Kent and O’Quigley (1988) developed a measure based
on the Kullback-Leibler information gain, and this could be interpreted as the
proportion of randomness explained in the observed survival times by the covari-
ates. The principle difficulty in Kent and O’Quigley’s measure was its complexity
of calculation, although a very simple approximation was suggested and appeared
to work well. The Kent and O’Quigley measure was not able to accommodate
time-dependent covariates. Korn and Simon (1990) suggested a class of potential
functionals of interest, such as the conditional median, and evaluated the ex-
plained variation via an appropriate distance measuring the ratio of average dis-
persions with the model to those without a model. Their measures are not invari-
ant to time transformation nor could they accommodate time-dependent
covariates. They have some advantage in generality in being applicable to a much
wider class of models than the proportional hazards one. Schemper (1990,1994)
introduced the concept of individual survival curves for each subject, with the
model and without the model. Interpretation is difficult. As with the Harrell mea-
sure the Schemper measures depend on censoring, even when the censoring
mechanism is independent of the failure mechanism. Other measures have also
been proposed in the literature (see e.g. Schemper and Stare 1996), but it is not
our intention here to give a complete review of them. Schemper and Kaider (11)
suggested multiple imputation as a way to deal with censoring. This is a promis-
ing idea that requires further study to anticipate the statistical properties of the
approach. Intuitively it appears that such an approach would come under the
heading of providing estimators for the relevant population quantities of Section
I. The unavailable empirical estimators are replaced by estimators deriving from
an iterative algorithm. Currently this appears somewhat ad hoc, the population
model not being referred to in the work of Schemper and Kaider (1997). How-
ever, it seems quite likely that the approach may be consistent, and further work
is needed for this to be demonstrated and under what conditions.
An alternative to the information gain measure of Kent and O’Quigley
(1986), similar in spirit but leaning on theorem 1 and the conditional distribution
of Z given T rather than the other way around, leads to a coefficient with good
properties (Xu and O’Quigley 1999). In practical examples this coefficient based
on information gain appears to give close agreement with the R 2 measure dis-
cussed here. There are in fact theoretical reasons for this agreement, and assuming
the data do not strongly contradict the proportional hazards assumption, we antici-
pate the two coefficients to be close to one another.
VI. ILLUSTRATION
We illustrate the basic ideas on the well-known Freireich (1963) data, which
records the remission times of 42 patients with acute leukemia treated by 6-
mercaptopurine (6-MP) or placebo. It was used in the original Cox (1972) paper
and has been studied by many other authors under the proportional hazards
model. Our estimate of the regression coefficient is β̂ ⫽ 1.53, and R 2 (β̂) ⫽ 0.386
and R 2ε (β̂) ⫽ 0.371. The 95% confidence interval for Ω 2 (β) obtained using the
monotonicity of Ω 2 (β), i.e., plugging the two end points of the interval for β
into R 2ε (⋅), is (0.106, 0.628). Our 1000 bootstrap using Efron’s bias-corrected
accelerated bootstrap (BCa) method gives confidence interval (0.111, 0.631) us-
ing R 2 and (0.103, 0.614) using R 2ε. We see that these have very good agreement
with the one obtained through monotonicity. For practical use in calculating con-
fidence interval of Ω 2 (β), we recommend the ‘‘plug-in’’ method, which is the
most computationally efficient.
The above R 2 (β̂) can be compared with some of the suggestions of the
previous section. For the same data the measure proposed by Kent and O’Quigley
(1986) resulted in the value 0.37. The explained variation proposals of Schemper
(1990), based on empirical survival functions per subject, resulted in (his nota-
tion) V 1 ⫽ 0.20 and V 2 ⫽ 0.19 and Schemper’s later correction (1994) resulted
in V 2 ⫽ 0.29. There is no obvious link between the Schemper measures and those
presented here since the Schemper measures depend on an independent censoring
mechanism. The measure of Korn and Simon (1990), based on quadratic loss,
gave the value 0.32. Although there is some comfort to be gained by a value not
dissimilar from that obtained above, again there appears to be no good grounds
for investigating a potential association between the measures. This is because
their measure, unlike the measure suggested here and the partial likelihood esti-
mator itself, does not remain invariant to monotone increasing transformation of
time. For these data the value 0.32 drops to 0.29 if the failure times are replaced
by the square roots of the times. The Korn and Simon measure is most useful
when the time variable provides more information than just an ordering, whereas
rank ordering is the only assumption we need for the measure presented in this
Chapter. Finally, the measure of Schemper and Kaider (1997) is calculated to be
0.34, and the measure of Xu and O’Quigley (1999) turns out to be 0.40.
REFERENCES
Cox DR. Regression models and life tables (with discussion). JR Stat Soc B 1972; 34:
187–220.
Cox DR. Partial likelihood. Biometrika 1975; 62:269–276.
Freireich EO. The effect of 6-mercaptopmine on the duration of steroid induced remission
in acute leukemia. Blood 1963; 21:699–716.
Harrell FE. The PHGLM Procedure, SAS Supplement Library User’s Guide, Version 5,
Cary, NC: SAS Institute Inc., 1986.
Kent JT, O’Quigley J. Measure of dependence for censored survival data. Biometrika
1988; 75:525–534.
Korn EL, Simon R. Measures of explained variation for survival data. Stat Med 1990; 9:
487–503.
O’Quigley J, Flandre P. Predictive capability of proportional hazards regression. Proc Natl
Acad Sci USA 1994; 91:2310–2314.
Schemper M. The explained variation in proportional hazards regression. Biometrika
1990; 77:216–218.
Schemper M. Correction: the explained variation in proportional hazards regression. Bio-
metrika 1994; 81:631.
Schemper M, Kaider A. A new approach to estimate correlation coefficients in the pres-
ence of censoring and proportional hazards. Comput Stat Data Anal 1997; 23:467–
476.
Schemper M, Stare J. Explained variation in survival analysis. Stat Med 1996; 15:1999–
2012.
Schoenfeld DA. Partial residuals for the proportional hazards regression model. Biome-
trika 1982; 69:239–241.
Xu R. Inference for the Proportional Hazards Model. Ph.D thesis of University of Califor-
nia, San Diego, 1996.
Xu R, O’Quigley J. Estimating average log relative risk under nonproportional hazards.
ASA 1998 Proceedings of the Biometrics Section, 216–221.
Xu R, O’Quigley J. A R 2 type measure of dependence for proportional hazards models.
Nonparam Stat 1999; 12:83–107.
20
Graphical Methods for Evaluating
Covariate Effects in the Cox Model
Peter F. Thall and Elihu H. Estey

University of Texas M.D. Anderson Cancer Center, Houston, Texas
I. INTRODUCTION
In medicine, patient characteristics often have profound effects on prognosis. For

example, in oncology a patient’s age, extent of disease, or the presence of a
particular cytogenetic or molecular abnormality typically have substantial effects
on his or her survival. When comparing the effects of two or more treatments
on patient outcome, a fundamental scientific problem is that apparent treatment
differences may result not from the inherent superiority of one particular treat-
ment over another but rather from differences between the patients in the treat-
ment groups. This observation has led to use of the randomized clinical trial to
ensure that groups given different treatments are on average similar with regard
to characteristics that may be related to response (‘‘covariates’’). Although ran-
domization is an essential tool in comparative treatment evaluation, it cannot
guarantee that treatment groups are perfectly balanced with regard to all variables
that may be related to outcome. This is especially true in small (e.g., ⬍200 pa-
tient) randomized trials. In the more common and problematic setting where treat-
ment comparisons are based on data from separate trials, as when evaluating data
411
412 Thall and Estey
from two or more single-arm phase II trials of different treatments, the potential
for the effects of unbalanced covariates to confound actual treatment differences
is much greater. Therefore, the use of statistical methods to account for variables
that may influence patient outcome is critically important in evaluating both ran-
domized and nonrandomized clinical trials. Although unobserved ‘‘latent’’ ef-
fects are also an important consideration when combining data from multiple
clinical centers or separate trials, we do not address this issue here. For the in-
terested reader, treatments of this problem are given by Li and Begg (2) and
Stangl (3).
Accounting for individual patient characteristics when evaluating treatment
effects entails some form of statistical regression analysis. The Cox regression
model (1) is the most widely used tool for evaluating the relationship between
covariates and time-to-event treatment outcomes, such as survival time or
disease-free survival (DFS) time. Unfortunately, the assumptions underlying the
Cox model are often violated in practice. In particular, many published results
in the medical literature are based on fitted models for which no goodness-of-fit
analysis has been performed. If such model criticism is not done and if the qualita-
tive relationship between a covariate and patient outcome is different from that
assumed by a particular model, then the statistical estimate of the covariate’s
effect under the fitted model may greatly misrepresent its actual effect. When
this is the case, apparent covariate effects and treatment effects obtained from a
fitted Cox model may be substantively misleading.
The purpose of this chapter is to illustrate some statistical methods for
assessing goodness-of-fit under the Cox model and also for correcting poor
model fit. Formal descriptions of the methods are given by Therneau et al. (4),
Grambsch and Therneau (5), Grambsch (6), and in Chapter 4 of the important
book by Fleming and Harrington (7). An excellent, albeit somewhat more mathe-
matical, explanation of the type of methods discussed here is given in Chapter
4.6 of Fleming and Harrington. We do not attempt to discuss all existing methods
for assessing goodness-of-fit of the Cox model, since the current literature is quite
extensive. Some earlier references are Crowley and Hu (8), Crowley and Storer
(9), Kay (10), Schoenfeld (11), Cain and Lange (12), and Harrell (13). Our goal
is to discuss and illustrate by example some useful graphical displays and statisti-
cal tests in terms that can be understood by physicians or other nonmathematical
readers. The graphical methods illustrate qualitative relationships between covari-
ates and outcome that are not otherwise apparent, and they also lead to use of
the extended Cox model, which allows the possibility of covariate effects that
vary with time, to obtain an improved model fit. Because these methods provide
more accurate and reliable evaluation of covariate and treatment effects on patient
outcome, their application often leads in turn to profound changes in the substan-
tive inferences formed from a particular data set. These techniques are straightfor-
Graphical Methods for the Cox Model 413
ward to implement using freely available computer programs in either Splus or

SAS (13,14). Our goal is to bring these methods into more widespread use, espe-
cially in the analysis of medical data.
We illustrate the methods using three data sets arising from clinical trials
in acute myelogenous leukemia (AML) and myelodysplastic syndromes (MDS)
conducted at M.D. Anderson Cancer Center: a data set arising from several phase
II trials of combination chemotherapies each involving fludarabine (16), where
415 of 530 patients had events (died or relapsed), with a 26-week median DFS
time; a data set for which the effects of two proteins, caspase 2 (C 2 ) and caspase
3 (C 3 ), on survival were evaluated (17), where 116 of the 185 good-prognosis
patients had events with a median DFS time of 82 weeks; and a data set arising
from a four-arm randomized trial designed to evaluate the effects of all-trans
retinoic acid (ATRA) and the growth factor granulocyte colony-stimulating factor
on survival (18), where 139 of 215 patients died with a median survival time of
28 weeks. We refer to these as the ‘‘fludarabine,’’ ‘‘caspase,’’ and ‘‘ATRA’’
data sets. For one example we also use simulated data having specific properties.
Most of our examples deal with the relationship between a single covariate
and survival or DFS time. However, we also discuss how properly modeling
covariate effects may affect treatment effect estimates in a multivariate model
(Sect. VII) and also the use of conditional survival plots to assess interactions
between two covariates (Sect. XII). The methods apply quite generally to any
time-to-event outcome subject to right censoring, as is the case with patients who
have not suffered the event in question when the trial is analyzed.
II. COX REGRESSION MODEL
Consider the common problem of assessing the relationships between each of a

collection of covariates Z ⫽ (Z 1 . . . . Z k ) and the elapsed time T from a ‘‘baseline’’
usually defined as the time of diagnosis or initiation of treatment to a particular
event such as relapse or death. The covariates typically include one or more
indicator variables denoting treatments given to patients. Each covariate may or
may not be of value in predicting T, and those covariates that are predictive
typically differ substantially in their qualitative relationships and strength of asso-
ciation with T. In addition, for some patients the value of T may be right censored
in that T is not observed but rather is known only to be no smaller than a censoring
time, usually due to the fact that the study ended without the patient experiencing
the event. Many models and methods deal with this type of data (7,19). By far
the most commonly used methods are based on the Cox regression model (1). The
Cox model assumes that the instantaneous hazard λ(t; Z ) of the event occurring at
time t from baseline for a patient with covariates Z takes the form λ(t; Z ) ⫽
414 Thall and Estey
λ 0 (t) exp (β 1 Z1 ⫹ . . . . ⫹ β k Z k ), where λ 0 (t) is an underlying baseline hazard

function not depending on the covariates and β1, . . . , β k are unknown parameters
quantifying the covariate effects. The expression β 1 Z 1 ⫹ . . . . ⫹ β k Z k , known
as the linear component of the model, is typically the main focus of a Cox mo-
del analysis since the β j ’s quantify the covariate effects. Due to the fact that
β1Z1 ⫹ . . . . ⫹ β k Zk ⫽ log e{λ (t; Z)/λ 0 (t)}, the covariates are said to have a log
linear effect on the hazard of the event. If a particular β j ⫽ 0, then the covariate
Z j has no effect on the hazard. If one defines the binary indicator variable Z A ⫽
1 for treatment group A and Z A ⫽ 0 for treatment B and includes β A ZA in the
linear component, then exp(βA ) is the relative risk or hazard ratio of the event
for a patient given treatment A compared with B, regardless of the patient’s other
covariates. For this reason, the Cox model is also called the proportional hazards
model. A relative risk of 1 corresponds to the case where the risk of the event
is the same with the two treatments. Numerical values β A ⬍ 0, β A ⫽ 0, and
β A ⬎ 0 correspond to relative risks below 1, equal to 1, and above 1, respec-
tively. Thus, a value of β A significantly less (greater) than 0 is the basis for
inferring that A is superior (inferior) to B. Two crucial assumptions underlying
the Cox model are that the covariates have a log linear effect on the hazard of
the event and that the value of each β j does not vary with time. Thus, for example,
the relative risk exp(βA ) associated with treatment A vis a vis treatment B is the
same at any time t.
III. GOODNESS-OF-FIT
The Cox model has proved to be an extremely useful statistical tool for evaluating
covariate effects on events that occur over time. When the underlying model
assumptions are not met in that the model does not fit the data well, however,
the sort of analysis described above may be invalid. Since no model can be per-
fectly correct, the practical question is whether a given statistical model provides
a reasonable fit to the data at hand. Consequently, practical application of any
statistical model should include some form of data-driven model criticism, com-
monly known as goodness-of-fit analysis. Ideally, model criticism should also
include consideration of fits obtained with other similar data sets. We do not
pursue this issue here, however, since it involves notions of Bayesian inference,
cross-validation, and meta-analysis that go far beyond the present discussion. The
point is that use of a statistical model without some goodness-of-fit assessment
is bad scientific practice, since it may easily produce flawed or substantively
misleading inferences. It is this danger, given the widespread use of the Cox
model to analyze medical data, that has motivated this chapter.
IV. MARTINGALE RESIDUAL PLOTS
In linear regression analysis the residuals are the differences, one for each patient,
between the observed outcome variable and the value of that variable predicted
by the fitted regression model. These ‘‘observed minus predicted’’ values may
be used to assess how well the regression model fits the data. A wide variety of
methods for residual analyses is discussed in the statistical literature, and each
method applies to a particular type of regression model (e.g., linear, logistic,
Cox). The martingale residuals associated with a fitted Cox model are the analog
of ordinary residuals associated with a linear regression model. Specifically, mar-
tingale residuals are numerical values, one for each patient in the data set, that
quantify the excess risk of the event not explained by the model. A large positive
(negative) martingale residual r m for a patient corresponds to a fitted model that
underestimates (overestimates) the risk of the event for that patient. This fact
may be exploited to assess goodness-of-fit in terms of a martingale residual plot,
obtained by plotting a point for each patient that corresponds to the patient’s
value of r m on the vertical axis and the patient’s value for the particular covariate,
Z, on the horizontal axis. The martingale residuals for this plot are computed
from a Cox model that includes only a baseline hazard function but no covariates,
so that rm essentially adjusts the patients’ observed times for censoring. This
produces a scattergram of points, as shown in Figure 1. Applying a local regres-
sion smoother (14,20) to create a line through the scattergram then allows one
to examine visually the nature of the relationship between r m and the covariate
Z. Subsequently, after a model with an appropriately transformed version of Z
has been fit, the plot on Z of the residual r m based on this new model should show
no pattern other than random noise. In most applications, the pattern revealed by
the smoother is impossible to determine by visual inspection of the scattergram
alone. For larger data sets, say with 1000 patients or more, a plot of the smoothed
line alone may be visually clearer since the scattergram of points tends to over-
whelm the picture. Smoothed scattergrams may be constructed easily using sev-
eral widely available statistical software packages, including Splus (13,14,21).
If Z satisfies the proportional hazards assumption (i.e., if it has a log linear effect
on the hazard), then aside from random variation the smoothed line will be
straight. Nonlinear patterns correspond to violation of the proportional hazards
assumption, and in this case the shape of the smoothed line indicates the relation-
ship between outcome and Z that should produce a good fit. Thus, a simple,
routine way to fit Cox models is to first examine the martingale residual plot for
each covariate Z, if necessary fit a new model incorporating this relationship,
examine a residual plot for the new model, and repeat this process until one
obtains a good fit.
416 Thall and Estey
Figure 1 Martingale residual scatterplot on white blood cell count, from the caspase
data.
V. TIME-VARYING COVARIATE EFFECTS
Time-to-event data often deviate from the usual Cox model in that the effect of
a given covariate may vary over time. The extended Cox model allowing one or
more of the β j ’s to vary with time has hazard function of the form λ(t; Z ) ⫽
λ 0 (t) exp [β 1 (t) Z 1 ⫹ . . . . ⫹ β k (t) Z k ]. Under this extended model, the log-linear
effect and corresponding risk associated with Zj at time t are given by β j (t) Z j
and exp[β j (t) Z j ], respectively. An application where this extension often is appro-
priate is in evaluating the effect of baseline performance status (PS) on survival
in acute leukemia, since the risk of regimen-related death for patients with poor
PS decreases once they survive chemotherapy and hence βPS(t) may become
closer to 0 as t increases beyond the early period, including treatment. A related
extension is that which allows covariates to be evaluated repeatedly over time
rather than being recorded only at baseline (t ⫽ 0), so that Z j (t) denotes the value
of the jth covariate at time t and the hazard function takes the extended form
λ(t; Z ) ⫽ λ 0 (t) exp [β 1 Z 1 (t) ⫹ . . . . ⫹ β k Z k (t)]. These two extensions are compu-
tationally very similar, although we do not explore this point here.
When the effect of a covariate Z varies over time, it is essential to assess

the form of β(t) to determine how Z actually affects patient survival. A graphical
method for doing this, similar to the martingale residual plot, is the Grambsch-
Therneau-Schoenfeld (GTS) residual plot (5,11), also known as a scaled Schoen-
feld residual plot. A smoothed GTS plot provides a picture of β(t) as a function
of t. This plot has an accompanying statistical test, due to Grambsch and Therneau
(5), of whether β varies with time versus the null hypothesis that it is constant
over time; hence, it tests whether the ordinary Cox model is appropriate. This
test is very general in that for each of several transformations of the time axis,
it takes the form of a particular goodness-of-fit test for the Cox model.
We illustrate these methods for assessing goodness-of-fit by example, with
emphasis on the graphical displays described above. Most of our examples sim-
plify things by focusing on the effects of one covariate for the sake of illustration.
In practice, each application described here would be followed by fitting and
evaluating multivariate models incorporating whatever forms are determined in
the univariate fits.
VI. A COVARIATE WITH NO PROGNOSTIC VALUE
We begin with an illustration of what a martingale residual plot looks like for a
covariate that is of no value for predicting outcome. Figure 1 is a martingale
residual plot from the caspase data. We plotted r m on white blood count (WBC)
for each patient, which produced the scattergram of points, and ran a local
weighted regression (‘‘lowess’’) smoother (18) through the points to obtain the
solid line. Note that very few patients have very large WBC values; consequently,
most of the points in the scattergram are forced into a small area in the left portion
of the figure, whereas the right portion of the smoothed line is very sensitive to
the locations of a few points. A simple way to deal with this common problem
is to truncate the WBC domain by excluding a small number of patients with
large WBC values. This produces Figure 2, which gives a clearer picture of the
true relationship between WBC and DFS for most of the data points. In each
illustration given below, we similarly truncate the domain of the covariate as
appropriate. Another method for dealing with a scattergram in which most of the
points occupy a small portion of the plot is to transform the covariate, for exam-
ple, by replacing Z with log(Z ). Aside from random fluctuations, the smoothed
line in Figure 2 is straight, suggesting that the assumption of a log linear hazard
is appropriate. A very important point in interpreting these graphs is that one
may see patterns in any plot if one stares at it long enough; hence, a statistical
test should accompany any graphical method to determine whether an apparent
pattern is real. Here, the Cox model assumptions are reasonable since the p value
of a Grambsch-Therneau goodness-of-fit test is p ⫽ 0.42. There is no relationship
418 Thall and Estey
Figure 2 Martingale residual scatterplot as in Figure 1, but with right truncation of the
white blood count domain.
between WBC and DFS since p ⫽ 0.83 for a test of β WBC ⫽ 0 versus the alterna-
tive hypothesis β WBC ≠ 0 under the usual Cox model with βWBCWBC as its linear
component. Thus, WBC is of no value for predicting DFS in this data set.
VII. A QUADRATIC EFFECT
The following example illustrates both the importance of including relevant pa-
tient covariates when evaluating treatment effects and the importance of using
goodness-of-fit analyses to properly model covariate effects. A fit of the usual
Cox model with linear component β ATRAATRA to the ATRA data, summarized
as Model 1 in Table 1, yields a test of the hypothesis β ATRA ⫽ 0 versus βATRA ≠
0 having p value 0.055. The estimated relative risk exp(⫺0.329) ⫽ 0.72 seems
to imply that ATRA reduces the risk of death or relapse in this patient group
and that this reduction is both statistically and medically significant. Since base-
line platelet count (platelets) often has a significant effect on DFS in treatment
of hematologic diseases, it also seems reasonable to include platelets in the model.
Table 1 Platelets, Cytogenetics, and the ATRA Effect
Estimated Model
Model Covariate coefficient SE p LR (df)
1 ATRA ⫺0.329 0.172 0.055 3.69 (1)

2 ATRA ⫺0.297 0.172 0.085 13.1 (2)
Platelets ⫺0.299 0.105 0.005
3 ATRA ⫺0.231 0.173 0.18 24.4 (3)
Platelets ⫺0.594 0.134 9.0 ⫻ 10⫺6
Platelets 2 0.170 0.044 1.1 ⫻ 10⫺4
4 ATRA ⫺0.204 0.174 0.24 29.4 (4)
Platelets ⫺0.546 0.136 5.7 ⫻ 10⫺5
Platelets 2 0.163 0.045 2.8 ⫻ 10⫺4
m5m7 0.415 0.181 0.022
The resulting fitted model is summarized as Model 2 in Table 1. This fit indicates
that a higher platelet count is a highly significant predictor of better DFS and
that after accounting for platelets the ATRA effect is still marginally significant
with p ⫽ 0.085. Moreover, this model provides a better overall fit since its likeli-
hood ratio (LR) statistic is 13.1 on 2 degrees of freedom (df), p ⫽ 0.0014, com-
pared with only 3.69 on 1 df, p ⫽ 0.055, for the model including the ATRA
effect alone.
A closer analysis leads to rather different conclusions. The smoothed mar-
tingale residual plot given in Figure 3 indicates that the risk of relapse or death
initially decreases as platelet count increases but that as platelet count rises above
roughly 150 ⫻ 10 3, the risk of an event begins to increase. Thus, the plot suggests
that either a very low or very high platelet count is prognostically disadvanta-
geous. In particular, the relationship between platelet count and DFS is not log
linear, as assumed when the standard Cox model is fit, but rather appears to be
log parabolic. A Cox model that includes both platelets and platelets 2 as covari-
ates would account for the parabolic shape suggested by Figure 3. This is done by
fitting a Cox model with linear component β 1ATRA ⫹ β 2platelets ⫹ β 3platelets 2,
summarized as Model 3 in Table 1. Since the hypothesis β 3 ⫽ 0 reduces the log
parabolic model to the simpler log linear model, the test of this hypothesis ad-
dresses the question of whether the bend in the line in Figure 3 is significant or
is merely an artifact of random variation in the data. The p value 1.1 ⫻ 10⫺4
corresponding to this test indicates that the log parabolic model is indeed appro-
priate. It is worth noting that the question of whether β 2 ⫽ 0 under this model
is essentially irrelevant, and in general it is standard practice to include the lower
420 Thall and Estey
Figure 3 Martingale residual scatterplot on platelets, from the ATRA data.
order term Z whenever Z 2 has a significant coefficient. The martingale residual

plot based on the fitted Model 3, given in Figure 4, shows a pattern consistent
with random noise, indicating that the parabolic model has adequately described
the relationship between platelets and DFS. This is underscored by the LR test
for the overall fit of Model 3 (LR ⫽ 24.4 on 3 df, p ⫽ 2.0 ⫻ 10⫺5 ), which
shows quantitatively that Model 3 provides a substantially better fit than Model
2. Perhaps most importantly, now that platelet count has been modeled properly
the ATRA effect is no longer even marginally significant ( p ⫽ 0.18). We take
this analysis one step further by adding to the linear component the indicator,
m5m7, that the patient has the cytogenetic abnormality characterized by the loss
of specific regions of the fifth and seventh chromosomes. This is summarized as
Model 4 in Table 1, which shows that m5m7 is a significant predictor of worse
DFS and that the inclusion of this covariate further reduces the prognostic sig-
nificance of ATRA ( p ⫽ 0.24).
This example illustrates the common scientific phenomenon that an appar-
ently significant treatment effect may disappear entirely once patient prognostic
covariates are properly accounted for. It is also notable how easily the defect in
the log linear model was revealed and corrected. A final point is that there are
Figure 4 Martingale residual scatterplot on platelets, based on fitted model including

a parabolic model for platelets.
a number of models other than a parabola that may describe a curved line; we
chose a parabolic function because it is simple and achieves the goal of providing
a reasonable fit to the data.
VIII. CUTPOINTS
A common practice in the medical literature is to dichotomize a numerical-valued

variable Z, such as white count or platelet count, by replacing it with a binary
indicator variable I c ⫽ 0 if Z ⱕ c and I c ⫽ 1 if Z ⬎ c for some cutpoint c. The
cutpoint may be determined in various ways. One common practice is to set c
equal to the mean or median of Z. Another is to use the ‘‘optimal’’ cutpoint
which gives the smallest p value, among all possible cutpoints, for the test of
β c ⫽ 0 under the model with linear term β cIc. One consequence of this practice
is that in evaluating the statistical analyses of two published studies where differ-
ent ‘‘optimal’’ cutpoints were used to define I c and the studies concluded, respec-
tively, that ‘‘Z had a significant effect on survival’’ and ‘‘Z did not have a signifi-
422 Thall and Estey
cant effect on survival,’’ it is impossible to determine whether the conflicting

conclusions were due to a phenomenological difference between the two studies
or were simply artifacts of random variation in the data manifested in application
of the optimal cutpoint method.
The use of a model containing I c in place of Z is appropriate only when it
describes the actual relationship between Z and the outcome. An illustration of
this is provided by the effect of hemoglobin on DFS in the caspase data set,
illustrated by the martingale residual plot given in Figure 5. This plot shows that
for a cutpoint c located somewhere between 6 and 8, patients having baseline
hemoglobin above c have a higher risk of relapse or death. That is, there is a
‘‘threshold effect’’ of hemoglobin on DFS. A search over values of c in this
range yields the optimal cutpoint 7 and a fit of the Cox model with linear term
β HG I HG , where IHG ⫽ 1 if hemoglobin ⬎ 7 and 0 if hemoglobin is ⱕ 7 yields a
p value of 0.004 for the test of β HG ⫽ 0 under this model. A more appropriate
test, which accounts for fact that multiple preliminary tests were conducted to
locate the optimal cutpoint (22), yields the corrected p value 0.032. Other meth-
ods for correcting p values to account for an optimal cutpoint search have been
given by Altman et al. (23) and Faraggi and Simon (24).
Figure 5 Martingale residual scatterplot on hemoglobin, from the caspase data.

Unfortunately, in most cases where a cutpoint model is used the relationship

between Z and outcome simply is not of the form given by Figure 5. In such
cases, the practice of replacing a continuous variable Z with a binary indicator
I c is just plain wrong. Figure 6 is a martingale residual plot obtained by simulating
an artificial covariate Z according to a standard normal distribution, so that Z is
‘‘white noise’’ and, as is apparent from Figure 6, has no relationship whatsoever
to the actual patient outcome data. The optimal cutpoint c ⫽ 0.623, chosen be-
cause it gave the lowest p value among all possible values for c, produces a Cox
model with linear component β cI c under which the uncorrected p value for the
test of β c ⫽ 0 is 0.018. This appears to show that the binary variable I c ⫽ 1 if
Z ⱖ 0.623 and I c ⫽ 0 if Z ⬍ 0.623 is a ‘‘significant’’ predictor. This anomaly,
that white noise produces an apparently significant predictor based on the optimal
cutpoint, is because many tests of hypotheses were conducted to determine the
optimal cutpoint; hence, the final ‘‘significant’’ test is merely an artifact of this
multiple testing procedure. The properly adjusted p value that accounts for this
is 0.370, which correctly reflects the facts that the cutpoint is a artificial and that
Z is nothing more than noise. More fundamentally, because Figure 5 does not
exhibit the sharp vertical rise that indicates an actual threshold effect, it is inap-
propriate to fit a cutpoint model for Z to the data.
Figure 6 Martingale residual scatterplot on white noise.

424 Thall and Estey
IX. PLATEAU EFFECT
It is well known that the presence of an antecedent hematologic disorder (AHD)

is prognostically unfavorable in AML. We have previously considered an AHD
‘‘present’’ if a documented abnormality in blood exceeded 1 month in duration
before diagnosis of AML (25). This reduces the duration Z AHD of an AHD to the
binary indicator variable I AHD ⫽ 1 if Z AHD ⱖ 1 month and I AHD ⫽ 0 if Z AHD ⬍ 1
month. Others have used a cutpoint of 3 months rather than 1 month. A martingale
residual plot on Z AHD for the fludarabine data is given in Figure 7, with the lowess
smooth given by the solid line and a parametric fit, described below, by the
dashed line. Figure 7 indicates that the risk of relapse or death increases sharply
for values of Z AHD from 0 up to roughly 10 to 20 months and then stabilizes at
a constant level for larger values. Due to the plateau in the smoothed curve, the
relationship between Z AHD and DFS is neither linear nor quadratic. There are
many functions that describe this pattern. We used the parametric function mini-
mum {β1 log(Z AHD ⫹ 0.5), β 2arctan(Z AHD )}, illustrated by the dashed line, and
this provides a reasonable fit that agrees with the lowess smooth. The estimates
0.204 for β 1 and 0.442 for β 2 each have p values ⬍ 10⫺8. While I AHD, the lowess
Figure 7 Martingale residual scatterplot on duration of antecedent hematological disor-

der, from the fludarabine data.
function, and the parametric function each are highly significant predictors of
DFS with the associated p ⬍ 10⫺7 in each case, Figure 7 indicates that the cutpoint
model, IAHD, only approximates either the lowess or the parametric function.
X. PARABOLIC TIME-VARYING EFFECT
As noted earlier, the usual Cox model assumes that covariate effects are constant
over time. For example, under the Cox model the hazard associated with an age
of 60 years at diagnosis is the same at either 1 month or 5 years after treatment.
This assumption frequently is not verified despite the fact that in some situations
it might appear tenuous. For example, in chemotherapy of hematologic malignan-
cies, the clinician might suspect that older patients are at greater risk of death
occurring during the first few weeks of therapy rather than later on.
A fit of the ordinary Cox model with linear term β AGE AGE to the fludara-
bine data yields an estimate of β AGE equal to 0.0175 with p ⬍ 0.001. This indicates
that, for example, at any time after start of treatment a 60-year-old patient has
twice the risk of death or relapse compared with a 20-year-old patient, since
exp[(60 ⫺ 20) 0.0175] ⫽ 2. Figure 8 is the smoothed GTS plot on age, including
Figure 8 Grambsch-Therneau-Schoenfeld residual scatterplot on age, from the fludar-

abine data.
426 Thall and Estey
a 95% confidence band for the graphical estimate of β AGE (t). This plot and the
associated test were produced with the ‘‘cox.zph’’ computer subroutine of Ther-
neau (14) using the ‘‘identity’’ (untransformed) timescale. In Figure 8, as previ-
ously, the horizontal line at β AGE ⫽ 0 corresponds to age having no effect. Dots
above the horizontal line at β AGE ⫽ 0 indicate an excess of deaths, whereas dots
below the line indicate a deficit of deaths. Whereas the effect of age would be
represented by a horizontal line at 0.0175 under the ordinary Cox model, the
smoothed GTS plot indicates that β AGE (t) may be a parabolic function of t. Thus,
the GTS plot indicates that the proportional hazards assumption may not be ap-
propriate. A Grambsch-Therneau test (5) of the hypothesis that the data are com-
patible with the proportional hazards assumption has p ⫽ 0.007, confirming the
graphical results. In particular, the effect of age on DFS cannot be quantified
adequately by the single estimate 0.0175 noted above. The GTS plot suggests
that the effect of age on the risk of relapse or death may be described by the
quadratic function (β 1 ⫹ β 2t ⫹ β 3t2 ) Z AGE . The respective estimates of β1, β 2 , and
β 3 under this extended Cox model are 0.0150, ⫺3.62 ⫻ 10⫺4, and 2.45 ⫻ 10⫺6
respectively, with p ⬍ 10⫺6, 1.2 ⫻ 10⫺4, and 0.014, indicating that age really has
a parabolic time-varying effect.
XI. NONLINEAR TIME-VARYING EFFECT WITH PLATEAU
The GTS plot is especially valuable when a time-dependent covariate effect is

not described easily by a parametric function. This is the case for pretreatment
Zubrod PS in the fludarabine data set. The ordinary Cox model fit with linear
term β PS Z PS gives an estimate for β PS of 0.417 with p ⬍ 0.001. The GTS plot
based on PS, given in Figure 9, shows that the effect of poor PS decreases during
the first 3 months and then reaches a small but nonzero plateau thereafter. The
grouping of points into rows is characteristic of GTS plots for variables taking
on a small number of values. Here, PS take has possible values 0, 1, 2, 3, 4,
which correspond respectively to the five groups of points in the plot. The
Grambsch-Therneau goodness-of-fit test has p ⬍ 10⫺7, indicating that the propor-
tional hazards assumption is untenable.
XII. CONDITIONAL KAPLAN-MEIER PLOTS
A useful method for assessing covariate effects on survival or DFS is to construct

a set of conditional Kaplan-Meier (KM) survival plots. These conditional plots,
or ‘‘coplots,’’ may be used to make inferences without resorting to any parametric
model fit or conventional test of hypothesis, although in practice we have found it
most useful to apply the graphical and model-based methods together. A general
Figure 9 Grambsch-Therneau-Schoenfeld residual scatterplot on performance status,

from the fludarabine data.
discussion of conditional plots, in the context of analyzing trinary data, is given

by Cleveland (26). An application of coplots to the caspase data is given in Figure
10, which is reproduced from Estrov et al. (17). The purpose of this figure is to
provide a visual representation of the joint effects of C 2 and C 3 on survival. This
figure was constructed as follows. Each of the nine plots in Figure 10 is a usual
KM plot, constructed from a particular subset of the data. Moving from left to
right, the three columns correspond to the lowest one third, middle one third,
and upper one third of the C 2 values, which happen to be C 2 ⱕ 0.69, 0.69 ⬍
C 2 ⱕ 1.07, and 1.07 ⬍ C 2 for this data set. Similarly, going from bottom to
top, the three rows correspond to the lowest, middle, and upper thirds of the C3
sample values, which are C 3 ⱕ 1.09, 1.09 ⬍ C 3 ⱕ 1.57, and 1.57 ⬍ C 3 for these
data. Thus, for example, the KM plot in the center of the top row is constructed
from the data of the 21 patients having intermediate C 2 values (0.69 ⬍ C 2 ⱕ
1.07) and high C 3 values (1.57 ⬍ C 3). Each plot is thus the usual KM estimate
of the survival probability curve but conditional on the patient having C 2 and C3
values in their specified ranges. Thus, for example, moving from left to right
along the bottom row shows how survival changes with increasing C 2 given that
428 Thall and Estey
Figure 10 Conditional KM plots for varying C 2 and C 3 values, based on three nonover-
lapping intervals for each covariate.
C 3 ⱕ 1.09. The most striking message conveyed by this matrix of KM coplots

is that survival is very poor for patients having both high C 2 and high C 3 . It is
important to bear in mind that the particular numerical cutoffs of the subintervals
here are specific to this data set, so that any patterns revealed by the plots may
hold generally for a similar data set but the particular numerical values very likely
will differ. This sort of interactive effect, manifested on a particular subdomain
of the two-dimensional set of (C 2 , C 3 ) values, would not be revealed by the
conventional approach of fitting a Cox model with linear component including
the terms β 2C 2 ⫹ β 3C 3 ⫹ β 23C 2 C 3, since this parametric model assumes that the
multiplicative interaction term β 23C 2C 3 is in effect over the entire domain of both
covariates. The coplots clearly show that this is not the case.
A slightly different way to construct this type of plot is to allow the adjacent
subintervals of each covariate to overlap, in order to provide a smoother visual
transition. Figure 11 is obtained by first defining subintervals in the domain of
C 2, each of which contains half of the C 2 data, with the first interval running
Figure 11 Conditional KM plots for varying C 2 and C 3 values, using overlapping inter-
vals for each covariate.
from the minimum to the 3/6th percentile, the second from the 1/6th to 4/6th
percentile, the third from the 2/6th to the 5/6th percentile, and the fourth from
the 3/6th percentile to the maximum. Thus, adjacent intervals share 1/3rd of the
C 2 data. Four subintervals of C 3 are defined similarly. The particular numerical
values of the C 2 and C 3 interval end points are given along the bottom and left
side of Figure 11. Another advantage of using intervals that overlap is that the
sample size for each KM plot is larger than if the intervals are disjoint.
Scanning each of the lower two rows from left to right shows that for lower
values of C 3, survival improves with increasing C 2. This pattern changes slightly
at the end of the third row, where survival seems to level off, and markedly in
the top row, where survival drops as both C 2 and C 3 become large. The relatively
poor survival shown in the upper right corner KM plot is notable in that this
plot is based on 59 patients, comprising 27% of the sample values, whereas the
430 Thall and Estey
considerably more striking drop in the upper right corner plot of Figure 10 is
based on a more extreme subsample of 31 patients.
XIII. DISCUSSION
The importance of prognostic factor analyses in clinical research is widely ac-

knowledged. In this chapter, we illustrate some general problems with these anal-
yses as they are often conducted, and we describe some graphical methods and
tests to address these problems. The aim of these methods is to determine the
true relationship between one or more covariates and patient outcome in a particu-
lar data set. When a covariate is modeled incorrectly, which is the case if its
effect on patient risk is not log linear as assumed under the usual Cox model,
evaluation of the covariate’s effect may be misleading. In particular, dichotomiza-
tion of a numerical variable by use of a cutpoint without first determining the
actual form of the covariate’s effect typically leads to loss of information and in
many cases is completely wrong. Use of the optimum cutpoint often leads to
spurious inferences arising from nothing more than random variation in the data.
Our examples illustrate that these problems may be addressed easily and effec-
tively by the combined use of martingale residual plots, statistical tests, and trans-
formation of covariates as appropriate. We also provide examples of covariates
having effects that change over time, along with methods for revealing and for-
mally evaluating such time-varying effects. The use of these methods helps to
avoid flawed inferences from being drawn, which may be the case with the typical
approach of fitting a Cox or logistic regression model without performing any
goodness-of-fit analyses. Finally, model-based regression analyses may be aug-
mented or even avoided entirely by the use of conditional KM plots.
An important caveat to keep in mind when interpreting the results of any
regression analysis is that the model-fitting process, including graphical methods
and tests based on intermediate models, is not accounted for by p values obtained
using conventional methods based only on the final fitted model. Formally, the
p value of any such final test should be adjusted for the process that produced
the model. This is due to the fact that both the model-fitting process and the final
tests are based on the same data set. For example, the adjusted p value computa-
tion required to test properly for an optimal cutpoint, noted in Section 8, recog-
nizes this problem. In fact, it applies more broadly to the entire model-fitting
process. This consideration leads to notions of cross-validation and bootstrapping,
which we do not pursue here. A basic reference is Efron and Tibshirani (27).
The practical point is that due to random variation, a particular regression model
fit to a given data set is not likely to provide as good a fit to another data set
based on a similar experiment.
Like medical research, statistical research is constantly evolving. Powerful

new techniques for modeling and analyzing data are currently becoming available
at an ever increasing rate. Difficulties in implementing graphical methods, such
as those described here, have decreased dramatically due to the widespread avail-
ability of high-speed computing platforms and flexible statistical software pack-
ages. These methods are of value to medical researchers for at least three reasons.
First, they suggest new directions for medical research. Why, for example, is a
platelet count above 200,000 associated with inferior DFS in patients with AML
or MDS (Fig. 3)? Certainly, this finding may be illusory, but the p value of 0.0001
for the quadratic term in Model 3 of Table 1 suggests otherwise. Is the sharp
drop in survival for high levels of both C 2 and C 3 (Fig. 10 and 11) due to a real
biological phenomenon, or is it merely an artifact of random variation? Second,
graphical methods provide a powerful means to determine if and how the Cox
model assumptions are violated, they lead quite easily to corrected models, and
they are perhaps the best method available for communicating the results of a
regression analysis to nonstatistical colleagues. Finally, because the methods of-
ten provide a greatly improved fit of the statistical model to the data, in turn they
provide more reliable inferences regarding covariate and treatment effects on
patient outcome.
ACKNOWLEDGMENT
We are grateful to Terry Therneau for his thoughtful comments on an earlier

draft of this manuscript.
REFERENCES
1. Cox DR. Regression models and life tables (with discussion). J Royal Stat Soc B
1972; 34:187–220.
2. Li Z, Begg CB. Random effects models for combining results from controlled and
uncontrolled studies in meta-analysis. J Am Stat Assoc 89:1523–1527.
3. Stangl DK. Modelling and decision making using Bayesian hierarchical models. Stat
Med 1995; 14:2173–2190.
4. Therneau TM, Grambsch PM, Fleming TR. Martingale-based residuals for survival
models. Biometrika 1990; 77:147–160.
5. Grambsch PM, Therneau TM. Proportional hazards tests and diagnostics based on
weighted residuals. Biometrika 1994; 81:515–526.
6. Grambsch PM. Goodness-of-fit diagnostics for proportional hazards regression mod-
els. In: Thall PF, ed. Recent Advances in Clinical Trial Design and Analysis. Boston:
Kluwer, 1995, pp. 95–112.
432 Thall and Estey
7. Fleming TR, Harrington DP. Counting Processes and Survival Analysis. New York:
Wiley, 1991.
8. Crowley JJ, Hu M. Covariance analysis of heart transplant survival data. J Am Stat
Assoc 1977; 72:27–36.
9. Crowley JJ, Storer BE: Comment on ‘‘A reanalysis of the Stanford heart transplant
data’’ by M. Aitkin, N. Laird and B. Francis. J Am Stat Assoc 1983; 878:277–281.
10. Kay R. Proportional hazards regression models and the analysis of censored survival
data. Applied Statistics 1977; 26:227–237.
11. Schoenfeld D. Chi-squared goodness-of-fit tests for the proportional hazards regres-
sion model. Biometrika 1980; 67:145–153.
12. Cain KC, Lange NT. Approximate case influence for the proportional hazards regres-
sion model with censored data. Biometrics 1984; 40:493–499.
13. Harrell FE. The PHGLM procedure. SAS Supplememntal User’s Guide, Version 5.
Cary, NC: SAS Institute, Inc. 1986.
14. Therneau TM. A Package for Survival in S. Mayo Foundation. 1995.
15. Harrell FE. Predicting Outcomes. Applied Survival Analysis and Logistic Regres-
sion. Charlottesville: University of Virginia. 1997.
16. Estey EH, Thall PF, Beran M, Kantarjian H, Pierce S, Keating M. Effect of diagnosis
(RAEB, RAEB-t, or AML) on outcome of AML-type chemotherapy. Blood 1997;
90:2969–2977.
17. Estrov Z, Thall PF, Talpaz M, Estey EH, Kantarjian HM, Andreeff M, Harris D,
Van Q, Walterscheid M, Kornblau S. Caspase 2 and caspase 3 protein levels as
predictors of survival in acute myelogenous leukemia. Blood 1998; 92:3090–3097.
18. Estey EH, Thall PF, Pierce S. Randomized phase II study of fludarabine ⫹ cytosine
arabinoside ⫹ idarubicin ⫾ all trans retinoic acid ⫾ granulocyte-colony stimulating
factor in poor prognosis newly diagnosed acute myeloid leukemia and myelodysplas-
tic syndrome. Blood 1999; 93:2478–2484.
19. Gentleman R, Crowley L. Local full likelihood estimation for the proportional haz-
ards model. Biometrics 1991; 47:1283–1296.
20. Cleveland WS. Robust locally-weighted regression and smoothing scatterplots. J
Am Stat Assoc 1979; 74:829–836.
21. Becker RA, Chambers RM, Wilks ARA. The New S Language. Pacific Grove, CA:
Wadsworth, 1988.
22. Hilsenbeck SG. Practical p-value adjustments for optimally selected cutpoints. Stat
Med 1996; 15:103–112.
23. Altman DG, Lausen B, Sauerbrei W, Schumacher M. The dangers of using ‘‘opti-
mal’’ cutpoints in evaluation of prognostic factors. J Natl Cancer Inst 1994; 86:829–
835.
24. Faraggi D, Simon R. A simulation study of cross-validation for selecting an optimal
cutpoint in univariate survival analysis. Stat Med 1996; 15:2203–2214.
25. Estey EH, Thall PF, et al. Use of G-CSF before, during and after fludarabine ⫹ ara-
C induction therapy of newly diagnosed AML or MDS: comparison with fludara-
bine ⫹ ara-C without G-CSF. J Clin Oncol 1994; 12:671–678.
26. Cleveland WS. Visualizing Data. Summit, NJ: Hobart Press, 1993.
27. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman and
Hall, 1993.
21
Graphical Approaches to Exploring
the Effects of Prognostic Factors
on Survival
Peter D. Sasieni and Angela Winnett*

Imperial Cancer Research Fund, London, England
I. INTRODUCTION
In this chapter we are interested in exploratory data analysis rather than precise
inference from a randomized controlled clinical trial. Although the quality of
data collection and follow-up is important, there is no need to have a randomized
trial to study prognostic factors; large series of patients receiving standard therapy
are important sources of information. Recent interest in molecular and genetic
markers create additional problems for the data analyst, but these are mostly
concerned with multiple testing and test reliability. They will not be discussed
in this chapter. We are concerned with methods appropriate for analyzing a small
number of prognostic factors.
The methods described in this chapter are illustrated using data from the
Medical Research Council’s fourth and fifth Myelomatosis Trials (1). A total of
*Current affiliation: Imperial College School of Medicine, London, England.
433
434
Table 1 Summary of Prognostic Variables
Variable Units Min. 1st quartile Median 3rd quartile Max.
Continuous variables
Age 10 years 3 5.7 6.3 6.9 8.1
log2 (sβ2m) log(mg/l) ⫺1.7 2.1 2.7 3.4 6.3
log2 (serum creatinine) log(mM) 5.1 6.5 6.9 7.5 11.0
Variable Description Freq. 1 Freq. 0
Indicator variables
ABCM 1 for trial 5 with ABCM, 0 otherwise 277 736
Cuzick index int. or poor 1 for Cuzick index ‘‘intermediate’’ or ‘‘poor,’’ 0 otherwise 758 255
Cuzick index poor 1 for Cuzick index ‘‘poor,’’ 0 otherwise 191 822
Sasieni and Winnett

Prognostic Factors and Survival 435
1013 patients are included, of whom 821 had died by the time the data set was
compiled and 192 were censored. Survival times range from 1 day to over 7
years. The patients in the fourth trial received treatment of either intermittent
courses of melphalan and prednisone (MP) or MP with vincristine given on the
first day of each course. In the fifth trial patients received either intermittent oral
melphalan (M7), or adriamycin, 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU),
cyclophosphamide, and melphalan (ABCM). A number of prognostic variables
were recorded—age, serum β2 microglobulin (sβ2m), and serum creatinine. Also
prognostic groups were defined according to the Cuzick index, which is based
on blood urea concentration, hemoglobin, and clinical performance status (2). No
significant differences were found between survival of patients with the different
treatments in trial 4 (3) or between survival of patients in trial 4 and patients
with M7 treatment in trial 5, so these three groups were pooled together. Loga-
rithms (to base 2) were used for sβ2m and serum creatinine since otherwise they
are very skewed. Age in years was divided by 10 so that the difference in the
interquartile range of each continuous variable was close to 1 (between 1 and
1.3). A summary of the variables is given in Table 1.
II. LOGISTIC REGRESSION
Clinically one may be interested in short-term (1 year), medium-term (5 year),

and long-term (10 year) survival. In the absence of censoring, one could use three
logistic models to examine the effect of various potentially prognostic factors on
each end point. An advantage of this approach is the ease with which the results
can be presented and interpreted. Not only can the importance of individual fac-
tors be evaluated in a multivariate model, but a prognostic score can be developed
to quantify the probability that an individual with a given profile will survive to
each of the three time points. A disadvantage of fitting three logistic models is
that although any patient who survives 5 years must have survived 1 year, the
models are not linked and estimation of the conditional probability of survival
to 5 years given that the patient is alive at 1 year is not straightforward. Instead,
one might consider a single model for the ordered multinomial end points: ‘‘early
death’’ (⬍1 year), ‘‘medium death’’ (1–5 years), ‘‘late death’’ (5–10 years), or
‘‘long-term survivor’’ (⬎10 years). Although such models are attractive, it is
rare that one has uncensored long-term follow-up, so they have their limitations
and survival analysis models are required. However, short-term survival is often
uncensored and logistic regression is a useful and underused technique for explor-
ing the role of prognostic factors in such situations.
Example None of the survival times in the myeloma data were censored at less
than 2 years, whereas 454 of the deaths occurred in the first 2 years and 199 in
436
Table 2 Multivariate Logistic Regression Using Three End Points Compared with Longer Survival
Death within
0–6 months 0–2 years 6 months–2 years
Covariate O.R. 95% CI O.R. 95% CI O.R. 95% CI
Age (per 10 years) 1.38 (1.11–1.71) 1.19 (1.02–1.39) 1.08 (0.90–1.29)

log2 (sβ2m) 1.62 (1.31–2.00) 1.68 (1.41–2.00) 1.56 (1.28–1.90)
log 2 (serum creatinine) 1.20 (0.94–1.54) 1.06 (0.85–1.33) 0.94 (0.72–1.23)
ABCM 0.74 (0.50–1.09) 0.62 (0.46–0.84) 0.63 (0.44–0.89)
Cuzick index int. or poor 2.27 (1.30–3.99) 1.62 (1.15–2.28) 1.37 (0.94–2.00)
(vs. good)
Cuzick index poor 1.46 (0.95–2.23) 1.11 (0.74–1.65) 0.90 (0.55–1.45)
(vs. int. or good)
Deviance 860 1255 965
Sasieni and Winnett

Null deviance 1004 1393 1012
the first 6 months. Therefore, logistic models were used to study the effects of
the prognostic variables on the probability of death in the first 6 months and on
the probability of death in the first 2 years. The results of the logistic regressions
are in Table 2. Confidence intervals were calculated based on ⫾1.96 ⫻ standard
error of the coefficients.
From the logistic regression results it can be seen that sβ2m is a strongly
prognostic factor for survival both to 6 months and to 2 years. The treatment
ABCM can be seen to improve survival to 2 years; although its effect on 6-month
survival is still beneficial, the effect is smaller in magnitude and not statistically
significant. The Cuzick prognostic index ‘‘good’’ does indeed indicate improved
survival, at least as far as 6 months, whereas the difference between the groups
‘‘poor’’ and ‘‘intermediate’’ is not statistically significant. The odds ratios for
age and the Cuzick prognostic index are closer to one for survival to 2 years
than for survival to 6 months, but these models by themselves do not indicate
whether there is any association with survival to 2 years within those patients
who survived to 6 months. This was investigated in a further logistic model for
survival up to 2 years, including only those patients who were still alive after 6
months. The results are also in Table 2. It is seen that neither age nor the Cuzick
index have statistically significant association with survival up to 2 years condi-
tional on survival up to 6 months, whereas sβ2m and ABCM treatment are associ-
ated with differences in survival from 6 months to 2 years and survival up to 6
months.
The effect of serum creatinine is not statistically significant; this is in con-
trast to what is seen if only serum creatinine is included in the model (Table 3).
There is strong correlation between serum creatinine and sβ2m, so that the strong
prognostic value of serum creatinine by itself can be largely accounted for by
the confounding effect of sβ2m, whereas the prognostic value of sβ2m is highly
statistically significant even when serum creatinine has been included in the
model. This can also be seen from Table 4, where the two variables have been
Table 3 Logistic Regression Models for Serum Creatinine and sβ 2 M only
Logistic regression Logistic regression

for death within 6 for death within 6 mo
mo with serum with serum creatinine
creatinine only and sβ2m
Covariate O.R. 95% CI O.R. 95% CI
log 2 (serum creatinine) 2.14 (1.82–2.52) 1.33 (1.05–1.68)

log 2 (sβ 2m) 1.75 (1.42–2.16)
Deviance 916 886
438 Sasieni and Winnett
Table 4 Number of Individuals in Categories Defined by Serum Creatinine and sβ2 m

with Percentage Dying in the First 6 Months in Each Category
log 2 (sβ 2 m)
log2(serum creatinine) ⱕ1.89 1.89–2.44 2.44–2.93 2.93–3.73 ⬎3.73 Total
ⱕ6.46 97 59 33 16 7 212
5% 7% 18% 25% 29% 10%
6.46–6.75 56 62 41 33 5 197
9% 11% 12% 21% 20% 13%
6.75–7.04 33 57 51 42 17 200
3% 7% 22% 26% 18% 15%
7.04–7.67 20 25 56 70 31 202
20% 4% 14% 26% 35% 21%
⬎7.67 1 1 16 42 142 202
0% 0% 19% 29% 46% 40%
Total 207 204 197 203 202 1013
7% 8% 17% 26% 41% 20%
divided into five categories each with roughly equal numbers of individuals. The
correlation between the two variables can be seen by the high frequencies around
the diagonal of the table. The prognostic value of each variable by itself can be
seen by the proportions dead by 6 months in the row and column total cells,
which increase as the value of each variable increases. The prognostic value of
sβ2m after adjusting for serum creatinine can be seen by the increasing propor-
tions dead in each row (i.e., as sβ2m increases in each category of serum creati-
nine). The lack of association between serum creatinine and survival after ad-
justing for sβ2m can be seen from the proportions dead which do not increase
steadily in each (internal) column of Table 4. This contrasts with the strong in-
creasing trend seen in the column of marginal totals.
III. PROPORTIONAL HAZARDS
Hazard-based models are naturally adapted for use with (right) censored data and
have therefore become the standard approach for survival analysis. In particular,
the proportional hazards regression model introduced by Cox (4) has become
ubiquitous in medical journals. The (conditional) hazard (of death) at time t for
an individual with covariates Z is defined by
λ(t |Z) ⫽ lim h↓0 P(T ∈ [t, t ⫹ h] |T ⱖ t, Z)/h [1]

It is the death rate at time t among those who are alive (and uncensored) just
prior to time t.
Constant hazards correspond to exponential random variables and are con-
veniently described in terms of one death every so many person-years. Constant
hazards are, however, rarely observed in clinical studies. The usual form of the
proportional hazards model is
λ(t| Z) ⫽ λ 0 (t)exp(β T Z) [2]
in which λ 0 (t) is an unspecified baseline hazard function (that corresponds to
individuals with Z ⫽ 0) and β is a vector of parameters. This model forms a
good starting point for analysis of prognostic factors for censored survival data.
However, it makes quite strong assumptions on the form of the effects and it is
always important to check the appropriateness of the model.
Even without questioning the form of the model, one has to apply a sensible
model-building strategy for selecting important prognostic factors from a pool
of potentially relevant covariates. We assume here that the goal is not simply
prediction (in which case one may prefer to use a ridge or shrinkage approach
over covariate selection (5)) but that the chosen model should be biologically
plausible. In many situations, several of the covariates will be correlated, and it
is a good idea to include certain basic covariates (factors known to be of prognos-
tic value from previous studies) in any model. After that one may wish to consider
a step forward or a step backward procedure to select a model. It is certainly
useful to properly document the model selection procedures employed and where
possible to validate the final model on a separate data set.
Example Cox regression was used to estimate the effects of prognostic vari-
ables on survival in the myeloma study; the results are in Table 5. Here, as in
the logistic models, Table 2, higher values of age and sβ2m are associated with
worse prognosis. As in the logistic models, ABCM treatment and Cuzick prog-
Table 5 Cox Proportional Hazards Model
Covariate Hazard ratio 95% CI
Age (per 10 years) 1.12 (1.03–1.21)

log 2 (sβ2 m) 1.26 (1.16–1.37)
log 2 (serum creatinine) 1.06 (0.95–1.19)
ABCM 0.76 (0.65–0.89)
Cuzick index int. or poor (vs. good) 1.38 (1.15–1.65)
Cuzick index poor (vs. int. or good) 1.11 (0.91–1.36)
⫺2 ⫻ log partial likelihood (fitted model): 10,093
⫺2 ⫻ log partial likelihood (null model): 10,236
nostic index good are associated with a reduced hazard, but the difference be-
tween Cuzick prognostic index intermediate and poor is not statistically signifi-
cant. Notice that the hazard ratios are generally closer to one than the
corresponding odds ratios from the logistic models in Table 2, but this is to be
expected from the relationship between odds ratios and hazard ratios.
Prognostic value of the prognostic index After fitting the Cox model one has
a function β̂ T Z that defines the effect of covariates on the baseline hazard. This is
not itself particularly useful clinically but should be combined with the estimated
baseline hazard to obtain estimates of the effects of the covariates on survival.
This will most conveniently be described in terms of the effect on the median
(or some other quantile) survival or on survival to some fixed time. Alternatively,
one can simply use the prognostic index to divide the study population into sub-
groups and estimate the survival of each group using standard Kaplan-Meier
techniques. An advantage of this latter approach is that the proportional hazards
model is only used to divide the population into subgroups with different progno-
ses: The actual survival of each subgroup is then estimated nonparametrically.
The disadvantages are the potential bias and loss of information associated with
discretizing a continuous prognostic factor and the loss of power resulting from
abandonment of the model.
Example From Table 5, higher sβ 2m is strongly associated with an increased
hazard, but it is not clear what the clinical significance of this effect would be.
Figure 1 shows estimates from the fitted Cox model of the survival function for
patients with log 2 (sβ 2 m) equal to 1.54 and 4.52 (the 10% and 90% quantiles).
The first part of the figure is based on a patient with values of the other covariates
Figure 1 Survival functions estimated from the Cox model, with 95% confidence inter-
vals, for log 2 (sβ2 m) equal to 1.54 and 4.52 and (a) age 69, log 2 (serum creatinine) ⫽ 7.5,
not ABCM treatment, Cuzick index ‘‘poor’’; (b) age 57, log 2 (serum creatinine) ⫽ 6.5,
ABCM treatment, Cuzick index ‘‘good.’’
Figure 2 Kaplan-Meier estimates with 95% confidence intervals, for individuals with
prognostic index from the Cox model (a) β̂ TZ in (1.91–2.03) and (2.52–3.22) and (b) β̂ TZ
in (0.45–1.36) and (1.68–1.80).
corresponding to a relatively poor prognosis, whereas the second part of the figure
is based on a patient with values of the other covariates corresponding to a rela-
tively good prognosis.
The prognostic index, β̂ T Z, from the fitted Cox model ranges from 0.45
to 3.22. By partitioning the prognostic index β̂ T Z, the sample was divided into
10 groups with equal numbers of individuals in each group, and Kaplan-Meier
estimates of survival in each group were calculated. The covariate values of Fig-
ure 1a correspond to β̂ T Z ⫽ 1.98 and β̂ T Z ⫽ 2.66, which fall in the 6th and
10th of the 10 groups, and the covariate values of Figure 1b correspond to
β̂ TZ ⫽ 1.08 and β̂ T Z ⫽ 1.77, which fall in the 1st and 4th of the 10 groups.
Figure 2 shows the Kaplan-Meier estimates for these groups.
The two methods produce survival estimates with quite different shapes
since Figure 1 is based on the assumption of proportional hazards and the specific
form of the Cox model, whereas in Figure 2 the survival functions are estimated
nonparametrically but based on grouping large numbers of patients together. For
comparison, 1-year survival probabilities for the four groups in Figure 1 are 51%,
71%, 76%, and 87% and in Figure 2 33%, 71%, 82%, and 94%. The correspond-
ing 5-year survival rates are 2%, 16%, 22%, and 54% compared with 11%, 9%,
20%, and 42%. Thus, it is seen that deviations from the model fit are greatest at
the extremes of the prognostic index range.
IV. TRANSFORMATION OF COVARIATES
Both the logistic model and the Cox model (2) impose a particular form on each
continuous covariate. As with all regression models one should consider the
possibility of transforming covariates before entering them in the model. Many

serum markers have positively skewed distributions and are traditionally log-
transformed. In other situations one may need to consider whether an extreme
value has undue influence on the parameter estimates or whether there is a (bio-
logically plausible) nonmonotone covariate effect.
A simple exploratory analysis of the effect of a single continuous covariate
on survival can be done using smooth estimates of quantiles of the conditional
survival function (6,7). The conditional survival function for covariate value z
is estimated using the Kaplan-Meier estimator with individual i weighted ac-
cording to the distance between z and Zi. Usually there will be more than one
covariate, and the following sections describe methods for exploring the relation-
ship between continuous covariates and survival within a logistic model or a Cox
model.
A. Transformation of Covariates in the Logistic Model

For the logistic model as in Section II the form of the covariate effects can be
investigated graphically simply using scatter plot smoothers of the response (i.e.,
death) against each covariate. Since the logistic model assumes that the logit
transformation of the probability of death given covariate vector Z is linear in
each component of Z, it may be more useful to plot the logit transformation of
a smooth of the response against a covariate.
Example Figure 3 shows the logit transformation of a smooth of the indicator

of death up to 2 years against the covariate values for sβ2m and serum creatinine.
The smoother used is a cubic smoothing spline with 7 degrees of freedom and
the shaded histograms on the plots indicate the distribution of the covariates.
Figure 3 Logit transformation of smoothed indicator of death up to 2 years against

prognostic variables.
Notice that the probability of death within 2 years of diagnosis is very high
(90%) for those with very high levels of log 2 (sβ 2 m) (⬎5) and very low (20%)
for those with extremely low values (⬍0). Despite this increasing trend, the rela-
tionship may not be monotone—there is certainly no evidence of increasing risk
associated with log 2 (sβ2 m) values of between 1.0 and 2.0. By contrast the proba-
bility of death by 2 years does seem to be a monotone function of serum creatinine
concentration (except possibly at the lowest few percent of concentrations). The
spread of risk is, however, less than for sβ 2m. The strong relationship between
dying and serum creatinine is interesting in that it largely disappears after ad-
justing for sβ 2m.
B. Transformation of Covariates in the Cox Model

The Cox model is hazard based and allows for censoring, so there is no simple
response that can be plotted to investigate the form of the covariate effects in
it, but other graphical methods have been developed. The simplest approach to
investigating covariate transformations is to partition the covariate of interest to
create about five ‘‘dummy variables.’’ Cutpoints should be chosen so that there
are roughly equal numbers of observations in each group wherever possible, but
standard cutpoints may be preferred. A plot of the estimated parameters against
the mean value of the observations in the interval is used to examine the appropri-
ateness of the linear fit associated with the basic model.
Discretizing a covariate and estimating a separate parameter for each inter-
val is equivalent to fitting a piecewise constant function. With a continuous covar-
iate, this is a very crude approximation to the logarithm of the hazard ratio that
may vary as a smooth function of the basic covariate. Consider the additive Cox
model
冦冱 s (Z )冧
p
λ(t| Z) ⫽ λ 0 (t) exp j j [3]

j⫽1
in which the hazard ratio associated with the jth covariate is equal to the function
exp{sj (Zj )} instead of simply exp(βj Zj ). Techniques exist for estimating the sj
directly using local estimation (8–10), regression splines (11–13), or penalized
partial likelihood (14,15).
In this chapter we are more interested in diagnostic plots to investigate
whether the chosen form exp(βj Zj ) is reasonable. Methods based on residuals
yield one-step approximations toward the underlying sj and have the advantage
of being easy to use and easy to apply using any software that can do Cox regres-
sion and smoothing. A very crude estimate of the sj can be obtained by applying
a scatterplot smoother to a plot of the so-called martingale residuals against Zj
(16). These are defined as
M̂ i ⫽ δ i ⫺ exp(β̂ T Zi )Λ̂ 0 (Ti )

for individual i with covariate vector Zi and survival time Ti , where exp
(β̂ TZi )Λ̂0 (Ti ) is an estimate of the cumulative hazard function for individual i at
Ti ,
δj
Λ̂ 0 (Ti ) ⫽ 冱∑
TjⱕTi TkⱖTj exp(β̂ T Z k )
and δ i ⫽ 1 if the observation on individual i is a death, δ i ⫽ 0 if it is censored.

These residuals can then be smoothed against each component of the covariate
vector; M̂ i is smoothed against Z ij to estimate the form of sj. Earlier approaches
included plotting the terms Ê i ⫽ exp(β̂ T Zi )Λ̂0 (Ti ) ⫽ δi ⫺ M̂ i against Z i (17,18;)
martingale residuals M̂i are an improvement as the terms δi provide an adjustment
for censoring.
The resulting estimates of the s j’s are not the best available diagnostics; a
better diagnostic plot can be obtained by adjusting each martingale residual M̂ i
by Ê i and plotting
smooth
冢冣 M̂ i
Ê i
against Z ij with smoothing weights Êi [4]
We call M̂ i /Ê i the adjusted martingale residual. Alternatively, a diagnostic plot

can be obtained by smoothing both δ i and Ê i against the covariate values and
plotting the logarithm of the ratio of the two smooths,
log 冦smooth (Ê ) against Z 冧

smooth(δ ) against Z
i
i
ij
ij
[5]
(19). Motivation for these plots comes from the fact that under the additive
Cox model (3), E(δ i | Z i ) ⫽ E[Λ 0 (T i ) exp{∑j s j (Z ij )}| Z i ], whereas E(Ê i | Z i ) ⫽
E{Λ̂0 (Ti )exp(∑j β̂j Zij )| Zi}. Thus, an estimate of E(δi |Z ij )/E(Êi |Zij ) approxi-
mately estimates the factor exp{sj (Zij ) ⫺ β̂j Zij }, which leads to Eq. [5] as an
estimate of sj (Zij ) ⫺ β̂jZij. The two methods [4] and [5] are similar as log{E(δi |Zij )/
E(Êi | Zij )} is approximately equal to E(M̂i |Z ij )/E(Êi |Zij ) by the approximation
log(1 ⫹ x) ⬇ x for small x.
Smoothing is needed in any plot of martingale residuals since they are
generally very skewed and are nearly uncorrelated; plots of martingale residuals
or adjusted martingale residuals themselves are not usually helpful. Note that
smooths should be mean based since they are estimating expected values; a robust
smoother is likely to be biased due to the skewedness of Λ̂.
Some statisticians use martingale residuals before entering a new covariate
into the Cox model. Since residual methods yield one-step approximations toward
the underlying s j , it is always advisable to start with at least a linear approxima-

tion. If martingale residuals are based on a model including the covariate Z j , with
coefficient β̂j , the function s j (Z j ) is estimated by the residual estimate [4] or [5]
added to the linear term β̂j Z j. On the other hand, to determine whether the func-
tion s j deviates from linear, it may be more useful to simply plot the residual
estimate.
Confidence intervals Confidence intervals are always important when looking
at any estimate; they are particularly important in the case of smoothed residual
plots, since smoothers can make a plot appear to have some nonlinear structure
even from random data with no underlying structure. Estimating confidence inter-
vals for smoothed estimates presents various problems such as bias correction,
multiple testing, and determining the shape of an estimate as opposed to its value
(see for example Hastie and Tibshirani, Sect. 3.8 (20)). Pointwise confidence
intervals are relatively simple to estimate, at least if a linear smoother is used,
and are certainly useful, although care should be taken in interpreting them.
The variance of the adjusted residual M̂ i /Ê i can be estimated by 1/Ê i , but
there is an additional problem due to correlation between adjusted residuals for
different individuals and the variance due to adding the linear estimate from the
Cox model. However, the variance of the linear estimate, and the covariance of
M̂ i /Ê i , and M̂ i′ /Ê i′ for i ≠ i′, are usually small compared with the variance of
M̂ i /Ê i for each individual, so approximate confidence intervals can be found by
estimating the variance of the vector of adjusted residuals M̂ i /Ê i by the diagonal
matrix with diagonal elements equal to 1/Ê i. If the weighted smooth against the
kth covariate is represented by the linear smoothing matrix L k , the variance of
the smooth estimate [4] can be estimated by the diagonal matrix with diagonal
elements equal to 1/Ê i , premultiplied by L k and postmultiplied by L Tk.
Example Figure 4 shows the weighted smooth of the adjusted martingale
residuals (4) with the linear term added for the continuous covariates, sβ2m, age,
and serum creatinine in the Cox model of Table 5. The smoother is a smoothing
spline with 7 degrees of freedom, and the shaded histograms on the plots indicate
the distribution of the covariates. The additive Cox model [3] only makes sense
if there is some constraint on the functions s j , for example, if λ 0 is taken to be
the hazard function for an individual with Z ⫽ 0, then s j (0) ⫽ 0 for each covariate
j; in Figure 4 λ 0 corresponds to the minimum observed value of each covariate.
As a result of Figure 4, the linear terms for log 2 (sβ2 m), log 2 (serum creati-
nine), and age in the Cox model were replaced by continuous piecewise linear
functions. The variable log2 (sβ2m) was split into four, for values up to 2, between
2 and 3, between 3 and 5, and greater than 5. Similarly log2 (serum creatinine)
was split into two variables for values greater than and less than 10, and age was
split into two at 55 years. These cutpoints are shown as vertical dashed lines on
Figure 4. A new variable was defined equal to age when age is less than 55 and
equal to 55 otherwise, and a second variable was defined equal to age when age is
Figure 4 Estimate of covariate transformation using smoothed adjusted martingale re-

siduals, with approximate 95% confidence intervals.
greater than 55 and equal to 55 otherwise; similarly for sβ2 m and serum creatinine
variables. The resulting estimates are shown in Table 6. The model in Table 6
does not give a particularly good estimate of the shape of the functions β j (z j ).
Figure 4 itself is more appropriate for that, but it does give a better idea of the
strength of the effects and the standard errors of the estimates. Thus, according
to this model it is both very young and older patients that have an increased
hazard. An increase in sβ2m is associated with an increased hazard both for very
high values and for more typical values, but extremely high values of sβ2m seem
to be associated with an additional increase in the hazard, whereas very low
values may not indicate a correspondingly lower hazard. In general, the hazard
seems to depend on the level of sβ2 m in a fairly complicated way. After adjusting
for sβ2 m, serum creatinine has no statistically significant association with in-
creased hazard either in general or for extreme values. The value of minus twice
the log partial likelihood for this model is 48 less than that for the model in Table
5, for the addition of five extra variables, so by a partial likelihood ratio test the
new model is certainly a better fit, even allowing for the data driven choice of
cut points.
Table 6 Cox Proportional Hazards Model Using Continuous Piecewise Linear

Covariate Effects
Age up to 55 years (per 10 years) 0.80 (0.65–1.00)

Age after 55 years (per 10 years) 1.28 (1.13–1.44)
log 2 (sβ2 m) up to 2 1.00 (0.83–1.20)
log 2 (sβ2 m) between 2 and 3 2.02 (1.58–2.58)
log 2 (sβ2 m) between 3 and 5 0.93 (0.78–1.11)
log 2 (sβ2 m) above 5 5.87 (3.38–10.18)
log 2 (serum creatinine) up to 10 1.03 (0.92–1.17)
log 2 (serum creatinine) above 10 3.43 (0.97–12.08)
ABCM 0.79 (0.67–0.92)
⫺2 ⫻ log partial likelihood: 10,045
Figure 5 shows adjusted martingale residual plots for sβ2 m and serum creat-
inine based on martingale residuals calculated from a model without these covari-
ates, that is, only age, hemoglobin, ABCM, and the Cuzick index indicator vari-
ables are in the model. In contrast to Figure 4, the plot for serum creatinine
indicates a strong effect (note the scales on the y-axes are not the same as in Fig.
4). This is due to the correlation between serum creatinine and sβ2 m as discussed
in Section II and illustrates the advantage of entering a covariate in a model at
least as a linear term before calculating residuals. Recall from Table 4 that of
the 202 individuals with log2 (serum creatinine) greater than 7.67, 184 (91%) had
log2 (sβ2 m) greater than 2.93.
Figure 5 Estimate of covariate transformation using smoothed adjusted martingale re-

siduals based on the model without sβ2 m or serum creatinine.
V. NONPROPORTIONAL HAZARDS:
TIME-VARYING COEFFICIENTS
In the Cox model, the hazard ratio of two individuals with covariates Z ⫽ z 1 and
Z ⫽ z 2 , respectively, is given by exp{β T (z 2 ⫺ z 1 )}, which does not vary with
time. Covariate effects that are thought to change (on the hazard ratio scale) over
time can be modeled by including user-defined time-dependent covariates, but
that is not a particularly flexible approach. Rather, one may wish to consider the
more general model
λ(t |Z) ⫽ λ 0 (t)exp{β(t) TZ}
in which the hazard ratio exp{β(t) T (z 2 ⫺ z 1 )} is allowed to vary over time through
the vector of functions β(t).
Standard software for fitting a Cox model to time-dependent covariates can
be used to estimate β(t) if one is willing to use a parametric regression spline
so that a single covariate X is replaced by a vector Xb(t) where b is a vector
basis for the spline (21,22). However, regression splines are not the most flexible
approach, and here we are more interested in diagnostic plots that can be used
to examine the form of the functions β(t) rather than direct estimation.
One simple approach is to estimate the parameters of the Cox model locally
in time. Although this is computationally intensive, it is conceptually simple and
easily implemented in any package capable of doing Cox regression. To estimate
the parameters at some point t* one considers a window in time (t 0 , t 1] containing
t* and estimates β in a standard Cox model left truncating the data at t 0 and right
censoring at t 1. A disadvantage is that because the estimate is locally constant,
there is bias toward the ends of the range of event times, in the same way as
smoothing in general using a running mean smoother results in bias toward the
ends. This approach is discussed more fully by Valsecchi et al. (23).
In a similar way to the methods for transformations of covariates, one-step
estimates of time-varying coefficients can by found by smoothing residuals
against time. The appropriate residuals here are Schoenfeld residuals (24–26).
Let t (1) , . . . , t (d) be the unique event times, then the Schoenfeld residual at event
time t (i) is defined as
冱冦Z ⫺ S (β̂, t )冧
S 1 (β̂, t (i))
r̂ (i) ⫽ j
j:Tj⫽t(i) 0 (i)
where S k (β̂, t) ⫽ ∑ j:Tjⱖt Z jk exp(β̂ T Z j ). Note that the residuals are only defined at
death times (there is not a residual that is identically zero at each censoring time).
Note also that this is a vector valued residual. Smoothing each component of the
residuals against the event times leads to estimates of each component of the
vector of functions β.
The residual r̂ (i) has variance V(i) estimated by
冱冦S (β̂, t ) ⫺ 冧
S 2 (β̂, t (i)) S 1 (β̂, t (i)) 䊟2
V̂ (i) ⫽
j:Tj⫽t(i) 0 (i) S 0 (β̂, t (i)) 2
where S 2 (β̂, t) ⫽ ∑j:Tjⱖt Z 䊟2j exp(β̂ T Z j ). Theory suggests that the expected value
⫺1
of V̂ (i) r̂ (i) is approximately equal to {β(t (i) ) ⫺ β̂} (24). If one only had a single
covariate, the shape (but not the magnitude) of β(t) could be estimated without
standardizing, but because the different components are not in general indepen-
dent, it is necessary to standardize the residuals by premultiplying by the inverse
⫺1
of their variance before smoothing against time. The adjusted residual V̂ (i) r̂ (i) then
⫺1
has variance approximately V (i) . This variance can vary greatly, particularly for
the later event times; therefore in smoothing, for the kth covariate, the inverses
⫺1
of the kth diagonal elements of the matrices V̂ (i) should be used as weights.
Often V̂ (i) will be nearly the same for each event time, in which case the
Schoenfeld residuals can be adjusted using V, the mean of the V̂ (i)s. This saves
computational effort, particularly since V is equal to the inverse of the Cox model
variance matrix divided by the number of events and the adjusted Schoenfeld
residuals are therefore available without any extra computation in statistics pack-
ages (such as S-Plus, Stata). On the other hand, if the risk set becomes very small
at later event times or, rather, if the range of covariate values in the risk set
becomes small at later event times, then V̂ (i) is likely to be smaller as well, and
using the mean V can lead to bias (27). Therefore, if V is used instead of V̂ (i) ,
care must be taken in interpreting the smooths once the risk set is small.
A number of points from Section IV also apply to the plots in this section.
Smoothing is needed for looking at plots of Schoenfeld residuals since the ad-
justed residuals themselves are often highly skewed and nearly uncorrelated. Ad-
ditionally, trends in Schoenfeld residuals may be obscured by the residuals being
in ‘‘bands’’ for different values of categorical covariates. The motivation for the
plots is based on using smoothing to estimate expected values; therefore, smooths
should be mean based and not robust.
The constant estimates from the Cox model can be added to the smoothed
adjusted residuals to estimate the functions β or the smoothed adjusted residuals
can be plotted by themselves to estimate the deviation of β from constant. The
residual plots are only one-step estimates, so the covariate should be entered in
the Cox model initially at least as constant, since a one-step estimate starting
from a constant estimate should be better than a one-step estimate starting from
zero.
Confidence intervals In the same way as described in Section IV.B, confidence
intervals are important for interpreting smooths of adjusted Schoenfeld residuals.
Approximate pointwise confidence intervals can be estimated fairly easily if a
linear smoother is used (26). Let β̂ 1(i) be the adjusted Schoenfeld residuals plus
constant estimate, β̂ 1(i) ⫽ V̂ (i) ⫺1

r̂ (i) ⫹ β̂, then the variance of β̂ 1(i) can be estimated
by V̂ (i) and for i ≠ i′, β̂ (i) and β̂ 1(i′) are approximately uncorrelated. Thus, for covari-
⫺1 1
ate k, the variance of the vector of estimates β̂ 1(i)k can be estimated by the diagonal
⫺1
matrix with (i)th diagonal element equal to the kth diagonal element of V (i) . If
the weighted smoothing for covariate k is represented by a linear smoothing ma-
trix L k , then the variance of the smoothed estimate can be estimated by premulti-
plying the diagonal variance matrix by L k and postmultiplying by L Tk.
Example Figure 6 shows the weighted smooth of the adjusted Schoenfeld resid-
uals with the constant estimate added for sβ2 m, from the fitted Cox model in
Table 6. The smoother is a smoothing spline with 7 degrees of freedom. The
shaded histograms indicate the distribution of observed deaths, for patients with
log 2 (sβ2 m) between 2 and 3 in the first part and for those with log 2 (sβ2 m) above
5 in the second part.
From these plots it seems that high values of sβ2 m are associated with an
increased hazard initially, but the effect decreases over time, and possibly sβ2 m
has no prognostic value beyond 2 or 3 years. A third Cox model was fitted in
which the constant coefficients of log2 (sβ2 m) were replaced by a piecewise con-
stant coefficient, which was allowed to have different values for up to and beyond
2 years. The four variables for log2 (sβ2 m) were set to zero after 2 years and a
further variable was defined to have value zero up to 2 years and the value of
log2 (sβ2 m) after 2 years. The results are in Table 7; thus, there does not seem to
be any statistically significant association between sβ2 m and increased hazard
beyond 2 years. Again, the Cox model gives an idea of the prognostic effect of
sβ2 m with error estimates in each of the two time intervals but does not give any
idea of how quickly or in what manner the effect decreases over time; an idea
Figure 6 Estimates of time-varying coefficients with 95% confidence intervals using

smoothed Schoenfeld residuals.
Table 7 Cox Model Using Continuous Piecewise Linear Covariate Effects with
Coefficients Piecewise Constant in Time
Age up to 55 years (per 10 years) 0.81 (0.65–1.00)

Age after 55 years (per 10 years) 1.28 (1.13–1.44)
log 2 (sβ2 m) up to 2 in the first two years 0.98 (0.73–1.33)
log 2 (sβ2 m) between 2 and 3 in the first two years 2.83 (2.02–3.96)
log 2 (sβ2 m) between 3 and 5 in the first two years 1.11 (0.91–1.35)
log 2 (sβ2 m) above 5 in the first two years 4.61 (2.65–8.02)
log 2 (sβ2 m) after the first 2 years 1.02 (0.92–1.14)
log 2 (serum creatinine) up to 10 1.01 (0.90–1.14)
log 2 (serum creatinine) above 10 2.92 (0.83–10.27)
ABCM 0.79 (0.67–0.92)
⫺2 ⫻ log partial likelihood: 10,013
of this can be seen from the figure instead. The value of minus twice the log
partial likelihood for this model is 32 less than that for the model in Table 6,
with one extra variable.
VI. STRATIFICATION
A covariate in the Cox model may require both a transformation as in Section

IV and a time-dependent covariate as in Section V. Therefore it might be neces-
sary to study the effect of a continuous prognostic variable without assuming
either proportional hazards or a particular form with respect to the covariate. A
simple way of doing this is to discretise the covariate as in Section IV and use
a stratified Cox model. Denote by X the discretised covariate of interest, and
model the conditional hazard as
λ(t| Z, X ⫽ k) ⫽ λ k (t) exp(β T Z) [6]
where there are now several ‘‘baseline hazards,’’ one for each level of the discret-
ised covariate. The functions λ k (t) can be estimated using smoothing (28). How-
ever estimating the set of functions λk now means estimating a function of both
time and the covariate X, and therefore without strong constraints on the functions
or a very large data set it can only be estimated with very limited accuracy.
Estimation of the corresponding survival functions is much easier than esti-

mating the hazard functions; this can be done using product limit type estimators
Ŝ k (t) ⫽ 兿
Ti ⱕt
{1 ⫺ ∆Λ̂ k (T i )}
where the jumps are defined by
∑ j:Xj⫽k,Tj⫽Ti δ i
∆Λ̂ k (Ti ) ⫽
∑ j:Xj⫽k,TjⱖTi exp(β̂ T Z j )
Note that it is important to center the covariates so that the baseline survival
functions correspond to a realistic combination of covariates (Z ⫽ 0). Extrapola-
tion from a group of individuals aged 40–65 years to age 0, for instance, is likely
to lead to nonsensical results. Such survival curves can be interpreted directly,
reading off x-year (e.g., 3 year) survival or the y-percentile (e.g., median) survival
in each group. Other plots based on stratified survival estimates can also be used;
for example, box plots can be used with stratified censored data and may provide
a useful visual summary of the survival functions (6,7).
As in Section IV. B, discretizing a covariate and estimating a separate base-
line hazard for each resulting stratum is equivalent to estimating a hazard that
is piecewise constant in the covariate. A continuously stratified Cox model may
be used instead to provide a hazard estimate that is smooth in the covariate. Such
estimators are discussed in detail by Sasieni (29) and Dabrowska (30).
The survival estimates Ŝ k are based on a stratification that is determined
by a specific covariate of interest, X. Alternatively, interest might be in simply
estimating survival in prognostic groups, within which individuals have similar
survival functions. Tree-based methods can be used to divide data into strata
(prognostic groups) based on the values of several covariates so that survival
within each stratum is relatively homogeneous (31,32).
The baseline survival functions can also be used in exploratory analysis
for deciding on an appropriate model. For example, traditionally the proportional
hazards assumption can be examined by plotting log {⫺log Ŝ k (t)} against t for all
strata on the same axes; under proportional hazards, the resulting curves should be
parallel (that is, the vertical distance between two curves should be constant for
all values of t). Unfortunately, it is surprisingly difficult to judge whether two
nonlinear curves are parallel, and the methods of Sections IV and V are more
useful for exploring possible models.
Example From Figures 4 and 6 it has been seen that the association between
sβ2 m and the hazard ratio is both nonlinear in the value of log2 (sβ2 m) and non-
constant in time. Therefore, to estimate the effect of sβ2 m on survival without
making either of these assumption, a stratified Cox model was used with strata
Figure 7 Estimated baseline survival functions and 95% confidence intervals from
stratified Cox model. Strata are defined by log 2 (sβ2 m) ⱕ 2.6, 2.6 ⬍ log2 (sβ2 m) ⱕ 5,
and 5 ⬍ log2 (sβ2 m).
defined by log2 (sβ2 m) with the cutpoints 5 and 2.6. The latter cutpoint was chosen
so that the two strata with log2 (sβ2 m) ⱕ 5 had approximately equal numbers of
individuals. Figure 7 shows estimates of the survival functions corresponding to
the baseline hazard functions for the three strata; each of the other covariates
was centered to have mean zero, so that the estimates correspond to the mean
values of the other covariates. Individuals in the same strata are not expected to
have the same survival functions so the estimates are for a randomly selected
individual from the group. Summary statistics from Figure 7 are in Table 8.
VII. DISCUSSION
The standard approach to regression analysis of censored data is to use the semi-
parametric proportional hazards model. The model is extremely useful for ad-
justing for possible confounders when the main interest is on a binary covariate.
Formal inference can be based on the score test that is an adjusted log-rank test,
and survival in the two groups can be examined without assuming proportional
Table 8 Estimated Survival from Stratified Cox Model
Probability of
log 2 (sβ2 m) Median survival surviving 2 years
⬍2.6 36 mo 0.70
2.6–5 21 mo 0.45
⬎5 4 mo 0.22
hazards by use of a stratified Cox model. If, however, the primary interest is in
one or more continuous covariates, one may wish to investigate more flexible
models. There are now a variety of graphical techniques that can be extremely
useful in pointing the data analyst toward a more appropriate model. It is often
tempting to overinterpret nonlinear effects detected by such plots, and one should
err on the side of caution unless a validation sample is available to check the
significance of trends found during exploratory analyses.
We have considered two main types of departure from the log linear propor-
tional hazards model. First, we looked at covariates whose effect may not be
linear on the prognostic index. This may arise due to a U- or J-shaped relationship
or when there is a large amount of information available so that more subtle
departures from linearity can be detected. In the example used throughout this
chapter, we observed a nonmonotone relationship with age (patients aged about
55 years had a better prognosis than both older and younger patients) and a mono-
tone, but distinctly nonlinear, relationship with the logarithm of sβ2 m concentra-
tion. We also saw how the apparent significance of serum creatinine depends
critically on whether or not sβ2 m is adjusted for.
The second form of departure considered was covariate effects that change
over time on the proportional hazards scale. In particular, we saw how sβ2 m,
which is so informative for short-term survival, contains little or no information
regarding subsequent survival of those who survive at least 2 years from diag-
nosis.
REFERENCES
1. MacLennan ICM, Chapman C, Dunn J, Kelly K. Combined chemotherapy with

ABCM versus melphalan for treatment of myelomatosis. The Lancet 1992; 339:
200–205.
2. Cuzick J, Galton DAG, Peto R for the MRC’s Working Party on Leukaemia in
adults. Prognostic features in the third MRC myelomatosis trial. Br J Cancer 1980;
43:831–840.
3. MRC Working Party on Leukaemia in Adults. Objective evaluation of the role of
vincristine in induction and maintenance therapy for myelomatosis. Br J Cancer
1985; 52:153–158.
4. Cox DR. Regression models and life tables [with discussion]. J R Stat Soc B 1972;
34:187–220.
5. LeCessie S, vanHouwelingen JC. Ridge estimators in logistic regression. Appl Stat
1992; 41:191–201.
6. Gentleman R, Crowley, J. A graphical approach to the analysis of censored data.
Breast Cancer Res Treat 1992; 22:229–240.
7. Gentleman R, Crowley J. Graphical methods for censored data. J Am Stat Assoc
1991; 86:678–683.
8. Tibshirani R, Hastie T. Local likelihood estimation. J Am Stat Assoc 1987; 82:559–

567.
9. Gentleman R, Crowley J. Local full likelihood estimation for the proportional haz-
ards model. Biometrics 47:1283–1296.
10. Hastie T, Tibshirani R. Generalized additive models. Stat Sci 1:297–318.
11. Sleeper LA, Harrington P. Regression splines in the Cox model with application to
covariate effects in liver disease. J Am Stat Assoc 1990; 85:941–949.
12. Durrleman S, Simon R. Flexible regression models with cubic splines. Stat Med
1989; 8:551–561.
13. Gray RJ. Flexible methods for analyzing survival data, using splines, with applica-
tions to breast cancer prognosis. J Am Stat Assoc 1992; 87:942–951.
14. O’Sullivan F. Nonparametric estimation of relative risk using splines and cross-
validation. SIAM J Sci Stat Comput 1988; 9:531–542.
15. Hastie T, Tibshirani R. Exploring the nature of covariate effects in the proportional
hazards model. Biometrics 1990; 46:1005–1016.
16. Therneau TM, Grambsch PM, Fleming TR. Martingale based residuals for survival
models. Biometrika 1990; 77:147–160.
17. Lagakos SW. The graphical evaluation of explanatory variables in proportional haz-
ard regression models. Biometrika 1981; 68:93–98.
18. Crowley J, Storer BE. Comment on Aitkin M, Laird N, Francis B, A reanalysis of
the Stanford heart transplant data. J Am Stat Assoc 1983; 78:278–281.
19. Grambsch PM, Therneau TM, Fleming TR. Diagnostic plots to reveal functional
form for covariates in multiplicative intensity models. Biometrics 1995; 51:1469–
1482.
20. Hastie T, Tibshirani R. Generalized Additive Models. London: Chapman and Hall.
21. Hess KR. Assessing time-by-covariate interactions in proportional hazards regres-
sion models using cubic spline functions. Stat Med 1994; 13:1045–1062.
22. Abrahamowicz M, MacKenzie T, Esdaile JM. Time-dependent hazard ratio: model-
ling and hypothesis testing with application in lupus nephritis. J Am Stat Assoc
1996; 91:1432–1439.
23. Valsecchi MG, Silvestri D, Sasieni P. Evaluation of long-term survival: use of diag-
nostics and robust estimators with Cox’s proportional hazards model. Stat Med 1996;
15:2763–2780.
24. Schoenfeld D. Partial residuals for the proportional hazards regression model. Bio-
metrika 1982; 69:239–241.
25. Pettitt AN, Bin Daud I. Investigating time dependence in Cox’s proportional hazards
model. Appl Stat 1990; 39:313–329.
26. Grambsch PM, Therneau TM. Proportional hazards tests and diagnostics based on
weighted residuals. Biometrika 1994; 81:515–526. [Correction in Biometrika 82:
668.]
27. Winnett AS, Sasieni P. A note on scaled Schoenfeld residuals for the proportional
hazards model. Biometrika 2001. (In press.).
28. Wells MT. Nonparametric kernel estimation in counting processes with explanatory
variables. Biometrika 1994; 81:795–801.
29. Sasieni P. Information bounds for the conditional hazard ratio in a nested family of
regression models. J R Stat Soc B 1992; 54:617–635.
30. Dabrowska DM. Smoothed Cox regression. Ann Stat 1997; 25:1510–1540.
31. LeBlanc M, Crowley J. Survival trees by goodness of split. J Am Stat Assoc 1993;
88:457–467.
32. Crowley J, LeBlanc M, Jacobson J, Salmon SE. Some exploratory tool for survival
analysis. In: Lin DY, Fleming TR, eds. Proceedings of the First Seattle Symposium
in Biostatistics: Survival Analysis. New York: Springer, 1997:199–229.
22
Tree-Based Methods for
Prognostic Stratification
Michael LeBlanc
I. INTRODUCTION
Identification of groups of patients with differing prognosis is often desired to

understand the association of patient characteristics and survival times and for
aiding in the design of clinical trials. Applications can include the development
of stratification schemes for future clinical trials and identification patients suit-
able for studies involving therapy targeted at a specific prognostic group. For
instance, identification of patients with poor prognosis may be useful to determine
eligibility for studies involving high-dose therapy for that disease.
Cox’s (1) proportional hazards model is a flexible tool for the study of
covariate associations with survival time. It has been used to identify prognostic
groups of patients by using the linear component of the model (prognostic index)
or informally through counting up the number of poor prognostic factors corre-
sponding to terms in the fitted model. However, the model does not directly lead
to an easily interpretable description of patient prognostic groups. An alternative
to using scores constructed from the Cox model is a rule that can be expressed
as simple logical combinations of covariate values. For example, an individual
with some hypothetical disease may be to have poor prognosis if (age ⬎ 60)
and (serum creatinine ⬎ 2) or (serum calcium ⬍ 5) and (sex ⫽ male). Tree-
457
458 LeBlanc
based methods, also called recursive partitioning methods, are techniques for
adaptively deriving these logical rules based on patient data.
Tree-based methods were formalized and extensively studied by Breiman
et al. (2). Trees have also recently been of interest in the machine learning; one
example is the C4.5 algorithm due to Quinlan (3). Tree-based methods recur-
sively split of the data into groups leading to a fitted model that is piecewise
constant over regions of the covariate space. Each region is represented by a
terminal node in a binary decision tree. Tree-based methods have been extended
to censored survival data for the goal of finding groups of patients with differing
prognosis (4–8). Some examples of tree-based methods for survival data in clini-
cal studies are given in Albain et al. (9), Ciampi et al. (10), and Kwak et al. (11).
In this chapter, we discuss some of the general methodological aspects of
tree-based modeling for survival data. We illustrate the methods using data from
a clinical trial for patients with myeloma conducted by the Southwest Oncology
Group (SWOG). SWOG study 8624 entered patients between 1987 and 1990 and
showed a significant treatment effect (12). However, we combined the data from
the treatment arms for the prognostic analyses presented here.
II. NOTATION AND PIECEWISE CONSTANT MODEL
We assume that the X is the true survival time, C is the random censoring time,
and Z is a p-vector of covariates. The observed variables are the triple (T ⫽ X
∧ C, ∆ ⫽ I {X ⱕ C}, Z) where T is the time under observation, ∆ is an indicator
of failure. Given Z, we assume that X and C are independent. The data consist
of a sample of independent observations {(t i , δ i , z i ) : i ⫽ 1,2 , . . . , N} distributed
as the vector (T, ∆, Z). The survival probability at time t is denoted by
S(t| z) ⫽ P(X ⬎ t |z)
Trees represent approximating models that are homogeneous over regions
of the prediction space. The model for survival can be represented piecewise
model
S(t| z) ⫽ 冱 S (t) I {z∈B }

n∈T̃
h h
where B h is a ‘‘box’’-shaped region in the predictor space, represented by a ter-

minal node, h, and the function S h (t) is the survival function corresponding to
region B h where T̃ is the set of terminal nodes. Each terminal node can be de-
scribed by a logical rule, for instance, (z 1 ⬍ 3) 傽 (z 2 ⱖ 3) 傽 (z 5 ⬍ 3). With
sample sizes typically available for clinical applications, a piecewise constant
model can yield quite poor approximations to the true conditional survival func-
Tree-Based Methods 459
tion, which is likely a smooth function of the underlying covariates. Smooth

methods such as linear Cox regression likely yield better function approximations
than tree-based methods. However, the primary motivation for using tree-based
models is the easy interpretation of the resulting regions or groups patients, and
such rules are not directly obtained by methods that assume S(t |z) is a smooth
function of the covariates.
The myeloma data from SWOG study 8624 presented in this chapter are
from 478 patients with complete data for sex, age, performance status, calcium,
creatinine, albumin, and serum β2 microglobulin.
III. CONSTRUCTING A TREE RULE
A tree-based model is developed by recursively partitioning the data. At the first

step the covariate space is partitioned into two regions and the data are split into
two groups. The splitting rule is applied recursively to each of the resulting re-
gions until a large tree has been grown. Splits along a single covariate are used
because they are easy to interpret. For an ordered covariate, splits are of the form
‘‘Z j ⬍ c’’ or ‘‘Z j ⱖ c’’ and for a nominal covariate splits are of the form
‘‘Z j ∈ S’’ or ‘‘Z j ∉ S,’’ where S is a nonempty subset of the set of labels for
the nominal predictor Z j . Potential splits are evaluated for each of the covariates,
and the covariate and split value resulting in the greatest reduction in impurity
is chosen.
The improvement for a split at node h into left and right daughter nodes
l(h) and r(h) is
G(h) ⫽ R(h) ⫺ [R(l(h)) ⫹ R(r(h))] (1)
where R(h) is the residual error at a node. For uncensored continuous response
problems, R(h) is typically the mean residual sum of squares or mean absolute
error. For survival data, it would be reasonable to use deviance corresponding
to an assumed survival model. For instance, the exponential model deviance for
node h is
冤冢λ̂δ t 冣 ⫺ (δ ⫺ λ̂ t )冥
R(h) ⫽ ∑2 δ i log i
h i
i h i
where λ̂ h is the maximum likelihood estimate of the hazard rate in node h. Often
an exponential assumption for survival times is not valid. However, a nonlinear
transformation of the survival times may make the distribution of survival times
closer to an exponential distribution. LeBlanc and Crowley (7) investigate a
‘‘full-likelihood’’ method that is equivalent to transforming time by the marginal
cumulative hazard function and using the exponential deviance.
460 LeBlanc
Figure 1 Logrank test statistics for splits at observed values of creatinine.
However, most recursive partitioning schemes for censored survival data

use the logrank test statistic of Mantel (13) for G(h) to measure the separation
in survival times between two groups. Simulation studies of the performance of
splitting with logrank test statistic and some other between node statistics are
given in LeBlanc and Crowley (8) and Crowley et al. (14).
Figure 1 shows the value of the logrank test statistic for groups defined by
(creatinine ⬍ c) and (creatinine ⱖ c) for observed values of the covariate for
the entire data set. The largest logrank test statistic corresponds to a split at c ⫽
1.7 and would lead to the first split in a tree-based model to be (creatinine ⬍
1.7) versus (creatinine ⱖ 1.7).
A. Updating Splitting Statistics

It is also important that the splitting statistic can be calculated efficiently for all
possible split points for continuous covariates. While the logrank test is relatively
inexpensive to calculate, one way to improve computational efficiency is the
use a simple approximation to the logrank statistic that allows simple updating
algorithms to consider all possible splits (15). Updating algorithms can also be
constructed for exponential deviance; for instance, see Davis and Anderson (16)
and LeBlanc and Crowley (15).
B. Constraints on Splitting
If there are weak associations between the survival times and covariates, splitting
on a continuous covariate tends to select splits that send almost all the observa-
tions to one side of the split. This is called ‘‘end-cut’’ preference by Breiman et
al. (2). When growing survival trees, we restrict both the minimum total number
of observations and the minimum number of uncensored observations within any
potential node. This restriction is also important for prognostic stratification, since
very small groups of patients are usually not of clinical interest.
Figure 2 shows the tree grown using the logrank test statistic for splitting
and with a constraint of a minimum node size of 40 observations and 5 uncen-
sored cases. The tree has nine terminal nodes. Below each split in the tree the
Figure 2 Unpruned survival tree. Below each split is the logrank test statistic and a
permutation p value. Below each terminal node is the logarithm of the hazard ratio relative
to the left most node in the tree and the number of cases in the node.
462 LeBlanc
logrank test statistic and permutation p value are presented. The p value is calcu-
lated at each node by the permuting responses over the covariates and recalculat-
ing the best split at the node 1000 times. At each terminal node, the logarithm
of the hazard ratio relative to the left-most node and the number of cases falling
into each terminal node are presented. The logarithm of the hazard ratio is ob-
tained by fitting a Cox (1) model with dummy variables defined by terminal nodes
in the tree. The worst prognostic group are patients with (creatinine ⱖ 1.7) and
(albumin ⬍ 3.3) and corresponds an estimated logarithm of the hazard ratio rela-
tive to the best prognostic group equal to 1.7. While the minimum node size was
set to be quite large (40 observations), the logrank test statistics near the bottom
of the tree (and permutation p values) indicate there may be several nodes that
should be combined to simplify the model.
IV. PRUNING AND SELECTING A TREE
Two general methods have been proposed for pruning trees for survival data.
The methods that use within-node error or deviance usually adopt the Classifica-
tion and Regression Tree (CART) pruning algorithm directly.
A. Methods Based on Within-Node Deviance

In the CART algorithm, the performance of a tree is based on the cost-complexity
measure
R α (T) ⫽ 冱 R(h) ⫹ α | T̃|

h∈T̃
of the binary tree T, where T̃ is the set of terminal nodes, | T̃| is the number
terminal nodes, α is a nonnegative parameter, and R(h) is the cost (often devi-
ance) of node h.
A subtree (a tree obtained by removing branches) To is an optimally pruned
subtree for any penalty α of the tree T if
R α (T o ) ⫽ min R α (T′)
T′ⱮT
where ‘‘Ɱ’’ means ‘‘is a subtree of’’ and T o is the smallest optimally pruned
subtree if T o Ɱ T″ for every optimally pruned subtree, T.″
The cost-complexity pruning algorithm obtains the optimally pruned
subtree for any α. This algorithm finds the sequence of optimally pruned subtrees
by repeatedly deleting branches of the tree for which the average reduction in
impurity per split in the branch is small. The cost-complexity pruning algorithm
is necessary for finding optimal subtrees because the number of possible subtrees
grows very rapidly as a function of tree size.
The deviance will always decrease for larger trees in the nested sequence
based on the data used to construct the tree. Therefore, some way of honestly
estimating deviance for a new sample is required to select a tree that would have
small expected deviance. If a test sample is available, the deviance for the test
sample can be calculated for each of the pruned trees in the sequence using the
node estimates from the training sample. For instance, the deviance at a node
would be
冤
R(h) ⫽ ∑ 2 δ Ti log
冢冣
δ Ti
λ̂ h t Ti
⫺ (δ Ti ⫺ λ̂ h t Ti ) 冥
where (t Ti , δ Ti ) are the test sample survival times and status indicators for test
sample observations falling into node h, z Ti ∈ B h , for the tree and node estimate
λ̂h calculated from the learning sample.
However, usually a test sample is not available. Therefore, the selection
of the best tree can be based on resampling based estimates of prediction error
(or expected deviance). The most popular method for tree-based models is the
K-fold cross-validation estimate of deviance. The training data, ᏸ, are divided
up into K test samples ᏸ k and training samples ᏸ (k) ⫽ ᏸ ⫺ ᏸ k , k ⫽ 1, . . . , K
of about equal size. Trees are grown with each of the training samples ᏸ (k); each
test sample ᏸ k is used to estimate the deviance using the parameter estimates
from the training sample ᏸ (k). The K-fold cross-validation estimate of deviance is
the sum of the test sample estimates. The tree that minimizes the cross-validation
estimate of deviance (or a slightly smaller tree) is selected. While K-fold cross-
validation is a standard method for selecting tree size, it subject to considerable
variability; this is noted in survival data in simulations given in LeBlanc and
Crowley (8). Therefore, other methods such as those based on bootstrap resam-
pling maybe useful alternatives (17). One bootstrap method for methods based
on logrank splitting is given in the next section.
B. Methods Based on Between-Node Separation

LeBlanc and Crowley (8) developed an optimal pruning algorithm analogous to
the cost-complexity pruning algorithm of CART for tree performance based on
between node separation. They define the split-complexity of a tree as
G α (T) ⫽ G(T) ⫺ α |S|

464 LeBlanc
where G(T) is the sum over the standardized splitting statistics, G(h), in the tree
T:
G(T) ⫽ 冱 G(h)
h∈S
where S represents the internal nodes T.

A tree T o is an optimally pruned subtree of T for complexity parameter α
if
G α (T o ) ⫽ max G α (T′)
T′ ⱮT
and it is the smallest optimally pruned subtree if T o Ɱ T′ for every optimally

pruned subtree. The algorithm repeatedly prunes off branches with smallest aver-
age logrank test statistics in the branch. An alternative pruning method for trees
based on the maximum value of the test statistic within any branch was proposed
by Segal (6).
Since the same data are used to select the split point and variable as used to
calculate the test statistic, we use a bias-corrected version of the split complexity
described above
G α (T) ⫽ 冱 G*(h) ⫺ α| S|
h∈S
where corrected split statistic is

G*(h) ⫽ G(h) ⫺ ∆*(h)
and where the bias is denoted by
∆*(h) ⫽ E Y* G(h; ᏸ*, ᏸ) ⫺ E Y* G(h; ᏸ*, ᏸ*)
The function G(h; ᏸ*, ᏸ) denotes the test statistic where the data ᏸ* were used
to determine the split variable and value and the data ᏸ were used to evaluate
the statistic. The function G(h; ᏸ*, ᏸ*) denotes the statistic, where the same
data was used to pick the split variable and value and to calculate the test statistic.
The difference ∆*(h) is the optimism due to adaptive splitting of the data. We
use the bootstrap to obtain an estimate ∆ˆ *(h) then we select trees that minimize
the corrected goodness of split
G̃ α (T) ⫽ 冱 (G(h) ⫺ ∆ˆ *(h)) ⫺ α| S|

h∈S
Note that the G̃ α (T) is similar to the bias-corrected version of split complexity
used in LeBlanc and Crowley (8) except here we do the correction locally for
each split conditional on splits higher in the tree. We typically chose a complexity
parameter, α ⫽ 4. Note that if splits were not selected adaptively, an α ⫽ 4
Figure 3 Pruned survival tree. Below each split is the logrank test statistic and permuta-
tion p value. Below each terminal node is the median survival and the number of patients
represented by that node.
would correspond approximately to the 0.05 significance level for a split and
α ⫽ 2 is in the spirit of AIC (18). Permutation sampling methods can also be
used to add an approximate p value to each split conditional on the tree structure
above the split to help the interpretation of individual splits.
Figure 3 shows a pruned tree based on the corrected goodness of split using
25 bootstrap samples with α ⫽ 4. There are five terminal nodes. For example,
the patients represented by (creatinine ⬍ 1.7) and (serum β2 microglobulin ⬍
5.1) and (age ⬍ 62) have the best prognosis with an estimated median survival
of 5.1 years.
V. FURTHER RECOMBINATION OF NODES
Usually, only a small number of prognostic groups are of interest. Therefore,

further recombination of nodes with similar prognosis from the pruned tree may
466 LeBlanc
be required. We select a measure of prognosis (for instance, hazard ratios relative

to some node in the tree or median survival for each node) and rank each of the
terminal nodes in the pruned tree based on the measure of prognosis selected.
After ranking the nodes, there are several options for combining nodes in a pruned
tree. One method would be to grow another tree on the ranked nodes and only
allow the second tree to select three or four nodes, another method would be to
divide the nodes based on quantiles of the data, and a third method would be to
evaluate all possible recombinations of nodes into V groups and choose the parti-
tion that yields the largest partial likelihood or largest V sample logrank test
statistic. The result of recombining to yield the largest partial likelihood for a
three-group combination of the pruned myeloma tree given in Figure 3 is pre-
sented in Figure 4.
The results of the prognostic staging scheme can be compared with cur-
Figure 4 Survival for the three prognostic groups derived from the pruned survival tree.
The best prognostic group corresponds to nodes 1 and 2, the middle prognostic group
corresponds to nodes 3 and 5, and the worst prognosis group corresponds to node 4 (num-
bered from the left on Fig. 3).
Figure 5 Survival by Durie-Salmon stages (I–II, IIIA, IIIB).
rently used prognostic staging system Durie-Salmon stage (19), which is based
on an the number of tumor cells categorized into three groups (I,II,III) and a
classification of kidney function. Figure 5 shows the systems collapsed into three
groups (I–II, IIIA, IIIB). However, the tree-based prognostic stratification is
adaptively chosen based on the response, so we would expect some shrinkage
of the differences on a test data set.
VI. OTHER ISSUES

A. Competing Splits
The tree model only shows the split point and variable (and test statistic) for the
best split at each node. However, there may be alternative partitions that yield
only slightly smaller test statistics. Therefore, it can be desirable for a given node
to automatically investigate the performance of other splits on other variables.
Figure 6 shows the logrank test statistics for potential splits at the first split in
the tree. For instance, while creatinine ⬍ 1.7 split yielded the largest logrank
468 LeBlanc
Figure 6 Logrank test statistics for splits on continuous variables at the root node of
the tree.
test statistics, the plots show that many values of serum β 2 microglobulin between
5 and 15 would also yield large test statistics. Given that the tree (Fig. 3) subse-
quently splits on serum β 2 microglobulin, it may be reasonable to investigate
alternative tree models that only use serum β 2 microglobulin. This could also be
supported by the biology since serum β 2 microglobulin measures the impact of
the disease on kidney function and tumor volume whereas creatinine focuses on
kidney function.
B. Stability of a Split
It has been recognized that the mechanism of selecting a best split and the re-
cursive partitioning of data leading to smaller and smaller data sets to be consid-
ered can lead to instability in the tree structure in some data sets. Note this does
not necessarily mean that the predictions are unstable. However, one easily imple-
mented way to understand the stability of the structure of the tree is to draw
bootstrap samples of the observed data and recalculate the splitting statistics.
Figure 7 below shows the frequency of the variables and splitting values chosen
as the first split in the tree over 100 bootstrap samples. All the bootstrap splits
were on either creatinine or serum β 2 microglobulin. This reflects the both prog-
nostic importance of either of these two variables and that there is significant
correlation between these two variables. The best splits from the bootstrap show
a strong peak for creatinine at 1.7; however, for serum β 2 microglobulin best
splits widely ranged between 5 and 15. Note that overall 63% of the boot-
strap samples split on serum β 2 microglobulin and 37% split on creatinine. Thus,
while split point selection on serum β 2 microglobulin may be unstable, it still
leads to important prognostic groupings at a range of wide range of split points.
This may lead an investigator to define the groups not at the optimal split on
serum β 2 microglobulin but at one chosen so a sufficient number of patients
fall into the good or poor prognosis groups to meet the goals of the prognostic
analysis.
Figure 7 Frequency of first splits for 100 bootstrap samples.

470 LeBlanc
In addition, we investigated the potential for lack of stability resulting in

variable survival estimates for the prognostic groups. We grouped the nodes of
trees grown with a minimum node size of 40 observations into three prognostic
groups for each of 100 bootstrap samples. We did not prune the trees so that the
three prognostic groups included approximately the same numbers of patients
across bootstrap samples. Figure 8 shows histograms of the median survival esti-
mates for the prognostic groups obtained from the 100 bootstrap sample data
sets. The bootstrap distribution of median survival estimates for the three groups
are widely separated.
Figure 8 Median survival estimates for three-group prognostic stratification based on

100 bootstrap samples.
VII. DISCUSSION
The primary objective of tree-based methods with censored survival data is to

provide easy to understand description of groups of patients. The highly adaptive
tree-based procedures require resampling methods for model selection. Resam-
pling methods can be useful in understanding the stability of the tree structures.
Software implementing tree-based methods based on logrank splitting us-
ing SPLUS are available from the first author. Other software has been written
for tree-based modeling for survival data. The RPART program implements a
recursive partitioning based on within-node error which includes exponential
model based survival trees (20). Another implementation of software based on
logrank test statistics based on Segal (6) is available from that author and is called
TSSA.
ACKNOWLEDGMENTS
We thank Drs. Salmon, Coltman, and Barlogie for encouragement to use the
myeloma data. Supported by the U.S. NIH through NCI 2 P01 CA 53996.
REFERENCES
1. Cox DR. Regression models and life-tables [with discussion]. J R Stat Soc B 1972;
34:187–220.
2. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees.
Belmont, CA: Wadsworth, 1984.
3. Quinlan JR. C4.5 Programs for Machine Learning. Morgan Kaufman, 1993.
4. Gordon L, Olshen RA. Tree-structured survival analysis. Cancer Treat Rep 1985;
69:1065–1069.
5. Ciampi A, Hogg S, McKinney S, Thiffaut J. RECPAM: a computer program for
recursive partition and amalgamation for censored survival data. Computer Methods
Progr Biomed 1988; 26:239–256.
6. Segal MR. Regression trees for censored data. Biometrics 1988; 44:35–48.
7. LeBlanc M, Crowley J. Relative risk regression trees for censored survival data.
Biometrics 1992; 48:411–425.
8. LeBlanc M, Crowley J. Survival trees by goodness of split. J Am Stat Assoc 1993;
88:457–467.
9. Albain K, Crowley J, LeBlanc M, Livingston R. Determinants of improved outcome
in small cell lung cancer: an analysis of the 2580 patient Southwest Oncology Group
data base. J Clin Oncol 1990; 8:1563–1574.
10. Ciampi A, Thiffault J, Nakache JP, Asselain B. Stratification by stepwise regression,
472 LeBlanc
correspondence analysis and recursive partition. Comput Stat Data Anal 1986; 4:
185–204.
11. Kwak LW, Halpern J, Olshen RA, Horning SJ. Prognostic significance of actual
dose intensity in diffuse large-cell lymphoma: results of a tree-structured survival
analysis. J Clin Oncol 1990; 8:963–977.
12. Salmon SE, Crowley J, Grogan TM, Finley P, Pugh RP, Barlogie B. Combination
chemotherapy glucocorticoids, and interferon alpha in the treatment of multiple my-
eloma: a Southwest Oncology Group study. J Clin Oncol 1994; 12:2405–2414.
13. Mantel N. Evaluation of survival data and two new rank order statistics arising in
its consideration. Cancer Chemother Rep 1966; 50:163–170.
14. Crowley J, LeBlanc M, Gentleman R, Salmon S. Exploratory methods in survival
analysis. In: Koul HL, Deshpande JV (eds). Analysis of Censored Data, IMS Lecture
Notes. Monograph Series 27. Hayward, CA: Institute of Mathematical Statistics.
1995, pp. 55–77.
15. LeBlanc M, Crowley J. Step-function covariate effects in the proportional-hazards
model. Can J Stat 1995; 23:109–129.
16. Davis R, Anderson J. Exponential survival trees. Stat Med 1989; 8:947–962.
17. Efron B, Tibshirani R. Improvements on cross-validation: the .632 bootstrap. J Am
Stat Assoc 1997; 92:548–560.
18. Akaike H. A new look at model identification. IEEE Trans Automatic Control 1974;
19:716–723.
19. Durie BGM, Salmon SE. A clinical system for multiple myeloma. Correlation of
measured myeloma cell mass with presenting clinical features, response to treatment
and survival. Cancer 1975; 36:842–854.
20. Therneau TM, Atkinson EJ. An introduction to recursive partitioning using the
RPART routines. Technical report, Mayo Foundation (Distributed in PostScript with
the RPART package), 1997.
23
Problems in Interpreting Clinical Trials
Lillian L. Siu and Ian F. Tannock

Princess Margaret Hospital, Toronto, Ontario, Canada
I. INTRODUCTION
The oncological literature is overwhelmed by its unceasing abundance of pub-

lished clinical trials, yet unfortunately many of these trials are performed with
little or no chance of improving clinical practice. Some trials may be well con-
ducted but ask irrelevant questions, whereas others attempt to address important
issues with poor methodology. This chapter focuses on key elements that need
to be considered when planning or evaluating a clinical trial (Table 1).
II. IMPORTANCE AND RELEVANCE OF THE QUESTION
Therapeutic advances in oncology that alter patient outcome and clinical practice
occur as a consequence of a multistep process that include formulation of a perti-
nent question, generation of an appropriate hypothesis, testing of the hypothesis
in the setting of properly conducted clinical trial(s), and careful interpretation
and accurate presentation of trial results. Priorities in cancer research should be
directed toward strategies that lead to an improved therapeutic index, either
through the modification of existent methods or by the development of novel
approaches. Far too often time and resources are expended on trivial issues that
do not ultimately benefit patients. Thoughtful scrutiny of the research question
473
474 Siu and Tannock
Table 1 Key Questions when Evaluating a Clinical Trial
• Does the trial address an important question?

• Is the design of the study appropriate?
Single arm versus randomized?
Nonblinded versus blinded?
Adequate sample size?
• Are the end points appropriate?
Are they well defined?
• Does the report of the study reflect its results?
What is the probability of false-positive or false-negative results?
Was the analysis performed in a rigorous way?
Are all the patients accounted for?
Do the results address the primary end points and hypotheses?
• Do the results fit with clinical experience (external validity)?
• Are the results generalizable such that they should influence oncologic practice?
to determine its clinical relevance is an essential first step toward a successful

investigation.
How might one determine whether the question being raised in a clinical
trial is important? In a survey examining treatment preferences of physicians for
different presentations of non-small cell lung cancer (NSCLC), Mackillop et al.
(1) suggested that expert physicians might act as surrogates for their patients and
their input might be helpful for ethics committees in evaluating the appropriate-
ness of clinical trials before they are conducted. A subsequent survey showed that
lay surrogates were unable to discern differences in the acceptability of clinical
experiments that were clear to experts, and most would appreciate access to ex-
pert opinions before consenting to trial participation (2). For example, lay surro-
gates found initially that trials comparing lobectomy versus segmentectomy for
operable NSCLC and five different regimens of chemotherapy for treatment of
metastatic NSCLC to be equally acceptable. When preferences of expert physi-
cians to accept the first trial but reject the second were made known to the lay
surrogates, their acceptance of the second trial declined dramatically. Expert phy-
sicians are familiar with the benefits and risks of diagnostic or therapeutic inter-
ventions and can therefore make an informed decision about enrollment into a
clinical study. Ideally, clinical trials into which a majority of physician surrogates
would enter themselves are the ones addressing scientifically sound, clinically
relevant, and ethically acceptable questions.
Although expert physicians are the logical candidates to evaluate the valid-
ity of a trial question, several factors may influence their views, such as their
specialty training and their geographic location. Utilizing the physician surrogate
method in an attempt to define various controversies in the management of genito-
Problems in Interpreting Clinical Trials 475
urinary malignancies, Moore et al. (3) distributed questionnaires to urologists,

radiation oncologists, and medical oncologists who treated genitourinary cancer
in Canada, Great Britain, and the United States. There was a tendency for respon-
dents to choose their own treatment modality, and British urologists were collec-
tively more conservative than their North American colleagues with respect to
their recommendation of radical surgery or chemotherapy. A follow-up survey
was mailed to the same physicians summarizing the treatment selections from
the initial questionnaire (4). Some of the scenarios in the questionnaire illustrated
clinical equipoise, a state of uncertainty in which there is no consensus within
the expert community about the comparative merits of the alternatives to be tested
(5). Recognition of the existence of controversy among experts did not substan-
tially alter physicians’ personal biases about entering themselves on trials, but a
greater proportion were willing to offer such options to their patients. Although
this disparity might reflect a double standard, it seems justifiable that at least
some physicians are not imposing their individual beliefs onto their patients, espe-
cially when these beliefs are not founded on objective information or scientific
proof.
Another notable observation that arose from the same survey is the readi-
ness of investigators to accrue to studies that are relatively easy to perform but
do not settle important clinical dilemmas (4). For example, a study in metastatic
renal cell carcinoma that compares interferon alone with interferon plus vinblas-
tine is feasible but uninspiring, since both treatment arms include agents of low
activity and are unlikely to yield a breakthrough in the management of this dis-
ease. In contrast, a study that strives to answer a fundamental question, such as
one comparing the use of radical prostatectomy and radical radiotherapy for local-
ized prostate cancer, will likely suffer from poor patient recruitment. Paradoxi-
cally, expert physicians agree almost unanimously that the comparison of these
two drastically different strategies represents a high priority for clinical research,
but many are reluctant to rectify it through recruitment of patients to a randomized
controlled trial. Nonevidence-based personal bias from the part of physicians can
be a deterrent to the execution of clinical trials that answer relevant questions,
and conscious efforts to eliminate this factor are warranted.
III. DESIGN OF THE STUDY
As in any rigorous scientific experiment, a clinical trial that is being undertaken

to address a pertinent issue in cancer research must be goal directed and hypothe-
sis driven. New treatments that appear promising after phase I studies of toxicity
and feasibility and phase II studies of biological activity need to be tested in
comparison with the currently accepted standard. These phase III studies gener-
ally require randomization of patients to receive either the new or the conven-
476 Siu and Tannock
tional therapy, thus ensuring that patient characteristics and other known prognos-
tic factors are well balanced between the treatment arms.
The main objectives of early noncomparative clinical studies of a new anti-
cancer therapy are to establish feasibility and to estimate biological activity. Ran-
domized control patients are not needed, and their presence would only obscure
the real purposes of these studies and delay their completion (6,7). Results ob-
tained from nonrandomized trials may help to generate new hypotheses but
should not be compared with those from another institution or from historical
experience at the same institution to draw definitive conclusions. Patients in non-
randomized trials may differ substantially in several factors other than their treat-
ment, such that indirect comparisons of trial outcome often lead to false claims
of therapeutic superiority that are not supported by subsequent randomized trials
(8). Selection of patients with favorable characteristics such as good performance
status and normal organ function and the provision of close medical attention
while on trial can contribute to a better outcome in trial subjects than their histori-
cal or nonstudy counterparts. Furthermore, it is a general phenomenon that partic-
ipants in clinical trials tend to have better survival than nonparticipants (9,10).
Attempts to match nonrandomized groups of patients by stage do not ensure
their comparability. Stage migration refers to the influence on disease detection
by newer and more sensitive imaging techniques such as computed tomography
and magnetic resonance scanning. Patients with disease that would have been
missed by previously used methods are upstaged, and the inclusion of patients
with lower tumor bulk into a higher stage category yields a better prognosis for
that stage. This effect can produce an apparent improvement in survival for each
stage but no change in overall survival (Fig. 1). Stage migration impedes the
comparison of currently staged patients with those staged with prior techniques.
Feinstein et al. (11) named this effect of stage migration as the Will Rogers
phenomenon, based on the quotation that ‘‘when the Okies left Oklahoma and
moved to California they raised the average intelligence level in both states.’’
A further source of bias that exists among both randomized and nonran-
domized trials is the propensity for investigators to continue and to publish those
with positive outcome. For nonrandomized trials, comparison of such selected
results with historical data precludes any valid conclusions. Suppose a new treat-
ment that produces no net benefit is being tested in several pilot studies, then
even by chance alone the first few patients in some studies will do well, whereas
the first few patients in others will do poorly. The former studies are likely to
be continued, whereas the latter studies may be quickly abandoned. The adoption
of a ‘‘two-stage’’ clinical trial design may reduce this bias and allow a treatment
to be evaluated in an adequate number of subjects before the decision is made
to proceed or to stop. The error probabilities of treatment acceptance or rejection
are predetermined by the investigators (12). Publication bias is another cause of
inappropriate optimism concerning the value of a new treatment. It is a well-
Figure 1 Stage migration. The diagram illustrates that a change in staging investigations
may lead to the apparent improvement of results within each stage without changing the
overall results. In the hypothetical example, patients with a given type of malignancy have
a spectrum of disease and are represented by equal cohorts whose survival is 100% (least
disease), 80%, 60%, 40%, 20%, and 0% (most disease). Older staging investigations clas-
sify these cohorts into stages as shown at the top. Newer and more sensitive staging investi-
gations classify the cohorts into stages as shown at the bottom. There is an apparent im-
provement in survival when compared stage by stage, but overall survival of 50% remains
unchanged. (Adapted from Ref. 67.)
recognized occurrence as positive findings are much more frequently submitted,

presented, and published than negative results (13,14), creating a preponderance
of false-positive studies in the literature.
All clinical trials should address a hypothesis. A hypothesis is a presump-
tion made a priori, suggested usually by previous preclinical and/or clinical ex-
perience. A clear statement of the hypothesis of interest requires that both the de-
pendent (outcome) variables and the independent (treatment, prognostic factors)
variables are explicitly described (15). Ideally, a null hypothesis and an alterna-
tive hypothesis should be specified before the implementation of a research study
(16). In phase III studies, the null hypothesis assumes that there is no treatment
478 Siu and Tannock
effect between the control and the experimental arms and that any observed dif-
ference is due to chance, whereas the alternative hypothesis states the opposite.
Predetermination of a decision rule is necessary, such that at the completion of
the study, one should be capable of disproving the null hypothesis and accepting
the alternative, or vice versa, based on the trial results obtained and draw mean-
ingful conclusions. For example, in the first randomized trials of adjuvant chemo-
therapy in breast cancer patients with positive axillary nodes (17,18), the null
hypotheses proposed that no difference existed between the control and the che-
motherapy arms. Significant reductions in treatment failures observed in the che-
motherapy arms in these studies refuted the null hypotheses and supported the
alternative hypotheses, confirming the benefit of adjuvant chemotherapy in this
patient population.
IV. APPROPRIATENESS OF THE END POINT
The end point of a clinical trial should reflect the outcome that the research ques-
tion sets out to measure. Typically, phase II clinical trials that evaluate the biolog-
ical activity of a chemotherapeutic agent or combination of agents against specific
malignancies use tumor shrinkage or response as their main end point. It should
be emphasized that tumor response is a measure of biological activity and not
of benefit to patients; for example, a very toxic drug might cause tumor shrinkage
but might also decrease the patients’ quality of life without increasing survival.
Phase III clinical trials ask questions about benefit to patients; appropriate end
points for these trials are therefore duration and quality of survival (Table 2).
Anderson et al. (19) pointed out the fallacy of analyzing survival as a func-
tion of tumor response category. The association of better prognosis with re-
sponse to treatment does not imply cause and effect. Pretreatment characteristics
of patients that lead to a favorable prognosis (e.g., high performance status) are
often the same as those that yield a good response to therapy. Thus, response
may simply be a marker for the patient subgroup who were destined to fare
Table 2 End Points in Clinical Trials
1. Explanatory or biological end points (used in phase II clinical trials)

a. Tumor response
b. Marker reduction
2. Pragmatic end points (used in phase III clinical trials)
a. Survival
b. Quality of life
3. Surrogate end points (that predict pragmatic end points)
well regardless of therapy. Likewise, the assumption that response to treatment

automatically translates into an improvement in quality of life is erroneous.
Symptomatic relief obtained from anticancer therapy is likely to require tumor
shrinkage, but this may be offset by treatment-induced toxicity; consequently,
the overall effect on patients’ quality of life can be variable.
There is wide disparity in reported rates of tumor response when patients
with the same type of cancer are treated with similar chemotherapy. It is meaning-
less to talk about a given drug regimen (e.g. combination of drugs A, B, and C)
having, for example, a 28% response rate for NSCLC. The reported response
rates using 5-fluorouracil for the treatment of metastatic colorectal cancer ranged
from 8% to 85% (20), whereas those of methotrexate for head and neck cancer
ranged from 8% to 60% (21). These large differences may be accounted for partly
by factors that exert a true impact on tumor response, such as differences in the
treatment protocol (e.g., dosage and schedule), differences in patient characteris-
tics (e.g., performance status and sites of metastatic disease), and differences in
quality of care (Table 3). Comparison of response rates between nonrandomized
series is rendered invalid, even if the same drug doses and schedule are used
because of the patient-based factors that influence response rates. Other causes
of diversity in reported response rates are artifactual and arise from differences
in the way investigators gather and report their data (22). Considerable heteroge-
neity exists in the criteria that have been used to determine response to therapy.
Such heterogeneity can create confusion in the interpretation of trial results and
render intercomparison of clinical trials difficult or impossible (23,24). The
World Health Organization and Cooperative Groups established more uniform
response criteria that have helped to standardize outcome assessment in clinical
trials of efficacy (25–27). Despite such efforts, limitations exist in the traditional
response criteria that are used to assess the biological activity of chemotherapeu-
tic agents. For example, the categorizations of minimal response (25–50% reduc-
tion in the sum of cross-sectional areas of index lesions) and stable disease
(⬍25% change in area), unless of meaningful duration, do not correlate with
patient benefit. Reports of transient or borderline reductions in tumor size are
susceptible to large measurement error and should not be used as criteria of re-
sponse: They are merely a guide to continue therapy, provided that treatment is
Table 3 Factors that Influence Response Rate in Clinical Trials
1. Treatment-based factors: e.g., drug dosage, drug schedule, quality of care

2. Patient-based factors: e.g., performance status, tumor stage, number of metastatic
sites
3. Measurement biases: e.g., intraobserver and interobserver errors
4. Variations in response criteria
480 Siu and Tannock
tolerated with acceptable levels of toxicity. Even in the face of stringent response
definitions, measurements of malignant lesions by physical examination or radio-
logical evaluation are subject to intraobserver and interobserver errors. Warr et
al. (28) demonstrated considerable inaccuracies in tumor measurements per-
formed by physicians on real and simulated malignant lesions, particularly in the
case of small-sized nodules (Table 4). Clinical practice commonly involves the
comparison of serial measurements with baseline lesions, and errors in the initial
and sequential measurements can lead to false categorization of tumor response.
In addition, overestimation of tumor response can occur as physicians may be
subconsciously biased by their desire to see a response to therapy. Whenever
feasible, objective external review of tumor measurements should be instituted
to minimize this potential source of bias from investigators.
Surrogate or intermediate end points refer to events or observations oc-
curring in the course of a disease that are believed to be precursors of the ultimate
outcome of primary interest (29). There are clinical scenarios in which the appli-
cation of surrogate end points appears logical and practical. The major goal in
trials of adjuvant chemotherapy used with effective local treatment is to improve
survival. However, determination of survival (e.g., for breast cancer) in these
trials will take a long time, whereas the duration required for follow-up with a
Table 4 False Categorization of Response from Comparison of All Pairs of

Measurements on the Same Lesion
Percent false categorization

Measurement PR PR ⫹ MR Progression
Simulated nodules 12.6 31.0 34.3

(1.0–2.6 cm)
Simulated nodules 1.3 19.7 24.0
(3.2–6.5 cm)
Neck nodes 13.1 32.1 33.4
Lung metastases 0.8 11.2 15.9
(CXR)
Liver size (A) 8.5 28.7
(B) 18.4
PR ⫽ partial response; MR ⫽ minor response; CXR ⫽ chest radiograph.

PR was a ⬎50% decrease in area, PR ⫹ MR was a ⬎25% decrease in area, and progression was
⬎25% increase in area except for liver size. For liver size, PR was defined as ⬎50% decrease (A)
or ⬎30% decrease (B), in the sum of the linear measurements of the liver edge below the costal
margin in the midline and midclavicular line; progression was defined as ⬎25% increase in the same
measurement.
Source: Ref. 28.
surrogate end point such as relapse-free survival is shorter. As well, a low event
rate of the primary end point may preclude feasibility in a clinical trial, or feasibil-
ity may be limited only to a multiinstitutional setting. For example, in a primary
prevention trial to determine the effect of aspirin on the incidence of colorectal
cancer among male physicians, over 22,000 subjects were accrued with 118 new
cases of invasive colorectal cancers identified (30). Adenomatous polyps are neo-
plasms that appear to be the precursors for most invasive cancers in the large
bowel and therefore may represent a useful surrogate end point. In the Polyp
Prevention Study, aspirin use was assessed in 793 subjects, of whom 259 devel-
oped at least one colonic or rectal adenoma detected by colonoscopy 1 year after
study entry (31). The practical advantage of using a valid surrogate end point is
its need for a much smaller sample size (32). Surrogate end points can be very
useful, but caution must be exercised against using nonvalidated surrogate end
points with unknown predictive ability for the true end point. Another problem
arises if therapy has specific effects to influence a surrogate end point. For exam-
ple, suramin is known to inhibit release of prostate-specific antigen (PSA) (33),
and reduction in PSA is a surrogate end point commonly used in prostatic cancer
trials; this end point might therefore be misleading when evaluating the anticancer
properties of suramin. The duration and quality of survival of patients with this
disease remain the appropriate end points for trials with this drug (34,35).
End points of phase III studies should assess benefit to patients and should
therefore include the duration of survival and its quality. Unfortunately, achieve-
ments in oncology that prolong survival in patients with metastatic cancers do
not occur frequently. Thus, it is not realistic to base the design of trials that
compare treatments for metastatic cancer in adults on the expectation of a sub-
stantial difference in the duration of survival.
For patients with advanced malignancies who have a poor outlook, the aim
of treatment should be directed toward palliation of their symptoms. Whereas
duration of survival is easy to measure and therefore often considered as a ‘‘hard’’
end point, evaluation of quality of life in patients has traditionally been viewed
as vague and difficult. The development of validated quality of life instruments
has allowed accurate assessment of this important end point. In fact, the improve-
ment in pain control alone has led to the approval of the chemotherapeutic regi-
men of mitoxantrone plus prednisone for the treatment of symptomatic hormone-
refractory prostate cancer (36). Similarly, gemcitabine has been accepted as ther-
apy for the palliation of patients with advanced pancreatic cancer, based on a
randomized trial demonstrating its superiority over 5-fluorouracil in providing
clinical benefit to such patients (37). Quality of life should no longer be regarded
as a ‘‘soft’’ end point of patient benefit. Increasing familiarity with its measure-
ment in patients will provide clinicians with an appropriate measure to evaluate
palliative effects of treatment. When using quality of life as a primary or major
end point in clinical trials, it is important to establish a hypothesis about the
482 Siu and Tannock
Table 5 Key Elements in the Evaluation of Quality of Life in Clinical Trials
1. Patient based
2. Define primary end point using one quality of life measure that is relevant to
patients in study (others exploratory)
3. Define hypothesis about change that is clinically important
4. Blind assessment where possible
expected degree of change in a predefined important aspect of quality of life, as

summarized in Table 5. The primary measure of quality of life might be an overall
summary scale or a measure of the dominant symptom such as pain in patients
with metastatic prostate cancer. When used in a rigorous way, quality of life end
points can give important information about the palliative value of treatment.
V. FALSE-POSITIVE AND FALSE-NEGATIVE TRIALS
Besides publication bias, two additional factors increase the probability of falsely
declaring clinical trial results to be positive: the performance of multiple signifi-
cance tests and the low prevalence of true-positive studies leading to therapeutic
advances. In clinical trials, multiple comparisons are commonly undertaken for
various end points, for subgroups, and for serial interim evaluations during patient
accrual and follow-up. The type I error, also known as the α error, is the error
made by reporting a significant difference between two treatments when it does
not exist. Typically, the α error is set at the 5% level, and 1 in 20 p values for
comparison of equivalent outcomes will be less than 5% simply by chance alone.
The number of implicit and explicit statistical comparisons executed in reports
of clinical trials is often large (38–40). There is disagreement about the value of
correcting for multiple tests (41,42), and adjustments are rarely done in practice.
Many of these comparisons are done in a post hoc fashion, and therefore any
detected differences are to be regarded as hypothesis generating, rather than hy-
pothesis testing. Final conclusions from clinical trials should derive only from
comparisons of the major predefined end point(s). Excessive manipulation of
study data results in ‘‘data torturing,’’ which can lead to the dissemination of
incorrect information to the research community and to patients (39).
Subgroup analyses of data from randomized trials seek to identify ‘‘effect
modifiers,’’ characteristics of the patients or treatment that modify the effect of
the intervention under study (43). Judicious application of subgroup analyses can
provide ideas for new trials, but they must not be confused with the primary
analysis that can give definitive information. Caution should be exercised against
the tendency to accept exploratory subset analyses that seem reasonable. For
example, in the trials of adjuvant chemotherapy for colorectal cancer, reductions

in risk of recurrence was associated with younger and female patients in one
study but with older and male patients in another (44,45). These divergent results
of unplanned subgroup analysis are almost certainly due to statistical artifact.
Periodic monitoring of the accumulating data in a trial can give interim
information that might indicate early stopping because of a substantial difference
between the arms, because one might predict that the trial cannot show a differ-
ence if it continues to its preplanned accrual goals, or because of unacceptable
toxicity. Review of outcome information should not be undertaken by the investi-
gators since this is likely to introduce bias. Rather, interim data should be re-
viewed by an independent Data Monitoring Committee. A recommendation for
trial closure must be based on sufficient evidence: Early interim analyses of lim-
ited available data require very small p values for stopping, whereas later analyses
can have stopping p values closer to conventional levels of significance (46).
After completion of accrual, multiple reanalyses of the data can also generate
multiple significance tests with the possibility of showing apparent effects that
are not sustained. For example, in an European Organization for Research and
Treatment of Cancer trial of adjuvant therapy for breast cancer, early results
suggested survival benefit for patients receiving chemotherapy but not those
receiving tamoxifen, whereas the mature 8-year results showed the opposite
(47,48). A limited number of analyses should be undertaken at predefined times.
The low expectation of therapeutic advances may also contribute to false-
positive reporting (40,49–51). The probability that there will be superior efficacy
of a new treatment compared with the reference approach is very low (49). For
example, in a review of all breast cancer abstracts published in the Program/
Proceedings of the American Society of Clinical Oncology from 1984 to 1993,
only 16% of the adjuvant trials and 2% of the advanced disease trials reported
a survival benefit for the experimental treatment (52). Due to the low prevalence
of true differences in outcome, clinical trials will have a high likelihood of false
positivity (and a low likelihood of false negativity). This concept is not intuitively
obvious and may be clarified by the following example (Parmar MKB, personal
communication). Suppose 400 trials are undertaken with a significance level of
5% (α error ⫽ 0.05, two-sided) and with a 90% power (the probability of de-
tecting a treatment difference if in truth there is one). If the prevalence of trials
with a real difference in favor of experimental treatment is 10% (40 trials), then
36 (40 ⫻ 0.9) trials will be reported correctly as true positives and 9 (360 ⫻
0.025) trials will be reported erroneously as false positives. Altogether, among
the 45 trials that will declare positive results, 1 in 5 (or 9 in 45) are false-positive
trials. The remaining 355 of the 400 trials will be reported as negative, and of
these, 342 (360 ⫺ 18) are true negatives and 13 are false negatives (or 1 in about
27). In Bayesian terminology, in the setting of a low pretest or prior probability
of true positive studies, a single positive study even when well designed and
484 Siu and Tannock
performed has a substantial chance of being false positive, especially if the p

value is in the range of 0.01 to 0.05, with only a relatively small increment in
the post-test probability of true results.
The type II error, or the β error, represents the failure of a clinical trial to
recognize a significant difference between treatment groups when it truly exists.
False negativity occurs most often because the sample size is too small to detect
plausible differences in outcome (53). For example, to detect or rule out a 20%
absolute differences in expected outcome events with a 90% power, more than
200 patients are required (the exact number depends on the expected outcome
in the control or standard therapy group). Clinical trials that seek differences in
outcome events in the range of 10% will require accrual of about 1000 patients
to obtain reasonable power (54).
VI. RIGOROUS VERSUS NONRIGOROUS ANALYSIS
At the completion of a clinical trial, the collected data must be assimilated using
appropriate methodologies and presented in an objective and thoughtful manner.
Many guidelines and recommendations are available to promote quality in the
reporting of clinical trials (55–59). Essential elements that should be clearly spec-
ified in every report include study hypothesis and design; patient population and
entry criteria that were used in its selection; actual therapy delivered, especially
if different from the intended therapy; treatment complications and toxicity;
methods of outcome assessment; statistical evaluation; and the accounting of all
study subjects. Failure to adhere to basic standards in the reporting of clinical
trials may lead to dissemination of misleading information and eventually to inap-
propriate treatment of patients. Baar and Tannock (60) illustrated the impact of
a rigorous in contrast to a nonrigorous approach in the analysis and reporting of
Table 6 Guidelines for the Reporting of Clinical Trials
• Are the criteria for study entry and patient selection described?
• Does the report provide details of the control and the experimental therapies?
• Are protocol violations reported and the reasons for such violations explained?
• Are stringent response criteria used for disease evaluation?
• Are all patients accounted for and included in the analysis of results?
• Does the survival curve include all patients and does not compare survival by
treatment response?
• Does the report provide a comprehensive analysis of toxicity?
• Are measures of quality of life or costs of treatment included?
clinical trials. Using a demonstration model involving fictional patients and a

hypothetical chemotherapy regimen ‘‘CABOOM’’ for metastatic carcinoma of
the great toe, a single set of data conveyed dramatically opposite conclusions
depending on the way they were interpreted and presented. The inappropriate
methods of analysis that led to the spurious results were not fictional; each of
them had been used in one or more reports of clinical trials published in a single
year in the Journal of Clinical Oncology. The differences between rigorous and
nonrigorous analysis used to generate these reports can be used as a guide to
reporting (and how not to report) clinical trials and are summarized in Table 6.
VII. EXTERNAL VALIDITY AND GENERALIZABILITY
Trials that have rigid selection criteria may be limited in their applicability to
patients with similar characteristics, and extrapolation beyond such patients is
unsound. For example, in a Cancer and Leukemia Group B trial of radiotherapy
with or without induction chemotherapy for stage III NSCLC (61), only patients
with high performance status were allowed on study. As a result of this strict
inclusion criterion, multiple participating centers took 3 years to accrue 155 eligi-
ble patients, despite the common prevalence of this malignancy. Furthermore,
the survival benefit noted in favor of combined chemoradiation may not apply
to other stage III NSCLC patients who have poorer performance status. Simple
entry criteria allow the accrual of larger number of patients and increase the
heterogeneity in the study population but will render final results more generaliz-
able. Internally and externally valid clinical trials are those performed using a
stringent methodology, reported in a prudent fashion, and produced meaningful
results that are consistent with clinical experience (Table 7). For example, in a
randomized study of only 60 patients which compared surgery with or without
preoperative chemotherapy for stage IIIA NSCLC, a substantial and significant
survival advantage was reported for the addition of chemotherapy (62). This study
had problems of design and analysis (i.e., poor internal validity) and generated
a result that is inconsistent with data from other trials, many of which were much
Table 7 Internal and External Validity
• Internal validity: Were rigorous and appropriate methods used in the design,
analysis, and reporting of the trial?
• External validity: How do the results of the trial compare with other data and with
past clinical experience?
486 Siu and Tannock
larger (i.e., poor external validity) (63–66). Despite these defects, the trial had
a substantial influence by virtue of being a lead article in the New England Jour-
nal of Medicine, a flagrant example of publication bias.
VIII. CONCLUSION
Many subtle problems can influence the design, analysis, and reporting of clinical
trials and hence the validity of their results. Quality control imposed by coopera-
tive groups has improved but not eliminated these problems. When reviewing
the results of a clinical trial, readers should consider two relatively simple ques-
tions: Were the design and analysis of the trial performed according to high stan-
dards (internal validity) and do the results fit with clinical experience and those
of other trials (external validity)? If the answers to these questions are ‘‘yes,’’
and the trial addresses an important question, it is appropriate to include its results
as part of the basis for making clinical decisions.
REFERENCES
1. Mackillop WJ, Ward GK, O’Sullivan B. The use of expert surrogates to evaluate
clinical trials in non-small cell lung cancer. Br J Cancer 1986; 54:661–667.
2. Mackillop WJ, Palmer MJ, O’Sullivan B, Ward GK, Steele R, Dotsikas G. Clinical
trials in cancer: the role of surrogate patients in defining what constitutes an ethically
acceptable clinical experiment. Br J Cancer 1989; 59:388–395.
3. Moore MJ, O’Sullivan B, Tannock IF. How expert physicians would wish to be
treated if they had genitourinary cancer. J Clin Oncol 1988; 6:1736–1745.
4. Moore MJ, O’Sullivan B, Tannock IF. Are treatment strategies of urologic oncolo-
gists influenced by the opinions of their colleagues? Br J Cancer 1990; 62:988–
991.
5. Freedman B. Equipoise and the ethics of clinical research. N Engl J Med 1987; 317:
141–145.
6. Gehan EA. The determination of the number of patients required in a preliminary
and a follow-up trial of a new chemotherapeutic agent. J Chronic Dis 1961; 13:346–
353.
7. Gehan EA, Freireich EJ. Non-randomized controls in cancer clinical trials. N Engl
J Med 1974; 290:198–203.
8. Sacks H, Chalmers TC, Smith H Jr. Randomized versus historical controls for clini-
cal trials. Am J Med 1982; 72:233–240.
9. Antman K, Amato D, Wood W, Carson J, Suit H, Proppek K, Carey R, Greenberger
J, Wilson R, Frei E III. Selection bias in clinical trials. J Clin Oncol 1985; 3:1142–
1147.
10. Davis S, Wright PW, Schulman SF, Hill LD, Pinkham RD, Johnson LP, Jones TW,
Kellog Jr HB, Radke HM, Sikkema WW, Jolly PC, Hammar SP. Participants in
prospective, randomized clinical trials for resected non-small cell lung cancer have
improved survival compared with nonparticipants in such trials. Cancer 1985; 56:
1710–1718.
11. Feinstein AR, Sosin DM, Wells CK. The Will Rogers phenomenon. Stage migration
and new diagnostic techniques as a source of misleading statistics for survival in
cancer. N Engl J Med 1985; 312:1604–1608.
12. Simon R. Optimal two-stage designs for phase II clinical trials. Controlled Clin Tri-
als 1989; 10:1–10.
13. Begg CB, Berlin JA. Publication bias and dissemination of clinical research. J Natl
Cancer Inst 1989; 81:107–115.
14. De Bellefeuille C, Morrison CA, Tannock IF. The fate of abstracts submitted to a
cancer meeting: factors which influence presentation and subsequent publication.
Ann Oncol 1992; 3:187–191.
15. Lyman GH, Kuderer NM. A primer for evaluating clinical trials. Cancer Control
1997; 4:413–418.
16. Neyman J, Pearson ES. On the use and interpretation of certain test criteria. Biome-
trika 1928; 201:175–240.
17. Fisher B, Carbone P, Economou SG, Frelick R, Glass A, Lerner H, Redmond C,
Zelen M, Band P, Katrych DL, Wolmark N, Fisher ER. 1-Phenylalanine mustard
(l-PAM) in the management of primary breast cancer. N Engl J Med 1975; 292:
117–122.
18. Bonadonna G, Brusamolino E, Valagussa P, Rossi A, Brugnatelli L, Brambilla C,
De Lena M, Tancini G, Bajetta E, Musumeci R, Veronesi U. Combination chemo-
therapy as an adjuvant treatment in operable breast cancer. N Engl J Med 1976;
294:405–410.
19. Anderson JR, Cain KC, Gelber RD. Analysis of survival by tumor response. J Clin
Oncol 1983; 1:710–719.
20. Moertel CG, Thynne GS. Cancer Medicine. 2nd ed. Philadelphia: Lea & Febiger,
1982, pp. 1830–1859.
21. Tannock IF. Chemotherapy for head and neck cancer. J Otolaryngol 1984; 13:99–
104.
22. Rudnick SA, Feinstein AR. An analysis of the reporting of results in lung cancer
drug trials. J Natl Cancer Inst 1980; 64:1337–1343.
23. Miller AB, Hoogstraten B, Staquet M, Winkleer A. Reporting results of cancer treat-
ment. Cancer 1981; 47:207–214.
24. Tonkin K, Tritchler D, Tannock IF. Criteria of tumor response used in clinical trials
of chemotherapy. J Clin Oncol 1985; 3:870–875.
25. World Health Organization. WHO handbook for reporting results of cancer treat-
ment. WHO Offset Publication No. 48, Geneva, 1979.
26. Oken MM, Creech RH, Tormey DC, Horton J, Davis TE, McFadden ET, Carbone
PP. Toxicity and response criteria of the Eastern Cooperative Oncology Group. Am
J Clin Oncol 1982; 5:649–655.
27. Green S, Weiss GR. Southwest Oncology Group standard response criteria, endpoint
definitions and toxicity criteria. Invest New Drugs 1992; 10:239–253.
28. Warr D, McKinney S, Tannock I. Influence of measurement error on assessment of
488 Siu and Tannock
response to anticancer chemotherapy: proposal for new criteria of tumor response.

J Clin Oncol 1984; 2:1040–1046.
29. Ellenberg SS. Surrogate endpoints. Br J Cancer 1993; 68:457–459.
30. Gann PH, Manson JE, Glynn RJ, Buring JE, Hennekens CH. Low-dose aspirin and
incidence of colorectal tumors in a randomized trial. J Natl Cancer Inst 1993; 85:
1220–1224.
31. Greenberg ER, Baron JA, Freeman DH, Mandel JS, Haile R. Reduced risk of large-
bowel adenomas among aspirin users. J Natl Cancer Inst 1993; 85:912–916.
32. Wittes J, Lakatos E, Probstfield J. Surrogate endpoints in clinical trials: cardiovascu-
lar diseases. Stat Med 1989; 8:415–425.
33. La Rocca RV, Danesi R, Cooper MR, Jamis-Dow CA, Ewing MW, Linehan WM,
Myers CE. Effect of suramin on human prostate cancer cells in vitro. J Urol 1991;
145:393–398.
34. Clark JW, Chabner BA. Suramin and prostate cancer: where do we go from here?
J Clin Oncol 1995; 13:2155–2157.
35. Small EJ, Marshall ME, Reyno L, Meyers F, Natale R, Meyer M, Lenehan P,
Chen L, Eisenberger M. Superiority of suramin ⫹ hydrocortisone (S ⫹ H) over
placebo ⫹ hydrocortisone (P ⫹ H): results of a multi-center double-blind phase III
study in patients with hormone refractory prostate cancer (HRPC). Proc Am Soc
Clin Oncol 1997; 17:308a (abstract no. 1187).
36. Tannock IF, Osoba D, Stockler MR, Ernst S, Neville AJ, Moore MJ, Armitage GR,
Wilson JJ, Venner PM, Coppin CML, Murphy KC. Chemotherapy with mitoxan-
trone plus prednisone or prednisone alone for symptomatic hormone-refractory pros-
tate cancer: a Canadian randomized trial with palliative endpoints. J Clin Oncol
1996; 14:1756–1764.
37. Burris III HA, Moore MJ, Andersen J, Green MR, Rothenberg ML, Modiano MR,
Cripps MC, Portenoy RK, Storniolo AM, Tarassoff P, Nelson R, Dorr A, Stephens
CD, Von Hoff DD. Improvements in survival and clinical benefit with gemcitabine
as first-line therapy for patients with advanced pancreatic cancer: a randomized trial.
J Clin Oncol 1997; 15:2403–2414.
38. Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the reporting of clinical
trials. A survey of three medical journals. N Engl J Med 1987; 317:426–432.
39. Mills JL. Data torturing. N Engl J Med 1993; 329:1196–1199.
40. Tannock IF. False-positive results in clinical trials: multiple significance tests and
the problem of unreported comparisons. J Natl Cancer Inst 1996; 88:206–207.
41. Pocock SJ, Geller NL, Tsiatis AA. The analysis of multiple endpoints in clinical
42. Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology
1990; 1:43–46.
43. Oxman AD, Guyatt GH. A consumer’s guide to subgroup analyses. Ann Intern Med
1992; 116:78–82.
44. Laurie JA, Moertel CG, Fleming TR, Wieand HS, Leigh JE, Rubin J, McCormack
GW, Gerstner JB, Krook JE, Malliard J, Twito DI, Morton RF, Tschetter LK, Barlow
JF. Surgical adjuvant therapy for large-bowel carcinoma: an evaluation of levamisole
and the combination of levamisole and fluorouracil. J Clin Oncol 1989; 7:1447–
1456.
45. Moertel CG, Fleming TR, MacDonald JS, Haller DG, Laurie JA, Goodman PJ,
Ungerleider JS, Emerson WA, Tormey DC, Glick JH, Veeder MH, Mailliard JA.
Levamisole and fluorouracil for adjuvant therapy of resected colon carcinoma.
N Engl J Med 1990; 322:352–358.
46. Pocock SJ. When to stop a clinical trial. BMJ 1992; 305:335–340.
47. Rubens RD, Bartelink H, Engelsman E, Hayward JL, Rotmensz N, Sylvester R, van
der Schueren E, Papadiamantis J, Vassilaros SD, Wildiers J, Winter PJ. Locally
advanced breast cancer: the contribution of cytotoxic and endocrine treatment to
radiotherapy. An EORTC Breast Cancer Co-operative Group Trial (10792). Eur J
Cancer Clin Oncol 1989; 25:667–678.
48. Bartelink H, Rubens RD, van der Schueren E, Sylvester R. Hormonal therapy pro-
longs survival in irradiated locally advanced breast cancer: a European Organization
for Research and Treatment of Cancer randomized phase III trial. J Clin Oncol 1997;
15:207–215.
49. Staquet MJ, Rozencweig M, Von Hoff DD, Muggia FM. The delta and epsilon errors
in the assessment of cancer clinical trials. Cancer Treat Rep 1979; 63:1917–1921.
50. Ciampi A, Till JE. Null results in clinical trials: the need for a decision-theory ap-
proach. Br J Cancer 1980; 41:618–629.
51. Parmar MKB, Ungerleider RS, Simon R. Assessing whether to perform a confirma-
tory randomized clinical trial. J Natl Cancer Inst 1996; 88:1645–1651.
52. Chlebowski RT, Lillington LM. A decade of breast cancer clinical investigations:
results as reported in the program/proceedings of the American Society of Clinical
Oncology. J Clin Oncol 1994; 12:1789–1795.
53. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the
type II error and sample size in the design and interpretation of the randomized
control trial. Survey of 71 ‘‘negative’’ results. N Engl J Med 1978; 299:690–694.
54. Boag JW, Haybittle JL, Fowler JF, Emergy EW. The number of patients required
in a clinical trial. Br J Radiol 1971; 44:122–125.
55. DerSimonian R, Charette LJ, McPeek B, Mosteller F. Reporting on methods in clini-
cal trials. N Engl J Med 1982; 306:1332–1337.
56. Zelen M. Guidelines for publishing paper on cancer clinical trials: responsibilities
of editors and authors. J Clin Oncol 1983; 1:164–169.
57. Simon R, Wittes RE. Methodologic guidelines for reports of clinical trials. Cancer
Treat Rep 1985; 69:1–3.
58. Green SJ, Fleming TR. Guidelines for the reporting of clinical trials. Semin Oncol
1988; 15:455–461.
59. Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, Pitkin R, Rennie D,
Schulz KF, Simel D, Stroup DF. Improving the quality of reporting of randomized
control trials. The CONSORT statement. JAMA 1996; 276:637–639.
60. Baar J, Tannock I. Analyzing the same data in two ways: a demonstration model
to illustrate the reporting and misreporting of clinical trials. J Clin Oncol 1989; 7:
969–978.
61. Dillman RO, Seagren SL, Propert KJ, Guerra J, Eaton WL, Perry M, Carey RW,
Frei EF III, Green MR. A randomized trial of induction chemotherapy plus high-
dose radiation versus radiation alone in stage III non-small-cell lung cancer. N Engl
J Med 1990; 323:940–945.
490 Siu and Tannock
62. Rosell R, Gomez-Codina J, Camps C, Maestre J, Padille J, Canto A, Mate JL, Li

S, Roig J, Olazabal A, Canela M, Ariza A, Skagel Z, Morera-Prat J, Abad A. A
randomized trial comparing preoperative chemotherapy plus surgery with surgery
alone in patients with non-small-cell lung cancer. N Engl J Med 1994; 330:153–
158.
63. Cocquyt V, De Neve W, Van Belle SJ-P. Chemotherapy and surgery versus surgery
alone in non-small-cell lung cancer. N Engl J Med 1994; 330:1756.
64. Chanarin N. Chemotherapy and surgery versus surgery alone in non-small-cell lung
cancer. N Engl J Med 1994; 330:1756.
65. Mills NE, Fishman CL, Jacobson DR. Chemotherapy and surgery versus surgery
alone in non-small-cell lung cancer. N Engl J Med 1994; 330:1756.
66. McLachlan SA, Stockler M. Chemotherapy and surgery versus surgery alone in non-
small-cell lung cancer. N Engl J Med 1994; 330:1757.
67. Bush RS. Cancer of the ovary: natural history. In: Peckham MJ, Carter RL, eds.
Malignancies of the Ovary, Uterus and Cervix: The Management of Malignant Dis-
ease, Series #2. London: Edward Arnold, 1979:26–37.
24
Commonly Misused Approaches in the
Analysis of Cancer Clinical Trials
James R. Anderson
University of Nebraska Medical Center, Omaha, Nebraska
I. INTRODUCTION
Analysis of cancer clinical trials data is sometimes not straightforward. In this

chapter, some commonly used incorrect or inappropriate methods of statistical
analysis and appropriate techniques when they exist are discussed.
II. COMPARISON OF SURVIVAL OR OTHER

‘‘TIME-TO-EVENT’’ BY OUTCOME VARIABLES
Many comparisons of survival, failure-free survival, or other ‘‘time-to-event’’

variables are made by grouping patients on some outcome of treatment. Examples
of these analyses include analysis of survival by best response to treatment, analy-
sis of survival by delivered chemotherapy dose intensity, survival by level of
toxicity experienced, and survival by compliance to some protocol specified treat-
ment (these compliance data often resulting from quality control review of radio-
therapy or surgical procedures).
All these analyses are problematic, first and foremost because patients are
classified into groups based on an outcome of treatment that unfolds over time
(response, toxicity, dose intensity, treatment compliance). Valid analyses of time-
491
492 Anderson
to-event data measure the time from some ‘‘time of entry’’ to the occurrence of
some ‘‘event-of-interest’’ and compare groups based on characteristics or factors
known at the time of entry. Since none of these ‘‘outcomes’’ are known at the
time treatment starts, the standard survival analysis approaches are invalid and
other statistical techniques must be used (1). Nevertheless, even when appropriate
analyses are conducted, the interpretation of the results of these analyses can be
difficult.
A. Analysis of Survival by Tumor Response

The effectiveness of treatments for patients with advanced cancers is often as-
sessed by computing complete and partial response rates and overall survival
from the start of treatment. For many years, survival by tumor response category
was commonly included as part of the reports of these treatments. Patients were
characterized as ‘‘responders’’ or ‘‘nonresponders,’’ and Kaplan-Meier life-table
estimates of survival from the start of treatment were calculated for each re-
sponder group and survival by tumor response compared, with the statistical sig-
nificance of observed differences assessed using the log-rank or other significance
test appropriate for time-to-event data. A review of articles published in Cancer
or Cancer Treatment Reports found 228 articles presenting data on responders
and nonresponders, with 61% containing formal statistical comparisons of sur-
vival by tumor response category (2). The Journal of Clinical Oncology, in its
last six issues of 1984 and its first six issues of 1985, published 18 papers that
included analyses of survival by tumor response and 10 provided formal statisti-
cal comparisons of survival of responders and nonresponders (3).
These analyses appear to have been used in two ways. First, responders
surviving longer than nonresponders was used as a justification for more aggres-
sive therapy, since therapy that increased the response rate would then necessarily
result in increased survival. Second, some investigators considered analysis of
survival by response category as a surrogate for a placebo-controlled trial. Re-
sponders benefited from the therapy and thus constituted the ‘‘treatment’’ group.
Nonresponders did not benefit, and thus their survival could be considered similar
to that for untreated patients (placebo controls).
However, the usual methods of comparing responders to nonresponders is
wrong and should never be used (4). The estimates of the survival distributions
are biased, the statistical tests are invalid, and the conclusions drawn are mis-
leading. The bias results from the requirement that responders must live long
enough for the response to be observed; there is no such requirement for nonre-
sponders. This bias also affects the validity of the statistical tests. Responders
will be counted as ‘‘at risk’’ of death from the start of treatment to the time of
response, biasing the analysis in favor of responders.
Misused Approaches to Analysis of Clinical Trials 493
There are a number of valid approaches to the statistical comparison of

survival by response category (4). The ‘‘landmark’’ method evaluates response
for all patients at some fixed time following the onset of treatment. For phase II
studies, this landmark might be a time at which most patients who are going to
respond have already done so (say, after three cycles of therapy). Patients who
progress or who are ‘‘off-study’’ prior to the landmark are excluded from the
analysis. Patients are analyzed according to their response at the landmark, re-
gardless of subsequent changes in response status. Survival estimates and statisti-
cal tests are conditional on the patients’ landmark response. A second approach
involves the application of the method of Mantel and Byar (5), first applied to
the analysis of heart transplant data. Patients accrue follow-up time in various
‘‘states’’ that they occupy during treatment. All patients begin in the nonresponse
state and patients move to the response state at the time of their response. This
time-dependent covariate analysis removes the bias inherent in the usual analysis
method. Simon and Makuch (6) proposed a method of obtaining estimates of
survival probabilities for responders and nonresponders, using ideas from the
landmark and Mantel-Byar approaches.
Even if appropriate statistical methods are used, longer survival for re-
sponders, as compared with nonresponders, cannot be used to conclude that re-
sponse ‘‘caused’’ longer survival. Response may act as a surrogate marker for
prognostically favorable patients. Thus, responders may survive longer than non-
responders, not because of an effect of response on survival but because response
identifies patients with pretreatment characteristics that favor longer survival. It
will generally be impossible to distinguish cases where response prolongs sur-
vival and cases where it simply acts as a marker for favorable prognosis patients.
Analyses of survival by tumor response category are therefore rarely help-
ful and should be avoided. Cancer Treatment Reports methodological guidelines
indicated they would not publish comparisons of survival by tumor response (7).
Few such analyses were published in 1998, although a few inappropriate analyses
still make their way past peer review (8–10). Some investigators interested in
assessing the effect of response on survival are applying appropriate statistical
techniques (11).
B. Analysis of Survival by Chemotherapy Dose Intensity

In experimental in vivo systems, there is substantial evidence for a steep dose–
response curve for many antineoplastic agents (12). However, clinical evidence
for a steep dose–response curve beyond those doses used in standard chemother-
apy regimens is limited (13). In a few studies where patients have been assigned
to doses of chemotherapeutic agents less than those considered standard (as op-
posed to receiving lower doses because of toxicity experienced), there is evidence
494 Anderson
of disease control and survival less than that expected with standard doses (14–
16).
Clinical cancer investigators understandably have attempted to demonstrate
the importance of dose or dose intensity in the delivery of cancer treatment.
However, much of the clinical evidence for the importance of dose has come
either from the retrospective analysis of dose or dose intensity of therapy deliv-
ered on clinical trial or from the nonrandomized comparison of results of different
treatments classified by a dose-intensity score.
In the retrospective analyses of dose delivered, each patient’s actual dose
of the chemotherapeutic agents received is calculated as a fraction of the protocol-
specified dose (see, for instance, 17–19). Patients are then classified into groups
based on the percent of protocol-specified dose delivered (e.g., ⱖ85%, 65–84%,
⬍65%) and outcome is compared among dose groups. Percent of protocol-
specified dose delivered is sometimes replaced by relative dose intensity, cal-
culated as mg/m2 /wk of the agents delivered as a percentage of the protocol-
specified dose intensity. A statistically significant association between high dose
or dose intensity and outcome is claimed to be evidence of a dose–response
effect.
However, these analyses cannot demonstrate that drug dose is directly re-
lated to therapeutic effect. It is possible that toxicity leading to dose reduction
acts as a marker of patients with a poor prognosis generally and additional chemo-
therapy over and above that which can be delivered under normal circumstances
would be of no benefit (13,20,21). Evidence that this may be the case comes from
a randomized trial in pediatric nonlymphoblastic lymphoma comparing COMP
therapy to COMP plus daunorubicin (D-COMP) (22). The addition of daunoru-
bicin had no effect on failure-free survival overall. However, patients randomized
to D-COMP who received less than the protocol-specified dose of daunorubicin
had a significantly poorer failure-free survival than both those patients on D-
COMP receiving full-dose daunorubicin and those patients randomized to COMP
alone who received no daunorubicin by design (23).
Other dose–response analyses have resulted from the retrospective analysis
of published data, in which regimens are allocated a dose intensity score com-
pared with some standard regimen (like a hypothetical ‘‘gold standard’’ for inter-
mediate grade non-Hodgkin’s lymphoma [24,25] or CMF for adjuvant therapy
of breast cancer [26]). However, these analyses are particularly prone to problems
that are inherent to nonrandomized comparisons. These comparisons may be bi-
ased because of differences between series with respect to factors like patient
characteristics and/or supportive care measures that may impact substantially
on outcome (13). Second- and third-generation regimens for intermediate grade
lymphoma (m-BACOD, ProMACE-CytaBOM, MACOP-B), which would have
been expected to produce improved disease control based on these ‘‘dose-
intensity’’ analyses (24,25), were shown to be essentially equivalent to CHOP
in a randomized trial comparing all four regimens (27). Differences in patient

characteristics known to have a substantial impact on treatment outcome in non-
Hodgkin’s lymphoma (28) between the pilot experience with the second- and
third-generation regimens at single institutions and the CHOP experience in the
cooperative groups is the most likely explanation for the observed differences.
The best and most convincing evidence for a dose–response effect in cancer
treatment comes from studies that randomly assign patients to differing doses or
dose intensities of chemotherapeutic agents. However, only a few examples of
randomized clinical trials demonstrate that dose intensity greater than that of
conventional therapy but less than that requiring stem-cell support produces im-
proved disease control (e.g., 16,29,30). The available clinical research data from
randomized trials and elsewhere suggest that small or even moderate reductions
in drug dose for nontrivial reasons (like toxicity) do not compromise patient sur-
vival.
C. Analysis of Survival by Toxicity

Interest in the relationship of chemotherapy dose intensity to outcome has lead
to an interest in the association between toxicity experienced by patients and
outcome. Sorensen et al. (31) reported that the occurrence of dose-limiting toxic-
ity during treatment was associated with higher response rates and longer sur-
vival. However, comparisons of toxicity and survival suffer the same problems
as survival by response: the longer a patient’s time on treatment, the greater the
chance that toxicity of a certain severity will be observed (32). Breslow and
Zandstra (33) observed a similar relationship between the level of bone marrow
lymphocyte toxicity observed during postinduction treatment and duration of re-
mission in childhood acute leukemia. The statistical techniques appropriate for
analysis of survival by response are appropriate here, as are the caveats regarding
interpretation. Toxicity can be a marker of adequate drug dosing and therefore
tumor kill, or it can simply be a indicator of more prognostically favorable pa-
tients who are better able to withstand the cytotoxic effects of therapy.
D. Analysis of Survival by Compliance

to Protocol-specified Treatment
Investigators are often interested in assessing how deviations from protocol ther-
apy affect outcome. Analysis of survival by percent of protocol dose delivered
is an example of such an analysis. Other analyses of outcome by compliance often
result from quality control review of specific modality treatment. For instance, a
process of radiation therapy quality assurance might assess whether the volume,
dose, and timing of the radiation therapy delivered is consistent with protocol
496 Anderson
requirements. Surgical quality control might assess the completeness of surgical

resection and evidence for negative margins.
Quality assurance processes are important in that they provide important
information on how therapy is delivered in practice. Feedback from the quality
control procedures to physicians treating patients may increase the likelihood that
future patients receive therapy closer to that specified by protocol. Feedback to
the protocol research team may lead to changes in protocol-specified therapy,
making it more likely to be delivered according to protocol in the field. For in-
stance, data from the radiation therapy quality assurance process for a recently
completed protocol of the Intergroup Rhabdomyosarcoma Study Group showed
that some radiation therapists or parents were unwilling to deliver hyperfraction-
ated radiotherapy to children less than 6 years of age, compromising the random-
ized comparison of conventional versus hyperfractionated radiation for these pa-
tients (34).
However, investigators are often interested in comparison of outcome be-
tween patients who comply with protocol-specified treatment and those who do
not or in comparisons of ‘‘treatment received’’ in addition to the randomized
treatment comparisons. These analyses have great potential for bias and may be
misleading (35). First, the characteristic of being compliant may have prognostic
importance, irrespective of the intervention. The Coronary Drug Project Research
Group showed improved survival for subjects who complied with protocol treat-
ment, as compared with noncompliers, for both the drug and placebo groups (36).
Second, lack of compliance may be related to patient or disease characteristics.
The dose of radiation may be decreased from protocol specified because of toxic-
ity; the field size may be reduced from protocol specified in an attempt to reduce
toxicity to vital organs. Hard-to-treat tumors may be hard to cure, and adjustment
of these compliance comparisons for prognostic factors may be inadequate. An
analysis of prognostic factors for children with bulky retroperitoneal embryonal
histology rhabdomyosarcoma suggested that patients whose tumors were ‘‘de-
bulked’’ experienced improved survival, as compared with patients who tumors
were not debulked (37). Nevertheless, it does not necessarily follow that more
extensive surgery for these patients would lead to improved outcome. It is possi-
ble that only those patients whose tumors could be debulked actually were and
those patients whose tumors were initially biopsied only had disease that both
precluded a debulking procedure and made them at higher risk of treatment
failure.
E. Conclusions: Survival by Outcome Variables

The interest in the analysis of survival by other outcomes of treatment (response,
toxicity, dose intensity, treatment compliance) is understandable, but most of the
usual approaches to the analysis of these data produce survival estimates that
are biased and statistical tests that are invalid. Even when appropriate analytic
techniques are used, the accurate interpretation of the results of these analyses
can be difficult.
III. COMPETING RISKS AND ESTIMATION OF

CAUSE-SPECIFIC FAILURE RATES
Competing risks arise when patients may experience a number of events of inter-
est, but the occurrence of some events preclude the observance of others. Exam-
ples of competing risks include
• Death from a cause ‘‘not tumor related’’ precluding the observation of

death from cancer;
• Death precluding the observation of time to diagnosis of a second can-
cer in lymphoma patients following high-dose chemotherapy and stem
cell transplant;
• Bone marrow relapse precluding the observation of central nervous sys-
tem relapse as a first event in children treated for pediatric acute lym-
phoblastic leukemia.
Sometimes, analyses focus inappropriately in cause-specific failures, when the

overall event rate is the more suitable end point. When interest is appropriately
focused on cause-specific failures in the presence of competing risks, the tech-
niques used for the estimation of the event rates are often incorrect.
A. Inappropriate Focus of Cause-specific Failures

Since the goal of cancer therapy is to control the cancer and reduce the risk of
cancer death, there is sometimes an interest in focusing on disease-related events
and discounting events thought not to be tumor-related. For instance, tumor mor-
tality could be estimated using the usual survival analysis methodology, but treat-
ing only deaths due to tumor as failures and censoring deaths that were not tumor
related. DeVita and colleagues reported tumor mortality for advanced Hodgkin’s
disease patients treated with MOPP, censoring patients who died of causes other
than Hodgkin’s disease (38,39). However, Hodgkin’s disease patients often die
of causes other than Hodgkin’s disease, with many of the deaths a result of effects
of their treatment (infections, pulmonary complications, second malignancies).
The Cancer and Leukemia Group B reported that 28% of all deaths observed on a
study of the treatment of advanced Hodgkin’s disease were not Hodgkin’s disease
related (40). Censoring deaths attributable to causes thought not to be tumor
related can give an inappropriately favorable view of the likely survival of pa-
498 Anderson
tients after treatment (41). In most cases, the most appropriate analysis of out-
come on clinical cancer trials is to count all deaths, irrespective of their cause.
Sometimes there is an interest in looking at cause-specific failures in the
context of a clinical trial, because therapy may be directed at preventing certain
kinds of failures. For instance, if therapies for the treatment of childhood acute
lymphoblastic leukemia differed by the intensity of prophylactic treatment of the
central nervous system (CNS), comparisons of outcome might focus on differ-
ences in the rates of CNS relapse as a first event. If treatments compare local
modalities to the primary tumor (e.g., with or without radiation therapy; with or
without surgery), the rate of local tumor control may be of interest.
However, interpretation of these analyses of ‘‘location-specific failures’’
can be difficult. First, these analyses focus attention away from what should be
the primary analysis objective: the comparison of the overall success of the treat-
ments. Second, these analyses can be misleading, because differences in therapeu-
tic strategies may lead to changes observed in the distribution of failures by site
without affecting the overall risk of failure (42). This phenomenon was first seen
in the treatment of childhood acute lymphoblastic leukemia and was designated
‘‘the dough-boy effect’’ by Dr. Mark Nesbit (43). A regimen with increased
intensity of CNS prophylaxis reduced the risk of CNS relapse as the first event
but increased the risk of bone marrow relapse as the first event, with the overall
failure-free survival being similar to that seen with the use of standard CNS
prophylaxis. This paradoxical effect, in which one kind of event is exchanged
for another, can occur when ‘‘successful’’ treatment of one compartment of dis-
ease allows the disease recurrence to be observed elsewhere. A similar effect is
sometimes seen in the treatment of advanced Hodgkin’s disease. Patients receiv-
ing radiotherapy to sites of bulk disease are less likely to relapse first in those
sites but have a higher rate of ‘‘relapse in previously uninvolved sites’’ and (in
most series) an equivalent failure-free survival.
Although it is natural for researchers who apply local therapies to evaluate
their effect on the risk of local failure, comparisons of treatment outcomes should
focus on differences in the distributions of time to first failure.
B. Appropriate Estimation of Cause-specific Failure Rates

Sometimes the focus on cause-specific failure is appropriate, for instance, in the
study of risk factors for the development of a second malignancy after cancer
treatment. However, estimation of the distribution of ‘‘time to second malig-
nancy’’ or other cause-specific failure using the Kaplan-Meier method and ‘‘cen-
soring’’ patients who die without experiencing the event of interest is problem-
atic. It assumes that the risk of experiencing the event of interest for those who
die would be the same as that for those patients followed beyond their time of
death had they not died, an assumption that is untestable. In addition, even if
this assumption were true, the ‘‘cause-specific time to event’’ curves produced
in this way describe the hypothetical risk of experiencing the event of interest if
all deaths could be prevented.
Most often what is of interest is an estimate of the proportion of the total
patient group treated who would be expected to develop the specific event of
interest (e.g., a second malignancy) by some time T. Cumulative incidence func-
tions provide the appropriate estimates (44). Additional methods of summarizing
competing risks failure time data also exist (45). Gooley et al. (46, and elsewhere
in this volume) showed that the estimates of cause-specific incidence rates incor-
rectly obtained from applying the Kaplan-Meier method are greater than or equal
to the estimates obtained from using cumulative incidence estimates, with the
difference often substantial when many patients experience the competing risk.
Comparisons of the risk of experiencing a cause-specific failure may be
made using standard statistical methods for the analysis of failure time data (log-
rank test, proportional hazards model), treating all other failures as censored ob-
servations (44). Darrington et al. (47) investigated the risk of secondary acute
myeologenous leukemia (AML) or myelodysplastic syndrome (MDS) in 511 pa-
tients who had received high-dose chemotherapy, with or without total body irra-
diation (TBI), and stem cell transplant for the treatment of relapsed lymphoma.
They estimated the cumulative incidence of AML/MDS to be 4% at 5 years for
both patients with Hodgkin’s disease and non-Hodgkin’s lymphoma (NHL). They
also showed TBI increased the risk of AML/MDS in patients with NHL who
were 40 years of age or older when transplanted.
C. Conclusions: Competing Risks and Estimation

of Cause-Specific Failure Rates
The analysis of cause-specific failure rates can sometimes be misleading, either
because events related to the effects of treatment are ignored or because differ-
ences in therapeutic strategies can lead to a trade-off of site-specific failures,
without influencing overall treatment success. When interest is focused on the
overall outcome of cancer treatment, analyses should count all relevant events
as ‘‘failures.’’ When attention is appropriately focused on the occurrence of cer-
tain events subject to competing risks, the cumulative incidence curve approach
should be used to estimate cause-specific risk.
REFERENCES
1. Anderson JR, Cain KC, Gelber RD, Gelman RS. Analysis and interpretation of the
comparison of survival by treatment outcome variables in cancer clinical trials. Can-
cer Treat Rep 1985; 69:1139–1144.
500 Anderson
2. Weiss GB, Bunce H III, Hokanson JA. Comparing survival of responders and non-
responders after treatment: a potential source of confusion in interpreting cancer
clinical trials. Control Clin Trials 1983; 4:43–52.
3. Anderson JR, Davis RB (Letter). Analysis of survival by tumor response. J Clin
Oncol 1986; 4:114–116.
4. Anderson JR, Cain KC, Gelber RD. Analysis of survival by tumor response. J Clin
Oncol 1983; 1:710–719.
5. Mantel N, Byar DP. Evaluation of response-time data involving transient states: an
illustration using heart-transplant data. JASA 1974; 69:81–86.
6. Simon R, Makuch RW. A non-parametric graphical representation of the relationship
between survival and the occurrence of an event: application to responder versus
non-responder bias. Stat Med 1984; 3:35–44.
7. Simon R, Wittes RE. Methodologic guidelines for reports of clinical trials. Cancer
Treat Rep 1985; 69:1–3.
8. Bonadonna G, Valagussa P, Brambilla C, et al. Primary chemotherapy in operable
breast cancer. J Clin Oncol 1998; 16:93–100.
9. Ellis P, Smith I, Ashley S, et al. Clinical prognostic and predictive factors for primary
chemotherapy in operable breast cancer. J Clin Oncol 1998; 16:107–114.
10. Dann EJ, Anatasi J, Larson RA. High-dose cladribine therapy for chronic myeloge-
nous leukemia in the accelerated or blast phase. J Clin Oncol 1998; 16:1498–1504.
11. Smith DC, Dunn RL, Strawderman MS, Pienta KJ. Change in serum prostate-
specific antigen as a marker of response to cytotoxic therapy for hormone-refractory
prostate cancer. J Clin Oncol 1998; 16:1835–1843.
12. Frei E III, Canellos GP. Dose: a critical factor in cancer chemotherapy. Am J Med
1980; 69:585–594.
13. Henderson IC, Hayes DF, Gelman R. Dose-response in the treatment of breast can-
cer. A critical review. J Clin Oncol 1988; 6:1501–1515.
14. Dixon DO, Neilan B, Jones SE, et al. Effect of age on therapeutic outcome in ad-
vanced diffuse histiocytic lymphoma: The Southwest Oncology Group Experience.
J Clin Oncol 1986; 3:295–305.
15. Tannock IF, Boyd NF, DeBoer G, et al. A randomized trial of two dose levels of
cyclophosphamide, methotrexate and fluorouracil chemotherapy for patients with
metastatic breast cancer. J Clin Oncol 1988; 6:1377–1387.
16. Wood WC, Budman DR, Korzun AH, et al. Dose and dose intensity of adjuvant therapy
for stage II, node-positive breast carcinoma. N Engl J Med 1994; 330:1253–1259.
17. Bonadonna G, Valagussa P. Dose-response effect of adjuvant chemotherapy in
breast cancer. N Engl J Med 1981; 304:10–15.
18. Carde P, MacKintosh FR, Rosenberg SA. A dose and time response analysis of the
treatment of Hodgkin’s disease with MOPP chemotherapy. J Clin Oncol 1983; 1:
146–153.
19. Kwak LW, Halpern J, Olshen RA, Horning SJ. Prognostic significance of actual
dose intensity in diffuse large-cell lymphoma: results of a tree-structured survival
analysis. J Clin Oncol 1990; 8:963–977.
20. Anderson JR, Santarelli MT, Peterson B. Dose intensity in the treatment of diffuse
large-cell lymphoma [letter]. J Clin Oncol 1990; 8:1927.
21. Redmond C, Fisher B, Weiand HS. The methodologic dilemma in retrospectively
correlating the amount of chemotherapy received in adjuvant therapy protocols with

disease-free survival. Cancer Treat Rep 1983; 67:519–526.
22. Chilcote RR, Krailo, M, Kjeldsberg C, et al., Daunomycin plus COMP vs
COMP therapy in childhood non-lymphoblastic lymphoma. Proc ASCO 1991; 10:
289(1011).
23. Chilcote RR, Krailo M. Unpublished observations.
24. DeVita VT Jr, Hubbard SM, Longo DL. The chemotherapy of lymphomas: looking
back, moving forward. Cancer Res 1987; 47:5810–5824.
25. Meyer RM, Hryniuk WM, Goodyear MD. The role of dose intensity in determining
outcome in intermediate-grade non-Hodgkin’s lymphoma. J Clin Oncol 1991; 9:
339–347.
26. Hryniuk W, Levine MN. Analysis of dose intensity for adjuvant chemotherapy trials
in stage II breast cancer. J Clin Oncol 1986; 4:1162–1170.
27. Fisher RI, Gaynor ER, Dahlberg S, et al. Comparison of a standard regimen (CHOP)
with three intensive chemotherapy regimens for advanced non-Hodgkin’s lym-
phoma. N Engl J Med 1993; 328:1002–1006.
28. Shipp MA, Harrington DP, Anderson JR, et al. A predictive model for aggressive
non-Hodgkin’s lymphoma. N Engl J Med 1993; 329:987–994.
29. Mayer RJ, Davis RB, Schiffer CA, et al. Comparative evaluation of intensive postre-
mission therapy with different dose schedules of Ara-C in adults with acute myeloid
leukemia (AML): Initial results of a CALGB Phase III study. Proc ASCO 1992; 11:
261(853).
30. Kaye SB, Lewis CR, Paul J, et al. Randomized study of two doses of cisplatin with
cyclophosphamide in epithelial ovarian cancer. Lancet 1992; 340:329–333.
31. Sorensen JB, Hansen HH, Dombernowsky P, et al. Chemotherapy for adenocarci-
noma of the lung (WHO III): a randomized study of vindesine versus lomustine,
cyclophosphamide and methotrexate versus all four drugs. J Clin Oncol 1987; 5:
1169–1177.
32. Propert KJ, Anderson JR. Assessing the effect of toxicity on prognosis: methods of
analysis and interpretation. J Clin Oncol 1988; 6:868–870.
33. Breslow N, Zandstra R. A note on the relationship between bone marrow lymphocy-
tosis and remission duration in acute leukemia. Blood 1970; 36:246–249.
34. Intergroup Rhabdomyosarcoma Study Group. (1998). Unpublished observations.
35. Shuster JJ, Gieser PW. Radiation in pediatric Hodgkin’s disease [reply]. J Clin Oncol
1998; 16:393.
36. Coronary Drug Project Research Group: Influence of adherence to treatment and
response of cholesterol on mortality in the Coronary Drug Project. N Engl J Med
1980; 303:1038–1041.
37. Blakely ML, Lobe TE, Anderson JR, et al. Does debulking improve survival in
advanced stage retroperitoneal embryonal rhabdomyosarcoma. J Ped Surg 1999; 34:
736–742.
38. DeVita VT Jr, Simon RM, Hubbard SM, et al. Curability of advanced Hodgkin’s
disease with chemotherapy: long-term follow-up of MOPP-treated patients at the
National Cancer Institute. Ann Intern Med 1980; 92:587–595.
39. Longo DL, Young RC, Wesley M, et al. Twenty years of MOPP therapy for Hodg-
kin’s disease. J Clin Oncol 1986; 4:1295–1306.
502 Anderson
40. Canellos GP, Anderson JR, Propert KJ, et al. Chemotherapy of advanced Hodgkin’s
disease with MOPP, ABVD or MOPP alternating with ABVD. N Engl J Med 1992;
327:1478–1484.
41. Glick JH, Tsiatis A. MOPP/ABVD chemotherapy for advanced Hodgkin’s disease.
Ann Intern Med 1986; 104:876–878.
42. Gelman R, Gelber R, Henderson IC, et al. Improved methodology for analyzing
local and distant recurrence. J Clin Oncol 1990; 8:548–555.
43. Bleyer WA, Sather HN, Nickerson HJ, et al. Monthly pulses of vincristine and pred-
nisone prevent bone marrow and testicular relapse in low-risk childhood acute lym-
phoblastic leukemia: a report from the CCG-161 study by the Childrens Cancer
Study Group. J Clin Oncol 1991; 9:1012–1021.
44. Kalbfleisch JD, Prentice RL. Statistical Analysis of Failure Time Data. New York:
Wiley, pp. 163–171.
45. Pepe MS, Mori M. Kaplan-Meier, marginal or conditional probability curves in sum-
marizing competing risks failure time data. Stat Med 1993; 12:737–751.
46. Gooley TA, Leisenring W, Crowley J, Storer BE. Estimation of failure probabilities
in the presence of competing risks: new representations of old estimators. Stat Med
1999; 18:695–706.
47. Darrington DL, et al. Incidence and characterization of secondary myelodysplastic
syndrome and acute myelogenous leukemia following high-dose chemoradiotherapy
and autologous stem cell transplantation for lymphoid malignancies. J Clin Oncol
1994; 12:2527–2534.
25
Dose-Intensity Analysis
Joseph L. Pater
NCIC Clinical Trials Group, Queen’s University, Kingston,
Ontario, Canada
I. INTRODUCTION
Despite the inherent methodological difficulties to be discussed in this section,

analyses that have attempted to relate the ‘‘intensity’’ of cytotoxic chemotherapy
to its effectiveness have had substantial influence both on the interpretation of
data from cancer clinical trials and on recommendations for practice in oncology.
Prior to a discussion of analytic pitfalls, a brief description of the evolution of
the concept of ‘‘dose intensity’’ and its impact on the design and analysis of
trials are presented. It should be noted from the outset the term dose intensity
has not had a consistent definition. In fact, precisely how the amount of treatment
given over time is quantified is an important problem in itself (see later).
The publication that first provoked interest in the issue of the importance
of delivering adequate doses of chemotherapy represented a secondary analysis
by Bonnadonna and Valagussa (1) of randomized trials of cyclophosphamide,
methotrexate, and 5-fluorouracil in the adjuvant therapy of women with breast
cancer. This secondary analysis subdivided patients in the trials according to how
much of their protocol prescribed chemotherapy they actually received. There
was a clear positive relationship between this proportion and survival. Patients
who received 85% or more of their planned protocol dose had a 77% 5-year
relapse-free survival compared with 48% in those who received less than 65%.
The authors concluded ‘‘it is necessary to administer combination chemotherapy
at a full dose to achieve clinical benefit.’’
503
504 Pater
Shortly afterward, interest in the role of dose intensity was further height-
ened by a series of publications by Hryniuk and colleagues. Instead of examining
outcomes of individual patients on a single trial, Hryniuk’s approach was to corre-
late outcomes of groups of patients on different trials with the dose intensity of
their treatment. In this case, the intensity of treatment was related not to the
protocol specified dose but to a standard regimen. In trials both in breast cancer
and ovarian cancer the results of these analyses indicated a strong positive corre-
lation between dose intensity and outcome. Thus, in the treatment of advanced
breast cancer the correlation between planned dose intensity and response was
0.74 ( p ⬍ 0.001) and between actually delivered dose intensity and response
0.82 ( p ⬍ 0.01) (2). In the adjuvant setting a correlation of 0.81 ( p ⬍ 0.00001)
was found between dose intensity and 3-year relapse-free survival (3). In ovarian
cancer the correlation between dose intensity and response was 0.57 ( p ⬍ 0.01)
(4).
Reaction to these publications was quite varied. Although some authors
pointed out alternate explanations for the results, most appeared to regard these
analyses as having provided insight into an extremely important principle. For
example, the then Director of the National Cancer Institute, Vincent Devita, in
an editorial commentary on Hryniuk and Levine’s paper on adjuvant therapy,
concluded ‘‘a strong case can be made for the point of view that the most toxic
maneuver a physician can make when administering chemotherapy is to arbi-
trarily and unnecessarily reduce the dose’’ (5). Hryniuk himself argued ‘‘since
dose intensity will most likely prove to be a major determinant of treatment out-
comes in cancer chemotherapy, careful attention should be paid to this factor
when protocols are designed or implemented even in routine clinical practice’’
(6).
Despite numerous trials designed to explore the impact of dose intensity
on outcomes of therapy over the nearly 20 years since Bonnadonna’s publication,
the topic remains controversial. Thus, a recent book (7) devoted a chapter to a
debate on the topic. Similar discussions continue within other sites.
The remainder of this section reviews the methodologic and statistical is-
sues involved in the three settings mentioned above, namely, studies aimed at
relating the delivery of treatment to outcomes in individual patients, studies at-
tempting to find a relationship between chemotherapy dose and outcome among
a group of trials, and finally trials aimed at demonstrating in a prospective ran-
domized fashion an impact of dose or dose intensity.
II. RELATING DELIVERED DOSE TO OUTCOME

IN INDIVIDUAL PATIENTS ON A TRIAL
Three years after the publication of Bonnadonna’s article, Redmond et al. (8)
published a methodological and statistical critique of the analysis and interpreta-
Dose-Intensity Analysis 505
tion of their findings. In my view, subsequent discussions of this issue have been
mainly amplifications or reiterations of Redmond’s article, so it is summarized
in detail here.
A. Problems in Analyzing Results in Patients Who

Stopped Treatment Before Its Intended Conclusion
Cancer chemotherapy regimens are usually given over at least several months,
so no matter the setting, some patients are likely to develop disease recurrence
or progression prior to the planned completion of therapy. How to calculate the
‘‘expected’’ amount of treatment in such patients is a problem. If the expected
duration is considered to be that planned at the outset, patients who progress on
treatment will by definition receive less than ‘‘full’’ doses, and a positive relation
between ‘‘completeness’’ of treatment and outcome will be likely.
An obvious solution to this problem is to consider as expected only the
duration of treatment up to the time of recurrence, which was the approach taken
by Bonnadonna. However, this method leads to a potential bias in the opposite
direction since, generally speaking, toxicity from chemotherapy is cumulative
and doses tend to be reduced over time. Thus, patients who stop treatment early
may have received a higher fraction of their expected dose over that time period.
In fact, in applying this method to data from a National Surgical Adjuvant Breast
and Bowel Project (NSABP) trial, Redmond et al. found an inverse relationship
between dose and outcome, that is, patients who receive higher doses over the
time they were on treatment were more likely to develop disease recurrence.
A third approach suggested by Redmond et al. was to use what has become
known as the ‘‘landmark’’ (9) method: The effect of delivered dose up to a given
point in time is assessed only among patients who were followed and were free
of recurrence until that time. This method avoids the biases mentioned above but
omits from the analysis patients who recur during the time of greatest risk of
treatment failure. The application of this method to the same NSABP trial indi-
cated no effect of variation in the amount of drug received up to 2 years on
subsequent disease-free survival.
Finally, Redmond et al. proposed using dose administered up to a point in
time as a time-dependent covariate recalculated at the time of each recurrence
in a Cox proportional hazards model. In their application of this technique to
NSABP data, however, they found a significant delivered dose effect in the pla-
cebo arm, a finding reminiscent of the well-known results of the clofibrate trial
(10). The apparent explanation for this result is that treatment may be delayed
or omitted in the weeks preceding the diagnosis of recurrence. Thus, when drug
delivery in the 2 months before diagnosis of recurrence was not included in the
calculation, the dose effect disappeared both in the placebo and active treatment
arms.
506 Pater
B. Confounding
The methodological problems of analyses that attempt to relate completeness of
drug administration to outcome are not confined to the difficulties of dealing
with patients who progress while on therapy. Citing the clofibrate trial mentioned
above, Redmond et al. also pointed out that even if one could calculate delivered
versus expected dose in an unbiased manner, there would still remain the problem
that patients who comply with or who tolerate treatment may well differ from
those who do not on a host of factors that themselves might relate to ultimate
outcome. Thus, the relationship between administered dose and outcome might
be real in the sense that it is not a product of bias in the way the data are assembled
or analyzed but might not be causal as it is the product of confounding by underly-
ing patient characteristics. Since the clinical application of an association of dose
delivery and outcome rests on the assumption that the relationship is causal, the
inability to draw a causal conclusion is a major concern (11).
Analyses similar to those of Bonnadonna have been repeated many times
(12), sometimes with positive and sometimes with negative results. However, it
is hard to argue with the conclusion of Redmond et al.’s initial publication, that
is, this issue will not be resolved by such analyses. Only a prospective examina-
tion of the question can produce definitive findings.
III. ANALYSES COMBINING DATA

FROM MULTIPLE STUDIES
As mentioned earlier, the approach taken by Hryniuk and colleagues was, instead
of relating outcomes of individual patients to the amount of treatment they re-
ceived, to assess whether the outcomes of groups of patients treated with regi-
mens that themselves varied in intended or delivered dose intensity correlated
with that measure. To compare a variety of regimens used, for example, for the
treatment of metastatic breast cancer, Hryniuk calculated the degree to which the
rate of drug administration in mg/m2 /wk corresponded to a standard protocol and
expressed the average of all drugs in a given regimen as a single proportion—
relative dose intensity (see Table 1 for an example of such a calculation). He
then correlated this quantity calculated for individual arms of clinical trials or
for single arm studies with a measure of outcome, for example, response or me-
dian survival.
Hryniuk’s work has been criticized on two primary grounds: first the
method of calculating dose intensity and second the appropriateness of his
method of combining data. With respect to the method of calculation, it has been
pointed out that the choice of a standard regimen is arbitrary, and different results
are obtained if different standards are used (12,13). Further, the approach ignores
Table 1 Calculation of Dose Intensity
Reference Dose
Drug Dose mg/m 2 /wk mg/m2 /wk intensity
Cyclophosphamide 100 mg/m2 days 1–14 350 560 0.62

Methotrexate 40 mg/m 2 days 1 and 20 28 0.71
8
Fluorouracil 600 mg/m 2 days 1 300 480 0.62
and 8
Dose intensity for 0.65
regimen
the impact of drug schedule and makes untested assumptions about the therapeu-
tic equivalence of different drugs. In more recent work (14), Hryniuk et al. ac-
cepted these criticisms and addressed them by calculating a new index called
‘‘summation dose intensity.’’ In this approach the intensity of an individual drug
is calculated relative to the dose of that drug that produces a 30% response rate
as a single agent. This index is more empirically based and avoids reference to
an arbitrary standard. As Hryniuk points out, whether it will resolve the debate
about dose intensity will depend on its prospective validation in randomized
trials.
Hryniuk’s work has also been considered a form of meta-analysis and criti-
cized because of its failure to meet standards (15) for such studies. In my view,
with the exception of an article by Meyer et al. (16), these studies are not actually
meta-analyses. Meta-analyses are conventionally considered to involve combin-
ing data from a set of studies, each of which attempted to measure some common
parameter, for example, an effect size or odds ratio. Hryniuk’s work, on the other
hand, compares results from one study to the next and estimates a parameter—
the correlation between dose intensity and outcome—that is not measured in any
one study. Irrespective of this distinction, however, the criticism that Hryniuk’s
early studies did not clearly state, for example, how trials were identified and
selected seems valid.
Formal meta-analytic techniques have, however, been applied to this issue.
As mentioned, the article by Meyer et al. contains what I would consider a true
meta-analysis of trials assessing the role of dose intensity in patients with non-
Hodgkin’s lymphoma. An accompanying editorial pointed out (13), however,
that the arms of these trials differed in more than dose intensity. Thus, because its
conclusions were otherwise based on nonrandomized comparisons, the editorial
argued that even in this analysis no level I evidence (in Sackett’s [17] terminol-
ogy) had been developed to support the role of dose intensity (13). In fact, despite
the fact that they drew on data from randomized trials, Hryniuk’s studies gener-
508 Pater
ated only what Sackett considers level V evidence since they relied mostly on
comparisons involving nonconcurrent nonconsecutive case series.
A much more extensive attempt to generate more conclusive evidence by
combining data from randomized trials testing dose intensity was carried out by
Torri et al. (18). Using standard search techniques, these authors identified all
published randomized trials from 1975 to 1989 dealing with the chemotherapy
of advanced ovarian cancer. For each arm of each trial they calculated dose with
an ‘‘equalised standard method’’ similar to the summation standard method de-
scribed above. They used a log linear model that included a separate fixed effect
term for each study to estimate the relationship between total dose intensity and
the log odds of response and log median survival. The inclusion of the fixed
effect terms ensured that comparisons were between the arms of the same trial,
not across trials. They also used a multiple linear regression model to assess the
relative impact of the intensity of various drugs. Their analysis showed a statisti-
cally significant relationship between dose intensity and both outcomes, although
the magnitude of the relationship was less than that observed in Hryniuk’s stud-
ies—a finding they attributed to the bias inherent in across trial comparisons.
They concluded ‘‘the validity of the dose intensity hypothesis in advanced ovar-
ian cancer is substantiated based on the utilisation of improved methodology for
analysis of available data. This approach suggests hypotheses for the intensifica-
tion of therapy and reinforces the importance of formally evaluating dose intense
regimens in prospective randomised clinical trials.’’
IV. DESIGNING TRIALS TO TEST THE

‘‘DOSE-INTENSITY HYPOTHESIS’’
Authors critical of Bonnadonna’s (8) or Hryniuk’s (13) methodology called for

prospective (randomized) tests of the underlying hypothesis—as did Hryniuk (3)
himself. However, it does not seem to have been widely appreciated how difficult
this clinical research task actually is (19). There are two fundamental problems:
the multitude of circumstances in which differences in dose intensity need to be
tested and the difficulty of designing individual trials.
A. Settings for Testing

To test in clinical trials the hypothesis that dose intensity of chemotherapy is an
important determinant of outcome of cancer treatment, there would need to be
consensus on what constitutes convincing evidence for or against the hypothesis.
Clearly, a single negative trial would not be sufficient since it could always be
argued that the clinical setting or the regimen used was not appropriate. Thus,
to build a strong case against this hypothesis, a large number of trials of different
regimens in different settings would have to be completed. Conversely, and per-

haps less obviously, a positive trial, whereas it would demonstrate that the hy-
pothesis is true in a specific circumstance, would not necessarily indicate that
maximizing dose intensity in other situations would achieve the same result. As
Siu and Tannock put it: ‘‘generalizations about chemotherapy dose intensification
confuse the reality that the impact of this intervention is likely to be disease, stage,
and drug specific (20).’’ It can be argued, in fact, that systematically attempting to
test hypotheses of this type in clinical trials is unlikely to be productive, at least
from the point of view of providing a basis for choosing treatment for patients
(21). In any case, given the number of malignancies for which chemotherapy is
used and the number of regimens in use in each of those diseases, the clinical
research effort required to explore fully the implications of this hypothesis is
enormous. It is perhaps not surprising that the issue is still unresolved (14,20).
B. Design of Individual Trials

Equally problematic is the design of a trial that is intended to demonstrate clearly
the independent role of dose intensity. Dose intensity, as defined by Hryniuk
and coworkers, is a rate, the numerator of which is the total dose given and the
denominator the time over which it is delivered. Because of the mathematical
relationship among the component variables (total dose, treatment duration, and
dose intensity), a comparative study that holds one constant will necessarily vary
the other two. That is, in order for two treatments that deliver the same total dose
to differ in dose intensity, they must also differ in treatment duration. If these
treatments produce different results in a randomized trial, it would not be possible
to separate the effect of dose intensity from treatment duration. In fact, a trial
would have to have a minimum of four arms for each of these three factors to
be held constant, whereas the other two are varied. Given the practical limitations
on the extent to which existing chemotherapy regimens can be safely modified,
designing such a trial is very challenging. Perhaps the best example of a success-
ful attempt to separate the contribution of these variables is the Cancer and Acute
Leukemia Group B (CALGB) trial of 5-flourouracil, cyclophosphamide, and dox-
orubicin (FAC) chemotherapy in the adjuvant treatment of women with node-
positive breast cancer. This study compared three versions of FAC: a standard
regimen that delivered conventional doses of the drugs over six three weekly
cycles; a dose intense regimen that delivered the same total dose in one half the
time; and a low dose regimen delivering one half the standard dose in a conven-
tional six cycles. (The trial lacked a comparison in which dose intensity was held
constant but treatment duration and total dose were varied.) The results of this
study were updated recently (22). The two arms delivering a higher total dose
demonstrated superior relapse-free and overall survival compared with the low-
dose arm. However, there was no difference in outcome between these two
510 Pater
‘‘high-dose’’ arms. Thus, it is impossible to separate the impact of total dose

and dose intensity. This trial also illustrates another feature of the trials that have
been done in this area; namely, it has been much easier to show that lowering
doses below conventional levels has an adverse effect on outcome than to demon-
strate an advantage to increasing dose above conventional levels (20).
In my view (19), considering the difficulties outlined above, Siu and Tan-
nock (20) are also correct that ‘‘further research (in this area) should concentrate
on those disease sites and/or chemotherapeutic agents for which there is a reason-
able expectation of benefit.’’ Although the retrospective methods described above
are inadequate for drawing causal conclusions, they still may be useful in pointing
out where further trials will be most fruitful. Whether such studies should be
designed to test a hypothesis as in the case of the CALGB trial described above
or should test a regimen developed on the basis of that hypothesis as in a recent
National Cancer Institute of Canada Clinical Trials Group adjuvant trial (23) is,
like the role of dose intensity itself, a subject for debate (13,19).
REFERENCES
1. Bonnadonna G, Valagussa B. Dose-response effect of adjuvant chemotherapy in

breast cancer. N Eng J Med 1981; 304:10–15.
2. Hryniuk W, Bush H. The importance of dose intensity in chemotherapy of metastatic
breast cancer. J Clin Oncol 1984; 2:1281–1288.
3. Hryniuk W, Levine MN. Analysis of dose intensity for adjuvant chemotherapy trials
in stage II breast cancer. J Clin Oncol 1986; 4:1162–1170.
4. Levin L, Hryniuk WM. Dose intensity analysis of chemotherapy regimens in ovarian
carcinoma. J Clin Oncol 1987; 5:756–767.
5. Devita VT. Dose-response is alive and well. J Clin Oncol 1986; 4:1157–1159.
6. Hryniuk WM, Figueredo A, Goodyear M. Applications of dose intensity to problems
in chemotherapy of breast and colorectal cancer. Semin Oncol 1987; 4(suppl 4):
3–11.
7. Gersheson DM, McGuire WP. Ovarian Cancer—Controversies in Management.
New York: Churchill Livingstone, 1998.
8. Redmond C, Fisher B, Wieand HS. The methodologic dilemma in retrospectively
correlating the amount of chemotherapy received in adjuvant therapy protocols with
disease-free survival. Cancer Treat Rep 1983; 67:519–526.
9. Anderson JR, Cain KC, Gelber RD. Analysis of survival by tumour response. J Clin
Oncol 1983; 1:710–719.
10. The Coronary Drug Project Research Group. Influence of adherence to treatment
and response of cholesterol on mortality in the coronary drug project. N Engl J Med
1980; 303:1038–1041.
11. Geller NL, Hakes TB, Petrone GR, Currie V, Kaufman R. Association of disease-
free survival and percent of ideal dose in adjuvant breast chemotherapy. Cancer
1990; 66:1678–1684.
12. Henderson IC, Hayes DF, Gelman R. Dose-response in the treatment of breast can-
cer: a critical review. J Clin Oncol 1988; 6:1501–1513.
13. Gelman R, Neuberg D. Making cocktails versus making soup. J Clin Oncol 1991;
9:200–203.
14. Hryniuk W, Frei E, Wright FA. A single scale for comparing dose-intensity of all
chemotherapy regimens in breast cancer: summation dose-intensity. J Clin Oncol
1998; 16:3137–3147.
15. Labbe KA, Detsky AS, O’Rourke K. Meta-analysis in clinical research. Ann Intern
Med 1987; 107:224.
16. Meyer RM, Hryniuk WM, Goodyear MDE. The role of dose intensity in determining
outcome in intermediate-grade non-Hodgkin’s lymphoma. J Clin Oncol 1991; 9:
339–347.
17. Sackett DL. Rules of evidence and clinical recommendations in the use of antithrom-
botic agents. Chest 1989; 95(suppl):2S–4S.
18. Torri V, Korn EL, Simon R. Dose intensity analysis in advanced ovarian cancer
patients. Br J Cancer 1993; 67:190–197.
19. Pater JL. Introduction: implications of dose intensity for cancer clinical trials. Semi-
nars in Oncology 1987; 14(suppl 4):1–2.
20. Siu LL, Tannock IP. Chemotherapy dose escalation: case unproven. J Clin Oncol
1997; 15:2765–2768.
21. Pater JL, Eisenhauer E, Shelley W, Willan A. Testing hypotheses in clinical trials.
Experience of the National Cancer Institute of Canada Clinical Trials Group. Cancer
Treat Rep 1986; 70:1133–1136.
22. Budman DR, Berry D, Cirrincione CT, Henderson IC, Wood WC, Weiss RB, et al.
Dose and dose intensity as determinants of outcome in the adjuvant treatment of
breast cancer. J Natl Cancer Inst 1998; 90:1205–1211.
23. Levine MN, Bramwell VH, Pritchard KL, Norris BD, Shepherd LS, Abu-Zahra H,
et al. Randomized trial of intensive cyclophosphamide, epirubicin, and fluorouracil
chemotherapy compared with cyclophosphamide, methotrexate, and fluorouracil in
premenopausal women with node-positive breast cancer. J Clin Oncol 1998; 16:
2651–2658.
26
Why Kaplan-Meier Fails and
Cumulative Incidence Succeeds
When Estimating Failure Probabilities
in the Presence of Competing Risks
Ted A. Gooley, Wendy Leisenring, John Crowley,

and Barry E. Storer
Fred Hutchinson Cancer Research Center, University of Washington, Seattle,
Washington
I. INTRODUCTION
In many fields of medical research, time-to-event end points are used to assess
the potential efficacy of a treatment. The outcome of interest associated with
some of these end points may be a particular type of failure, and it is often of
interest to estimate the probability of this failure by a specified time among a
particular population of patients. For example, the occurrence of end-stage renal
failure is an outcome of interest among patients with insulin-dependent diabetes
mellitus (IDDM). Given a sample drawn from the population of patients with
IDDM, one may therefore wish to obtain an estimate of the probability of devel-
oping end-stage renal failure. As other examples, consider the probability of death
due to prostate cancer among patients afflicted with this disease and the probabil-
ity of the occurrence of cytomegalovirus (CMV) retinitis among patients with
AIDS.
513
514 Gooley et al.
The examples above share two features. First, the outcome of interest in
each is a time-to-event end point, that is, one is interested not only in the occur-
rence of the outcome but also the time at which the outcome occurs. Second,
multiple modes of failure exist for each of the populations considered, namely
failure from the outcome of interest plus other types whose occurrence preclude
the failure of interest from occurring. In the IDDM example, death without end-
stage renal failure is a type of failure in addition to the failure of interest, and
if a patient with IDDM dies without renal failure, this failure precludes the out-
come of interest (end-stage renal failure) from occurring. For the prostate cancer
example, deaths from causes unrelated to this cancer comprise types of failure
in addition to death from the index cancer, and the occurrence of these alternate
failure types preclude the outcome of interest (death from prostate cancer) from
occurring. Finally, patients with AIDS who die without CMV retinitis are not
capable of going on to develop CMV retinitis, that is, to fail from the cause of
interest. In each of these examples, a competing type of failure exists in addition
to the failure of interest. These competing causes of failure are referred to as
competing risks for the failure of interest. Specifically, we define a competing
risk as an event whose occurrence precludes the occurrence of the failure type
of interest.
The method due to Kaplan and Meier (1) was developed to estimate the
probability of an event for time-to-event end points, but the assumptions required
to make the resulting estimate interpretable as a probability are not met when
competing risks are present. Nonetheless, this method is often used and the re-
sulting estimate misinterpreted as representing the probability of failure in the
presence of competing risks. Statistical methods for obtaining an estimate that
is interpretable in this way are not new (2–9), and this topic has also received
some attention in medical journals (10–13). We refer to such an estimate as a
cumulative incidence (CI) estimate (3), although it has also been referred to as
the cause-specific failure probability, the crude incidence curve, and cause-spe-
cific risk (14). Similarly, the term cumulative incidence has been used for various
purposes. Our reference to this term, however, will be consistent with its interpre-
tation as an estimate of the probability of failure.
Despite its recognition in both the statistical and medical literature as the
appropriate tool to use, CI is not uniformly used in medical research (15) for
purposes of estimation in the setting in which competing risks exist. We believe
the reason for this is due to both a lack of complete understanding of the Kaplan-
Meier method and a lack of understanding of how CI is calculated and hence
the resulting difference between the two estimators. In this article, we present,
in a nontechnical fashion, a description of the Kaplan-Meier estimate not com-
monly seen. We believe this expression is useful for purposes of understanding
why the Kaplan-Meier method results in an estimate that is not interpretable as
a probability when used in the presence of competing risks. In addition, this
Kaplan-Meier versus Cumulative Incidence 515
alternate characterization will be extended in a way that allows us to represent

CI in a manner similar to that used to obtain the estimate from the Kaplan-Meier
method, and in so doing, the validity of CI and the difference between the estima-
tors will be clearly demonstrated.
In the next section, we describe the estimate associated with the Kaplan-
Meier method in the alternate manner alluded to above in the setting in which
competing risks do not exist. The discussion reviews the concept of censoring
and provides a heuristic description of the impact that a censored observation
has on the estimate. An example based on hypothetical data is used to illustrate
the concepts discussed. The subsequent section contains a description of how
the two estimates are calculated when competing risks are present, utilizing the
description of censoring provided in the preceding section. Data from a clinical
trial are then used to calculate each estimate for the end point of disease progres-
sion among patients with head and neck cancer, thus providing further demonstra-
tion of the concepts discussed previously. We close with a discussion that summa-
rizes and presents conclusions and recommendations.
II. ESTIMATION IN THE ABSENCE OF COMPETING RISKS:

KAPLAN-MEIER ESTIMATE
For time-to-event data without competing risks, each patient under study will
either fail or survive without failure to last contact. We use ‘‘failure’’ here and
throughout as a general term. The specific type of failure depends on the end
point analyzed. A patient without failure at last contact is said to be censored
due to lack of follow-up beyond this time, that is, it is known that such a patient
has not failed by last contact, but failure could occur at a later time.
The most reasonable and natural estimate of the probability of failure by
a prespecified point in time is the simple ratio of the number of failures divided
by the total number of patients, provided all patients without failure have follow-
up to this time. This simple ratio is appropriately interpreted as an estimate of
the probability of failure. This estimate is not only intuitive but is also unbiased
when all patients who have not failed have follow-up through the specified time;
unbiasedness is a desirable statistical property for estimators to possess.
If one or more patients are censored before the specified time, the simple
ratio is no longer adequate and methods that take into account data from the
censored patients are required to obtain an estimate consistent with this ratio.
The method due to Kaplan and Meier was developed for precisely this purpose,
and when competing risks are not present, this method leads to an estimate that
is consistent with the desired simple ratio. The resulting estimate is also exactly
equal to this ratio when all patients have either failed or been followed through
the specified follow-up time. When used to estimate the probability of failure,
516 Gooley et al.
one uses the complement of the Kaplan-Meier estimate, which we shall denote
by 1-KM, where the Kaplan-Meier estimate (KM) represents the probability of
surviving without failure. 1-KM can be interpreted as an estimate of the probabil-
ity of failure when competing risks are not present.
To appreciate how data from patients without follow-up to the specified
time are incorporated into 1-KM, it is necessary to understand how censored
observations are handled computationally. We present below a heuristic descrip-
tion of censoring not commonly seen. We believe that the use of this approach
leads to a clear understanding of what 1-KM represents. In addition, this alternate
explanation is used in the following section to explain why 1-KM fails as a valid
estimate of the probability of failure and to highlight how and when 1-KM and
CI differ when used in the presence of competing risks. What follows is a non-
technical description. The interested reader can find a detailed mathematical de-
scription elsewhere (16).
Note that the probability of failure depends on the timepoint at which the
associated estimate is desired, and as failures occur throughout time the estimate
will increase with each failure. Because an estimate that is consistent with the
simple ratio described above is desired, any estimate that achieves this goal will
change if and only if a patient fails. Moreover, if it is assumed that all patients
under study are equally likely to fail, it can be shown that each failure contributes
a prescribed and equal amount to the estimate, provided all patients have either
failed or been followed to a specified timepoint. This prescribed amount is simply
the inverse of the total number of patients under study.
If a patient is censored prior to the time point of interest; however, failure
may occur at a time beyond that at which censoring occurred, and this information
must be taken into account to obtain an estimate consistent with the desired sim-
ple ratio. One way to view the manner in which censored observations are handled
is as follows. As stated above, each patient under study possesses a potential
contribution to the estimate of the probability of failure, and each time a patient
fails the estimate is increased by the amount of the contribution of the failed
patient. Since patients who are censored due to lack of follow-up through a speci-
fied time remain capable of failure by this time, however, the potential contribu-
tion of these patients can not be discounted. In particular, one can consider the
potential contribution of a censored patient as being redistributed among all pa-
tients known to be at risk of failure beyond the censored time, as noted by Efron
(17). It can be shown that this ‘‘redistribution to the right’’ makes the resulting
estimate consistent with the simple ratio and therefore interpretable as an estimate
of the probability of failure. Because of this redistribution, any failure that takes
place after a censored observation contributes slightly more to the estimate than
do failures prior to the censored observation, that is, the potential contribution
of a patient to the estimate increases after the occurrence of a censored patient.
Another way to understand the impact of censoring is to consider such an observa-
Kaplan-Meier versus Cumulative Incidence
Table 1 Hypothetical Data to Illustrate Concept of Redistribution to the Right
Event No. known failures
time No. known to (total no. known Contribution of Incidence
(in years)* be at risk† failures by time)‡ No. censored§ next failure¶ estimate**
0 100 0 0 1/100 ⫽ 0.01 0

15 95 5 (5) 0 0.01 5(1/100) ⫽ 0.05
20 70 0 (5) 25 0.01 ⫹ 25(0.01)(1/70) ⫽ 0.0136 0.05
30 56 14 (19) — — 0.05 ⫹ 14 (0.0136) ⫽ 0.24
* Denotes the time at which various events occurred. An event is either a failure or a censored observation.
† Denotes the number of patients who have survived to the associated time, i.e., the number at risk of failure beyond this time.
‡ Denotes the number of patients who failed at the associated time. Parenthetically is the total number of known failures by the associated time.
§ Denotes the number of patients censored at the associated time. Each of these patients could fail at a later time, so the potential contribution to the estimate
due to these patients is redistributed evenly among all patients known to be at risk beyond this time.
¶ Denotes the amount that each failure occurring after the associated time will contribute to the estimate.
** Gives the estimate of the probability of failure by the associated time. This is derived by multiplying the number of failures at each time by the associated
contribution and summing the results or by summing the number of known failures and the number of projected failures and dividing by 100, i.e., the
number initially at risk.
517
518 Gooley et al.
tion as a projected failure or survivor to the time at which the next failure occurs,
where this projection is based on the experience of patients who have survived
to the time at which censoring occurred.
To help illustrate the above discussion, consider the following hypothetical
example summarized in Table 1. Suppose that a selected group of 100 patients
with IDDM is being followed for development of end-stage renal failure and that
none of these 100 dies without failure (i.e., no competing-risk events occur).
Assume that all patients have complete follow-up to 15 years after diagnosis of
IDDM and 5 of the 100 have end-stage renal failure by this time. The 15-year
estimate of renal failure is therefore 5 ⫻ (1/100) ⫽ 0.05, where 1/100 is the
potential contribution of each patient to the estimate of the probability of failure.
Suppose the remaining 95 patients survive without renal failure to 20 years but
25 of these 95 have follow-up only to 20 years (i.e., each of these 25 patients is
censored at 20 years). Since each censored patient could develop renal failure
beyond 20 years, the potential contribution of each patient cannot be discounted.
In particular, the potential contribution to the estimate for each can be thought
of as being uniformly redistributed among the 70 patients known to be at risk
of failure beyond the censored time. In other words, each remaining patient’s
potential contribution is increased over 1/100 by 25 ⫻ (1/100) divided among
the 70 remaining patients, or (1/100) ⫹ 25 ⫻ (1/100) ⫻ (1/70) ⫽ 0.0136. Sup-
pose 14 of these 70 go on to develop renal failure by 30 years, so that a total of
19 patients are known to have failed by this time. Since 25 patients had follow-
up less than 30 years, however, the estimate of failure by this time should be
larger than 19/100 ⫽ 0.19; 5 patients fail whose contribution to the estimate is
0.01 and 14 fail whose contribution is 0.0136. The estimate consistent with the
desired simple ratio is therefore 5 ⫻ (0.01) ⫹ 14 ⫻ (0.0136) ⫽ 0.24. An alternate
way to understand this is that since 20% (14/70) of the patients known to be at
risk of renal failure beyond 20 years failed by 30 years, it is reasonable to assume
that 20% of the 25 patients censored at 20 years will fail by 30 years. This leads
to a projected total of 24 failures (5 ⫹ 14 known failures ⫹ 5 projected failures)
and a 30-year estimate of renal failure of 24/100 ⫽ 0.24.
III. ESTIMATION IN THE PRESENCE OF COMPETING

RISKS: CUMULATIVE INCIDENCE VERSUS
KAPLAN-MEIER
Consider now the situation in which competing risks exist. In this setting, three
outcomes are possible for each patient under study; each will fail from the event
of interest, fail from a competing risk, or survive without failure to last contact.
In this setting, 1-KM is not capable of appropriately handling failures from a
competing risk because in its calculation, patients who fail from a competing
risk are treated in the same manner as patients who are censored. Patients who
have not failed by last contact retain the potential to fail, however, whereas pa-
tients who fail from a competing risk do not. As a result of this, failures from
the event of interest that occur after failures from a competing risk contribute
more to the estimate than is appropriate in the calculation of 1-KM, as will be
demonstrated in the example below. This overestimate is a result of the fact that
the potential contribution from a patient who failed from a competing risk, and
who is therefore not capable of a later failure, is redistributed among all patients
known to be at risk of failure beyond this time. This redistribution has the effect
of inflating the estimate above what it should be, that is, 1-KM is not consistent
with the desired simple ratio that is an appropriate estimate of this probability.
An alternative to 1-KM in the presence of competing risks is the CI esti-
mate. This estimate is closely related to 1-KM, and patients who are censored
due to lack of follow-up are handled exactly as is done in the calculation of 1-
KM. Failures from a competing risk, however, are dealt with in a manner appro-
priate for purposes of obtaining an estimate interpretable as a probability of fail-
ure. In the calculation of CI, patients who fail from a competing risk are correctly
assumed to be unable to fail from the event of interest beyond the time of the
competing-risk failure. The potential contribution to the estimate for such patients
is therefore not redistributed among the patients known to be at risk of failure,
that is, failures from a competing risk are not treated as censored as in the calcula-
tion of 1-KM. The difference between 1-KM and CI, therefore, comes about from
the way in which failures from a competing risk are handled. If there are no
failures from a competing risk, 1-KM and CI will be identical. If failures from
a competing risk exist, however, 1-KM is always larger than CI at and beyond
the time of first failure from the event of interest that follows a competing-risk
failure.
Returning to the above example on end-stage renal failure, suppose that
competing-risk events do occur (i.e., there are deaths without renal failure). Sup-
pose all assumptions are the same as before with the exception that the 25 patients
censored above instead die without renal failure at 20 years. The estimate 1-KM
at 30 years is the same as previously because these 25 patients are treated as
censored at 20 years. Since competing risks have occurred, however, CI is the
appropriate estimate, and the 25 patients who die without renal failure should
not be treated as censored. In this simple example, all patients have complete
follow-up to 30 years, that is, each has either developed renal failure, died without
failure, or survived without failure by 30 years. The estimate of the probability
of renal failure by 30 years should therefore be the number of failures divided
by the number of patients, or 19/100 ⫽ 0.19 (i.e., the desired simple ratio). Due
to the inappropriate censoring of the patients who died without renal failure,
however, 1-KM ⫽ 0.24, an estimate that is not interpretable as the probability
of end-stage renal failure.
520 Gooley et al.
This simple example illustrates how 1-KM and CI differ when competing
risks are present. It also demonstrates why treating patients who fail from a com-
peting risk as censored leads to an estimate (i.e., 1-KM) that cannot be validly
interpreted as a probability of failure. In general, the calculation of 1-KM and
CI is more involved than shown in the above example due to a more complex
combination of event times, but the concepts detailed above are identical.
IV. EXAMPLE FROM REAL DATA: SQUAMOUS CELL

CARCINOMA
To further illustrate the differences between 1-KM and CI, consider the following
example taken from a phase III Southwest Oncology Group clinical trial. The
objectives of this study were to compare the response rates, treatment failure
rates, survival, and pattern of treatment failure between two treatments for pa-
tients with advanced-stage resectable squamous cell carcinoma of the head and
neck (18). A conventional (surgery and postoperative radiotherapy) and an exper-
imental (induction chemotherapy followed by surgery and postoperative radio-
therapy) treatment were considered. We use data from this clinical trial among
Figure 1 The complement of the Kaplan-Meier estimate (1-KM) and the cumulative
incidence estimate (CI) of disease progression among 76 patients with head and neck
cancer. The numerical values of each estimate are indicated.
patients treated with the conventional treatment to calculate both 1-KM and CI
for the outcome of disease progression.
Among 175 patients entered into the study, 17 were ruled ineligible. Of
the 158 eligible patients, 76 were randomized to receive the conventional treat-
ment and 32 had disease progression while 37 died without progression. There-
fore, 32 of 76 patients failed from the event of interest (disease progression),
whereas 37 of 76 patients failed from the competing risk of death without progres-
sion. The remaining seven patients were alive without progression at last follow-
up and were therefore censored. Each of the seven censored patients had follow-
up to at least 7.0 years, and all cases of progression occurred prior to this time.
All patients therefore have complete follow-up through 7.0 years after randomiza-
tion so that the natural estimate of the probability of progression by 7.0 years is
32/76 ⫽ 42.1%, precisely the value of CI (Fig. 1). On the other hand, the value
of 1-KM at this time is 51.6%, as shown in Figure 1, the discrepancy being due
to the difference in the way that patients who failed from the competing risk of
death without progression are handled.
V. DISCUSSION
We have shown that when estimating the probability of failure for end points
that are subject to competing risks, 1-KM and CI can result in different estimates.
If it is of interest to estimate the probability of a particular event that is subject
to competing risks, 1-KM should never be used, even if the number of competing-
risk events is relatively small (and therefore the two estimates not much dif-
ferent.).
It is not clear what 1-KM in such situations represents. The only way to
interpret 1-KM in this setting is as the probability of failure from the cause of
interest in the hypothetical situation where there are no competing risks and the
risk of failure from the cause of interest remains unchanged when competing
risks are removed. Because of the way the Kaplan-Meier method handles failures
from a competing risk, 1-KM will not be a consistent estimate of the probability
of failure from the cause of interest. The discrepancy between 1-KM and CI is
dependent on the timing and frequency of the failures from a competing risk; the
earlier and more frequent the occurrences of such events, the larger the difference.
Regardless of the magnitude of this difference, however, reporting 1-KM in such
situations, if it is interpreted as an estimate of the probability of failure, is incor-
rect and can be very misleading.
The wide availability of statistical software packages capable of calculating
KM estimates but which do not directly calculate CI estimates undoubtedly
contributes to the frequent misuse of 1-KM. Although we have not seen the CI
estimate offered commercially in any software packages, its calculation is not
522 Gooley et al.
computationally difficult, and programs that accomplish this are reasonably

straightforward to write.
The focus of this article was to demonstrate that the methods discussed
above lead to different estimates in the presence of competing risks, even though
each is intended to measure the same quantity. Nonetheless, we emphasize that
presenting only results describing the probability of the event of interest falls
short of what should be examined so that one can fully appreciate factors that
affect the outcome. When analyzing and presenting data where competing risks
occur, it is therefore important to describe probabilities of failure not only from
the event of interest but also failures due to competing-risk events. One approach
to dealing with this problem is to present an estimate of the time to the minimum
of the different types of failure. For a discussion of related topics, see Pepe et
al. (10).
We focused purely on the estimation of probability of failure in this chapter.
It is often of interest, however, to compare outcome between two treatment
groups. How such comparisons are made and exactly what is compared can be
complicated issues and have not been addressed here. Such topics are beyond
the scope of this chapter but have been addressed in previous work (19,20). If
estimation is the goal and competing risks exist, however, the use of 1-KM is
inappropriate and yields an estimate that is not interpretable. In such situations,
CI should always be used for purposes of estimation.
ACKNOWLEDGMENT
Supported by grants CA 18029, CA 38296, and HL 36444 awarded by the Na-

tional Institutes of Health.
REFERENCES

Stat Assoc 1957; 53:457–481.
2. Aalen O. Nonparametric estimation of partial transition probabilities in multiple
decrement models. Ann Stat 1978; 6:534–545.
3. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. New
York: John Wiley, 1980.
4. Prentice RL, Kalbfleisch JD. The analysis of failure times in the presence of compet-
ing risks. Biometrics 1978; 34:541–554.
5. Benichou J, Gail MH. Estimates of absolute cause-specific risk in cohort studies.
Biometrics 1990; 46:813–826.
6. Dinse GE, Larson MG. A note on semi-Markov models for partially censored data.
Biometrika 1986; 73:379–386.
7. Lawless JF. Statistical Models and Methods for Lifetime Data. New York: John
Wiley; 1982.
8. Gaynor JJ, Feuer EJ, Tan CC, Wu DH, Little CR, Straus DJ, et al. On the use of
cause-specific failure and conditional failure probabilities: examples from clinical
oncology data. J Am Stat Assoc 1993; 88:400–409.
9. Pepe MS, Mori M. Kaplan-Meier, marginal or conditional probability curves in sum-
marizing competing risks failure time data? Stat Med 1993; 12:737–751.
10. Pepe MS, Longton G, Pettinger M, Mori M, Fisher LD, Storb R. Summarizing data
on survival, relapse, and chronic graft-versus-host disease after bone marrow trans-
plantation: motivation for and description of new methods. Br J Haematol 1993; 83:
602–607.
11. McGiffin DC, Naftel DC, Kirklin JK, Morrow WR, Towbin J, Shaddy R, et al.
Predicting outcome after listing for heart transplantation in children: comparison of
Kaplan-Meier and parametric competing risk analysis. Pediatrics 1997; 16:713–722.
12. Caplan RJ, Pajak TF, Cox JD. Analysis of the probability and risk of cause-specific
failure. Int J Radiat Oncol Biol Phys 1994; 29:1183–1186.
13. Gelman R, Gelber R, Henderson C, Coleman CN, Harris JR. Improved methodology
for analyzing local and distant recurrence. J Clin Oncol 1990; 8:548–555.
14. Cheng SC, Fine JP, Wei LJ. Prediction of cumulative incidence function under the
proportional hazard model. Biometrics 1998; 54:219–228.
15. Niland JC, Gebhardt JA, Lee J, Forman SJ. Study design, statistical analyses, and
results reporting in the bone marrow transplantation literature. Biology of Blood and
Marrow Transplantation 1995; 1:47–53.
16. Gooley TA, Leisenring W, Crowley JC, Storer BE. Estimation of failure probabili-
ties in the presence of competing risks: new representations of old estimators. Stat
Med 1999; 18:695–706.
17. Efron B. The two sample problem with censored data. Proceedings of the fifth Berke-
ley symposium in mathematical statistics, IV. New York: Prentice-Hall, 1967, pp.
831–853.
18. Schuller DE, Metch B, Stein DW, Mattox D, McCracken JD. Preoperative chemo-
therapy in advanced resectable head and neck cancer: final report of the Southwest
Oncology Group. Laryngoscope 1988; 98:1205–1211.
19. Pepe MS. Inference for events with dependent risks in multiple endpoint studies. J
Am Stat Assoc 1991; 86:770–778.
20. Gray RJ. A class of k-sample tests for comparing the cumulative incidence of a
competing risk. Ann Stat 1988; 16:1141–1154.
27
Meta-Analysis
Luc Duchateau and Richard Sylvester

European Organization for Research and Treatment of Cancer,
Brussels, Belgium
I. INTRODUCTION
Individual cancer clinical trials are often not powerful enough to provide a defini-
tive answer to the question of interest. However, data may be available from
several different trials that have studied the same or a similar question. A meta-
analysis (overview) is the process whereby the quantitative results of separate
(but similar) studies are combined together using formal statistical techniques to
make use of all the available information. Due to the larger sample size, this
provides a more powerful test of the hypothesis and increased precision (lower
variance) of the estimated treatment effect. Meta-analyses are often carried out
if the individual trials addressing a given question of interest are inconclusive or
if there are conflicting results from the various studies (1).
It is important that the clinical question addressed in a meta-analysis is
neither too broad nor too narrow. If the clinical question is too broad, the overall
results might just be an average of the effects of different treatment regimens
and/or patient populations and do not reflect any real effect in a specific treatment
regimen and/or patient population. On the other hand, if the clinical question is
too narrow, few eligible trials might be found and the results might not be of
general interest. As compared with a review of the literature in which personal
judgment may play a role in drawing conclusions (2), a meta-analysis is more
objective since it is quantitative in nature. It is important, though, to ensure objec-
525
526 Duchateau and Sylvester
tivity, that a protocol is written describing how the meta-analysis will be per-
formed, for instance, clearly stating trial eligibility criteria and statistical methods
that will be used.
II. TYPES OF META-ANALYSES
There are three different types of meta-analyses. They may be based on the fol-
lowing:
1. The literature (MAL): A literature search is undertaken to find all rele-
vant publications. The results from these publications are then com-
bined based on the information available in the publication (e.g., p
value, log rank statistic, number of events in the treated and control
groups, etc.).
2. Summary data (MAS): Again, all relevant publications are identified
and a summary of the relevant statistics (e.g., Kaplan-Meier survival
estimates in the treated and control groups at a specified time or logrank
O, E, and V) is obtained from the authors of the publication.
3. Individual patient data (MAP): A search is not only done in the litera-
ture for published trials but also in the scientific community for unpub-
lished trials. For all trials, whether published or not, individual patient
data are obtained from the investigator. For each patient, data are re-
quested on the end point of interest, for example, the exact date of
death or censoring, along with additional information on the patient’s
treatment and prognostic factors.
Performing a MAL in the field of cancer is often difficult since the time
to an event, duration of survival, or time to progression is generally the main
end point (3). The information that can be obtained from the different publications
varies greatly, from the logrank statistic and Kaplan-Meier curves to virtually no
information except the number of patients randomized to treatment and control.
It is not always clear how this information can be combined together for an
overall test. In most cases it is not feasible to perform a genuine time to event
analysis.
MAS is an alternative that could be considered. It shares, however, weak-
nesses similar to MAL. First, there are two important sources of bias in both
MAL and MAS but not in MAP (4):
1. Publication bias, which is caused by the fact that trials where there is
a statistically significant treatment effect are more likely to be pub-
lished than ‘‘negative’’ trials, thus leading to an over estimation of the
size of the difference.
Meta-Analysis 527
2. Selection bias, which is caused by the fact that some patients may
have been excluded from the analysis presented in the publication for
reasons that are treatment related. These patients are not included in
the MAL or MAS, whereas a MAP is based on all randomized patients.
Second, as detailed by Stewart and Clarke (5), meta-analysis of individual
patient data presents numerous other advantages as compared with MAL and
MAS:
1. Quality control of the individual patient data. For instance the random-
ization sequence can be verified so that the trials that were not properly
randomized, and thus could be biased (6), can be excluded from the
MAP.
2. Updated follow-up data can be obtained from the investigator, thus
leading to a more powerful analysis as compared with MAL.
3. Subgroup and prognostic factor analyses can be carried out. These are
only feasible in MAP as individual patient data are needed for this
purpose.
Due to the above reasons, MAP is the only type of meta-analysis that can be
recommended in oncology, even though it is much more time consuming than
the two other types of meta-analyses. In the remainder of this chapter, it is mainly
MAP that is discussed.
III. STATISTICAL TECHNIQUES FOR META-ANALYSES

A. General Principles for Combining Data
Historically, one of the first techniques for combining data from several experi-
ments was proposed by Fisher (7) and was based on the combination of p values.
Statistical techniques that combine either p values or standardized test statistics
such as the chi-square test statistic should be avoided in meta-analyses. Such
techniques do not take into consideration the size of the trial (number of events,
total person-years follow up, etc.), the precision of the estimate (variance), or
even the direction of the difference in the case of a two-sided test. A trial of 20
patients with a significant p value will thus contribute as much as a trial of 200
patients with the same p value.
Another extreme and possibly misleading solution to the problem of com-
bining data from different studies is to disregard the fact that data come from
different trials. An example can easily show how this might lead to incorrect
conclusions. Assume that the results of two trials, expressed as the number of
patients dead and alive in the treated and control groups, are to be combined.
The data are summarized in Table 1.
Table 1 Data From Two Trials With Binary Outcomes
Dead Alive Total
Trial 1
Treatment 18 42 (70%) 60
Control 36 84 (70%) 120
Total 54 126 (70%) 180
Trial 2
Treatment 84 36 (30%) 120
Control 42 18 (30%) 60
Total 126 54 (30%) 180
Within each trial the survival is the same in the treated and in the control
group, but the overall survival in the first trial is 70% as compared with 30% in
the second trial. Thus, there is no difference at all between treatment and control.
The combination of the data from these two trials is shown in Table 2.
For the combined data, the survival in the treated group is only 43% as
compared with 57% for the control group, a difference that is significant at the
1% level. This paradoxical result, better known as Simpson’s paradox (8), is
caused by the fact that relatively more patients in the treated group come from
the study with a low survival rate.
Thus, in combining results from different trials, data should not just be
pooled and an analysis done on the pooled data. Rather, a statistic that compares
the treatment and control groups must be calculated for each trial. Subsequently,
the statistics are combined, weighting them by their precision.
B. Whitehead’s Approach
A general parametric approach has been proposed by Whitehead and Whitehead
(9). Assume that the results of I trials have to be combined. The true treatment
Table 2 Data From Two Trials Combined Together

Pooled data
Dead Alive Total
Treatment 102 78 (43%) 180

Control 78 102 (57%) 180
Total 180 180 (50%) 360
Meta-Analysis 529
effect in the ith trial is given by τ i and an estimate of τ i by τ̂ i with asymptotic

variance ν i. The measure of treatment effect is taken such that no treatment effect
means that τ i ⫽ 0 (e.g., log odds ratio, log hazard ratio).
If it is assumed that asymptotically for each trial
τ̂ i ⬃ N(τ i , ν i ) (1)
then under the null hypothesis of no treatment effect, τ i ⫽ 0,
τ̂ i ν i⫺1 ⬃ N(0, ν i⫺1 )
As this is true for each trial and the trial results are independent of each other,
we have
冢冣
I I
冱i⫽1
τ̂ i ν i⫺1 ⬃ N 0, 冱ν
i⫽1
⫺1
i
Under the null hypothesis we thus have
冢冱冣
I 2
τ̂ i ν i⫺1
X⫽ i⫽1
I ⬃ χ 21 (2)
冱 i⫽1
ν i⫺1
Large values of X lead to rejection of the null hypothesis.

To estimate the ‘‘overall’’ or ‘‘typical’’ effect, the assumption is made that
the treatment effect is the same in each of the trials (τ 1 ⫽ ⋅ ⋅ ⋅ ⫽ τ I ⫽ τ).
Then
冢冱冣
I I I
冱i⫽1
τ̂ i ν i⫺1 ⬃ N τ
i⫽1
ν i⫺1, 冱ν
i⫽1
⫺1
i
and an unbiased estimate is given by
冱 τ̂ ν 冫冱 ν
I I
τ̂ ⫽ i
⫺1
i
⫺1
i (3)
i⫽1 i⫽1
Finally, an approximate maximum likelihood estimate for τ i that is asymptotically

normally distributed (10) can be obtained by taking the ratio of the score of
the likelihood and the Fisher information at τ i ⫽ 0. The approximate maximum
likelihood estimates for the odds ratio and the hazard ratio obtained using this
method are given in the next section.
C. Testing the Hypothesis of No Treatment Effect and

Estimating the Overall Treatment Effect
Most end points in cancer clinical trials are binary (presence or absence of an
event) or involve the time to an event (duration of survival, time to recurrence).
The next two sections focus on meta-analysis techniques for each of these two
types of end points.
1. Binary Data
For binary data, the most common measure of treatment effect is the odds ratio,
given by
π t /(1 ⫺ π t)
OR ⫽
π c /(1 ⫺ π c )
where π t (π c ) is the probability of an event, for example of dying, in the treated

(control) group.
An approximate maximum likelihood estimate for the log odds ratio of a
specific trial is given by
D t ⫺ N t (D/N)
log(OR) ⫽
(D(N ⫺ D)N t N c )/(N 2 (N ⫺ 1))
where D t is the number of deaths in the treated group, N t (N c ) is the number of

patients in the treated (control) group, D is the total number of deaths, and N is
the total sample size.
The asymptotic variance of this estimate is given by
1
Var(log(OR)) ⫽
(D(N ⫺ D)N t N c )/(N 2 (N ⫺ 1))
Note that the estimate and the variance of the estimate have the same denomina-
tor. To simplify the notation, let V be the denominator of the above expressions.
The numerator of the estimate of log(OR) can be expressed as (O ⫺ E), where
O ⫽ D t is the observed number of events in the treated group and E is the expected
number of events in the treated group if there is no difference between the treated
and the control groups. Therefore,
(O ⫺ E)
log(OR) ⫽
V
and asymptotically [see expression (1)]
log(OR) ⬃ N(log(OR), V ⫺1 )
Meta-Analysis 531
To make a distinction between the different trials, we use a subscript i, so that

the statistics for the ith trial are given by (O ⫺ E) i , Vi , OR i , and log(OR i ).
Using expression (2), the overall test statistic is
冢冱冣冢冱冣
I 2 I 2
log(OR i )Vi (O ⫺ E) i
X⫽ i⫽1
I ⫽ i⫽1
I ⬃ χ 21
冱V
i⫽1
i 冱V
i⫽1
i
and an estimate of the overall effect log(OR) [see expression (3)] is given by
冫冱冫冱 V
I I I I
log(O) ⫽ 冱
i⫽1
log(OR i ) Vi
i⫽1
Vi ⫽ 冱
i⫽1
(O ⫺ E) i
i⫽1
i
with 95% confidence interval

⫺1/2
冢冱冣
I
log(OR) ⫾ 1.96 Vi
i⫽1
The estimate and the confidence interval for OR can be found by taking the
exponential of log(OR) and of the lower and upper limits of its confidence in-
terval:
OR ⫽ exp(log(OR))

⫺1/2 ⫺1/2
冤exp冢log(OR) ⫺ 1.96冢冱 V 冣冣; exp冢log(OR) ⫹ 1.96冢冱 V 冣冣冥

I I
i i
i⫽1 i⫽1
An additional parameter of interest is the percent reduction in the odds of the

event in the treated group as compared with the control group, given by
100((π c /1 ⫺ π c ) ⫺ (π t /1 ⫺ π t ))
100(1 ⫺ OR) ⫽
(π c /1 ⫺ π c )
and can thus be estimated by 100(1 ⫺ OR). Its variance is given by
(100 (1 ⫺ OR)) 2
Var(100(1 ⫺ OR)) ⫽
冢冱冣冫冱
I 2 I
(O ⫺ E) i Vi
i⫽1 i⫽1

100(1 ⫺ OR) ⫾ 1.96(Var(100(1 ⫺ OR))) 1/2
2. Time-to-Event Data
The analysis of time-to-event (duration of survival) data is more complex and
additional notation is needed. Assume that the survival density function is given
by f t (t) (f c (t)) for the treated (control) group. The survival distribution function,
expressing the probability of surviving beyond time t, is given by St (t) (Sc (t))
for the treated (control) group. The hazard function, the instantaneous death rate
or the conditional probability of dying immediately after time t, given survival
up to time t, for the treated (control) group is defined by
λt (t) ⫽ f t (t)/St (t) (λc (t) ⫽ f c (t)/Sc (t))
In the analysis of survival data, it is often assumed that the hazard functions of
the treated and the control group are proportional to each other over time. This
does not mean that the functions themselves are constant over time.
The proportionality factor is given by the hazard ratio
HR ⫽ λt (t)/λc (t)
This constant is the most common measure of treatment effect for time to event
studies.
Assume N t (N c ) patients are randomly assigned to the treated (control)
group. Patients are followed up until a certain time t. Either the event (e.g., death)
has already occurred by time t or the patient is censored at this timepoint. From
the start to the end of the study assume there are k distinct times of death. The
number of patients at risk just before the jth death time in the treated (control)
group, N tj (N cj ), equals the initial number of patients minus the number of patients
who died or were censored in the treated (control) group before the jth death
time and N j ⫽ N cj ⫹ N tj . At the jth death time, the number of deaths in the treated
(control) group equals D tj (D cj ) and let D j ⫽ D cj ⫹ D tj.
An approximate maximum likelihood estimate for the log hazards ratio of
a specific trial is given by
k
冱D tj ⫺ N tj (D j /N j )
log(HR) ⫽ k
j⫽1
冱 D (N ⫺ D )N N /N
j⫽1
j j j tj cj
2
j (N j ⫺ 1)
Meta-Analysis 533
The asymptotic variance of this estimate is given by
1
Var(log(HR)) ⫽ k
冱 D (N ⫺ D )N N /N
j⫽1
j j j tj cj
2
j (N j ⫺ 1)
Again, we can express the numerator of the estimator of log(HR) for the ith trial
as (O ⫺ E)i and the denominator as V i.
Thus,
(O ⫺ E) i
log(HR i ) ⫽
Vi
with asymptotically
log(HR i ) ⬃ N(log(HR i ), V i⫺1 )
In a similar way as for the combination of odds ratios, the test statistic is given
by
冢冱冣冢冱冣
I 2 I 2
log(HR i )Vi (O ⫺ E) i
X⫽ i⫽1
I ⫽ i⫽1
I ⬃ χ 21
冱V
i⫽1
冱V i⫽1
i
and an estimate of the overall effect log(HR) is given by
冫冱冱冫冱 V
I I I I
log(HR) ⫽ 冱
i⫽1
log(HR i)Vi
i⫽1
⫽
i⫽1
(O ⫺ E) i
i⫽1
i
⫺1/2
冢冱冣
I
log(HR) ⫾ 1.96 Vi
i⫽1
The estimate and the confidence interval for HR can be found by taking the
exponential of log(ĤR) and of the lower and upper limit of its confidence interval:
HR ⫽ exp(log(HR))

⫺1/2 ⫺1/2
冤exp冢log(HR) ⫺ 1.96冢冱 V 冣冣;exp冢log(HR) ⫹ 1.96冢冱 V 冣冣冥

I I
i i
i⫽1 i⫽1
An additional parameter of interest is the percent reduction in the hazard

of the treated group as compared with the control group, given by
100(λ c (t) ⫺ λ t (t))
100 (1 ⫺ HR) ⫽
λ c (t)
and can thus be estimated by 100 (1 ⫺ HR). Its variance is given by
(100 (1 ⫺ HR)) 2
Var(100(1 ⫺ HR)) ⫽
冢冱冣冫冱
I 2 I
(O ⫺ E)i Vi
i⫽1 i⫽1
100 (1 ⫺ HR) ⫾ 1.96 (Var(100(1 ⫺ HR)) 1/2
3. Example
For patients with superficial bladder cancer, four EORTC trials were identified
that compared the disease-free interval in patients that were randomized after
transurethral resection to either adjuvant treatment or to no further treatment (con-
trol) (11). The results of these four trials for patients between 60 and 70 years
of age are shown in Table 3. The summary results in the (O ⫺ E) and V columns
are presented twice, once based on an analysis of whether or not an event occurred
(odds ratio) and once based on an analysis of the actual time to the event (hazard
ratio).
Odds Ratio. An estimate of the overall or typical OR is given by
OR ⫽ exp
冢 ⫺1.77 ⫺ 1.82 ⫺ 6.05 ⫺ 0.74
5.27 ⫹ 6.45 ⫹ 5.16 ⫹ 9.22
⫽ 0.67
冣
with 95% confidence interval equal to [0.46; 0.99]. The chi-square test statistic
is equal to
(⫺1.77 ⫺ 1.82 ⫺ 6.05 ⫺ 0.74) 2
X⫽ ⫽ 4.13
5.27 ⫹ 6.45 ⫹ 5.16 ⫹ 9.22
and
P(χ 21 ⱖ 4.13) ⫽ 0.042
Meta-Analysis
Table 3 Disease-free Interval Statistics From Four Bladder Cancer Trials Comparing Adjuvant Treatment to Control for Patients
Between 60 and 70 Years Old
Patients Events Odds ratio Hazard ratio

No No
Study Treatment Treatment Treatment Treatment O⫺E V O⫺E V
30751 82 35 55 26 ⫺1.77 5.27 ⫺5.4 15.1

30781 56 55 34 37 ⫺1.82 6.45 ⫺3.9 17.4
30791 114 25 53 19 ⫺6.05 5.16 ⫺9.4 8.2
30863 76 80 28 31 ⫺0.74 9.22 ⫺0.9 14.7
Total 328 195 170 113 ⫺10.4 26.1 ⫺19.6 55.4
535
The treatment effect is thus significant at the 5% level. The odds reduction
is estimated to be 33% with 95% confidence interval [1%; 64%].
Hazard Ratio. Using the same reasoning, the (O ⫺ E) and V statistics
from each individual trial can be used to derive an estimate of the HR, 0.70,
with 95% confidence interval [0.54; 0.91]. The chi-square statistic for testing the
treatment difference equals 6.94 with a p value of 0.008. The estimate of the
hazard reduction is 30% with 95% confidence interval [8%; 52%].
The analysis based on the hazard ratio is the correct one to use in this case
since it takes into account not just whether or not an event occurred but also the
time at which it occurred. Use of this additional information provides a more
convincing result.
D. Testing for Heterogeneity

Unlike a multicenter clinical trial, the trials that contribute to a meta-analysis are
based on different protocols. As there can be differences between the trials with
respect to the treatment regimens, the patient population, and the end point assess-
ment, they can also differ in the size of the treatment effect. It is thus important
to investigate whether the results in the different trials are similar. If the results
of the different trials are heterogeneous, it is important to further investigate in
which trials they are different and to try to identify the reasons for the differences
(12).
Again a general framework can be constructed to test for heterogeneity
based on reasoning similar to that in Section III.B. Testing for heterogeneity
corresponds to testing the null hypothesis that the treatment effect is the same
in all I trials
(H 0 : τ 1 ⫽ . . . ⫽ τ I ⫽ τ)
The statistic by which heterogeneity can be tested is given by
I
XH ⫽ 冱 (τ̂ ⫺ τ̂)
i⫽1
i
2
ν i⫺1
If the null hypothesis is true and the treatment effects are similar, their estimates
will also be close together and close to the overall estimate τ̂ that is based on
the different estimates τ̂ i.
It can be proven that under the null hypothesis
X H ⬃ χ 2I⫺1
An alternative expression for X H which is easier to use in computations can be
derived:
Meta-Analysis 537
XH ⫽ 冱 (τ̂ ⫺ τ̂)
i⫽1
i
2
ν i⫺1
I I I
⫽ 冱
i⫽1
τ̂ 2i ν i⫺1 ⫹ τ̂ 2 冱
i⫽1
ν i⫺1 ⫺ 2τ̂ 冱 τ̂ ν
i⫽1
i
⫺1
i
I I
⫽ 冱 τ̂ ν
i⫽1
2
i
⫺1
i ⫹ τ̂ 2 冱ν
i⫽1
⫺1
i
In terms of the statistics (O ⫺ E) i and V i for both binary and time to event data,
this becomes
冢冱冣
I 2
I
(O ⫺ E) i
(O ⫺E) 2i
XH ⫽ 冱 ⫺ i⫽1
I
冱V
i⫽1
Vi
i
i⫽1
Example The statistic for testing heterogeneity for the time to event data de-
scribed in Table 3 can be obtained by
XH ⫽
冢 ⫺5.4 2 ⫺3.9 2 ⫺9.4 2 ⫺0.9 2
15.1
⫹
17.4
⫹
8.2
⫹
14.7 冣
(⫺5.4 ⫺ 3.9 ⫺ 9.4 ⫺ 0.9) 2
⫺ ⫽ 6.7
(15.1 ⫹ 17.4 ⫹ 8.2 ⫹ 14.7)
The p value is given by
P(χ 23 ⱖ 6.7) ⫽ 0.08
which is not significant at the 5% level.
E. Testing for Interaction and Linear Trend

It is sometimes thought that certain subgroups of patients may benefit more from
the treatment than other subgroups. If the division into subgroups is not based
on ordered categories, only an interaction can be tested. If the division is based
on an ordered variable (e.g., age groups), a test for linear trend can also be carried
out. Within each trial the patients are first divided into G subgroups of interest
(g ⫽ 1 , . . . , G). For each of these subgroups within trial i the statistics (O ⫺
E) ig and Vig are calculated as before. For each subgroup, the sum of these statistics
over all the trials is calculated
I I
(O ⫺ E) g ⫽ 冱
i⫽1
(O ⫺ E) ig and Vg ⫽ 冱V
i⫽1
ig
An estimate of the log odds ratio for the gth subgroup can be obtained by
log(OR g ) ⫽ (O ⫺ E) g /Vg
and a test for interaction is then given by
G G
XI ⫽ 冱
g⫽1
log(OR g )V g⫺1 ⫺ log(OR) 冱V
g⫽1
⫺1
g
Thus, if the estimates of the odds ratios in the different subgroups are similar to
each other, the test statistic will be small. Under the null hypothesis
X I ⬃ χ 2G⫺1
In the case of time to event data, exactly the same formulae apply with OR
replaced by HR.
A computationally easier form in terms of (O ⫺ E) g and V g is given by
冢冱冣
G 2
G
(O ⫺ E) g
(O ⫺ E) 2g
XI ⫽ 冱 Vg
⫺ g⫽1
G
g⫽1
冱 Vg
g⫽1
Testing for a trend with an ordered variable is more complex. The same statistics
are calculated for each subgroup. A number (score) is then assigned to each of
the subgroups according to their order, the subgroup with the lowest value being
assigned the number 0, the subgroup with the highest value the number G ⫺ 1.
Subsequently, the following statistics are calculated:
G G G
A⫽ 冱
g⫽1
Vg ; B ⫽ 冱
g⫽1
(g ⫺ 1)Vg ; C ⫽ 冱 (g ⫺ 1)
g⫽1
2
Vg ;
G G
D⫽ 冱 (O ⫺ E) ;
g⫽1
g E⫽ 冱 (g ⫺ 1)(O ⫺ E)
g⫽1
g
The statistic used to test for linear trend is given by (13)

(E ⫺ (DB)/A) 2
XT ⫽
C ⫺ B 2 /A
Meta-Analysis 539
Under the null hypothesis of no trend between the subgroups

X T ⬃ χ 21
Example Patients in the four previous trials were divided into three age groups.
The statistics (O ⫺ E) and V for each of the subgroups from each of the trials
are given in the fourth and fifth columns of the forest plot (discussed in the next
section). The sum of these statistics over all the trials in a particular subgroup
is presented in the subtotals line.
The test statistic for interaction using the time to event data can then be
derived from these statistics
XI ⫽
冢 ⫺11 2 ⫺19.7 2 ⫺8.8 2
49.7
⫹
55.4
⫹
46.2
⫺
冣
(⫺11 ⫺ 19.7 ⫺ 8.8) 2
(49.7 ⫹ 55.4 ⫹ 46.2)
⫽ 0.81
The p value is given by

P(χ 22 ⱖ 0.81) ⫽ 0.85
which is clearly not significant.
For the test for linear trend, the following intermediate results were found
A ⫽ 151.3; B ⫽ 147.8; C ⫽ 240.2; D ⫽ ⫺39.5; E ⫽ ⫺37.3
Therefore, the test statistic equals 0.017 and the p value is given by
P(χ 21 ⱖ 0.017) ⫽ 0.99
IV. GRAPHICAL PRESENTATION OF THE RESULTS:

FOREST PLOTS
A summary of the data can be presented in a forest plot (14): One line with the
corresponding statistics is generated for each trial. In the case of subgroups, each
subgroup is presented separately and each trial that contributes to a subgroup is
presented under the heading of that subgroup.
An example of such a forest plot is given in Figure 1. The results of the
bladder meta-analysis with patients divided into different age subgroups are pre-
sented here. Within each of the three age groups, the four contributing trials are
listed. On each line, the number of events as a fraction of the total number of
patients randomized are shown for both the treatment and control groups, along
with estimates of (O ⫺ E) and V. The estimate of the hazard ratio corresponds
to the middle of the square and the horizontal line gives the 99% confidence
interval. An arrow is used to show that the confidence interval extends beyond
the area used for showing the confidence interval.
Figure 1 Forest plot of the disease-free interval for the combined analysis of four super-
ficial bladder cancer trials comparing adjuvant treatment to control. Patients are divided
into subgroups according to age.
Meta-Analysis 541
If the confidence interval does not contain the value 1, it means that there
is a significant difference between treatment and control at the 1% level. The
size of the square is proportional to the number of events in the subgroup within
that trial and thus represents the amount of information available. The solid verti-
cal line represents no treatment effect and corresponds to a hazard ratio of 1. If
the estimate of the HR is to the left of this line, the difference favors the treatment
arm. Note that a log scale is used for the hazard ratio so that a hazard ratio of
2 or 0.5 is the same distance from 1. A hazard ratio of 2 means that the hazard
of treatment is double that of No treatment, whereas a hazard ratio of 0.5 means
that the hazard of treatment is half the hazard of No treatment. Thus, they have
the same meaning but the direction of the treatment difference is reversed.
For each subgroup, the sum of the statistics is given along with the sum-
mary hazard ratio, represented by the middle of the diamond. The extremes of
the diamond are the lower and upper limits of the 95% confidence interval. A
test of heterogeneity between the trials within a subgroup is given below the
summary statistics of that subgroup. All the subgroups have a test for heterogene-
ity that is significant at the 10% level. The percent reduction of the hazard along
with its standard deviation is also given for each subgroup. The largest reduction,
30%, is found for patients between 60 and 70 years.
The lower part of the forest plot presents the overall tests. The bottom
diamond shows the overall hazard ratio with its 95% confidence interval. The
dashed line is extended vertically upward to show the heterogeneity of this ‘‘typi-
cal’’ hazard ratio with the estimated hazard ratios for the subgroups and the trials
within a subgroup. Below the overall hazard ratio, the test for the treatment effect
is given that is significant (p ⫽ 0.0013). The lower left hand corner presents tests
for interaction and trend, neither of which is significant.
V. ADVANCED TECHNIQUES IN META-ANALYSIS
More advanced techniques have been developed with regards to the methods of
analysis, especially with respect to investigating and explaining heterogeneity.
Alternatives to the fixed effects model presented in this chapter have been pro-
posed, the most important one being the random effects model (15). As compared
with the assumption of a fixed treatment effect in a fixed effects model, in a
random effects model the treatment effects from the different studies are assumed
to come from a normal distribution with a variance describing the heterogeneity
between the studies. The advantages of the random effects model are that it takes
into consideration the trial to trial variability in deriving the variance of the treat-
ment effect estimate. The merit of the fixed effects model is that it is relatively
simple and easy to perform, leading in most practical situations to similar treat-
ment effect estimates as in the random effects model.
Bayesian methods have also been applied to meta-analyses (16). Two im-
portant graphical tools for checking assumptions are the Galbraith plot (17) and
the funnel plot (18). They both present the estimated odds ratios from different
studies. The Galbraith plot is used to study heterogeneity and detect outlying
studies while the funnel plot is used to assess publication bias.
Two additional meta-analytic techniques have been developed. Cumulative
meta-analyses analyze and plot studies cumulatively in time and provide a tool
to assess whether additional experimental evidence is needed to draw conclusions
(19).
Another proposed technique is prospective meta-analyses. Studies are pro-
spectively designed to incorporate them afterwards in a meta-analysis. It should,
however, be used with great care as this approach may lead to many small trials
being carried out with little power to detect a difference.
VI. DISCUSSION
While meta-analyses play a very important role in the decision-making process,

they can be criticized just as with most scientific methods. Criticisms generally
concern the selection of studies, the choice of end point, the interpretation of
heterogeneity, and the generalization and application of the results. To a large
extent many of these criticisms can be overcome by posing a well-formulated
question, excluding improperly randomized trials, collecting the individual pa-
tient data, and using a hard end point such as survival. However, it must be
recognized that they are not a panacea or cure-all. They most certainly should
not be a replacement for large-scale randomized trials and should not be used as
an excuse for conducting small underpowered trials.
In the absence of large-scale trials, meta-analyses, when properly carried
out, provide the best overall evidence of treatment effect. They have now replaced
the traditional literature reviews and have been instrumental in stimulating inter-
national cooperation and research. They have gained considerable credibility,
largely due to the pioneering efforts of the Early Breast Cancer Trialists Collabo-
rative Group (13). Their work has shown that the adjuvant therapy of early breast
cancer improves survival and that for every 1 million such women treated, an
extra 100,000 10-year survivors could result (20).
REFERENCES
1. Gelber RD, Goldhirsch A. Meta-analysis: the fashion of summing-up evidence. Part

I. Rationale and conduct. Ann Oncol 1991; 2:461–468.
Meta-Analysis 543
2. Olkin G. Meta-analyses: reconciling the results of independent studies. Stat Med

1995; 24:457–472.
3. Clarke M, Stewart L, Pignon JP, Bijnens L. Individual patient data meta-analyses
in cancer. Br J Cancer 1998; 77:2036–2044.
4. Stewart LA, Parmar MKB. Meta-analyses of the literature or of individual patient
data: is there a difference? Lancet 1993; 341:418–422.
5. Stewart LA, Clarke MJ. Practical methodology of meta-analyses (overviews) using
updated individual patient data. Stat Med 1995; 14:2057–2079.
6. Jeng GT, Scott JR, Burmeister LF. A comparison of meta-analytic results using
literature vs individual patient data. JAMA 1995; 274:830–836.
7. Fisher RA. Statistical methods for research workers. 4th ed. London: Oliver and
Boyd, 1932.
8. Simpson CH. The interpretation of interaction in contingency tables. J R Stat Soc
1994; B13:233–241.
9. Whithead A, Whitehead J. A general parametric approach to the meta-analysis of
randomized clinical trials. Stat Med 1991; 10:1665–1677.
10. Yusuf S, Peto R, Lewis J, Collins R, Sleight P. Beta-blockade during and after myo-
cardial infarction: an overview of the randomized trials. Progr Cardiovasc Dis 1985;
27:335–371.
11. Pawinski A, Sylvester R, Kurth KH, Bouffioux C, van der Meijden A, Parmar MKB,
Bijnens L. A combined analysis of European Organization for Research and Treat-
ment of Cancer and Medical Research Council randomized clinical trials for the
prophylactic treatment of stage TaT1 bladder cancer. J Urol 1996; 156:1934–1941.
12. Thompson SG. Why sources of heterogeneity in meta-analysis should be investi-
gated. BMJ 1994; 309:1351–1355.
13. Early Breast Cancer Trialists’ Collaborative Group. Treatment of Early Breast Can-
cer. Vol. 1. Worldwide Evidence 1985–1990. Oxford: Oxford University Press,
1990.
14. Demets DL. Methods for combining randomized clinical trials: strengths and limita-
tions. Stat Med 1987; 6:341–348.
15. DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clin Trials
1986; 7:177–188.
16. Dumouchel W. Bayesian meta analysis. In: Berry DA (ed.). Statistical Methodology
in the Pharmaceutical Sciences. New York: Dekker, 1990.
17. Galbraith RF. A note on graphical presentation of estimated odds ratios from several
clinical trials. Stat Med 1988; 7:889–894.
18. Egger M, Davey Smith G. Misleading meta-analysis. BMJ 1995; 310:752–754.
19. Lau J, Antman EM, Jimenez-Silva J, Kupelnick B, Mosteller F, Chalmers TC. Cu-
mulative meta-analysis of therapeutic trials for myocardial infarction. N Engl J Med
1992; 327:248–254.
20. Early Breast Cancer Trialists’ Collaborative Group. Systemic treatment of early
breast cancer by hormonal, cytotoxic, or immune therapy. Lancet 1992; 339:1–15,
71–85.
Index
Adverse event (AE), 4–5 Cox regression (See Proportional haz-

Adverse drug reaction (ADR), 5, 23– ards model)
24 Cumulative incidence, 513–522
Alternative hypothesis, 477–478 Cut point model, 328–332, 421–423
Area under the curve (AUC), 25–26
Dose intensity, 493–495, 503–511
Blinding, 251
Bonferoni procedure, 151–152, 162, Early stopping (See Interim analysis)
234, 331 Endpoints (See Outcome measures)
Equivalence trials, 173–187
Cause-specific failure, 497–499 (See active controls, 175, 178
also Competing risks) bioequivalence, 173
Censored data, right censoring, 414, 458 confidence intervals, 176
Change scores, 274 interim analysis, 177
Classification and regression trees nonsigificance, 174
(CART) (See Recursive parti- one-sided vs two-sided, 176
tioning) sample size, 177
Common toxicity criteria, 4–5, 18, 23, Explained variation, 397–409 (See also
27 R-squared)
Competing losses (See Competing risks) Exponential model, 131, 140–141, 162–
Competing risks, 139, 497–499, 513– 163
523 piecewise exponential, 133–135
Compliance, 495–496
Continual reassessment method (CRM), Factorial trials, 138, 155–157, 161–171
12, 15–18, 20–21, 28, 35–72, False positives, 482–484
76–78, 81, 83–90 False negatives, 482–484
stopping rule, 56–57 Forest plots, 539–541
545
546 Index
Generalizability, 485–486 Missing completely at random (MCAR),

Generalized estimating equations 274–285, 306
(GEE), 281–282 Missing data, 233, 261–262, 274–285,
Generalized linear models, 274 292, 303, 306
Global test, 152 (See also Overall test) Missing not at random (MNAR), 275–
Good clinical practice (GCP), 26–27 285
Goodness of fit, 414 Multiplicity, 149
Grade (of toxicity) (See Toxicity Multiple comparisons, 138, 149–155,
grades) 161–162, 330
Graphical methods, 411–456 Multiple imputation, 283–284, 306
in meta-analyses, 539–541
Neural networks, 349–359
Health related quality of life (See Qual- Null hypothesis, 477–478
ity of life)
Hyperbolic tangent model, 8, 42, 77 Ordered alternatives, 169
Ordered treatments, 124
Informed consent, 27 Outcome measures, 229, 478–482
Interactions, 138, 156–157, 162–171 cost, 229, 242–248, 291–320
in meta-analyses, 537–539 cost-effectiveness, 242–243, 292,
Interim analyses, 189–209 296–299
conditional power, 194–197 cost-minimization, 292, 296–
early stopping rules, 189–209 297
group sequential monitoring, 130, cost-utility, 242–243, 292, 296–
136–137, 139, 190–209, 221 298
information fraction, 137 Q-TwiST, 235–242, 245, 263, 285,
sequential monitoring, 129–130, 298
135–137, 139, 193, 211–228 quality adjusted life year (QALY),
stochastic curtailment, 194–195 292, 294, 296, 298
stopping for no benefit, 194, 197– quality adjusted survival, 235–242,
204, 206 285
International Committee on Harmoniza- quality of life, 229–242, 249–267,
tion (ICH), 26–27 269–289, 294–295
CARES-SF, 262
Kaplan-Meier method, 131, 236, 274, FACT, 272
308, 426–430, 514–522 SF-36, 272
Sickness Impact Profile, 230
Landmark method, 493, 505
Logistic regression, 79, 435–438, 442– Pattern analysis, 261
443 Pharmacokinetics, 25–26, 60–61, 81
Logit model, 6, 8, 79 Phase I trials, 1–91
dose levels, 2–4, 8, 12–19, 21, 27–
Martingale residuals (See Residuals) 28, 40–41, 74, 77
Meta-analysis, 314, 506–508, 525–543 dose limiting toxicity (DLT), 2–7,
Missing at Random (MAR), 274–285, 13–15, 18, 22–23, 27, 30, 74–
306 76, 83–90, 79–80
Index 547
[Phase I trials] Proportional hazards model, 129–130,

dose escalation scheme, 2–3, 9–24 132–133, 138–139, 162, 180,
Bayesian, 18–19, 67–69 190, 216, 274, 308, 310, 333–
best of five, 82–90 341, 359–367, 380–390, 402,
continual reassessment method 411–432, 438–454, 457, 532–
(CRM), 12, 15–18, 20–21, 28, 536
35–72, 76–78, 81, 83–90 stratified model, 138, 451–453
stopping rule, 56–57 Publication bias, 476, 526
modified Fibonacci, 9–11, 75,
82 Recursive partitioning, 341–349, 359–
three ⫹ three, 11–14, 19, 75–76, 364, 384–385, 391–392, 457–
82–90 472
toxicity-response, 22–23 splitting rules, 459–462
up and down, 12, 14–15, 20–21, pruning, 462–465
28 staging, 465–467
dose-response, 3, 6, 74, 77, 82 Regression trees (See Recursive parti-
dose-toxicity, 3–4, 7, 16, 21, 23–24, tioning)
27, 37–38, 41–42 Repeated measures, 233–234, 261–262,
maximum tolerated dose (MTD), 2– 273
6, 10–12, 14–21, 23–27, 29–30, Residuals, 415–426
57, 73–76, 78–90 R-squared, 363–364, 400, 404–408
starting dose, 3, 11, 21
target dose, 40–41, 43 Sample size
target toxicity, 3, 38, 41, 74–76, 79, binomial trials, 131
83–90 cost studies, 302–303
two-stage designs, 48–50, 78–80, equivalence trials, 177
83–90 multi-arm trials, 153–154
Phase II trials, 93–127 one-sided vs two-sided tests, 131
attained sample size, 94–95 phase I trials, 19, 27
confidence intervals, 97–99 phase II trials, 94–95
expected sample size, 94–95 phase III trials, 129–147, 153–154
Gehan design, 93 prognostic factor studies, 364
multi-arm, 99–100, 119–127 quality of life studies, 257–260
multiple endpoints, 100–102, 105– selection designs, 121–122
118, 122–123 two-stage designs, 48–50, 78–80,
two-stage designs, 93–103 83–90
Phase III trials, 129–228 Selection designs, 99–100, 119–127,
equivalence, 173–87 157
factorial, 138, 155–157, 161– binary outcomes, 120–121
171 sample size, 121–122
multi-arm, 138, 149–171 survival outcomes, 121–122
sample size, 129–147 ordered treatments, 124
two-arm, 129–147 Selection bias, 526
Probit model, 6, 8 Sequential probability ratio test, 219–
Prognostic factors, 321–472 220, 224–224
548 Index
Stage migration, 476–477 Transformation of covariates, 441–447

Stochastic approximation, 62–64 Tree-based methods (See Recursive par-
Surrogate endpoints, 480–482 titioning)
Survival by response, 492–493 Triangular test, 211–228
Time-dependent covariates, 416–417,

Utilities, 232–233
505
Tolerance distribution, 6
Toxicity grades, 5, 24, 49 Weibull model, 6

Statistics in Clinical Oncology

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics in Clinical Oncology

Uploaded by

Copyright:

Available Formats

ISBN: 0-8247-9025-1

This book is printed on acid-free paper.

Eastern Hemisphere Distribution

World Wide Web

Copyright  2001 by Marcel Dekker, Inc. All Rights Reserved.

Current printing (last digit):

PRINTED IN THE UNITED STATES OF AMERICA

This book is a compendium of statistical approaches to the problems facing those

chapters in this section demonstrate the breadth and depth of activity in

1. Overview of Phase I Trials 1

2. Dose-Finding Designs Using Continual Reassessment Method 35

3. Choosing a Phase I Design 73

4. Overview of Phase II Clinical Trials 93

5. Designs Based on Toxicity and Response 105

6. Phase II Selection Designs 119

PHASE III TRIALS

7. Power and Sample Size for Phase III Clinical Trials of

8. Multiple Treatment Trials 149

9. Factorial Designs with Time-to-Event End Points 161

10. Therapeutic Equivalence Trials 173

11. Early Stopping of Cancer Clinical Trials 189

12. Use of the Triangular Test in Sequential Clinical Trials 211

13. Design and Analysis Considerations for Complementary

14. Health-Related Quality-of-Life Outcomes 249

15. Statistical Analysis of Quality of Life 269

16. Economic Analysis of Cancer Clinical Trials 291

PROGNOSTIC FACTORS AND EXPLORATORY ANALYSIS

17. Prognostic Factor Studies 321

18. Statistical Methods to Identify Prognostic Factors 379

19. Explained Variation in Proportional Hazards Regression 397

20. Graphical Methods for Evaluating Covariate Effects in the

21. Graphical Approaches to Exploring the Effects of Prognostic

22. Tree-Based Methods for Prognostic Stratification 457

INTERPRETING CLINICAL TRIALS

23. Problems in Interpreting Clinical Trials 473

24. Commonly Misused Approaches in the Analysis of Cancer

25. Dose-Intensity Analysis 503

26. Why Kaplan-Meier Fails and Cumulative Incidence Succeeds

27. Meta-Analysis 525

James R. Anderson, Ph.D. Department of Preventive and Societal Medicine,

Ursula Berger, Dipl.Stat. Institute for Medical Statistics and Epidemiology,

Bernard F. Cole, Ph.D. Department of Community and Family Medicine,

Mark R. Conaway, Ph.D. Division of Biostatistics and Epidemiology, Depart-

John Crowley, Ph.D. Southwest Oncology Group Statistical Center, Fred

Elihu H. Estey, M.D. Department of Leukemia, University of Texas M.D. An-

Stephen L. George, Ph.D. Department of Biostatistics and Bioinformatics,

Ted A. Gooley, Ph.D. Department of Clinical Statistics, Southwest Oncology

Stephanie Green, Ph.D. Program in Biostatistics, Southwest Oncology Group

Norbert Holländer, M.Sc. Department of Medical Biometry and Statistics, In-

Michael LeBlanc, Ph.D. Program in Biostatistics, Fred Hutchinson Cancer Re-

Wendy Leisenring, Sc.D. Departments of Clinical Statistics and Biostatistics,

P. Y. Liu, Ph.D. Public Health Sciences Division, Fred Hutchinson Cancer

Gary H. Lyman, M.D., M.P.H., F.R.C.P.(Edin) Department of Medicine,

Carol McMillen Moinpour, Ph.D. Division of Public Health Sciences, South-

Hjalmar Nekarda, Dr. Institute for Medical Statistics and Epidemiology,

John O’Quigley, Ph.D. Department of Mathematics, University of California–

David Osoba, B.Sc., M.D.(Alta), F.R.C.P.C. Quality of Life Consulting, West

Joseph L. Pater, M.D., M. Sc., F.R.C.P.(C) NCIC Clinical Trials Group,

Gina R. Petroni, Ph.D. Division of Biostatistics and Epidemiology, Depart-

Peter D. Sasieni, Ph.D. Department of Mathematics, Statistics, and Epidemi-

Willi Sauerbrei, Ph.D. Department of Medical Biometry and Statistics, Insti-

Martin Schumacher, Ph.D. Department of Medical Biometry and Statistics,