Professional Documents
Culture Documents
The Veterans Administration Lung Cancer data set includes the following variables:
Question 1
(a) To assess the linear relationship between covariates and logarithm of hazard, we need to
plot martingale residuals versus a covariate of interest.
The LOWESS smoothing for age and the LOWESS smoothing for diagnosis time do not present
any obvious systematic trend in either martingale residuals and the deviance residuals.
Moreover, diagnosis time is stretched out due to some outlier; otherwise, its deviance residuals
are symmetric. Karnofsky performance score, on the other hand, shows a systematic decreasing
trend. This suggests that the linearity assumption for this variable may not hold. It is possible to
see non-linear relationship between mortality and Karnofsky score. A 100 Karnofsky score
states that the patient can carry out normal and activity without special care, and a 0 Karnofsky
1|Page
score states that the patient is dead. Thus, the relationship between Karnofsky score and
mortality are negatively related. This negative relationship can be exponential, quadratic or
linear, so here we dont know the nature of this decrease. In our study, if we assume that the
Karnofsky score and mortality are linear, then there is no need for modify of the effects for age,
diagnosis, and the Karnofsky score. If we assume Karnosfky performance score and mortality
are non-linear, we need to construct a model that take into account of the non-linear
relationship between mortality and Karnofsky score.
(b) The score residuals measure the influence of the observation i on the parameter estimates
when i is removed from the model fit. These are calculated separately for each regression
coefficient.
In order to calculate for outliers, the formulas Q1 - 1.5*IQR and Q3 + 1.5*IQR were applied.
Treatment: There are three outliers on the upper end in the standard group (in red), and there
are three outliers at the bottom end in the test group (in blue). Individuals 44, 17, and 21 are
red. Individuals 78, 75, and 73 are blue. The most noticeable outliers are individual 44 all the
way on the top in the standard group and individual 78 in the test group.
Celltype: Since celltype is divided into three covariates, we have three score residuals. Each set
of boxplots have different outliers as seen from above. The frequent outliers in all three plots
for cell type are individuals 44 and 17.
2|Page
Karnofsky performance score: There are a few outliers on the upper end (in red), and only one
outlier at the bottom (in blue). The outliers include individuals 100, 21, 118, 44, and 70. There
are several noticeable outliers. For instance, individual 118 has a Karnofsky score of 10, which is
the only case where the score equals 10 in the study. Individual 44 with a Karnofsky score of 40
has the highest score residual in the study.
Diagtime: There are several outliers on the upper end of score residuals (in red), and a few in
the lower end (in blue). The most noticeable outlier sits at 58 months from diagnosis to
randomization, with a score residual of 45.77282, which is individual 12. The outliers include
individuals 77, 118, 45, 81, 78, 58, 21, 8, 95, 44, 12, 106, 36, and 91.
Age: There are some outliers on the upper end of score residuals (in red), and a few in the
lower end (in blue). There are several noticeable blue colored outliers (individuals 44 and 118)
at the bottom right. The outliers include individuals 44, 17, 91, 58, 75, 9, 11, 73, 118, 85, 18, 89,
36, 53, 95, and 111.
Prior: There are two outliers in the without prior therapy group, and a few outliers in the
without prior therapy group. The outliers include individuals 44, 17, 75, 73, 36, 70, and 91. The
most noticeable outliers are individual 44 all the way on the top of the group with no prior
therapy and individual 75 with prior therapy.
In conclusion: Individuals 17 and 44 are outliers almost for all the covariates analyzed above,
and they both have small cells. Thus, they are considered as influential subjects that are
influencing the results of the analysis.
(c) Investigate the appropriateness of the proportional hazards assumptions through
appropriate residual plots, and statistical tests.
The Residual-time Correlations:
> cox.zph(fit)
rho chisq p
trt -0.0284 0.131 0.717536
celltypesmallcell 0.0128 0.026 0.871960
celltypeadeno 0.1428 2.980 0.084304
celltypelarge 0.1708 4.082 0.043339
karno 0.3057 12.815 0.000344
diagtime 0.1498 2.923 0.087323
age 0.1890 5.313 0.021166
prior -0.1756 4.386 0.036242
GLOBAL NA 27.641 0.000548
The cox.zph function tests proportionality of all the predictors in the model by creating
interactions with time. A p-value less than 0.05 indicates a violation of the proportionality
assumption. Here, celltypelarge, karno, age, and prior all have p-values less than 0.05. They all
violated the proportional hazard assumption. The GLOBAL test in the last row is a test for all the
interactions tested at once. Since the p-value is 0.000548, this indicates a violation of the
proportionality assumption.
3|Page
Schoenfeld Residuals Analysis:
Treatment: The residuals are mostly centered around the 0 line and there is no systematic
trend over time; thus, we conclude that the proportionality assumption holds.
Small Cell: The residuals are centering around 0 with no systematic trend over time; thus,
based on the plot, we conclude that the proportionality assumption holds.
Adeno: There does not seem to be a systematic trend for the residual for this covariate. Thus,
we can conclude that the proportionality assumption holds.
Large: There seems to up an upward trend over time for this covariate. Thus, it could be the
case that the proportional hazards assumption does not hold for this covariate.
Karno: It looks like the residuals are mostly in the negative side, so the mean of the residuals is
more likely to be negative rather than zero. Thus, the proportional hazards assumption might
not hold for this covariate.
Diagtime: The residuals are centered around 0 with some random fluctuations. There are no
systematic trends over time for this covariate. Thus, the proportionality assumption holds.
Age: There is some increasing trend in the residual for age over time even though the line is
mostly centered around 0. Thus, the proportionality assumption might not hold.
Prior: This is a systematic decreasing trend over time for the prior therapy effect. Thus, the
proportional hazards assumption might not hold for this effect.
4|Page
Covariate-time interactions:
# Covariate-time interactions using tt function
ctt = c()
for (i in 1:length(names(coef(fit)))){
cox =
coxph(Surv(time,status)~trt+celltype+karno+age+prior+diagtime+tt(as.numeric(model.matrix
(fit)[,i])), tt=function(x, t, ...)I(t*x),data=veteran)
ctt = rbind(ctt,summary(cox)$coefficients[9,5])
}
results = cbind(names(coef(fit)),ctt)
colnames(results) = c("covariate","p_value")
results
> results
covariate p_value
[1,] "trt" "0.97796842372455"
[2,] "celltypesmallcell" "0.0292373532064322"
[3,] "celltypeadeno" "0.00185399162105382"
[4,] "celltypelarge" "0.0163737188345113"
[5,] "karno" "0.0171885773543171"
[6,] "diagtime" "0.8106179045302"
[7,] "age" "0.380977018238588"
[8,] "prior" "0.222710266155446"
The covariate-time interactions with significant p-values indicate that the covariate effects are
not constant over time; thus, their effects change based on time. As a result, these effects
violate the proportional hazards assumption. Here, both celltype and karno are statistically
significant, which indicate a violation of the proportional hazards assumption. Unlike the results
we found in the residual-time correlations, this method does not state that age and prior have
violated the proportional hazards assumption. In addition, this method states that
celltypesmallcell and celltypeadeno are both significant, which violated the proportional
hazards assumption. However, in the residual-time correlations method, these two types of
cells are not significant. As we can see, either method can be used to test proportionality, and
these two methods do not necessarily have to give the same results.
5|Page
Question 2
In the cox models, we see the two survival curves are almost identical. Notice that the two
curves do not cross. This is because in these models we assume for proportionality. Thus, there
is no proportionality violation.
Residual-time Correlation:
> cox.zph(cox.fit)
rho chisq p
trt -0.16 3.3 0.0691
From the cox.zph function, the p-value for the residual-time correlation is 0.0691. This p-value
indicates that there is no proportionality violation for the treatment effect. Depending on
definition, if we set the significance level at 0.05, this test barely passed the non-proportionality
test. If we set the significance level at 0.1, then this indicates of proportionality violation.
6|Page
Covariate-time Interaction:
> cox.mod = coxph(Surv(time, status) ~ trt + tt(trt), tt=function(x,t, ...)
x * t, data=veteran)
> summary(cox.mod)
Call:
coxph(formula = Surv(time, status) ~ trt + tt(trt), data = veteran,
tt = function(x, t, ...) x * t)
The p-value of tt(trt) from the covariate-time interaction test is 0.0316 and this indicates the
treatment effect violates the proportionality assumption. To note, the negative tt(trt)
coefficient estimates that as time increases, risk in the treatment group decreases. This
coincides with the survival curves in Kaplan-Meier survival curves in Figure 5. We can see that
as time goes on, the treatment effect increases, thus, risk decreases.
Since the Kaplan-Meier survival curves in figure 5 cross, this suggests a violation of the
proportional hazard assumption for the treatment effect. We can study the patterns of the
survival curves. The survival curve for the test group is lower than the standard group before
the day when the two curves cross, roughly at 160 days. This shows that the treatment effect is
not constant over time, thus, a violation of the proportionality assumption.
7|Page
Question 3
Add baseline age to the above model (Model 1)
coxph(formula = Surv(time, status) ~ trt + age, data = veteran)
Suppose we have the following Cox model and assume there are two individuals are both
assigned to the control group. Then the Cox model becomes the following when we treat age as
the time-dependent covariate for individual i and j, respectively.
t
() = 0 ()exp[1 (0) + 2 agei ( )]
365.25
t
= 0 ()exp[2 agei (365.25)]
t
() = 0 ()exp[2 agei ( )]
365.25
Then the hazard ratio becomes the following:
() exp(2 ( + ))
365
= = exp[2 ( )]
() exp(2 ( + ))
365
The results for both hazard ratios are the same even when treating age as a time-dependent
covariate. One way of explaining this is that age difference between individuals doesnt change
no matter when considering baseline age or time-dependent age.
8|Page
Question 4
(a) Separate treatment effects before and after 100 days of follow-up.
vet = NULL
for (i in 1:nrow(veteran)) {
if (veteran$time[i]<=100) {
vet = rbind(vet, data.frame(veteran[i,], id=i, start=0, stop=veteran$time[i], event=veteran$status[i],
tr1=1, tr2=0))
}
else if (veteran$time[i] > 100) {
vet = rbind(vet, data.frame(veteran[i,], id=i, start=0, stop=100, event=0, tr1=1, tr2=0))
vet = rbind(vet, data.frame(veteran[i,], id=i, start=100, stop=veteran$time[i],
event=veteran$status[i], tr1=0, tr2=1))
}
}
head(vet)
tail(vet)
Below are the results for the first and last two individuals:
> head(vet);tail(vet)
trt celltype time status karno diagtime age prior id start stop event tr1 tr2
1 1 squamous 72 1 60 7 69 0 1 0 72 1 1 0
2 1 squamous 411 1 70 5 64 10 2 0 100 0 1 0
21 1 squamous 411 1 70 5 64 10 2 100 411 1 0 1
trt celltype time status karno diagtime age prior id start stop event tr1 tr2
136 2 large 378 1 80 4 65 0 136 0 100 0 1 0
1361 2 large 378 1 80 4 65 0 136 100 378 1 0 1
137 2 large 49 1 30 3 37 0 137 0 49 1 1 0
9|Page
Now we construct the following Cox Model:
2 ()
() = 0 () exp {1 {100} + 2 {>100} }
> cox.vet
Call:
coxph(formula = Surv(start, stop, event) ~ trt:tr1 + trt:tr2,
data = vet)
The two Cox models have the same coefficient and p-values. Based on looking at Kaplan-Meier
or any other measurements of the same data and create new variables and new hypothesis
based on these observations above, we are going to inflate type I error. This is because as the
number of tests increase, there is a higher chance that one of the tests will yield significant
result by chance even though there is no difference in hazard rate between treatment and
control. If we would like to do this, we need to account for multiple comparison or have an
independent dataset which is not observed and apply the same hypothesis in the new
independent dataset. Then the results for the new dataset will be valid.
10 | P a g e
Question 5
Hazard Model
p
i (t) = 0 (t)exp[k=1 k xki (t)]
where {2 100} and {()>100} ], are the indicator functions of follow up time before and at 100
days and follow-up time after 100 days, respectively; trti is an indicator for whether the ith
individual is in the treatment group or not; if trti = 1 then the ith individual is in the treatment
group and if trti = 0 then the ith individual is in the standard group.
1 is the effect of the treatment on the hazard rate before 100 days of follow-up
2 is the effect of the treatment on the hazard rate after 100 days of follow-up.
Cox Partial Likelihood Function
2 ()
exp[1 { +2 ]
100} { >100}
PL() = 137
=1 [137 2 () ]
=1 ( )exp[1 { 100} +2 { >100} ]
11 | P a g e
Appendix
library(survival)
veteran
veteran$prior[veteran$prior==10]=1
veteran$smallcell <- as.numeric(veteran$celltype=="smallcell")
veteran$adeno <- as.numeric(veteran$celltype=="adeno")
veteran$large <- as.numeric(veteran$celltype=="large")
12 | P a g e
plot(veteran$trt, r_score[,1], ylab="Score Residuals", xlab="Treatment", main="Score
Residuals by Treatment")
points(veteran$trt[44], r_score[,1][44], col = "red")
points(veteran$trt[17], r_score[,1][17], col = "red")
points(veteran$trt[21], r_score[,1][21], col = "red")
points(veteran$trt[78], r_score[,1][78], col = "blue")
points(veteran$trt[75], r_score[,1][75], col = "blue")
points(veteran$trt[73], r_score[,1][73], col = "blue")
13 | P a g e
points(veteran$diagtime[118], r_score[,6][118], col = "red")
points(veteran$diagtime[45], r_score[,6][45], col = "red")
points(veteran$diagtime[81], r_score[,6][81], col = "red")
points(veteran$diagtime[78], r_score[,6][78], col = "red")
points(veteran$diagtime[58], r_score[,6][58], col = "red")
points(veteran$diagtime[21], r_score[,6][21], col = "red")
points(veteran$diagtime[8], r_score[,6][8], col = "red")
points(veteran$diagtime[95], r_score[,6][95], col = "red")
points(veteran$diagtime[44], r_score[,6][44], col = "red")
points(veteran$diagtime[12], r_score[,6][12], col = "red")
points(veteran$diagtime[106], r_score[,6][106], col = "blue")
points(veteran$diagtime[36], r_score[,6][36], col = "blue")
points(veteran$diagtime[91], r_score[,6][91], col = "blue")
14 | P a g e
plot(veteran$trt, r_dfbeta[,1], ylab="Score Residuals", xlab="Treatment", main="Score
Residuals by Treatment")
plot(veteran$celltype, r_dfbeta[,2], ylab="Score Residuals", xlab ="Cell Type", main="Score
Residuals by Cell Type", names=c("squa", "small", "adeno", "large"))
plot(veteran$celltype, r_dfbeta[,3], ylab="Score Residuals", xlab ="Cell Type", main="Score
Residuals by Cell Type", names=c("squa", "small", "adeno", "large"))
plot(veteran$celltype, r_dfbeta[,4], ylab="Score Residuals", xlab ="Cell Type", main="Score
Residuals by Cell Type", names=c("squa", "small", "adeno", "large"))
plot(veteran$karno, r_dfbeta[,5], ylab="Score Residuals",xlab="Karnofsky Performance
Score", main="Score Residuals by Karnofsky Performance Score")
plot(veteran$diagtime, r_dfbeta[,6], ylab="Score Residuals", xlab="Diagnosis
Time",main="Score Residuals by Diagnosis Time")
plot(veteran$age, r_dfbeta[,7], ylab="Score Residuals", xlab="Age", main="Score Residuals by
Age")
plot(veteran$prior, r_dfbeta[,8], ylab="Score Residuals", xlab="Prior", main="Score Residuals
by Prior")
# Scaledsch Residuals
par(mfrow=c(2,4))
residuals(fit)
residuals(fit, type="scaledsch")
cox.zph(fit)
for (i in 1:8){
plot(cox.zph(fit)[i])
abline(0,0,lty=2,col='red')
}
# Question 2
# Cox Model Survival Curves
cox.fit = coxph(Surv(time, status)~trt, data=veteran)
15 | P a g e
plot(survfit(cox.fit,newdata=data.frame(trt=c(1,2))),lty=c(1,1),col=c('red','blue'), xlab="Time",
ylab="Survival Probability", main="Survival Curves of the Cox Model")
legend(800, 1, c("standard", "test"), col=c("red", "blue"), lty=c(1:1))
cox.zph(cox.fit, global=TRUE)
cox.mod = coxph(Surv(time, status) ~ trt + tt(trt), tt=function(x,t, ...) x * t, data=veteran)
# Question 3
coxfit1 = coxph(Surv(time, status)~trt + age, data=veteran)
coxfit2 = coxph(Surv(time, status)~trt + tt(age),tt=function(x,t,...) x+t/365, data=veteran)
# Question 4
# Estimate separate treatment effects for early and later parts of the follow-up
vet = NULL
for (i in 1:nrow(veteran)) {
if (veteran$time[i]<=100) {
vet = rbind(vet, data.frame(veteran[i,], id=i, start=0, stop=veteran$time[i],
event=veteran$status[i], tr1=1, tr2=0))
}
else if (veteran$time[i] > 100) {
vet = rbind(vet, data.frame(veteran[i,], id=i, start=0, stop=100, event=0, tr1=1, tr2=0))
vet = rbind(vet, data.frame(veteran[i,], id=i, start=100, stop=veteran$time[i],
event=veteran$status[i], tr1=0, tr2=1))
}
}
head(vet);tail(vet)
16 | P a g e