In this lecture we will do some hands-on examples covering survival analysis using R.
Note: This lecture is designed based on several resources. For details see the end-notes1.
Prerequisites: Familiarity with R is required (including working with data frames, installing/using packages, importing data, and saving results).
In cohort studies where you track samples or subjects from one time point (e.g., entry into a study, diagnosis, start of a treatment) until you observe some outcome event (e.g., death, onset of disease, relapse), it does not make sense to assume the survival rates are constant. For example:
The risk of death after heart surgery is highest immediately post-op, decreases as the patient recovers, then rises slowly again as the patient ages.
Recurrence rate of different cancers varies highly over time, and depends on tumor genetics, treatment, and other environmental factors.
Survival analysis is used to analyze the rates of occurrence of events over time, without assuming the rates are constant. Generally, survival analysis allows for
modeling the time until an event occurs,2 or
compare the time-to-event between different groups, or
assess how time-to-event correlates with quantitative variables.
Hazard, denoted by \(h(t)\), is the instantaneous event (death) rate at a particular time point t. If \(T\) denotes the time-to-event, then \[h(t) = \lim_{\delta \rightarrow 0} \frac{P(t<T<t+\delta|T>t)}{\delta}.\]
Survival analysis does not assume the hazard is constant over time.
Cumulative hazard, denoted by \(H(t)\), is the total hazard experienced up to time t. \[H(t) = \int_0^t h(u) du.\]
Survival function, denoted by \(S(t)\), is the probability that an individual survives (or, the probability that the event of interest does not occur) up to and including time t.
Note: We will use death and event interchangeably.
Let \(T\) be the time of death, then \[ S(t) = P(T>t) \] is the probability that the time of death is greater than some time \(t\).
It is same as the probability that the event has not occurred until time \(t\).
Since \(S\) is a probability, we have \(0 \leq S(t) \leq 1\), and \(T \geq 0\) as the survival times are always positive.
Kaplan-Meier Curve is an estimator for \(S(t)\) which is given as \[\hat{S}(t) = \prod_{t_i<t} \Big( 1 - \frac{d_i}{n_i}\Big),\] where
\(n_i\) is the number of subjects at risk at time \(t_i\), and
\(d_i\) is the number of individuals who fail at that time.
The Kaplan-Meier curve is a step function illustrating the cumulative survival probability over time.
The curve is horizontal over periods where no event occurs, then drops vertically corresponding to a change in the survival function at each time an event occurs.
Censoring is a type of missing data problem unique to survival analysis.
It occurs when you track the sample/subject through the end of the study and the event never occurs.
It could also happen due to the sample/subject dropping out of the study for reasons other than death, or some other loss to followup.
If the sample is censored, you only know that the individual survived up to the loss to followup, but you do not know anything about survival after that.3
One of the objectives in survival analysis is to compare the survival functions in different groups, e.g., leukemia patients as compared to cancer-free controls.
If both groups were followed until everyone died, both survival curves would end at 0%. However, one group might have survived on average a lot longer than the other group.
Proportional hazards assumption says that the ratio of hazards between groups is constant over time.4
Kaplan-Meier curves are good for visualizing differences in survival between two categories5, but they do not work well for assessing the effect of quantitative variables like age, gene expression, leukocyte count, etc.
Cox proportional hazards (PH) regression can assess the effect of both categorical and continuous variables, and can model the effect of multiple variables at once.
Cox PH regression models the natural log of the hazard at time t, denoted by \(h(t)\), as a function of the baseline hazard, \(h_0(t)\) (the hazard for an individual where all exposure variables are 0), and multiple exposure variables \(x_1, x_2,\ldots,x_p\).
Cox PH model is given as \[ log(h(t)) = log(h_0(t)) + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p. \]
If we limit the right hand side to just a single categorical exposure variable (\(x_1\)) with two groups (\(x_1=1\) for exposed and \(x_1=0\) for unexposed), the previous equation is equivalent to: \[\begin{eqnarray*} h_1(t) & = & h_0(t) e^{\beta_1 x_1}, \\ \Rightarrow HR(t) & := & \frac{h_1(t)}{h_0(t)} = e^{\beta_1}. \end{eqnarray*}\]
\(HR(t)\) is the hazard ratio, comparing the exposed to the unexposed individuals at time t.
This model shows that the hazard ratio is \(e^{\beta_1}\), and remains constant over time t (hence the name proportional hazards regression).
The \(\beta\)s are the regression coefficients that are estimated from the model, and represent the \(log(HR)\) for each unit increase in the corresponding predictor variable.
The interpretation of the hazard ratio depends on the measurement scale of the predictor variable. In simple terms for the variable in question,
a \(\beta > 0\) indicates more hazard than baseline and hence worse survival, and
a \(\beta < 0\) indicates less hazard than baseline and hence better survival.
The core survival analysis functions are in the survival package. The survival package is one of the few “core” packages that comes bundled with your basic R installation, so you probably did not need to use install.packages()
. However, you will need to load it like any other library when you want to use it.
We will also want to load the survminer package, which provides much nicer Kaplan-Meier plots compared to base graphics.
# CRAN Packages needed
library(survival) # core survival analysis function
library(survminer) # recommended for visualizing survival curves
The core functions we will use out of the survival package include:
Surv()
: creates the response variable (survival object), and typical usage takes the time-to-event,6 and whether or not the event occurred (i.e., death vs censored).survfit()
: Fits a survival curve using either a formula, or a previously fitted Cox model. It creates a survival curve which could be displayed or plotted.
coxph()
: Fits a Cox proportional hazards regression model. Models are specified the same way as in regular linear models (e.g. lm()
).
Other optional functions you might use include:
cox.zph()
: Tests the proportional hazards assumption of a Cox regression model.
survdiff()
: Tests for differences in survival between two groups using a log-rank / Mantel-Haenszel test.7
We will use the built-in lung cancer data set8 in the survival package. You can get some more information about the data set by running ?lung
.
library(survival)
?lung
inst
: Institution code
time
: Survival time in days
status
: censoring status 1=censored, 2=dead
age
: Age in years
sex
: Male=1 Female=2
ph.ecog
: ECOG performance score (0=good 5=dead)
ph.karno
: Karnofsky performance score as rated by physician
pat.karno
: Karnofsky performance score as rated by patient
meal.cal
: Calories consumed at meals
wt.loss
: Weight loss in last six months
Note: You can access the data just by running lung
, as if you had read in a data set and called it lung
. You can operate on it just like any other data frame.
We see that there are 228 observations and 10 variables in this data:
dim(lung) # returns the dimensions of the data frame
## [1] 228 10
head(lung) # returns the first few rows from the data
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
## 3 3 1010 1 56 1 0 90 90 NA 15
## 4 5 210 2 57 1 1 90 60 1150 11
## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0
View(lung) # opens the data in the viewer pane
See the help for ?Surv
. This is the main function we will use to create the survival object.
The help tells you that when there are two unnamed arguments, they will match time
and event
in that order.
This is the common shorthand for right-censored data. The alternative lets you specify interval data, where you give it the start and end times (time
and time2
).
Surv
function tries to guess the coding of the status variable. It will try to guess whether you are using 0/1 or 1/2 to represent censored vs death, respectively.
Create a survival object called s
and display it. It is a special type of vector that tells you
how long the subject was tracked for, and
if the event occurred or the sample was censored (shown by the +
).
s = Surv(lung$time, lung$status)
class(s)
## [1] "Surv"
s
## [1] 306 455 1010+ 210 883 1022+ 310 361 218 166 170 654
## [13] 728 71 567 144 613 707 61 88 301 81 624 371
## [25] 394 520 574 118 390 12 473 26 533 107 53 122
## [37] 814 965+ 93 731 460 153 433 145 583 95 303 519
## [49] 643 765 735 189 53 246 689 65 5 132 687 345
## [61] 444 223 175 60 163 65 208 821+ 428 230 840+ 305
## [73] 11 132 226 426 705 363 11 176 791 95 196+ 167
## [85] 806+ 284 641 147 740+ 163 655 239 88 245 588+ 30
## [97] 179 310 477 166 559+ 450 364 107 177 156 529+ 11
## [109] 429 351 15 181 283 201 524 13 212 524 288 363
## [121] 442 199 550 54 558 207 92 60 551+ 543+ 293 202
## [133] 353 511+ 267 511+ 371 387 457 337 201 404+ 222 62
## [145] 458+ 356+ 353 163 31 340 229 444+ 315+ 182 156 329
## [157] 364+ 291 179 376+ 384+ 268 292+ 142 413+ 266+ 194 320
## [169] 181 285 301+ 348 197 382+ 303+ 296+ 180 186 145 269+
## [181] 300+ 284+ 350 272+ 292+ 332+ 285 259+ 110 286 270 81
## [193] 131 225+ 269 225+ 243+ 279+ 276+ 135
## [ reached getOption("max.print") -- omitted 28 entries ]
survfit()
function. See the help for ?survfit
.Create a simple survival curve that does not consider any different groupings, so we will specify just an intercept (e.g., ~1
) in the formula that survfit
expects.
We can also do this in one step by nesting the Surv()
call within the survfit()
call, we will use the data=
argument to specify which data we are using.
sfit = survfit(s~1)
# or we can also use the following
sfit = survfit(Surv(time, status)~1, data=lung)
# sfit
Run summary
on the object sfit
. This will show a life table.
summary(sfit)
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 228 1 0.9956 0.00438 0.9871 1.000
## 11 227 3 0.9825 0.00869 0.9656 1.000
## 12 224 1 0.9781 0.00970 0.9592 0.997
## 13 223 2 0.9693 0.01142 0.9472 0.992
## 15 221 1 0.9649 0.01219 0.9413 0.989
## 26 220 1 0.9605 0.01290 0.9356 0.986
## 30 219 1 0.9561 0.01356 0.9299 0.983
## 31 218 1 0.9518 0.01419 0.9243 0.980
## 53 217 2 0.9430 0.01536 0.9134 0.974
## 54 215 1 0.9386 0.01590 0.9079 0.970
## 59 214 1 0.9342 0.01642 0.9026 0.967
## 60 213 2 0.9254 0.01740 0.8920 0.960
## 61 211 1 0.9211 0.01786 0.8867 0.957
## 62 210 1 0.9167 0.01830 0.8815 0.953
## 65 209 2 0.9079 0.01915 0.8711 0.946
## 71 207 1 0.9035 0.01955 0.8660 0.943
## 79 206 1 0.8991 0.01995 0.8609 0.939
## 81 205 2 0.8904 0.02069 0.8507 0.932
## 88 203 2 0.8816 0.02140 0.8406 0.925
## 92 201 1 0.8772 0.02174 0.8356 0.921
## 93 199 1 0.8728 0.02207 0.8306 0.917
## 95 198 2 0.8640 0.02271 0.8206 0.910
## 105 196 1 0.8596 0.02302 0.8156 0.906
## 107 194 2 0.8507 0.02362 0.8056 0.898
## 110 192 1 0.8463 0.02391 0.8007 0.894
## 116 191 1 0.8418 0.02419 0.7957 0.891
## 118 190 1 0.8374 0.02446 0.7908 0.887
## 122 189 1 0.8330 0.02473 0.7859 0.883
## [ reached getOption("max.print") -- omitted 111 rows ]
These tables show a row for each time point where either the event occurred or a sample was censored. It shows the number at risk (number still remaining), and the cumulative survival at that instant.
Let us now fit survival curves separately by sex.
sfit = survfit(Surv(time, status)~sex, data=lung)
sfit
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## n events median 0.95LCL 0.95UCL
## sex=1 138 112 270 212 310
## sex=2 90 53 426 348 550
summary(sfit)
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## sex=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 11 138 3 0.9783 0.0124 0.9542 1.000
## 12 135 1 0.9710 0.0143 0.9434 0.999
## 13 134 2 0.9565 0.0174 0.9231 0.991
## 15 132 1 0.9493 0.0187 0.9134 0.987
## 26 131 1 0.9420 0.0199 0.9038 0.982
## 30 130 1 0.9348 0.0210 0.8945 0.977
## 31 129 1 0.9275 0.0221 0.8853 0.972
## 53 128 2 0.9130 0.0240 0.8672 0.961
## 54 126 1 0.9058 0.0249 0.8583 0.956
## 59 125 1 0.8986 0.0257 0.8496 0.950
## 60 124 1 0.8913 0.0265 0.8409 0.945
## 65 123 2 0.8768 0.0280 0.8237 0.933
## 71 121 1 0.8696 0.0287 0.8152 0.928
## 81 120 1 0.8623 0.0293 0.8067 0.922
## 88 119 2 0.8478 0.0306 0.7900 0.910
## 92 117 1 0.8406 0.0312 0.7817 0.904
## 93 116 1 0.8333 0.0317 0.7734 0.898
## 95 115 1 0.8261 0.0323 0.7652 0.892
## 105 114 1 0.8188 0.0328 0.7570 0.886
## 107 113 1 0.8116 0.0333 0.7489 0.880
## 110 112 1 0.8043 0.0338 0.7408 0.873
## 116 111 1 0.7971 0.0342 0.7328 0.867
## 118 110 1 0.7899 0.0347 0.7247 0.861
## 131 109 1 0.7826 0.0351 0.7167 0.855
## 132 108 2 0.7681 0.0359 0.7008 0.842
## 135 106 1 0.7609 0.0363 0.6929 0.835
## 142 105 1 0.7536 0.0367 0.6851 0.829
## 144 104 1 0.7464 0.0370 0.6772 0.823
## [ reached getOption("max.print") -- omitted 71 rows ]
##
## sex=2
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 90 1 0.9889 0.0110 0.9675 1.000
## 60 89 1 0.9778 0.0155 0.9478 1.000
## 61 88 1 0.9667 0.0189 0.9303 1.000
## 62 87 1 0.9556 0.0217 0.9139 0.999
## 79 86 1 0.9444 0.0241 0.8983 0.993
## 81 85 1 0.9333 0.0263 0.8832 0.986
## 95 83 1 0.9221 0.0283 0.8683 0.979
## 107 81 1 0.9107 0.0301 0.8535 0.972
## 122 80 1 0.8993 0.0318 0.8390 0.964
## 145 79 2 0.8766 0.0349 0.8108 0.948
## 153 77 1 0.8652 0.0362 0.7970 0.939
## 166 76 1 0.8538 0.0375 0.7834 0.931
## 167 75 1 0.8424 0.0387 0.7699 0.922
## 182 71 1 0.8305 0.0399 0.7559 0.913
## 186 70 1 0.8187 0.0411 0.7420 0.903
## 194 68 1 0.8066 0.0422 0.7280 0.894
## 199 67 1 0.7946 0.0432 0.7142 0.884
## 201 66 2 0.7705 0.0452 0.6869 0.864
## 208 62 1 0.7581 0.0461 0.6729 0.854
## 226 59 1 0.7452 0.0471 0.6584 0.843
## 239 57 1 0.7322 0.0480 0.6438 0.833
## 245 54 1 0.7186 0.0490 0.6287 0.821
## 268 51 1 0.7045 0.0501 0.6129 0.810
## 285 47 1 0.6895 0.0512 0.5962 0.798
## 293 45 1 0.6742 0.0523 0.5791 0.785
## 305 43 1 0.6585 0.0534 0.5618 0.772
## 310 42 1 0.6428 0.0544 0.5447 0.759
## 340 39 1 0.6264 0.0554 0.5267 0.745
## [ reached getOption("max.print") -- omitted 23 rows ]
Check out the help for ?summary.survfit
. You can give the summary()
function an option for what times you want to show in the results.
?summary.survfit
range()
. You can create a sequence of numbers going from one number to another number by increments of yet another number with the seq()
function.range(lung$time)
## [1] 5 1022
seq(0, 1100, by=100)
## [1] 0 100 200 300 400 500 600 700 800 900 1000 1100
sfit
to get life tables at those intervals separately for both males (1) and females (2).summary(sfit, times=seq(0, 1000, 100))
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## sex=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 138 0 1.0000 0.0000 1.0000 1.000
## 100 114 24 0.8261 0.0323 0.7652 0.892
## 200 78 30 0.6073 0.0417 0.5309 0.695
## 300 49 20 0.4411 0.0439 0.3629 0.536
## 400 31 15 0.2977 0.0425 0.2250 0.394
## 500 20 7 0.2232 0.0402 0.1569 0.318
## 600 13 7 0.1451 0.0353 0.0900 0.234
## 700 8 5 0.0893 0.0293 0.0470 0.170
## 800 6 2 0.0670 0.0259 0.0314 0.143
## 900 2 2 0.0357 0.0216 0.0109 0.117
## 1000 2 0 0.0357 0.0216 0.0109 0.117
##
## sex=2
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 90 0 1.0000 0.0000 1.0000 1.000
## 100 82 7 0.9221 0.0283 0.8683 0.979
## 200 66 11 0.7946 0.0432 0.7142 0.884
## 300 43 9 0.6742 0.0523 0.5791 0.785
## 400 26 10 0.5089 0.0603 0.4035 0.642
## 500 21 5 0.4110 0.0626 0.3050 0.554
## 600 11 3 0.3433 0.0634 0.2390 0.493
## 700 8 3 0.2496 0.0652 0.1496 0.417
## 800 2 5 0.0832 0.0499 0.0257 0.270
## 900 1 0 0.0832 0.0499 0.0257 0.270
From these tables we can start to see that males tend to have worse survival than females.
Kaplan-Meier plot is used to visualize the survival curve fit to the data.
To do this we can use the plot()
function on the fitted survival curve sfit
.
The plot()
function can be used to modify the plot as necessary. See ?plot.survfit
.
sfit = survfit(Surv(time, status)~sex, data=lung)
plot(sfit)
However, we will focus on the package called survminer that provides a function called ggsurvplot()
.
ggplot2
syntax it is pretty easy to modify.library(survminer)
ggsurvplot(sfit)
This plot is substantially more informative than the one created by plot()
. Some cool things include
automatic color codes the different groups,
adding axis labels, and
creating automatic legend, and lot more.
Let us also
add confidence intervals,
show the p-value for the log-rank test,
show a risk table below the plot, and
change the colors and the group labels.
ggsurvplot(sfit,
conf.int=TRUE, # add confidence intervals
pval=TRUE, # show the p-value for the log-rank test
risk.table=TRUE, # show a risk table below the plot
legend.labs=c("Male", "Female"), # change group labels
legend.title="Sex", # add legend title
palette=c("dodgerblue4", "orchid2"), # change colors of the groups
title="Kaplan-Meier Curve for Lung Cancer Survival", # add title to plot
risk.table.height=.2)
Kaplan-Meier curves are good for visualizing differences in survival between two categorical groups. The p-value for the log-rank test provided when pval=TRUE
is also useful for identifying any differences in survival between different groups.
However, this does not generalize well for assessing the effect of quantitative variables.
Just try creating a K-M plot for the nodes
variable, which has values that range from \(0-33\).
ggsurvplot(survfit(Surv(time, status)~inst, data=lung))
At some point using a categorical grouping for K-M plots breaks down, more so when we might want to assess how multiple variables work together to influence survival.
For example, we might want to simultaneously examine the effect of race and socioeconomic status (so as to adjust for factors like income, access to care, etc.) before concluding that ethnicity influences some outcome.
Cox PH regression can assess the effect of both categorical and continuous variables, and can model the effect of multiple variables at once.
The coxph()
function uses the same syntax as lm()
, glm()
, etc.
Let us run a Cox regression on sex.
fit = coxph(Surv(time, status)~sex, data=lung)
fit
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
##
## coef exp(coef) se(coef) z p
## sex -0.5310 0.5880 0.1672 -3.176 0.00149
##
## Likelihood ratio test=10.63 on 1 df, p=0.001111
## n= 228, number of events= 165
The exp(coef)
column contains \(e^{\beta_1}\) (see background section above for more info). This is the hazard ratio – the multiplicative effect of that variable on the hazard rate (for each unit increase in that variable).
For a categorical variable like sex
, going from male (baseline) to female results in approximately ~\(40\%\) reduction in hazard.
We could also flip the sign on the coef
column, and take exp(0.531)
, which can be interpreted as:
Males have a \(1.7\)-fold increase in hazard, or that
Males die at approximately \(1.7\times\) the rate per unit time as females (females die at \(0.588\times\) the rate per unit time as males).
Remember:
\(HR=1\): No effect
\(HR>1\): Increase in hazard
\(HR<1\): Decrease in hazard
There is a p-value on the sex
term, and a p-value on the overall model.
That \(0.00111\) p-value is really close to the \(p=0.00131\) p-value we saw on the Kaplan-Meier plot.
That is because the KM plot is showing the log-rank test p-value.
We can get this out of the Cox model with a call to summary(fit)
.
You can directly calculate the log-rank test p-value using survdiff()
.
summary(fit)
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
##
## n= 228, number of events= 165
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex -0.5310 0.5880 0.1672 -3.176 0.00149 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 0.588 1.701 0.4237 0.816
##
## Concordance= 0.579 (se = 0.021 )
## Likelihood ratio test= 10.63 on 1 df, p=0.001
## Wald test = 10.09 on 1 df, p=0.001
## Score (logrank) test = 10.33 on 1 df, p=0.001
survdiff(Surv(time, status)~sex, data=lung)
## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = lung)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 138 112 91.6 4.55 10.3
## sex=2 90 53 73.4 5.68 10.3
##
## Chisq= 10.3 on 1 degrees of freedom, p= 0.001
Let us create another model where we analyze all the variables in the data set.
This shows us how all the variables, when considered together, act to influence survival.
Some are very strong predictors (sex, ECOG score).
Interestingly, the Karnofsky performance score as rated by the physician ph.karno
was marginally significant, while the same score as rated by the patient pat.karno
was not.
fit = coxph(Surv(time, status)~sex+age+ph.ecog+ph.karno+pat.karno+meal.cal+wt.loss, data=lung)
fit
## Call:
## coxph(formula = Surv(time, status) ~ sex + age + ph.ecog + ph.karno +
## pat.karno + meal.cal + wt.loss, data = lung)
##
## coef exp(coef) se(coef) z p
## sex -5.509e-01 5.765e-01 2.008e-01 -2.743 0.00609
## age 1.065e-02 1.011e+00 1.161e-02 0.917 0.35906
## ph.ecog 7.342e-01 2.084e+00 2.233e-01 3.288 0.00101
## ph.karno 2.246e-02 1.023e+00 1.124e-02 1.998 0.04574
## pat.karno -1.242e-02 9.877e-01 8.054e-03 -1.542 0.12316
## meal.cal 3.329e-05 1.000e+00 2.595e-04 0.128 0.89791
## wt.loss -1.433e-02 9.858e-01 7.771e-03 -1.844 0.06518
##
## Likelihood ratio test=28.33 on 7 df, p=0.0001918
## n= 168, number of events= 121
## (60 observations deleted due to missingness)
The Cox proportional hazards model makes several assumptions. Thus, it is important to assess whether a fitted Cox regression model adequately describes the data.
Here, we’ll discuss three types of diagnostics for the Cox model:
Detecting non-linearity in relationship between the log-hazard and the covariates.
Examining influential observations (or outliers).
Testing the proportional hazards assumption.
In order to check these model assumptions, residuals method are used. The common residuals for the Cox model include:
Martingale residual to assess non-linearity \[r_{Mi} = \delta_i - \hat{H}_0(t_i)e^{\mathbf{x}_i^\top \hat{\beta}},\] where \(\hat{H}_0(t_i)\) is the estimate of the baseline cumulative hazard at \(t_i\), and \(\delta_i\) is the event indicator for subject \(i\).
Deviance residual (symmetric transformation of the Martingale residuals), to examine influential observations. \[r_{Di} = sign(r_{Mi})[-2(r_{Mi}+\delta_i\log(\delta_i-r_{Mi}))]^{1/2}.\]
Schoenfeld residuals to check the proportional hazards assumption.
For any subject \(i \in D(t_k)\) (the set of \(d_k\) failures at time \(t_k\)) is the difference between the covariate for that subject and the weighted average of covariates in the risk set, namely \(x_i - \bar{x}(\beta, t_k)\).
The sum of the Schoenfeld residuals over all \(d_k\) subjects who fail at \(t_k\), also known as the Schoenfeld residual corresponding to \(t_k\), is \[r_{Sk} = \sum\limits_{i \in R(t_k)}\delta_{ik}[x_i - \bar{x}(\beta, t_k)],\] where \(\delta_{ik} = 1\) if subject \(i\) fails at \(t_k\) and zero otherwise; and \(R(t_k)\) is the risk set at \(t_k\).
Let us construct a proportional hazards model with the variables sex
, age
and wt.loss
fit = coxph(Surv(time, status)~sex+age+wt.loss, data=lung)
fit
## Call:
## coxph(formula = Surv(time, status) ~ sex + age + wt.loss, data = lung)
##
## coef exp(coef) se(coef) z p
## sex -0.5210319 0.5939074 0.1743541 -2.988 0.0028
## age 0.0200882 1.0202913 0.0096644 2.079 0.0377
## wt.loss 0.0007596 1.0007599 0.0061934 0.123 0.9024
##
## Likelihood ratio test=14.67 on 3 df, p=0.002122
## n= 214, number of events= 152
## (14 observations deleted due to missingness)
The proportional hazards (PH) assumption can be checked using statistical tests and graphical diagnostics based on the scaled Schoenfeld residuals.
In principle, the Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption.
The function cox.zph()
(for help ?cox.zph
) provides a convenient solution to test the proportional hazards assumption for each covariate included in a Cox regression model fit.
For each covariate, the function cox.zph()
correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time.
Additionally, it performs a global test for the model as a whole.
The proportional hazard assumption is supported by a non-significant relationship between residuals and time, and refuted by a significant relationship.
test.ph = cox.zph(fit)
test.ph
## rho chisq p
## sex 0.1265 2.349 0.125
## age -0.0483 0.378 0.538
## wt.loss 0.0126 0.024 0.877
## GLOBAL NA 2.846 0.416
From the output above, the test is not statistically significant for each of the covariates, and the global test is also not statistically significant. Therefore, our proportional hazards assumption is reasonable.
ggcoxzph()
(for help ?ggcoxzph
), which produces, for each covariate, graphs of the scaled Schoenfeld residuals against the transformed time.ggcoxzph(test.ph)
In the figure above, the solid line is a smoothing spline fit to the plot, with the dashed lines representing a \(\pm 2\sigma\) band around the fit.
Systematic departures from a horizontal line are indicative of non-proportional hazards, since proportional hazards assumes that estimates \(\beta_1,\beta_2,\beta_3\) do not vary much over time.
From the graphical inspection, there is no pattern with time.
The assumption of proportional hazards appears to be supported for the covariates age, wt.loss and sex (which is a two-level factor, accounting for the two bands in the graph).
Violations of proportional hazards assumption can be resolved by:
Adding \(covariate \times time\) interaction.
Stratification (useful for nuisance confounders).
To test influential observations or outliers, we can visualize either:
the deviance residuals or
the dfbeta values
ggcoxdiagnostics()
(for help ?ggcoxdiagnostics
) provides a convenient solution for checking influential observations.
General usage is as follows
ggcoxdiagnostics(fit, type = ???, linear.predictions = TRUE)
fit: an object of class coxph.object
type: the type of residuals to present on Y axis. Allowed values include one of c(“martingale”, “deviance”, “score”, “schoenfeld”, “dfbeta”, “dfbetas”, “scaledsch”, “partial”)
.
Specifying the argument type = “dfbeta”
, plots the estimated changes in the regression coefficients upon deleting each observation in turn; likewise,
type=“dfbetas”
produces the estimated changes in the coefficients divided by their standard errors.
linear.predictions: a logical value indicating whether to show linear predictions for observations (TRUE
) or just indexes of observations (FALSE
) on X axis.
"dfbeta"
ggcoxdiagnostics(fit, type = "dfbeta",
linear.predictions = FALSE, ggtheme = theme_bw())
Index plots of dfbeta
for the Cox regression of time to death on age
, sex
and wt.loss
age
and wt.loss
are large compared with the others."deviance"
It is also possible to check outliers by visualizing the deviance residuals.
The deviance residual is a normalized transform of the martingale residual.
These residuals should be roughly symmetrically distributed about zero with a standard deviation of \(1\).
Positive values correspond to individuals that “died too soon” compared to expected survival times.
Negative values correspond to individual that “lived too long”.
Very large or small values are outliers, which are poorly predicted by the model.
ggcoxdiagnostics(fit, type = "deviance",
linear.predictions = FALSE, ggtheme = theme_bw())
Plotting the Martingale residuals against continuous covariates is a common approach used to detect non-linearity or, in other words, to assess the functional form of a covariate.
Non-linearity is not an issue for categorical variables, so we only examine plots of martingale residuals and partial residuals against a continuous variable.
Martingale residuals may present any value in the range \((-\infty, 1)\):
A value of martingale residuals near 1 represents individuals that “died too soon”, and
large negative values correspond to individuals that “lived too long”.
To assess the functional form of a continuous variable in a Cox PH model, we will use the function ggcoxfunctional()
(for help ?ggcoxfunctional
).
ggcoxfunctional()
displays graphs of continuous covariates against martingale residuals of null Cox PH model.
This might help to properly choose the functional form of continuous variable in the Cox PH model.
Fitted lines with loess
function should be linear to satisfy the Cox PH model assumptions.
ggcoxfunctional(Surv(time, status) ~ age + log(age) + sqrt(age), data = lung)
In biomedical applications, non-parametric (e.g. the product-limit survival curve estimator) and semi-parametric (e.g. the Cox proportional hazards model) methods play the most important role, since they have the flexibility to accommodate a wide range of hazard function forms.
Parametric methods may be appropriate when survival data can be shown to approximately follow a particular parametric form.
Parametric models are often much easier to work with than the partial-likelihood-based models (e.g. Cox PH model), since the former are defined by a small and fixed number of unknown parameters.
Furthermore, accommodating complex censoring and truncation patterns is much more direct with parametric models than with partial likelihood models.
Of course, the validity of these techniques depends heavily on the appropriateness of the particular parametric model being used.
Some of the distributions that could potentially serve as survival distribution models are the exponential, Weibull, and gamma distributions.
For more details see Paramteric Survival Models9 which examines a range of parametric survival distributions, their specifications in R, and the hazard shapes they support.
This notes is adapted from Cox Model Assumptions Tools, Proportional Hazards Regression Diagnostics, and Survival Analysis with R.↩
In the medical world, we typically think of survival analysis literally – tracking time until death. But, it is more general than that – survival analysis models time until an event occurs (any event). This might be death of a biological organism. But it could also be the time until a hardware failure in a mechanical system, time until recovery, time someone remains unemployed after losing a job, time until a ripe tomato is eaten by a grazing deer, time until someone falls asleep in a workshop, etc. Survival analysis also goes by reliability theory in engineering, duration analysis in economics, and event history analysis in sociology.↩
This describes the most common type of censoring – right censoring. Left censoring less commonly occurs when the “start” is unknown, such as when an initial diagnosis or exposure time is unknown.↩
And, following the definitions above, proportional hazards assumption says that the cumulative hazard ratio between two groups remains constant over time.↩
And there is a chi-square-like statistical test for these differences called the log-rank test that compare the survival functions of different categories.↩
Surv()
can also take start and stop times, to account for left censoring. See the help for ?Surv
.↩
Cox regression and the log-rank test from survdiff
are going to give you similar results most of the time. The log-rank test is asking if survival curves differ significantly between two groups. Cox regression is asking which of many categorical or continuous variables significantly affect survival.↩
Loprinzi et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.↩
Devin Incerti, Paramteric Survival Models: https://devinincerti.com/2019/06/18/parametric_survival.html↩