In this lecture we will do some hands-on examples covering survival analysis using R.

Note: This lecture is designed based on several resources. For details see the end-notes1.

Prerequisites: Familiarity with R is required (including working with data frames, installing/using packages, importing data, and saving results).


1 Setup


2 Background

In cohort studies where you track samples or subjects from one time point (e.g., entry into a study, diagnosis, start of a treatment) until you observe some outcome event (e.g., death, onset of disease, relapse), it does not make sense to assume the survival rates are constant. For example:

  • The risk of death after heart surgery is highest immediately post-op, decreases as the patient recovers, then rises slowly again as the patient ages.

  • Recurrence rate of different cancers varies highly over time, and depends on tumor genetics, treatment, and other environmental factors.

2.1 Survival Analysis

Survival analysis is used to analyze the rates of occurrence of events over time, without assuming the rates are constant. Generally, survival analysis allows for

  • modeling the time until an event occurs,2 or

  • compare the time-to-event between different groups, or

  • assess how time-to-event correlates with quantitative variables.

2.2 Hazard and Cumulative Hazard

  • Hazard, denoted by \(h(t)\), is the instantaneous event (death) rate at a particular time point t. If \(T\) denotes the time-to-event, then \[h(t) = \lim_{\delta \rightarrow 0} \frac{P(t<T<t+\delta|T>t)}{\delta}.\]

  • Survival analysis does not assume the hazard is constant over time.

  • Cumulative hazard, denoted by \(H(t)\), is the total hazard experienced up to time t. \[H(t) = \int_0^t h(u) du.\]

2.3 Survival Function

  • Survival function, denoted by \(S(t)\), is the probability that an individual survives (or, the probability that the event of interest does not occur) up to and including time t.

  • Note: We will use death and event interchangeably.

  • Let \(T\) be the time of death, then \[ S(t) = P(T>t) \] is the probability that the time of death is greater than some time \(t\).

  • It is same as the probability that the event has not occurred until time \(t\).

  • Since \(S\) is a probability, we have \(0 \leq S(t) \leq 1\), and \(T \geq 0\) as the survival times are always positive.

2.4 Kaplan-Meier Curve

  • Kaplan-Meier Curve is an estimator for \(S(t)\) which is given as \[\hat{S}(t) = \prod_{t_i<t} \Big( 1 - \frac{d_i}{n_i}\Big),\] where

    • \(n_i\) is the number of subjects at risk at time \(t_i\), and

    • \(d_i\) is the number of individuals who fail at that time.

  • The Kaplan-Meier curve is a step function illustrating the cumulative survival probability over time.

  • The curve is horizontal over periods where no event occurs, then drops vertically corresponding to a change in the survival function at each time an event occurs.

2.5 Censoring

  • Censoring is a type of missing data problem unique to survival analysis.

  • It occurs when you track the sample/subject through the end of the study and the event never occurs.

  • It could also happen due to the sample/subject dropping out of the study for reasons other than death, or some other loss to followup.

  • If the sample is censored, you only know that the individual survived up to the loss to followup, but you do not know anything about survival after that.3

2.6 Proportional Hazards Assumption

  • One of the objectives in survival analysis is to compare the survival functions in different groups, e.g., leukemia patients as compared to cancer-free controls.

  • If both groups were followed until everyone died, both survival curves would end at 0%. However, one group might have survived on average a lot longer than the other group.

    • Survival analysis addresses this problem by comparing the hazard at different times over the observation period.
  • Proportional hazards assumption says that the ratio of hazards between groups is constant over time.4

2.7 Cox Proportional Hazards Model

  • Kaplan-Meier curves are good for visualizing differences in survival between two categories5, but they do not work well for assessing the effect of quantitative variables like age, gene expression, leukocyte count, etc.

  • Cox proportional hazards (PH) regression can assess the effect of both categorical and continuous variables, and can model the effect of multiple variables at once.

  • Cox PH regression models the natural log of the hazard at time t, denoted by \(h(t)\), as a function of the baseline hazard, \(h_0(t)\) (the hazard for an individual where all exposure variables are 0), and multiple exposure variables \(x_1, x_2,\ldots,x_p\).

  • Cox PH model is given as \[ log(h(t)) = log(h_0(t)) + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p. \]

    • If we limit the right hand side to just a single categorical exposure variable (\(x_1\)) with two groups (\(x_1=1\) for exposed and \(x_1=0\) for unexposed), the previous equation is equivalent to: \[\begin{eqnarray*} h_1(t) & = & h_0(t) e^{\beta_1 x_1}, \\ \Rightarrow HR(t) & := & \frac{h_1(t)}{h_0(t)} = e^{\beta_1}. \end{eqnarray*}\]

    • \(HR(t)\) is the hazard ratio, comparing the exposed to the unexposed individuals at time t.

    • This model shows that the hazard ratio is \(e^{\beta_1}\), and remains constant over time t (hence the name proportional hazards regression).

  • The \(\beta\)s are the regression coefficients that are estimated from the model, and represent the \(log(HR)\) for each unit increase in the corresponding predictor variable.

  • The interpretation of the hazard ratio depends on the measurement scale of the predictor variable. In simple terms for the variable in question,

    • a \(\beta > 0\) indicates more hazard than baseline and hence worse survival, and

    • a \(\beta < 0\) indicates less hazard than baseline and hence better survival.


3 Getting Started with R

  • The core survival analysis functions are in the survival package. The survival package is one of the few “core” packages that comes bundled with your basic R installation, so you probably did not need to use install.packages(). However, you will need to load it like any other library when you want to use it.

  • We will also want to load the survminer package, which provides much nicer Kaplan-Meier plots compared to base graphics.

# CRAN Packages needed
library(survival)  # core survival analysis function
library(survminer) # recommended for visualizing survival curves

The core functions we will use out of the survival package include:

  • Surv(): creates the response variable (survival object), and typical usage takes the time-to-event,6 and whether or not the event occurred (i.e., death vs censored).
  • survfit(): Fits a survival curve using either a formula, or a previously fitted Cox model. It creates a survival curve which could be displayed or plotted.

  • coxph(): Fits a Cox proportional hazards regression model. Models are specified the same way as in regular linear models (e.g. lm()).

Other optional functions you might use include:

  • cox.zph(): Tests the proportional hazards assumption of a Cox regression model.

  • survdiff(): Tests for differences in survival between two groups using a log-rank / Mantel-Haenszel test.7

We will use the built-in lung cancer data set8 in the survival package. You can get some more information about the data set by running ?lung.

library(survival)
?lung
  1. inst: Institution code

  2. time: Survival time in days

  3. status: censoring status 1=censored, 2=dead

  4. age: Age in years

  5. sex: Male=1 Female=2

  6. ph.ecog: ECOG performance score (0=good 5=dead)

  7. ph.karno: Karnofsky performance score as rated by physician

  8. pat.karno: Karnofsky performance score as rated by patient

  9. meal.cal: Calories consumed at meals

  10. wt.loss: Weight loss in last six months

Note: You can access the data just by running lung, as if you had read in a data set and called it lung. You can operate on it just like any other data frame.

We see that there are 228 observations and 10 variables in this data:

dim(lung) # returns the dimensions of the data frame
## [1] 228  10
head(lung) # returns the first few rows from the data
##   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1    3  306      2  74   1       1       90       100     1175      NA
## 2    3  455      2  68   1       0       90        90     1225      15
## 3    3 1010      1  56   1       0       90        90       NA      15
## 4    5  210      2  57   1       1       90        60     1150      11
## 5    1  883      2  60   1       0      100        90       NA       0
## 6   12 1022      1  74   1       1       50        80      513       0
View(lung) # opens the data in the viewer pane

4 Survival Curves

4.1 Surv()

  • See the help for ?Surv. This is the main function we will use to create the survival object.

  • The help tells you that when there are two unnamed arguments, they will match time and event in that order.

  • This is the common shorthand for right-censored data. The alternative lets you specify interval data, where you give it the start and end times (time and time2).

  • Surv function tries to guess the coding of the status variable. It will try to guess whether you are using 0/1 or 1/2 to represent censored vs death, respectively.

4.1.1 Example

  • Create a survival object called s and display it. It is a special type of vector that tells you

    • how long the subject was tracked for, and

    • if the event occurred or the sample was censored (shown by the +).

s = Surv(lung$time, lung$status)
class(s)
## [1] "Surv"
s
##   [1]  306   455  1010+  210   883  1022+  310   361   218   166   170   654 
##  [13]  728    71   567   144   613   707    61    88   301    81   624   371 
##  [25]  394   520   574   118   390    12   473    26   533   107    53   122 
##  [37]  814   965+   93   731   460   153   433   145   583    95   303   519 
##  [49]  643   765   735   189    53   246   689    65     5   132   687   345 
##  [61]  444   223   175    60   163    65   208   821+  428   230   840+  305 
##  [73]   11   132   226   426   705   363    11   176   791    95   196+  167 
##  [85]  806+  284   641   147   740+  163   655   239    88   245   588+   30 
##  [97]  179   310   477   166   559+  450   364   107   177   156   529+   11 
## [109]  429   351    15   181   283   201   524    13   212   524   288   363 
## [121]  442   199   550    54   558   207    92    60   551+  543+  293   202 
## [133]  353   511+  267   511+  371   387   457   337   201   404+  222    62 
## [145]  458+  356+  353   163    31   340   229   444+  315+  182   156   329 
## [157]  364+  291   179   376+  384+  268   292+  142   413+  266+  194   320 
## [169]  181   285   301+  348   197   382+  303+  296+  180   186   145   269+
## [181]  300+  284+  350   272+  292+  332+  285   259+  110   286   270    81 
## [193]  131   225+  269   225+  243+  279+  276+  135 
##  [ reached getOption("max.print") -- omitted 28 entries ]

4.2 survfit()

  • Let us fit a survival curve with the survfit() function. See the help for ?survfit.

4.2.1 Example

  • Create a simple survival curve that does not consider any different groupings, so we will specify just an intercept (e.g., ~1) in the formula that survfit expects.

  • We can also do this in one step by nesting the Surv() call within the survfit() call, we will use the data= argument to specify which data we are using.

sfit = survfit(s~1)
# or we can also use the following
sfit = survfit(Surv(time, status)~1, data=lung)
# sfit

Run summary on the object sfit. This will show a life table.

summary(sfit)
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5    228       1   0.9956 0.00438       0.9871        1.000
##    11    227       3   0.9825 0.00869       0.9656        1.000
##    12    224       1   0.9781 0.00970       0.9592        0.997
##    13    223       2   0.9693 0.01142       0.9472        0.992
##    15    221       1   0.9649 0.01219       0.9413        0.989
##    26    220       1   0.9605 0.01290       0.9356        0.986
##    30    219       1   0.9561 0.01356       0.9299        0.983
##    31    218       1   0.9518 0.01419       0.9243        0.980
##    53    217       2   0.9430 0.01536       0.9134        0.974
##    54    215       1   0.9386 0.01590       0.9079        0.970
##    59    214       1   0.9342 0.01642       0.9026        0.967
##    60    213       2   0.9254 0.01740       0.8920        0.960
##    61    211       1   0.9211 0.01786       0.8867        0.957
##    62    210       1   0.9167 0.01830       0.8815        0.953
##    65    209       2   0.9079 0.01915       0.8711        0.946
##    71    207       1   0.9035 0.01955       0.8660        0.943
##    79    206       1   0.8991 0.01995       0.8609        0.939
##    81    205       2   0.8904 0.02069       0.8507        0.932
##    88    203       2   0.8816 0.02140       0.8406        0.925
##    92    201       1   0.8772 0.02174       0.8356        0.921
##    93    199       1   0.8728 0.02207       0.8306        0.917
##    95    198       2   0.8640 0.02271       0.8206        0.910
##   105    196       1   0.8596 0.02302       0.8156        0.906
##   107    194       2   0.8507 0.02362       0.8056        0.898
##   110    192       1   0.8463 0.02391       0.8007        0.894
##   116    191       1   0.8418 0.02419       0.7957        0.891
##   118    190       1   0.8374 0.02446       0.7908        0.887
##   122    189       1   0.8330 0.02473       0.7859        0.883
##  [ reached getOption("max.print") -- omitted 111 rows ]

These tables show a row for each time point where either the event occurred or a sample was censored. It shows the number at risk (number still remaining), and the cumulative survival at that instant.

4.2.2 Example

Let us now fit survival curves separately by sex.

sfit = survfit(Surv(time, status)~sex, data=lung)
sfit
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
## 
##         n events median 0.95LCL 0.95UCL
## sex=1 138    112    270     212     310
## sex=2  90     53    426     348     550
summary(sfit)
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
## 
##                 sex=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    11    138       3   0.9783  0.0124       0.9542        1.000
##    12    135       1   0.9710  0.0143       0.9434        0.999
##    13    134       2   0.9565  0.0174       0.9231        0.991
##    15    132       1   0.9493  0.0187       0.9134        0.987
##    26    131       1   0.9420  0.0199       0.9038        0.982
##    30    130       1   0.9348  0.0210       0.8945        0.977
##    31    129       1   0.9275  0.0221       0.8853        0.972
##    53    128       2   0.9130  0.0240       0.8672        0.961
##    54    126       1   0.9058  0.0249       0.8583        0.956
##    59    125       1   0.8986  0.0257       0.8496        0.950
##    60    124       1   0.8913  0.0265       0.8409        0.945
##    65    123       2   0.8768  0.0280       0.8237        0.933
##    71    121       1   0.8696  0.0287       0.8152        0.928
##    81    120       1   0.8623  0.0293       0.8067        0.922
##    88    119       2   0.8478  0.0306       0.7900        0.910
##    92    117       1   0.8406  0.0312       0.7817        0.904
##    93    116       1   0.8333  0.0317       0.7734        0.898
##    95    115       1   0.8261  0.0323       0.7652        0.892
##   105    114       1   0.8188  0.0328       0.7570        0.886
##   107    113       1   0.8116  0.0333       0.7489        0.880
##   110    112       1   0.8043  0.0338       0.7408        0.873
##   116    111       1   0.7971  0.0342       0.7328        0.867
##   118    110       1   0.7899  0.0347       0.7247        0.861
##   131    109       1   0.7826  0.0351       0.7167        0.855
##   132    108       2   0.7681  0.0359       0.7008        0.842
##   135    106       1   0.7609  0.0363       0.6929        0.835
##   142    105       1   0.7536  0.0367       0.6851        0.829
##   144    104       1   0.7464  0.0370       0.6772        0.823
##  [ reached getOption("max.print") -- omitted 71 rows ]
## 
##                 sex=2 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5     90       1   0.9889  0.0110       0.9675        1.000
##    60     89       1   0.9778  0.0155       0.9478        1.000
##    61     88       1   0.9667  0.0189       0.9303        1.000
##    62     87       1   0.9556  0.0217       0.9139        0.999
##    79     86       1   0.9444  0.0241       0.8983        0.993
##    81     85       1   0.9333  0.0263       0.8832        0.986
##    95     83       1   0.9221  0.0283       0.8683        0.979
##   107     81       1   0.9107  0.0301       0.8535        0.972
##   122     80       1   0.8993  0.0318       0.8390        0.964
##   145     79       2   0.8766  0.0349       0.8108        0.948
##   153     77       1   0.8652  0.0362       0.7970        0.939
##   166     76       1   0.8538  0.0375       0.7834        0.931
##   167     75       1   0.8424  0.0387       0.7699        0.922
##   182     71       1   0.8305  0.0399       0.7559        0.913
##   186     70       1   0.8187  0.0411       0.7420        0.903
##   194     68       1   0.8066  0.0422       0.7280        0.894
##   199     67       1   0.7946  0.0432       0.7142        0.884
##   201     66       2   0.7705  0.0452       0.6869        0.864
##   208     62       1   0.7581  0.0461       0.6729        0.854
##   226     59       1   0.7452  0.0471       0.6584        0.843
##   239     57       1   0.7322  0.0480       0.6438        0.833
##   245     54       1   0.7186  0.0490       0.6287        0.821
##   268     51       1   0.7045  0.0501       0.6129        0.810
##   285     47       1   0.6895  0.0512       0.5962        0.798
##   293     45       1   0.6742  0.0523       0.5791        0.785
##   305     43       1   0.6585  0.0534       0.5618        0.772
##   310     42       1   0.6428  0.0544       0.5447        0.759
##   340     39       1   0.6264  0.0554       0.5267        0.745
##  [ reached getOption("max.print") -- omitted 23 rows ]

Check out the help for ?summary.survfit. You can give the summary() function an option for what times you want to show in the results.

?summary.survfit
  • Look at the range of followup times in the lung data set with range(). You can create a sequence of numbers going from one number to another number by increments of yet another number with the seq() function.
range(lung$time)
## [1]    5 1022
seq(0, 1100, by=100)
##  [1]    0  100  200  300  400  500  600  700  800  900 1000 1100
  • We use that sequence vector with a summary call on sfit to get life tables at those intervals separately for both males (1) and females (2).
summary(sfit, times=seq(0, 1000, 100))
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
## 
##                 sex=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    138       0   1.0000  0.0000       1.0000        1.000
##   100    114      24   0.8261  0.0323       0.7652        0.892
##   200     78      30   0.6073  0.0417       0.5309        0.695
##   300     49      20   0.4411  0.0439       0.3629        0.536
##   400     31      15   0.2977  0.0425       0.2250        0.394
##   500     20       7   0.2232  0.0402       0.1569        0.318
##   600     13       7   0.1451  0.0353       0.0900        0.234
##   700      8       5   0.0893  0.0293       0.0470        0.170
##   800      6       2   0.0670  0.0259       0.0314        0.143
##   900      2       2   0.0357  0.0216       0.0109        0.117
##  1000      2       0   0.0357  0.0216       0.0109        0.117
## 
##                 sex=2 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0     90       0   1.0000  0.0000       1.0000        1.000
##   100     82       7   0.9221  0.0283       0.8683        0.979
##   200     66      11   0.7946  0.0432       0.7142        0.884
##   300     43       9   0.6742  0.0523       0.5791        0.785
##   400     26      10   0.5089  0.0603       0.4035        0.642
##   500     21       5   0.4110  0.0626       0.3050        0.554
##   600     11       3   0.3433  0.0634       0.2390        0.493
##   700      8       3   0.2496  0.0652       0.1496        0.417
##   800      2       5   0.0832  0.0499       0.0257        0.270
##   900      1       0   0.0832  0.0499       0.0257        0.270

From these tables we can start to see that males tend to have worse survival than females.


5 Kaplan-Meier Plots

  • Kaplan-Meier plot is used to visualize the survival curve fit to the data.

  • To do this we can use the plot() function on the fitted survival curve sfit.

The plot() function can be used to modify the plot as necessary. See ?plot.survfit.

sfit = survfit(Surv(time, status)~sex, data=lung)
plot(sfit)

5.1 ggsurvplot()

  • However, we will focus on the package called survminer that provides a function called ggsurvplot().

    • It is much easier to produce publication-ready survival plots, and if you are familiar with ggplot2 syntax it is pretty easy to modify.
library(survminer)
ggsurvplot(sfit)

  • This plot is substantially more informative than the one created by plot(). Some cool things include

    • automatic color codes the different groups,

    • adding axis labels, and

    • creating automatic legend, and lot more.

  • Let us also

    • add confidence intervals,

    • show the p-value for the log-rank test,

    • show a risk table below the plot, and

    • change the colors and the group labels.

5.1.1 Example

ggsurvplot(sfit,
           conf.int=TRUE, # add confidence intervals
           pval=TRUE, # show the p-value for the log-rank test
           risk.table=TRUE, # show a risk table below the plot
           legend.labs=c("Male", "Female"), # change group labels
           legend.title="Sex",  # add legend title
           palette=c("dodgerblue4", "orchid2"), # change colors of the groups
           title="Kaplan-Meier Curve for Lung Cancer Survival", # add title to plot
           risk.table.height=.2)


6 Cox Regression

  • Kaplan-Meier curves are good for visualizing differences in survival between two categorical groups. The p-value for the log-rank test provided when pval=TRUE is also useful for identifying any differences in survival between different groups.

  • However, this does not generalize well for assessing the effect of quantitative variables.

  • Just try creating a K-M plot for the nodes variable, which has values that range from \(0-33\).

    • What a mess! Don’t do this.
ggsurvplot(survfit(Surv(time, status)~inst, data=lung))

  • At some point using a categorical grouping for K-M plots breaks down, more so when we might want to assess how multiple variables work together to influence survival.

  • For example, we might want to simultaneously examine the effect of race and socioeconomic status (so as to adjust for factors like income, access to care, etc.) before concluding that ethnicity influences some outcome.

6.1 Model Fit

  • Cox PH regression can assess the effect of both categorical and continuous variables, and can model the effect of multiple variables at once.

  • The coxph() function uses the same syntax as lm(), glm(), etc.

6.1.1 Example

Let us run a Cox regression on sex.

fit = coxph(Surv(time, status)~sex, data=lung)
fit
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
## 
##        coef exp(coef) se(coef)      z       p
## sex -0.5310    0.5880   0.1672 -3.176 0.00149
## 
## Likelihood ratio test=10.63  on 1 df, p=0.001111
## n= 228, number of events= 165
  • The exp(coef) column contains \(e^{\beta_1}\) (see background section above for more info). This is the hazard ratio – the multiplicative effect of that variable on the hazard rate (for each unit increase in that variable).

  • For a categorical variable like sex, going from male (baseline) to female results in approximately ~\(40\%\) reduction in hazard.

  • We could also flip the sign on the coef column, and take exp(0.531), which can be interpreted as:

    • Males have a \(1.7\)-fold increase in hazard, or that

    • Males die at approximately \(1.7\times\) the rate per unit time as females (females die at \(0.588\times\) the rate per unit time as males).

  • Remember:

    • \(HR=1\): No effect

    • \(HR>1\): Increase in hazard

    • \(HR<1\): Decrease in hazard

  • There is a p-value on the sex term, and a p-value on the overall model.

    • That \(0.00111\) p-value is really close to the \(p=0.00131\) p-value we saw on the Kaplan-Meier plot.

    • That is because the KM plot is showing the log-rank test p-value.

    • We can get this out of the Cox model with a call to summary(fit).

    • You can directly calculate the log-rank test p-value using survdiff().

summary(fit)
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
## 
##   n= 228, number of events= 165 
## 
##        coef exp(coef) se(coef)      z Pr(>|z|)   
## sex -0.5310    0.5880   0.1672 -3.176  0.00149 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## sex     0.588      1.701    0.4237     0.816
## 
## Concordance= 0.579  (se = 0.021 )
## Likelihood ratio test= 10.63  on 1 df,   p=0.001
## Wald test            = 10.09  on 1 df,   p=0.001
## Score (logrank) test = 10.33  on 1 df,   p=0.001
survdiff(Surv(time, status)~sex, data=lung)
## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = lung)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 138      112     91.6      4.55      10.3
## sex=2  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001

6.1.2 Example

Let us create another model where we analyze all the variables in the data set.

  • This shows us how all the variables, when considered together, act to influence survival.

  • Some are very strong predictors (sex, ECOG score).

  • Interestingly, the Karnofsky performance score as rated by the physician ph.karno was marginally significant, while the same score as rated by the patient pat.karno was not.

fit = coxph(Surv(time, status)~sex+age+ph.ecog+ph.karno+pat.karno+meal.cal+wt.loss, data=lung)
fit
## Call:
## coxph(formula = Surv(time, status) ~ sex + age + ph.ecog + ph.karno + 
##     pat.karno + meal.cal + wt.loss, data = lung)
## 
##                 coef  exp(coef)   se(coef)      z       p
## sex       -5.509e-01  5.765e-01  2.008e-01 -2.743 0.00609
## age        1.065e-02  1.011e+00  1.161e-02  0.917 0.35906
## ph.ecog    7.342e-01  2.084e+00  2.233e-01  3.288 0.00101
## ph.karno   2.246e-02  1.023e+00  1.124e-02  1.998 0.04574
## pat.karno -1.242e-02  9.877e-01  8.054e-03 -1.542 0.12316
## meal.cal   3.329e-05  1.000e+00  2.595e-04  0.128 0.89791
## wt.loss   -1.433e-02  9.858e-01  7.771e-03 -1.844 0.06518
## 
## Likelihood ratio test=28.33  on 7 df, p=0.0001918
## n= 168, number of events= 121 
##    (60 observations deleted due to missingness)

6.2 Diagnostics

The Cox proportional hazards model makes several assumptions. Thus, it is important to assess whether a fitted Cox regression model adequately describes the data.

Here, we’ll discuss three types of diagnostics for the Cox model:

  • Detecting non-linearity in relationship between the log-hazard and the covariates.

  • Examining influential observations (or outliers).

  • Testing the proportional hazards assumption.

In order to check these model assumptions, residuals method are used. The common residuals for the Cox model include:

  • Martingale residual to assess non-linearity \[r_{Mi} = \delta_i - \hat{H}_0(t_i)e^{\mathbf{x}_i^\top \hat{\beta}},\] where \(\hat{H}_0(t_i)\) is the estimate of the baseline cumulative hazard at \(t_i\), and \(\delta_i\) is the event indicator for subject \(i\).

  • Deviance residual (symmetric transformation of the Martingale residuals), to examine influential observations. \[r_{Di} = sign(r_{Mi})[-2(r_{Mi}+\delta_i\log(\delta_i-r_{Mi}))]^{1/2}.\]

  • Schoenfeld residuals to check the proportional hazards assumption.

    • For any subject \(i \in D(t_k)\) (the set of \(d_k\) failures at time \(t_k\)) is the difference between the covariate for that subject and the weighted average of covariates in the risk set, namely \(x_i - \bar{x}(\beta, t_k)\).

    • The sum of the Schoenfeld residuals over all \(d_k\) subjects who fail at \(t_k\), also known as the Schoenfeld residual corresponding to \(t_k\), is \[r_{Sk} = \sum\limits_{i \in R(t_k)}\delta_{ik}[x_i - \bar{x}(\beta, t_k)],\] where \(\delta_{ik} = 1\) if subject \(i\) fails at \(t_k\) and zero otherwise; and \(R(t_k)\) is the risk set at \(t_k\).

6.2.1 Example

Let us construct a proportional hazards model with the variables sex, age and wt.loss

fit = coxph(Surv(time, status)~sex+age+wt.loss, data=lung)
fit
## Call:
## coxph(formula = Surv(time, status) ~ sex + age + wt.loss, data = lung)
## 
##               coef  exp(coef)   se(coef)      z      p
## sex     -0.5210319  0.5939074  0.1743541 -2.988 0.0028
## age      0.0200882  1.0202913  0.0096644  2.079 0.0377
## wt.loss  0.0007596  1.0007599  0.0061934  0.123 0.9024
## 
## Likelihood ratio test=14.67  on 3 df, p=0.002122
## n= 214, number of events= 152 
##    (14 observations deleted due to missingness)

6.2.2 Testing Proportional Hazards Assumption

  • The proportional hazards (PH) assumption can be checked using statistical tests and graphical diagnostics based on the scaled Schoenfeld residuals.

  • In principle, the Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption.

  • The function cox.zph() (for help ?cox.zph) provides a convenient solution to test the proportional hazards assumption for each covariate included in a Cox regression model fit.

    • For each covariate, the function cox.zph() correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time.

    • Additionally, it performs a global test for the model as a whole.

    • The proportional hazard assumption is supported by a non-significant relationship between residuals and time, and refuted by a significant relationship.

test.ph = cox.zph(fit)
test.ph
##             rho chisq     p
## sex      0.1265 2.349 0.125
## age     -0.0483 0.378 0.538
## wt.loss  0.0126 0.024 0.877
## GLOBAL       NA 2.846 0.416

From the output above, the test is not statistically significant for each of the covariates, and the global test is also not statistically significant. Therefore, our proportional hazards assumption is reasonable.

  • We can also do a graphical diagnostic using the function ggcoxzph() (for help ?ggcoxzph), which produces, for each covariate, graphs of the scaled Schoenfeld residuals against the transformed time.
ggcoxzph(test.ph)

In the figure above, the solid line is a smoothing spline fit to the plot, with the dashed lines representing a \(\pm 2\sigma\) band around the fit.

  • Systematic departures from a horizontal line are indicative of non-proportional hazards, since proportional hazards assumes that estimates \(\beta_1,\beta_2,\beta_3\) do not vary much over time.

  • From the graphical inspection, there is no pattern with time.

  • The assumption of proportional hazards appears to be supported for the covariates age, wt.loss and sex (which is a two-level factor, accounting for the two bands in the graph).

  • Violations of proportional hazards assumption can be resolved by:

    • Adding \(covariate \times time\) interaction.

    • Stratification (useful for nuisance confounders).

6.2.3 Testing Influential Observations

  • To test influential observations or outliers, we can visualize either:

    • the deviance residuals or

    • the dfbeta values

  • ggcoxdiagnostics() (for help ?ggcoxdiagnostics) provides a convenient solution for checking influential observations.

General usage is as follows

ggcoxdiagnostics(fit, type = ???, linear.predictions = TRUE)
  • fit: an object of class coxph.object

  • type: the type of residuals to present on Y axis. Allowed values include one of c(“martingale”, “deviance”, “score”, “schoenfeld”, “dfbeta”, “dfbetas”, “scaledsch”, “partial”).

    • Specifying the argument type = “dfbeta”, plots the estimated changes in the regression coefficients upon deleting each observation in turn; likewise,

    • type=“dfbetas” produces the estimated changes in the coefficients divided by their standard errors.

  • linear.predictions: a logical value indicating whether to show linear predictions for observations (TRUE) or just indexes of observations (FALSE) on X axis.

6.2.3.1 Example: "dfbeta"

ggcoxdiagnostics(fit, type = "dfbeta",
                 linear.predictions = FALSE, ggtheme = theme_bw())

Index plots of dfbeta for the Cox regression of time to death on age, sex and wt.loss

  • The above index plots show that comparing the magnitudes of the largest dfbeta values to the regression coefficients suggests that none of the observations is terribly influential individually, even though some of the dfbeta values for age and wt.loss are large compared with the others.

6.2.3.2 Example: "deviance"

It is also possible to check outliers by visualizing the deviance residuals.

  • The deviance residual is a normalized transform of the martingale residual.

  • These residuals should be roughly symmetrically distributed about zero with a standard deviation of \(1\).

    • Positive values correspond to individuals that “died too soon” compared to expected survival times.

    • Negative values correspond to individual that “lived too long”.

    • Very large or small values are outliers, which are poorly predicted by the model.

ggcoxdiagnostics(fit, type = "deviance",
                 linear.predictions = FALSE, ggtheme = theme_bw())

6.2.4 Testing Non-Linearity

  • Plotting the Martingale residuals against continuous covariates is a common approach used to detect non-linearity or, in other words, to assess the functional form of a covariate.

    • For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit.
  • Non-linearity is not an issue for categorical variables, so we only examine plots of martingale residuals and partial residuals against a continuous variable.

  • Martingale residuals may present any value in the range \((-\infty, 1)\):

    • A value of martingale residuals near 1 represents individuals that “died too soon”, and

    • large negative values correspond to individuals that “lived too long”.

  • To assess the functional form of a continuous variable in a Cox PH model, we will use the function ggcoxfunctional() (for help ?ggcoxfunctional).

  • ggcoxfunctional() displays graphs of continuous covariates against martingale residuals of null Cox PH model.

    • This might help to properly choose the functional form of continuous variable in the Cox PH model.

    • Fitted lines with loess function should be linear to satisfy the Cox PH model assumptions.

ggcoxfunctional(Surv(time, status) ~ age + log(age) + sqrt(age), data = lung)


7 Paramteric Surival Models

  • In biomedical applications, non-parametric (e.g. the product-limit survival curve estimator) and semi-parametric (e.g. the Cox proportional hazards model) methods play the most important role, since they have the flexibility to accommodate a wide range of hazard function forms.

  • Parametric methods may be appropriate when survival data can be shown to approximately follow a particular parametric form.

  • Parametric models are often much easier to work with than the partial-likelihood-based models (e.g. Cox PH model), since the former are defined by a small and fixed number of unknown parameters.

    • This allows us to use standard likelihood theory for parameter estimation and inference.
  • Furthermore, accommodating complex censoring and truncation patterns is much more direct with parametric models than with partial likelihood models.

  • Of course, the validity of these techniques depends heavily on the appropriateness of the particular parametric model being used.

  • Some of the distributions that could potentially serve as survival distribution models are the exponential, Weibull, and gamma distributions.

  • For more details see Paramteric Survival Models9 which examines a range of parametric survival distributions, their specifications in R, and the hazard shapes they support.



  1. This notes is adapted from Cox Model Assumptions Tools, Proportional Hazards Regression Diagnostics, and Survival Analysis with R.

  2. In the medical world, we typically think of survival analysis literally – tracking time until death. But, it is more general than that – survival analysis models time until an event occurs (any event). This might be death of a biological organism. But it could also be the time until a hardware failure in a mechanical system, time until recovery, time someone remains unemployed after losing a job, time until a ripe tomato is eaten by a grazing deer, time until someone falls asleep in a workshop, etc. Survival analysis also goes by reliability theory in engineering, duration analysis in economics, and event history analysis in sociology.

  3. This describes the most common type of censoring – right censoring. Left censoring less commonly occurs when the “start” is unknown, such as when an initial diagnosis or exposure time is unknown.

  4. And, following the definitions above, proportional hazards assumption says that the cumulative hazard ratio between two groups remains constant over time.

  5. And there is a chi-square-like statistical test for these differences called the log-rank test that compare the survival functions of different categories.

  6. Surv() can also take start and stop times, to account for left censoring. See the help for ?Surv.

  7. Cox regression and the log-rank test from survdiff are going to give you similar results most of the time. The log-rank test is asking if survival curves differ significantly between two groups. Cox regression is asking which of many categorical or continuous variables significantly affect survival.

  8. Loprinzi et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.

  9. Devin Incerti, Paramteric Survival Models: https://devinincerti.com/2019/06/18/parametric_survival.html