In this lecture we will cover some complex models for survival data such as regularization in the Cox PH regression and random survival forests.

Note: This lecture is designed based on several resources. For details see the end-notes1.


1 Introduction

  • Consider the usual survival analysis framework. We have data of the form \[(y_1, x_1, \delta_1), . . . , (y_n, x_n, \delta_n),\] where \(y_i\), the observed time, is a time of failure if \(\delta_i\) is \(1\) or right-censoring if \(\delta_i\) is \(0\).

  • As in regression, \(x_i\) is a vector of potential predictors \((x_{i1}, x_{i2}, \ldots,x_{ip})\). We further let \(t_1 < t_2 < \ldots < t_m\) be the increasing list of unique failure times, and \(j(i)\) denote the index of the observation failing at time \(t_i\).

  • One potential problem of interest is to study the relationship between predictor variables and survival time.

  • This can be addressed with the Cox proportional hazards model, which assumes a semi-parametric form for the hazard \[h_i(t) = h_0(t) e^{x_i^\top \beta},\] where \(h_i(t)\) is the hazard for patient \(i\) at time \(t\), \(h_0(t)\) is a shared baseline hazard, and \(\beta\) is a fixed vector of length \(p\).

  • Inference is then made via the partial likelihood \[L(\beta) = \prod\limits_{i=1}^m \frac{e^{x_{j(i)}^\top\beta}}{\sum_{j \in R_i}e^{x_j^\top \beta}},\] where \(R_i\) is the set of indices, \(j\), with \(y_j \geq t_i\) (those at risk at time \(t_i\)).

  • For classical problems, with many more observations than predictors, the Cox model performs well. However, problems with \(p > n\), lead to degenerate behavior.

Note: For ease of notation the above formula assumes that the \(y_i\) are unique, however it can be suitably altered for the case of ties.

1.1 Regularization

  • We wish to find \(\beta\) which maximizes \(L(\beta)\) subject to the elastic net constraint \[P_\alpha (\beta) := \alpha\sum\limits_{k=1}^p |\beta_k| + (1 - \alpha)\sum\limits_{k=1}^p \beta_k^2 \leq c.\]

  • Maximizing the partial likelihood is equivalent to maximizing a scaled log partial likelihood \[\frac{2}{n}l(\beta) = \frac{2}{n}\Bigg[ \sum\limits_{i=1}^n x_{j(i)}^\top\beta - \log\Big(\sum_{j \in R_i}e^{x_j^\top \beta}\Big) \Bigg]\]

  • If we consider the Lagrangian formulation, our problem becomes \[\hat{\beta} = \underset{\beta}{\mathrm{argmax}} \frac{2}{n}\Bigg[ \sum\limits_{i=1}^n x_{j(i)}^\top\beta - \log\Big(\sum_{j \in R_i}e^{x_j^\top \beta}\Big) \Bigg] - \lambda P_\alpha(\beta),\] where \(P_\alpha (\beta)\) is the elastic net penalty.

  • It is a mixture of the \(\mathbb{L}^1\) (LASSO) and \(\mathbb{L}^2\) (ridge regression) penalties.

  • Park and Hastie2 applied this to the Cox model and proposed an efficient algorithm to solve this problem along a path of c and \(\alpha\) values. However, here we cover the approach proposed by Simon et.al.3

  • References (if interested): Simon et.al.4 and Tibshirani et.al.5.

1.2 Random Forest

A Random Forest6 is grown by bagging7 a collection of classification and regression trees (CART)8.

  • The method uses a set of \(B\) bootstrap9 samples and grows an independent tree model on each sub-sample of the population.
  • Each tree is grown by recursively partitioning the population based on optimization of a split rule over the \(p\)-dimensional covariate space.

  • At each split, a subset of \(m \leq p\) candidate variables are tested for the split rule optimization, dividing each node into two daughter nodes.

  • Each daughter node is then split again until the process reaches the stopping criteria of either node purity or node member size, which defines the set of terminal (unsplit) nodes for the tree.

  • In regression trees, node impurity is measured by mean squared error, whereas in classification problems, the Gini index is used10.

  • Random forest sorts each training set observation into one unique terminal node per tree.

  • Tree estimates for each observation are constructed at each terminal node, among the terminal node members.

  • The Random Forest estimate for each observation is then calculated by aggregating, averaging (regression) or votes (classification), the terminal node results across the collection of \(B\) trees.


2 Regularized Cox Regression

  • Here we try to implement an approach to fit the Cox Model regularized by an elastic net penalty.

  • It is used for under-determined (or nearly under-determined systems) and chooses a small number of covariates to include in the model.

  • Because the Cox Model is rarely used for actual prediction, we will rather focus on finding and interpreting an appropriate model.

  • We give a simple example of how to format data and run the Cox Model in glmnet with cross-validation.

2.1 Example

  • Load the data and set up the response. In this case

    • \(X\) must be an \(n \times p\)-matrix of covariate values - each row corresponds to a patient and each column a covariate,

    • \(y\) is an \(n\) length vector of failure/censoring times, and

    • status is an \(n\) length vector with each entry, \(1\) or \(0\), indicating whether the corresponding entry in \(y\) is indicative of a failure time or right censoring time (1 for failure, 0 for censoring).

  • The core package we use is glmnet package. The glmnet package provides functions for lasso and elastic-net regularized generalized linear models.

# CRAN Package needed
install.packages('glmnet') # install package for regularization in GLMs
  • We will also want to load the survival package. We shall also
# CRAN Packages needed
library(glmnet) # package for regularization in GLMs
library(survival)  # core survival analysis functions
setwd("C:/Users/username/Documents/CBSA") # setup working directory
  • We now load the patient data coxnet.RDS.
# should be stored in the working directory
data = readRDS("coxnet.RDS")
str(data)
## List of 3
##  $ x     : num [1:100, 1:1000] -0.221 -0.195 -0.318 0.427 0.512 ...
##  $ time  : num [1:100] 5 5.9 6.6 13.1 1.6 1.3 1.4 2.2 3.4 2.2 ...
##  $ status: num [1:100] 0 0 0 0 1 1 1 1 1 1 ...
  • We then call our functions to fit with the lasso penalty \((\alpha = 1)\), and cross validate.

    • We set maxit = 1000 (increasing the maximum number of iterations to 1000) because our data is relatively high dimensional, so more iterations are needed for convergence.

    • In practice, the function will spit out an error if convergence is not reached by the maximum number of iterations.

    • The Surv function packages the survival data into the form expected by glmnet.

cv.fit = cv.glmnet(data$x, # X matrix
                   Surv(data$time,data$status), # create survival object from the data
                   alpha = 1, # lasso: alpha = 1; ridge: alpha=0
                   family = "cox", # specify Cox PH model
                   maxit = 1000)

Following is the cross validated error plot to help evaluate our model with the \(\lambda\)-values on a \(\log\)-scale.

plot(cv.fit)

The optimal value of \(\lambda\) which minimizes the cross-validation error is given by

log(cv.fit$lambda.min) # in log-scale
## [1] -1.557179
cv.fit$lambda.min
## [1] 0.2107298
  • The left vertical line in our plot shows us where the cross-validation error curve hits its minimum.

  • The right vertical line shows us the most regularized model with cross-validation error within 1-SD of the minimum.

  • In this case, we see that the minimum was achieved by a fairly regularized model, however the right line indicates that the null model (no coefficients included) is within 1-SD of the minimum.

  • This might lead us to believe that in actuality the covariates are not explaining any variability. For the time being we will concern ourselves with the minimum CV-error model.

  • We can check which covariates our model chose as active, and see the coefficients of those covariates.

est.coef = coef(cv.fit, s = cv.fit$lambda.min) # returns the p length coefficient vector
                                               # of the solution corresponding to lambda 
active.k = which(est.coef != 0)
active.k
## [1]  80 394
active.k.vals = est.coef[active.k]
active.k.vals
## [1] 0.4012294 0.1151167

3 Random Survival Forest

  • Random forest11 (RF) is a non-parametric statistical method which requires no distributional assumptions on covariate relation to the response.

    • RF is a robust, nonlinear technique that optimizes predictive accuracy by fitting an ensemble of trees to stabilize model estimates.
  • Random Survival Forest (RSF)12 is an extension of RF techniques to survival settings, allowing efficient non-parametric analysis of time to event data.

    • A forest of survival trees is grown using a log-rank splitting rule to select the optimal candidate variables.

    • Survival estimate for each observation are constructed with a Kaplan{Meier (KM) estimator within each terminal node, at each event time.

    • Random Survival Forests adaptively discover nonlinear effects and interactions and are fully non-parametric.

    • Averaging over many trees enables RSF to approximate complex survival functions, including non-proportional hazards, while maintaining low prediction error.

  • In this section, we will mostly discuss the implementation of the random survival forests and how to generate plots for inference.

Let us first install

  • randomForestSRC package, which provides implementation for the random survival forests, and

  • ggRandomForests package, which provides implementation for the random survival forests

# CRAN Package needed
install.packages('randomForestSRC') # install package for random forest
install.packages('ggRandomForests') # install package for plotting results
install.packages('ggplot2') # install package for plotting results

Load the package

# CRAN Packages needed
library(randomForestSRC) # package for random forest
library(ggRandomForests) # package for random forest
library(ggplot2)

3.1 Example: PBC Data

  • The primary biliary cirrhosis of the liver (PBC) study consists of \(424\) PBC patients referred to Mayo Clinic between 1974 and 1984 who met eligibility criteria for a randomized placebo controlled trial of the drug D-penicillamine (DPCA). For further details see: Fleming and Harrington 199113.
  • The PBC data set, included in the randomForestSRC package, contains \(418\) observations, of which \(312\) patients participated in the randomized trial.

Load the PBC data:

data(pbc)
head(pbc)
##   days status treatment   age sex ascites hepatom spiders edema bili chol
## 1  400      1         1 21464   1       1       1       1   1.0 14.5  261
## 2 4500      0         1 20617   1       0       1       1   0.0  1.1  302
## 3 1012      1         1 25594   0       0       0       0   0.5  1.4  176
## 4 1925      1         1 19994   1       0       1       1   0.5  1.8  244
## 5 1504      0         2 13918   1       0       1       1   0.0  3.4  279
## 6 2503      1         2 24201   1       0       1       0   0.0  0.8  248
##   albumin copper    alk   sgot trig platelet prothrombin stage
## 1    2.60    156 1718.0 137.95  172      190        12.2     4
## 2    4.14     54 7394.8 113.52   88      221        10.6     3
## 3    3.48    210  516.0  96.10   55      151        12.0     4
## 4    2.54     64 6121.8  60.63   92      183        10.3     4
## 5    3.53    143  671.0 113.15   72      136        10.9     3
## 6    3.98     50  944.0  93.00   63       NA        11.0     3
## 
## Variable name | Description                    | Type
## days          | Time (days)                    | numeric
## status        | Event (F = censor, T = death)  | logical
## treatment     | Treatment (DPCA, Placebo)       | factor
## age           | Age (years)                    | numeric
## sex           | Female = T                     | logical
## ascites       | Presence of Asictes            | logical
## hepatom       | Presence of Hepatomegaly       | logical
## spiders       | Presence of Spiders            | logical
## edema         | Edema (0, 0.5, 1)              | factor
## bili          | Serum Bilirubin (mg/dl)        | numeric
## chol          | Serum Cholesterol (mg/dl)      | integer
## albumin       | Albumin (gm/dl)                | numeric
## copper        | Urine Copper (ug/day)          | integer
## alk           | Alkaline Phosphatase (U/liter) | numeric
## sgot          | SGOT (U/ml)                    | numeric
## trig          | Triglicerides (mg/dl)          | integer
## platelet      | Platelets per cubic ml/1000    | integer
## prothrombin   | Prothrombin time (sec)         | numeric
## stage         | Histologic Stage               | factor
  • If we restrict the data to the trial only, most of the missing values are removed. Therefore, we will focus on the \(312\) observations from the clinical trial.
# Create the trial and test data sets
pbc.trial = subset(pbc, subset=!is.na(treatment))
pbc.test = subset(pbc, subset=is.na(treatment))
  • The rfsrc() function (for help see ?rfsrc) fits the forest.

    • It determines the type of forest by the response supplied in the formula argument.

We now grow a random forest for survival, by passing a survival (Surv) object to the forest. The forest uses all variables in the pbc.trial data set to generate the RSF survival model.

rfsrc.pbc = rfsrc(Surv(days, status) ~ ., # survival object
                  data = pbc.trial, # data
                  nsplit = 10, # number of random splits to consider 
                  na.action = "na.impute", # imputing missing data
                  tree.err = TRUE,
                  importance = TRUE) # variable importance
print(rfsrc.pbc)
##                          Sample size: 312
##                     Number of deaths: 125
##                     Was data imputed: yes
##                      Number of trees: 1000
##            Forest terminal node size: 15
##        Average no. of terminal nodes: 15.244
## No. of variables tried at each split: 5
##               Total no. of variables: 17
##        Resampling used to grow trees: swor
##     Resample size used to grow trees: 197
##                             Analysis: RSF
##                               Family: surv
##                       Splitting rule: logrank *random*
##        Number of random split points: 10
##                           Error rate: 17.05%

3.2 Generalization Error

One advantage of random forest is a built in generalization error estimate.

  • Each bootstrap sample selects approximately \(63.2\%\) of the population on average.

  • The remaining \(36.8\%\) of observations, the Out-of-Bag14 (OOB) sample, can be used as a hold out test set for each tree.

  • An OOB prediction error estimate can be calculated for each observation by predicting the response over the set of trees which were not trained with that particular observation.

  • Out-of-Bag prediction error estimates have been shown to be nearly identical to n-fold cross-validation estimates15.

  • This feature of random forest allows us to obtain both model fit and validation in one pass of the algorithm.

We now plot this error as follows

plot(rfsrc.pbc)

## 
##               Importance   Relative Imp
## bili              0.0689         1.0000
## edema             0.0152         0.2207
## copper            0.0142         0.2067
## ascites           0.0128         0.1859
## albumin           0.0089         0.1290
## prothrombin       0.0073         0.1054
## age               0.0060         0.0869
## stage             0.0045         0.0655
## hepatom           0.0028         0.0408
## sgot              0.0028         0.0400
## chol              0.0023         0.0339
## spiders           0.0003         0.0039
## sex               0.0002         0.0033
## alk              -0.0001        -0.0011
## treatment        -0.0001        -0.0013
## trig             -0.0002        -0.0026
## platelet         -0.0008        -0.0122

3.3 Training-Set Predictions

We now plot the predicted survival from our RSF model.

  • Each line represents a single patient in the training data set, where censored patients are colored blue, and patients who have experienced the event (death) are colored in red.
ggRFsrc = plot(gg_rfsrc(rfsrc.pbc))+
  theme(legend.position = "none")+
  labs(y = "Survival Probability", x = "Time (days)")+
  coord_cartesian(ylim = c(-0.01, 1.01))

show(ggRFsrc)

  • Interpretation of general survival properties from the plot above is difficult because of the number of curves displayed.

To get more interpretable results, we plot a summary of the survival results.

  • The figure below shows the median survival with a \(95\%\) shaded confidence band for the DPCA group in red, and the placebo group in blue.
plot(gg_rfsrc(rfsrc.pbc, by = "treatment"))+
  theme(legend.position = c(0.2, 0.2))+
  labs(y = "Survival Probability", x = "Time (days)")+
  coord_cartesian(ylim = c(-0.01, 1.01))

3.4 Random Forest Imputation

There are two modeling issues when dealing with missing data values:

  • How does the algorithm build a model when values are missing from the training data?, and

  • How does the algorithm predict a response when values are missing from the test data?

The randomForestSRC package imputes missing values using adaptive tree imputation. Rather than impute missing values before fitting the forest, the algorithm takes a just-in-time approach.

  • At each node split, the set of mtry candidate variables is checked for missing values, where the mtry is the number of variables randomly selected as candidates for splitting a node (see ?rfsrc).

  • Missing values are then imputed by randomly drawing values from non-missing data within the node.

  • The split-statistic is then calculated on observations that were not missing values.

  • The imputed values are used to sort observations into the subsequent daughter nodes and then discarded before the next split occurs.

  • The process is repeated until the stopping criteria is reached and all observations are sorted into terminal nodes.

  • A final imputation step can be used to fill in missing values from within the terminal nodes. This step uses a process similar to the previous imputation but uses the OOB non-missing terminal node data for the random draws.

  • These values are aggregated (averaging for continuous variables, voting for categorical variables) over the ntree trees in the forest to estimate an imputed data set.

  • By default, the missing values are not filled into the training data, but are available within the forest object for later use if desired.

  • Adaptive tree imputation still requires the missing at random assumptions (Rubin 1976).

  • At each imputation step, the random forest assumes that similar observations are grouped together within each node. The random draws used to fill in missing data do not bias the split rule, but only sort observations similar in non-missing data into like nodes.

  • Additional feature of this approach is the ability of predicting on test set observations with missing values.

3.5 Test-Set Predictions

The strength of adaptive tree imputation becomes clear when doing prediction on test set observations.

  • If we want to predict survival for patients that did not participate in the trial using the model we created, we need to somehow account for the missing values.

  • The predict.rfsrc() (see ?predict.rfsrc) takes the forest object (rfsrc.pbc), and the test data set (pbc.test) and returns a predicted survival using the same forest imputation method for missing values within the test data set (na.action="na.impute").

rfsrc.pbc.test = predict(rfsrc.pbc, newdata = pbc.test,
                         na.action = "na.impute",
                         importance = TRUE)

plot(gg_rfsrc(rfsrc.pbc.test))+
  theme(legend.position = "none")+
  labs(y = "Survival Probability", x = "Time (days)")+
  coord_cartesian(ylim = c(-0.01, 1.01))

Blue curves correspond to censored patients, red curves correspond to patients experiencing a death event.

3.6 Variable Selection

  • Random forest is not a parsimonious method, but uses all variables available in the data set to construct the response predictor.

  • Also, unlike parametric models, random forest does not require the explicit specification of the functional form of covariates to the response. Therefore, there is no explicit p-value/significance test for variable selection with a random forest model.

  • Instead, RF ascertains which variables contribute to the prediction through the split rule optimization, optimally choosing variables which separate observations.

  • In survival settings we are also interested in how we can possibly improve the the outcome of interest. Hence care should be taken while including variables into the model.

3.7 Variable Importance

  • Variable importance was originally defined in CART using a measure involving surrogate variables.

  • The most popular variable importance method uses a prediction error approach involving “noising-up” each variable in turn.

  • Variable importance for a variable \(x_k\) is the difference between prediction error when \(x_k\) is randomly permuted, compared to prediction error under the observed values.

    • A large positive variable importance value indicates that misspecification detracts from the predictive accuracy in the forest.

    • A variable importance value close to zero indicates the variable contributes nothing to predictive accuracy, and

    • A large negative variable importance value indicates the predictive accuracy improves when the variable is misspecified indicating that noise is more informative than the true variable.

    • As such, we ignore variables with negative and near zero values of VIMP, relying on large positive values to indicate that the predictive power of the forest is dependent on those variables.

plot(gg_vimp(rfsrc.pbc))+
  theme(legend.position = c(0.8, 0.2))+
  labs(fill = "Var_Imp > 0")

The plot above details variable importance ranking for the pbc.trial baseline variables, from the largest (Serum Bilirubin) at the top, to smallest (Treatment (DPCA, Placebo)) at the bottom.


4 Open Source

4.1 TCGA

  • The Cancer Genome Atlas (TCGA) is a collaboration between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) that collected lots of clinical and genomic data across 33 cancer types.

  • The entire TCGA data set is over 2 petabytes worth of gene expression, CNV profiling, SNP genotyping, DNA methylation, miRNA profiling, exome sequencing, and other types of data.

  • You can learn more about TCGA at cancergenome.nih.gov. The data is now housed at the Genomic Data Commons Portal.

  • There are lots of ways to access TCGA data without actually downloading and parsing through the data from GDC.

First, let us look at an R package that provides convenient, direct access to TCGA data.

4.1.1 RTCGA

The RTCGA package (bioconductor.org/packages/RTCGA) and all the associated data packages provide convenient access to clinical and genomic data in TCGA.

# Install the BiocManager installer
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
# Install the main RTCGA package
BiocManager::install("RTCGA")

Each of the data packages is a separate package, and must be installed (once) individually.

# Install the clinical and mRNA gene expression data packages
BiocManager::install("RTCGA.clinical")
BiocManager::install("RTCGA.mRNA")

You can load the RTCGA package, and use the infoTCGA() function to get some information about the kind of data available for each cancer type.

library(RTCGA)
infoTCGA()

4.2 Other TCGA Resources

RTCGA is not the only resource providing easy access to TCGA data. In fact, it is not even the only R/Bioconductor package. Take a look at some of the other resources shown below.

  • TCGAbiolinks: another R package that allows direct query and analysis from the NCI GDC.

  • cBioPortal: cbioportal.org

    • Nice graphical user interface.

    • Quick/easy summary info on patients, demographics, mutations, copy number alterations, etc.

    • Query individual genes, find co-expressed genes.

    • Survival analysis against different sub-types, expression, CNAs, etc.

  • OncoLnc: oncolnc.org

    • Focus on survival analysis and RNA-seq data.

    • Simple query interface across all cancers for any mRNA, miRNA, or lncRNA gene (try SERPINA1).

    • Precomputed Cox PH regression for every gene, for every cancer.

    • Kaplan-Meier plots produced on demand.

  • LinkedOmics: A portal that includes multi-omics data from all 32 TCGA cancer types.



  1. This notes is adapted from Coxnet: Regularized Cox Regression, Random Forests for Survival, Survival Analysis with R and ggRandomForests: Exploring Random Forest Survival.

  2. Park, M.Y. and Hastie, T., 2007. L1‐regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), pp.659-677.

  3. Simon, N., Friedman, J., Hastie, T. and Tibshirani, R., 2011. Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of statistical software, 39(5), p.1.

  4. Simon, N., Friedman, J. and Hastie, T., 2013. A block-wise descent algorithm for group-penalized multiresponse and multinomial regression. arXiv preprint arXiv:1311.6529.

  5. Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J. and Tibshirani, R.J., 2012. Strong rules for discarding predictors in lasso‐type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2), pp.245-266.

  6. Breiman, L., 2001. Random forests. Machine learning, 45(1), pp.5-32.

  7. Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140.

  8. Breiman, L., Friedman, J., Olshen, R. and Stone, C., 1984. Classification and regression trees. Wadsworth Int. Group, 37(15), pp.237-251.

  9. Efron, B. and Tibshirani, R.J., 1994. An introduction to the bootstrap. CRC press.

  10. Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp.1189-1232.

  11. Breiman, L., 2001. Random forests. Machine learning, 45(1), pp.5-32.

  12. Ishwaran, H., Kogalur, U.B., Blackstone, E.H. and Lauer, M.S., 2008. Random survival forests. The annals of applied statistics, 2(3), pp.841-860.

  13. Fleming, T.R. and Harrington, D.P., 2011. Counting processes and survival analysis (Vol. 169). John Wiley & Sons.

  14. Breiman, L., 1996. Out-of-bag estimation.

  15. Hastie, T., Tibshirani, R. and Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media.