In this lecture we will cover some complex models for survival data such as regularization in the Cox PH regression and random survival forests.
Note: This lecture is designed based on several resources. For details see the end-notes1.
Consider the usual survival analysis framework. We have data of the form \[(y_1, x_1, \delta_1), . . . , (y_n, x_n, \delta_n),\] where \(y_i\), the observed time, is a time of failure if \(\delta_i\) is \(1\) or right-censoring if \(\delta_i\) is \(0\).
As in regression, \(x_i\) is a vector of potential predictors \((x_{i1}, x_{i2}, \ldots,x_{ip})\). We further let \(t_1 < t_2 < \ldots < t_m\) be the increasing list of unique failure times, and \(j(i)\) denote the index of the observation failing at time \(t_i\).
One potential problem of interest is to study the relationship between predictor variables and survival time.
This can be addressed with the Cox proportional hazards model, which assumes a semi-parametric form for the hazard \[h_i(t) = h_0(t) e^{x_i^\top \beta},\] where \(h_i(t)\) is the hazard for patient \(i\) at time \(t\), \(h_0(t)\) is a shared baseline hazard, and \(\beta\) is a fixed vector of length \(p\).
Inference is then made via the partial likelihood \[L(\beta) = \prod\limits_{i=1}^m \frac{e^{x_{j(i)}^\top\beta}}{\sum_{j \in R_i}e^{x_j^\top \beta}},\] where \(R_i\) is the set of indices, \(j\), with \(y_j \geq t_i\) (those at risk at time \(t_i\)).
For classical problems, with many more observations than predictors, the Cox model performs well. However, problems with \(p > n\), lead to degenerate behavior.
Note: For ease of notation the above formula assumes that the \(y_i\) are unique, however it can be suitably altered for the case of ties.
We wish to find \(\beta\) which maximizes \(L(\beta)\) subject to the elastic net constraint \[P_\alpha (\beta) := \alpha\sum\limits_{k=1}^p |\beta_k| + (1 - \alpha)\sum\limits_{k=1}^p \beta_k^2 \leq c.\]
Maximizing the partial likelihood is equivalent to maximizing a scaled log partial likelihood \[\frac{2}{n}l(\beta) = \frac{2}{n}\Bigg[ \sum\limits_{i=1}^n x_{j(i)}^\top\beta - \log\Big(\sum_{j \in R_i}e^{x_j^\top \beta}\Big) \Bigg]\]
If we consider the Lagrangian formulation, our problem becomes \[\hat{\beta} = \underset{\beta}{\mathrm{argmax}} \frac{2}{n}\Bigg[ \sum\limits_{i=1}^n x_{j(i)}^\top\beta - \log\Big(\sum_{j \in R_i}e^{x_j^\top \beta}\Big) \Bigg] - \lambda P_\alpha(\beta),\] where \(P_\alpha (\beta)\) is the elastic net penalty.
It is a mixture of the \(\mathbb{L}^1\) (LASSO) and \(\mathbb{L}^2\) (ridge regression) penalties.
Park and Hastie2 applied this to the Cox model and proposed an efficient algorithm to solve this problem along a path of c and \(\alpha\) values. However, here we cover the approach proposed by Simon et.al.3
References (if interested): Simon et.al.4 and Tibshirani et.al.5.
A Random Forest6 is grown by bagging7 a collection of classification and regression trees (CART)8.
Each tree is grown by recursively partitioning the population based on optimization of a split rule over the \(p\)-dimensional covariate space.
At each split, a subset of \(m \leq p\) candidate variables are tested for the split rule optimization, dividing each node into two daughter nodes.
Each daughter node is then split again until the process reaches the stopping criteria of either node purity or node member size, which defines the set of terminal (unsplit) nodes for the tree.
In regression trees, node impurity is measured by mean squared error, whereas in classification problems, the Gini index is used10.
Random forest sorts each training set observation into one unique terminal node per tree.
Tree estimates for each observation are constructed at each terminal node, among the terminal node members.
The Random Forest estimate for each observation is then calculated by aggregating, averaging (regression) or votes (classification), the terminal node results across the collection of \(B\) trees.
Here we try to implement an approach to fit the Cox Model regularized by an elastic net penalty.
It is used for under-determined (or nearly under-determined systems) and chooses a small number of covariates to include in the model.
Because the Cox Model is rarely used for actual prediction, we will rather focus on finding and interpreting an appropriate model.
We give a simple example of how to format data and run the Cox Model in glmnet
with cross-validation.
Load the data and set up the response. In this case
\(X\) must be an \(n \times p\)-matrix of covariate values - each row corresponds to a patient and each column a covariate,
\(y\) is an \(n\) length vector of failure/censoring times, and
status is an \(n\) length vector with each entry, \(1\) or \(0\), indicating whether the corresponding entry in \(y\) is indicative of a failure time or right censoring time (1 for failure, 0 for censoring).
The core package we use is glmnet package. The glmnet
package provides functions for lasso and elastic-net regularized generalized linear models.
# CRAN Package needed
install.packages('glmnet') # install package for regularization in GLMs
# CRAN Packages needed
library(glmnet) # package for regularization in GLMs
library(survival) # core survival analysis functions
setwd("C:/Users/username/Documents/CBSA") # setup working directory
coxnet.RDS
.# should be stored in the working directory
data = readRDS("coxnet.RDS")
str(data)
## List of 3
## $ x : num [1:100, 1:1000] -0.221 -0.195 -0.318 0.427 0.512 ...
## $ time : num [1:100] 5 5.9 6.6 13.1 1.6 1.3 1.4 2.2 3.4 2.2 ...
## $ status: num [1:100] 0 0 0 0 1 1 1 1 1 1 ...
We then call our functions to fit with the lasso penalty \((\alpha = 1)\), and cross validate.
We set maxit = 1000
(increasing the maximum number of iterations to 1000) because our data is relatively high dimensional, so more iterations are needed for convergence.
In practice, the function will spit out an error if convergence is not reached by the maximum number of iterations.
The Surv
function packages the survival data into the form expected by glmnet.
cv.fit = cv.glmnet(data$x, # X matrix
Surv(data$time,data$status), # create survival object from the data
alpha = 1, # lasso: alpha = 1; ridge: alpha=0
family = "cox", # specify Cox PH model
maxit = 1000)
Following is the cross validated error plot to help evaluate our model with the \(\lambda\)-values on a \(\log\)-scale.
plot(cv.fit)
The optimal value of \(\lambda\) which minimizes the cross-validation error is given by
log(cv.fit$lambda.min) # in log-scale
## [1] -1.557179
cv.fit$lambda.min
## [1] 0.2107298
The left vertical line in our plot shows us where the cross-validation error curve hits its minimum.
The right vertical line shows us the most regularized model with cross-validation error within 1-SD of the minimum.
In this case, we see that the minimum was achieved by a fairly regularized model, however the right line indicates that the null model (no coefficients included) is within 1-SD of the minimum.
This might lead us to believe that in actuality the covariates are not explaining any variability. For the time being we will concern ourselves with the minimum CV-error model.
We can check which covariates our model chose as active, and see the coefficients of those covariates.
est.coef = coef(cv.fit, s = cv.fit$lambda.min) # returns the p length coefficient vector
# of the solution corresponding to lambda
active.k = which(est.coef != 0)
active.k
## [1] 80 394
active.k.vals = est.coef[active.k]
active.k.vals
## [1] 0.4012294 0.1151167
Random forest11 (RF) is a non-parametric statistical method which requires no distributional assumptions on covariate relation to the response.
Random Survival Forest (RSF)12 is an extension of RF techniques to survival settings, allowing efficient non-parametric analysis of time to event data.
A forest of survival trees is grown using a log-rank splitting rule to select the optimal candidate variables.
Survival estimate for each observation are constructed with a Kaplan{Meier (KM) estimator within each terminal node, at each event time.
Random Survival Forests adaptively discover nonlinear effects and interactions and are fully non-parametric.
Averaging over many trees enables RSF to approximate complex survival functions, including non-proportional hazards, while maintaining low prediction error.
Let us first install
randomForestSRC package, which provides implementation for the random survival forests, and
ggRandomForests package, which provides implementation for the random survival forests
# CRAN Package needed
install.packages('randomForestSRC') # install package for random forest
install.packages('ggRandomForests') # install package for plotting results
install.packages('ggplot2') # install package for plotting results
Load the package
# CRAN Packages needed
library(randomForestSRC) # package for random forest
library(ggRandomForests) # package for random forest
library(ggplot2)
Load the PBC data:
data(pbc)
head(pbc)
## days status treatment age sex ascites hepatom spiders edema bili chol
## 1 400 1 1 21464 1 1 1 1 1.0 14.5 261
## 2 4500 0 1 20617 1 0 1 1 0.0 1.1 302
## 3 1012 1 1 25594 0 0 0 0 0.5 1.4 176
## 4 1925 1 1 19994 1 0 1 1 0.5 1.8 244
## 5 1504 0 2 13918 1 0 1 1 0.0 3.4 279
## 6 2503 1 2 24201 1 0 1 0 0.0 0.8 248
## albumin copper alk sgot trig platelet prothrombin stage
## 1 2.60 156 1718.0 137.95 172 190 12.2 4
## 2 4.14 54 7394.8 113.52 88 221 10.6 3
## 3 3.48 210 516.0 96.10 55 151 12.0 4
## 4 2.54 64 6121.8 60.63 92 183 10.3 4
## 5 3.53 143 671.0 113.15 72 136 10.9 3
## 6 3.98 50 944.0 93.00 63 NA 11.0 3
##
## Variable name | Description | Type
## days | Time (days) | numeric
## status | Event (F = censor, T = death) | logical
## treatment | Treatment (DPCA, Placebo) | factor
## age | Age (years) | numeric
## sex | Female = T | logical
## ascites | Presence of Asictes | logical
## hepatom | Presence of Hepatomegaly | logical
## spiders | Presence of Spiders | logical
## edema | Edema (0, 0.5, 1) | factor
## bili | Serum Bilirubin (mg/dl) | numeric
## chol | Serum Cholesterol (mg/dl) | integer
## albumin | Albumin (gm/dl) | numeric
## copper | Urine Copper (ug/day) | integer
## alk | Alkaline Phosphatase (U/liter) | numeric
## sgot | SGOT (U/ml) | numeric
## trig | Triglicerides (mg/dl) | integer
## platelet | Platelets per cubic ml/1000 | integer
## prothrombin | Prothrombin time (sec) | numeric
## stage | Histologic Stage | factor
# Create the trial and test data sets
pbc.trial = subset(pbc, subset=!is.na(treatment))
pbc.test = subset(pbc, subset=is.na(treatment))
The rfsrc()
function (for help see ?rfsrc
) fits the forest.
We now grow a random forest for survival, by passing a survival (Surv
) object to the forest. The forest uses all variables in the pbc.trial
data set to generate the RSF survival model.
rfsrc.pbc = rfsrc(Surv(days, status) ~ ., # survival object
data = pbc.trial, # data
nsplit = 10, # number of random splits to consider
na.action = "na.impute", # imputing missing data
tree.err = TRUE,
importance = TRUE) # variable importance
print(rfsrc.pbc)
## Sample size: 312
## Number of deaths: 125
## Was data imputed: yes
## Number of trees: 1000
## Forest terminal node size: 15
## Average no. of terminal nodes: 15.244
## No. of variables tried at each split: 5
## Total no. of variables: 17
## Resampling used to grow trees: swor
## Resample size used to grow trees: 197
## Analysis: RSF
## Family: surv
## Splitting rule: logrank *random*
## Number of random split points: 10
## Error rate: 17.05%
One advantage of random forest is a built in generalization error estimate.
Each bootstrap sample selects approximately \(63.2\%\) of the population on average.
The remaining \(36.8\%\) of observations, the Out-of-Bag14 (OOB) sample, can be used as a hold out test set for each tree.
An OOB prediction error estimate can be calculated for each observation by predicting the response over the set of trees which were not trained with that particular observation.
Out-of-Bag prediction error estimates have been shown to be nearly identical to n-fold cross-validation estimates15.
We now plot this error as follows
plot(rfsrc.pbc)
##
## Importance Relative Imp
## bili 0.0689 1.0000
## edema 0.0152 0.2207
## copper 0.0142 0.2067
## ascites 0.0128 0.1859
## albumin 0.0089 0.1290
## prothrombin 0.0073 0.1054
## age 0.0060 0.0869
## stage 0.0045 0.0655
## hepatom 0.0028 0.0408
## sgot 0.0028 0.0400
## chol 0.0023 0.0339
## spiders 0.0003 0.0039
## sex 0.0002 0.0033
## alk -0.0001 -0.0011
## treatment -0.0001 -0.0013
## trig -0.0002 -0.0026
## platelet -0.0008 -0.0122
We now plot the predicted survival from our RSF model.
ggRFsrc = plot(gg_rfsrc(rfsrc.pbc))+
theme(legend.position = "none")+
labs(y = "Survival Probability", x = "Time (days)")+
coord_cartesian(ylim = c(-0.01, 1.01))
show(ggRFsrc)
To get more interpretable results, we plot a summary of the survival results.
plot(gg_rfsrc(rfsrc.pbc, by = "treatment"))+
theme(legend.position = c(0.2, 0.2))+
labs(y = "Survival Probability", x = "Time (days)")+
coord_cartesian(ylim = c(-0.01, 1.01))
There are two modeling issues when dealing with missing data values:
How does the algorithm build a model when values are missing from the training data?, and
How does the algorithm predict a response when values are missing from the test data?
The randomForestSRC
package imputes missing values using adaptive tree imputation. Rather than impute missing values before fitting the forest, the algorithm takes a just-in-time approach.
At each node split, the set of mtry
candidate variables is checked for missing values, where the mtry
is the number of variables randomly selected as candidates for splitting a node (see ?rfsrc
).
Missing values are then imputed by randomly drawing values from non-missing data within the node.
The split-statistic is then calculated on observations that were not missing values.
The imputed values are used to sort observations into the subsequent daughter nodes and then discarded before the next split occurs.
The process is repeated until the stopping criteria is reached and all observations are sorted into terminal nodes.
A final imputation step can be used to fill in missing values from within the terminal nodes. This step uses a process similar to the previous imputation but uses the OOB non-missing terminal node data for the random draws.
These values are aggregated (averaging for continuous variables, voting for categorical variables) over the ntree
trees in the forest to estimate an imputed data set.
By default, the missing values are not filled into the training data, but are available within the forest object for later use if desired.
Adaptive tree imputation still requires the missing at random assumptions (Rubin 1976).
At each imputation step, the random forest assumes that similar observations are grouped together within each node. The random draws used to fill in missing data do not bias the split rule, but only sort observations similar in non-missing data into like nodes.
Additional feature of this approach is the ability of predicting on test set observations with missing values.
The strength of adaptive tree imputation becomes clear when doing prediction on test set observations.
If we want to predict survival for patients that did not participate in the trial using the model we created, we need to somehow account for the missing values.
The predict.rfsrc()
(see ?predict.rfsrc
) takes the forest object (rfsrc.pbc), and the test data set (pbc.test) and returns a predicted survival using the same forest imputation method for missing values within the test data set (na.action="na.impute")
.
rfsrc.pbc.test = predict(rfsrc.pbc, newdata = pbc.test,
na.action = "na.impute",
importance = TRUE)
plot(gg_rfsrc(rfsrc.pbc.test))+
theme(legend.position = "none")+
labs(y = "Survival Probability", x = "Time (days)")+
coord_cartesian(ylim = c(-0.01, 1.01))
Blue curves correspond to censored patients, red curves correspond to patients experiencing a death event.
Random forest is not a parsimonious method, but uses all variables available in the data set to construct the response predictor.
Also, unlike parametric models, random forest does not require the explicit specification of the functional form of covariates to the response. Therefore, there is no explicit p-value/significance test for variable selection with a random forest model.
Instead, RF ascertains which variables contribute to the prediction through the split rule optimization, optimally choosing variables which separate observations.
In survival settings we are also interested in how we can possibly improve the the outcome of interest. Hence care should be taken while including variables into the model.
Variable importance was originally defined in CART using a measure involving surrogate variables.
The most popular variable importance method uses a prediction error approach involving “noising-up” each variable in turn.
Variable importance for a variable \(x_k\) is the difference between prediction error when \(x_k\) is randomly permuted, compared to prediction error under the observed values.
A large positive variable importance value indicates that misspecification detracts from the predictive accuracy in the forest.
A variable importance value close to zero indicates the variable contributes nothing to predictive accuracy, and
A large negative variable importance value indicates the predictive accuracy improves when the variable is misspecified indicating that noise is more informative than the true variable.
As such, we ignore variables with negative and near zero values of VIMP, relying on large positive values to indicate that the predictive power of the forest is dependent on those variables.
plot(gg_vimp(rfsrc.pbc))+
theme(legend.position = c(0.8, 0.2))+
labs(fill = "Var_Imp > 0")
The plot above details variable importance ranking for the pbc.trial baseline variables, from the largest (Serum Bilirubin) at the top, to smallest (Treatment (DPCA, Placebo)) at the bottom.
The Cancer Genome Atlas (TCGA) is a collaboration between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) that collected lots of clinical and genomic data across 33 cancer types.
The entire TCGA data set is over 2 petabytes worth of gene expression, CNV profiling, SNP genotyping, DNA methylation, miRNA profiling, exome sequencing, and other types of data.
You can learn more about TCGA at cancergenome.nih.gov. The data is now housed at the Genomic Data Commons Portal.
There are lots of ways to access TCGA data without actually downloading and parsing through the data from GDC.
First, let us look at an R package that provides convenient, direct access to TCGA data.
The RTCGA package (bioconductor.org/packages/RTCGA) and all the associated data packages provide convenient access to clinical and genomic data in TCGA.
# Install the BiocManager installer
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Install the main RTCGA package
BiocManager::install("RTCGA")
Each of the data packages is a separate package, and must be installed (once) individually.
# Install the clinical and mRNA gene expression data packages
BiocManager::install("RTCGA.clinical")
BiocManager::install("RTCGA.mRNA")
You can load the RTCGA package, and use the infoTCGA()
function to get some information about the kind of data available for each cancer type.
library(RTCGA)
infoTCGA()
RTCGA is not the only resource providing easy access to TCGA data. In fact, it is not even the only R/Bioconductor package. Take a look at some of the other resources shown below.
TCGAbiolinks: another R package that allows direct query and analysis from the NCI GDC.
R package: bioconductor.org/packages/TCGAbiolinks
Paper: Nucleic Acids Research 2015 DOI: 10.1093/nar/gkv1507.
Nice graphical user interface.
Quick/easy summary info on patients, demographics, mutations, copy number alterations, etc.
Query individual genes, find co-expressed genes.
Survival analysis against different sub-types, expression, CNAs, etc.
Focus on survival analysis and RNA-seq data.
Simple query interface across all cancers for any mRNA, miRNA, or lncRNA gene (try SERPINA1).
Precomputed Cox PH regression for every gene, for every cancer.
Kaplan-Meier plots produced on demand.
LinkedOmics: A portal that includes multi-omics data from all 32 TCGA cancer types.
This notes is adapted from Coxnet: Regularized Cox Regression, Random Forests for Survival, Survival Analysis with R and ggRandomForests: Exploring Random Forest Survival.↩
Park, M.Y. and Hastie, T., 2007. L1‐regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), pp.659-677.↩
Simon, N., Friedman, J., Hastie, T. and Tibshirani, R., 2011. Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of statistical software, 39(5), p.1.↩
Simon, N., Friedman, J. and Hastie, T., 2013. A block-wise descent algorithm for group-penalized multiresponse and multinomial regression. arXiv preprint arXiv:1311.6529.↩
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J. and Tibshirani, R.J., 2012. Strong rules for discarding predictors in lasso‐type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2), pp.245-266.↩
Breiman, L., 2001. Random forests. Machine learning, 45(1), pp.5-32.↩
Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140.↩
Breiman, L., Friedman, J., Olshen, R. and Stone, C., 1984. Classification and regression trees. Wadsworth Int. Group, 37(15), pp.237-251.↩
Efron, B. and Tibshirani, R.J., 1994. An introduction to the bootstrap. CRC press.↩
Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp.1189-1232.↩
Breiman, L., 2001. Random forests. Machine learning, 45(1), pp.5-32.↩
Ishwaran, H., Kogalur, U.B., Blackstone, E.H. and Lauer, M.S., 2008. Random survival forests. The annals of applied statistics, 2(3), pp.841-860.↩
Fleming, T.R. and Harrington, D.P., 2011. Counting processes and survival analysis (Vol. 169). John Wiley & Sons.↩
Breiman, L., 1996. Out-of-bag estimation.↩
Hastie, T., Tibshirani, R. and Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media.↩