In this lecture we will cover methods for exploratory data analysis and some basic analysis with linear models. We will also cover several variable selection approaches including step-wise selection, random forests and regularization based approaches.
We will use the cleaned data set variable_selection_data.csv
as our sample data for this lecture. This data set is included in the materials provided.
This data set has \(500\) subjects, and \(10\) variables including
\(1\) binary confounder \(z\),
\(7\) predictor variables (\(x_1,\ldots,x_7\)),
\(1\) response variables (\(Y\)), and
the variable OBS indicating the observation ID.
setwd("C:/Users/username/Documents/CBSA") #setup working directory
# should be stored in the working directory
data = read.csv("variable_selection_data.csv") # read in data
#change the datanames from UPPER to LOWER
names(data) = tolower(names(data))
names(data)
## [1] "obs" "y" "x1" "x2" "x3" "x4" "x5" "x6" "x7" "z"
# dimention of data
dim.data = dim(data)
dim.data
## [1] 500 10
# frequency for categorical variable
table(data$z)
##
## 0 1
## 286 214
# percentage of each category
prop.table(table(data$z))
##
## 0 1
## 0.572 0.428
data.cont = data[, -c(1,10)] # remove "obs" and "z"
names(data.cont)
## [1] "y" "x1" "x2" "x3" "x4" "x5" "x6" "x7"
# simple way:
summary(data.cont) # gives min, q1, median, q3, max, and mean
## y x1 x2 x3
## Min. : 0.353 Min. :0.0482 Min. :0.242 Min. :0.0376
## 1st Qu.:14.374 1st Qu.:0.4127 1st Qu.:0.660 1st Qu.:0.4404
## Median :21.099 Median :0.8606 Median :0.930 Median :0.8454
## Mean :23.301 Mean :1.1421 Mean :0.993 Mean :1.1278
## 3rd Qu.:31.858 3rd Qu.:1.4920 3rd Qu.:1.286 3rd Qu.:1.5569
## Max. :53.013 Max. :4.6872 Max. :2.467 Max. :4.5648
## x4 x5 x6 x7
## Min. :0.072 Min. :0.0495 Min. :0.078 Min. :0.0427
## 1st Qu.:0.618 1st Qu.:0.4900 1st Qu.:0.527 1st Qu.:0.4128
## Median :0.972 Median :0.9178 Median :0.916 Median :0.8444
## Mean :1.220 Mean :1.2372 Mean :1.157 Mean :1.1304
## 3rd Qu.:1.537 3rd Qu.:1.6784 3rd Qu.:1.586 3rd Qu.:1.5120
## Max. :4.652 Max. :4.9680 Max. :4.805 Max. :4.8855
num.cont = dim(data.cont)[2] # 8 continuous variables
desc.cont = matrix(NA,num.cont, 4)
rownames(desc.cont) = names(data.cont)
colnames(desc.cont) = c('Mean', 'SD', '# of missing', "# in total")
desc.cont[,1] = apply(data.cont, 2, mean) # Mean
desc.cont[,2] = apply(data.cont, 2, sd) # Standard deviation (SD)
# number of missings and total available number
for(i in 1:num.cont){
desc.cont[i,4] = length(na.omit(data.cont[,i])) # data available excluding missingness
desc.cont[i,3] = dim.data[1]-desc.cont[i,4] # missingness
}
desc.cont
## Mean SD # of missing # in total
## y 23.3010 10.85264 0 500
## x1 1.1421 0.97913 0 500
## x2 0.9931 0.43424 0 500
## x3 1.1278 0.92698 0 500
## x4 1.2197 0.85501 0 500
## x5 1.2372 1.00616 0 500
## x6 1.1573 0.85140 0 500
## x7 1.1304 0.97948 0 500
### correlation matrix of predictors
x = data[,-c(1:2,10)] # remove the outcome and confounder, extract 7 predictors
install.packages("Hmisc")
rcorr(as.matrix(x), type="pearson") # type can be Pearson or Spearman
library(Hmisc)
corrx = cor(x) # get the correlations
col.x = colnames(x)
num.x = ncol(x) # number of predictors
corrx_big = matrix(NA, nrow = num.x, ncol = num.x,
dimnames=list(col.x,col.x))
### corrx_big keeps the entries of corrx that have correlations>0.5 or <-0.5
index = which(abs(corrx)>0.5, arr.ind=TRUE)
corrx_big[index] = corrx[index]
#### save corrx and corrx_big to csv files
# write.csv(corrx, file="corrx.csv")
# write.csv(corrx_big, file="corrx_big.csv")
Here corrx.csv
contains the correlation matrix of the all continuous variables and corrx_big.xlsx
contains the entries of the correlation matrix that have correlations \(>0.5\) or \(<-0.5\).
install.packages("ggplot2")
install.packages("reshape2")
### draw heatmap
library(ggplot2)
library(reshape2)
qplot(x=Var1, y=Var2, data=melt(cor(x)), fill=value, geom="tile")
corrplot()
install.packages("corrplot")
## using corrplot
library(corrplot)
## calculate p-value matrix
p = cor.mtest(x, conf.level = .95)
round(p$p, digits = 4)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.0000 0.0000 0.0000 0.0000 0.3866 0.4202 0.0000
## [2,] 0.0000 0.0000 0.0000 0.0000 0.0137 0.1155 0.0000
## [3,] 0.0000 0.0000 0.0000 0.0000 0.2636 0.6147 0.0000
## [4,] 0.0000 0.0000 0.0000 0.0000 0.0096 0.0026 0.4043
## [5,] 0.3866 0.0137 0.2636 0.0096 0.0000 0.0000 0.0848
## [6,] 0.4202 0.1155 0.6147 0.0026 0.0000 0.0000 0.0053
## [7,] 0.0000 0.0000 0.0000 0.4043 0.0848 0.0053 0.0000
corrplot(corrx, method = "color")
corrplot(corrx, method = "color", type = "upper", order = "hclust")
install.packages("mgcv")
library(mgcv)
dat1 = data[,-1]
par(mfrow = c(2, 4)) # split the plot display in 2 rows and 4 columns
for(i in 1:7) {
dat_temp = dat1[, c(1, i+1, 9)]
colnames(dat_temp)[2] = "x"
mod = gam(y ~ s(x) + z, data = dat_temp)
plot(mod, xlab = names(dat1)[1+i])
}
We see that \(x_7\) is typically \(\log\)-distributed.
Let us now take the \(\log\)-transform of \(x_7\) and standardize all the \(7\) variables
dat1[, 8] = log(dat1[, 8])
# data standardization (mean to 0)
dat1[, 2:8] = scale(dat1[, 2:8])
t(x) # When x is a vector, it is treated as a column, i.e.,
# the result is a 1-row matrix
A %*% B # Matrix multiplication
A * B # Element-wise multiplication
dat2 = data.frame(matrix(0, 500, 37))
colnames(dat2)[1:9] = colnames(dat1)
dat2[, 1:9] = dat1
k = 10
for(i in 1:7){
for(j in (i+1):8){
dat2[, k] = dat1[, i+1]*dat1[, j+1]
colnames(dat2)[k] = paste(colnames(dat1)[i+1], ".",
colnames(dat1)[j+1], sep = "")
k = k + 1
}
}
names(dat2)
## [1] "y" "x1" "x2" "x3" "x4" "x5" "x6" "x7" "z"
## [10] "x1.x2" "x1.x3" "x1.x4" "x1.x5" "x1.x6" "x1.x7" "x1.z" "x2.x3" "x2.x4"
## [19] "x2.x5" "x2.x6" "x2.x7" "x2.z" "x3.x4" "x3.x5" "x3.x6" "x3.x7" "x3.z"
## [28] "x4.x5" "x4.x6" "x4.x7" "x4.z" "x5.x6" "x5.x7" "x5.z" "x6.x7" "x6.z"
## [37] "x7.z"
names(data)
## [1] "obs" "y" "x1" "x2" "x3" "x4" "x5" "x6" "x7" "z"
X_sq = data[, 3:8]^2
colnames(X_sq) = paste("x", 1:6, "sq", sep="")
colnames(X_sq)
## [1] "x1sq" "x2sq" "x3sq" "x4sq" "x5sq" "x6sq"
dat.all = cbind(dat2, X_sq)
names(dat.all)
## [1] "y" "x1" "x2" "x3" "x4" "x5" "x6" "x7" "z"
## [10] "x1.x2" "x1.x3" "x1.x4" "x1.x5" "x1.x6" "x1.x7" "x1.z" "x2.x3" "x2.x4"
## [19] "x2.x5" "x2.x6" "x2.x7" "x2.z" "x3.x4" "x3.x5" "x3.x6" "x3.x7" "x3.z"
## [28] "x4.x5" "x4.x6" "x4.x7" "x4.z" "x5.x6" "x5.x7" "x5.z" "x6.x7" "x6.z"
## [37] "x7.z" "x1sq" "x2sq" "x3sq" "x4sq" "x5sq" "x6sq"
data2 = cbind(dat.all[,c(1,9)],dat.all[,c(2:8,10:43)])
x0 = data2[,-c(1:2,10:43)] # extract 7 pollutants
x1 = data2[,-c(1,10:43)] # extract 8 main effect predictors
x2 = data2[,-c(1)] # extract all predictors: main effect and interactions
Here we fit single-variable models with adjustment for confounders.
Note: Always keep confounders in the model.
y = data2$y # y as the outcome
num.x = ncol(x0)
univ = matrix(0, nrow = num.x, ncol = 2)
colnames(univ) = c('beta','pvalue')
rownames(univ) = names(x0)
univ = data.frame(univ)
for(i in 1:num.x){
fit = lm(y ~ x0[,i] + factor(z), data = data2) # always keep confounders in the model
univ[i,1] = coef(fit)[2]
univ[i,2] = anova(fit)$'Pr(>F)'[1]
}
univ = univ[order(univ$pvalue),]# sorted by pvalue in increasing order
univ
## beta pvalue
## x1 4.1947 2.2766e-138
## x3 2.8814 6.5621e-114
## x2 2.5996 3.9445e-110
## x7 3.5144 7.1499e-89
## x5 -3.3050 3.3736e-28
## x4 0.2402 3.7374e-19
## x6 -1.7112 9.8027e-08
Here we fit multi-variable models adjusted for confounders.
Since all predictors have p-values \(\leq 0.1\), we put all \(8\) main effect variables into the multiple regression model.
dat.ml = data.frame(cbind(y, x1))
fit.ml = lm(y ~ ., data = dat.ml) # . indicates include all variables in dat.ml except y
summary(fit.ml)
##
## Call:
## lm(formula = y ~ ., data = dat.ml)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.700 -1.753 0.019 1.700 8.613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.4477 0.2050 90.00 < 2e-16 ***
## z 11.3394 0.3872 29.29 < 2e-16 ***
## x1 2.9275 0.2462 11.89 < 2e-16 ***
## x2 1.0546 0.2331 4.52 7.6e-06 ***
## x3 0.0151 0.2485 0.06 0.95
## x4 -0.6751 0.1323 -5.10 4.8e-07 ***
## x5 -3.5940 0.1527 -23.53 < 2e-16 ***
## x6 -0.1266 0.1526 -0.83 0.41
## x7 3.3486 0.1288 26.00 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.7 on 491 degrees of freedom
## Multiple R-squared: 0.939, Adjusted R-squared: 0.938
## F-statistic: 948 on 8 and 491 DF, p-value: <2e-16
x2 = data2[,-c(1)] # 42 variables, excluding outcome
dat.ml = data.frame(cbind(y, x2))
fit.ml2 = lm(y ~ ., data = dat.ml)
summary(fit.ml2)
##
## Call:
## lm(formula = y ~ ., data = dat.ml)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.019 -1.452 0.029 1.491 7.735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.6719 1.8265 9.68 < 2e-16 ***
## z 10.8002 0.5366 20.13 < 2e-16 ***
## x1 3.7079 0.9989 3.71 0.00023 ***
## x2 1.4535 1.3874 1.05 0.29537
## x3 -0.2265 0.9887 -0.23 0.81888
## x4 -0.6060 0.4571 -1.33 0.18558
## x5 -5.2849 0.5444 -9.71 < 2e-16 ***
## x6 -0.4850 0.4955 -0.98 0.32822
## x7 3.5847 0.2117 16.93 < 2e-16 ***
## x1.x2 -0.3009 0.3418 -0.88 0.37906
## x1.x3 -0.1731 0.3246 -0.53 0.59413
## x1.x4 0.1718 0.2325 0.74 0.46041
## x1.x5 -0.0581 0.3051 -0.19 0.84916
## x1.x6 0.0580 0.3071 0.19 0.85024
## x1.x7 0.6606 0.2659 2.48 0.01333 *
## x1.z -0.0675 1.2296 -0.05 0.95625
## x2.x3 -0.0373 0.4800 -0.08 0.93811
## x2.x4 0.2341 0.2203 1.06 0.28858
## x2.x5 -0.4982 0.3060 -1.63 0.10421
## x2.x6 0.2820 0.3277 0.86 0.38990
## x2.x7 0.2747 0.2524 1.09 0.27711
## x2.z -0.6640 0.7247 -0.92 0.35998
## x3.x4 -0.4362 0.2302 -1.90 0.05872 .
## x3.x5 0.1785 0.3086 0.58 0.56333
## x3.x6 -0.2266 0.3033 -0.75 0.45533
## x3.x7 -0.0731 0.2749 -0.27 0.79042
## x3.z 0.4718 0.7535 0.63 0.53147
## x4.x5 0.3161 0.1794 1.76 0.07879 .
## x4.x6 0.0606 0.1983 0.31 0.76007
## x4.x7 -0.0830 0.1389 -0.60 0.55066
## x4.z -0.3605 0.4160 -0.87 0.38665
## x5.x6 0.1525 0.1663 0.92 0.35948
## x5.x7 -0.3676 0.1580 -2.33 0.02046 *
## x5.z -0.1815 0.4491 -0.40 0.68630
## x6.x7 0.0812 0.1573 0.52 0.60578
## x6.z -0.3445 0.4477 -0.77 0.44201
## x7.z -0.3468 0.4025 -0.86 0.38938
## x1sq -0.1287 0.2552 -0.50 0.61433
## x2sq -0.0870 1.5249 -0.06 0.95451
## x3sq 0.0964 0.3242 0.30 0.76638
## [ reached getOption("max.print") -- omitted 3 rows ]
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.46 on 457 degrees of freedom
## Multiple R-squared: 0.953, Adjusted R-squared: 0.949
## F-statistic: 220 on 42 and 457 DF, p-value: <2e-16
fit.ml
: model including only marginal effect terms.
fit.ml2
: model including main and pairwise interaction terms.
fit.start = lm(y ~ 1, data = dat.ml) # model with only intercept
fit.full = lm(y ~ ., data = dat.ml) # model with all variables
full = formula(fit.full)
fit.forward = step(fit.start, direction = 'forward',
trace = 0, scope = full)
## if you want to keep z
fit.forward1 = step(fit.start, direction='forward', trace = 0,
scope = list(lower = ~z, upper = full)) # error
fit.forward2 = step(lm(y ~ z, data = dat.ml), direction = 'forward',
trace = 0, scope = list(upper = full))
summary(fit.forward2)
##
## Call:
## lm(formula = y ~ z + x7 + x5 + x1 + x5sq + x4sq + x2 + x1.x2 +
## x1.x7 + x5.x7 + x5.z + x3.x4 + x4.x5 + x4 + x2.x5 + x5.x6 +
## x2.x7, data = dat.ml)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.162 -1.461 0.061 1.490 7.825
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.6521 0.4278 41.26 < 2e-16 ***
## z 10.7754 0.3930 27.42 < 2e-16 ***
## x7 3.4145 0.1184 28.83 < 2e-16 ***
## x5 -5.4387 0.4209 -12.92 < 2e-16 ***
## x1 3.3516 0.2788 12.02 < 2e-16 ***
## x5sq 0.4910 0.1120 4.38 1.4e-05 ***
## x4sq 0.0508 0.1146 0.44 0.6576
## x2 1.2145 0.1875 6.48 2.3e-10 ***
## x1.x2 -0.5623 0.1312 -4.29 2.2e-05 ***
## x1.x7 0.4323 0.1935 2.23 0.0259 *
## x5.x7 -0.3361 0.1178 -2.85 0.0045 **
## x5.z -0.3803 0.3129 -1.22 0.2247
## x3.x4 -0.2677 0.1087 -2.46 0.0142 *
## x4.x5 0.3270 0.1264 2.59 0.0100 **
## x4 -0.7852 0.3804 -2.06 0.0395 *
## x2.x5 -0.2843 0.1719 -1.65 0.0988 .
## x5.x6 0.1827 0.1177 1.55 0.1214
## x2.x7 0.2800 0.2007 1.40 0.1636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.43 on 482 degrees of freedom
## Multiple R-squared: 0.952, Adjusted R-squared: 0.95
## F-statistic: 559 on 17 and 482 DF, p-value: <2e-16
fit.backward = step(fit.full, direction='backward',
trace = 0, scope=list(lower = ~z))
summary(fit.backward)
##
## Call:
## lm(formula = y ~ z + x1 + x2 + x4 + x5 + x7 + x1.x2 + x1.x7 +
## x2.x5 + x2.x6 + x3.x4 + x4.x5 + x5.x6 + x5.x7 + x6.z + x5sq,
## data = dat.ml)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.099 -1.573 0.036 1.443 7.801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.769 0.356 49.94 < 2e-16 ***
## z 10.912 0.383 28.48 < 2e-16 ***
## x1 3.208 0.257 12.50 < 2e-16 ***
## x2 1.231 0.185 6.67 7.1e-11 ***
## x4 -0.613 0.129 -4.74 2.8e-06 ***
## x5 -5.317 0.419 -12.70 < 2e-16 ***
## x7 3.433 0.118 29.05 < 2e-16 ***
## x1.x2 -0.503 0.124 -4.06 5.6e-05 ***
## x1.x7 0.638 0.122 5.21 2.8e-07 ***
## x2.x5 -0.477 0.155 -3.09 0.00214 **
## x2.x6 0.249 0.168 1.49 0.13742
## x3.x4 -0.258 0.102 -2.52 0.01214 *
## x4.x5 0.326 0.123 2.65 0.00825 **
## x5.x6 0.252 0.120 2.10 0.03644 *
## x5.x7 -0.324 0.116 -2.79 0.00555 **
## x6.z -0.523 0.234 -2.24 0.02579 *
## x5sq 0.440 0.114 3.88 0.00012 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.42 on 483 degrees of freedom
## Multiple R-squared: 0.952, Adjusted R-squared: 0.95
## F-statistic: 597 on 16 and 483 DF, p-value: <2e-16
fit.step = step(fit.full, direction='both',
trace = 0, scope = list(lower = ~z))
summary(fit.step) # this is exactly the same as backward
##
## Call:
## lm(formula = y ~ z + x1 + x2 + x4 + x5 + x7 + x1.x2 + x1.x7 +
## x2.x5 + x2.x6 + x3.x4 + x4.x5 + x5.x6 + x5.x7 + x6.z + x5sq,
## data = dat.ml)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.099 -1.573 0.036 1.443 7.801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.769 0.356 49.94 < 2e-16 ***
## z 10.912 0.383 28.48 < 2e-16 ***
## x1 3.208 0.257 12.50 < 2e-16 ***
## x2 1.231 0.185 6.67 7.1e-11 ***
## x4 -0.613 0.129 -4.74 2.8e-06 ***
## x5 -5.317 0.419 -12.70 < 2e-16 ***
## x7 3.433 0.118 29.05 < 2e-16 ***
## x1.x2 -0.503 0.124 -4.06 5.6e-05 ***
## x1.x7 0.638 0.122 5.21 2.8e-07 ***
## x2.x5 -0.477 0.155 -3.09 0.00214 **
## x2.x6 0.249 0.168 1.49 0.13742
## x3.x4 -0.258 0.102 -2.52 0.01214 *
## x4.x5 0.326 0.123 2.65 0.00825 **
## x5.x6 0.252 0.120 2.10 0.03644 *
## x5.x7 -0.324 0.116 -2.79 0.00555 **
## x6.z -0.523 0.234 -2.24 0.02579 *
## x5sq 0.440 0.114 3.88 0.00012 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.42 on 483 degrees of freedom
## Multiple R-squared: 0.952, Adjusted R-squared: 0.95
## F-statistic: 597 on 16 and 483 DF, p-value: <2e-16
fit.step1 = step(lm(y ~ z, data= dat.ml), direction='both',
trace = 0, scope = list(lower = ~z, upper = full))
summary(fit.step1)
##
## Call:
## lm(formula = y ~ z + x7 + x5 + x1 + x5sq + x2 + x1.x2 + x1.x7 +
## x5.x7 + x3.x4 + x4.x5 + x4 + x2.x5 + x5.x6 + x6.z + x2.x6,
## data = dat.ml)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.099 -1.573 0.036 1.443 7.801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.769 0.356 49.94 < 2e-16 ***
## z 10.912 0.383 28.48 < 2e-16 ***
## x7 3.433 0.118 29.05 < 2e-16 ***
## x5 -5.317 0.419 -12.70 < 2e-16 ***
## x1 3.208 0.257 12.50 < 2e-16 ***
## x5sq 0.440 0.114 3.88 0.00012 ***
## x2 1.231 0.185 6.67 7.1e-11 ***
## x1.x2 -0.503 0.124 -4.06 5.6e-05 ***
## x1.x7 0.638 0.122 5.21 2.8e-07 ***
## x5.x7 -0.324 0.116 -2.79 0.00555 **
## x3.x4 -0.258 0.102 -2.52 0.01214 *
## x4.x5 0.326 0.123 2.65 0.00825 **
## x4 -0.613 0.129 -4.74 2.8e-06 ***
## x2.x5 -0.477 0.155 -3.09 0.00214 **
## x5.x6 0.252 0.120 2.10 0.03644 *
## x6.z -0.523 0.234 -2.24 0.02579 *
## x2.x6 0.249 0.168 1.49 0.13742
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.42 on 483 degrees of freedom
## Multiple R-squared: 0.952, Adjusted R-squared: 0.95
## F-statistic: 597 on 16 and 483 DF, p-value: <2e-16
Here we demonstrate the code to fit/grow and prune the classification and regression trees.
Install and load the packages rpart and rpart.plot.
install.packages("rpart")
install.packages("rpart.plot")
# using main effect
library('rpart') # CART
library('rpart.plot') # Plotting for CART
# set up the random-number-generator seed for cross validation (CV)
set.seed(0)
fit.cart = rpart(y ~ ., data = x1, method = "anova")
summary(fit.cart)
fit.cart$cptable
## CP nsplit rel error xerror xstd
## 1 0.688354 0 1.00000 1.00240 0.048945
## 2 0.055184 1 0.31165 0.31519 0.019071
## 3 0.033570 2 0.25646 0.28043 0.016474
## 4 0.029894 3 0.22289 0.25053 0.014541
## 5 0.024777 4 0.19300 0.22507 0.013234
## 6 0.019229 5 0.16822 0.19354 0.012132
## 7 0.011355 6 0.14899 0.18473 0.011369
## 8 0.010071 7 0.13764 0.17936 0.011187
## 9 0.010000 8 0.12757 0.17532 0.011164
plotcp(fit.cart) # plot of CP table
The dotted line in CP table denotes the upper limit of the one standard deviation rule: it displays the one standard deviation upper limit based on the model with the smallest cross-validation error.
When using CP table to minimize the cross-validation error, there are two methods:
Use the tree that minimizes the cross-validation error.
Use the one standard deviation rule: choose the smallest tree whose cross-validation error is within one standard error of the minimum.
Since here the CP with the minimum CV error (show as “xerror”) has \(nsplit=8\).
If we use method \(1\) above, then the tree would keep all nodes with \(8\) split.
However, if we use one standard deviation rule, we can choose point \(7\), which means the tree split for \(6\) times. Thus we minimize the tree. Let us use \(7^{th}\) point, apply \(CP=0.01135508\) for pruning.
The tree can be plotted as
rpart.plot(fit.cart) # tree plot
* Note that the size of tree \(=\) nsplit \(+1\).
vi.cart = fit.cart$variable.importance/sum(fit.cart$variable.importance)*100
vi.cart
## x1 z x3 x2 x7 x4 x5 x6
## 25.8944 24.9450 17.6825 15.6689 6.3012 4.5259 3.4237 1.5583
length(vi.cart)
## [1] 8
sum(vi.cart>1) # Number of variables with VI > 1
## [1] 8
prune.cart = prune(fit.cart, cp= fit.cart$cptable[7,"CP"])
rpart.plot(prune.cart) # tree plot
vi.prune = prune.cart$variable.importance/sum(prune.cart$variable.importance)*100
vi.prune
## x1 z x3 x2 x7 x4 x5 x6
## 25.8698 25.2775 17.7795 15.7391 6.3390 4.5862 3.0523 1.3566
length(vi.prune)
## [1] 8
The default setting for the CP is \(0.01\), and the default CV is \(10\)-fold. We can change them by using rpart.control
(for help see ?rpart.control
).
We fit a tree with \(CP=0.008\) and a \(5\)-fold cross validation
set.seed(0) # set up the random-number-generator seed for
# cross validation (CV)
fit.cart = rpart(y ~ ., data = x1, method = "anova",
control = rpart.control(cp = .008, xval = 5))
# cp=0.008, CV is 5-fold
fit.cart$cptable # the CP table
## CP nsplit rel error xerror xstd
## 1 0.688354 0 1.00000 1.00087 0.048929
## 2 0.055184 1 0.31165 0.31530 0.019081
## 3 0.033570 2 0.25646 0.27710 0.017097
## 4 0.029894 3 0.22289 0.23527 0.014224
## 5 0.024777 4 0.19300 0.22544 0.013696
## 6 0.019229 5 0.16822 0.18296 0.011661
## 7 0.011355 6 0.14899 0.17742 0.011058
## 8 0.010071 7 0.13764 0.17163 0.010930
## 9 0.008000 8 0.12757 0.16382 0.010406
plotcp(fit.cart) # plot of CP table
rpart.plot(fit.cart)
vi.cart=fit.cart$variable.importance/sum(fit.cart$variable.importance)*100
vi.cart
## x1 z x3 x2 x7 x4 x5 x6
## 25.8944 24.9450 17.6825 15.6689 6.3012 4.5259 3.4237 1.5583
length(vi.cart)
## [1] 8
sum(vi.cart>1) # number of variables with VI > 1
## [1] 8
prune.cart = prune(fit.cart, cp = fit.cart$cptable[8,"CP"])
rpart.plot(prune.cart) # tree plot
vi.prune = prune.cart$variable.importance/sum(prune.cart$variable.importance)*100
#the variable importance, rescaled to add to 100.
vi.prune
## x1 z x3 x2 x7 x4 x5 x6
## 25.7113 25.1226 17.6706 15.6427 6.3001 4.5581 3.4481 1.5464
length(vi.prune)
## [1] 8
sum(vi.cart>1) # number of variables with VI>1
## [1] 8
install.packages("rsample")
install.packages("randomForest")
install.packages("ranger")
install.packages("caret")
library(rsample) # data splitting
library(randomForest) # basic implementation
library(ranger) # a faster implementation of randomForest
library(caret) # an aggregator package for performing
# many machine learning models
# Use set.seed for reproducibility.
set.seed(1234)
data_split = initial_split(dat.all)
# default 3/4 and 1/4, you can change it e.g., 'prop=0.5'
data_split
## <375/125/500>
data_train = training(data_split)
data_test = testing(data_split)
randomForest
function (for help see ?randomForest).# for reproducibility
set.seed(1234)
# default RF model
m1 = randomForest(y ~ ., data = data_train)
m1
##
## Call:
## randomForest(formula = y ~ ., data = data_train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 14
##
## Mean of squared residuals: 10.23
## % Var explained: 91.27
plot(m1)
# which.min(m1$mse) gives the minimum MSE
sqrt(m1$mse[which.min(m1$mse)])
## [1] 3.1985
randomForest
data_train
except y
.# set difference between (x, y)
features = setdiff(names(data_train), "y")
tuneRf
(for help see ?tuneRf) will start at a value of mtry
that you supply and increase by a certain step factor until the Out-of-bag (OOB) error stops improving be a specified amount.
For example, it starts with mtry
= 5 and increases by a factor of \(1.5\) until the OOB error stops improving by \(1\%\).
Here mtry
is the number of variables randomly selected as candidates for splitting a node (see ?randomForest).
set.seed(1234)
m2 = tuneRF(x = data_train[features],
y = data_train$y,
ntreeTry = 500,
mtryStart = 5,
stepFactor = 1.5,
improve = 0.01,
trace = FALSE # to not show real-time progress
)
## -0.078966 0.01
## 0.050766 0.01
## 0.0074674 0.01
#### Full grid search with ranger
hyper_grid = expand.grid(mtry = seq(5, 17, by = 2),
node_size = seq(3, 9, by = 2),
sample_size = c(.55, .632, .70, .80),
OOB_RMSE = 0)
# total number of combinations
nrow(hyper_grid) # 7x4x4
## [1] 112
for(i in 1:nrow(hyper_grid)){
# train model
model = ranger(y ~ .,
data = data_train,
num.trees = 500,
mtry = hyper_grid$mtry[i],
min.node.size = hyper_grid$node_size[i],
sample.fraction = hyper_grid$sample_size[i],
seed = 1234)
# add OOB error to grid
hyper_grid$OOB_RMSE[i] = sqrt(model$prediction.error)
}
Here we show the 10 random forest models with lowest OOB_RMSE.
Note: %>%
is called the pipe operator. It acts like a composition of functions, that is, $ f(g(x)) $ is similar to g %>% f
.
hyper_grid_sorted = hyper_grid %>%
dplyr::arrange(OOB_RMSE) %>%
head(10)
hyper_grid_sorted
## mtry node_size sample_size OOB_RMSE
## 1 15 3 0.8 3.1927
## 2 15 5 0.8 3.1958
## 3 13 5 0.8 3.2013
## 4 17 3 0.8 3.2066
## 5 17 7 0.8 3.2066
## 6 17 5 0.8 3.2101
## 7 13 3 0.8 3.2112
## 8 17 9 0.8 3.2165
## 9 15 7 0.8 3.2167
## 10 15 9 0.8 3.2219
mtry_1 = hyper_grid_sorted$mtry[1]
node_size_1 = hyper_grid_sorted$node_size[1]
sample_size_1 = hyper_grid_sorted$sample_size[1]
The best random forest model we have found retains columns with categorical variables and uses mtry
= 15, terminal node size of 3 observations, and a sample size of 80 \(\%\).
Lets repeat this model to get a better expectation of our error rate.
OOB_RMSE = vector(mode = "numeric", length = 100)
for(i in seq_along(OOB_RMSE)){
optimal_ranger = ranger(y ~ .,
data = data_train,
num.trees = 500,
mtry = mtry_1,
min.node.size = node_size_1,
sample.fraction = sample_size_1,
importance = 'impurity'
)
OOB_RMSE[i] = sqrt(optimal_ranger$prediction.error)
}
hist(OOB_RMSE, breaks = 20)
attributes(m1) # m1: Random Forest before tuning
attributes(m2) # m2: Random Forest after tuning
attributes(optimal_ranger) # Best random forest model with OOB_RMSE
optimal_ranger$variable.importance # variable importance
## x1 x2 x3 x4 x5 x6 x7 z
## 7081.423 156.350 400.374 51.085 783.338 78.949 822.483 5578.699
## x1.x2 x1.x3 x1.x4 x1.x5 x1.x6 x1.x7 x1.z x2.x3
## 164.806 253.107 50.409 852.496 120.272 1256.884 1700.209 118.929
## x2.x4 x2.x5 x2.x6 x2.x7 x2.z x3.x4 x3.x5 x3.x6
## 88.945 302.498 77.020 251.516 1027.427 63.531 273.633 75.270
## x3.x7 x3.z x4.x5 x4.x6 x4.x7 x4.z x5.x6 x5.x7
## 479.123 517.400 115.190 75.181 96.963 22.360 115.820 686.749
## x5.z x6.x7 x6.z x7.z x1sq x2sq x3sq x4sq
## 1580.116 124.500 241.278 1007.953 6753.955 161.893 275.829 47.534
## x5sq x6sq
## 789.629 78.746
sort(optimal_ranger$variable.importance) # sorted values of variable importance
## x4.z x4sq x1.x4 x4 x3.x4 x4.x6 x3.x6 x2.x6
## 22.360 47.534 50.409 51.085 63.531 75.181 75.270 77.020
## x6sq x6 x2.x4 x4.x7 x4.x5 x5.x6 x2.x3 x1.x6
## 78.746 78.949 88.945 96.963 115.190 115.820 118.929 120.272
## x6.x7 x2 x2sq x1.x2 x6.z x2.x7 x1.x3 x3.x5
## 124.500 156.350 161.893 164.806 241.278 251.516 253.107 273.633
## x3sq x2.x5 x3 x3.x7 x3.z x5.x7 x5 x5sq
## 275.829 302.498 400.374 479.123 517.400 686.749 783.338 789.629
## x7 x1.x5 x7.z x2.z x1.x7 x5.z x1.z z
## 822.483 852.496 1007.953 1027.427 1256.884 1580.116 1700.209 5578.699
## x1sq x1
## 6753.955 7081.423
t(m1$importance) # importance of variables in m1
## x1 x2 x3 x4 x5 x6 x7 z x1.x2
## IncNodePurity 9013.5 206.66 520.96 61.372 1025.1 103.23 997.32 4315.6 215.99
## x1.x3 x1.x4 x1.x5 x1.x6 x1.x7 x1.z x2.x3 x2.x4 x2.x5 x2.x6
## IncNodePurity 364.26 67.655 1273 175.53 1501.1 2868 145.3 94.934 338.74 84.413
## x2.x7 x2.z x3.x4 x3.x5 x3.x6 x3.x7 x3.z x4.x5 x4.x6
## IncNodePurity 301.9 1523.4 79.698 306.23 118.87 629.41 868.83 140.04 100.67
## x4.x7 x4.z x5.x6 x5.x7 x5.z x6.x7 x6.z x7.z x1sq
## IncNodePurity 119.19 28.707 138.96 855.78 1934.7 145.72 176.88 1569.9 9646.4
## x2sq x3sq x4sq x5sq x6sq
## IncNodePurity 196.75 371.98 49.259 964.07 97.901
m1
# randomForest
pred_randomForest = predict(m1, data_test)
head(pred_randomForest)
## 4 10 17 19 22 36
## 20.5144 6.4587 19.1872 42.7845 36.0801 34.3820
optimal_ranger
# ranger
pred_ranger = predict(optimal_ranger, data_test)
head(pred_ranger$predictions)
## [1] 20.2520 6.5332 19.3703 42.2882 36.3760 34.5744
head(data_test$y)
## [1] 15.60029 0.35266 21.49137 43.92268 34.88292 33.50203
In the following, we fit the
LASSO,
elastic net (E-Net),
adaptive LASSO, and
adaptive elastic net
using the R package gcdnet.
Note that the objective function in gcdnet
is \[\frac{1}{n}Loss(y,x,\beta) + \lambda P_{\mathbb{L}^1}(\beta) + \lambda_2 P_{\mathbb{L}^2}(\beta),\] where \(P_{\mathbb{L}^1}(\beta)\) and \(P_{\mathbb{L}^2}(\beta)\) are \(\mathbb{L}^1\) and \(\mathbb{L}^2\) penalties, respectively.
By default, gcdnet
chooses its own \(\lambda\) (the tuning parameter for \(\mathbb{L}^1\) penalty) sequence based on the given data.
install.packages("gcdnet")
# LASSO & E-Net
library('gcdnet')#for lasso, adpative lasso, elastic net, adaptive elastic net
## fold ID
set.seed(0)
n = nrow(x2)
foldid_mat = cbind(sample(n), rep(1:5, each = n/5)) #assign into 5 folds
X = data.matrix(x2)
Y = y
## LASSO
cv_lasso = cv.gcdnet(X, Y, lambda2 = 0, method="ls",
foldid = foldid_mat[, 2]) # calculate the CV error
plot(cv_lasso) # CV error = y axis
The lasso function will report two critical parameter values:
Left dot line: The lambda value that minimizes the cross validated mean squared error
Right dot line: The lambda value with the greatest amount of shrinkage whose CV-MSE is within one standard error of the minimum.
Get the \(\lambda\) with minimum CV error
lambda_opt_lasso = cv_lasso$lambda.min
lambda_opt_lasso
## [1] 0.04512
fit_lasso = gcdnet(X, Y, lambda = lambda_opt_lasso,
lambda2 = 0, method = "ls")
beta_lasso = coef(fit_lasso)
sum(beta_lasso!=0)
## [1] 23
row.names(beta_lasso)[which(beta_lasso!=0)]
## [1] "(Intercept)" "z" "x1" "x2" "x4"
## [6] "x5" "x6" "x7" "x1.x2" "x1.x3"
## [11] "x1.x7" "x2.x5" "x2.x7" "x3.x4" "x4.x5"
## [16] "x4.z" "x5.x6" "x5.x7" "x5.z" "x6.z"
## [21] "x3sq" "x4sq" "x5sq"
Y_pred_lasso = cbind(1, X)%*%as.vector(beta_lasso) # get predicted Y
cor(Y_pred_lasso, Y)^2 # R^2
## [,1]
## [1,] 0.95138
1 - (1- cor(Y_pred_lasso, Y)^2)*499/(499-(sum(beta_lasso!=0)-1)) #adj R^2
## [,1]
## [1,] 0.94914
mean((Y - Y_pred_lasso)^2) # MSE
## [1] 5.7276
order = order(abs(beta_lasso[,1]), decreasing = TRUE)
beta.lasso = data.frame(beta = beta_lasso[order],
row.names = rownames(beta_lasso)[order])
beta.lasso = subset(beta.lasso, beta.lasso[,1] != 0)
beta.lasso
## beta
## (Intercept) 18.496941
## z 10.508618
## x5 -4.408284
## x7 3.382300
## x1 3.344982
## x2 1.066532
## x1.x2 -0.496023
## x1.x7 0.424511
## x5.x7 -0.329496
## x5.x6 0.312452
## x5.z -0.298642
## x4 -0.265199
## x4.x5 0.252270
## x5sq 0.238471
## x3.x4 -0.229729
## x2.x5 -0.215640
## x6 -0.204192
## x2.x7 0.155193
## x4sq -0.085377
## x3sq 0.068312
## x1.x3 -0.060188
## x6.z -0.048129
## x4.z -0.038430
cv.gcdnet
is 5-fold, we can also change the folds:set.seed(0) # set up the random-number-generator seed for cross validation (CV)
n = nrow(data2)
foldid_mat = cbind(sample(n), rep(1:10, each = n/10)) # assign into 10 folds
cv_lasso = cv.gcdnet(X, Y, lambda2 = 0, method = "ls",
foldid = foldid_mat[, 2]) # calculate the CV error
plot(cv_lasso) # CV error = y axis
nfolds
set.seed(0) # set up the random-number-generator seed for cross validation (CV)
cv.lasso = cv.gcdnet(X, Y, lambda2 = 0, nfolds = 10, method = "ls") # Use 10-fold CV.
plot(cv_lasso) # CV error = y axis
We introduce the E-NET method first, and then use the coefficient estimators obtained from E-NET to construct the adaptive weights in adaptive LASSO and adaptive E-NET.
We do tuning on \(\lambda_2\) (for \(\mathbb{L}^2\) penalty) from 0.0001 to 0.01 with an increment 0.001.
lambda2_list = seq(0.0001, 0.01, by=0.001) # candidate lambda2
num.lambda2 = length(lambda2_list) # the number of candidate lambda2
min_cv_enet = numeric(num.lambda2)
for(i in 1:length(lambda2_list)){
cv_enet = cv.gcdnet(X, Y, lambda2 = lambda2_list[i],
foldid = foldid_mat[, 2], method = "ls")
# lambda (for L1 penalty) will be automatically generated by default setting
min_cv_enet[i] = min(cv_enet$cvm) # collect minimum cv error for each lambda2 candidate
}
plot(lambda2_list, min_cv_enet, xlab = "lambda2", ylab = "minimum CV error", type = 'l')
lambda2_list[which.min(min_cv_enet)] # optimal lambda2 which gives minimun cv error
## [1] 1e-04
Here from the plot we can see that the optimal \(lambda_2\) is close to zero.
Sometimes the optimal \(\lambda_2\) is not close to zero — what we need to do is find this \(\lambda_2\) providing lowest CV error.
Note: Choose appropriate range for lambda2_list
based on the study.
lambda2_list[which.min(min_cv_enet)] #optimistic lambda2 which gives minimun cv error
## [1] 1e-04
lambda.list = cv_enet$lambda
lambda.list # candidate lambda (L1 penalty)
## [1] 27.6850512 25.2255875 22.9846159 20.9427259 19.0822319 17.3870190
## [7] 15.8424042 14.4350086 13.1526422 11.9841978 10.9195547 9.9494915
## [13] 9.0656061 8.2602427 7.5264255 6.8577986 6.2485706 5.6934649
## [19] 5.1876731 4.7268145 4.3068974 3.9242845 3.5756619 3.2580099
## [25] 2.9685773 2.7048571 2.4645650 2.2456199 2.0461252 1.8643531
## [31] 1.6987291 1.5478187 1.4103147 1.2850262 1.1708681 1.0668514
## [37] 0.9720752 0.8857187 0.8070339 0.7353393 0.6700137 0.6104916
## [43] 0.5562572 0.5068408 0.4618145 0.4207882 0.3834065 0.3493457
## [49] 0.3183108 0.2900330 0.2642672 0.2407905 0.2193993 0.1999085
## [55] 0.1821491 0.1659675 0.1512234 0.1377892 0.1255483 0.1143950
## [61] 0.1042324 0.0949727 0.0865356 0.0788480 0.0718434 0.0654610
## [67] 0.0596456 0.0543469 0.0495189 0.0451197 0.0411114 0.0374592
## [73] 0.0341314 0.0310993 0.0283365 0.0258192 0.0235255 0.0214355
## [79] 0.0195313 0.0177962 0.0162152 0.0147747 0.0134621 0.0122662
## [85] 0.0111765 0.0101836 0.0092789 0.0084546 0.0077035 0.0070192
## [91] 0.0063956 0.0058274 0.0053097 0.0048380 0.0044082 0.0040166
## [97] 0.0036598 0.0033347 0.0030384 0.0027685
log(lambda.list) # the values of log(lambda)
## [1] 3.320893 3.227859 3.134825 3.041791 2.948758 2.855724 2.762690
## [8] 2.669656 2.576623 2.483589 2.390555 2.297521 2.204488 2.111454
## [15] 2.018420 1.925386 1.832353 1.739319 1.646285 1.553252 1.460218
## [22] 1.367184 1.274150 1.181117 1.088083 0.995049 0.902015 0.808982
## [29] 0.715948 0.622914 0.529880 0.436847 0.343813 0.250779 0.157745
## [36] 0.064712 -0.028322 -0.121356 -0.214390 -0.307423 -0.400457 -0.493491
## [43] -0.586525 -0.679558 -0.772592 -0.865626 -0.958659 -1.051693 -1.144727
## [50] -1.237761 -1.330794 -1.423828 -1.516862 -1.609896 -1.702929 -1.795963
## [57] -1.888997 -1.982031 -2.075064 -2.168098 -2.261132 -2.354166 -2.447199
## [64] -2.540233 -2.633267 -2.726301 -2.819334 -2.912368 -3.005402 -3.098436
## [71] -3.191469 -3.284503 -3.377537 -3.470571 -3.563604 -3.656638 -3.749672
## [78] -3.842705 -3.935739 -4.028773 -4.121807 -4.214840 -4.307874 -4.400908
## [85] -4.493942 -4.586975 -4.680009 -4.773043 -4.866077 -4.959110 -5.052144
## [92] -5.145178 -5.238212 -5.331245 -5.424279 -5.517313 -5.610347 -5.703380
## [99] -5.796414 -5.889448
summary(log(lambda.list))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.89 -3.59 -1.28 -1.28 1.02 3.32
cv_enet = cv.gcdnet(X, Y, foldid = foldid_mat[, 2],
lambda2 = lambda2_list[which.min(min_cv_enet)],
method = "ls")
plot(cv_enet) # CV error given lambda and the optimal lambda2
lambda_opt_enet = cv_enet$lambda.min
lambda_opt_enet
## [1] 0.037459
summary(lambda.list)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0028 0.0277 0.2772 3.1161 2.7708 27.6851
Since \(\lambda_2\) is close to \(0\), the shrinkage result of E-Net would be the similar to that of LASSO.
Best model from E-NET
fit_enet = gcdnet(X, Y, lambda=lambda_opt_enet,
lambda2 = lambda2_list[which.min(min_cv_enet)],
method="ls")
beta_enet = coef(fit_enet)
beta_enet
## 43 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 18.3664736
## z 10.5310815
## x1 3.3514539
## x2 1.0697612
## x3 0.0000000
## x4 -0.3444847
## x5 -4.5354775
## x6 -0.1914132
## x7 3.3869822
## x1.x2 -0.5046198
## x1.x3 -0.0681791
## x1.x4 0.0000000
## x1.x5 0.0000000
## x1.x6 0.0000000
## x1.x7 0.4288911
## x1.z 0.0000000
## x2.x3 0.0000000
## x2.x4 0.0046394
## x2.x5 -0.2311197
## x2.x6 .
## x2.x7 0.1696795
## x2.z 0.0000000
## x3.x4 -0.2385012
## x3.x5 0.0000000
## x3.x6 0.0000000
## x3.x7 0.0000000
## x3.z 0.0000000
## x4.x5 0.2711996
## x4.x6 0.0000000
## x4.x7 -0.0043092
## x4.z -0.0563444
## x5.x6 0.3002017
## x5.x7 -0.3301774
## x5.z -0.3022551
## x6.x7 0.0000000
## x6.z -0.0670827
## x7.z 0.0000000
## x1sq 0.0000000
## x2sq 0.0000000
## x3sq 0.0721358
## x4sq -0.0620423
## x5sq 0.2715471
## x6sq .
sum(beta_enet!=0) # number of terms selected including intercept
## [1] 25
row.names(beta_enet)[which(beta_enet != 0)] # names of terms being selected
## [1] "(Intercept)" "z" "x1" "x2" "x4"
## [6] "x5" "x6" "x7" "x1.x2" "x1.x3"
## [11] "x1.x7" "x2.x4" "x2.x5" "x2.x7" "x3.x4"
## [16] "x4.x5" "x4.x7" "x4.z" "x5.x6" "x5.x7"
## [21] "x5.z" "x6.z" "x3sq" "x4sq" "x5sq"
Y_pred_enet = cbind(1, X)%*%as.vector(beta_enet) #calculate the predicted Y
cor(Y_pred_enet, Y)^2 # R^2
## [,1]
## [1,] 0.95161
1 - (1- cor(Y_pred_enet, Y)^2)*499/(499- (sum(beta_enet!=0)-1)) #adj R^2
## [,1]
## [1,] 0.94916
mean((Y - Y_pred_enet)^2) # MSE
## [1] 5.6977
order = order(abs(beta_enet[,1]), decreasing = TRUE)
beta.enet = data.frame(beta = beta_enet[order], row.names = rownames(beta_enet)[order])
beta.enet = subset(beta.enet, beta.enet[,1] != 0)
t(beta.enet)
## (Intercept) z x5 x7 x1 x2 x1.x2 x1.x7 x4
## beta 18.366 10.531 -4.5355 3.387 3.3515 1.0698 -0.50462 0.42889 -0.34448
## x5.x7 x5.z x5.x6 x5sq x4.x5 x3.x4 x2.x5 x6 x2.x7
## beta -0.33018 -0.30226 0.3002 0.27155 0.2712 -0.2385 -0.23112 -0.19141 0.16968
## x3sq x1.x3 x6.z x4sq x4.z x2.x4 x4.x7
## beta 0.072136 -0.068179 -0.067083 -0.062042 -0.056344 0.0046394 -0.0043092
Before moving to the Adaptive Lasso and Adaptive E-NET, we first introduce the way to construct the adaptive weights.
The default way handling the predictors \(X=((x_{ij}))_{n \times p}\) in gcdnet
is to standardize it by \[x_{ij}^* = \frac{x_{ij} - \sum\limits_{i=1}^n x_{ij}}{\sqrt{\frac{1}{n}\sum\limits_{i=1}^n \Big(x_{ij} - \sum\limits_{i=1}^n x_{ij}\Big)^2}}.\]
After fitting the regularized linear regression to get the \(\beta^*\) for the standardized predictors, gcdnet
returns the \(\beta\) for the original predictors as \[\beta = \frac{\beta^*}{\sqrt{\frac{1}{n}\sum\limits_{i=1}^n \Big(x_{ij} - \sum\limits_{i=1}^n x_{ij}\Big)^2}}.\]
Denote the \(\beta^*\) obtained for the standardized predictors from ENET as \(\beta^*_{e-net}\). The adaptive weights for \(\beta^*\) in Adaptive Lasso and Adaptive E-NET are constructed as \[w_j = \Big(|\beta^*_{e-net,j}| + \frac{1}{n}\Big)^{-\gamma},\] as recommended by Zou and Zhang (2009)1.
Here, the parameter \(\gamma > \frac{2\nu}{1-\nu}\) with \(0 \leq \nu = \lim\limits_{n \longrightarrow \infty} \frac{\log p}{\log n} < 1\).
To avoid tuning on \(\gamma\), we can choose \(\gamma = \lceil \frac{2\nu}{1-\nu} \rceil + 1\) as shown in the simulations of Zou and Zhang (2009).
By the default setting, gcdnet
does tuning on lambda (for \(\mathbb{L}^1\) penalty) by choosing proper lambda-candidate sequence based on the given adaptive weights, that is, and thus the lambda-candidate sequence is different for different given adaptive weights.
Therefore, we do not do tuning on \(\gamma\), and fix \(\gamma = \lceil \frac{2\nu^*}{1-\nu^*} \rceil + 1\) with \(\nu^* = \frac{\log p}{\log n}\).
The standard error and p-value for each non-zero estimated coefficient is obtained by Theorem 3.3 of Zou and Zhang (2009).
Note that Adaptive E-NET reduces to Adaptive Lasso when \(\lambda_2=0\).
v_star = log(sum(beta_enet!=0)) / log(n)
gamma = ceiling(2*v_star/(1 - v_star)) + 1
gamma #4
## [1] 4
X_sd = apply(X, 2, sd)
weights_ad = (abs(beta_enet[-1]*X_sd) + 1/n)^(-gamma)
X_non0 = X[, which(beta_enet[-1]!=0)]
X_non0
## z x1 x2 x4 x5 x6 x7
## [1,] 0 -0.7418453 -1.0645840 -0.07070489 1.21312727 -0.2507142 0.94783681
## [2,] 0 -0.6257728 -0.1361881 0.13484876 -0.92179518 -0.7496121 -1.23822366
## [3,] 0 -0.6710988 -0.6263737 -0.21789651 -0.28611988 -0.9289527 -0.84245696
## [4,] 0 -0.3212268 0.1211559 -0.36865535 -0.32437430 -0.8542872 0.27917229
## [5,] 0 -0.6363128 -0.4901829 -0.99033380 -0.72886332 -0.9393591 -0.28421582
## [6,] 0 -0.7569199 -0.2973188 -0.37264361 0.05327005 -0.4208111 0.71837037
## [7,] 0 -0.1474909 0.0267849 -0.70773993 -0.86080081 -0.8672893 -0.76337724
## [8,] 1 1.8620285 2.2524479 0.05763310 -0.13230711 -0.2647147 -0.09190870
## x1.x2 x1.x3 x1.x7 x2.x4 x2.x5 x2.x7
## [1,] 0.78975659 0.72455615 -0.70314827 0.07527129 -1.2915e+00 -1.00905187
## [2,] 0.08522281 0.00453198 0.77484669 -0.01836480 1.2554e-01 0.16863134
## [3,] 0.42035866 0.48134490 0.56537183 0.13648465 1.7922e-01 0.52769292
## [4,] -0.03891852 0.05470810 -0.08967763 -0.04466477 -3.9300e-02 0.03382337
## [5,] 0.31190966 0.35252281 0.18085016 0.48544472 3.5728e-01 0.13931774
## [6,] 0.22504655 0.47163183 -0.54374883 0.11079397 -1.5838e-02 -0.21358505
## [7,] -0.00395053 -0.14684483 0.11259119 -0.01895676 -2.3056e-02 -0.02044700
## [8,] 4.19412218 4.60321424 -0.17113663 0.12981556 -2.9801e-01 -0.20701957
## x3.x4 x4.x5 x4.x7 x4.z x5.x6 x5.x7
## [1,] 0.06905707 -8.5774e-02 -0.06701670 0.0000000 -3.0415e-01 1.14984668
## [2,] -0.00097660 -1.2430e-01 -0.16697293 0.0000000 6.9099e-01 1.14138861
## [3,] 0.15628605 6.2345e-02 0.18356843 0.0000000 2.6579e-01 0.24104368
## [4,] 0.06278565 1.1958e-01 -0.10291836 0.0000000 2.7711e-01 -0.09055631
## [5,] 0.54865353 7.2182e-01 0.28146853 0.0000000 6.8466e-01 0.20715449
## [6,] 0.23219180 -1.9851e-02 -0.26769613 0.0000000 -2.2417e-02 0.03826762
## [7,] -0.70463980 6.0922e-01 0.54027255 0.0000000 7.4656e-01 0.65711574
## [8,] 0.14247769 -7.6253e-03 -0.00529698 0.0576331 3.5024e-02 0.01216018
## x5.z x6.z x3sq x4sq x5sq
## [1,] 0.00000000 0.0000000 0.0494618 1.3438837 6.0405842
## [2,] 0.00000000 0.0000000 1.2567755 1.7822517 0.0959079
## [3,] 0.00000000 0.0000000 0.2142764 1.0679362 0.9011325
## [4,] 0.00000000 0.0000000 0.9407060 0.8181383 0.8295384
## [5,] 0.00000000 0.0000000 0.3772662 0.1391066 0.2538245
## [6,] 0.00000000 0.0000000 0.3026980 0.8119812 1.6660614
## [7,] 0.00000000 0.0000000 4.2053295 0.3777209 0.1376855
## [8,] -0.13230711 -0.2647147 11.6922964 1.6103356 1.2189043
## [ reached getOption("max.print") -- omitted 492 rows ]
weights_ad_non0 = weights_ad[which(beta_enet[-1] != 0)]
weights_ad_non0 # only add weights to those beta_enet ne 0
## z x1 x2 x4 x5 x6 x7
## 1.3490e-03 7.9073e-03 7.5789e-01 6.9385e+01 2.3591e-03 7.1459e+02 7.5809e-03
## x1.x2 x1.x3 x1.x7 x2.x4 x2.x5 x2.x7 x3.x4
## 5.3222e+00 6.7012e+03 3.0726e+01 4.0729e+08 4.0477e+02 1.4080e+03 1.3643e+02
## x4.x5 x4.x7 x4.z x5.x6 x5.x7 x5.z x6.z
## 1.8351e+02 5.9291e+08 3.1866e+05 4.3767e+01 8.5082e+01 5.6918e+02 2.8021e+05
## x3sq x4sq x5sq
## 2.4737e+02 4.9200e+02 7.5214e-01
cv_adlasso = cv.gcdnet(X_non0, Y, foldid = foldid_mat[, 2],
lambda = seq(0.000001, 0.01, length = 1000),
lambda2 = 0, pf = weights_ad_non0, method="ls")
plot(cv_adlasso)
lambda_opt_adlasso = cv_adlasso$lambda.min
lambda_opt_adlasso
## [1] 7.1063e-05
We use the optimal \(\lambda\) to fit the adaptive LASSO
fit_adlasso = gcdnet(X_non0, Y, lambda = lambda_opt_adlasso,
lambda2 = 0, pf = weights_ad_non0, method="ls")
beta_adlasso = coef(fit_adlasso)
sum(beta_adlasso!=0)
## [1] 18
row.names(beta_adlasso)[which(beta_adlasso!=0)]
## [1] "(Intercept)" "z" "x1" "x2" "x4"
## [6] "x5" "x6" "x7" "x1.x2" "x1.x7"
## [11] "x2.x5" "x3.x4" "x4.x5" "x5.x6" "x5.x7"
## [16] "x5.z" "x3sq" "x5sq"
Y_pred_adlasso = cbind(1, X[, which(beta_enet[-1] != 0)])%*%as.vector(beta_adlasso)
cor(Y_pred_adlasso, Y)^2 # R^2
## [,1]
## [1,] 0.9517
1 - (1- cor(Y_pred_adlasso, Y)^2)*499/(499- 21) #adj R^2
## [,1]
## [1,] 0.94958
mean((Y - Y_pred_adlasso)^2) # MSE
## [1] 5.6769
# order the selected variables (with beta ne 0)
order = order(abs(beta_adlasso[,1]), decreasing = TRUE)
beta_adlasso = data.frame(beta = beta_adlasso[order],
row.names = rownames(beta_adlasso)[order])
beta_adlasso # beta of all terms
## beta
## (Intercept) 17.681284
## z 10.860967
## x5 -5.383510
## x7 3.416588
## x1 3.179394
## x2 1.171567
## x4 -0.625752
## x1.x7 0.621786
## x1.x2 -0.547950
## x5sq 0.457772
## x5.x7 -0.331631
## x2.x5 -0.326898
## x4.x5 0.302307
## x3.x4 -0.271014
## x5.x6 0.227288
## x5.z -0.155125
## x6 -0.092843
## x3sq 0.055568
## x1.x3 0.000000
## x2.x4 0.000000
## x2.x7 0.000000
## x4.x7 0.000000
## x4.z 0.000000
## x6.z 0.000000
## x4sq 0.000000
beta_adlasso = subset(beta_adlasso, beta_adlasso[,1] != 0)
beta_adlasso # beta of terms selected in E-Net
## beta
## (Intercept) 17.681284
## z 10.860967
## x5 -5.383510
## x7 3.416588
## x1 3.179394
## x2 1.171567
## x4 -0.625752
## x1.x7 0.621786
## x1.x2 -0.547950
## x5sq 0.457772
## x5.x7 -0.331631
## x2.x5 -0.326898
## x4.x5 0.302307
## x3.x4 -0.271014
## x5.x6 0.227288
## x5.z -0.155125
## x6 -0.092843
## x3sq 0.055568
# use the original order of beta
beta_adlasso = coef(fit_adlasso)
# estimated sigma squared
sigma2_adlasso = crossprod(Y - Y_pred_adlasso)/(n - sum(beta_adlasso != 0) - 1)
# Selected variables
X_adlasso = X_non0[, which(beta_adlasso[-1] != 0)]
# standardized X (actually we already standardizes before...)
X_std_adlasso = X_adlasso / (rep(1, n)%*%t(apply(X_adlasso, 2, sd)))
# solve(): matrix inverse
var_std_non0_adlasso = solve(crossprod(X_std_adlasso))*c(sigma2_adlasso)
# variance matrix for the non-zero estimated coefficients
# for original predictors
var_non0_adlasso = diag(1/apply(X_adlasso, 2, sd))%*%
var_std_non0_adlasso%*%
diag(1/apply(X_adlasso, 2, sd))
# variance for the intercept
var_int_adlasso = apply(X_adlasso, 2, mean)%*%var_non0_adlasso%*%apply(X_adlasso, 2, mean)
adlasso_tab = matrix(0, sum(beta_adlasso != 0), 3)
rownames(adlasso_tab) = rownames(beta_adlasso)[which(beta_adlasso!=0)]
colnames(adlasso_tab) = c("beta", "sd", "pvalue")
# beta
adlasso_tab[, 1] = beta_adlasso[which(beta_adlasso!=0), ]
# sd
adlasso_tab[, 2] = c(sqrt(var_int_adlasso), sqrt(diag(var_non0_adlasso)))
# p-value
adlasso_tab[, 3] = sapply(abs(adlasso_tab[, 1])/adlasso_tab[, 2],
function(x) 2*(1 - pnorm(x)))
adlasso_tab
## beta sd pvalue
## (Intercept) 17.681284 0.104078 0.0000e+00
## z 10.860967 0.320660 0.0000e+00
## x1 3.179394 0.228261 0.0000e+00
## x2 1.171567 0.197919 3.2308e-09
## x4 -0.625752 0.129935 1.4655e-06
## x5 -5.383510 0.362426 0.0000e+00
## x6 -0.092843 0.152097 5.4158e-01
## x7 3.416588 0.118368 0.0000e+00
## x1.x2 -0.547950 0.123966 9.8622e-06
## x1.x7 0.621786 0.123339 4.6244e-07
## x2.x5 -0.326898 0.171740 5.6982e-02
## x3.x4 -0.271014 0.107386 1.1611e-02
## x4.x5 0.302307 0.124238 1.4962e-02
## x5.x6 0.227288 0.125521 7.0178e-02
## x5.x7 -0.331631 0.116859 4.5416e-03
## x5.z -0.155125 0.313063 6.2024e-01
## x3sq 0.055568 0.055741 3.1881e-01
## x5sq 0.457772 0.085515 8.6454e-08
# Compared with lm model's coefficients, sd and p-values
summary(lm(Y ~ X_adlasso))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.725547 0.378019 46.8906 1.0199e-181
## X_adlassoz 10.867208 0.383529 28.3348 1.1010e-104
## X_adlassox1 3.165926 0.261814 12.0923 1.3890e-29
## X_adlassox2 1.165311 0.198993 5.8560 8.7814e-09
## X_adlassox4 -0.617476 0.129623 -4.7636 2.5197e-06
## X_adlassox5 -5.115505 0.485473 -10.5372 1.6701e-23
## X_adlassox6 -0.209993 0.157326 -1.3348 1.8258e-01
## X_adlassox7 3.418793 0.118365 28.8836 3.2488e-107
## X_adlassox1.x2 -0.545100 0.129677 -4.2035 3.1331e-05
## X_adlassox1.x7 0.618979 0.123235 5.0228 7.1857e-07
## X_adlassox2.x5 -0.293320 0.171545 -1.7099 8.7933e-02
## X_adlassox3.x4 -0.298507 0.107164 -2.7855 5.5549e-03
## X_adlassox4.x5 0.340799 0.123951 2.7495 6.1936e-03
## X_adlassox5.x6 0.264618 0.131290 2.0155 4.4404e-02
## X_adlassox5.x7 -0.321261 0.117741 -2.7285 6.5940e-03
## X_adlassox5.z -0.413665 0.312421 -1.3241 1.8611e-01
## X_adlassox3sq 0.063818 0.057092 1.1178 2.6421e-01
## X_adlassox5sq 0.430063 0.121799 3.5309 4.5389e-04
lambda2_list_adenet = seq(0.001, 0.025, by = 0.001)
# using the same weighted created for adaptive LASSO
# list of lambda2
min_cv_adenet = numeric(length(lambda2_list_adenet))
for(i in 1:length(lambda2_list_adenet)){
# lambda (for L1 penalty) will be automatically generated by default setting
cv_adenet = cv.gcdnet(X_non0, Y, foldid = foldid_mat[, 2],
lambda2 = lambda2_list_adenet[i], pf = weights_ad_non0,
method="ls")
# collect minimum cv error for each lambda2 candidate
min_cv_adenet[i] = min(cv_adenet$cvm)
}
plot(lambda2_list_adenet, min_cv_adenet,xlab = "lambda2",
ylab = "MSPE (5-fold CV)", type = 'l' )
lambda2_list_adenet[which.min(min_cv_adenet)]
## [1] 0.001
# candidate lambda (for L1 penalty)
lambda.list = cv_adenet$lambda
lambda.list
## [1] 3299.15721 3006.06918 2739.01828 2495.69144 2273.98109 2071.96688
## [7] 1887.89905 1720.18330 1567.36696 1428.12640 1301.25559 1185.65563
## [13] 1080.32525 984.35213 896.90500 817.22643 744.62629 678.47574
## [19] 618.20183 563.28248 513.24202 467.64701 426.10254 388.24876
## [25] 353.75781 322.33094 293.69595 267.60481 243.83154 222.17022
## [31] 202.43323 184.44962 168.06362 153.13331 139.52937 127.13397
## [37] 115.83974 105.54885 96.17218 87.62851 79.84384 72.75073
## [43] 66.28776 60.39893 55.03326 50.14426 45.68958 41.63064
## [49] 37.93229 34.56249 31.49205 28.69439 26.14526 23.82258
## [55] 21.70625 19.77793 18.02091 16.41998 14.96128 13.63216
## [61] 12.42112 11.31766 10.31223 9.39612 8.56139 7.80082
## [67] 7.10782 6.47638 5.90104 5.37680 4.89914 4.46392
## [73] 4.06736 3.70602 3.37679 3.07681 2.80347 2.55442
## [79] 2.32749 2.12072 1.93232 1.76066 1.60425 1.46173
## [85] 1.33188 1.21356 1.10575 1.00751 0.91801 0.83646
## [91] 0.76215 0.69444 0.63275 0.57654 0.52532 0.47865
## [97] 0.43613 0.39738 0.36208 0.32992
log(lambda.list)
## [1] 8.1014223 8.0083886 7.9153548 7.8223211 7.7292874 7.6362536
## [7] 7.5432199 7.4501861 7.3571524 7.2641187 7.1710849 7.0780512
## [13] 6.9850174 6.8919837 6.7989499 6.7059162 6.6128825 6.5198487
## [19] 6.4268150 6.3337812 6.2407475 6.1477138 6.0546800 5.9616463
## [25] 5.8686125 5.7755788 5.6825451 5.5895113 5.4964776 5.4034438
## [31] 5.3104101 5.2173763 5.1243426 5.0313089 4.9382751 4.8452414
## [37] 4.7522076 4.6591739 4.5661402 4.4731064 4.3800727 4.2870389
## [43] 4.1940052 4.1009715 4.0079377 3.9149040 3.8218702 3.7288365
## [49] 3.6358027 3.5427690 3.4497353 3.3567015 3.2636678 3.1706340
## [55] 3.0776003 2.9845666 2.8915328 2.7984991 2.7054653 2.6124316
## [61] 2.5193979 2.4263641 2.3333304 2.2402966 2.1472629 2.0542292
## [67] 1.9611954 1.8681617 1.7751279 1.6820942 1.5890604 1.4960267
## [73] 1.4029930 1.3099592 1.2169255 1.1238917 1.0308580 0.9378243
## [79] 0.8447905 0.7517568 0.6587230 0.5656893 0.4726556 0.3796218
## [85] 0.2865881 0.1935543 0.1005206 0.0074868 -0.0855469 -0.1785806
## [91] -0.2716144 -0.3646481 -0.4576819 -0.5507156 -0.6437493 -0.7367831
## [97] -0.8298168 -0.9228506 -1.0158843 -1.1089180
summary(log(lambda.list))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.11 1.19 3.50 3.50 5.80 8.10
cv_adenet = cv.gcdnet(X_non0, Y, foldid = foldid_mat[, 2],
lambda2 = lambda2_list_adenet[which.min(min_cv_adenet)],
method = "ls", pf = weights_ad_non0)
# CV error given lambda and the optimal lambda2
plot(cv_adenet,ylab='MSPE (5-fold CV)')
lambda_opt_adenet = cv_adenet$lambda.min
lambda_opt_adenet
## [1] 0.32992
# optimal lambda2
lambda2_list_adenet[which.min(min_cv_adenet)]
## [1] 0.001
fit_adenet = gcdnet(X_non0, Y, lambda = lambda_opt_adenet,
lambda2 = lambda2_list_adenet[which.min(min_cv_adenet)],
method="ls", pf = weights_ad_non0)
beta_adenet = coef(fit_adenet)
sum(beta_adenet != 0)
## [1] 7
# names of selected terms
row.names(beta_adenet)[which(beta_adenet!=0)]
## [1] "(Intercept)" "z" "x1" "x2" "x5"
## [6] "x7" "x5sq"
Y_pred_adenet = cbind(1, X[, which(beta_enet[-1] != 0)])%*%as.vector(beta_adenet)
cor(Y_pred_adenet, Y)^2 # R^2
## [,1]
## [1,] 0.93822
1 - (1- cor(Y_pred_adenet, Y)^2)*499/(499- (sum(beta_adenet!=0)-1)) # adj R^2
## [,1]
## [1,] 0.93747
mean((Y - Y_pred_adenet)^2) # MSE
## [1] 7.2632
# order the selected variables (with beta ne 0)
order = order(abs(beta_adenet[,1]), decreasing = TRUE)
beta_adenet = data.frame(beta=beta_adenet[order],
row.names=rownames(beta_adenet)[order])
beta_adenet
## beta
## (Intercept) 17.36857
## z 11.59789
## x5 -4.96295
## x7 3.51399
## x1 3.20387
## x5sq 0.38117
## x2 0.23683
## x4 0.00000
## x6 0.00000
## x1.x2 0.00000
## x1.x3 0.00000
## x1.x7 0.00000
## x2.x4 0.00000
## x2.x5 0.00000
## x2.x7 0.00000
## x3.x4 0.00000
## x4.x5 0.00000
## x4.x7 0.00000
## x4.z 0.00000
## x5.x6 0.00000
## x5.x7 0.00000
## x5.z 0.00000
## x6.z 0.00000
## x3sq 0.00000
## x4sq 0.00000
# calculate error and p-values....
beta_adenet = coef(fit_adenet)
# estimated sigma squared
sigma2_adenet = crossprod(Y - Y_pred_adenet)/(n - sum(beta_adenet!=0) - 1)
# selected variables
X_adenet = X_non0[, which(beta_adenet[-1] != 0)]
# standardized X (actually we already standardize before...)
X_std_adenet = X_adenet/(rep(1, n)%*%t(apply(X_adenet, 2, sd)))
# variance matrix for the non-zero estimated coefficents for original predicators
var_std_non0_adenet = solve(crossprod(X_std_adenet))*c(sigma2_adenet)
# variance for the intercept
var_non0_adenet = diag(1/apply(X_adenet, 2, sd))%*%
var_std_non0_adenet%*%
diag(1/apply(X_adenet, 2, sd))
var_int_adenet = apply(X_adenet, 2, mean)%*%var_non0_adenet%*%apply(X_adenet, 2, mean)
adenet_tab = matrix(0, sum(beta_adenet != 0), 3)
rownames(adenet_tab) = rownames(beta_adenet)[which(beta_adenet != 0)]
colnames(adenet_tab) = c("beta", "sd", "pvalue")
# beta
adenet_tab[, 1] = beta_adenet[which(beta_adenet!=0), ]
# sd
adenet_tab[, 2] = c(sqrt(var_int_adenet), sqrt(diag(var_non0_adenet)))
# p-value
adenet_tab[, 3] = sapply(abs(adenet_tab[, 1])/adenet_tab[, 2],
function(x) 2*(1 - pnorm(x)))
adenet_tab
## beta sd pvalue
## (Intercept) 17.36857 0.11324 0.0000e+00
## z 11.59789 0.33346 0.0000e+00
## x1 3.20387 0.21385 0.0000e+00
## x2 0.23683 0.20005 2.3648e-01
## x5 -4.96295 0.26590 0.0000e+00
## x7 3.51399 0.12887 0.0000e+00
## x5sq 0.38117 0.06296 1.4124e-09
lm
model’s coefficients, SD and p-valuessummary(lm(Y ~ X_adenet))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.91909 0.32978 51.3034 7.8912e-200
## X_adenetz 11.42790 0.38335 29.8107 2.1240e-112
## X_adenetx1 2.75902 0.21995 12.5439 1.6099e-31
## X_adenetx2 0.90969 0.19750 4.6061 5.2291e-06
## X_adenetx5 -5.79184 0.39929 -14.5053 5.8393e-40
## X_adenetx7 3.50440 0.12727 27.5343 9.2190e-102
## X_adenetx5sq 0.58670 0.10065 5.8294 1.0066e-08
For any general data set, we want to discuss how we we can effectively perform the variable selection by splitting the data into two parts as training and test data set.
Below we demonstrate the case of E-NET, but similar procedure can be used for other variable selection approaches.
Split the data as training and test data sets and name the variables \(X\) and \(y\) appropriately
set.seed(5541)
train_obs = sample(seq_len(nrow(dat.all)), size = 250) # randomly select 250 obs as training obs
train = dat.all[train_obs,]
test = dat.all[-train_obs,]
# head(train)
# head(test)
# dim(train)
# dim(test)
y.train = train$y # y as the outcome in training data
y.test = test$y # y as the outcome in test data
xcov.train = train[,-1] # all predictors in training data
xcov.test = test[,-1] # all predictors in test data
names(xcov.train)
## [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "z" "x1.x2"
## [10] "x1.x3" "x1.x4" "x1.x5" "x1.x6" "x1.x7" "x1.z" "x2.x3" "x2.x4" "x2.x5"
## [19] "x2.x6" "x2.x7" "x2.z" "x3.x4" "x3.x5" "x3.x6" "x3.x7" "x3.z" "x4.x5"
## [28] "x4.x6" "x4.x7" "x4.z" "x5.x6" "x5.x7" "x5.z" "x6.x7" "x6.z" "x7.z"
## [37] "x1sq" "x2sq" "x3sq" "x4sq" "x5sq" "x6sq"
## fold ID
set.seed(0)
n = nrow(train)
foldid_mat = cbind(sample(n), rep(1:5, each = n/5)) # assign into 5 folds
We now run E-NET on the training data set.
We can use a large list of values to identify \(\lambda_2\) that gives the minimum CV error.
lambda2_list = seq(0.01, 2.0, by = 0.01) # candidates for lambda2
num.lambda2 = length(lambda2_list) # the number of candidates for lambda2
min_cv=c()
for(i in 1:num.lambda2){
cv_enet = cv.gcdnet(xcov.train, y.train, lambda2 = lambda2_list[i],
foldid = foldid_mat[, 2], method = "ls")
# lambda (for L1 penalty) will be automatically generated by default setting
min_cv[i] = min(cv_enet$cvm) # collect minimum cv error for each lambda2 candidate
}
plot(lambda2_list, min_cv, xlab = "lambda2",ylab = "Minimum CV error", type = 'l')
lambda2_list[which.min(min_cv)] # optimal lambda2 which gives minimun cv error
## [1] 0.01
lambda2_enet_opt = lambda2_list[which.min(min_cv)]
## lambda 1
# cross validation for lambda given the optimal lambda2
cv_enet = cv.gcdnet(xcov.train, y.train, foldid = foldid_mat[, 2],
lambda2 = lambda2_enet_opt, method = "ls")
plot(cv_enet) # CV error given lambda and the optimal lambda2
lambda_opt_enet = cv_enet$lambda.min
lambda_opt_enet
## [1] 0.023319
# Regression
fit_enet = gcdnet(xcov.train, y.train, lambda=lambda_opt_enet,
lambda2 = lambda2_enet_opt, method="ls")
beta_enet = coef(fit_enet)
beta_enet
## 43 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 18.1013731
## x1 3.7419862
## x2 0.8325687
## x3 0.4113393
## x4 -0.4171876
## x5 -4.1416765
## x6 -0.9002629
## x7 3.4021254
## z 10.3853736
## x1.x2 -0.3576881
## x1.x3 -0.3443246
## x1.x4 0.0085007
## x1.x5 0.0000000
## x1.x6 0.1211567
## x1.x7 0.3773487
## x1.z 0.0000000
## x2.x3 0.0000000
## x2.x4 0.1047705
## x2.x5 -0.2400901
## x2.x6 -0.0289408
## x2.x7 0.2712493
## x2.z -0.0761900
## x3.x4 -0.0945836
## x3.x5 0.0487351
## x3.x6 0.0000000
## x3.x7 0.1492009
## x3.z -0.6661710
## x4.x5 0.5323473
## x4.x6 -0.0474661
## x4.x7 -0.1712154
## x4.z -0.2157479
## x5.x6 0.3114071
## x5.x7 -0.3726135
## x5.z -0.5260048
## x6.x7 0.0304243
## x6.z -0.1990544
## x7.z 0.0000000
## x1sq 0.0000000
## x2sq 0.0000000
## x3sq 0.1331426
## x4sq -0.0270691
## x5sq 0.2452894
## x6sq 0.1348027
sum(beta_enet!=0) # number of terms selected including intercept
## [1] 36
row.names(beta_enet)[which(beta_enet!=0)] # names of terms being selected
## [1] "(Intercept)" "x1" "x2" "x3" "x4"
## [6] "x5" "x6" "x7" "z" "x1.x2"
## [11] "x1.x3" "x1.x4" "x1.x6" "x1.x7" "x2.x4"
## [16] "x2.x5" "x2.x6" "x2.x7" "x2.z" "x3.x4"
## [21] "x3.x5" "x3.x7" "x3.z" "x4.x5" "x4.x6"
## [26] "x4.x7" "x4.z" "x5.x6" "x5.x7" "x5.z"
## [31] "x6.x7" "x6.z" "x3sq" "x4sq" "x5sq"
## [36] "x6sq"
## note that X should be matrix to compute predicted values
X.train = data.matrix(xcov.train)
Y_pred_train = cbind(1, X.train)%*%as.vector(beta_enet) # calculate the predicted Y
cor(Y_pred_train, y.train)^2 # R^2
## [,1]
## [1,] 0.95881
1 - (1- cor(Y_pred_train, y.train)^2)*249/(249- (sum(beta_enet!=0)-1)) # adj R^2
## [,1]
## [1,] 0.95208
mean((y.train - Y_pred_train)^2) # MSE
## [1] 5.1756
### calculate MSPE (mean squared prediction error) in the test set
X.test <- data.matrix(xcov.test)
Y_pred_test <- cbind(1, X.test)%*%as.vector(beta_enet) #calculate the predicted Y
cor(Y_pred_test, y.test)^2 # R^2
## [,1]
## [1,] 0.94252
1 - (1- cor(Y_pred_test, y.test)^2)*249/(249- (sum(beta_enet!=0)-1)) #adj R^2
## [,1]
## [1,] 0.93312
mean((y.test - Y_pred_test)^2) # MSPE
## [1] 6.5292
Super Learner is an algorithm that uses cross-validation to estimate the performance of multiple machine learning models, or the same model with different settings.
It then creates an optimal weighted average of those models, aka an “ensemble”, using the test data performance.
This approach has been proven to be asymptotically as accurate as the best possible prediction algorithm that is tested.
A tutorial on Super Learner is available at Guide to SuperLearner: (https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html).
Install and load the packages BMA, caret, glmnet and ggplot2.
## Windows users may need to install Rtools first
## Go to https://cran.r-project.org/bin/windows/Rtools/ and install Rtools first
install.packages(c("SuperLearner","caret", "glmnet", "ggplot2"))
SuperLearner
using the training and test data created in the previous subsection.library(ggplot2)
library(SuperLearner)
listWrappers()
## [1] "SL.bartMachine" "SL.bayesglm" "SL.biglasso"
## [4] "SL.caret" "SL.caret.rpart" "SL.cforest"
## [7] "SL.earth" "SL.extraTrees" "SL.gam"
## [10] "SL.gbm" "SL.glm" "SL.glm.interaction"
## [13] "SL.glmnet" "SL.ipredbagg" "SL.kernelKnn"
## [16] "SL.knn" "SL.ksvm" "SL.lda"
## [19] "SL.leekasso" "SL.lm" "SL.loess"
## [22] "SL.logreg" "SL.mean" "SL.nnet"
## [25] "SL.nnls" "SL.polymars" "SL.qda"
## [28] "SL.randomForest" "SL.ranger" "SL.ridge"
## [31] "SL.rpart" "SL.rpartPrune" "SL.speedglm"
## [34] "SL.speedlm" "SL.step" "SL.step.forward"
## [37] "SL.step.interaction" "SL.stepAIC" "SL.svm"
## [40] "SL.template" "SL.xgboost"
## [1] "All"
## [1] "screen.corP" "screen.corRank" "screen.glmnet"
## [4] "screen.randomForest" "screen.SIS" "screen.template"
## [7] "screen.ttest" "write.screen.template"
SL.library = c("SL.bartMachine", "SL.caret", "SL.randomForest", "SL.glm", "SL.gam",
"SL.glm.interaction", "SL.stepAIC" , "SL.glmnet", "SL.xgboost",
"SL.ipredbagg", "SL.nnls", "SL.ridge", "SL.svm")
## Reduce this library if you think some learners are not helpful. This
## will reduce the computation time.
SL.library = c("SL.randomForest", "SL.stepAIC", "SL.glmnet")
set.seed(1)
sl = SuperLearner(Y = y.train, X = xcov.train, SL.library = SL.library)
sl
##
## Call: SuperLearner(Y = y.train, X = xcov.train, SL.library = SL.library)
##
##
## Risk Coef
## SL.randomForest_All 11.0901 0.082102
## SL.stepAIC_All 6.9954 0.437002
## SL.glmnet_All 6.9197 0.480897
The coefficient is how much weight Super Learner puts on that model in the weighted-average.
attributes(sl)
## $names
## [1] "call" "libraryNames" "SL.library"
## [4] "SL.predict" "coef" "library.predict"
## [7] "Z" "cvRisk" "family"
## [10] "fitLibrary" "cvFitLibrary" "varNames"
## [13] "validRows" "method" "whichScreen"
## [16] "control" "cvControl" "errorsInCVLibrary"
## [19] "errorsInLibrary" "metaOptimizer" "env"
## [22] "times"
##
## $class
## [1] "SuperLearner"
sl$times$everything
## user system elapsed
## 14.16 0.07 14.24
# onlySL is set to TRUE so we don't fit algorithms that had weight = 0, saving computation.
pred = predict(sl, xcov.test, onlySL = TRUE)
str(pred) # Check the structure of this prediction object.
## List of 2
## $ pred : num [1:250, 1] 12.2 16.1 19.7 35.6 20.5 ...
## $ library.predict: num [1:250, 1:3] 12.5 16.2 18.5 34.3 20.4 ...
summary(pred$library.predict) # Review the columns of $library.predict.
## V1 V2 V3
## Min. : 6.47 Min. : 2.4 Min. : 2.42
## 1st Qu.:14.30 1st Qu.:14.5 1st Qu.:14.48
## Median :19.91 Median :21.1 Median :20.72
## Mean :23.54 Mean :23.6 Mean :23.64
## 3rd Qu.:32.85 3rd Qu.:32.9 3rd Qu.:33.25
## Max. :45.39 Max. :46.3 Max. :45.74
qplot(pred$pred[, 1]) + theme_minimal() # Histogram of our predicted values.
summary(pred$pred[, 1]) # predicted value of y in test
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.89 14.43 20.70 23.61 33.06 45.68
summary(y.test) # observed value of y in test
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.353 15.606 22.280 23.969 32.431 46.538
qplot(y.test, pred$pred[, 1], geom=c("point","smooth"), method=lm)
plot(pred$pred[, 1], y.test)
cor(pred$pred[, 1], y.test)^2 # R^2
## [1] 0.94065
## MSPE
mean((y.test - pred$pred)^2) # MSPE
## [1] 6.8953
install.packages("BMA")
#### Bayesian model averagin (BMA) model
library('BMA')
x.bma = x2
fit.bma = bicreg(x.bma, y, maxCol = ncol(x.bma)+1) # intercept is included by default.
summary(fit.bma)
##
## Call:
## bicreg(x = x.bma, y = y, maxCol = ncol(x.bma) + 1)
##
##
## 186 models were selected
## Best 5 models (cumulative posterior probability = 0.155 ):
##
## p!=0 EV SD model 1 model 2 model 3
## Intercept 100.0 1.76e+01 0.65558 17.605 17.537 17.535
## z 100.0 1.07e+01 0.42462 10.761 10.840 10.839
## x1 100.0 3.47e+00 0.53363 3.327 3.261 3.253
## x2 88.4 1.11e+00 0.47144 1.244 1.268 1.272
## x3 0.0 0.00e+00 0.00000 . . .
## x4 75.0 -5.57e-01 0.34553 -0.779 -0.772 -0.784
## x5 100.0 -5.74e+00 0.40679 -5.830 -5.617 -5.860
## x6 0.0 0.00e+00 0.00000 . . .
## x7 100.0 3.41e+00 0.11975 3.398 3.387 3.394
## x1.x2 93.3 -5.80e-01 0.21938 -0.622 -0.609 -0.604
## x1.x3 0.4 -3.77e-04 0.01102 . . .
## x1.x4 2.1 -3.28e-03 0.02781 . . .
## x1.x5 1.1 -1.26e-03 0.02210 . . .
## x1.x6 0.2 -1.89e-04 0.00709 . . .
## x1.x7 88.2 5.48e-01 0.23699 0.618 0.634 0.646
## x1.z 1.3 -1.54e-02 0.16887 . . .
## x2.x3 0.0 0.00e+00 0.00000 . . .
## x2.x4 0.4 4.07e-04 0.01559 . . .
## x2.x5 30.7 -1.09e-01 0.18404 . . -0.277
## x2.x6 0.3 -2.50e-04 0.00785 . . .
## x2.x7 14.5 7.99e-02 0.20863 . . .
## x2.z 6.1 -6.46e-02 0.27752 . . .
## x3.x4 19.5 -4.39e-02 0.10095 . . .
## x3.x5 0.9 -1.42e-03 0.01885 . . .
## x3.x6 0.2 -1.57e-04 0.00659 . . .
## model 4 model 5
## Intercept 17.434 17.996
## z 10.871 10.825
## x1 3.274 3.226
## x2 1.273 1.223
## x3 . .
## x4 -0.797 .
## x5 -5.655 -5.623
## x6 . .
## x7 3.396 3.415
## x1.x2 -0.622 -0.587
## x1.x3 . .
## x1.x4 . .
## x1.x5 . .
## x1.x6 . .
## x1.x7 0.613 0.635
## x1.z . .
## x2.x3 . .
## x2.x4 . .
## x2.x5 . .
## x2.x6 . .
## x2.x7 . .
## x2.z . .
## x3.x4 . .
## x3.x5 . .
## x3.x6 . .
## [ reached getOption("max.print") -- omitted 23 rows ]
probne0.bma = fit.bma$probne0 # probability in percent
ord = order(probne0.bma, decreasing = TRUE) # sort by probne0 in decreasing order
probne0.bma = data.frame(probne0 = probne0.bma[ord], row.names = names(x.bma)[ord])
t(probne0.bma) # probability in percent
## z x1 x5 x7 x5sq x1.x2 x2 x1.x7 x5.x7 x4 x5.z x2.x5 x4sq x3.x4
## probne0 100 100 100 100 100 93.3 88.4 88.2 81.9 75 39 30.7 24.3 19.5
## x4.x5 x2.x7 x2sq x1sq x5.x6 x2.z x4.z x6.z x1.x4 x1.z x1.x5 x4.x6 x3.x5
## probne0 18 14.5 12.2 9.3 7 6.1 4.5 2.3 2.1 1.3 1.1 1.1 0.9
## x3.x7 x4.x7 x1.x3 x2.x4 x2.x6 x1.x6 x3.x6 x6.x7 x6sq x3 x6 x2.x3 x3.z
## probne0 0.8 0.8 0.4 0.4 0.3 0.2 0.2 0.2 0.2 0 0 0 0
## x7.z x3sq
## probne0 0 0
sum(probne0.bma > 5)
## [1] 20
postmean.bma = fit.bma$postmean
round(postmean.bma, digits=4) # get the coefficients
## (Intercept) z x1 x2 x3 x4
## 17.6407 10.7499 3.4729 1.1075 0.0000 -0.5574
## x5 x6 x7 x1.x2 x1.x3 x1.x4
## -5.7439 0.0000 3.4087 -0.5799 -0.0004 -0.0033
## x1.x5 x1.x6 x1.x7 x1.z x2.x3 x2.x4
## -0.0013 -0.0002 0.5479 -0.0154 0.0000 0.0004
## x2.x5 x2.x6 x2.x7 x2.z x3.x4 x3.x5
## -0.1090 -0.0002 0.0799 -0.0646 -0.0439 -0.0014
## x3.x6 x3.x7 x3.z x4.x5 x4.x6 x4.x7
## -0.0002 0.0032 0.0000 0.0511 0.0015 -0.0011
## x4.z x5.x6 x5.x7 x5.z x6.x7 x6.z
## -0.0238 0.0144 -0.2875 -0.2521 -0.0003 -0.0068
## x7.z x1sq x2sq x3sq x4sq x5sq
## 0.0000 -0.0341 0.1278 0.0000 -0.0521 0.5754
## x6sq
## 0.0001
postsd.bma = fit.bma$postsd
round(postsd.bma, digits=4) # get the sd
## [1] 0.6556 0.4246 0.5336 0.4714 0.0000 0.3455 0.4068 0.0000 0.1197 0.2194
## [11] 0.0110 0.0278 0.0221 0.0071 0.2370 0.1689 0.0000 0.0156 0.1840 0.0079
## [21] 0.2086 0.2775 0.1009 0.0188 0.0066 0.0422 0.0000 0.1235 0.0193 0.0162
## [31] 0.1311 0.0614 0.1743 0.3531 0.0084 0.0542 0.0000 0.1196 0.4028 0.0000
## [41] 0.0948 0.1013 0.0026
Principal component analysis (PCA) is a method of extracting important variables from a large set of variables.
The extracted variables would represent major proportion of variation of the data set, and try to capture the main pattern.
The PCA gives a set of principal components that are linear combination of the original variables.
Install and load the package FactoMineR.
install.packages("FactoMineR")
x1 = data2[, c(3:9)] # 7 predictors
library(FactoMineR)
result = PCA(x1) # graphs generated automatically
# conduct the pca analysis
pca_x = prcomp(x1, scale = TRUE, center = TRUE)
# variances of the variables in x1 vary by orders of magnitude, so scaling is appropriate
names(pca_x)
## [1] "sdev" "rotation" "center" "scale" "x"
pca_x$var = (pca_x$sdev)^2
# To compute the proportion of variance explained by each component,
# we simply divide the variance by sum of total variance
pca_x$percentVar = (pca_x$var)/sum(pca_x$var)*100
# cumulative variance explained
pca_x$cumPerVar = cumsum(pca_x$percentVar)
# Plot PCA variances and save as "tiff" file
library(ggplot2)
y = pca_x
nVar = length(y$var)
varPlotData = data.frame(X = 1:nVar, Y = y$percentVar)
ggplot(varPlotData, aes(x = X, y = Y))+
geom_point()+
geom_line()+
geom_hline(yintercept = 0, colour = "gray65", linetype = 'dashed')+
geom_vline(xintercept = 0, colour = "gray65", linetype = 'dashed')+
theme_bw()+
labs (x = "No. of Principal Component", y = "Percent Variance" )
# or simply use "plot" function to draw a screenplot
# plot(pca_x, type='l') # type = "lines"
plot(pca_x$cumPerVar, xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained", type = "b")
# multipal dimention plot
biplot(pca_x)
summary(pca_x)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.712 1.296 0.958 0.8316 0.6306 0.4694 0.4017
## Proportion of Variance 0.419 0.240 0.131 0.0988 0.0568 0.0315 0.0231
## Cumulative Proportion 0.419 0.659 0.790 0.8887 0.9455 0.9769 1.0000
# each principal component is a linear combinition of original variables
pca_x$rotation
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## x1 0.527308 -0.0430310 0.0029855 -0.245250 0.0978208 0.727094 -0.3488608
## x2 0.534988 0.0024486 0.0372759 -0.180592 -0.0534367 -0.670498 -0.4768178
## x3 0.540312 -0.0466598 0.0083292 -0.221124 0.0010474 -0.082509 0.8062953
## x4 0.271276 -0.2852336 0.5316786 0.748868 0.0031703 0.039045 0.0055813
## x5 0.067509 0.6602582 0.2542171 0.043954 -0.6942163 0.103984 0.0139405
## x6 0.054221 0.6701488 0.1784032 0.108238 0.7082157 -0.047403 0.0245172
## x7 0.250418 0.1719495 -0.7870172 0.532816 -0.0637080 0.019348 -0.0015428
The summary and plots showed that PC1-PC5 results in variance over \(\approx 90\%\).
Therefore, for modeling, we will use these 5 components as predictor variables.
# using PC1-PC7 as predictors
xnew = pca_x$x[,1:5]
# keep z in the model
summary(lm(y ~ z + xnew, data = data2))
##
## Call:
## lm(formula = y ~ z + xnew, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.017 -1.839 -0.034 1.711 9.405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.176 0.209 87.11 <2e-16 ***
## z 11.974 0.390 30.74 <2e-16 ***
## xnewPC1 2.386 0.111 21.43 <2e-16 ***
## xnewPC2 -1.809 0.097 -18.65 <2e-16 ***
## xnewPC3 -3.880 0.131 -29.59 <2e-16 ***
## xnewPC4 0.252 0.155 1.63 0.1
## xnewPC5 2.421 0.199 12.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.81 on 493 degrees of freedom
## Multiple R-squared: 0.934, Adjusted R-squared: 0.933
## F-statistic: 1.16e+03 on 6 and 493 DF, p-value: <2e-16
Synopsis
Environmental pollutant exposures have been associated with various health outcomes. Most epidemiologic studies have examined environmental pollutants individually, but in real life, we are exposed to pollutant mixtures, not single chemicals. Although multi-pollutant approaches have been recognized recently, analytical challenges exist such as numerous potential exposures of interest, high degrees of correlation between some of these exposures, and high dimensions. New methods for epidemiologic analysis of mixtures are being developed, but it is not well understood how well they perform or how well they identify key, causal risk factors.
Dataset
exercise_data.csv
: This is the dataset used in Park et al. (2017)2. It contains 9664 subjects and 40 variables from NHANES 2003-2014. See the Data Dictionary at the end.
Outcome: LBXSGTSI, gamma-glutamyl transferase (GGT)
Exposure: 20 metals measured in blood and urine
Potential confounders: age, sex, race/ethnicity, education, smoking status, body mass index (BMI), urinary creatinine
Objectives
To identify important predictors of GGT
, using three different methods discussed above:
ONE from conventional methods (e.g., forward selection, backward elimination, stepwise)
ONE from regularized (shrinkage) regression methods (e.g., LASSO, ENET, adaptive LASSO, and adaptive ENET)
ONE from CART or random forests.
For the conventional methods and regularized regression methods, include quadratic terms and pairwise interactions among 20 predictors.
Variable Dictionary
No | Variable | Label |
---|---|---|
1 | seqn | Participant ID |
2 | cycle | NHANES cycle (3=2003-2004, 4=2005-2006, 5=2007-2008, 6=2009-2010, 7=2011-2012, 8=2013-2014) |
3 | sex | Sex (1 male, 2 female) |
4 | age | Age (years) |
5 | raceeth | Race/Ethnicity (1 Mexican American, 2 Other Hispanic, 3 Non-Hispanic white, 4 Non-Hispanic black, 5 Other) |
6 | psu | Primary Sampling Unit (PSU) |
7 | strata | Sampling Strata |
8 | bmi | Body Mass Index (kg/m2) |
9 | wt2yr | Two-year weights of subsample |
10 | ucr | Creatinine, urine (mg/dL) |
11 | ggt | Gamma-Glutamyl Transferase (GGT) (U/L) |
12 | bpb | Lead, blood (ug/dL) |
13 | bcd | Cadmium, blood (ug/L) |
14 | bhg | Total Mercury, blood (ug/L) |
15 | utas | Urinary total Arsenic (\(\mu g/L\)) |
16 | uas3 | Urinary Arsenous acid (\(\mu g/L\)) |
17 | uas5 | Urinary Arsenic acid (\(\mu g/L\)) |
18 | uab | Urinary Arsenobetaine (\(\mu g/L\)) |
19 | uac | Urinary Arsenocholine (\(\mu g/L\)) |
20 | udma | Urinary Dimethylarsonic acid (\(\mu g/L\)) |
21 | umma | Urinary Monomethylacrsonic acid (\(\mu g/L\)) |
22 | uba | Barium, urine (\(\mu g/L\)) |
23 | ucd | Cadmium, urine (\(\mu g/L\)) |
24 | uco | Cobalt, urine (\(\mu g/L\)) |
25 | ucs | Cesium, urine (\(\mu g/L\)) |
26 | umo | Molybdenum, urine (\(\mu g/L\)) |
27 | upb | Lead, urine (\(\mu g/L\)) |
28 | usb | Antimony, urine (\(\mu g/L\)) |
29 | utl | Thallium, urine (\(\mu g/L\)) |
30 | utu | Tungsten, urine (\(\mu g/L\)) |
31 | uur | Uranium, urine (\(\mu g/L\)) |
32 | sbp | Systolic BP (mmHg) |
33 | dbp | Diastolic BP (mmHg) |
34 | htn | Hypertension status (1=yes, 0=no) |
35 | pmon | Person Months of Follow-up from Exam Date |
36 | d_total | Total mortality (0=alive, 1=death) |
37 | d_cvd | CVD mortality (0=alive, 1=death) |
38 | d_cancer | Cancer mortality (0=alive, 1=death) |
39 | smkstat | Smoking status (1 never, 2 former, 3 current) |
40 | educ | Education (1 <High School, 2 High School, 3 College+) |
Zou, H., & Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics, 37(4), 1733.↩
Park SK, Zhao Z, Mukherjee B. Construction of environmental risk score beyond standard linear models using machine learning methods: Application to metal mixtures, oxidative stress and cardiovascular disease in NHANES. Environ Health 2017;16(1):102.↩