In this lecture we will cover methods for exploratory data analysis and some basic analysis with linear models. We will also cover several variable selection approaches including step-wise selection, random forests and regularization based approaches.

1 Pre-Analysis

We will use the cleaned data set variable_selection_data.csv as our sample data for this lecture. This data set is included in the materials provided.
This data set has $500$ subjects, and $10$ variables including
- $1$ binary confounder $z$,
- $7$ predictor variables ($x_1,\ldots,x_7$),
- $1$ response variables ($Y$), and
- the variable OBS indicating the observation ID.

1.1 Descriptive Statistics

setwd("C:/Users/username/Documents/CBSA") #setup working directory

# should be stored in the working directory
data = read.csv("variable_selection_data.csv") # read in data
#change the datanames from UPPER to LOWER
names(data) = tolower(names(data))
names(data)

##  [1] "obs" "y"   "x1"  "x2"  "x3"  "x4"  "x5"  "x6"  "x7"  "z"

# dimention of data
dim.data = dim(data)
dim.data

## [1] 500  10

1.1.1 For Binary Variable $z$

# frequency for categorical variable
table(data$z)

## 
##   0   1 
## 286 214

# percentage of each category
prop.table(table(data$z))

## 
##     0     1 
## 0.572 0.428

1.1.2 For Continuous Variables

data.cont = data[, -c(1,10)] # remove "obs" and "z"
names(data.cont)

## [1] "y"  "x1" "x2" "x3" "x4" "x5" "x6" "x7"

# simple way:
summary(data.cont) # gives min, q1, median, q3, max, and mean

##        y                x1               x2              x3        
##  Min.   : 0.353   Min.   :0.0482   Min.   :0.242   Min.   :0.0376  
##  1st Qu.:14.374   1st Qu.:0.4127   1st Qu.:0.660   1st Qu.:0.4404  
##  Median :21.099   Median :0.8606   Median :0.930   Median :0.8454  
##  Mean   :23.301   Mean   :1.1421   Mean   :0.993   Mean   :1.1278  
##  3rd Qu.:31.858   3rd Qu.:1.4920   3rd Qu.:1.286   3rd Qu.:1.5569  
##  Max.   :53.013   Max.   :4.6872   Max.   :2.467   Max.   :4.5648  
##        x4              x5               x6              x7        
##  Min.   :0.072   Min.   :0.0495   Min.   :0.078   Min.   :0.0427  
##  1st Qu.:0.618   1st Qu.:0.4900   1st Qu.:0.527   1st Qu.:0.4128  
##  Median :0.972   Median :0.9178   Median :0.916   Median :0.8444  
##  Mean   :1.220   Mean   :1.2372   Mean   :1.157   Mean   :1.1304  
##  3rd Qu.:1.537   3rd Qu.:1.6784   3rd Qu.:1.586   3rd Qu.:1.5120  
##  Max.   :4.652   Max.   :4.9680   Max.   :4.805   Max.   :4.8855

num.cont = dim(data.cont)[2] # 8 continuous variables
desc.cont = matrix(NA,num.cont, 4)
rownames(desc.cont) = names(data.cont)
colnames(desc.cont) = c('Mean', 'SD', '# of missing', "# in total")
desc.cont[,1] = apply(data.cont, 2, mean) # Mean
desc.cont[,2] = apply(data.cont, 2, sd) # Standard deviation (SD)

# number of missings and total available number
for(i in 1:num.cont){
  desc.cont[i,4] = length(na.omit(data.cont[,i])) # data available excluding missingness
  desc.cont[i,3] = dim.data[1]-desc.cont[i,4] # missingness
}
desc.cont

##       Mean       SD # of missing # in total
## y  23.3010 10.85264            0        500
## x1  1.1421  0.97913            0        500
## x2  0.9931  0.43424            0        500
## x3  1.1278  0.92698            0        500
## x4  1.2197  0.85501            0        500
## x5  1.2372  1.00616            0        500
## x6  1.1573  0.85140            0        500
## x7  1.1304  0.97948            0        500

1.1.3 Correlation Matrix of the 7 Predictors

### correlation matrix of predictors 
x = data[,-c(1:2,10)] # remove the outcome and confounder, extract 7 predictors

Install and load the Hmisc package

install.packages("Hmisc")

rcorr(as.matrix(x), type="pearson") # type can be Pearson or Spearman

library(Hmisc)
corrx = cor(x) # get the correlations
col.x = colnames(x)
num.x = ncol(x) # number of predictors

corrx_big = matrix(NA, nrow = num.x, ncol = num.x,
                   dimnames=list(col.x,col.x))
### corrx_big keeps the entries of corrx that have correlations>0.5 or <-0.5
index = which(abs(corrx)>0.5, arr.ind=TRUE)
corrx_big[index] = corrx[index]

#### save corrx and corrx_big to csv files
# write.csv(corrx, file="corrx.csv")
# write.csv(corrx_big, file="corrx_big.csv")

Here corrx.csv contains the correlation matrix of the all continuous variables and corrx_big.xlsx contains the entries of the correlation matrix that have correlations $>0.5$ or $<-0.5$.
Install and load the packages ggplot2 and reshape2.

install.packages("ggplot2")
install.packages("reshape2")

We see that $x_1$, $x_2$ and $x_3$ have high correlation.

### draw heatmap
library(ggplot2)
library(reshape2)
qplot(x=Var1, y=Var2, data=melt(cor(x)), fill=value, geom="tile")

1.1.3.1 Heatmap of Correlation Matrix using `corrplot()`

Install and load the package corrplot.

install.packages("corrplot")

## using corrplot
library(corrplot)

## calculate p-value matrix
p = cor.mtest(x, conf.level = .95)
round(p$p, digits = 4)

##        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]
## [1,] 0.0000 0.0000 0.0000 0.0000 0.3866 0.4202 0.0000
## [2,] 0.0000 0.0000 0.0000 0.0000 0.0137 0.1155 0.0000
## [3,] 0.0000 0.0000 0.0000 0.0000 0.2636 0.6147 0.0000
## [4,] 0.0000 0.0000 0.0000 0.0000 0.0096 0.0026 0.4043
## [5,] 0.3866 0.0137 0.2636 0.0096 0.0000 0.0000 0.0848
## [6,] 0.4202 0.1155 0.6147 0.0026 0.0000 0.0000 0.0053
## [7,] 0.0000 0.0000 0.0000 0.4043 0.0848 0.0053 0.0000

corrplot(corrx, method = "color")

corrplot(corrx, method = "color", type = "upper", order = "hclust")

2 Basic Analysis

Install and load the package mgcv.

install.packages("mgcv")

library(mgcv)

Check single univariate relationships $x_i \sim y$ and $z$

dat1 = data[,-1]
par(mfrow = c(2, 4)) # split the plot display in 2 rows and 4 columns
for(i in 1:7) {
  dat_temp = dat1[, c(1, i+1, 9)]
  colnames(dat_temp)[2] = "x"
  mod = gam(y ~ s(x) + z, data = dat_temp)
  plot(mod, xlab = names(dat1)[1+i])
}

We see that $x_7$ is typically $\log$-distributed.
Let us now take the $\log$-transform of $x_7$ and standardize all the $7$ variables

dat1[, 8] = log(dat1[, 8])

# data standardization (mean to 0)
dat1[, 2:8] = scale(dat1[, 2:8])

Matrix calculation:

t(x) # When x is a vector, it is treated as a column, i.e.,
     # the result is a 1-row matrix
A %*% B # Matrix multiplication
A * B # Element-wise multiplication

Data expansion: add interaction terms

dat2 = data.frame(matrix(0, 500, 37))
colnames(dat2)[1:9] = colnames(dat1)
dat2[, 1:9] = dat1
k = 10
for(i in 1:7){
  for(j in (i+1):8){
    dat2[, k] = dat1[, i+1]*dat1[, j+1]
    colnames(dat2)[k] = paste(colnames(dat1)[i+1], ".",
                              colnames(dat1)[j+1], sep = "")
    k = k + 1
  }
}

names(dat2)

##  [1] "y"     "x1"    "x2"    "x3"    "x4"    "x5"    "x6"    "x7"    "z"    
## [10] "x1.x2" "x1.x3" "x1.x4" "x1.x5" "x1.x6" "x1.x7" "x1.z"  "x2.x3" "x2.x4"
## [19] "x2.x5" "x2.x6" "x2.x7" "x2.z"  "x3.x4" "x3.x5" "x3.x6" "x3.x7" "x3.z" 
## [28] "x4.x5" "x4.x6" "x4.x7" "x4.z"  "x5.x6" "x5.x7" "x5.z"  "x6.x7" "x6.z" 
## [37] "x7.z"

Data expansion: add quadratic terms

names(data)

##  [1] "obs" "y"   "x1"  "x2"  "x3"  "x4"  "x5"  "x6"  "x7"  "z"

X_sq = data[, 3:8]^2
colnames(X_sq) = paste("x", 1:6, "sq", sep="")
colnames(X_sq)

## [1] "x1sq" "x2sq" "x3sq" "x4sq" "x5sq" "x6sq"

dat.all = cbind(dat2, X_sq)
names(dat.all)

##  [1] "y"     "x1"    "x2"    "x3"    "x4"    "x5"    "x6"    "x7"    "z"    
## [10] "x1.x2" "x1.x3" "x1.x4" "x1.x5" "x1.x6" "x1.x7" "x1.z"  "x2.x3" "x2.x4"
## [19] "x2.x5" "x2.x6" "x2.x7" "x2.z"  "x3.x4" "x3.x5" "x3.x6" "x3.x7" "x3.z" 
## [28] "x4.x5" "x4.x6" "x4.x7" "x4.z"  "x5.x6" "x5.x7" "x5.z"  "x6.x7" "x6.z" 
## [37] "x7.z"  "x1sq"  "x2sq"  "x3sq"  "x4sq"  "x5sq"  "x6sq"

We now have $43$ variables totally, including $y$, $7$ main terms, $1$ confounder, $28$ interaction terms and $6$ square terms.

data2 = cbind(dat.all[,c(1,9)],dat.all[,c(2:8,10:43)])
x0 = data2[,-c(1:2,10:43)] # extract 7 pollutants
x1 = data2[,-c(1,10:43)] # extract 8 main effect predictors
x2 = data2[,-c(1)] # extract all predictors: main effect and interactions

2.1 Single-Variable Models

Here we fit single-variable models with adjustment for confounders.
Note: Always keep confounders in the model.

y = data2$y # y as the outcome
num.x = ncol(x0)
univ = matrix(0, nrow = num.x, ncol = 2)
colnames(univ) = c('beta','pvalue')
rownames(univ) = names(x0)
univ = data.frame(univ)
for(i in 1:num.x){
  fit = lm(y ~ x0[,i] + factor(z), data = data2) # always keep confounders in the model
  univ[i,1] = coef(fit)[2]
  univ[i,2] = anova(fit)$'Pr(>F)'[1]
}
univ = univ[order(univ$pvalue),]# sorted by pvalue in increasing order
univ

##       beta      pvalue
## x1  4.1947 2.2766e-138
## x3  2.8814 6.5621e-114
## x2  2.5996 3.9445e-110
## x7  3.5144  7.1499e-89
## x5 -3.3050  3.3736e-28
## x4  0.2402  3.7374e-19
## x6 -1.7112  9.8027e-08

2.2 Multiple-Variable Models

Here we fit multi-variable models adjusted for confounders.
Since all predictors have p-values $\leq 0.1$, we put all $8$ main effect variables into the multiple regression model.

dat.ml = data.frame(cbind(y, x1))
fit.ml = lm(y ~ ., data = dat.ml) # . indicates include all variables in dat.ml except y
summary(fit.ml)

## 
## Call:
## lm(formula = y ~ ., data = dat.ml)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.700 -1.753  0.019  1.700  8.613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.4477     0.2050   90.00  < 2e-16 ***
## z            11.3394     0.3872   29.29  < 2e-16 ***
## x1            2.9275     0.2462   11.89  < 2e-16 ***
## x2            1.0546     0.2331    4.52  7.6e-06 ***
## x3            0.0151     0.2485    0.06     0.95    
## x4           -0.6751     0.1323   -5.10  4.8e-07 ***
## x5           -3.5940     0.1527  -23.53  < 2e-16 ***
## x6           -0.1266     0.1526   -0.83     0.41    
## x7            3.3486     0.1288   26.00  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.7 on 491 degrees of freedom
## Multiple R-squared:  0.939,  Adjusted R-squared:  0.938 
## F-statistic:  948 on 8 and 491 DF,  p-value: <2e-16

2.2.1 Including Pairwise Interactions

Here we fit multi-variable models (adjusted for confounders) including pairwise interactions.

x2 = data2[,-c(1)] # 42 variables, excluding outcome
dat.ml = data.frame(cbind(y, x2))
fit.ml2 = lm(y ~ ., data = dat.ml)
summary(fit.ml2)

## 
## Call:
## lm(formula = y ~ ., data = dat.ml)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.019 -1.452  0.029  1.491  7.735 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.6719     1.8265    9.68  < 2e-16 ***
## z            10.8002     0.5366   20.13  < 2e-16 ***
## x1            3.7079     0.9989    3.71  0.00023 ***
## x2            1.4535     1.3874    1.05  0.29537    
## x3           -0.2265     0.9887   -0.23  0.81888    
## x4           -0.6060     0.4571   -1.33  0.18558    
## x5           -5.2849     0.5444   -9.71  < 2e-16 ***
## x6           -0.4850     0.4955   -0.98  0.32822    
## x7            3.5847     0.2117   16.93  < 2e-16 ***
## x1.x2        -0.3009     0.3418   -0.88  0.37906    
## x1.x3        -0.1731     0.3246   -0.53  0.59413    
## x1.x4         0.1718     0.2325    0.74  0.46041    
## x1.x5        -0.0581     0.3051   -0.19  0.84916    
## x1.x6         0.0580     0.3071    0.19  0.85024    
## x1.x7         0.6606     0.2659    2.48  0.01333 *  
## x1.z         -0.0675     1.2296   -0.05  0.95625    
## x2.x3        -0.0373     0.4800   -0.08  0.93811    
## x2.x4         0.2341     0.2203    1.06  0.28858    
## x2.x5        -0.4982     0.3060   -1.63  0.10421    
## x2.x6         0.2820     0.3277    0.86  0.38990    
## x2.x7         0.2747     0.2524    1.09  0.27711    
## x2.z         -0.6640     0.7247   -0.92  0.35998    
## x3.x4        -0.4362     0.2302   -1.90  0.05872 .  
## x3.x5         0.1785     0.3086    0.58  0.56333    
## x3.x6        -0.2266     0.3033   -0.75  0.45533    
## x3.x7        -0.0731     0.2749   -0.27  0.79042    
## x3.z          0.4718     0.7535    0.63  0.53147    
## x4.x5         0.3161     0.1794    1.76  0.07879 .  
## x4.x6         0.0606     0.1983    0.31  0.76007    
## x4.x7        -0.0830     0.1389   -0.60  0.55066    
## x4.z         -0.3605     0.4160   -0.87  0.38665    
## x5.x6         0.1525     0.1663    0.92  0.35948    
## x5.x7        -0.3676     0.1580   -2.33  0.02046 *  
## x5.z         -0.1815     0.4491   -0.40  0.68630    
## x6.x7         0.0812     0.1573    0.52  0.60578    
## x6.z         -0.3445     0.4477   -0.77  0.44201    
## x7.z         -0.3468     0.4025   -0.86  0.38938    
## x1sq         -0.1287     0.2552   -0.50  0.61433    
## x2sq         -0.0870     1.5249   -0.06  0.95451    
## x3sq          0.0964     0.3242    0.30  0.76638    
##  [ reached getOption("max.print") -- omitted 3 rows ]
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.46 on 457 degrees of freedom
## Multiple R-squared:  0.953,  Adjusted R-squared:  0.949 
## F-statistic:  220 on 42 and 457 DF,  p-value: <2e-16

fit.ml: model including only marginal effect terms.
fit.ml2: model including main and pairwise interaction terms.

2.3 Step-wise Selection Approaches

Here we use $7$ main effect predictors $+ z$ into the step-wise regression using both forward and backward selections.

2.3.1 Forward Selection

Here we show code for forward selection of the variables in the model.

fit.start = lm(y ~ 1, data = dat.ml) # model with only intercept
fit.full = lm(y ~ ., data = dat.ml) # model with all variables

full = formula(fit.full)

Select a formula-based model by AIC.

fit.forward = step(fit.start, direction = 'forward',
                   trace = 0, scope = full)

## if you want to keep z
fit.forward1 = step(fit.start, direction='forward', trace = 0,
                    scope = list(lower = ~z, upper = full))  # error

fit.forward2 = step(lm(y ~ z, data = dat.ml), direction = 'forward',
                    trace = 0, scope = list(upper = full))

summary(fit.forward2)

## 
## Call:
## lm(formula = y ~ z + x7 + x5 + x1 + x5sq + x4sq + x2 + x1.x2 + 
##     x1.x7 + x5.x7 + x5.z + x3.x4 + x4.x5 + x4 + x2.x5 + x5.x6 + 
##     x2.x7, data = dat.ml)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.162 -1.461  0.061  1.490  7.825 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.6521     0.4278   41.26  < 2e-16 ***
## z            10.7754     0.3930   27.42  < 2e-16 ***
## x7            3.4145     0.1184   28.83  < 2e-16 ***
## x5           -5.4387     0.4209  -12.92  < 2e-16 ***
## x1            3.3516     0.2788   12.02  < 2e-16 ***
## x5sq          0.4910     0.1120    4.38  1.4e-05 ***
## x4sq          0.0508     0.1146    0.44   0.6576    
## x2            1.2145     0.1875    6.48  2.3e-10 ***
## x1.x2        -0.5623     0.1312   -4.29  2.2e-05 ***
## x1.x7         0.4323     0.1935    2.23   0.0259 *  
## x5.x7        -0.3361     0.1178   -2.85   0.0045 ** 
## x5.z         -0.3803     0.3129   -1.22   0.2247    
## x3.x4        -0.2677     0.1087   -2.46   0.0142 *  
## x4.x5         0.3270     0.1264    2.59   0.0100 ** 
## x4           -0.7852     0.3804   -2.06   0.0395 *  
## x2.x5        -0.2843     0.1719   -1.65   0.0988 .  
## x5.x6         0.1827     0.1177    1.55   0.1214    
## x2.x7         0.2800     0.2007    1.40   0.1636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.43 on 482 degrees of freedom
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.95 
## F-statistic:  559 on 17 and 482 DF,  p-value: <2e-16

2.3.2 Backward Elimination

Here we show code for backward elimination of the variables in the model.

fit.backward = step(fit.full, direction='backward',
                    trace = 0, scope=list(lower = ~z))
summary(fit.backward)

## 
## Call:
## lm(formula = y ~ z + x1 + x2 + x4 + x5 + x7 + x1.x2 + x1.x7 + 
##     x2.x5 + x2.x6 + x3.x4 + x4.x5 + x5.x6 + x5.x7 + x6.z + x5sq, 
##     data = dat.ml)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.099 -1.573  0.036  1.443  7.801 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.769      0.356   49.94  < 2e-16 ***
## z             10.912      0.383   28.48  < 2e-16 ***
## x1             3.208      0.257   12.50  < 2e-16 ***
## x2             1.231      0.185    6.67  7.1e-11 ***
## x4            -0.613      0.129   -4.74  2.8e-06 ***
## x5            -5.317      0.419  -12.70  < 2e-16 ***
## x7             3.433      0.118   29.05  < 2e-16 ***
## x1.x2         -0.503      0.124   -4.06  5.6e-05 ***
## x1.x7          0.638      0.122    5.21  2.8e-07 ***
## x2.x5         -0.477      0.155   -3.09  0.00214 ** 
## x2.x6          0.249      0.168    1.49  0.13742    
## x3.x4         -0.258      0.102   -2.52  0.01214 *  
## x4.x5          0.326      0.123    2.65  0.00825 ** 
## x5.x6          0.252      0.120    2.10  0.03644 *  
## x5.x7         -0.324      0.116   -2.79  0.00555 ** 
## x6.z          -0.523      0.234   -2.24  0.02579 *  
## x5sq           0.440      0.114    3.88  0.00012 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.42 on 483 degrees of freedom
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.95 
## F-statistic:  597 on 16 and 483 DF,  p-value: <2e-16

2.3.3 Step-wise Selection

Here we show code for step-wise selection of the variables in the model, which includes both forward selection and backward elimination.

fit.step = step(fit.full, direction='both',
                trace = 0, scope = list(lower = ~z))
summary(fit.step)  # this is exactly the same as backward

## 
## Call:
## lm(formula = y ~ z + x1 + x2 + x4 + x5 + x7 + x1.x2 + x1.x7 + 
##     x2.x5 + x2.x6 + x3.x4 + x4.x5 + x5.x6 + x5.x7 + x6.z + x5sq, 
##     data = dat.ml)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.099 -1.573  0.036  1.443  7.801 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.769      0.356   49.94  < 2e-16 ***
## z             10.912      0.383   28.48  < 2e-16 ***
## x1             3.208      0.257   12.50  < 2e-16 ***
## x2             1.231      0.185    6.67  7.1e-11 ***
## x4            -0.613      0.129   -4.74  2.8e-06 ***
## x5            -5.317      0.419  -12.70  < 2e-16 ***
## x7             3.433      0.118   29.05  < 2e-16 ***
## x1.x2         -0.503      0.124   -4.06  5.6e-05 ***
## x1.x7          0.638      0.122    5.21  2.8e-07 ***
## x2.x5         -0.477      0.155   -3.09  0.00214 ** 
## x2.x6          0.249      0.168    1.49  0.13742    
## x3.x4         -0.258      0.102   -2.52  0.01214 *  
## x4.x5          0.326      0.123    2.65  0.00825 ** 
## x5.x6          0.252      0.120    2.10  0.03644 *  
## x5.x7         -0.324      0.116   -2.79  0.00555 ** 
## x6.z          -0.523      0.234   -2.24  0.02579 *  
## x5sq           0.440      0.114    3.88  0.00012 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.42 on 483 degrees of freedom
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.95 
## F-statistic:  597 on 16 and 483 DF,  p-value: <2e-16

fit.step1 = step(lm(y ~ z, data= dat.ml), direction='both',
               trace = 0, scope = list(lower = ~z, upper = full))
summary(fit.step1)

## 
## Call:
## lm(formula = y ~ z + x7 + x5 + x1 + x5sq + x2 + x1.x2 + x1.x7 + 
##     x5.x7 + x3.x4 + x4.x5 + x4 + x2.x5 + x5.x6 + x6.z + x2.x6, 
##     data = dat.ml)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.099 -1.573  0.036  1.443  7.801 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.769      0.356   49.94  < 2e-16 ***
## z             10.912      0.383   28.48  < 2e-16 ***
## x7             3.433      0.118   29.05  < 2e-16 ***
## x5            -5.317      0.419  -12.70  < 2e-16 ***
## x1             3.208      0.257   12.50  < 2e-16 ***
## x5sq           0.440      0.114    3.88  0.00012 ***
## x2             1.231      0.185    6.67  7.1e-11 ***
## x1.x2         -0.503      0.124   -4.06  5.6e-05 ***
## x1.x7          0.638      0.122    5.21  2.8e-07 ***
## x5.x7         -0.324      0.116   -2.79  0.00555 ** 
## x3.x4         -0.258      0.102   -2.52  0.01214 *  
## x4.x5          0.326      0.123    2.65  0.00825 ** 
## x4            -0.613      0.129   -4.74  2.8e-06 ***
## x2.x5         -0.477      0.155   -3.09  0.00214 ** 
## x5.x6          0.252      0.120    2.10  0.03644 *  
## x6.z          -0.523      0.234   -2.24  0.02579 *  
## x2.x6          0.249      0.168    1.49  0.13742    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.42 on 483 degrees of freedom
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.95 
## F-statistic:  597 on 16 and 483 DF,  p-value: <2e-16

3 Variable Selection Methods

3.1 Classification And Regression Trees (CART)

Here we demonstrate the code to fit/grow and prune the classification and regression trees.
Install and load the packages rpart and rpart.plot.

install.packages("rpart")
install.packages("rpart.plot")

More detailed examples can be found in Tree-Based Models.

# using main effect
library('rpart') # CART
library('rpart.plot') # Plotting for CART 

# set up the random-number-generator seed for cross validation (CV)
set.seed(0)

3.1.1 Grow the Tree

We fit a ANOVA for $y$ when it is continuous, the default number of folds in CV is $10$.

fit.cart = rpart(y ~ ., data = x1, method = "anova")

Detailed results can be obtained by running

summary(fit.cart)

The complexity parameter (CP) table and its plot is given as

fit.cart$cptable

##         CP nsplit rel error  xerror     xstd
## 1 0.688354      0   1.00000 1.00240 0.048945
## 2 0.055184      1   0.31165 0.31519 0.019071
## 3 0.033570      2   0.25646 0.28043 0.016474
## 4 0.029894      3   0.22289 0.25053 0.014541
## 5 0.024777      4   0.19300 0.22507 0.013234
## 6 0.019229      5   0.16822 0.19354 0.012132
## 7 0.011355      6   0.14899 0.18473 0.011369
## 8 0.010071      7   0.13764 0.17936 0.011187
## 9 0.010000      8   0.12757 0.17532 0.011164

plotcp(fit.cart) # plot of CP table

The dotted line in CP table denotes the upper limit of the one standard deviation rule: it displays the one standard deviation upper limit based on the model with the smallest cross-validation error.
When using CP table to minimize the cross-validation error, there are two methods:
1. Use the tree that minimizes the cross-validation error.
2. Use the one standard deviation rule: choose the smallest tree whose cross-validation error is within one standard error of the minimum.
Since here the CP with the minimum CV error (show as “xerror”) has $nsplit=8$.
- If we use method $1$ above, then the tree would keep all nodes with $8$ split.
- However, if we use one standard deviation rule, we can choose point $7$, which means the tree split for $6$ times. Thus we minimize the tree. Let us use $7^{th}$ point, apply $CP=0.01135508$ for pruning.
The tree can be plotted as

rpart.plot(fit.cart) # tree plot

* Note that the size of tree $=$ nsplit $+1$.

The variable importance, rescaled to add to 100 is computed as

vi.cart = fit.cart$variable.importance/sum(fit.cart$variable.importance)*100
vi.cart

##      x1       z      x3      x2      x7      x4      x5      x6 
## 25.8944 24.9450 17.6825 15.6689  6.3012  4.5259  3.4237  1.5583

Number of variables with non-zero variable importance (VI)

length(vi.cart)

## [1] 8

sum(vi.cart>1) # Number of variables with VI > 1

## [1] 8

All $8$ variables have non-zero variable importance, and they all have variable importance $>1$.

3.1.2 Prune the Tree

Here we show code of how to prune the tree using the one standard deviation rule.

prune.cart = prune(fit.cart, cp= fit.cart$cptable[7,"CP"])

rpart.plot(prune.cart) # tree plot

The variable importance, rescaled to add to 100 is computed as

vi.prune = prune.cart$variable.importance/sum(prune.cart$variable.importance)*100
vi.prune

##      x1       z      x3      x2      x7      x4      x5      x6 
## 25.8698 25.2775 17.7795 15.7391  6.3390  4.5862  3.0523  1.3566

Number of variables with non-zero VI are

length(vi.prune)

## [1] 8

Remark: If we use $CP=0.01$ which minimize the CV error, the pruned tree will be the same as unpruned tree.

3.1.3 Optional Settings

The default setting for the CP is $0.01$, and the default CV is $10$-fold. We can change them by using rpart.control (for help see ?rpart.control).
We fit a tree with $CP=0.008$ and a $5$-fold cross validation

set.seed(0)  # set up the random-number-generator seed for
             # cross validation (CV)
fit.cart = rpart(y ~ ., data = x1, method = "anova",
                 control = rpart.control(cp = .008, xval = 5))
# cp=0.008, CV is 5-fold

The CP table and its plot is given as

fit.cart$cptable # the CP table

##         CP nsplit rel error  xerror     xstd
## 1 0.688354      0   1.00000 1.00087 0.048929
## 2 0.055184      1   0.31165 0.31530 0.019081
## 3 0.033570      2   0.25646 0.27710 0.017097
## 4 0.029894      3   0.22289 0.23527 0.014224
## 5 0.024777      4   0.19300 0.22544 0.013696
## 6 0.019229      5   0.16822 0.18296 0.011661
## 7 0.011355      6   0.14899 0.17742 0.011058
## 8 0.010071      7   0.13764 0.17163 0.010930
## 9 0.008000      8   0.12757 0.16382 0.010406

plotcp(fit.cart) # plot of CP table

The tree can be plotted as

rpart.plot(fit.cart)

The variable importance, rescaled to add to 100 is computed as

vi.cart=fit.cart$variable.importance/sum(fit.cart$variable.importance)*100
vi.cart

##      x1       z      x3      x2      x7      x4      x5      x6 
## 25.8944 24.9450 17.6825 15.6689  6.3012  4.5259  3.4237  1.5583

Number of variables with non-zero VI are

length(vi.cart)

## [1] 8

sum(vi.cart>1) # number of variables with VI > 1

## [1] 8

Prune the tree using the one standard deviation rule.

prune.cart = prune(fit.cart, cp = fit.cart$cptable[8,"CP"])

rpart.plot(prune.cart) # tree plot

The variable importance, rescaled to add to $100$ is computed as

vi.prune = prune.cart$variable.importance/sum(prune.cart$variable.importance)*100
#the variable importance, rescaled to add to 100.
vi.prune

##      x1       z      x3      x2      x7      x4      x5      x6 
## 25.7113 25.1226 17.6706 15.6427  6.3001  4.5581  3.4481  1.5464

Number of variables with non-zero VI are

length(vi.prune)

## [1] 8

sum(vi.cart>1) # number of variables with VI>1

## [1] 8

3.2 Random Forests

Install and load the packages rsample, randomForest, ranger and caret.

install.packages("rsample")
install.packages("randomForest")
install.packages("ranger")
install.packages("caret")

Load these packages

library(rsample)      # data splitting 
library(randomForest) # basic implementation
library(ranger)       # a faster implementation of randomForest
library(caret)        # an aggregator package for performing 
                      # many machine learning models

Split the data into training and testing.

# Use set.seed for reproducibility.
set.seed(1234)

data_split = initial_split(dat.all)  
# default 3/4 and 1/4, you can change it e.g., 'prop=0.5' 
data_split

## <375/125/500>

data_train = training(data_split)
data_test  = testing(data_split)

Run the random forest model using the randomForest function (for help see ?randomForest).

# for reproducibility
set.seed(1234)

# default RF model
m1 = randomForest(y ~ ., data = data_train)
m1

## 
## Call:
##  randomForest(formula = y ~ ., data = data_train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 14
## 
##           Mean of squared residuals: 10.23
##                     % Var explained: 91.27

plot(m1)

# which.min(m1$mse) gives the minimum MSE

RMSE of this optimal random forest

sqrt(m1$mse[which.min(m1$mse)])

## [1] 3.1985

3.2.1 Initial tuning with `randomForest`

We wan to get the names of all variables in data_train except y.

# set difference between (x, y)
features = setdiff(names(data_train), "y")

tuneRf (for help see ?tuneRf) will start at a value of mtry that you supply and increase by a certain step factor until the Out-of-bag (OOB) error stops improving be a specified amount.
- For example, it starts with mtry = 5 and increases by a factor of $1.5$ until the OOB error stops improving by $1\%$.
- Here mtry is the number of variables randomly selected as candidates for splitting a node (see ?randomForest).

set.seed(1234)
m2 = tuneRF(x = data_train[features],
            y = data_train$y,
            ntreeTry = 500,
            mtryStart = 5,
            stepFactor = 1.5,
            improve = 0.01,
            trace = FALSE # to not show real-time progress
            )

## -0.078966 0.01 
## 0.050766 0.01 
## 0.0074674 0.01

Hyper-parameter grid search

#### Full grid search with ranger
hyper_grid = expand.grid(mtry = seq(5, 17, by = 2),
                         node_size  = seq(3, 9, by = 2),
                         sample_size = c(.55, .632, .70, .80),
                         OOB_RMSE = 0)

# total number of combinations
nrow(hyper_grid) # 7x4x4

## [1] 112

for(i in 1:nrow(hyper_grid)){
  # train model
  model = ranger(y ~ .,
                 data = data_train,
                 num.trees = 500,
                 mtry = hyper_grid$mtry[i],
                 min.node.size = hyper_grid$node_size[i],
                 sample.fraction = hyper_grid$sample_size[i],
                 seed = 1234)
  
  # add OOB error to grid
  hyper_grid$OOB_RMSE[i] = sqrt(model$prediction.error)
}

Here we show the 10 random forest models with lowest OOB_RMSE.
Note: %>% is called the pipe operator. It acts like a composition of functions, that is, $ f(g(x)) $ is similar to g %>% f.

hyper_grid_sorted = hyper_grid %>% 
  dplyr::arrange(OOB_RMSE) %>%
  head(10)
hyper_grid_sorted

##    mtry node_size sample_size OOB_RMSE
## 1    15         3         0.8   3.1927
## 2    15         5         0.8   3.1958
## 3    13         5         0.8   3.2013
## 4    17         3         0.8   3.2066
## 5    17         7         0.8   3.2066
## 6    17         5         0.8   3.2101
## 7    13         3         0.8   3.2112
## 8    17         9         0.8   3.2165
## 9    15         7         0.8   3.2167
## 10   15         9         0.8   3.2219

mtry_1 = hyper_grid_sorted$mtry[1]
node_size_1 = hyper_grid_sorted$node_size[1]
sample_size_1 = hyper_grid_sorted$sample_size[1]

The best random forest model we have found retains columns with categorical variables and uses mtry = 15, terminal node size of 3 observations, and a sample size of 80 $\%$.
Lets repeat this model to get a better expectation of our error rate.

OOB_RMSE = vector(mode = "numeric", length = 100)

for(i in seq_along(OOB_RMSE)){
  optimal_ranger = ranger(y ~ .,
                          data = data_train,
                          num.trees = 500,
                          mtry = mtry_1,
                          min.node.size = node_size_1,
                          sample.fraction = sample_size_1,
                          importance = 'impurity'
                          )
  
  OOB_RMSE[i] = sqrt(optimal_ranger$prediction.error)
}

hist(OOB_RMSE, breaks = 20)

3.2.2 Check Variable Importance

attributes(m1) # m1: Random Forest before tuning
attributes(m2) # m2: Random Forest after tuning
attributes(optimal_ranger) # Best random forest model with OOB_RMSE

optimal_ranger$variable.importance # variable importance

##       x1       x2       x3       x4       x5       x6       x7        z 
## 7081.423  156.350  400.374   51.085  783.338   78.949  822.483 5578.699 
##    x1.x2    x1.x3    x1.x4    x1.x5    x1.x6    x1.x7     x1.z    x2.x3 
##  164.806  253.107   50.409  852.496  120.272 1256.884 1700.209  118.929 
##    x2.x4    x2.x5    x2.x6    x2.x7     x2.z    x3.x4    x3.x5    x3.x6 
##   88.945  302.498   77.020  251.516 1027.427   63.531  273.633   75.270 
##    x3.x7     x3.z    x4.x5    x4.x6    x4.x7     x4.z    x5.x6    x5.x7 
##  479.123  517.400  115.190   75.181   96.963   22.360  115.820  686.749 
##     x5.z    x6.x7     x6.z     x7.z     x1sq     x2sq     x3sq     x4sq 
## 1580.116  124.500  241.278 1007.953 6753.955  161.893  275.829   47.534 
##     x5sq     x6sq 
##  789.629   78.746

sort(optimal_ranger$variable.importance) # sorted values of variable importance

##     x4.z     x4sq    x1.x4       x4    x3.x4    x4.x6    x3.x6    x2.x6 
##   22.360   47.534   50.409   51.085   63.531   75.181   75.270   77.020 
##     x6sq       x6    x2.x4    x4.x7    x4.x5    x5.x6    x2.x3    x1.x6 
##   78.746   78.949   88.945   96.963  115.190  115.820  118.929  120.272 
##    x6.x7       x2     x2sq    x1.x2     x6.z    x2.x7    x1.x3    x3.x5 
##  124.500  156.350  161.893  164.806  241.278  251.516  253.107  273.633 
##     x3sq    x2.x5       x3    x3.x7     x3.z    x5.x7       x5     x5sq 
##  275.829  302.498  400.374  479.123  517.400  686.749  783.338  789.629 
##       x7    x1.x5     x7.z     x2.z    x1.x7     x5.z     x1.z        z 
##  822.483  852.496 1007.953 1027.427 1256.884 1580.116 1700.209 5578.699 
##     x1sq       x1 
## 6753.955 7081.423

t(m1$importance) # importance of variables in m1

##                   x1     x2     x3     x4     x5     x6     x7      z  x1.x2
## IncNodePurity 9013.5 206.66 520.96 61.372 1025.1 103.23 997.32 4315.6 215.99
##                x1.x3  x1.x4 x1.x5  x1.x6  x1.x7 x1.z x2.x3  x2.x4  x2.x5  x2.x6
## IncNodePurity 364.26 67.655  1273 175.53 1501.1 2868 145.3 94.934 338.74 84.413
##               x2.x7   x2.z  x3.x4  x3.x5  x3.x6  x3.x7   x3.z  x4.x5  x4.x6
## IncNodePurity 301.9 1523.4 79.698 306.23 118.87 629.41 868.83 140.04 100.67
##                x4.x7   x4.z  x5.x6  x5.x7   x5.z  x6.x7   x6.z   x7.z   x1sq
## IncNodePurity 119.19 28.707 138.96 855.78 1934.7 145.72 176.88 1569.9 9646.4
##                 x2sq   x3sq   x4sq   x5sq   x6sq
## IncNodePurity 196.75 371.98 49.259 964.07 97.901

3.2.3 Prediction

Check prediction performance in test data using m1

# randomForest
pred_randomForest = predict(m1, data_test)
head(pred_randomForest)

##       4      10      17      19      22      36 
## 20.5144  6.4587 19.1872 42.7845 36.0801 34.3820

Check prediction performance in test data using optimal_ranger

# ranger
pred_ranger = predict(optimal_ranger, data_test)
head(pred_ranger$predictions)

## [1] 20.2520  6.5332 19.3703 42.2882 36.3760 34.5744

The true responses are

head(data_test$y)

## [1] 15.60029  0.35266 21.49137 43.92268 34.88292 33.50203

In the following, we fit the

LASSO,
elastic net (E-Net),
adaptive LASSO, and
adaptive elastic net

using the R package gcdnet.

Note that the objective function in gcdnet is \[\frac{1}{n}Loss(y,x,\beta) + \lambda P_{\mathbb{L}^1}(\beta) + \lambda_2 P_{\mathbb{L}^2}(\beta),\] where $P_{\mathbb{L}^1}(\beta)$ and $P_{\mathbb{L}^2}(\beta)$ are $\mathbb{L}^1$ and $\mathbb{L}^2$ penalties, respectively.
By default, gcdnet chooses its own $\lambda$ (the tuning parameter for $\mathbb{L}^1$ penalty) sequence based on the given data.

3.3 LASSO

Install and load the package gcdnet.

install.packages("gcdnet")

# LASSO & E-Net
library('gcdnet')#for lasso, adpative lasso, elastic net, adaptive elastic net

## fold ID
set.seed(0)
n = nrow(x2)
foldid_mat = cbind(sample(n), rep(1:5, each = n/5)) #assign into 5 folds

X = data.matrix(x2)
Y = y

We now fit the LASSO with 5-fold cross-validation

## LASSO
cv_lasso = cv.gcdnet(X, Y, lambda2 = 0, method="ls",
                     foldid = foldid_mat[, 2]) # calculate the CV error
plot(cv_lasso) # CV error = y axis

The lasso function will report two critical parameter values:
- Left dot line: The lambda value that minimizes the cross validated mean squared error
- Right dot line: The lambda value with the greatest amount of shrinkage whose CV-MSE is within one standard error of the minimum.
Get the $\lambda$ with minimum CV error

lambda_opt_lasso = cv_lasso$lambda.min
lambda_opt_lasso

## [1] 0.04512

Re-fit the model with the best $\lambda$

fit_lasso = gcdnet(X, Y, lambda = lambda_opt_lasso,
                   lambda2 = 0, method = "ls") 
beta_lasso = coef(fit_lasso)

We now compute the number of terms with non-zero coefficients, including intercept

sum(beta_lasso!=0)

## [1] 23

row.names(beta_lasso)[which(beta_lasso!=0)]

##  [1] "(Intercept)" "z"           "x1"          "x2"          "x4"         
##  [6] "x5"          "x6"          "x7"          "x1.x2"       "x1.x3"      
## [11] "x1.x7"       "x2.x5"       "x2.x7"       "x3.x4"       "x4.x5"      
## [16] "x4.z"        "x5.x6"       "x5.x7"       "x5.z"        "x6.z"       
## [21] "x3sq"        "x4sq"        "x5sq"

Compute $\hat{Y} = X\hat{\beta}$

Y_pred_lasso = cbind(1, X)%*%as.vector(beta_lasso) # get predicted Y

Compute $R^2$

cor(Y_pred_lasso, Y)^2 # R^2

##         [,1]
## [1,] 0.95138

Compute adjusted-$R^2$

1 - (1- cor(Y_pred_lasso, Y)^2)*499/(499-(sum(beta_lasso!=0)-1))  #adj R^2

##         [,1]
## [1,] 0.94914

Compute MSE

mean((Y - Y_pred_lasso)^2) # MSE

## [1] 5.7276

Order the selected variables (with $\beta \neq 0$)

order = order(abs(beta_lasso[,1]), decreasing = TRUE)
beta.lasso = data.frame(beta = beta_lasso[order],
                        row.names = rownames(beta_lasso)[order])

Estimated non-zero coefficients

beta.lasso = subset(beta.lasso, beta.lasso[,1] != 0)
beta.lasso

##                  beta
## (Intercept) 18.496941
## z           10.508618
## x5          -4.408284
## x7           3.382300
## x1           3.344982
## x2           1.066532
## x1.x2       -0.496023
## x1.x7        0.424511
## x5.x7       -0.329496
## x5.x6        0.312452
## x5.z        -0.298642
## x4          -0.265199
## x4.x5        0.252270
## x5sq         0.238471
## x3.x4       -0.229729
## x2.x5       -0.215640
## x6          -0.204192
## x2.x7        0.155193
## x4sq        -0.085377
## x3sq         0.068312
## x1.x3       -0.060188
## x6.z        -0.048129
## x4.z        -0.038430

3.3.1 Optional Settings

The default CV in the function cv.gcdnet is 5-fold, we can also change the folds:

set.seed(0) # set up the random-number-generator seed for cross validation (CV)
n = nrow(data2)
foldid_mat = cbind(sample(n), rep(1:10, each = n/10)) # assign into 10 folds

cv_lasso = cv.gcdnet(X, Y, lambda2 = 0, method = "ls",
                     foldid = foldid_mat[, 2]) # calculate the CV error
plot(cv_lasso) # CV error = y axis

Or: To change the number of CV folds, we can also use nfolds

set.seed(0) # set up the random-number-generator seed for cross validation (CV)
cv.lasso = cv.gcdnet(X, Y, lambda2 = 0, nfolds = 10, method = "ls") # Use 10-fold CV.
plot(cv_lasso) # CV error = y axis

3.4 Elastic Net (E-NET)

We introduce the E-NET method first, and then use the coefficient estimators obtained from E-NET to construct the adaptive weights in adaptive LASSO and adaptive E-NET.
We do tuning on $\lambda_2$ (for $\mathbb{L}^2$ penalty) from 0.0001 to 0.01 with an increment 0.001.

lambda2_list = seq(0.0001, 0.01, by=0.001) # candidate lambda2
num.lambda2 = length(lambda2_list) # the number of candidate lambda2
min_cv_enet = numeric(num.lambda2)

for(i in 1:length(lambda2_list)){
  cv_enet = cv.gcdnet(X, Y, lambda2 = lambda2_list[i],
                      foldid = foldid_mat[, 2], method = "ls")
  # lambda (for L1 penalty) will be automatically generated by default setting
  min_cv_enet[i] = min(cv_enet$cvm)  # collect minimum cv error for each lambda2 candidate
}

plot(lambda2_list, min_cv_enet, xlab = "lambda2", ylab = "minimum CV error", type = 'l')

lambda2_list[which.min(min_cv_enet)] # optimal lambda2 which gives minimun cv error

## [1] 1e-04

Here from the plot we can see that the optimal $lambda_2$ is close to zero.
Sometimes the optimal $\lambda_2$ is not close to zero — what we need to do is find this $\lambda_2$ providing lowest CV error.
Note: Choose appropriate range for lambda2_list based on the study.

lambda2_list[which.min(min_cv_enet)] #optimistic lambda2 which gives minimun cv error

## [1] 1e-04

lambda.list = cv_enet$lambda
lambda.list # candidate lambda (L1 penalty)

##   [1] 27.6850512 25.2255875 22.9846159 20.9427259 19.0822319 17.3870190
##   [7] 15.8424042 14.4350086 13.1526422 11.9841978 10.9195547  9.9494915
##  [13]  9.0656061  8.2602427  7.5264255  6.8577986  6.2485706  5.6934649
##  [19]  5.1876731  4.7268145  4.3068974  3.9242845  3.5756619  3.2580099
##  [25]  2.9685773  2.7048571  2.4645650  2.2456199  2.0461252  1.8643531
##  [31]  1.6987291  1.5478187  1.4103147  1.2850262  1.1708681  1.0668514
##  [37]  0.9720752  0.8857187  0.8070339  0.7353393  0.6700137  0.6104916
##  [43]  0.5562572  0.5068408  0.4618145  0.4207882  0.3834065  0.3493457
##  [49]  0.3183108  0.2900330  0.2642672  0.2407905  0.2193993  0.1999085
##  [55]  0.1821491  0.1659675  0.1512234  0.1377892  0.1255483  0.1143950
##  [61]  0.1042324  0.0949727  0.0865356  0.0788480  0.0718434  0.0654610
##  [67]  0.0596456  0.0543469  0.0495189  0.0451197  0.0411114  0.0374592
##  [73]  0.0341314  0.0310993  0.0283365  0.0258192  0.0235255  0.0214355
##  [79]  0.0195313  0.0177962  0.0162152  0.0147747  0.0134621  0.0122662
##  [85]  0.0111765  0.0101836  0.0092789  0.0084546  0.0077035  0.0070192
##  [91]  0.0063956  0.0058274  0.0053097  0.0048380  0.0044082  0.0040166
##  [97]  0.0036598  0.0033347  0.0030384  0.0027685

log(lambda.list) # the values of log(lambda)

##   [1]  3.320893  3.227859  3.134825  3.041791  2.948758  2.855724  2.762690
##   [8]  2.669656  2.576623  2.483589  2.390555  2.297521  2.204488  2.111454
##  [15]  2.018420  1.925386  1.832353  1.739319  1.646285  1.553252  1.460218
##  [22]  1.367184  1.274150  1.181117  1.088083  0.995049  0.902015  0.808982
##  [29]  0.715948  0.622914  0.529880  0.436847  0.343813  0.250779  0.157745
##  [36]  0.064712 -0.028322 -0.121356 -0.214390 -0.307423 -0.400457 -0.493491
##  [43] -0.586525 -0.679558 -0.772592 -0.865626 -0.958659 -1.051693 -1.144727
##  [50] -1.237761 -1.330794 -1.423828 -1.516862 -1.609896 -1.702929 -1.795963
##  [57] -1.888997 -1.982031 -2.075064 -2.168098 -2.261132 -2.354166 -2.447199
##  [64] -2.540233 -2.633267 -2.726301 -2.819334 -2.912368 -3.005402 -3.098436
##  [71] -3.191469 -3.284503 -3.377537 -3.470571 -3.563604 -3.656638 -3.749672
##  [78] -3.842705 -3.935739 -4.028773 -4.121807 -4.214840 -4.307874 -4.400908
##  [85] -4.493942 -4.586975 -4.680009 -4.773043 -4.866077 -4.959110 -5.052144
##  [92] -5.145178 -5.238212 -5.331245 -5.424279 -5.517313 -5.610347 -5.703380
##  [99] -5.796414 -5.889448

summary(log(lambda.list))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -5.89   -3.59   -1.28   -1.28    1.02    3.32

Cross-validation for $\lambda$ given the optimal $\lambda_2$

cv_enet = cv.gcdnet(X, Y, foldid = foldid_mat[, 2],
                    lambda2 = lambda2_list[which.min(min_cv_enet)],
                    method = "ls") 
plot(cv_enet) # CV error given lambda and the optimal lambda2

lambda_opt_enet = cv_enet$lambda.min
lambda_opt_enet

## [1] 0.037459

summary(lambda.list)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0028  0.0277  0.2772  3.1161  2.7708 27.6851

Since $\lambda_2$ is close to $0$, the shrinkage result of E-Net would be the similar to that of LASSO.
Best model from E-NET

fit_enet = gcdnet(X, Y, lambda=lambda_opt_enet,
                  lambda2 = lambda2_list[which.min(min_cv_enet)],
                  method="ls")
beta_enet = coef(fit_enet)
beta_enet

## 43 x 1 sparse Matrix of class "dgCMatrix"
##                     s0
## (Intercept) 18.3664736
## z           10.5310815
## x1           3.3514539
## x2           1.0697612
## x3           0.0000000
## x4          -0.3444847
## x5          -4.5354775
## x6          -0.1914132
## x7           3.3869822
## x1.x2       -0.5046198
## x1.x3       -0.0681791
## x1.x4        0.0000000
## x1.x5        0.0000000
## x1.x6        0.0000000
## x1.x7        0.4288911
## x1.z         0.0000000
## x2.x3        0.0000000
## x2.x4        0.0046394
## x2.x5       -0.2311197
## x2.x6        .        
## x2.x7        0.1696795
## x2.z         0.0000000
## x3.x4       -0.2385012
## x3.x5        0.0000000
## x3.x6        0.0000000
## x3.x7        0.0000000
## x3.z         0.0000000
## x4.x5        0.2711996
## x4.x6        0.0000000
## x4.x7       -0.0043092
## x4.z        -0.0563444
## x5.x6        0.3002017
## x5.x7       -0.3301774
## x5.z        -0.3022551
## x6.x7        0.0000000
## x6.z        -0.0670827
## x7.z         0.0000000
## x1sq         0.0000000
## x2sq         0.0000000
## x3sq         0.0721358
## x4sq        -0.0620423
## x5sq         0.2715471
## x6sq         .

sum(beta_enet!=0) # number of terms selected including intercept

## [1] 25

row.names(beta_enet)[which(beta_enet != 0)] # names of terms being selected

##  [1] "(Intercept)" "z"           "x1"          "x2"          "x4"         
##  [6] "x5"          "x6"          "x7"          "x1.x2"       "x1.x3"      
## [11] "x1.x7"       "x2.x4"       "x2.x5"       "x2.x7"       "x3.x4"      
## [16] "x4.x5"       "x4.x7"       "x4.z"        "x5.x6"       "x5.x7"      
## [21] "x5.z"        "x6.z"        "x3sq"        "x4sq"        "x5sq"

Predictions from the best model from E-NET

Y_pred_enet = cbind(1, X)%*%as.vector(beta_enet) #calculate the predicted Y

Compute $R^2$

cor(Y_pred_enet, Y)^2 # R^2

##         [,1]
## [1,] 0.95161

Compute adjusted-$R^2$

1 - (1- cor(Y_pred_enet, Y)^2)*499/(499- (sum(beta_enet!=0)-1))  #adj R^2

##         [,1]
## [1,] 0.94916

Compute MSE

mean((Y - Y_pred_enet)^2) # MSE

## [1] 5.6977

Order the selected variables (with $\beta \neq 0$)

order = order(abs(beta_enet[,1]), decreasing = TRUE)
beta.enet = data.frame(beta = beta_enet[order], row.names = rownames(beta_enet)[order])
beta.enet = subset(beta.enet, beta.enet[,1] != 0)
t(beta.enet)

##      (Intercept)      z      x5    x7     x1     x2    x1.x2   x1.x7       x4
## beta      18.366 10.531 -4.5355 3.387 3.3515 1.0698 -0.50462 0.42889 -0.34448
##         x5.x7     x5.z  x5.x6    x5sq  x4.x5   x3.x4    x2.x5       x6   x2.x7
## beta -0.33018 -0.30226 0.3002 0.27155 0.2712 -0.2385 -0.23112 -0.19141 0.16968
##          x3sq     x1.x3      x6.z      x4sq      x4.z     x2.x4      x4.x7
## beta 0.072136 -0.068179 -0.067083 -0.062042 -0.056344 0.0046394 -0.0043092

Before moving to the Adaptive Lasso and Adaptive E-NET, we first introduce the way to construct the adaptive weights.

The default way handling the predictors $X=((x_{ij}))_{n \times p}$ in gcdnet is to standardize it by \[x_{ij}^* = \frac{x_{ij} - \sum\limits_{i=1}^n x_{ij}}{\sqrt{\frac{1}{n}\sum\limits_{i=1}^n \Big(x_{ij} - \sum\limits_{i=1}^n x_{ij}\Big)^2}}.\]
After fitting the regularized linear regression to get the $\beta^*$ for the standardized predictors, gcdnet returns the $\beta$ for the original predictors as \[\beta = \frac{\beta^*}{\sqrt{\frac{1}{n}\sum\limits_{i=1}^n \Big(x_{ij} - \sum\limits_{i=1}^n x_{ij}\Big)^2}}.\]
Denote the $\beta^*$ obtained for the standardized predictors from ENET as $\beta^*_{e-net}$. The adaptive weights for $\beta^*$ in Adaptive Lasso and Adaptive E-NET are constructed as \[w_j = \Big(|\beta^*_{e-net,j}| + \frac{1}{n}\Big)^{-\gamma},\] as recommended by Zou and Zhang (2009)¹.

Here, the parameter $\gamma > \frac{2\nu}{1-\nu}$ with $0 \leq \nu = \lim\limits_{n \longrightarrow \infty} \frac{\log p}{\log n} < 1$.
To avoid tuning on $\gamma$, we can choose $\gamma = \lceil \frac{2\nu}{1-\nu} \rceil + 1$ as shown in the simulations of Zou and Zhang (2009).
By the default setting, gcdnet does tuning on lambda (for $\mathbb{L}^1$ penalty) by choosing proper lambda-candidate sequence based on the given adaptive weights, that is, and thus the lambda-candidate sequence is different for different given adaptive weights.
Therefore, we do not do tuning on $\gamma$, and fix $\gamma = \lceil \frac{2\nu^*}{1-\nu^*} \rceil + 1$ with $\nu^* = \frac{\log p}{\log n}$.
The standard error and p-value for each non-zero estimated coefficient is obtained by Theorem 3.3 of Zou and Zhang (2009).
Note that Adaptive E-NET reduces to Adaptive Lasso when $\lambda_2=0$.

3.5 Adaptive Lasso

In order to reduce redundancy, we only calculate weight for those terms selected in E-Net.

v_star = log(sum(beta_enet!=0)) / log(n)

gamma = ceiling(2*v_star/(1 - v_star)) + 1
gamma #4

## [1] 4

X_sd = apply(X, 2, sd)
weights_ad = (abs(beta_enet[-1]*X_sd) + 1/n)^(-gamma)
X_non0 = X[, which(beta_enet[-1]!=0)]
X_non0

##        z         x1         x2          x4          x5         x6          x7
##   [1,] 0 -0.7418453 -1.0645840 -0.07070489  1.21312727 -0.2507142  0.94783681
##   [2,] 0 -0.6257728 -0.1361881  0.13484876 -0.92179518 -0.7496121 -1.23822366
##   [3,] 0 -0.6710988 -0.6263737 -0.21789651 -0.28611988 -0.9289527 -0.84245696
##   [4,] 0 -0.3212268  0.1211559 -0.36865535 -0.32437430 -0.8542872  0.27917229
##   [5,] 0 -0.6363128 -0.4901829 -0.99033380 -0.72886332 -0.9393591 -0.28421582
##   [6,] 0 -0.7569199 -0.2973188 -0.37264361  0.05327005 -0.4208111  0.71837037
##   [7,] 0 -0.1474909  0.0267849 -0.70773993 -0.86080081 -0.8672893 -0.76337724
##   [8,] 1  1.8620285  2.2524479  0.05763310 -0.13230711 -0.2647147 -0.09190870
##              x1.x2       x1.x3       x1.x7       x2.x4       x2.x5       x2.x7
##   [1,]  0.78975659  0.72455615 -0.70314827  0.07527129 -1.2915e+00 -1.00905187
##   [2,]  0.08522281  0.00453198  0.77484669 -0.01836480  1.2554e-01  0.16863134
##   [3,]  0.42035866  0.48134490  0.56537183  0.13648465  1.7922e-01  0.52769292
##   [4,] -0.03891852  0.05470810 -0.08967763 -0.04466477 -3.9300e-02  0.03382337
##   [5,]  0.31190966  0.35252281  0.18085016  0.48544472  3.5728e-01  0.13931774
##   [6,]  0.22504655  0.47163183 -0.54374883  0.11079397 -1.5838e-02 -0.21358505
##   [7,] -0.00395053 -0.14684483  0.11259119 -0.01895676 -2.3056e-02 -0.02044700
##   [8,]  4.19412218  4.60321424 -0.17113663  0.12981556 -2.9801e-01 -0.20701957
##              x3.x4       x4.x5       x4.x7       x4.z       x5.x6       x5.x7
##   [1,]  0.06905707 -8.5774e-02 -0.06701670  0.0000000 -3.0415e-01  1.14984668
##   [2,] -0.00097660 -1.2430e-01 -0.16697293  0.0000000  6.9099e-01  1.14138861
##   [3,]  0.15628605  6.2345e-02  0.18356843  0.0000000  2.6579e-01  0.24104368
##   [4,]  0.06278565  1.1958e-01 -0.10291836  0.0000000  2.7711e-01 -0.09055631
##   [5,]  0.54865353  7.2182e-01  0.28146853  0.0000000  6.8466e-01  0.20715449
##   [6,]  0.23219180 -1.9851e-02 -0.26769613  0.0000000 -2.2417e-02  0.03826762
##   [7,] -0.70463980  6.0922e-01  0.54027255  0.0000000  7.4656e-01  0.65711574
##   [8,]  0.14247769 -7.6253e-03 -0.00529698  0.0576331  3.5024e-02  0.01216018
##               x5.z       x6.z       x3sq       x4sq       x5sq
##   [1,]  0.00000000  0.0000000  0.0494618  1.3438837  6.0405842
##   [2,]  0.00000000  0.0000000  1.2567755  1.7822517  0.0959079
##   [3,]  0.00000000  0.0000000  0.2142764  1.0679362  0.9011325
##   [4,]  0.00000000  0.0000000  0.9407060  0.8181383  0.8295384
##   [5,]  0.00000000  0.0000000  0.3772662  0.1391066  0.2538245
##   [6,]  0.00000000  0.0000000  0.3026980  0.8119812  1.6660614
##   [7,]  0.00000000  0.0000000  4.2053295  0.3777209  0.1376855
##   [8,] -0.13230711 -0.2647147 11.6922964  1.6103356  1.2189043
##  [ reached getOption("max.print") -- omitted 492 rows ]

weights_ad_non0 = weights_ad[which(beta_enet[-1] != 0)] 
weights_ad_non0 # only add weights to those beta_enet ne 0

##          z         x1         x2         x4         x5         x6         x7 
## 1.3490e-03 7.9073e-03 7.5789e-01 6.9385e+01 2.3591e-03 7.1459e+02 7.5809e-03 
##      x1.x2      x1.x3      x1.x7      x2.x4      x2.x5      x2.x7      x3.x4 
## 5.3222e+00 6.7012e+03 3.0726e+01 4.0729e+08 4.0477e+02 1.4080e+03 1.3643e+02 
##      x4.x5      x4.x7       x4.z      x5.x6      x5.x7       x5.z       x6.z 
## 1.8351e+02 5.9291e+08 3.1866e+05 4.3767e+01 8.5082e+01 5.6918e+02 2.8021e+05 
##       x3sq       x4sq       x5sq 
## 2.4737e+02 4.9200e+02 7.5214e-01

Cross-validation for $\lambda$.

cv_adlasso = cv.gcdnet(X_non0, Y,  foldid = foldid_mat[, 2],
                       lambda = seq(0.000001, 0.01, length = 1000),
                       lambda2 = 0, pf = weights_ad_non0, method="ls")  
plot(cv_adlasso)

Optimal $\lambda$ which provided relatively low CV error

lambda_opt_adlasso = cv_adlasso$lambda.min
lambda_opt_adlasso

## [1] 7.1063e-05

We use the optimal $\lambda$ to fit the adaptive LASSO

fit_adlasso = gcdnet(X_non0, Y, lambda = lambda_opt_adlasso,
                     lambda2 = 0, pf = weights_ad_non0, method="ls")
beta_adlasso = coef(fit_adlasso)

Number of selected terms including intercept

sum(beta_adlasso!=0)

## [1] 18

Selected terms with $\beta \neq 0$

row.names(beta_adlasso)[which(beta_adlasso!=0)]

##  [1] "(Intercept)" "z"           "x1"          "x2"          "x4"         
##  [6] "x5"          "x6"          "x7"          "x1.x2"       "x1.x7"      
## [11] "x2.x5"       "x3.x4"       "x4.x5"       "x5.x6"       "x5.x7"      
## [16] "x5.z"        "x3sq"        "x5sq"

Predicted $Y$

Y_pred_adlasso = cbind(1, X[, which(beta_enet[-1] != 0)])%*%as.vector(beta_adlasso)

Compute $R^2$

cor(Y_pred_adlasso, Y)^2 # R^2

##        [,1]
## [1,] 0.9517

Compute adjusted-$R^2$

1 - (1- cor(Y_pred_adlasso, Y)^2)*499/(499- 21)  #adj R^2

##         [,1]
## [1,] 0.94958

Compute MSE

mean((Y - Y_pred_adlasso)^2) # MSE

## [1] 5.6769

# order the selected variables (with beta ne 0)
order = order(abs(beta_adlasso[,1]), decreasing = TRUE)
beta_adlasso = data.frame(beta = beta_adlasso[order],
                          row.names = rownames(beta_adlasso)[order])
beta_adlasso # beta of all terms

##                  beta
## (Intercept) 17.681284
## z           10.860967
## x5          -5.383510
## x7           3.416588
## x1           3.179394
## x2           1.171567
## x4          -0.625752
## x1.x7        0.621786
## x1.x2       -0.547950
## x5sq         0.457772
## x5.x7       -0.331631
## x2.x5       -0.326898
## x4.x5        0.302307
## x3.x4       -0.271014
## x5.x6        0.227288
## x5.z        -0.155125
## x6          -0.092843
## x3sq         0.055568
## x1.x3        0.000000
## x2.x4        0.000000
## x2.x7        0.000000
## x4.x7        0.000000
## x4.z         0.000000
## x6.z         0.000000
## x4sq         0.000000

beta_adlasso = subset(beta_adlasso, beta_adlasso[,1] != 0)
beta_adlasso # beta of terms selected in E-Net

##                  beta
## (Intercept) 17.681284
## z           10.860967
## x5          -5.383510
## x7           3.416588
## x1           3.179394
## x2           1.171567
## x4          -0.625752
## x1.x7        0.621786
## x1.x2       -0.547950
## x5sq         0.457772
## x5.x7       -0.331631
## x2.x5       -0.326898
## x4.x5        0.302307
## x3.x4       -0.271014
## x5.x6        0.227288
## x5.z        -0.155125
## x6          -0.092843
## x3sq         0.055568

In the following, we give the standard errors and p-values for those non-zero estimated coefficients (see Theorem 3.3 of Zou and Zhang(2009)).

# use the original order of beta
beta_adlasso = coef(fit_adlasso)

# estimated sigma squared
sigma2_adlasso = crossprod(Y - Y_pred_adlasso)/(n - sum(beta_adlasso != 0) - 1)

# Selected variables
X_adlasso = X_non0[, which(beta_adlasso[-1] != 0)]

# standardized X (actually we already standardizes before...)
X_std_adlasso = X_adlasso / (rep(1, n)%*%t(apply(X_adlasso, 2, sd)))

# solve(): matrix inverse 
var_std_non0_adlasso = solve(crossprod(X_std_adlasso))*c(sigma2_adlasso)

# variance matrix for the non-zero estimated coefficients
# for original predictors
var_non0_adlasso = diag(1/apply(X_adlasso, 2, sd))%*%
  var_std_non0_adlasso%*%
  diag(1/apply(X_adlasso, 2, sd))

# variance for the intercept
var_int_adlasso = apply(X_adlasso, 2, mean)%*%var_non0_adlasso%*%apply(X_adlasso, 2, mean)

adlasso_tab = matrix(0, sum(beta_adlasso != 0), 3)
rownames(adlasso_tab) = rownames(beta_adlasso)[which(beta_adlasso!=0)]
colnames(adlasso_tab) = c("beta", "sd", "pvalue")

# beta
adlasso_tab[, 1] = beta_adlasso[which(beta_adlasso!=0), ] 
# sd
adlasso_tab[, 2] = c(sqrt(var_int_adlasso), sqrt(diag(var_non0_adlasso)))

# p-value
adlasso_tab[, 3] = sapply(abs(adlasso_tab[, 1])/adlasso_tab[, 2],
                          function(x) 2*(1 - pnorm(x))) 
adlasso_tab

##                  beta       sd     pvalue
## (Intercept) 17.681284 0.104078 0.0000e+00
## z           10.860967 0.320660 0.0000e+00
## x1           3.179394 0.228261 0.0000e+00
## x2           1.171567 0.197919 3.2308e-09
## x4          -0.625752 0.129935 1.4655e-06
## x5          -5.383510 0.362426 0.0000e+00
## x6          -0.092843 0.152097 5.4158e-01
## x7           3.416588 0.118368 0.0000e+00
## x1.x2       -0.547950 0.123966 9.8622e-06
## x1.x7        0.621786 0.123339 4.6244e-07
## x2.x5       -0.326898 0.171740 5.6982e-02
## x3.x4       -0.271014 0.107386 1.1611e-02
## x4.x5        0.302307 0.124238 1.4962e-02
## x5.x6        0.227288 0.125521 7.0178e-02
## x5.x7       -0.331631 0.116859 4.5416e-03
## x5.z        -0.155125 0.313063 6.2024e-01
## x3sq         0.055568 0.055741 3.1881e-01
## x5sq         0.457772 0.085515 8.6454e-08

# Compared with lm model's coefficients, sd and p-values
summary(lm(Y ~ X_adlasso))$coef

##                 Estimate Std. Error  t value    Pr(>|t|)
## (Intercept)    17.725547   0.378019  46.8906 1.0199e-181
## X_adlassoz     10.867208   0.383529  28.3348 1.1010e-104
## X_adlassox1     3.165926   0.261814  12.0923  1.3890e-29
## X_adlassox2     1.165311   0.198993   5.8560  8.7814e-09
## X_adlassox4    -0.617476   0.129623  -4.7636  2.5197e-06
## X_adlassox5    -5.115505   0.485473 -10.5372  1.6701e-23
## X_adlassox6    -0.209993   0.157326  -1.3348  1.8258e-01
## X_adlassox7     3.418793   0.118365  28.8836 3.2488e-107
## X_adlassox1.x2 -0.545100   0.129677  -4.2035  3.1331e-05
## X_adlassox1.x7  0.618979   0.123235   5.0228  7.1857e-07
## X_adlassox2.x5 -0.293320   0.171545  -1.7099  8.7933e-02
## X_adlassox3.x4 -0.298507   0.107164  -2.7855  5.5549e-03
## X_adlassox4.x5  0.340799   0.123951   2.7495  6.1936e-03
## X_adlassox5.x6  0.264618   0.131290   2.0155  4.4404e-02
## X_adlassox5.x7 -0.321261   0.117741  -2.7285  6.5940e-03
## X_adlassox5.z  -0.413665   0.312421  -1.3241  1.8611e-01
## X_adlassox3sq   0.063818   0.057092   1.1178  2.6421e-01
## X_adlassox5sq   0.430063   0.121799   3.5309  4.5389e-04

Adaptive LASSO and linear regression provides different beta coefficients, SDs and p-values.

3.6 Adaptive E-NET

We do tuning on $\lambda_2$ (for $\mathbb{L}^2$ penalty) from $0.001$ to $0.25$ with an increment $0.001$.

lambda2_list_adenet = seq(0.001, 0.025, by = 0.001)

# using the same weighted created for adaptive LASSO
# list of lambda2
min_cv_adenet = numeric(length(lambda2_list_adenet))
for(i in 1:length(lambda2_list_adenet)){
  # lambda (for L1 penalty) will be automatically generated by default setting
  cv_adenet = cv.gcdnet(X_non0, Y, foldid = foldid_mat[, 2],
                        lambda2 = lambda2_list_adenet[i], pf = weights_ad_non0,
                        method="ls")
  # collect minimum cv error for each lambda2 candidate
  min_cv_adenet[i] = min(cv_adenet$cvm)
}
plot(lambda2_list_adenet, min_cv_adenet,xlab = "lambda2",
     ylab = "MSPE (5-fold CV)", type = 'l' )

lambda2_list_adenet[which.min(min_cv_adenet)]

## [1] 0.001

# candidate lambda (for L1 penalty)
lambda.list = cv_adenet$lambda 
lambda.list

##   [1] 3299.15721 3006.06918 2739.01828 2495.69144 2273.98109 2071.96688
##   [7] 1887.89905 1720.18330 1567.36696 1428.12640 1301.25559 1185.65563
##  [13] 1080.32525  984.35213  896.90500  817.22643  744.62629  678.47574
##  [19]  618.20183  563.28248  513.24202  467.64701  426.10254  388.24876
##  [25]  353.75781  322.33094  293.69595  267.60481  243.83154  222.17022
##  [31]  202.43323  184.44962  168.06362  153.13331  139.52937  127.13397
##  [37]  115.83974  105.54885   96.17218   87.62851   79.84384   72.75073
##  [43]   66.28776   60.39893   55.03326   50.14426   45.68958   41.63064
##  [49]   37.93229   34.56249   31.49205   28.69439   26.14526   23.82258
##  [55]   21.70625   19.77793   18.02091   16.41998   14.96128   13.63216
##  [61]   12.42112   11.31766   10.31223    9.39612    8.56139    7.80082
##  [67]    7.10782    6.47638    5.90104    5.37680    4.89914    4.46392
##  [73]    4.06736    3.70602    3.37679    3.07681    2.80347    2.55442
##  [79]    2.32749    2.12072    1.93232    1.76066    1.60425    1.46173
##  [85]    1.33188    1.21356    1.10575    1.00751    0.91801    0.83646
##  [91]    0.76215    0.69444    0.63275    0.57654    0.52532    0.47865
##  [97]    0.43613    0.39738    0.36208    0.32992

log(lambda.list)

##   [1]  8.1014223  8.0083886  7.9153548  7.8223211  7.7292874  7.6362536
##   [7]  7.5432199  7.4501861  7.3571524  7.2641187  7.1710849  7.0780512
##  [13]  6.9850174  6.8919837  6.7989499  6.7059162  6.6128825  6.5198487
##  [19]  6.4268150  6.3337812  6.2407475  6.1477138  6.0546800  5.9616463
##  [25]  5.8686125  5.7755788  5.6825451  5.5895113  5.4964776  5.4034438
##  [31]  5.3104101  5.2173763  5.1243426  5.0313089  4.9382751  4.8452414
##  [37]  4.7522076  4.6591739  4.5661402  4.4731064  4.3800727  4.2870389
##  [43]  4.1940052  4.1009715  4.0079377  3.9149040  3.8218702  3.7288365
##  [49]  3.6358027  3.5427690  3.4497353  3.3567015  3.2636678  3.1706340
##  [55]  3.0776003  2.9845666  2.8915328  2.7984991  2.7054653  2.6124316
##  [61]  2.5193979  2.4263641  2.3333304  2.2402966  2.1472629  2.0542292
##  [67]  1.9611954  1.8681617  1.7751279  1.6820942  1.5890604  1.4960267
##  [73]  1.4029930  1.3099592  1.2169255  1.1238917  1.0308580  0.9378243
##  [79]  0.8447905  0.7517568  0.6587230  0.5656893  0.4726556  0.3796218
##  [85]  0.2865881  0.1935543  0.1005206  0.0074868 -0.0855469 -0.1785806
##  [91] -0.2716144 -0.3646481 -0.4576819 -0.5507156 -0.6437493 -0.7367831
##  [97] -0.8298168 -0.9228506 -1.0158843 -1.1089180

summary(log(lambda.list))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.11    1.19    3.50    3.50    5.80    8.10

cv_adenet = cv.gcdnet(X_non0, Y, foldid = foldid_mat[, 2],
                      lambda2 = lambda2_list_adenet[which.min(min_cv_adenet)],
                      method = "ls", pf = weights_ad_non0)

# CV error given lambda and the optimal lambda2
plot(cv_adenet,ylab='MSPE (5-fold CV)')

lambda_opt_adenet = cv_adenet$lambda.min
lambda_opt_adenet

## [1] 0.32992

# optimal lambda2
lambda2_list_adenet[which.min(min_cv_adenet)]

## [1] 0.001

Cross-validation for $\lambda$ given the optimal $\lambda_2$

fit_adenet = gcdnet(X_non0, Y, lambda = lambda_opt_adenet,
                    lambda2 = lambda2_list_adenet[which.min(min_cv_adenet)],
                    method="ls", pf = weights_ad_non0)
beta_adenet = coef(fit_adenet)

Number of selected terms including intercept

sum(beta_adenet != 0)

## [1] 7

# names of selected terms
row.names(beta_adenet)[which(beta_adenet!=0)]

## [1] "(Intercept)" "z"           "x1"          "x2"          "x5"         
## [6] "x7"          "x5sq"

Prediction

Y_pred_adenet = cbind(1, X[, which(beta_enet[-1] != 0)])%*%as.vector(beta_adenet)

Compute $R^2$

cor(Y_pred_adenet, Y)^2 # R^2

##         [,1]
## [1,] 0.93822

Compute adjusted-$R^2$

1 - (1- cor(Y_pred_adenet, Y)^2)*499/(499- (sum(beta_adenet!=0)-1)) # adj R^2

##         [,1]
## [1,] 0.93747

Compute MSE

mean((Y - Y_pred_adenet)^2) # MSE

## [1] 7.2632

# order the selected variables (with beta ne 0)
order = order(abs(beta_adenet[,1]), decreasing = TRUE)
beta_adenet = data.frame(beta=beta_adenet[order],
                         row.names=rownames(beta_adenet)[order])
beta_adenet

##                 beta
## (Intercept) 17.36857
## z           11.59789
## x5          -4.96295
## x7           3.51399
## x1           3.20387
## x5sq         0.38117
## x2           0.23683
## x4           0.00000
## x6           0.00000
## x1.x2        0.00000
## x1.x3        0.00000
## x1.x7        0.00000
## x2.x4        0.00000
## x2.x5        0.00000
## x2.x7        0.00000
## x3.x4        0.00000
## x4.x5        0.00000
## x4.x7        0.00000
## x4.z         0.00000
## x5.x6        0.00000
## x5.x7        0.00000
## x5.z         0.00000
## x6.z         0.00000
## x3sq         0.00000
## x4sq         0.00000

In the following, we compute the standard error and p-value for each non-zero estimated coefficient (see Theorem 3.3 of Zou and Zhang(2009))

# calculate error and p-values....
beta_adenet = coef(fit_adenet)

# estimated sigma squared
sigma2_adenet = crossprod(Y - Y_pred_adenet)/(n - sum(beta_adenet!=0) - 1) 

# selected variables
X_adenet = X_non0[, which(beta_adenet[-1] != 0)]

# standardized X (actually we already standardize before...)
X_std_adenet = X_adenet/(rep(1, n)%*%t(apply(X_adenet, 2, sd)))

# variance matrix for the non-zero estimated coefficents for original predicators
var_std_non0_adenet = solve(crossprod(X_std_adenet))*c(sigma2_adenet)

# variance for the intercept
var_non0_adenet = diag(1/apply(X_adenet, 2, sd))%*%
  var_std_non0_adenet%*%
  diag(1/apply(X_adenet, 2, sd))
var_int_adenet = apply(X_adenet, 2, mean)%*%var_non0_adenet%*%apply(X_adenet, 2, mean)

adenet_tab = matrix(0, sum(beta_adenet != 0), 3)
rownames(adenet_tab) = rownames(beta_adenet)[which(beta_adenet != 0)]
colnames(adenet_tab) = c("beta", "sd", "pvalue")

# beta
adenet_tab[, 1] = beta_adenet[which(beta_adenet!=0), ]  
# sd
adenet_tab[, 2] = c(sqrt(var_int_adenet), sqrt(diag(var_non0_adenet)))
# p-value
adenet_tab[, 3] = sapply(abs(adenet_tab[, 1])/adenet_tab[, 2],
                         function(x) 2*(1 - pnorm(x)))
adenet_tab

##                 beta      sd     pvalue
## (Intercept) 17.36857 0.11324 0.0000e+00
## z           11.59789 0.33346 0.0000e+00
## x1           3.20387 0.21385 0.0000e+00
## x2           0.23683 0.20005 2.3648e-01
## x5          -4.96295 0.26590 0.0000e+00
## x7           3.51399 0.12887 0.0000e+00
## x5sq         0.38117 0.06296 1.4124e-09

Compare with lm model’s coefficients, SD and p-values

summary(lm(Y ~ X_adenet))$coef

##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept)  16.91909    0.32978  51.3034 7.8912e-200
## X_adenetz    11.42790    0.38335  29.8107 2.1240e-112
## X_adenetx1    2.75902    0.21995  12.5439  1.6099e-31
## X_adenetx2    0.90969    0.19750   4.6061  5.2291e-06
## X_adenetx5   -5.79184    0.39929 -14.5053  5.8393e-40
## X_adenetx7    3.50440    0.12727  27.5343 9.2190e-102
## X_adenetx5sq  0.58670    0.10065   5.8294  1.0066e-08

3.7 Training/Test Splitting

For any general data set, we want to discuss how we we can effectively perform the variable selection by splitting the data into two parts as training and test data set.
Below we demonstrate the case of E-NET, but similar procedure can be used for other variable selection approaches.
Split the data as training and test data sets and name the variables $X$ and $y$ appropriately

set.seed(5541)
train_obs = sample(seq_len(nrow(dat.all)), size = 250) # randomly select 250 obs as training obs
train = dat.all[train_obs,]
test = dat.all[-train_obs,]

# head(train)
# head(test)
# dim(train)
# dim(test)

y.train = train$y # y as the outcome in training data
y.test = test$y # y as the outcome in test data
xcov.train = train[,-1] # all predictors in training data
xcov.test = test[,-1] # all predictors in test data
names(xcov.train)

##  [1] "x1"    "x2"    "x3"    "x4"    "x5"    "x6"    "x7"    "z"     "x1.x2"
## [10] "x1.x3" "x1.x4" "x1.x5" "x1.x6" "x1.x7" "x1.z"  "x2.x3" "x2.x4" "x2.x5"
## [19] "x2.x6" "x2.x7" "x2.z"  "x3.x4" "x3.x5" "x3.x6" "x3.x7" "x3.z"  "x4.x5"
## [28] "x4.x6" "x4.x7" "x4.z"  "x5.x6" "x5.x7" "x5.z"  "x6.x7" "x6.z"  "x7.z" 
## [37] "x1sq"  "x2sq"  "x3sq"  "x4sq"  "x5sq"  "x6sq"

We want to use the training data set to do a 5-fold cross-validation.

## fold ID
set.seed(0)
n = nrow(train)
foldid_mat = cbind(sample(n), rep(1:5, each = n/5)) # assign into 5 folds

We now run E-NET on the training data set.
We can use a large list of values to identify $\lambda_2$ that gives the minimum CV error.

lambda2_list = seq(0.01, 2.0, by = 0.01) # candidates for lambda2
num.lambda2 = length(lambda2_list) # the number of candidates for lambda2
min_cv=c()

for(i in 1:num.lambda2){
  cv_enet = cv.gcdnet(xcov.train, y.train, lambda2 = lambda2_list[i],
                      foldid = foldid_mat[, 2], method = "ls")
  # lambda (for L1 penalty) will be automatically generated by default setting
  min_cv[i] = min(cv_enet$cvm) # collect minimum cv error for each lambda2 candidate
}
plot(lambda2_list, min_cv, xlab = "lambda2",ylab = "Minimum CV error", type = 'l')

lambda2_list[which.min(min_cv)] # optimal lambda2 which gives minimun cv error

## [1] 0.01

lambda2_enet_opt = lambda2_list[which.min(min_cv)]

Now we do a cross-validation for an optimal value of $\lambda_1$ given the optimal value of $\lambda_2$.

## lambda 1
# cross validation for lambda given the optimal lambda2
cv_enet = cv.gcdnet(xcov.train, y.train, foldid = foldid_mat[, 2],
                    lambda2 = lambda2_enet_opt, method = "ls") 
plot(cv_enet) # CV error given lambda and the optimal lambda2

lambda_opt_enet = cv_enet$lambda.min
lambda_opt_enet

## [1] 0.023319

Use the optimal values of $\lambda_1$ and $\lambda_2$ to get the final estimates of $\beta$.

# Regression
fit_enet = gcdnet(xcov.train, y.train, lambda=lambda_opt_enet,
                  lambda2 = lambda2_enet_opt, method="ls")
beta_enet = coef(fit_enet)
beta_enet

## 43 x 1 sparse Matrix of class "dgCMatrix"
##                     s0
## (Intercept) 18.1013731
## x1           3.7419862
## x2           0.8325687
## x3           0.4113393
## x4          -0.4171876
## x5          -4.1416765
## x6          -0.9002629
## x7           3.4021254
## z           10.3853736
## x1.x2       -0.3576881
## x1.x3       -0.3443246
## x1.x4        0.0085007
## x1.x5        0.0000000
## x1.x6        0.1211567
## x1.x7        0.3773487
## x1.z         0.0000000
## x2.x3        0.0000000
## x2.x4        0.1047705
## x2.x5       -0.2400901
## x2.x6       -0.0289408
## x2.x7        0.2712493
## x2.z        -0.0761900
## x3.x4       -0.0945836
## x3.x5        0.0487351
## x3.x6        0.0000000
## x3.x7        0.1492009
## x3.z        -0.6661710
## x4.x5        0.5323473
## x4.x6       -0.0474661
## x4.x7       -0.1712154
## x4.z        -0.2157479
## x5.x6        0.3114071
## x5.x7       -0.3726135
## x5.z        -0.5260048
## x6.x7        0.0304243
## x6.z        -0.1990544
## x7.z         0.0000000
## x1sq         0.0000000
## x2sq         0.0000000
## x3sq         0.1331426
## x4sq        -0.0270691
## x5sq         0.2452894
## x6sq         0.1348027

sum(beta_enet!=0) # number of terms selected including intercept

## [1] 36

row.names(beta_enet)[which(beta_enet!=0)] # names of terms being selected

##  [1] "(Intercept)" "x1"          "x2"          "x3"          "x4"         
##  [6] "x5"          "x6"          "x7"          "z"           "x1.x2"      
## [11] "x1.x3"       "x1.x4"       "x1.x6"       "x1.x7"       "x2.x4"      
## [16] "x2.x5"       "x2.x6"       "x2.x7"       "x2.z"        "x3.x4"      
## [21] "x3.x5"       "x3.x7"       "x3.z"        "x4.x5"       "x4.x6"      
## [26] "x4.x7"       "x4.z"        "x5.x6"       "x5.x7"       "x5.z"       
## [31] "x6.x7"       "x6.z"        "x3sq"        "x4sq"        "x5sq"       
## [36] "x6sq"

Use the estimated $\beta$ values to compute the predictions on the training set.

## note that X should be matrix to compute predicted values
X.train = data.matrix(xcov.train)
Y_pred_train = cbind(1, X.train)%*%as.vector(beta_enet) # calculate the predicted Y

Compute $R^2$.

cor(Y_pred_train, y.train)^2 # R^2

##         [,1]
## [1,] 0.95881

Compute adjusted-$R^2$.

1 - (1- cor(Y_pred_train, y.train)^2)*249/(249- (sum(beta_enet!=0)-1))  # adj R^2

##         [,1]
## [1,] 0.95208

Compute the mean-squared error

mean((y.train - Y_pred_train)^2) # MSE

## [1] 5.1756

Using the estimated $\beta$, we now want to compute the predicted values on the test set, as well as $R^2$, adjusted-$R^2$ and the mean squared prediction error.

### calculate MSPE (mean squared prediction error) in the test set
X.test <- data.matrix(xcov.test)
Y_pred_test <- cbind(1, X.test)%*%as.vector(beta_enet) #calculate the predicted Y
cor(Y_pred_test, y.test)^2 # R^2

##         [,1]
## [1,] 0.94252

1 - (1- cor(Y_pred_test, y.test)^2)*249/(249- (sum(beta_enet!=0)-1))  #adj R^2

##         [,1]
## [1,] 0.93312

mean((y.test - Y_pred_test)^2) # MSPE

## [1] 6.5292

3.8 Super Learner

Super Learner is an algorithm that uses cross-validation to estimate the performance of multiple machine learning models, or the same model with different settings.
It then creates an optimal weighted average of those models, aka an “ensemble”, using the test data performance.
This approach has been proven to be asymptotically as accurate as the best possible prediction algorithm that is tested.
A tutorial on Super Learner is available at Guide to SuperLearner: (https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html).
Install and load the packages BMA, caret, glmnet and ggplot2.

## Windows users may need to install Rtools first
## Go to https://cran.r-project.org/bin/windows/Rtools/ and install Rtools first

install.packages(c("SuperLearner","caret", "glmnet", "ggplot2"))

We now review the available models in SuperLearner using the training and test data created in the previous subsection.

library(ggplot2)
library(SuperLearner)
listWrappers()

##  [1] "SL.bartMachine"      "SL.bayesglm"         "SL.biglasso"        
##  [4] "SL.caret"            "SL.caret.rpart"      "SL.cforest"         
##  [7] "SL.earth"            "SL.extraTrees"       "SL.gam"             
## [10] "SL.gbm"              "SL.glm"              "SL.glm.interaction" 
## [13] "SL.glmnet"           "SL.ipredbagg"        "SL.kernelKnn"       
## [16] "SL.knn"              "SL.ksvm"             "SL.lda"             
## [19] "SL.leekasso"         "SL.lm"               "SL.loess"           
## [22] "SL.logreg"           "SL.mean"             "SL.nnet"            
## [25] "SL.nnls"             "SL.polymars"         "SL.qda"             
## [28] "SL.randomForest"     "SL.ranger"           "SL.ridge"           
## [31] "SL.rpart"            "SL.rpartPrune"       "SL.speedglm"        
## [34] "SL.speedlm"          "SL.step"             "SL.step.forward"    
## [37] "SL.step.interaction" "SL.stepAIC"          "SL.svm"             
## [40] "SL.template"         "SL.xgboost"         
## [1] "All"
## [1] "screen.corP"           "screen.corRank"        "screen.glmnet"        
## [4] "screen.randomForest"   "screen.SIS"            "screen.template"      
## [7] "screen.ttest"          "write.screen.template"

Generate library of model and run the super learner

SL.library = c("SL.bartMachine", "SL.caret", "SL.randomForest", "SL.glm", "SL.gam",
               "SL.glm.interaction", "SL.stepAIC" , "SL.glmnet", "SL.xgboost",
               "SL.ipredbagg", "SL.nnls", "SL.ridge", "SL.svm")

## Reduce this library if you think some learners are not helpful. This
## will reduce the computation time.
SL.library = c("SL.randomForest", "SL.stepAIC", "SL.glmnet")

set.seed(1)
sl = SuperLearner(Y = y.train, X = xcov.train, SL.library = SL.library)
sl

## 
## Call:  SuperLearner(Y = y.train, X = xcov.train, SL.library = SL.library) 
## 
## 
##                        Risk     Coef
## SL.randomForest_All 11.0901 0.082102
## SL.stepAIC_All       6.9954 0.437002
## SL.glmnet_All        6.9197 0.480897

The coefficient is how much weight Super Learner puts on that model in the weighted-average.
- So, the coefficient equal to $0$ implies that model is not used at all.

attributes(sl)

## $names
##  [1] "call"              "libraryNames"      "SL.library"       
##  [4] "SL.predict"        "coef"              "library.predict"  
##  [7] "Z"                 "cvRisk"            "family"           
## [10] "fitLibrary"        "cvFitLibrary"      "varNames"         
## [13] "validRows"         "method"            "whichScreen"      
## [16] "control"           "cvControl"         "errorsInCVLibrary"
## [19] "errorsInLibrary"   "metaOptimizer"     "env"              
## [22] "times"            
## 
## $class
## [1] "SuperLearner"

Review how long it took to run the Super Learner:

sl$times$everything

##    user  system elapsed 
##   14.16    0.07   14.24

Let us use the ensemble of models to predict on the test data set, compute $R^2$ and mean squared prediction error.

# onlySL is set to TRUE so we don't fit algorithms that had weight = 0, saving computation.
pred = predict(sl, xcov.test, onlySL = TRUE)
str(pred) # Check the structure of this prediction object.

## List of 2
##  $ pred           : num [1:250, 1] 12.2 16.1 19.7 35.6 20.5 ...
##  $ library.predict: num [1:250, 1:3] 12.5 16.2 18.5 34.3 20.4 ...

summary(pred$library.predict) # Review the columns of $library.predict.

##        V1              V2             V3       
##  Min.   : 6.47   Min.   : 2.4   Min.   : 2.42  
##  1st Qu.:14.30   1st Qu.:14.5   1st Qu.:14.48  
##  Median :19.91   Median :21.1   Median :20.72  
##  Mean   :23.54   Mean   :23.6   Mean   :23.64  
##  3rd Qu.:32.85   3rd Qu.:32.9   3rd Qu.:33.25  
##  Max.   :45.39   Max.   :46.3   Max.   :45.74

qplot(pred$pred[, 1]) + theme_minimal() # Histogram of our predicted values.

summary(pred$pred[, 1]) # predicted value of y in test

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.89   14.43   20.70   23.61   33.06   45.68

summary(y.test) # observed value of y in test

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.353  15.606  22.280  23.969  32.431  46.538

qplot(y.test, pred$pred[, 1], geom=c("point","smooth"), method=lm)

plot(pred$pred[, 1], y.test)

cor(pred$pred[, 1], y.test)^2 # R^2

## [1] 0.94065

## MSPE
mean((y.test - pred$pred)^2) # MSPE

## [1] 6.8953

4 Other Methods

4.1 Bayesian Model Averaging (BMA)

Install and load the package BMA.

install.packages("BMA")

We include all $7$ predictors $+ z +$ interaction $+$ squared terms into the BMA model.

#### Bayesian model averagin (BMA) model
library('BMA')
x.bma = x2
fit.bma = bicreg(x.bma, y, maxCol = ncol(x.bma)+1) # intercept is included by default.
summary(fit.bma)

## 
## Call:
## bicreg(x = x.bma, y = y, maxCol = ncol(x.bma) + 1)
## 
## 
##   186  models were selected
##  Best  5  models (cumulative posterior probability =  0.155 ): 
## 
##            p!=0    EV        SD       model 1    model 2    model 3  
## Intercept  100.0   1.76e+01  0.65558     17.605     17.537     17.535
## z          100.0   1.07e+01  0.42462     10.761     10.840     10.839
## x1         100.0   3.47e+00  0.53363      3.327      3.261      3.253
## x2          88.4   1.11e+00  0.47144      1.244      1.268      1.272
## x3           0.0   0.00e+00  0.00000      .          .          .    
## x4          75.0  -5.57e-01  0.34553     -0.779     -0.772     -0.784
## x5         100.0  -5.74e+00  0.40679     -5.830     -5.617     -5.860
## x6           0.0   0.00e+00  0.00000      .          .          .    
## x7         100.0   3.41e+00  0.11975      3.398      3.387      3.394
## x1.x2       93.3  -5.80e-01  0.21938     -0.622     -0.609     -0.604
## x1.x3        0.4  -3.77e-04  0.01102      .          .          .    
## x1.x4        2.1  -3.28e-03  0.02781      .          .          .    
## x1.x5        1.1  -1.26e-03  0.02210      .          .          .    
## x1.x6        0.2  -1.89e-04  0.00709      .          .          .    
## x1.x7       88.2   5.48e-01  0.23699      0.618      0.634      0.646
## x1.z         1.3  -1.54e-02  0.16887      .          .          .    
## x2.x3        0.0   0.00e+00  0.00000      .          .          .    
## x2.x4        0.4   4.07e-04  0.01559      .          .          .    
## x2.x5       30.7  -1.09e-01  0.18404      .          .         -0.277
## x2.x6        0.3  -2.50e-04  0.00785      .          .          .    
## x2.x7       14.5   7.99e-02  0.20863      .          .          .    
## x2.z         6.1  -6.46e-02  0.27752      .          .          .    
## x3.x4       19.5  -4.39e-02  0.10095      .          .          .    
## x3.x5        0.9  -1.42e-03  0.01885      .          .          .    
## x3.x6        0.2  -1.57e-04  0.00659      .          .          .    
##            model 4    model 5  
## Intercept     17.434     17.996
## z             10.871     10.825
## x1             3.274      3.226
## x2             1.273      1.223
## x3             .          .    
## x4            -0.797      .    
## x5            -5.655     -5.623
## x6             .          .    
## x7             3.396      3.415
## x1.x2         -0.622     -0.587
## x1.x3          .          .    
## x1.x4          .          .    
## x1.x5          .          .    
## x1.x6          .          .    
## x1.x7          0.613      0.635
## x1.z           .          .    
## x2.x3          .          .    
## x2.x4          .          .    
## x2.x5          .          .    
## x2.x6          .          .    
## x2.x7          .          .    
## x2.z           .          .    
## x3.x4          .          .    
## x3.x5          .          .    
## x3.x6          .          .    
##  [ reached getOption("max.print") -- omitted 23 rows ]

The posterior probability that the variable has non-zero coefficient is computed as

probne0.bma = fit.bma$probne0 # probability in percent
ord = order(probne0.bma, decreasing = TRUE) # sort by probne0 in decreasing order
probne0.bma = data.frame(probne0 = probne0.bma[ord], row.names = names(x.bma)[ord])
t(probne0.bma) # probability in percent

##           z  x1  x5  x7 x5sq x1.x2   x2 x1.x7 x5.x7 x4 x5.z x2.x5 x4sq x3.x4
## probne0 100 100 100 100  100  93.3 88.4  88.2  81.9 75   39  30.7 24.3  19.5
##         x4.x5 x2.x7 x2sq x1sq x5.x6 x2.z x4.z x6.z x1.x4 x1.z x1.x5 x4.x6 x3.x5
## probne0    18  14.5 12.2  9.3     7  6.1  4.5  2.3   2.1  1.3   1.1   1.1   0.9
##         x3.x7 x4.x7 x1.x3 x2.x4 x2.x6 x1.x6 x3.x6 x6.x7 x6sq x3 x6 x2.x3 x3.z
## probne0   0.8   0.8   0.4   0.4   0.3   0.2   0.2   0.2  0.2  0  0     0    0
##         x7.z x3sq
## probne0    0    0

The total number of variables that have the posterior probability of $> 5\%$ is

sum(probne0.bma > 5)

## [1] 20

postmean.bma = fit.bma$postmean
round(postmean.bma, digits=4) # get the coefficients

## (Intercept)           z          x1          x2          x3          x4 
##     17.6407     10.7499      3.4729      1.1075      0.0000     -0.5574 
##          x5          x6          x7       x1.x2       x1.x3       x1.x4 
##     -5.7439      0.0000      3.4087     -0.5799     -0.0004     -0.0033 
##       x1.x5       x1.x6       x1.x7        x1.z       x2.x3       x2.x4 
##     -0.0013     -0.0002      0.5479     -0.0154      0.0000      0.0004 
##       x2.x5       x2.x6       x2.x7        x2.z       x3.x4       x3.x5 
##     -0.1090     -0.0002      0.0799     -0.0646     -0.0439     -0.0014 
##       x3.x6       x3.x7        x3.z       x4.x5       x4.x6       x4.x7 
##     -0.0002      0.0032      0.0000      0.0511      0.0015     -0.0011 
##        x4.z       x5.x6       x5.x7        x5.z       x6.x7        x6.z 
##     -0.0238      0.0144     -0.2875     -0.2521     -0.0003     -0.0068 
##        x7.z        x1sq        x2sq        x3sq        x4sq        x5sq 
##      0.0000     -0.0341      0.1278      0.0000     -0.0521      0.5754 
##        x6sq 
##      0.0001

postsd.bma = fit.bma$postsd
round(postsd.bma, digits=4) # get the sd

##  [1] 0.6556 0.4246 0.5336 0.4714 0.0000 0.3455 0.4068 0.0000 0.1197 0.2194
## [11] 0.0110 0.0278 0.0221 0.0071 0.2370 0.1689 0.0000 0.0156 0.1840 0.0079
## [21] 0.2086 0.2775 0.1009 0.0188 0.0066 0.0422 0.0000 0.1235 0.0193 0.0162
## [31] 0.1311 0.0614 0.1743 0.3531 0.0084 0.0542 0.0000 0.1196 0.4028 0.0000
## [41] 0.0948 0.1013 0.0026

4.2 Principal Component Analysis

Principal component analysis (PCA) is a method of extracting important variables from a large set of variables.
- The extracted variables would represent major proportion of variation of the data set, and try to capture the main pattern.
- The PCA gives a set of principal components that are linear combination of the original variables.
Install and load the package FactoMineR.

install.packages("FactoMineR")

Conduct PCA

x1 = data2[, c(3:9)] # 7 predictors

PCA variable factor map

library(FactoMineR)
result = PCA(x1) # graphs generated automatically

# conduct the pca analysis
pca_x = prcomp(x1, scale = TRUE, center = TRUE) 
# variances of the variables in x1 vary by orders of magnitude, so scaling is appropriate
names(pca_x)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

pca_x$var = (pca_x$sdev)^2

# To compute the proportion of variance explained by each component, 
# we simply divide the variance by sum of total variance
pca_x$percentVar = (pca_x$var)/sum(pca_x$var)*100

# cumulative variance explained
pca_x$cumPerVar = cumsum(pca_x$percentVar)

# Plot PCA variances and save as "tiff" file
library(ggplot2)
y = pca_x
nVar = length(y$var)
varPlotData = data.frame(X = 1:nVar, Y = y$percentVar)
ggplot(varPlotData, aes(x = X, y = Y))+
  geom_point()+
  geom_line()+
  geom_hline(yintercept = 0, colour = "gray65", linetype = 'dashed')+
  geom_vline(xintercept = 0, colour = "gray65", linetype = 'dashed')+
  theme_bw()+
  labs (x = "No. of Principal Component", y = "Percent Variance" )

# or simply use "plot" function to draw a screenplot
# plot(pca_x, type='l') # type = "lines"
plot(pca_x$cumPerVar, xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained", type = "b")

# multipal dimention plot
biplot(pca_x)

summary(pca_x)

## Importance of components:
##                          PC1   PC2   PC3    PC4    PC5    PC6    PC7
## Standard deviation     1.712 1.296 0.958 0.8316 0.6306 0.4694 0.4017
## Proportion of Variance 0.419 0.240 0.131 0.0988 0.0568 0.0315 0.0231
## Cumulative Proportion  0.419 0.659 0.790 0.8887 0.9455 0.9769 1.0000

# each principal component is a linear combinition of original variables
pca_x$rotation

##         PC1        PC2        PC3       PC4        PC5       PC6        PC7
## x1 0.527308 -0.0430310  0.0029855 -0.245250  0.0978208  0.727094 -0.3488608
## x2 0.534988  0.0024486  0.0372759 -0.180592 -0.0534367 -0.670498 -0.4768178
## x3 0.540312 -0.0466598  0.0083292 -0.221124  0.0010474 -0.082509  0.8062953
## x4 0.271276 -0.2852336  0.5316786  0.748868  0.0031703  0.039045  0.0055813
## x5 0.067509  0.6602582  0.2542171  0.043954 -0.6942163  0.103984  0.0139405
## x6 0.054221  0.6701488  0.1784032  0.108238  0.7082157 -0.047403  0.0245172
## x7 0.250418  0.1719495 -0.7870172  0.532816 -0.0637080  0.019348 -0.0015428

The summary and plots showed that PC1-PC5 results in variance over $\approx 90\%$.
Therefore, for modeling, we will use these 5 components as predictor variables.

# using PC1-PC7 as predictors
xnew = pca_x$x[,1:5]
# keep z in the model
summary(lm(y ~ z + xnew, data = data2))

## 
## Call:
## lm(formula = y ~ z + xnew, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.017 -1.839 -0.034  1.711  9.405 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.176      0.209   87.11   <2e-16 ***
## z             11.974      0.390   30.74   <2e-16 ***
## xnewPC1        2.386      0.111   21.43   <2e-16 ***
## xnewPC2       -1.809      0.097  -18.65   <2e-16 ***
## xnewPC3       -3.880      0.131  -29.59   <2e-16 ***
## xnewPC4        0.252      0.155    1.63      0.1    
## xnewPC5        2.421      0.199   12.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.81 on 493 degrees of freedom
## Multiple R-squared:  0.934,  Adjusted R-squared:  0.933 
## F-statistic: 1.16e+03 on 6 and 493 DF,  p-value: <2e-16

5 Real Data Exercise

Synopsis

Environmental pollutant exposures have been associated with various health outcomes. Most epidemiologic studies have examined environmental pollutants individually, but in real life, we are exposed to pollutant mixtures, not single chemicals. Although multi-pollutant approaches have been recognized recently, analytical challenges exist such as numerous potential exposures of interest, high degrees of correlation between some of these exposures, and high dimensions. New methods for epidemiologic analysis of mixtures are being developed, but it is not well understood how well they perform or how well they identify key, causal risk factors.

Dataset

exercise_data.csv: This is the dataset used in Park et al. (2017)². It contains 9664 subjects and 40 variables from NHANES 2003-2014. See the Data Dictionary at the end.

Outcome: LBXSGTSI, gamma-glutamyl transferase (GGT)
Exposure: 20 metals measured in blood and urine
Potential confounders: age, sex, race/ethnicity, education, smoking status, body mass index (BMI), urinary creatinine

Objectives

To identify important predictors of GGT, using three different methods discussed above:

ONE from conventional methods (e.g., forward selection, backward elimination, stepwise)
ONE from regularized (shrinkage) regression methods (e.g., LASSO, ENET, adaptive LASSO, and adaptive ENET)
ONE from CART or random forests.

For the conventional methods and regularized regression methods, include quadratic terms and pairwise interactions among 20 predictors.

Variable Dictionary

No	Variable	Label
1	seqn	Participant ID
2	cycle	NHANES cycle (3=2003-2004, 4=2005-2006, 5=2007-2008, 6=2009-2010, 7=2011-2012, 8=2013-2014)
3	sex	Sex (1 male, 2 female)
4	age	Age (years)
5	raceeth	Race/Ethnicity (1 Mexican American, 2 Other Hispanic, 3 Non-Hispanic white, 4 Non-Hispanic black, 5 Other)
6	psu	Primary Sampling Unit (PSU)
7	strata	Sampling Strata
8	bmi	Body Mass Index (kg/m2)
9	wt2yr	Two-year weights of subsample
10	ucr	Creatinine, urine (mg/dL)
11	ggt	Gamma-Glutamyl Transferase (GGT) (U/L)
12	bpb	Lead, blood (ug/dL)
13	bcd	Cadmium, blood (ug/L)
14	bhg	Total Mercury, blood (ug/L)
15	utas	Urinary total Arsenic ($\mu g/L$)
16	uas3	Urinary Arsenous acid ($\mu g/L$)
17	uas5	Urinary Arsenic acid ($\mu g/L$)
18	uab	Urinary Arsenobetaine ($\mu g/L$)
19	uac	Urinary Arsenocholine ($\mu g/L$)
20	udma	Urinary Dimethylarsonic acid ($\mu g/L$)
21	umma	Urinary Monomethylacrsonic acid ($\mu g/L$)
22	uba	Barium, urine ($\mu g/L$)
23	ucd	Cadmium, urine ($\mu g/L$)
24	uco	Cobalt, urine ($\mu g/L$)
25	ucs	Cesium, urine ($\mu g/L$)
26	umo	Molybdenum, urine ($\mu g/L$)
27	upb	Lead, urine ($\mu g/L$)
28	usb	Antimony, urine ($\mu g/L$)
29	utl	Thallium, urine ($\mu g/L$)
30	utu	Tungsten, urine ($\mu g/L$)
31	uur	Uranium, urine ($\mu g/L$)
32	sbp	Systolic BP (mmHg)
33	dbp	Diastolic BP (mmHg)
34	htn	Hypertension status (1=yes, 0=no)
35	pmon	Person Months of Follow-up from Exam Date
36	d_total	Total mortality (0=alive, 1=death)
37	d_cvd	CVD mortality (0=alive, 1=death)
38	d_cancer	Cancer mortality (0=alive, 1=death)
39	smkstat	Smoking status (1 never, 2 former, 3 current)
40	educ	Education (1 <High School, 2 High School, 3 College+)

Zou, H., & Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics, 37(4), 1733.↩
Park SK, Zhao Z, Mukherjee B. Construction of environmental risk score beyond standard linear models using machine learning methods: Application to metal mixtures, oxidative stress and cardiovascular disease in NHANES. Environ Health 2017;16(1):102.↩

High-Dimensional Variable Selection Methods

Workshop on Computational Biostatistics and Survival Analysis

Bhramar Mukherjee and Shariq Mohammed

1 Pre-Analysis

1.1 Descriptive Statistics

1.1.1 For Binary Variable \(z\)

1.1.2 For Continuous Variables

1.1.3 Correlation Matrix of the 7 Predictors

1.1.3.1 Heatmap of Correlation Matrix using `corrplot()`

2 Basic Analysis

2.1 Single-Variable Models

2.2 Multiple-Variable Models

2.2.1 Including Pairwise Interactions

2.3 Step-wise Selection Approaches

2.3.1 Forward Selection

2.3.2 Backward Elimination

2.3.3 Step-wise Selection

3 Variable Selection Methods

3.1 Classification And Regression Trees (CART)

3.1.1 Grow the Tree

3.1.2 Prune the Tree

3.1.3 Optional Settings

3.2 Random Forests

3.2.1 Initial tuning with `randomForest`

3.2.2 Check Variable Importance

3.2.3 Prediction

3.3 LASSO

3.3.1 Optional Settings

3.4 Elastic Net (E-NET)

3.5 Adaptive Lasso

3.6 Adaptive E-NET

3.7 Training/Test Splitting

3.8 Super Learner

4 Other Methods

4.1 Bayesian Model Averaging (BMA)

4.2 Principal Component Analysis

5 Real Data Exercise

High-Dimensional Variable Selection Methods

Workshop on Computational Biostatistics and Survival Analysis

Bhramar Mukherjee and Shariq Mohammed

1 Pre-Analysis

1.1 Descriptive Statistics

1.1.1 For Binary Variable \(z\)

1.1.2 For Continuous Variables

1.1.3 Correlation Matrix of the 7 Predictors

1.1.3.1 Heatmap of Correlation Matrix using corrplot()

2 Basic Analysis

2.1 Single-Variable Models

2.2 Multiple-Variable Models

2.2.1 Including Pairwise Interactions

2.3 Step-wise Selection Approaches

2.3.1 Forward Selection

2.3.2 Backward Elimination

2.3.3 Step-wise Selection

3 Variable Selection Methods

3.1 Classification And Regression Trees (CART)

3.1.1 Grow the Tree

3.1.2 Prune the Tree

3.1.3 Optional Settings

3.2 Random Forests

3.2.1 Initial tuning with randomForest

3.2.2 Check Variable Importance

3.2.3 Prediction

3.3 LASSO

3.3.1 Optional Settings

3.4 Elastic Net (E-NET)

3.5 Adaptive Lasso

3.6 Adaptive E-NET

3.7 Training/Test Splitting

3.8 Super Learner

4 Other Methods

4.1 Bayesian Model Averaging (BMA)

4.2 Principal Component Analysis

5 Real Data Exercise

1.1.3.1 Heatmap of Correlation Matrix using `corrplot()`

3.2.1 Initial tuning with `randomForest`