Goals

After working through this handout, you should:

R packages

Many R functions exist by default in R. But many, many more have been written by users and are available as packages that can be installed and loaded in R. Many of these are hosted on the Comprehensive R Archive Network (CRAN). We will install a package to provide a function for LASSO regression. You only have to install the package once (unless you update to a new version of R), but you need to load the package (library) in every R session where you want to use it. I had you install all necessary packages at the start of the semester, but if you’d like to try it for yourself then run the following code:

#Install glmnet package for LASSO regression 
install.packages("glmnet")
## Installing package into '/home/briankissmer/R/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
## Warning in install.packages("glmnet"): installation of package 'glmnet' had
## non-zero exit status
#Load package and make its functions available. Note the lack of quotes 
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8

Fitting a LASSO regression model in R with glmnet

Now, let’s use the glmnet function to fit a LASSO regression model. We will fit the model to a simulated data set, where 10 out of 100 simulated covariates affect our response variable. The sample size for the data set will be 100 observations. We will use cross-validation to chose the shrinkage factor λ. We will visualize our results.

#Load the package to make its function available
library("glmnet")
#Simulate the data 100 covariates (columns) for 100 observations (rows)
X <- matrix(rnorm(100 * 100, mean = 0, sd = 1), nrow = 100, ncol = 100)
#Generate non-zero beta coefficients for the first 10, 0s for the
#other 90
betas <- c(rnorm(10, 0, 1), rep(0, 90))
#Generate Y based on X, betas and some noise uses matrix
#multiplication
Y <- X %*% betas + rnorm(100, 0, 2)
#Determine value of lambda; alpha = 1 is for LASSO
cv_model <- cv.glmnet(x = X, y = Y, alpha = 1, nfolds = 10)
lam <- cv_model$lambda.min
lam
## [1] 0.2219202
plot(cv_model)

#Now fit the model with lambda set to lam
lasfit <- glmnet(x = X, y = Y, alpha = 1, lambda = lam)
#Lets look at the coefficient estimates, 100 betas
plot(lasfit$beta, type = "h", ylab = "Regression coefficient", xlab = "Covariate number",
cex.lab = 1.3)

sum(lasfit$beta != 0) #Number that were retained
## [1] 26
#Compare the true values for the betas to the estimated values
plot(betas, lasfit$beta, pch = 19, xlab = "True beta", ylab = "Est. beta",
cex.lab = 1.2)

cor.test(betas, as.numeric(lasfit$beta))
## 
##  Pearson's product-moment correlation
## 
## data:  betas and as.numeric(lasfit$beta)
## t = 14.596, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7537080 0.8808378
## sample estimates:
##       cor 
## 0.8276058
#Model predictions of Y
Y_pred <- predict(lasfit, s = lam, newx = X)
plot(Y, Y_pred, pch = 19, xlab = "Obs. values", ylab = "Predict values",
cex.lab = 1.2)

#We can compute r2, not provided by default total sum of squares
sst <- sum((Y - mean(Y)) ^ 2)
#Residual or error sum of squares
sse <- sum((Y_pred - Y) ^ 2)
#r2 = prop. variance explained by model
r2 <- 1 - sse/sst
r2
## [1] 0.7146257

\[ \] \[ \] \[ \] \[ \] \[ \]