--- title: "HDS 6. Penalized regression" author: "Matti Pirinen, University of Helsinki" date: "6.11.2021" output: html_document: default --- {r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)  In HDS5 notes, we considered multiple regression models with different numbers of predictors$p$. We had two information criteria, AIC and BIC, that balance the model fit (to the training data), measured by the likelihood function$L(\pmb\beta\,|\,\pmb y)$, and the flexibility of the model, mesured by a penalty term for the parameters$\pmb\beta$. Both criteria are instances of a minimization problem of an **objective function** of the form $$-2\log(L(\pmb \beta\,|\,\pmb y)) + \lambda\cdot \textrm{penalty} (\pmb \beta),$$ where the penalty function is only a function of the model parameters and$\lambda \geq 0$is a constant. For example, we get AIC, by setting$\lambda=2$and$\textrm{penalty} (\pmb \beta)= \sum_{j=1}^p \textrm{I}(\beta_j)$, where$\textrm{I}$is an indicator function taking value 1 if$\beta_j \neq 0$and 0 otherwise. This week, we will consider a more general family of non-negative penalties of the form $$\textrm{penalty}_q (\pmb \beta)= \sum_{j=1}^p |\beta_j|^q,\, q > 0.$$ We write the penalized objective function as $$O_{q,\lambda}(\pmb \beta\,|\,\pmb y) = -2\log(L(\pmb \beta\,|\,\pmb y)) + \lambda \cdot \textrm{penalty}_q (\pmb \beta).$$ The goal is to find the minimum value for this objective function with respect to$\pmb \beta$as a candidate solution that appropriately balances bias and variance of the model. Note that by agreeing that$0^0=0$, we could extend the above framework to include both AIC ($q=0,\lambda=2$) and BIC ($q=0, \lambda = \log(n))$. **Example 6.1.** For standard linear regression model the objective function is (up to an additive constant) $$O_{q,\lambda}(\pmb \beta\,|\,\pmb y) = (\pmb y-\pmb X \pmb \beta)^T (\pmb y-\pmb X \pmb \beta) + \lambda \cdot \textrm{penalty}_q (\pmb \beta) = \text{RSS} + \lambda\cdot\textrm{penalty}_q (\pmb \beta),$$ and by minimizing this function we will get a compromise between a small residual sum of squares (RSS) and a small penalty term. In other words, we will get a compromise between the ordinary least squares (OLS) estimates and the requirement that the cumulative magnitude of the coefficients is small. The parameter$\lambda$determines the importance of each component: When$\lambda=0$, we repeat the OLS and when$\lambda \to \infty$, we shrink all coefficients to 0. For intermediate values of$\lambda$, the solution lies somewhere between OLS and 0. Because minimizers of$O_{q, \lambda}(\pmb \beta\,|\,\pmb y)$depend on the scaling of the predictors, as a **pre-processing step**, we standardize the predictors before optimizing for$\pmb \beta$. This way the magnitude of the penalty attached to each$\beta$-coefficient is equal with respect to the variance explained in the outcome variable. Typically, the intercept term is not penalized. For linear model, we can leave the intercept$\beta_0$out from the model after the predictors and outcome have been mean-centered. For logistic or Poisson regression models, we do not include the intercept among the coefficients that form the penalty term even though we always have the intercept in the model. How should we determine the values of$q$and$\lambda$? Let's next consider what kinds of models we will get with commonly used values of$q=2$and$q=1$. ### Ridge regression ($q=2$) When$q=2$and the penalty term is the sum of squares, the regression model is called **ridge regression**. For this case, we can derive the solution analytically. Let's first expand the objective function: $$(\pmb y -\pmb X \pmb\beta)^T (\pmb y -\pmb X \pmb\beta) + \lambda\, \pmb \beta^T \pmb \beta = \pmb y^T \pmb y - 2\pmb y^T \pmb X \pmb \beta + \pmb \beta^T \left( \pmb X^T \pmb X + \lambda \pmb I \right) \pmb \beta.$$ By solving for$\widehat{\pmb \beta}_{\textrm {ridge}}$, that makes the derivative with respect to$\pmb \beta$vanish, we have $$0 - 2 \pmb y^T \pmb X + 2 \widehat{\pmb \beta}_{\textrm {ridge}}^T \left( \pmb X^T \pmb X + \lambda \pmb I \right) = 0 \iff \widehat{\pmb \beta}_{\textrm {ridge}} = \left( \pmb X^T \pmb X + \lambda \pmb I \right)^{-1} \pmb X^T \pmb y.$$ We see that, compared to the least square solution where$\lambda=0$, the ridge solution adds a positive diagonal term to$(\pmb X^T \pmb X)$before the matrix inversion. Numerically, this adds stability to the solution and makes computation of the solution possible even when the original matrix$\pmb X$was not of full rank. **Example 6.2.** Suppose that predictors in columns of$\pmb X$are **uncorrelated** and standardized and thus$(\pmb X^T \pmb X)= n \pmb I.$(When sample variance is used for standardization the scaling factor would be$n-1$rather than$n$but for simplicity we scale by$n$here.) Then $$\widehat{\pmb \beta}_{\textrm {ridge}} = (n +\lambda)^{-1} \pmb X^T \pmb y = \frac{n}{n+\lambda} \widehat{\pmb \beta}_{OLS},$$ where$\widehat{\pmb \beta}_{OLS}= \pmb X^T \pmb y /n$is the least squares solution of$\pmb y = \pmb X \beta + \pmb{\varepsilon}$. We see that the ridge regression simply shrinks the least squares solution towards 0 by a multiplier that decreases as the ratio$\lambda/n$increases. As we saw in HDS5, an important aspect of the multiple regression model is that it can tease apart the effects of correlated predictors. However, when there are many highly correlated variables in a linear regression model, their coefficients become poorly determined and exhibit high variance, which is reflected in large standard errors of each coefficient. This is because a large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated counterpart. By imposing a penalty term for the joint magnitude of the coefficients, this problem is alleviated. This was a main motivation to consider ridge regression in the first place [(Hoerl and Kennard, Technometrics 1970)](https://www.math.arizona.edu/~hzhang/math574m/Read/RidgeRegressionBiasedEstimationForNonorthogonalProblems.pdf). **Example 6.3.** Consider the case of two predictors$X_1$and$X_2$and outcome variable$Y$. All three are standardized to have mean 0 and variance 1. Correlation between$X_1$and$X_2$is$r$. We want to understand the behavior of the least squares estimator$\widehat{\pmb \beta} = (\widehat{\beta}_1,\widehat{\beta}_2)$of the model$Y=X_1\beta_1 + X_2\beta_2 + \varepsilon$as a function of$r$. We know from the course in linear models that the sampling distribution is $$\widehat{\pmb \beta} \sim \mathcal{N}\left(\pmb \beta, \sigma^2 \left(\pmb X^T \pmb X \right )^{-1}\right),$$ where$\sigma^2=\textrm{Var}(\varepsilon)$is the error variance. Note that in the observed data$\pmb x_j^T \pmb x_j \approx n\, \textrm{Var}(X_j) = n$for$j=1,2$and that$\pmb x_1^T \pmb x_2 \approx n\, \textrm{Cov}(X_1, X_2) = n\, \textrm{Cor}(X_1, X_2) = n\cdot r,$since the variables are standardized. By using the formula for the inversion of a 2-by-2 matrix, we have that $$\sigma^2 \left(\pmb X^T \pmb X\right)^{-1} = \frac{\sigma^2}{n} \left(\begin{array}{cc} 1 & r \\ r & 1 \end{array} \right)^{-1} = \frac{\sigma^2}{n(1-r^2)} \left(\begin{array}{cc} 1 & -r \\ -r & 1 \end{array} \right).$$ We see that the sampling variance related to the each coefficient (either$\beta_1$or$\beta_2$) alone is$\propto \frac{1}{n(1-r^2)}$, so will go DOWN as sample size$n$grows, but will go UP when$|r|$gets closer to 1. This means that the information to learn the true value of parameter$\beta_1$will grow as$n$grows but will decrease as$|r|$grow, and the total effect of these two properties on the amount of information is determined by their product$n (1-r^2)$. This explains why highly correlated variables in the linear model cause unstability to the estimates through large standard errors. However, there is no fundamental reason why linear model could not tell the effects of correlated variables apart from each other, but we simply would need a larger sample size to get the same accuracy (as measured by SEs) for correlated variables than we have for uncorrelated ones already with a smaller sample size. (To be precise, with correlation$r$, we will need$1/(1-r^2)$times the sample size compared to the uncorrelated case to get the same precision for the estimates.) We also see that the shape of the sampling distribution of$\widehat{\pmb \beta}$has correlation of$-r$, so has the opposite sign compared to the correlation of the original variables$X_1$and$X_2$. This is reflecting that if$X_1$and$X_2$are highly positively correlated (say$r>0.99$), then for the model fit, it is almost insignificant how we divide the true total value$B = \beta_1 + \beta_2$between$\widehat{\beta}_1$and$\widehat{\beta}_2$. By increasing the first coefficient and simultaneously decreasing the second coefficient by almost the same amount, the model fit stays almost the same. For example, we can get almost the same fitted values by a choice of$\widehat{\beta}_1 = B/2 + 100, \widehat{\beta}_2 = B/2 - 100$as we get by a choice of$\widehat{\beta}_1 = B/2, \widehat{\beta}_2 = B/2$. Obviously, if we want to interpret the coefficients of the model, then the estimates$\widehat{\beta}_1 = 0.02, \widehat{\beta}_2 = -0.01$would be much preferred over, say,$\widehat{\beta}_1 = 100.02, \widehat{\beta}_2 = -100.01$in case when truly$\beta_1 = \beta_2 = 0$. From that point of view, the unstability of estimates with highly correlated variables is a problem in the standard (unpenalized) regression. **Questions.** How does the 2-dimensional likelihood function of two highly correlated predictors look like? Where does the unstability of parameter estimates manifest in the likelihood function? How does an increasing sample size affect the shape of the likelihood surfaces of constant value and help in bringing more stability to estimates? **Example 6.4.** Let's reuse the generation of a pair of correlated variables from beginning of HDS5 notes. We generate two highly correlated ($0.99$) predictors and then generate the outcome$y = x_1 \cdot 1.5 + x_2 \cdot 1.0 + \varepsilon$. The sample size is small ($n=50$) and so we do not expect to have information to separate well which one of the two variables is having which effect size. Let's see how OLS (corresponding to$\lambda=0$) and ridge regression (here we set$\lambda = 10$) behave. We use lm.ridge() from MASS to fit the ridge regression. {r, fig.height = 4, fig.width = 7} library(MASS) set.seed(17102017) #function to generate a pair of correlated variables gen.cor.pair <- function(r, n){ #r is target correlation, n is sample size u = rnorm(n, 0, sqrt(abs(r))) #temporary variable that is used for generating x1 and x2 x1 = scale(u + rnorm(n, 0, sqrt(1 - abs(r)))) x2 = scale(sign(r)*u + rnorm(n, 0, sqrt(1 - abs(r)))) cbind(x1, x2) #return matrix } n = 50 X = gen.cor.pair(r = 0.99, n = n) #generate one pair of predictors cor(X) #check that now cor(x,z) ~ r. true.b = c(1.5, 1.0) #true values of coeffs for cols of X R = 100 #replications. predictors are fixed but outcome resampled for each replication OLS.res = matrix(NA, ncol = 2, nrow = R) #least squares estimates ridge.res = matrix(NA, ncol = 2, nrow = R) #ridge estimates for(ii in 1:R){ y = as.numeric( X %*% true.b + rnorm(n,0,1) ) y = y - mean(y) #mean center so that can ignore intercept OLS.res[ii,] = as.numeric(lm(y ~ 0 + X)$coeff) #joint model using least squares fit #ridge regression with lambda=10 ridge.res[ii,] = as.numeric(coefficients(lm.ridge( y ~ 0 + X, lambda = 10 ))) } #plot estimates across replications and mark true values by dotted lines par(mfrow = c(1, 2)) plot(OLS.res, col = "red", pch = 4, xlab = expression(beta[1]), ylab = expression(beta[2])) points(ridge.res, col = "blue", pch = 1) grid() abline(h = true.b[2], lty = 2, lwd = 2) abline(v = true.b[1], lty = 2, lwd = 2) points(0, 0, pch = 19, cex = 0.6) text(-0.3, -0.3, "0") #mark origin legend("topright", leg = c("OLS","RIDGE"), col = c("red","blue"), pch = c(4,1)) #plot sum of two coefficients across methods and mark true value by dotted line plot(rowSums(OLS.res), rowSums(ridge.res), pch = 19, xlab = "beta sum (OLS)", ylab = "beta sum (RIDGE)") abline(h = sum(true.b), lty = 2, lwd = 2) abline(v = sum(true.b), lty = 2, lwd = 2)  We see that OLS are unstable and vary widely across data sets along the line $\beta_1 + \beta_2 \approx 2.5$ whereas ridge regression estimates are much more stable and concentrate near to the closest point between the above mentioned line and the origin, because that point minimizes the ridge objective function among all points with the same distance from origin (panel top left). This behavior is a clear difference between OLS and ridge regression. With small amount of data and highly correlated predictors, we simply do not have information to split the effect between the two variables. We can see how well we estimate the sum of the effects of the two variables, rather than how well we can split it into its parts (panel top right). We see that OLS solution has an average the correct value (2.5 is in the middle of estimated) whereas ridge regression is underestimating the sum of $\beta$s (the average is around 2.2). Let's have a closer look at the estimates of the individual coefficients. {r, fig.width = 7, fig.height = 6} par(mfrow = c(2, 2)) for(ii in 1:2){ res = OLS.res[,ii] hist(res, breaks = 15, col = "red", xlab = paste0("beta_",ii," (OLS)"), main = "OLS", sub = paste("mean =",signif(mean(res), 2)))} for(ii in 1:2){ res = ridge.res[,ii] hist(res, breaks = 15, col = "blue", xlab = paste0("beta_",ii," (RIDGE)"), main = "RIDGE", sub = paste("mean =",signif(mean(res), 2)))}  We know that OLS is unbiased, that is, its mean converges to the correct value across simulations (red histograms). Ridge regression, instead, is biased (blue histograms). By penalizing likelihood we cause some shrinkage to the coefficients towards each other and towards 0. This bias is the price to pay for reduced variance compared to OLS. Let's see which method the bias-variance tradeoff favors here by computing MSEs of individual coefficients. {r} mse <- function(x,y){mean((x-y)^2)} data.frame(mse.b1 = c(mse(OLS.res[,1], true.b[1]), mse(ridge.res[,1], true.b[1])), mse.b2 = c(mse(OLS.res[,2], true.b[2]), mse(ridge.res[,2], true.b[2])), row.names = c("OLS","ridge"))  While ridge regression has a larger bias than OLS, it also has a much lower variance than OLS, and the combined effect of these properties is a dramatically lower MSE for ridge regression. This shows the utility of penalized regression to add robustness to analyses with highly correlated predictors with too little information to split between them. Why, technically, does ridge regression produce stable estimates? The likelihood part of the objective function (which is RSS) is exactly the same as for the OLS and would therefore be minimized by OLS. However, when there is little information to decide between a wide variety of combinations of $\left(\widehat{\beta}_1,\widehat{\beta}_2\right)$ that all produce almost the same likelihood value as long as their sum remains constant (see figure above where red Xs are spread over a line on which $\widehat{\beta}_1 + \widehat{\beta}_2 \approx 2.5$), then the OLS is very unstable because even small changes in data can make the minimum of the RSS change widely across this large region of flat likelihood. Ridge regression instead prefers those points in the parameter space, where, in addition to low RSS, the magnitudes of all the parameters are fairly small. Thus, from among the points in the large region of almost flat likelihood, we will choose a one that is fairly close to the origin (0,0) to keep the ridge penalty low. What would happen if true values of $\beta_1$ and/or $\beta_2$ had a large magnitude? Would ridge regression then fail badly because it is shrinking the effects? Ridge regression is a biased method and the bias can in some cases be large. However, the tuning parameter $\lambda$ determines how biased the method is, and in cases where less bias is enough to get good predictive accuracy, then $\lambda$ will be kept smaller and ridge regression is closer to OLS. The example above used a fixed value of $\lambda=10$ without justifying this choice. Typically, $\lambda$ is chosen by cross-validation and can therefore adapt to the data. In addition to handling correlated variables, another benefit of penalized regression is to handle cases where $p$ is large (compared to $n$). Here the unpenalized least squares methods tend to overfit to the training data and adding some robustness by penalizing model complexity is needed. Next we consider another version of penalized regression called LASSO that is particularly suitable for variable selection. ### LASSO ($q=1$) The Least Absolute Shrinkage and Selection Operator (LASSO), introduced by [Tibshirani 1996](https://www.jstor.org/stable/2346178), is achieved by minimizing the objective function $$O_{1,\lambda}(\pmb \beta\,|\,\pmb y) = -2\log(L(\pmb \beta\,|\,\pmb y)) + \lambda \cdot \textrm{penalty}_1 (\pmb \beta) = -2\log(L(\pmb \beta\,|\,\pmb y)) + \lambda \cdot \sum_{j=1}^p |\beta_j|.$$ Thus, the difference to ridge regression is the change from the $\ell^2$ penalty $\beta_j^2$ to the $\ell^1$ penalty $|\beta_j|$ per each coefficient. Because the LASSO penalty includes the absolute value operator, the objective function is not differentiable and, as a result, lacks a closed form solution in general. However, in the special case of uncorrelated predictors ($\pmb X^T \pmb X = n\pmb I$) it is possible to obtain the closed form solution for LASSO using the soft-thresolding operator $(t)_+$, which equals $t$, when $t>0$, and 0 otherwise. Suppose that predictors in columns of $\pmb X$ are uncorrelated. Then the LASSO solution is $$\widehat{\beta}_j^{\textrm{(LASSO)}} = \textrm{sign} \left( \widehat{\beta}_{j}^{\textrm{(OLS)}} \right) \left(\left|\widehat{\beta}_{j}^{\textrm{(OLS)}}\right|-\lambda\right)_+ =\left\{ \begin{array}{cc} \left(\widehat{\beta}_{j}^{\textrm{(OLS)}}-\lambda\right), & \textrm{ if }\, \widehat{\beta}_{j}^{\textrm{(OLS)}} > \lambda\\ 0, & \textrm{ if }\, \left|\widehat{\beta}_{j}^{\textrm{(OLS)}}\right| \leq \lambda \\ \left(\widehat{\beta}_{j}^{\textrm{(OLS)}}+\lambda\right), & \textrm{ if }\, \widehat{\beta}_{j}^{\textrm{(OLS)}} < -\lambda \end{array}\right .,$$ where OLS refers to the ordinary least squares estimate from the linear model. Let's check this using glmnet package. You can install it from CRAN and detailed instructions of installation and other aspects are at https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html. {r, warning = F} library(glmnet) n = 100 X = as.matrix(scale(cbind(rnorm(n), rnorm(n)))) #two uncorrelated columns y = as.numeric(scale(X %*% c(0.5, -0.2) + rnorm(n))) OLS.coef = coefficients(lm(y ~ X)) OLS.coef #Least squares estimate lambda = c(0.6, 0.4, 0.15, 0.1, 0.0) #Let's choose these to cover the range of OLS values. g.fit = glmnet(X, y, lambda = lambda, alpha = 1) #alpha=1 means LASSO, we come back to that later. data.frame(lambda = g.fit$lambda, t(as.matrix(coef(g.fit))))  glmnet uses symbol s for$\lambda$, so e.g. s0 refers to the largest$\lambda$value of 0.6 and s4 to the smallest value of 0.0. They are always ordered in descending order no matter how they were ordered in the original lambda vector. We see that the number of non-zero coeffs increases from 0 ($\lambda=0.6$) to 2 (starting from$\lambda=0.15$). We also see that whenever there is a non-zero coefficient, its distance to the OLS of the same parameter is approximately$\lambda$. For example, OLS for$\beta_2$is -0.15 and therefore it does not survive penalization in models with$\lambda=0.6$or$0.4$as these are clearly$> 0.15$, but it starts to appear in the model when$\lambda$drops to around 0.15. Just like with ridge regression, we get back to OLS when$\lambda=0$. We can also plot glmnet results. {r, fig.width =8} plot(g.fit, xvar = "lambda", lwd = 1.5) abline(h = 0, col = "gray", lty = 2)  Each curve corresponds to a predictor. It shows the path of its coefficient (y-axis) against$\log(\lambda)$at the grid points used in glmnet above (from log(0.1) ~ -2.3 to log(0.6) ~ -0.51). The axis above the plot indicates the number of nonzero coefficients at each$\lambda$, which is the effective degrees of freedom (df) for the LASSO. There are also other options for the x-axis (see ?plot.glmnet). {r} coef(g.fit, s = 0.1) #get coefficients at s = 0.1. coef(g.fit, s = 0.2) #interpolation at s = 0.2 (since 0.2 was not computed above) #exact coefs at s = 0.2 from original data. Dot means 0 in sparse matrix format. coef(g.fit, s = 0.2, exact = T, x = X, y = y)  Let's see how LASSO performs on our earlier data set from HDS5 where we had$p=70$predictors of which only the first four had a non-zero effect (and together explained about 20% of variance) but since training data$n=150$was small, pure OLS led to bad overfitting where the full model explained 50% of variance in training data but essentially 0% in test data. Since these predictors are generated independently, we expect that LASSO corresponds to soft-thresholding in these data. The key question is what should$\lambda$be? To determine this we can use cv.glmnet() that does an automatic (10-fold) cross-validation to find the optimal$\lambda$. {r} set.seed(20102017) #We repeat exactly the dataset from HDS5 p = 70 n.tr = 150 #training set n.te = 150 #test set n = n.tr + n.te i.tr = 1:n.tr; i.te = (n.tr + 1):n #indexes for training and testing phi = 0.05 #variance explained by x_1, should be 0 < phi < 1. b = rep(c(sqrt(phi / (1-phi)), 0), c(4, p-4)) #effects 1,2,3,4 are non-zero, see Lect. 0.1 for "phi" X = matrix(rnorm(n*p), nrow = n) #columns 1,...,4 of X have effects, other 66 cols are noise eps = scale(rnorm(n, 0, 1)) #epsilon, error term y = scale(X%*%b + eps) #makes y have mean = 1, var = 1 y.tr = y[i.tr]; y.te = y[i.te] X.tr = data.frame(scale(X[i.tr,])); X.te = data.frame(scale(X[i.te,])) lm.fit = lm(y.tr ~ 0 + . , data = X.tr) #fit the model lasso.cv = cv.glmnet(x = as.matrix(X.tr), y = y.tr, alpha = 1, type.measure = "mse") lasso.cv$lambda.min #this is lambda corresponding to minimum MSE seen during CV lasso.cv$lambda.1se #this is largest lambda with CV'd MSE < 1sd of min CV'd MSE plot(lasso.cv) #shows CV'd MSEs across lambdas, lines for lambda.min and lambda.1se  CV plot shows how CV'd MSE is about 2.0 for$\lambda\approx 0$and drops around 0.9 at the minimum. This agrees with results in HDS5 where we saw that CV with cv.glm gave MSE estimate of about 2.0 and this was considerably larger than the variance within the training data (1.08). A reason one might prefer lambda.1se over lambda.min is that the former produces more sparsity (larger penalty) while still being acceptably close to the minimum observed MSE. predict.cv.glmnet() uses lambda.1se by default. Cross-validation for linear model can be done using error measure specified by type.measure parameter. The default value mse applies mean-square error and the other option is mae (mean absolute error). **Questions.** 1. Where in the CV MSE plot the method has high variance and where high bias? 2. What would it mean if your CV MSE plot were monotonic (either increasing or decreasing) from left to right and the reported minimum were at the end of the plot? Let's plot the LASSO solution paths for each coefficient and let's plot the model for lambda.1se. {r} plot(lasso.cv$glmnet.fit, label = T, xvar = "lambda") #plot LASSO sequence abline(v = log(lasso.cv$lambda.min), lty = 2) abline(v = log(lasso.cv$lambda.1se), lty = 2)  The glmnet plot drawn from the glmnet.fit component of cv.glmnet object shows how the 70 predictors evolve from unpenalized OLS (left side) to LASSO solution (dotted lines on right side). We see that only 4 or 12 variables are non-zero in LASSO solution depending whether we look at the lambda.1se or lambda.min model. {r} #how many non-zero coefficients in each model data.frame(model = c("min","1se"), lambda = c(lasso.cv$lambda.min,lasso.cv$lambda.1se), nzero = c(lasso.cv$nzero[c(which(lasso.cv$lambda == lasso.cv$lambda.min), which(lasso.cv$lambda == lasso.cv$lambda.1se))]))  To see the coefficients, we can call coef(glmnet.fit, s=lambda). Let's do that for model lambda.1se. {r} coef(lasso.cv$glmnet.fit, s = lasso.cv$lambda.1se)  Since the first coefficient corresponds to the intercept, we will remove the first coefficient and then our indexes correspond to the predictors 1...70. {r} b.1se = as.numeric(coef(lasso.cv$glmnet.fit, s = lasso.cv$lambda.1se))[-1] data.frame(predictor = which(b.1se != 0.0), coeff = b.1se[which(b.1se != 0.0)])  Thus lambda.1se kept 3 out of the 4 correct variables (1,2,3,4) and additionally included variable 45 with a small coefficient. Let's see how well LASSO models trained on the training data work on the test data. Let's compute MSE in test data and compare it to the values from HDS5. {r} pred.lasso.1se = predict(lasso.cv, newx = as.matrix(X.te)) #default is at lambda.1se pred.lasso.min = predict(lasso.cv, newx = as.matrix(X.te), s = "lambda.min") data.frame(MSE = c(mse(y.te, pred.lasso.1se), mse(y.te, pred.lasso.min)), row.names = c("1se","min"))  Remember that in HDS5 the variance in test data was 0.92 and MSEs in test data were OLS|AIC back|AIC fwd|BIC back|BIC fwd -|-|-|-|-|- 1.60|1.32|1.00|0.82|0.86 It seems that with LASSO we do quite similarly to BIC models. What about ridge regression on the same problem? It can also be done with glmnet by setting alpha=0. {r} ridge.cv = cv.glmnet(x = as.matrix(X.tr), y = y.tr, alpha = 0) plot(ridge.cv) #shows CV'd MSEs across lambdas, lines for lambda.min and lambda.1se plot(ridge.cv$glmnet.fit, label = T, xvar = "lambda") abline(v = log(ridge.cv$lambda.min), lty = 2) abline(v = log(ridge.cv$lambda.1se), lty = 2) pred.ridge.1se = predict(ridge.cv, newx = as.matrix(X.te)) #default is at lambda.1se pred.ridge.min = predict(ridge.cv, newx = as.matrix(X.te), s = "lambda.min") data.frame (MSE = c( mse(y.te, pred.ridge.1se), mse(y.te, pred.ridge.min)), row.names = c("1se","min"))  We see that ridge regression model does not work on these data well as its prediction accuracy measured by MSE is no smaller than variance of test data, 0.92. We also see from the coefficient plot that the lambda.1se model essentially shrinks all coefficients to very small values so would correspond to a pure intercept model that predicts everything by the mean of the training data which explains the numerical similarity of MSE to the variance of the outcome variable. Remember that predicted values from pure OLS had much higher MSE than the variance of the test data, whereas ridge regression coefficients at lambda.1se are shrunk in such a way that no such overfitting happens. This particular example is a case where sparse models, such as LASSO, are preferred over ridge regression. Typically, however, when there are many non-zero coefficients, then ridge regression often provides a better prediction accuracy than LASSO. ### Ridge regression vs. LASSO #### Constrained optimization interpretation It can be shown (by using a constrained optimization technique called Lagrange multipliers) that ridge regression and LASSO can be formulated as optimization problems within a certain region around the origin $\pmb 0$, whose size is controlled by a parameter $t$ that is a function of parameter $\lambda$ and the likelihood function. * Ridge regression (with $t=t(\lambda, \pmb y))$:$$\textrm{Find minimum of } -\log(L(\pmb \beta|\pmb y)) \textrm{ when } \sum_{j=1}^p \beta_j^2 \leq t.$$ * LASSO (with $t=t(\lambda, \pmb y))$:$$\textrm{Find minimum of } -\log(L(\pmb \beta|\pmb y)) \textrm{ when } \sum_{j=1}^p |\beta_j| \leq t.$$ In both cases, for large enough $t\geq t_0$, the maximum likelihood estimate (MLE) will belong to the search region and therefore the MLE will be the solution of the problem. This threshold value $t_0$ corresponds to $\lambda=0$ and values $t0$. Obviously, for ridge regression $t_0 = \sum_{j=1}^p \widehat{\beta}_j^2$ and for LASSO $t_0 = \sum_{j=1}^p \left|\widehat{\beta}_j\right|$, where $\widehat{\beta}_j$ is the MLE of $\beta_j$. #### Bayesian interpretation Let's consider the linear model with Gaussian errors with known variance $\sigma^2$. $$y_i \sim \mathcal{N}(\beta_0 + \pmb x_i^T \pmb \beta, \sigma^2 ), \textrm{ for } i=1,\ldots,n.$$ Let's assume that the parameters $\beta_j$ have a Gaussian prior with known variance $\tau^2$, $$\beta_j \sim \mathcal{N}(0,\tau^2), \textrm{ for } j=1,\ldots,p.$$ By Bayes rule, the posterior distribution will be proportional to the product of prior and likelihood. Consequently, the log-posterior is, up to an additive constant, the sum of logarithms of prior and likelihood. Thus, up to a multiplicative constant $C$ that does not depend on $\pmb \beta$, $$\log(\textrm{Pr}(\pmb \beta | \pmb y)) = C -\frac{1}{2 \tau^2} \pmb \beta^T \pmb \beta -\frac{1}{2 \sigma^2} \left(\pmb y - \beta_0 - \pmb X \pmb \beta\right)^T \left(\pmb y - \beta_0 - \pmb X \pmb \beta\right)$$ The maximum value of the posterior distribution (maximum a posteriori estimate, MAP estimate) is the same value as the solution to the ridge regression problem with $\lambda = \sigma^2 / \tau^2$. Thus ridge regression is nothing else than a MAP estimate from a Bayesian linear model with zero mean Gaussian prior on the parameters. This gives a new interpretation for $\lambda$ values. If we assume a priori that the linear model coefficients $\pmb \beta$ are quite close to zero then we make $\tau^2$ small and consequently $\lambda\propto \tau^{-2}$ large. If instead we assume a flat prior ($\tau^2 \to \infty$) then $\lambda \to 0$ and we are considering the pure maximum likelihood estimate, which coincides with the OLS for this linear model. This relationship shows how a prior assumption about the parameters of the statistical model leads to ridge regression, whereas our original definition of ridge regression through the use of a "penalized likelihood" was an optimization formulation of an ad-hoc (although common sense) objective function. In Bayesian statistics, the penalized likelihood is simply the posterior distribution of the parameters corresponding to a particular prior distribution. We can do a similar interpretation for the LASSO model. There the prior of each coefficient is the Laplace distribution (a.k.a double exponential distribution) with scale $\tau>0$. So $\beta_j \sim \textrm{Laplace}(0, \tau)$ which means that the prior density is $$\textrm{Pr}(\beta_j) = \frac{1}{2 \tau}\exp(-|\beta_j|/\tau).$$ By combining this with the likelihood from the linear model with Gaussian errors, we see that the MAP estimate is the LASSO estimate with $\lambda = 2\sigma^2/\tau.$ Let's plot both priors so that they have the same variance. {r} par(mfrow=c(1,1)) x = seq(-2.5, 2.5, 0.001) #plot priors on these grid points eff.var = 0.8^2 #both distributions will have this same variance y.ridge = dnorm(x, 0, sqrt(eff.var)) #prior for ridge y.lasso = 1/2/sqrt(eff.var/2) * exp(-abs(x) / sqrt(eff.var/2) )#prior for LASSO plot(x, y.lasso, t = "l", lty = 1, lwd = 2, col = "orange", xlab = expression(beta), ylab = "prior density") lines(x, y.ridge, lty = 1, lwd = 2, col = "forestgreen") legend("topright", leg = c("LASSO","RIDGE"), col = c("orange","forestgreen"), lwd = 2)  For the original formulation of the penalized regression problem, $\lambda$ was simply a nuisance parameter that needed to be chosen to estimate the $\beta$-coefficients that are our main interest. However, with the Bayesian interpetation we know how the value of $\lambda$ is related to the parameters controlling the coefficients' prior distribution ($\beta_j \sim \mathcal{N}(0,\sigma^2/\lambda)$ for ridge and $\beta_j \sim \textrm{Laplace}(0,2\sigma^2/\lambda)$ for LASSO). Therefore, by learning a value of $\lambda$ from the data (e.g. by cross-validation) we learn about the coefficient distribution that best describes the data. Hence $\lambda$ is not only a nuisance parameter but also tells about the magnitude of the effects that we see in the data. While ridge regression and LASSO are connected to Bayesian models via being MAP estimates, we should keep in mind that the Bayesian regression models typically aim to estimate the whole posterior distribution of the coefficients, not just a point estimate such as the MAP. This is computationally a much more challenging task, and ridge and LASSO are not necessarily the best models for such a complete Bayesian analysis, since they lack flexibility to tune the shrinkage factor of each individual coefficient and instead they only consider a single global $\lambda$ parameter. Some discussion about more advanced Bayesian approaches by [M. Betancourt](https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html). #### Correlated predictors Earlier we figured out how each method works in the simplest case of uncorrelated predictors: * Ridge regression shrinks the OLS towards zero by a multiplicative factor that gets smaller (more shrinkage) as $\lambda$ grows * LASSO does soft-thresholding where each coefficient is pulled towards zero by a constant amount determined by $\lambda$ until the coefficient reaches zero But how do they work with correlated predictors? Ridge regression is known to shrink the coefficients of correlated predictors towards each other. In the extreme case of $k$ identical predictors, they each get identical coeffcients with $1/k$th the size that any single one would get if fit alone. From a Bayesian point of view, the ridge penalty is ideal if there are many predictors, and all have non-zero coeffcients (drawn from a Gaussian distribution). LASSO, on the other hand, is somewhat indifferent to very correlated predictors, and will tend to pick one and ignore the rest. In the extreme case above, the LASSO problem breaks down because several combinations of parameter values give same value for the objective function. To overcome this problem, a combination of ridge and LASSO penalties have been introduced (*elastic net*). #### Other penalties? What about other forms of penalties than $q=1$ (LASSO) and $q=2$ (ridge)? Values of $q \in (1, 2)$ suggest a compromise between the LASSO and ridge regression. Although this is the case, with $q > 1$, $|\beta_j|^q$ is differentiable at 0, and so does not share the ability of LASSO for setting coefficients exactly to zero. Partly for this reason as well as for computational tractability, Zou and Hastie (2005) introduced the *elastic net* penalty $$\textrm{penalty}_{\textrm{enet}}^\alpha(\pmb \beta) = \sum_{j=1}^p ((1-\alpha) \beta_j^2 + \alpha |\beta_j|).$$ This also explains why in glmnet() $\alpha=1$ corresponds to LASSO and $\alpha=0$ to ridge regression. Let's test elastic net with $\alpha=0.5$ on the data we have been considering where $n=150$ and $p=70$ and only the first four predictors have truly an effect. {r} enet.cv = cv.glmnet(x = as.matrix(X.tr), y = y.tr, alpha = 0.5) plot(enet.cv) #shows CV'd MSEs across lambdas, lines for lambda.min and lambda.1se plot(enet.cv$glmnet.fit, label = T, xvar = "lambda") #plot coefficients abline(v = log(enet.cv$lambda.min), lty = 2) abline(v = log(enet.cv$lambda.1se), lty = 2) pred.enet.1se = predict(enet.cv, newx = as.matrix(X.te)) #default is at lambda.1se pred.enet.min = predict(enet.cv, newx = as.matrix(X.te), s = "lambda.min") data.frame(MSE = c(mse(y.te, pred.enet.1se), mse(y.te, pred.enet.min)), row.names = c("1se","min"))  With this particular elastic net model ($\alpha=0.5$), we have results slightly worse than LASSO but better than ridge regression in this data set that truly benefits from a sparse model. It would be possible to use cross-validation also to estimate$\alpha$to given an optimal elastic net model. #### Logistic regression We can apply glmnet to fit also logistic, multinomial, Poisson or Cox regression models. The syntax is similar and the distribution is specified by family= parameter whose options are "gaussian","binomial","poisson","multinomial","cox" or "mgaussian". Let's study logistic regression here and leave the other models to be introduced by the [glmnet manual](https://glmnet.stanford.edu/articles/glmnet.html). In logistic regression we model the binary outcome$y_i \in \{0,1\}$for$i=1,\ldots,n,$using the$p$predictors in columns of matrix$\pmb X$whose row$i$is denoted as$\pmb x_i$. The model assumes that log-odds of event$y_i=1$depends linearly on the predictors $$\log\left( \frac{\text{Pr}(y_i = 1)}{1- \text{Pr}(y_i = 1)}\right ) = \beta_0 + \pmb x_i^T\pmb \beta$$ or equivalently that the probability of event$y_i = 1$is $$\text{Pr}(y_i = 1) = \frac{\exp(\beta_0 + \pmb x_i^T\pmb \beta)}{1+\exp(\beta_0 + \pmb x_i^T\pmb \beta)} = \frac{1}{1+\exp(-\beta_0 - \pmb x_i^T\pmb \beta)}.$$ The likelihood function is from the Bernoulli distribution $$L(\pmb \beta \, | \, \pmb y) = \prod_i \text{Pr}(y_i = 1)^{y_i}\, \text{Pr}(y_i = 0)^{(1-y_i)} = \prod_i \frac{\exp(\beta_0 + \pmb x_i^T\pmb \beta)^{y_i}}{1+\exp(\beta_0 + \pmb x_i^T\pmb \beta)}$$ and the log-likelihood is $$\log(L(\pmb \beta \, | \, \pmb y)) = \sum_i \left[y_i(\beta_0 + \pmb x_i^T\pmb \beta)-\log(1+\exp(\beta_0 + \pmb x_i^T\pmb \beta))\right].$$ Thus, with glmnet, we are searching for$\pmb \beta$that minimizes the function $$-2 \sum_i \left[y_i(\beta_0 + \pmb x_i^T\pmb \beta)-\log(1+\exp(\beta_0 + \pmb x_i^T\pmb \beta))\right] + \lambda \cdot \text{penalty}_\text{enet}^\alpha (\pmb \beta).$$ Let's use existing data ($p=70$,$n=150$training and test samples) to illustrate logistic regression. {r} logist.coeff = c(rnorm(10, 1, 0.2), rnorm(10, -1, 0.2), rep(0, p-20) ) #coeff 1,...,20 are nonzero prob.tr = 1/(1 + exp(-as.matrix(X.tr) %*% logist.coeff)) #prob of 1 in log. regression bin.tr = rbinom(n.tr, size = 1, prob = prob.tr) # generate binary outcome prob.te = 1/(1 + exp(-as.matrix(X.te) %*% logist.coeff)) bin.te = rbinom(n.te, size = 1, prob = prob.te)  The default CV error measure for logistic regression is "deviance", which is defined as$2(\log(L(\widehat{\pmb \beta}_\textrm{saturated}|\pmb y))-\log(L(\widehat{\pmb \beta}|\pmb y)))$, where saturated model has so many parameters that it fits the data perfectly. Thus the deviance measures distance ("deviance") from the highest possible likelihood value, and smaller deviance means better fit. Other error measures that could be used with the logistic model include MSE (mse) and mean absolute error (mae) that both are computed from the errors on the observed scale, i.e., outcome is 0 or 1 and prediction is probability value between 0 and 1. Additionally, error can be measured by misclassification rate (class) where probabilities are turned into best guesses (0 or 1) or by area under curve of ROC curve (auc). Let's compare$\lambda$s from all these possibilities on a plot that shows CV results of LASSO as measured by deviance. {r} types = c("mae","class","auc","deviance") lambda = c() for(ii in 1:length(types)){ cv.bin = cv.glmnet(x = as.matrix(X.tr), y = bin.tr, family = "binomial", type.measure = types[ii], alpha = 1) #LASSO lambda[ii] = cv.bin$lambda.1se } plot(cv.bin) #uses deviance as error measure abline(v = log(lambda), col = c("blue","green","orange","black")) data.frame(types, lambda)  These seem close to each other and we are happy to use deviance. Question: Why does the deviance in plot above not go down when $\lambda \to 0$, even though the model becomes less penalized and can fit the data better then? The coefficient plots can be shown on three scales. So far, we have used option lambda which puts $\log(\lambda)$ on x-axis. The other two options are the scaled deviance dev (1 means the fit is perfect and 0 means the intercept-only model), and norm that shows how large proportion of the unpenalized model's coefficients is remaining in the penalized model. {r, fig.width = 9, fig.height = 4} par(mfrow = c(1,3)) plot(cv.bin$glmnet.fit, xvar = "lambda") abline(v = log(cv.bin$lambda.1se), lty = 2) #put CV chosen level of penalty plot(cv.bin$glmnet.fit, xvar = "dev") plot(cv.bin$glmnet.fit, xvar = "norm")  Note in the deviance plot (middle panel) how at the right hand side end the deviance is not anymore increasing much but coefficients still get very quickly larger in magnitude. Such behavior suggest overfitting to the training data that happens when penalization is becoming so weak that the model is free to find the perfect fit in the training data. Prediction in logistic regression model can be output on the log-odds scale (default), on the proability scale response, or as the best guess outcome values class. {r} predict(cv.bin, type = "response", newx = as.matrix(X.te[1:2,])) #probability of being 1 predict(cv.bin, type = "class", newx = as.matrix(X.te[1:2,])) #best guess (0 or 1) predict(cv.bin, newx = as.matrix(X.te[1:2,])) #log-odds, i.e., log( p/(1-p) )  #### Standardization of $\pmb X$ and $\pmb y$ in glmnet In the beginning of these notes it was mentioned that predictors in penalized regression models should be standardized. This is because in penalized regression the penalty terms are defined as sums over $|\beta_j|^q$, and therefore it is important that the scales of different predictors are chosen appropriately with respect to the other predictors. If we have no reason to do otherwise, then the default choice is that every predictor has the same variance (=1), whence same value of every $\beta_j$ means the same proportion of $\textrm{Var}(y)$ explained by the predictor. Glmnet does the standardization of $\pmb X$ automatically, whether or not you gave the predictors as standardized (unless you explicitly said standardize=FALSE). Therefore, the default model fit from glmnet does not depend on whether you input $\pmb X$ as standardized or not. However, the coefficients that glmnet outputs are always on the scale on which you gave the predictors. The crucial thing is that when you do predictions from a glmnet model, the test data predictors must be given on the same scale as the training data predictors were in the original glmnet call. Thus, in the exercises, you don't need to standardize $\pmb X$ yourself (except if explicitly asked to), but you can just let glmnet do it automatically (as long as you don't change the default option: standardize=TRUE). $\pmb y$ variable does not need to be standardized. In practice it will be mean-centered by the default intercept term in the model (unless you said intercept=FALSE). If both $\pmb y$ and all columns of $\pmb X$ are mean-centered prior to calling glmnet, then the intercept in linear model will be 0, but even in that case, it does not harm to have the intercept in the model. In non-linear models, you should always have the intercept in the model. ### READ: "6.2 Shrinkage Methods" from ISL **Questions.** * How would you need to pre-process the predictors and outcome variable before applying penalized regression? * What do ridge regression, LASSO, and subset selection do when predictors are uncorrelated? * What are advantages of penalized regression methods over stepwise selection methods? * If you want to choose a subset of important variables would you use ridge regression or LASSO? * In which cases would you prefer ridge regression over LASSO?