In HDS5 notes, we considered multiple regression models with different numbers of predictors \(p\). We had two information criteria, AIC and BIC, that balance the model fit (to the training data), measured by the likelihood function \(L(\pmb\beta\,|\,\pmb y)\), and the flexibility of the model, mesured by a penalty term for the parameters \(\pmb\beta\). Both criteria are instances of a minimization problem of an objective function of the form \[-2\log(L(\pmb \beta\,|\,\pmb y)) + \lambda\cdot \textrm{penalty} (\pmb \beta),\] where the penalty function is only a function of the model parameters and \(\lambda \geq 0\) is a constant. For example, we get AIC, by setting \(\lambda=2\) and \(\textrm{penalty} (\pmb \beta)= \sum_{j=1}^p \textrm{I}(\beta_j)\), where \(\textrm{I}\) is an indicator function taking value 1 if \(\beta_j \neq 0\) and 0 otherwise.

This week, we will consider a more general family of non-negative penalties of the form \[\textrm{penalty}_q (\pmb \beta)= \sum_{j=1}^p |\beta_j|^q,\, q > 0.\] We write the penalized objective function as \[O_{q,\lambda}(\pmb \beta\,|\,\pmb y) = -2\log(L(\pmb \beta\,|\,\pmb y)) + \lambda \cdot \textrm{penalty}_q (\pmb \beta).\] The goal is to find the minimum value for this objective function with respect to \(\pmb \beta\) as a candidate solution that appropriately balances bias and variance of the model.

Note that by agreeing that \(0^0=0\), we could extend the above framework to include both AIC (\(q=0,\lambda=2\)) and BIC (\(q=0, \lambda = \log(n))\).

Example 6.1. For standard linear regression model the objective function is (up to an additive constant) \[O_{q,\lambda}(\pmb \beta\,|\,\pmb y) = (\pmb y-\pmb X \pmb \beta)^T (\pmb y-\pmb X \pmb \beta) + \lambda \cdot \textrm{penalty}_q (\pmb \beta) = \text{RSS} + \lambda\cdot\textrm{penalty}_q (\pmb \beta), \] and by minimizing this function we will get a compromise between a small residual sum of squares (RSS) and a small penalty term. In other words, we will get a compromise between the ordinary least squares (OLS) estimates and the requirement that the cumulative magnitude of the coefficients is small. The parameter \(\lambda\) determines the importance of each component: When \(\lambda=0\), we repeat the OLS and when \(\lambda \to \infty\), we shrink all coefficients to 0. For intermediate values of \(\lambda\), the solution lies somewhere between OLS and 0.

Because minimizers of \(O_{q, \lambda}(\pmb \beta\,|\,\pmb y)\) depend on the scaling of the predictors, as a pre-processing step, we standardize the predictors before optimizing for \(\pmb \beta\). This way the magnitude of the penalty attached to each \(\beta\)-coefficient is equal with respect to the variance explained in the outcome variable.

Typically, the intercept term is not penalized. For linear model, we can leave the intercept \(\beta_0\) out from the model after the predictors and outcome have been mean-centered. For logistic or Poisson regression models, we do not include the intercept among the coefficients that form the penalty term even though we always have the intercept in the model.

How should we determine the values of \(q\) and \(\lambda\)? Let’s next consider what kinds of models we will get with commonly used values of \(q=2\) and \(q=1\).

Ridge regression (\(q=2\))

When \(q=2\) and the penalty term is the sum of squares, the regression model is called ridge regression. For this case, we can derive the solution analytically. Let’s first expand the objective function: \[ (\pmb y -\pmb X \pmb\beta)^T (\pmb y -\pmb X \pmb\beta) + \lambda\, \pmb \beta^T \pmb \beta = \pmb y^T \pmb y - 2\pmb y^T \pmb X \pmb \beta + \pmb \beta^T \left( \pmb X^T \pmb X + \lambda \pmb I \right) \pmb \beta. \] By solving for \(\widehat{\pmb \beta}_{\textrm {ridge}}\), that makes the derivative with respect to \(\pmb \beta\) vanish, we have \[ 0 - 2 \pmb y^T \pmb X + 2 \widehat{\pmb \beta}_{\textrm {ridge}}^T \left( \pmb X^T \pmb X + \lambda \pmb I \right) = 0 \iff \widehat{\pmb \beta}_{\textrm {ridge}} = \left( \pmb X^T \pmb X + \lambda \pmb I \right)^{-1} \pmb X^T \pmb y. \] We see that, compared to the least square solution where \(\lambda=0\), the ridge solution adds a positive diagonal term to \((\pmb X^T \pmb X)\) before the matrix inversion. Numerically, this adds stability to the solution and makes computation of the solution possible even when the original matrix \(\pmb X\) was not of full rank.

Example 6.2. Suppose that predictors in columns of \(\pmb X\) are uncorrelated and standardized and thus \((\pmb X^T \pmb X)= n \pmb I.\) (When sample variance is used for standardization the scaling factor would be \(n-1\) rather than \(n\) but for simplicity we scale by \(n\) here.) Then \[\widehat{\pmb \beta}_{\textrm {ridge}} = (n +\lambda)^{-1} \pmb X^T \pmb y = \frac{n}{n+\lambda} \widehat{\pmb \beta}_{OLS},\] where \(\widehat{\pmb \beta}_{OLS}= \pmb X^T \pmb y /n\) is the least squares solution of \(\pmb y = \pmb X \beta + \pmb{\varepsilon}\). We see that the ridge regression simply shrinks the least squares solution towards 0 by a multiplier that decreases as the ratio \(\lambda/n\) increases.

As we saw in HDS5, an important aspect of the multiple regression model is that it can tease apart the effects of correlated predictors. However, when there are many highly correlated variables in a linear regression model, their coefficients become poorly determined and exhibit high variance, which is reflected in large standard errors of each coefficient. This is because a large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated counterpart. By imposing a penalty term for the joint magnitude of the coefficients, this problem is alleviated. This was a main motivation to consider ridge regression in the first place (Hoerl and Kennard, Technometrics 1970).

Example 6.3. Consider the case of two predictors \(X_1\) and \(X_2\) and outcome variable \(Y\). All three are standardized to have mean 0 and variance 1. Correlation between \(X_1\) and \(X_2\) is \(r\). We want to understand the behavior of the least squares estimator \(\widehat{\pmb \beta} = (\widehat{\beta}_1,\widehat{\beta}_2)\) of the model \(Y=X_1\beta_1 + X_2\beta_2 + \varepsilon\) as a function of \(r\). We know from the course in linear models that the sampling distribution is \[ \widehat{\pmb \beta} \sim \mathcal{N}\left(\pmb \beta, \sigma^2 \left(\pmb X^T \pmb X \right )^{-1}\right),\] where \(\sigma^2=\textrm{Var}(\varepsilon)\) is the error variance. Note that in the observed data \(\pmb x_j^T \pmb x_j \approx n\, \textrm{Var}(X_j) = n\) for \(j=1,2\) and that \(\pmb x_1^T \pmb x_2 \approx n\, \textrm{Cov}(X_1, X_2) = n\, \textrm{Cor}(X_1, X_2) = n\cdot r,\) since the variables are standardized. By using the formula for the inversion of a 2-by-2 matrix, we have that \[ \sigma^2 \left(\pmb X^T \pmb X\right)^{-1} = \frac{\sigma^2}{n} \left(\begin{array}{cc} 1 & r \\ r & 1 \end{array} \right)^{-1} = \frac{\sigma^2}{n(1-r^2)} \left(\begin{array}{cc} 1 & -r \\ -r & 1 \end{array} \right). \] We see that the sampling variance related to the each coefficient (either \(\beta_1\) or \(\beta_2\)) alone is \(\propto \frac{1}{n(1-r^2)}\), so will go DOWN as sample size \(n\) grows, but will go UP when \(|r|\) gets closer to 1. This means that the information to learn the true value of parameter \(\beta_1\) will grow as \(n\) grows but will decrease as \(|r|\) grow, and the total effect of these two properties on the amount of information is determined by their product \(n (1-r^2)\). This explains why highly correlated variables in the linear model cause unstability to the estimates through large standard errors. However, there is no fundamental reason why linear model could not tell the effects of correlated variables apart from each other, but we simply would need a larger sample size to get the same accuracy (as measured by SEs) for correlated variables than we have for uncorrelated ones already with a smaller sample size. (To be precise, with correlation \(r\), we will need \(1/(1-r^2)\) times the sample size compared to the uncorrelated case to get the same precision for the estimates.)

We also see that the shape of the sampling distribution of \(\widehat{\pmb \beta}\) has correlation of \(-r\), so has the opposite sign compared to the correlation of the original variables \(X_1\) and \(X_2\). This is reflecting that if \(X_1\) and \(X_2\) are highly positively correlated (say \(r>0.99\)), then for the model fit, it is almost insignificant how we divide the true total value \(B = \beta_1 + \beta_2\) between \(\widehat{\beta}_1\) and \(\widehat{\beta}_2\). By increasing the first coefficient and simultaneously decreasing the second coefficient by almost the same amount, the model fit stays almost the same. For example, we can get almost the same fitted values by a choice of \(\widehat{\beta}_1 = B/2 + 100, \widehat{\beta}_2 = B/2 - 100\) as we get by a choice of \(\widehat{\beta}_1 = B/2, \widehat{\beta}_2 = B/2\). Obviously, if we want to interpret the coefficients of the model, then the estimates \(\widehat{\beta}_1 = 0.02, \widehat{\beta}_2 = -0.01\) would be much preferred over, say, \(\widehat{\beta}_1 = 100.02, \widehat{\beta}_2 = -100.01\) in case when truly \(\beta_1 = \beta_2 = 0\). From that point of view, the unstability of estimates with highly correlated variables is a problem in the standard (unpenalized) regression.

Questions. How does the 2-dimensional likelihood function of two highly correlated predictors look like? Where does the unstability of parameter estimates manifest in the likelihood function? How does an increasing sample size affect the shape of the likelihood surfaces of constant value and help in bringing more stability to estimates?

Example 6.4. Let’s reuse the generation of a pair of correlated variables from beginning of HDS5 notes. We generate two highly correlated (\(0.99\)) predictors and then generate the outcome \(y = x_1 \cdot 1.5 + x_2 \cdot 1.0 + \varepsilon\). The sample size is small (\(n=50\)) and so we do not expect to have information to separate well which one of the two variables is having which effect size. Let’s see how OLS (corresponding to \(\lambda=0\)) and ridge regression (here we set \(\lambda = 10\)) behave. We use lm.ridge() from MASS to fit the ridge regression.

library(MASS)
set.seed(17102017)

#function to generate a pair of correlated variables
gen.cor.pair <- function(r, n){ #r is target correlation, n is sample size
 u = rnorm(n, 0, sqrt(abs(r))) #temporary variable that is used for generating x1 and x2
 x1 = scale(u + rnorm(n, 0, sqrt(1 - abs(r))))
 x2 = scale(sign(r)*u + rnorm(n, 0, sqrt(1 - abs(r)))) 
 cbind(x1, x2) #return matrix
}
n = 50
X = gen.cor.pair(r = 0.99, n = n) #generate one pair of predictors
cor(X) #check that now cor(x,z) ~ r. 
##           [,1]      [,2]
## [1,] 1.0000000 0.9890795
## [2,] 0.9890795 1.0000000
true.b = c(1.5, 1.0) #true values of coeffs for cols of X
R = 100 #replications. predictors are fixed but outcome resampled for each replication
OLS.res = matrix(NA, ncol = 2, nrow = R) #least squares estimates
ridge.res = matrix(NA, ncol = 2, nrow = R) #ridge estimates
for(ii in 1:R){
  y = as.numeric( X %*% true.b + rnorm(n,0,1) ) 
  y = y - mean(y) #mean center so that can ignore intercept
  OLS.res[ii,] = as.numeric(lm(y ~ 0 + X)$coeff) #joint model using least squares fit 
  #ridge regression with lambda=10
  ridge.res[ii,] = as.numeric(coefficients(lm.ridge( y ~ 0 + X, lambda = 10 )))
}

#plot estimates across replications and mark true values by dotted lines
par(mfrow = c(1, 2))
plot(OLS.res, col = "red", pch = 4, xlab = expression(beta[1]), ylab = expression(beta[2]))
points(ridge.res, col = "blue", pch = 1)
grid()
abline(h = true.b[2], lty = 2, lwd = 2)
abline(v = true.b[1], lty = 2, lwd = 2)
points(0, 0, pch = 19, cex = 0.6)
text(-0.3, -0.3, "0") #mark origin
legend("topright", leg = c("OLS","RIDGE"), col = c("red","blue"), pch = c(4,1))

#plot sum of two coefficients across methods and mark true value by dotted line
plot(rowSums(OLS.res), rowSums(ridge.res), pch = 19, xlab = "beta sum (OLS)", 
     ylab = "beta sum (RIDGE)")
abline(h = sum(true.b), lty = 2, lwd = 2)
abline(v = sum(true.b), lty = 2, lwd = 2)

We see that OLS are unstable and vary widely across data sets along the line \(\beta_1 + \beta_2 \approx 2.5\) whereas ridge regression estimates are much more stable and concentrate near to the closest point between the above mentioned line and the origin, because that point minimizes the ridge objective function among all points with the same distance from origin (panel top left). This behavior is a clear difference between OLS and ridge regression.

With small amount of data and highly correlated predictors, we simply do not have information to split the effect between the two variables. We can see how well we estimate the sum of the effects of the two variables, rather than how well we can split it into its parts (panel top right). We see that OLS solution has an average the correct value (2.5 is in the middle of estimated) whereas ridge regression is underestimating the sum of \(\beta\)s (the average is around 2.2). Let’s have a closer look at the estimates of the individual coefficients.

par(mfrow = c(2, 2))
for(ii in 1:2){
  res =  OLS.res[,ii]  
  hist(res, breaks = 15, col = "red", 
     xlab = paste0("beta_",ii," (OLS)"), main = "OLS", 
     sub = paste("mean =",signif(mean(res), 2)))}

for(ii in 1:2){
  res =  ridge.res[,ii]  
  hist(res, breaks = 15, col = "blue", 
     xlab = paste0("beta_",ii," (RIDGE)"), main = "RIDGE", 
     sub = paste("mean =",signif(mean(res), 2)))}

We know that OLS is unbiased, that is, its mean converges to the correct value across simulations (red histograms). Ridge regression, instead, is biased (blue histograms). By penalizing likelihood we cause some shrinkage to the coefficients towards each other and towards 0. This bias is the price to pay for reduced variance compared to OLS. Let’s see which method the bias-variance tradeoff favors here by computing MSEs of individual coefficients.

mse <- function(x,y){mean((x-y)^2)}
data.frame(mse.b1 = c(mse(OLS.res[,1], true.b[1]), mse(ridge.res[,1], true.b[1])),
           mse.b2 = c(mse(OLS.res[,2], true.b[2]), mse(ridge.res[,2], true.b[2])),
           row.names = c("OLS","ridge"))
##          mse.b1     mse.b2
## OLS   1.2972966 1.29587169
## ridge 0.1354325 0.02106637

While ridge regression has a larger bias than OLS, it also has a much lower variance than OLS, and the combined effect of these properties is a dramatically lower MSE for ridge regression. This shows the utility of penalized regression to add robustness to analyses with highly correlated predictors with too little information to split between them.

Why, technically, does ridge regression produce stable estimates? The likelihood part of the objective function (which is RSS) is exactly the same as for the OLS and would therefore be minimized by OLS. However, when there is little information to decide between a wide variety of combinations of \(\left(\widehat{\beta}_1,\widehat{\beta}_2\right)\) that all produce almost the same likelihood value as long as their sum remains constant (see figure above where red Xs are spread over a line on which \(\widehat{\beta}_1 + \widehat{\beta}_2 \approx 2.5\)), then the OLS is very unstable because even small changes in data can make the minimum of the RSS change widely across this large region of flat likelihood. Ridge regression instead prefers those points in the parameter space, where, in addition to low RSS, the magnitudes of all the parameters are fairly small. Thus, from among the points in the large region of almost flat likelihood, we will choose a one that is fairly close to the origin (0,0) to keep the ridge penalty low.

What would happen if true values of \(\beta_1\) and/or \(\beta_2\) had a large magnitude? Would ridge regression then fail badly because it is shrinking the effects? Ridge regression is a biased method and the bias can in some cases be large. However, the tuning parameter \(\lambda\) determines how biased the method is, and in cases where less bias is enough to get good predictive accuracy, then \(\lambda\) will be kept smaller and ridge regression is closer to OLS. The example above used a fixed value of \(\lambda=10\) without justifying this choice. Typically, \(\lambda\) is chosen by cross-validation and can therefore adapt to the data.

In addition to handling correlated variables, another benefit of penalized regression is to handle cases where \(p\) is large (compared to \(n\)). Here the unpenalized least squares methods tend to overfit to the training data and adding some robustness by penalizing model complexity is needed. Next we consider another version of penalized regression called LASSO that is particularly suitable for variable selection.

LASSO (\(q=1\))

The Least Absolute Shrinkage and Selection Operator (LASSO), introduced by Tibshirani 1996, is achieved by minimizing the objective function \[O_{1,\lambda}(\pmb \beta\,|\,\pmb y) = -2\log(L(\pmb \beta\,|\,\pmb y)) + \lambda \cdot \textrm{penalty}_1 (\pmb \beta) = -2\log(L(\pmb \beta\,|\,\pmb y)) + \lambda \cdot \sum_{j=1}^p |\beta_j|.\]

Thus, the difference to ridge regression is the change from the \(\ell^2\) penalty \(\beta_j^2\) to the \(\ell^1\) penalty \(|\beta_j|\) per each coefficient.

Because the LASSO penalty includes the absolute value operator, the objective function is not differentiable and, as a result, lacks a closed form solution in general. However, in the special case of uncorrelated predictors (\(\pmb X^T \pmb X = n\pmb I\)) it is possible to obtain the closed form solution for LASSO using the soft-thresolding operator \((t)_+\), which equals \(t\), when \(t>0\), and 0 otherwise.

Suppose that predictors in columns of \(\pmb X\) are uncorrelated. Then the LASSO solution is \[\widehat{\beta}_j^{\textrm{(LASSO)}} = \textrm{sign} \left( \widehat{\beta}_{j}^{\textrm{(OLS)}} \right) \left(\left|\widehat{\beta}_{j}^{\textrm{(OLS)}}\right|-\lambda\right)_+ =\left\{ \begin{array}{cc} \left(\widehat{\beta}_{j}^{\textrm{(OLS)}}-\lambda\right), & \textrm{ if }\, \widehat{\beta}_{j}^{\textrm{(OLS)}} > \lambda\\ 0, & \textrm{ if }\, \left|\widehat{\beta}_{j}^{\textrm{(OLS)}}\right| \leq \lambda \\ \left(\widehat{\beta}_{j}^{\textrm{(OLS)}}+\lambda\right), & \textrm{ if }\, \widehat{\beta}_{j}^{\textrm{(OLS)}} < -\lambda \end{array}\right ., \]

where OLS refers to the ordinary least squares estimate from the linear model.

Let’s check this using glmnet package. You can install it from CRAN and detailed instructions of installation and other aspects are at https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-3
n = 100
X = as.matrix(scale(cbind(rnorm(n), rnorm(n)))) #two uncorrelated columns
y = as.numeric(scale(X %*% c(0.5, -0.2) + rnorm(n)))
OLS.coef = coefficients(lm(y ~ X))
OLS.coef #Least squares estimate
##   (Intercept)            X1            X2 
## -2.437083e-17  5.195129e-01 -1.500554e-01
lambda = c(0.6, 0.4, 0.15, 0.1, 0.0) #Let's choose these to cover the range of OLS values.

g.fit = glmnet(X, y, lambda = lambda, alpha = 1) #alpha=1 means LASSO, we come back to that later.
data.frame(lambda = g.fit$lambda, t(as.matrix(coef(g.fit))))
##    lambda X.Intercept.        V1          V2
## s0   0.60 6.938894e-18 0.0000000  0.00000000
## s1   0.40 6.938894e-18 0.1450647  0.00000000
## s2   0.15 6.781402e-18 0.3921546 -0.02269691
## s3   0.10 6.486827e-18 0.4346076 -0.06514971
## s4   0.00 5.897675e-18 0.5195130 -0.15005541

glmnet uses symbol s for \(\lambda\), so e.g. s0 refers to the largest \(\lambda\) value of 0.6 and s4 to the smallest value of 0.0. They are always ordered in descending order no matter how they were ordered in the original lambda vector. We see that the number of non-zero coeffs increases from 0 (\(\lambda=0.6\)) to 2 (starting from \(\lambda=0.15\)). We also see that whenever there is a non-zero coefficient, its distance to the OLS of the same parameter is approximately \(\lambda\). For example, OLS for \(\beta_2\) is -0.15 and therefore it does not survive penalization in models with \(\lambda=0.6\) or \(0.4\) as these are clearly \(> 0.15\), but it starts to appear in the model when \(\lambda\) drops to around 0.15. Just like with ridge regression, we get back to OLS when \(\lambda=0\).

We can also plot glmnet results.

plot(g.fit, xvar = "lambda", lwd = 1.5)
abline(h = 0, col = "gray", lty = 2)