--- title: "HDS 5. Multiple regression with growing $p$" author: "Matti Pirinen, University of Helsinki" date: "31.10.2021" output: html_document: default --- {r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) set.seed(17)  So far we have considered statistical inference methods that are based on univariate models. That is, models where we estimate/test only one variable at a time. Such methods are computationally simple and can therefore be applied for very high dimensional problems. However, they ignore dependencies between predictors. To illustrate why this is a problem, let's consider two correlated predictors$X$and$Z$of which only one directly affects the outcome$Y$. **Example 5.1.** To create variables$x$and$z$with correlation$r$, let's write$X = U + \varepsilon_x$and$Z = U + \varepsilon_z$, where$\textrm{Var}(U) = |r|$and$\textrm{Var}(\varepsilon_x) = \textrm{Var}(\varepsilon_z) = 1-|r|$with$\varepsilon_x$and$\varepsilon_z$independent of each other and of$U$. Then \begin{eqnarray*} \textrm{Cor}(X,Z) &=& \frac{\textrm{Cov}(X,Z)}{\sqrt{\textrm{Var}(X)\cdot \textrm{Var}(Z)}} = \frac{\textrm{Cov}(U + \varepsilon_x, U + \varepsilon_z)}{\sqrt{\textrm{Var}(U + \varepsilon_x)\cdot \textrm{Var}(U + \varepsilon_z)}}\\ &=& \frac{\textrm{Cov}(U , U)}{\sqrt{(\textrm{Var}(U) + \textrm{Var}(\varepsilon_x))(\textrm{Var}(U) + \textrm{Var}(\varepsilon_z))}}\\ &=&\frac{|r|}{\left(\sqrt{|r| + 1 - |r|}\right)^2} = |r|. \end{eqnarray*} If$r$was negative, we would write$Z=-U + \varepsilon_z$instead. Let's generate$x$and$z$for$n=1000$samples with correlation$r=0.7.${r} n = 1000 r = 0.7 u = rnorm(n, 0, sqrt(abs(r))) #temporary variable that is used for generating x and z x = scale(u + rnorm(n, 0, sqrt(1-abs(r)))) z = scale(sign(r)*u + rnorm(n, 0, sqrt(1-abs(r)))) cor(x, z) #check that now cor(x,z) ~ r  Let's generate$y = \sqrt{0.2} \cdot x + \mathcal{N}(0, 0.8)$in which case$\textrm{Var}(Y)=\sqrt{0.2}^2 + 0.8 = 1$and$X$will explain 20% of the variance of$Y$(as$\text{Var}(\sqrt{0.2} X) / \text{Var}(Y) = 0.2 \cdot 1 / 1 = 0.2$). {r} y = scale(sqrt(0.2)*x + rnorm(n, 0, sqrt(0.8))) # x explains 20% of y summary(lm(y ~ 0 + x))  Note that we did not use$z$to generate$y$, but$z$and$y$are still highly correlated: $$\textrm{Cor}(Y,Z) = \frac{\textrm{Cov}(Y,Z)}{\sqrt{\textrm{Var}(Y)\cdot \textrm{Var}(Z)}} = \frac{\textrm{Cov}(\sqrt{0.2}\cdot U + \sqrt{0.2} \cdot \varepsilon_x + \varepsilon_y, U + \varepsilon_z)}{\sqrt{1 \cdot 1}} = \sqrt{0.2}\cdot \textrm{Cov}(U , U)=\sqrt{0.2}\cdot |r|$$ Thus, we expect the regression coefficient of$Y$on$Z$to be$\sqrt{0.2}\cdot 0.7\approx 0.31.${r} summary(lm(y ~ 0 + z)) # z does not have a direct effect on y but will be significant here  Luckily, the joint model of y ~ x + z can tease the effects of x and z apart. {r} summary(lm(y ~ 0 + x + z))  We have seen that if we test$z$univariately, we observe a clear effect on$y$, but that effect disappears when we do multiple regression of$y$on both$x$and$z$. Thus, regression is able to correctly split the effect among correlated variables. This is very important in order to separate possible *direct effects*$X$on$Y$from *confounders*$Z$, that are simply correlated with both$X$and$Y$, but not anymore important in explaining$Y$, once we have measured predictors$X$with direct effects on$Y$. **Example 5.2.** In medical literature, the role of HDL and LDL cholesterol levels in the risk of heart disease has been studied a lot over decades. Both are marginally correlated with the risk, HDL negatively and LDL positively. Recent studies and medical trials suggest that LDL is "causal", in the sense that medication lowering directly LDL levels will reduce disease risk, whereas HDL is not causal in that sense. This is extremely important information for developing preventive treatment for heart disease. Note that with statistical models alone, we cannot ever *prove* a causal relationship between$X$and$Y$but we can show that$Z$is unlikely to have a causal relationship if its effect goes away once a better predictor$X$is found. If variables are highly correlated, there will be little information about how the effect should be split among them and therefore the SEs will become large and effect size estimates will become unstable. Therefore, in practice, we would not want to include several very highly correlated variables (say$|r|>0.95$) into a regression model. We will come back to this point later when we discuss ridge regression as a way to stabilize models with highly correlated variables. What about if predictors were independent of each other? Do we also then need to use a joint model to get the effect estimates correct? If$X$and$Z$are independent, then their marginal effects (from separate models$Y=X \beta_x + \varepsilon$and$Y= Z \beta_z + \varepsilon$) are estimating the same quantities as the joint model$Y = X \gamma_x + Z \gamma_z + \varepsilon$, i.e.,$\beta_x=\gamma_x$and$\beta_z = \gamma_z$. However, the joint model is produces more precise estimates than the marginal models in case that both$X$and$Z$have an effect on$Y$, because the joint model has lower residual variance, and hence also smaller standard errors for estimates. Let's next consider how we can use multiple regression models in higher dimensional (HD) problems. We begin from a problem that results from a naive approach. This example also takes us through central issues in HD data analysis where we need to balance between the variance and bias of our statistical method in order to avoid **overfitting** while still getting useful information out from the model. #### READ: *2.2.1 Measuring the Quality of Fit* from book [ISL](https://web.stanford.edu/~hastie/ISLR2/ISLRv2_website.pdf) and see slides HDS5. * Mean squared error (MSE) for measuring fit * Training and test data sets, training and test MSE * How training and test MSE behave as a function of model flexibility? * Overfitting #### Overfitting To make overfitting problem concrete in a linear regression setting, let's consider$n_r=150$training samples that were measured on$p=70$variables and an outcome variable$y$. Four variables,$x_1,x_2,x_3,x_4$will truly affect$y$but the 66 others are just random noise. Let's build a prediction model for$y$using linear regression and let's test that model in an additional set of$n_t=150$test samples. {r} set.seed(20102017) p = 70 m = 4 #effect predictors n.tr = 150 # training set n.te = 150 # test set n = n.tr + n.te i.tr = 1:n.tr #indexes for training i.te = (n.tr + 1):n #indexes for testing phi = 0.05 # variance explained by each relevant predictor, should be 0 < phi < 1/m. b = rep(c(sqrt(phi / (1 - phi)), 0), c(m, p-m)) #effects 1,2,3,4 are non-zero, see Lect. 0.1 for "phi" X = matrix(rnorm(n*p), nrow = n) #columns 1,...,4 of X have effects, other 66 cols are noise eps = scale(rnorm(n, 0, 1)) #epsilon, error term y = scale(X%*%b + eps) # y has mean = 1, var = 1 y.tr = y[i.tr] y.te = y[i.te] X.tr = data.frame(scale(X[i.tr,])) X.te = data.frame(scale(X[i.te,])) lm.fit = lm(y.tr ~ 0 + . , data = X.tr) #fit the model to training data #(remove intercept because formulas below become easier, # only possible because columns of X and y have mean = 0.) summary(lm.fit)  Let's see how well the fitted values correspond to the true outcome values in training data and compare that to how well the predicted values in test data approximate the outcome values in test data. {r, fig.width = 8, fig.height = 4} fitted.tr = predict(lm.fit) # same as 'as.matrix(X.tr) %*% lm.fit$coeff' fitted.te = predict(lm.fit, newdata = X.te) # same as 'as.matrix(X.te) %*% lm.fit$coeff' par(mfrow = c(1, 2)) plot(fitted.tr, y.tr, xlab = "fitted", ylab = "y", main = "Training", pch = 19, col = "peru" ) plot(fitted.te, y.te, xlab = "fitted", ylab = "y", main = "Test", pch = 19, col = "maroon1" )  Plots show how the fitted values from the model predict well in training data but not at all in test data. Let's compute measures of fit by mean squared error (MSE) and the proportion of the total variance explained by the model (R2 value). Let's write a function mse(x,y) that computes MSE between two vectors of same length. {r} mse <- function(x,y){mean((x-y)^2)} data.frame(MSE = c(round(mse(y.tr, fitted.tr), 3), round(mse(y.te, fitted.te), 3)), R2 = c(signif(1 - mse(y.tr, fitted.tr)/mse(y.tr, mean(y.tr)), 3), signif(1 - mse(y.te, fitted.te)/mse(y.te, mean(y.te)), 3)), variance = c(round(var(y.tr), 3), round(var(y.te), 3)), row.names = c("Train","Test"))  The model explains about 50% of variance in training data, but$R^2$estimate is negative in test data! This is because the MSE in test data is far larger than the variance of the test observations, meaning that by predicting each test data value by their common mean would give lower MSE than this model fit! This seems a completely useless model in the test data. Not only has the model fit to the noise of training data but it also does not clearly pinpoint the real effects$x_1,\ldots,x_4$that together would explain around 20% of$y$. (This can be seen in weak statistical significance of the 4 effect variables in the model summary above.) We say that the model has **overfit** to the training data. It has found artificial patterns between the outcome and predictors that are only present in the training data and do not generalize to the other data sets. This happens when we have many possible predictors and a small sample size. With many predictors, the model is too flexible and fits the coefficients to model the noise. This is why we will later use shrinkage methods, that do not just greedily try to explain the patterns in the training data, but will also penalize models for their greater flexibility to fit varying patterns. Note that in the above model summary, the adjusted-$R^2$(0.07) as well as the P-value (0.25) based on the F-statistic are able to tell that the model is not that good. These quantities are able to take into account the number of predictors used (also called degrees of freedom). Instead, the standard$R^2$or the model's likelihood value do not have any penalization for the model dimension and therefore should not be used for evaluating the generalizability of the model to other data sets. **Questions.** 1. Linear model fit can be geometrically seen as a projection of$n$-dimensional outcome vector$\pmb{y}$on the subspace spanned by the columns of the predictor matrix$\pmb{X}$(see notes on Linear model). From this point of view, what happens to the model fit in the training data as we will add more (linearly independent) predictors in the model? How does overfitting look like in this geometric interpretation? 2. We always expect training error to be smaller than test error, so that property alone is not yet overfitting. What should be a definition for overfitting? #### READ: *2.2.2 The Bias-Variance Trade-Off* from [ISL](http://faculty.marshall.usc.edu/gareth-james/ISL/). * What is *Bias* and *Variance* of a statistical prediction method in intuitive terms? * What is the formula that combines bias and variance to MSE? * Which kinds of methods have high bias, which have high variance? * Which kinds of methods overfit, which "underfit"? ### Bias-variance trade-off Let's decompose the expected test error into parts. Assume that we are looking at regression problem where$Y = f(X) + \varepsilon$and$\textrm{E}(\varepsilon)=0$and$\textrm{Var}(\varepsilon)=\sigma^2$. We denote by$\widehat{f}$the regression function that is learned on training data and we apply it to predict a new test observation whose predictors are$X=x_0. We have \begin{aligned} \textrm{Test-err}(x_0) &= \textrm{E}( (Y-\widehat{f}(x_0))^2\, |\, X=x_0) \\ &= \textrm{E}((Y-f(x_0)+f(x_0)- \widehat{f}(x_0))^2 \,|\, X=x_0) \\ &= \textrm{E}((Y-f(x_0))^2 \,|\, X=x_0) + \textrm{E}((f(x_0)- \widehat{f}(x_0))^2) + 2\cdot \textrm{E}((Y-f(x_0))(f(x_0)- \widehat{f}(x_0))) \\ &= \sigma^2 + \textrm{E}((f(x_0)- \textrm{E}(\widehat{f}(x_0)) + \textrm{E}(\widehat{f}(x_0)) - \widehat{f}(x_0))^2) + 2\cdot \textrm{E}(\varepsilon) \textrm{E}(f(x_0)- \widehat{f}(x_0)) \\ &= \sigma^2 + \textrm{E}((f(x_0)- \textrm{E}(\widehat{f}(x_0)))^2 + \textrm{E}(\textrm{E}(\widehat{f}(x_0)) - \widehat{f}(x_0))^2 + 2\cdot \textrm{E}((f(x_0)- \textrm{E}(\widehat{f}(x_0)))(\textrm{E}(\widehat{f}(x_0)) - \widehat{f}(x_0))) + 2\cdot 0 \\ &=\sigma^2 + \textrm{Bias}^2(x_0) + \textrm{Var}(\widehat{f}(x_0))+2\cdot \textrm{Bias}(x_0)\cdot 0\\ &=\sigma^2 + \textrm{Bias}^2(x_0) + \textrm{Var}(\widehat{f}(x_0)) \end{aligned} where Bias(x_0) = \textrm{E}(\widehat{f}(x_0)) - f(x_0)$. (The covariance term with$\varepsilon$disappeared because error$\varepsilon$in test data has expectation 0 and is independent of both$f(x_0)$and of$\widehat{f}(x_0)$which is learned in the training data.) We see that the test MSE is bounded from below by *irreducible error*$\sigma^2$that the regression model does not account for and therefore cannot decrease. We also see that to make test error as small as possible we should make both squared bias and variance small. However, these two terms are in an inverse relation where a decrease of one leads to an increase of the other. In short, bias is increased when model is not flexible enough to model the true$f$and variance is increased when the model is so flexible to fit data that small changes in data cause large variation in estimated$\widehat{f}$. The problem of overfitting follows when we optimise the model fit (e.g. by maximising the likelihood function) but do not penalize for the number of parameters (or more generally, the flexibility of the model). By penalizing for the flexibility of the model we cause some bias, i.e., model is not completely free to go where (training) data points to but instead we prefer sparser models that hopefully are better generalizable to test data. This bias means that we will not be able to fit perfectly to settings where best models are not sparse. However, by inducing this bias, we simultaneously decrease the variance of the model fit, because the model fit is not anymore equally free to follow all the weak patterns in training data and therefore the model fit is more robust to small changes in data and less prone to overfit to training data. In high-dimensional problems, it is crucial to look for an approriate tradeoff between bias and variance to keep the model flexible enough to truly learn from the data (which requires lower bias) while at the same time minimizing overfitting (which requires lower variance). **Questions.** (From ISL) 1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer. (a) The sample size$n$is extremely large, and the number of predictors$p$is small. (b) The number of predictors$p$is extremely large, and the number of observations$n$is small. (c) The relationship between the predictors and the response is highly non-linear. (d) The variance of the error terms, i.e.$\sigma^2 = \text{Var}(\varepsilon)$, is extremely high. 2. Provide a sketch of typical (squared) bias, variance, training error, test error, and irreducible error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Explain why each of the five curves has the shape displayed. #### How to choose suitable models: Information criteria As we just saw, the model containing all of the predictors in training data will always have the smallest residual error and the largest R2, since these quantities are related to the training error. Instead, we wish to choose a model with a low test error and training error or R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors. Let's have a look at two commonly used information criteria that attempt to better estimate how model performs also outside the training data, while still using only the training data as input. #### Akaike Information Criterion (AIC) AIC is a general quantity to compare different models defined on the same data set. It is named according to Hirotugu Akaike (published early 1970s). Suppose we consider models$M_0,\ldots,M_J$where model$M_j$has$k_j$parameters. Let$L_j(\widehat{\theta}_j)$be the maximum value of the likelihood function for the model$M_j$. Then the AIC value of model$M_j$is $$\textrm{AIC}_j = 2 k_j - 2 \log(L_j(\widehat{\theta}_j))$$ The smaller the AIC, the better the model (as measured by AIC). We know that by increasing the model complexity (e.g. number of parameters) we can make model fit better to data and hence get higher values of likelihood. AIC penalizes the maximized log-likelihood by the number of parameters of the model. Thus, if two models have quite similar maximized log-likelihoods, then the one with less parameters is preferred. Intuitively, this favors sparser models and therefore helps to decrease overfitting. AIC is derived by an information theoretic argument. Suppose that the data are generated by a probability density$f$. We want to determine which one of two candidate densities$g_1$and$g_2$is a better approximation to$f$. The information lost when using$g_1$to represent$f$is quantified by the Kullback-Leibler divergence, $$\textrm{D}_{\textrm{KL}}(f\,||\, g_1) = \int f(x)\log\left(\frac{f(x)}{g_1(x)}\right) \mathrm{d}x.$$ Similarly,$\textrm{D}_{\textrm{KL}}(f\,||\, g_2)$measures information lost when approximating$f$by$g_2.$In practice, we do not know$f$and hence cannot compute divergencies above. Akaike (1974) showed, that differences in AIC can be used to estimate how much more (or less) information is lost by$g_1$than by$g_2$. This estimate, though, is only valid asymptotically and makes many quite crude approximations. In particular, if the number of data points is small, then some correction is often necessary (see e.g. AICc). **Connection to likelihood ratio test in nested models.** If model$M_1$is a submodel of model$M_2$(e.g. both are regression models but in$M_2$we use$k_2-k_1$more predictors on top of what is used in$M_1$), then, if the simpler model$M_1$holds, Wilks thorem says that the likelihood ratio test statistic $$\textrm{LRT}_{12} = -2 \log\left(\frac{L_1(\widehat{\theta}_1)}{L_2(\widehat{\theta}_2)}\right) \sim \chi^2_{k_2-k_1},$$ which can be used to derive P-values to test the additional parameters. (This result is asymptotic and requires some regularity conditions and, for example, that the additional parameters of model$M_2$are not at boundary values of their range in$M_1$.) We see that the AIC difference can be written in terms of LRT as $$\Delta_{12} = \textrm{AIC}_1 - \textrm{AIC}_2 = 2(k_1 - k_2) + \textrm{LRT}_{12}.$$ Given that$\textrm{E}\left(\chi^2_{k_2-k_1}\right) = k_2-k_1$, if model$M_1$holds, then the expectation of$\Delta_{12}$is $$\textrm{E}(\Delta_{12} \,|\, M_1) = 2(k_1 - k_2) + (k_2-k_1) = k_1-k_2 < 0$$ showing that AIC is expected to correctly favor the model$M_1$. Note that AIC can be compared for any pair of models on the same data, whereas the$\chi^2_{k_2-k_1}$distribution for LRT holds only for *nested* models. #### Bayesian Information Criterion (BIC) BIC was introduced by Gideon E. Schwarz in 1978, and is also called as Schwarz criterion. BIC has a similar form to AIC, but imposes a stronger penalty on the number of parameters. With the same notation as was used above for AIC, we define BIC value of model$M_j$as $$\textrm{BIC}_j = \log(n) k_j - 2 \log(L_j(\widehat{\theta}_j)),$$ where$n$is the number of (indepedent) data points, often the sample size. The BIC is derived using a Bayesian argument assuming that the prior distribution of the parameters is flat, i.e. a constant, in the region of interest. Then we have a following approximation for the marginal likelihood by using Laplace approximation to compute the integral: $$\textrm{Pr}(y\,|\,M_j) = \int \textrm{Pr}(\theta_j\,|\,M_j) \textrm{Pr}(y\,|\,\theta_j,M_j)\, \mathrm{d}\theta_j \approx C\, n^{-k_j/2}\, L_j(\widehat{\theta}_j),$$ where$C$is a constant not depending on$n$. Thus, we have the approximation $$2\, \log(\textrm{Pr}(y\,|\,M_j)) \approx 2\, \log \left( L_j(\widehat{\theta}_j) \right) - k_j\, \log(n) + C'= - \textrm{BIC}_j + C'$$ Hence BIC can be seen as an approximation to (negative) log-marginal likelihood of the model and smaller BIC corresponds to larger marginal likelihood, and hence better explanation of the data, *after* the model complexity has been taken into account. By ignoring the constants, we have an approximation to the Bayes factor between models 1 and 2 as $$\log (\textrm{BF}_{12}) = \log\left(\frac{\textrm{Pr}(y\,|\,M_1)}{\textrm{Pr}(y\,|\,M_2)}\right) \approx \frac{1}{2}(\textrm{BIC}_2 - \textrm{BIC}_1).$$ This gives an interpretation for comparing models via difference in BIC values. In particular, note how Bayesian marginal likelihood automatically accounts for the number of parameters of the model, and can get higher *or lower* as we move to more complex models (see "The role of marginal likelihood in Bayesian inference" from lecture HDS4). This is in contrast to the maximum likelihood value that can only get higher as we increase the number of parameters of the model. The difference between marginal likelihood and maximum likelihood is that the former has been computed by averaging the likelihood function with respect to the prior distribution of the parameters, whereas the latter only maximizes the likelihood and ignores the prior distribution. In BIC, the approximation that takes us from the maximized likelihood to an approximate marginal likelihood is the addition of the penalty term$\log(n) k$, where$k$is the number of parameters. For a detailed derivation of BIC see, for example, . **Example 5.3.** We can determine the P-value thresholds at which AIC and BIC start to favor the more complex model$H_1$(with maximized likelihood$L_1$) over the simpler submodel$H_0$(with maxmized likelihood$L_0$), when the difference in parameters is$k$and the models are nested (such that LRT is applicable). $$\textrm{AIC}_0 > \textrm{AIC}_1 \iff -2\log(L_0) > 2k -2\log(L_1) \iff \textrm{LRT}_{01} > 2k \iff \textrm{P-value} < 1-F_{\chi^2_k}(2k)$$ and for$k=1$this threshold is 0.157. $$\textrm{BIC}_0 > \textrm{BIC}_1 \iff -2\log(L_0) > k\log(n) -2\log(L_1) \iff \textrm{LRT}_{01} > k\log(n) \iff \textrm{P-value} < 1-F_{\chi^2_k}(k\log(n))$$ This threshold depends on$n$as follows: {r} k = 1 ns = c(10, 25, 50, 75, seq(100,10000,100)) plot(ns, -log10(pchisq(k*log(ns), df = 1, lower = F)), t = "l", lwd = 2, ylab = "-log10 P-value", xlab = "n", ylim = c(0,3), main = "k=1") #BIC threshold abline(h = -log10(0.157), col = "red", lwd = 2) #AIC threshold legend("bottomright", leg = c("AIC", "BIC"), col = c("red", "black"), lwd = 2)  This shows that AIC is much more liberal than BIC in letting additional variables in the model and the difference increases as information in data ($n$) grows. **Example 5.4.** Let's see how AIC and BIC work when we fit different polynomial models for Boston data set trying to predict medv (median value of a property) with lstat (proportion of lower status people in area). {r} library(MASS) medv = Boston$medv lstat = Boston$lstat plot(lstat, medv, pch = "+", col = "gray") x.points = seq(min(lstat), max(lstat), length = 1000) #grid on which to evaluate cols = topo.colors(8) Bos.res = data.frame(matrix(NA, ncol = 2, nrow = 8)) names(Bos.res) = c("AIC","BIC") for(ii in 1:8){ Bos.fit = lm(medv ~ poly(lstat, ii, raw = T), data = Boston) #ii deg polynomial y.at.x = predict(Bos.fit, data.frame(lstat = x.points)) lines(x.points, y.at.x, col = cols[ii], lty = ii, lwd = 2) Bos.res[ii, ] = c(AIC(Bos.fit), BIC(Bos.fit)) } legend("topright", leg = 1:8, col = cols, lwd = 2, lty = 1:8) Bos.res  We see that both AIC and BIC are minimized by the 5th degree polynomial, so this would be the best model according to these criteria. This would be the bias-variance compromise here: The smaller degree polynomials don't explain data well enough and higher degree polynomials start to overfit. We could also check whether 5th degree polynomial gives the smallest test error in a test set, but for that we would need to first split our data into test and training sets. (We will later look how to do this with cross-validation.) ### READ: 6.1.1-6.1.2 *Subset selection* from [ISL](https://web.stanford.edu/~hastie/ISLR2/ISLRv2_website.pdf). * What are problems with best subset selection? * What are advantages of stepwise selection methods? ### Stepwise regression By using, for example, either of AIC or BIC, we could compare and rank different regression models even with different numbers of parameters. When there are$p$predictors, there are$2^p$different subset models, and *best subset selection* becomes impossible already when$p$gets to the range of 20-30. Instead, a simple way to search for a top model is through a stepwise procedure where we start from the empty model, we evaluate putative additions of all possible predictors that are not yet in the model and choose the one which gives the optimal value for the criterion. We continue this until no new predictor increases the criterion. This is called *forward selection*. We can also start from the full model and by *backward selection* start dropping terms that are not needed. We may also add a possibility at each step to either drop existing predictors or add new ones in *hybrid selection*. Note that in both forward and backward selection there are at most$p$steps and hence at most$\frac{1}{2}(p^2+p)$models to fit, compared to$2^p$models in best subset selection. Let's see how to do this in R using step() function. We have our regression model in lm.fit where we had fit$p=70$predictors of which only the first four truly affected the outcome. We don't plot the output of these functions in whole here, but you can see it yourself by removing trace=0 from the code blocks. {r} aic.back = step(lm.fit, direction = "backward", k = 2, trace = 0) #k = 2 corresponds to AIC i.e. penalty is 2*#params #which variables are in the final model chosen by AIC aic.back.vars = names(coefficients(aic.back)) aic.back.vars #we get BIC by setting k = log(n) where n is the number of individuals in regression bic.back = step(lm.fit, direction = "backward", k = log(n.tr), trace = 0) bic.back.vars = names(coefficients(bic.back)) bic.back.vars  So we see that BIC is much more stringent and keeps only 4 variables whereas AIC keeps 16. Both include the true effects and BIC finds exactly the true model. We have the test set, so let's see how these two models work in out-of-sample data, which should be our ultimate goal to judge what is a good model. We use predict() function to apply the final model from step() function to test data. {r} aic.back.test = predict(aic.back, newdata = X.te) bic.back.test = predict(bic.back, newdata = X.te) cat(paste("Test MSE for AIC back:",signif(mse(y.te, aic.back.test), 2), "\nTest MSE for BIC back:",signif(mse(y.te, bic.back.test), 2) ))  So BIC is better, as expected, because it was using exactly the correct variables here. Note that the variance in the test data was about 0.92, so the variance explained by BIC model here is$1-0.82/0.92 = 10.8$%. (We would expect close to 20% variance explained if the coefficients were accurately estimated, but here the sample size is too small for very accurate estimates.) Let's do forward selection instead. Here we need to specify the starting model and the largest possible model. {r} aic.fwd = step(lm(y.tr ~ 0, data = X.tr), scope = formula(lm(y.tr ~ 0 + . , data = X.tr)), direction = "forward", k = 2, trace = 0) aic.fwd.vars = names(coefficients(aic.fwd)) aic.fwd.vars bic.fwd = step(lm(y.tr ~ 0, data = X.tr), scope = formula(lm(y.tr ~ 0 + . , data = X.tr)), direction = "forward", k = log(n.tr), trace = 0); bic.fwd.vars = names(coefficients(bic.fwd)) bic.fwd.vars  Now, on top of the 4 correct variables, BIC gave one additional variable and AIC gave 7 variables. Let's test these models in test data. {r} aic.fwd.test = predict(aic.fwd, newdata = X.te) bic.fwd.test = predict(bic.fwd, newdata = X.te) cat(paste("Test MSE for AIC fwd:",signif(mse(y.te, aic.fwd.test), 2), "\nTest MSE for BIC fwd:",signif(mse(y.te, bic.fwd.test), 2) ))  For BIC, forward search gave a slightly worse model than backward search whereas the opposite is true for AIC. Our example here was about regression where the true model was very sparse (only 4/70 predictors were relevant). Here BIC worked much better than AIC, because BIC leads to sparser models in general. In cases when there are more relevant predictors, each with smaller effect size, the conclusion may be different. In general, there is no clear choice between AIC and BIC. Theoretically, BIC is asymptotically consistent as a selection criterion. This means that given a family of models, including the true model, the probability that BIC will select the correct model approaches one as the sample size$n\to \infty$. This is not the case for AIC, which tends to choose models which are too complex as$n \to \infty$. However, this results assumes that the true model is among the tested ones, which is unlikely to be the case in real world applications. In practice, AIC is preferred in situations when a false negative finding would be considered more misleading than a false positive, and BIC is preferred in situations where a false positive is at least as misleading as a false negative. **Example 5.5.** Let's look at the Boston data set about variables related to Housing in suburbs of Boston in 1970s. Let's predict chas that is a Binary variable (= 1 if area is located on Charles river; 0 otherwise) using other variables in a logistic regression model. (With logistic regression we should always include the intercept.) Let's do forward and backward regressions. {r} library(MASS) #Forward AIC step(glm(chas ~ 1, family = "binomial", data = Boston), scope = formula(glm(chas ~ ., family = "binomial", data = Boston)), trace = 0, k = 2, direction = "forward") step(glm(chas ~ ., family = "binomial", data = Boston), trace = 0, k = 2, direction = "backward")  Backward selection found a model with lower AIC than forward selection. Why is this? The model has variables rad and tax on top of what the forward selection model found. Let's see what happens when we put these two variables in the model either separately or together. {r} AIC(glm(chas ~ medv + nox + crim + indus, family = "binomial", data = Boston)) #AIC-fwd model AIC(glm(chas ~ medv + nox + crim + indus + rad, family = "binomial", data = Boston)) # add only rad AIC(glm(chas ~ medv + nox + crim + indus + tax, family = "binomial", data = Boston)) # add only tax AIC(glm(chas ~ medv + nox + crim + indus + rad + tax, family = "binomial", data = Boston)) #add rad and tax  We see that adding either tax or rad alone would not decrease AIC but adding them together would decrease AIC. This shows that stepwise search may not find the optimal model because a single step may not always be enough to get from the current model to a region of better models, even when such a region existed. Thus running both forward and backward search might be useful, although even together they would not guarantee that we would find the optimal model. What about adding interaction terms? Compact notation .^2 means to include all main terms and all pairwise interaction terms. {r warning=FALSE} aic.interaction = step(glm(chas ~ 1, family = "binomial", data = Boston), scope = formula(glm(chas ~ .^2, family = "binomial", data = Boston)), trace=0, k = 2, direction = "forward") aic.interaction  Why it might not be a good idea to do backward search from model with all interaction terms in it? For the chosen interaction model we see a lot lower AIC than with the main effects. Can we see whether this model truly performs well and does not just overfit to the data? That is a bit late now, since the model has been chosen in the whole data set. Even if we re-estimated the effect sizes of the predictors and interactions in only some subset of the data and applied it to another subset, we would still underestimate the true test error that we would expect in an independent test data set. Question: How could we compare performance of the models when we **do not have an extra test data set** and do not want to split our valuable data for a separate test set, which would decrease our power to estimate a good model because the sample size in training data would be smaller after a test data set has been removed? ### READ: 5.1.1-5.1.4 *Cross-validation* from [ISL](http://faculty.marshall.usc.edu/gareth-james/ISL/). * What does cross-validation attempt to do? * What is LOOCV and K-fold CV and how do they compare? ### Cross-validation The idea of cross-validation (CV) is to mimic the split of data into training and validation sets, without completely ignoring any part of the data in the training part. (For us here "validation set" is a synonym for "test set".) In K-fold CV, we split the data into K parts, and carry out the following procedure for each part$j=1,\ldots,K$: * Use other parts of the data except part$j$to train the model. * Validate the trained model on part$j$and record the test measure (such as MSE for continuous variables or correct classification for logistic regression model). After all$K$parts have been used as a validation set * Average the test measure over all$K$runs. By carrying CV out for different models, we approximate each model's performance on out-of-sample test data. Typically$K=5$or$K=10$are used and by keeping$K$rather small we avoid large variance due to small validation sets, but still have multiple parts in order to avoid the bias present in any one split of the data set. The case$K = n$is known as leave-one-out cross-validation. In this case, for observation$i$, the fit is computed using all the data except$i$. Let's try cross-validation on our original data with$n=150$training samples and$p=70$predictors. The goal is to estimate test error of the model that includes all variables by using just the training data set. CV is implemented in cv.glm() function of the boot package and requires model fitted by glm(), that by default fits a linear model with Gaussian errors. Thus, by default, glm() fits the same model as lm() but the output is a bit different as glm() is also applicable to other regression models such as logistic regression (using glm(,family="binomial")). {r} library(boot) #has cv.glm function to do CV for (generalized) linear models glm.fit = glm(y.tr ~ 0 + ., data = X.tr) #linear model w/o intercept fitted by glm cv.glm( data.frame(y.tr, X.tr), glm.fit, K = 10)$delta #K=10 fold CV using MSE as cost function #outputs 2 values: 1 = raw CV estimate and 2 = adjusted CV estimate of prediction error. var(y.tr) #for comparison with CV MSE.  We estimate a test MSE of about 2.0 which is nearly 100% larger than the variance in training data $y$. CV is telling that it is not a good idea to fit this kind of linear model ($p=70$) with this small sample size ($n=150$) as the estimated MSE is much larger than variance of the outcome variable. In other words, the model is useless when applied to new data. Input for cv.glm() can also include a cost function of two vector arguments. The first argument to cost should correspond to the observed responses and the second argument should correspond to the predicted or fitted responses from the generalized linear model. The default is MSE, and it is used here. #### The Wrong and Right Way to Do Cross-validation (from ESL 7.10.2) Consider a problem with a large number of predictors. A typical strategy for analysis might be as follows: 1. Screen the predictors: find a subset of "good"" predictors that show fairly strong (univariate) correlation with the outcome. 2. Using just this subset of predictors, build a multivariate model. 3. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model. Is this a correct application of cross-validation? The problem is that the predictors have an unfair advantage, as they were chosen in step (1) on the basis of all of the samples. Leaving samples out after the variables have been selected does not correctly mimic the application of the model to a completely independent test set, since these predictors "have already seen" the left out samples. Here is the correct way to carry out cross-validation in this example: 1. Divide the samples into $K$ cross-validation folds at random. 2. For each fold $k = 1, 2,\ldots,K$ * Find a subset of "good" predictors that show fairly strong (univariate) correlation with the outcome, using all of the samples except those in fold $k$. * Using just this subset of predictors, build a multivariate model, using all of the samples except those in fold $k$. * Use the model to predict the outcome for the samples in fold $k$. The error estimates from step 2(c) are then accumulated over all $K$ folds, to produce the cross-validation estimate of prediction error. In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, validation samples must be "left out" before any selection or filtering steps that use outcome variable are applied. ### Using observed data for deriving sampling distributions Cross-validation is an example of *resampling methods* where we use our observed data to not only fit some model, but also evaluate its performance. When we have a large number of observations from the target population, then several other methods with similar spirit are also very useful. When we considered Q-values, we already saw how the observed P-value distribution can be used to infer $\pi_0$, the proportion of true nulls, and further to approximate local false discovery rate for any one P-value. Two other common applications are permutation testing and Bootstrap. Note the difference between these methods and the traditional way of stipulating a theoretical distribution for a test statistic and then comparing the observed to the theoretical. #### Permutation testing In permutation testing we permute the indexes of sets of variables with respect to each other in order to break all associations *between* the sets while maintaining all the structure *within* the sets. For example, suppose that I want to fit a non-linear spline model to predict disease risk ($y$) from set of 20 clinically measured variables ($X$). We don't have a reliable theoretical null distribution to know what kind of MSE this complex and highly non linear prediction method would give already when there is no association between $y$ and $X$. In permutation testing, we would permute the elements of $y$ vector and run the method on permuted data exactly as it was run on the original unpermuted data. By collecting MSE values from 1000s of permuted data sets we would have an idea of the null distribution of MSE from this method. Importantly, we would have realistic structure among the predictors $X$, since the columns of $X$ were not permuted. We only permuted away the association between $y$ and $X$, not between any two columns of $X$. #### Bootstrap We consider bootstrap later on this course. It is explained in *5.2 The Bootstrap* in [ISL](https://web.stanford.edu/~hastie/ISLR2/ISLRv2_website.pdf).