So far we have considered statistical inference methods that are based on univariate models. That is, models where we estimate/test only one variable at a time. Such methods are computationally simple and can therefore be applied for very high dimensional problems. However, they ignore dependencies between predictors. To illustrate why this is a problem, let’s consider two correlated predictors \(X\) and \(Z\) of which only one directly affects the outcome \(Y\).

Example 5.1. To create variables \(x\) and \(z\) with correlation \(r\), let’s write \(X = U + \varepsilon_x\) and \(Z = U + \varepsilon_z\), where \(\textrm{Var}(U) = |r|\) and \(\textrm{Var}(\varepsilon_x) = \textrm{Var}(\varepsilon_z) = 1-|r|\) with \(\varepsilon_x\) and \(\varepsilon_z\) independent of each other and of \(U\). Then \[\begin{eqnarray*} \textrm{Cor}(X,Z) &=& \frac{\textrm{Cov}(X,Z)}{\sqrt{\textrm{Var}(X)\cdot \textrm{Var}(Z)}} = \frac{\textrm{Cov}(U + \varepsilon_x, U + \varepsilon_z)}{\sqrt{\textrm{Var}(U + \varepsilon_x)\cdot \textrm{Var}(U + \varepsilon_z)}}\\ &=& \frac{\textrm{Cov}(U , U)}{\sqrt{(\textrm{Var}(U) + \textrm{Var}(\varepsilon_x))(\textrm{Var}(U) + \textrm{Var}(\varepsilon_z))}}\\ &=&\frac{|r|}{\left(\sqrt{|r| + 1 - |r|}\right)^2} = |r|. \end{eqnarray*}\] If \(r\) was negative, we would write \(Z=-U + \varepsilon_z\) instead.

Let’s generate \(x\) and \(z\) for \(n=1000\) samples with correlation \(r=0.7.\)

n = 1000
r = 0.7
u = rnorm(n, 0, sqrt(abs(r))) #temporary variable that is used for generating x and z
x = scale(u + rnorm(n, 0, sqrt(1-abs(r))))
z = scale(sign(r)*u + rnorm(n, 0, sqrt(1-abs(r)))) 
cor(x, z) #check that now cor(x,z) ~ r
##           [,1]
## [1,] 0.7036026

Let’s generate \(y = \sqrt{0.2} \cdot x + \mathcal{N}(0, 0.8)\) in which case \(\textrm{Var}(Y)=\sqrt{0.2}^2 + 0.8 = 1\) and \(X\) will explain 20% of the variance of \(Y\) (as \(\text{Var}(\sqrt{0.2} X) / \text{Var}(Y) = 0.2 \cdot 1 / 1 = 0.2\)).

y = scale(sqrt(0.2)*x + rnorm(n, 0, sqrt(0.8))) # x explains 20% of y
summary(lm(y ~ 0 + x))
## 
## Call:
## lm(formula = y ~ 0 + x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0237 -0.6078 -0.0290  0.6301  2.8429 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x  0.44291    0.02837   15.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8966 on 999 degrees of freedom
## Multiple R-squared:  0.1962, Adjusted R-squared:  0.1954 
## F-statistic: 243.8 on 1 and 999 DF,  p-value: < 2.2e-16

Note that we did not use \(z\) to generate \(y\), but \(z\) and \(y\) are still highly correlated: \[ \textrm{Cor}(Y,Z) = \frac{\textrm{Cov}(Y,Z)}{\sqrt{\textrm{Var}(Y)\cdot \textrm{Var}(Z)}} = \frac{\textrm{Cov}(\sqrt{0.2}\cdot U + \sqrt{0.2} \cdot \varepsilon_x + \varepsilon_y, U + \varepsilon_z)}{\sqrt{1 \cdot 1}} = \sqrt{0.2}\cdot \textrm{Cov}(U , U)=\sqrt{0.2}\cdot |r| \] Thus, we expect the regression coefficient of \(Y\) on \(Z\) to be \(\sqrt{0.2}\cdot 0.7\approx 0.31.\)

summary(lm(y ~ 0 + z)) # z does not have a direct effect on y but will be significant here
## 
## Call:
## lm(formula = y ~ 0 + z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.06774 -0.66576  0.02115  0.63622  2.81372 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## z  0.30243    0.03016   10.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9532 on 999 degrees of freedom
## Multiple R-squared:  0.09147,    Adjusted R-squared:  0.09056 
## F-statistic: 100.6 on 1 and 999 DF,  p-value: < 2.2e-16

Luckily, the joint model of y ~ x + z can tease the effects of x and z apart.

summary(lm(y ~ 0 + x + z)) 
## 
## Call:
## lm(formula = y ~ 0 + x + z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.01014 -0.60618 -0.02546  0.62552  2.84596 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x  0.45573    0.03993  11.412   <2e-16 ***
## z -0.01822    0.03993  -0.456    0.648    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8969 on 998 degrees of freedom
## Multiple R-squared:  0.1963, Adjusted R-squared:  0.1947 
## F-statistic: 121.9 on 2 and 998 DF,  p-value: < 2.2e-16

We have seen that if we test \(z\) univariately, we observe a clear effect on \(y\), but that effect disappears when we do multiple regression of \(y\) on both \(x\) and \(z\). Thus, regression is able to correctly split the effect among correlated variables. This is very important in order to separate possible direct effects \(X\) on \(Y\) from confounders \(Z\), that are simply correlated with both \(X\) and \(Y\), but not anymore important in explaining \(Y\), once we have measured predictors \(X\) with direct effects on \(Y\).

Example 5.2. In medical literature, the role of HDL and LDL cholesterol levels in the risk of heart disease has been studied a lot over decades. Both are marginally correlated with the risk, HDL negatively and LDL positively. Recent studies and medical trials suggest that LDL is “causal”, in the sense that medication lowering directly LDL levels will reduce disease risk, whereas HDL is not causal in that sense. This is extremely important information for developing preventive treatment for heart disease.

Note that with statistical models alone, we cannot ever prove a causal relationship between \(X\) and \(Y\) but we can show that \(Z\) is unlikely to have a causal relationship if its effect goes away once a better predictor \(X\) is found.

If variables are highly correlated, there will be little information about how the effect should be split among them and therefore the SEs will become large and effect size estimates will become unstable. Therefore, in practice, we would not want to include several very highly correlated variables (say \(|r|>0.95\)) into a regression model. We will come back to this point later when we discuss ridge regression as a way to stabilize models with highly correlated variables.

What about if predictors were independent of each other? Do we also then need to use a joint model to get the effect estimates correct? If \(X\) and \(Z\) are independent, then their marginal effects (from separate models \(Y=X \beta_x + \varepsilon\) and \(Y= Z \beta_z + \varepsilon\)) are estimating the same quantities as the joint model \(Y = X \gamma_x + Z \gamma_z + \varepsilon\), i.e., \(\beta_x=\gamma_x\) and \(\beta_z = \gamma_z\). However, the joint model is produces more precise estimates than the marginal models in case that both \(X\) and \(Z\) have an effect on \(Y\), because the joint model has lower residual variance, and hence also smaller standard errors for estimates.

Let’s next consider how we can use multiple regression models in higher dimensional (HD) problems. We begin from a problem that results from a naive approach. This example also takes us through central issues in HD data analysis where we need to balance between the variance and bias of our statistical method in order to avoid overfitting while still getting useful information out from the model.

READ: 2.2.1 Measuring the Quality of Fit from book ISL and see slides HDS5.

Overfitting

To make overfitting problem concrete in a linear regression setting, let’s consider \(n_r=150\) training samples that were measured on \(p=70\) variables and an outcome variable \(y\). Four variables, \(x_1,x_2,x_3,x_4\) will truly affect \(y\) but the 66 others are just random noise. Let’s build a prediction model for \(y\) using linear regression and let’s test that model in an additional set of \(n_t=150\) test samples.

set.seed(20102017) 
p = 70
m = 4 #effect predictors
n.tr = 150 # training set 
n.te = 150 # test set
n = n.tr + n.te
i.tr = 1:n.tr #indexes for training
i.te = (n.tr + 1):n #indexes for testing
phi = 0.05 # variance explained by each relevant predictor, should be 0 < phi < 1/m. 
b = rep(c(sqrt(phi / (1 - phi)), 0), c(m, p-m)) #effects 1,2,3,4 are non-zero, see Lect. 0.1 for "phi"
X = matrix(rnorm(n*p), nrow = n) #columns 1,...,4 of X have effects, other 66 cols are noise 
eps = scale(rnorm(n, 0, 1)) #epsilon, error term
y = scale(X%*%b + eps) # y has mean = 1, var = 1
y.tr = y[i.tr]
y.te = y[i.te]
X.tr = data.frame(scale(X[i.tr,]))
X.te = data.frame(scale(X[i.te,]))
lm.fit =  lm(y.tr ~ 0 + . , data = X.tr) #fit the model to training data
#(remove intercept because formulas below become easier,
# only possible because columns of X and y have mean = 0.)
summary(lm.fit)
## 
## Call:
## lm(formula = y.tr ~ 0 + ., data = X.tr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.08623 -0.53466 -0.09445  0.43393  2.43027 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)   
## X1   0.229311   0.107648   2.130  0.03623 * 
## X2   0.194258   0.123795   1.569  0.12055   
## X3   0.260994   0.111172   2.348  0.02136 * 
## X4   0.308204   0.113932   2.705  0.00834 **
## X5  -0.093432   0.108425  -0.862  0.39142   
## X6  -0.063471   0.120010  -0.529  0.59835   
## X7   0.035323   0.125730   0.281  0.77948   
## X8  -0.042481   0.115028  -0.369  0.71287   
## X9   0.085748   0.113017   0.759  0.45025   
## X10  0.147267   0.119053   1.237  0.21971   
## X11  0.043963   0.104601   0.420  0.67540   
## X12 -0.037629   0.120575  -0.312  0.75579   
## X13 -0.040844   0.121283  -0.337  0.73718   
## X14  0.030316   0.126425   0.240  0.81110   
## X15  0.239408   0.112427   2.129  0.03629 * 
## X16  0.081389   0.130025   0.626  0.53313   
## X17  0.171138   0.133239   1.284  0.20269   
## X18  0.040340   0.108379   0.372  0.71072   
## X19 -0.050519   0.110614  -0.457  0.64912   
## X20  0.103824   0.108994   0.953  0.34368   
## X21 -0.093361   0.127151  -0.734  0.46494   
## X22 -0.062987   0.109931  -0.573  0.56827   
## X23  0.119597   0.119399   1.002  0.31953   
## X24  0.200846   0.125858   1.596  0.11447   
## X25 -0.064935   0.115757  -0.561  0.57639   
## X26  0.042791   0.105519   0.406  0.68617   
## X27 -0.067227   0.107897  -0.623  0.53501   
## X28 -0.037153   0.123721  -0.300  0.76473   
## X29 -0.064788   0.127722  -0.507  0.61337   
## X30 -0.142176   0.114476  -1.242  0.21788   
## X31  0.187589   0.116862   1.605  0.11239   
## X32 -0.101127   0.127979  -0.790  0.43175   
## X33 -0.009285   0.110950  -0.084  0.93352   
## X34 -0.135615   0.108108  -1.254  0.21333   
## X35 -0.033501   0.105570  -0.317  0.75182   
## X36  0.006990   0.116591   0.060  0.95234   
## X37 -0.002173   0.105583  -0.021  0.98363   
## X38 -0.164300   0.108265  -1.518  0.13306   
## X39 -0.017016   0.109143  -0.156  0.87650   
## X40 -0.033981   0.114801  -0.296  0.76800   
## X41 -0.088039   0.118300  -0.744  0.45894   
## X42  0.166009   0.119484   1.389  0.16857   
## X43 -0.099728   0.114277  -0.873  0.38545   
## X44  0.046778   0.114974   0.407  0.68520   
## X45  0.139730   0.119479   1.169  0.24568   
## X46  0.133005   0.110356   1.205  0.23167   
## X47  0.082510   0.111679   0.739  0.46218   
## X48  0.060008   0.119797   0.501  0.61781   
## X49  0.071564   0.117202   0.611  0.54319   
## X50  0.025570   0.114068   0.224  0.82320   
## X51  0.017773   0.111185   0.160  0.87340   
## X52 -0.012569   0.120808  -0.104  0.91740   
## X53  0.127866   0.118374   1.080  0.28330   
## X54 -0.063649   0.104397  -0.610  0.54380   
## X55  0.029813   0.110874   0.269  0.78871   
## X56 -0.069541   0.103458  -0.672  0.50341   
## X57 -0.183767   0.124261  -1.479  0.14310   
## X58  0.125368   0.115853   1.082  0.28245   
## X59  0.013123   0.107267   0.122  0.90293   
## X60 -0.049127   0.112818  -0.435  0.66441   
## X61 -0.062615   0.101760  -0.615  0.54009   
## X62 -0.049920   0.115007  -0.434  0.66541   
## X63  0.073498   0.109782   0.669  0.50511   
## X64  0.069620   0.100568   0.692  0.49077   
## X65  0.014010   0.108928   0.129  0.89799   
## X66 -0.023163   0.113457  -0.204  0.83875   
## X67  0.001113   0.107764   0.010  0.99178   
## X68 -0.253880   0.122114  -2.079  0.04082 * 
## X69  0.023563   0.111688   0.211  0.83344   
## X70 -0.012839   0.105635  -0.122  0.90357   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.001 on 80 degrees of freedom
## Multiple R-squared:  0.5047, Adjusted R-squared:  0.0713 
## F-statistic: 1.165 on 70 and 80 DF,  p-value: 0.2541

Let’s see how well the fitted values correspond to the true outcome values in training data and compare that to how well the predicted values in test data approximate the outcome values in test data.

fitted.tr = predict(lm.fit) # same as 'as.matrix(X.tr) %*% lm.fit$coeff'
fitted.te = predict(lm.fit, newdata = X.te) # same as 'as.matrix(X.te) %*% lm.fit$coeff'
par(mfrow = c(1, 2))
plot(fitted.tr, y.tr, xlab = "fitted", ylab = "y", main = "Training", pch = 19, col = "peru" )
plot(fitted.te, y.te, xlab = "fitted", ylab = "y", main = "Test", pch = 19, col = "maroon1" )

Plots show how the fitted values from the model predict well in training data but not at all in test data.

Let’s compute measures of fit by mean squared error (MSE) and the proportion of the total variance explained by the model (R2 value). Let’s write a function mse(x,y) that computes MSE between two vectors of same length.

mse <- function(x,y){mean((x-y)^2)}
data.frame(MSE = c(round(mse(y.tr, fitted.tr), 3),
                   round(mse(y.te, fitted.te), 3)),
           R2 = c(signif(1 - mse(y.tr, fitted.tr)/mse(y.tr, mean(y.tr)), 3),
                  signif(1 - mse(y.te, fitted.te)/mse(y.te, mean(y.te)), 3)),
           variance = c(round(var(y.tr), 3), round(var(y.te), 3)),
           row.names = c("Train","Test"))
##         MSE     R2 variance
## Train 0.534  0.504    1.083
## Test  1.596 -0.748    0.919

The model explains about 50% of variance in training data, but \(R^2\) estimate is negative in test data! This is because the MSE in test data is far larger than the variance of the test observations, meaning that by predicting each test data value by their common mean would give lower MSE than this model fit! This seems a completely useless model in the test data.

Not only has the model fit to the noise of training data but it also does not clearly pinpoint the real effects \(x_1,\ldots,x_4\) that together would explain around 20% of \(y\). (This can be seen in weak statistical significance of the 4 effect variables in the model summary above.) We say that the model has overfit to the training data. It has found artificial patterns between the outcome and predictors that are only present in the training data and do not generalize to the other data sets. This happens when we have many possible predictors and a small sample size. With many predictors, the model is too flexible and fits the coefficients to model the noise. This is why we will later use shrinkage methods, that do not just greedily try to explain the patterns in the training data, but will also penalize models for their greater flexibility to fit varying patterns.

Note that in the above model summary, the adjusted-\(R^2\) (0.07) as well as the P-value (0.25) based on the F-statistic are able to tell that the model is not that good. These quantities are able to take into account the number of predictors used (also called degrees of freedom). Instead, the standard \(R^2\) or the model’s likelihood value do not have any penalization for the model dimension and therefore should not be used for evaluating the generalizability of the model to other data sets.

Questions.

  1. Linear model fit can be geometrically seen as a projection of \(n\)-dimensional outcome vector \(\pmb{y}\) on the subspace spanned by the columns of the predictor matrix \(\pmb{X}\) (see notes on Linear model). From this point of view, what happens to the model fit in the training data as we will add more (linearly independent) predictors in the model? How does overfitting look like in this geometric interpretation?

  2. We always expect training error to be smaller than test error, so that property alone is not yet overfitting. What should be a definition for overfitting?

READ: 2.2.2 The Bias-Variance Trade-Off from ISL.

Bias-variance trade-off

Let’s decompose the expected test error into parts. Assume that we are looking at regression problem where \(Y = f(X) + \varepsilon\) and \(\textrm{E}(\varepsilon)=0\) and \(\textrm{Var}(\varepsilon)=\sigma^2\). We denote by \(\widehat{f}\) the regression function that is learned on training data and we apply it to predict a new test observation whose predictors are \(X=x_0\). We have \[ \begin{aligned} \textrm{Test-err}(x_0) &= \textrm{E}( (Y-\widehat{f}(x_0))^2\, |\, X=x_0) \\ &= \textrm{E}((Y-f(x_0)+f(x_0)- \widehat{f}(x_0))^2 \,|\, X=x_0) \\ &= \textrm{E}((Y-f(x_0))^2 \,|\, X=x_0) + \textrm{E}((f(x_0)- \widehat{f}(x_0))^2) + 2\cdot \textrm{E}((Y-f(x_0))(f(x_0)- \widehat{f}(x_0))) \\ &= \sigma^2 + \textrm{E}((f(x_0)- \textrm{E}(\widehat{f}(x_0)) + \textrm{E}(\widehat{f}(x_0)) - \widehat{f}(x_0))^2) + 2\cdot \textrm{E}(\varepsilon) \textrm{E}(f(x_0)- \widehat{f}(x_0)) \\ &= \sigma^2 + \textrm{E}((f(x_0)- \textrm{E}(\widehat{f}(x_0)))^2 + \textrm{E}(\textrm{E}(\widehat{f}(x_0)) - \widehat{f}(x_0))^2 + 2\cdot \textrm{E}((f(x_0)- \textrm{E}(\widehat{f}(x_0)))(\textrm{E}(\widehat{f}(x_0)) - \widehat{f}(x_0))) + 2\cdot 0 \\ &=\sigma^2 + \textrm{Bias}^2(x_0) + \textrm{Var}(\widehat{f}(x_0))+2\cdot \textrm{Bias}(x_0)\cdot 0\\ &=\sigma^2 + \textrm{Bias}^2(x_0) + \textrm{Var}(\widehat{f}(x_0)) \end{aligned} \] where Bias\((x_0) = \textrm{E}(\widehat{f}(x_0)) - f(x_0)\). (The covariance term with \(\varepsilon\) disappeared because error \(\varepsilon\) in test data has expectation 0 and is independent of both \(f(x_0)\) and of \(\widehat{f}(x_0)\) which is learned in the training data.) We see that the test MSE is bounded from below by irreducible error \(\sigma^2\) that the regression model does not account for and therefore cannot decrease. We also see that to make test error as small as possible we should make both squared bias and variance small. However, these two terms are in an inverse relation where a decrease of one leads to an increase of the other. In short, bias is increased when model is not flexible enough to model the true \(f\) and variance is increased when the model is so flexible to fit data that small changes in data cause large variation in estimated \(\widehat{f}\).

The problem of overfitting follows when we optimise the model fit (e.g. by maximising the likelihood function) but do not penalize for the number of parameters (or more generally, the flexibility of the model). By penalizing for the flexibility of the model we cause some bias, i.e., model is not completely free to go where (training) data points to but instead we prefer sparser models that hopefully are better generalizable to test data. This bias means that we will not be able to fit perfectly to settings where best models are not sparse. However, by inducing this bias, we simultaneously decrease the variance of the model fit, because the model fit is not anymore equally free to follow all the weak patterns in training data and therefore the model fit is more robust to small changes in data and less prone to overfit to training data. In high-dimensional problems, it is crucial to look for an approriate tradeoff between bias and variance to keep the model flexible enough to truly learn from the data (which requires lower bias) while at the same time minimizing overfitting (which requires lower variance).

Questions. (From ISL)

  1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
  1. The sample size \(n\) is extremely large, and the number of predictors \(p\) is small.
  2. The number of predictors \(p\) is extremely large, and the number of observations \(n\) is small.
  3. The relationship between the predictors and the response is highly non-linear.
  4. The variance of the error terms, i.e. \(\sigma^2 = \text{Var}(\varepsilon)\), is extremely high.
  1. Provide a sketch of typical (squared) bias, variance, training error, test error, and irreducible error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Explain why each of the five curves has the shape displayed.

How to choose suitable models: Information criteria

As we just saw, the model containing all of the predictors in training data will always have the smallest residual error and the largest R2, since these quantities are related to the training error. Instead, we wish to choose a model with a low test error and training error or R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors. Let’s have a look at two commonly used information criteria that attempt to better estimate how model performs also outside the training data, while still using only the training data as input.

Akaike Information Criterion (AIC)

AIC is a general quantity to compare different models defined on the same data set. It is named according to Hirotugu Akaike (published early 1970s).

Suppose we consider models \(M_0,\ldots,M_J\) where model \(M_j\) has \(k_j\) parameters. Let \(L_j(\widehat{\theta}_j)\) be the maximum value of the likelihood function for the model \(M_j\). Then the AIC value of model \(M_j\) is \[\textrm{AIC}_j = 2 k_j - 2 \log(L_j(\widehat{\theta}_j))\] The smaller the AIC, the better the model (as measured by AIC).

We know that by increasing the model complexity (e.g. number of parameters) we can make model fit better to data and hence get higher values of likelihood. AIC penalizes the maximized log-likelihood by the number of parameters of the model. Thus, if two models have quite similar maximized log-likelihoods, then the one with less parameters is preferred. Intuitively, this favors sparser models and therefore helps to decrease overfitting.

AIC is derived by an information theoretic argument. Suppose that the data are generated by a probability density \(f\). We want to determine which one of two candidate densities \(g_1\) and \(g_2\) is a better approximation to \(f\). The information lost when using \(g_1\) to represent \(f\) is quantified by the Kullback-Leibler divergence, \[\textrm{D}_{\textrm{KL}}(f\,||\, g_1) = \int f(x)\log\left(\frac{f(x)}{g_1(x)}\right) \mathrm{d}x.\] Similarly, \(\textrm{D}_{\textrm{KL}}(f\,||\, g_2)\) measures information lost when approximating \(f\) by \(g_2.\)

In practice, we do not know \(f\) and hence cannot compute divergencies above. Akaike (1974) showed, that differences in AIC can be used to estimate how much more (or less) information is lost by \(g_1\) than by \(g_2\). This estimate, though, is only valid asymptotically and makes many quite crude approximations. In particular, if the number of data points is small, then some correction is often necessary (see e.g. AICc).

Connection to likelihood ratio test in nested models. If model \(M_1\) is a submodel of model \(M_2\) (e.g. both are regression models but in \(M_2\) we use \(k_2-k_1\) more predictors on top of what is used in \(M_1\)), then, if the simpler model \(M_1\) holds, Wilks thorem says that the likelihood ratio test statistic \[\textrm{LRT}_{12} = -2 \log\left(\frac{L_1(\widehat{\theta}_1)}{L_2(\widehat{\theta}_2)}\right) \sim \chi^2_{k_2-k_1},\] which can be used to derive P-values to test the additional parameters. (This result is asymptotic and requires some regularity conditions and, for example, that the additional parameters of model \(M_2\) are not at boundary values of their range in \(M_1\).) We see that the AIC difference can be written in terms of LRT as \[\Delta_{12} = \textrm{AIC}_1 - \textrm{AIC}_2 = 2(k_1 - k_2) + \textrm{LRT}_{12}.\] Given that \(\textrm{E}\left(\chi^2_{k_2-k_1}\right) = k_2-k_1\), if model \(M_1\) holds, then the expectation of \(\Delta_{12}\) is \[\textrm{E}(\Delta_{12} \,|\, M_1) = 2(k_1 - k_2) + (k_2-k_1) = k_1-k_2 < 0\] showing that AIC is expected to correctly favor the model \(M_1\).

Note that AIC can be compared for any pair of models on the same data, whereas the \(\chi^2_{k_2-k_1}\) distribution for LRT holds only for nested models.

Bayesian Information Criterion (BIC)

BIC was introduced by Gideon E. Schwarz in 1978, and is also called as Schwarz criterion. BIC has a similar form to AIC, but imposes a stronger penalty on the number of parameters. With the same notation as was used above for AIC, we define BIC value of model \(M_j\) as \[\textrm{BIC}_j = \log(n) k_j - 2 \log(L_j(\widehat{\theta}_j)),\] where \(n\) is the number of (indepedent) data points, often the sample size.

The BIC is derived using a Bayesian argument assuming that the prior distribution of the parameters is flat, i.e. a constant, in the region of interest. Then we have a following approximation for the marginal likelihood by using Laplace approximation to compute the integral: \[\textrm{Pr}(y\,|\,M_j) = \int \textrm{Pr}(\theta_j\,|\,M_j) \textrm{Pr}(y\,|\,\theta_j,M_j)\, \mathrm{d}\theta_j \approx C\, n^{-k_j/2}\, L_j(\widehat{\theta}_j),\] where \(C\) is a constant not depending on \(n\). Thus, we have the approximation \[2\, \log(\textrm{Pr}(y\,|\,M_j)) \approx 2\, \log \left( L_j(\widehat{\theta}_j) \right) - k_j\, \log(n) + C'= - \textrm{BIC}_j + C'\] Hence BIC can be seen as an approximation to (negative) log-marginal likelihood of the model and smaller BIC corresponds to larger marginal likelihood, and hence better explanation of the data, after the model complexity has been taken into account.

By ignoring the constants, we have an approximation to the Bayes factor between models 1 and 2 as \[\log (\textrm{BF}_{12}) = \log\left(\frac{\textrm{Pr}(y\,|\,M_1)}{\textrm{Pr}(y\,|\,M_2)}\right) \approx \frac{1}{2}(\textrm{BIC}_2 - \textrm{BIC}_1).\] This gives an interpretation for comparing models via difference in BIC values.

In particular, note how Bayesian marginal likelihood automatically accounts for the number of parameters of the model, and can get higher or lower as we move to more complex models (see “The role of marginal likelihood in Bayesian inference” from lecture HDS4). This is in contrast to the maximum likelihood value that can only get higher as we increase the number of parameters of the model. The difference between marginal likelihood and maximum likelihood is that the former has been computed by averaging the likelihood function with respect to the prior distribution of the parameters, whereas the latter only maximizes the likelihood and ignores the prior distribution. In BIC, the approximation that takes us from the maximized likelihood to an approximate marginal likelihood is the addition of the penalty term \(\log(n) k\), where \(k\) is the number of parameters.

For a detailed derivation of BIC see, for example, https://www.researchgate.net/publication/265739543_On_the_Derivation_of_the_Bayesian_Information_Criterion.

Example 5.3. We can determine the P-value thresholds at which AIC and BIC start to favor the more complex model \(H_1\) (with maximized likelihood \(L_1\)) over the simpler submodel \(H_0\) (with maxmized likelihood \(L_0\)), when the difference in parameters is \(k\) and the models are nested (such that LRT is applicable).

\[\textrm{AIC}_0 > \textrm{AIC}_1 \iff -2\log(L_0) > 2k -2\log(L_1) \iff \textrm{LRT}_{01} > 2k \iff \textrm{P-value} < 1-F_{\chi^2_k}(2k)\] and for \(k=1\) this threshold is 0.157.

\[\textrm{BIC}_0 > \textrm{BIC}_1 \iff -2\log(L_0) > k\log(n) -2\log(L_1) \iff \textrm{LRT}_{01} > k\log(n) \iff \textrm{P-value} < 1-F_{\chi^2_k}(k\log(n))\] This threshold depends on \(n\) as follows:

k = 1
ns = c(10, 25, 50, 75, seq(100,10000,100))
plot(ns, -log10(pchisq(k*log(ns), df = 1, lower = F)), 
     t = "l", lwd = 2, ylab = "-log10 P-value", xlab = "n", ylim = c(0,3), main = "k=1") #BIC threshold
abline(h = -log10(0.157), col = "red", lwd = 2) #AIC threshold
legend("bottomright", leg = c("AIC", "BIC"), col = c("red", "black"), lwd = 2)

This shows that AIC is much more liberal than BIC in letting additional variables in the model and the difference increases as information in data (\(n\)) grows.

Example 5.4. Let’s see how AIC and BIC work when we fit different polynomial models for Boston data set trying to predict medv (median value of a property) with lstat (proportion of lower status people in area).

library(MASS)
medv = Boston$medv
lstat = Boston$lstat
plot(lstat, medv, pch = "+", col = "gray")
x.points = seq(min(lstat), max(lstat), length = 1000) #grid on which to evaluate
cols = topo.colors(8)
Bos.res = data.frame(matrix(NA, ncol = 2, nrow = 8))
names(Bos.res) = c("AIC","BIC")
for(ii in 1:8){
  Bos.fit = lm(medv ~ poly(lstat, ii, raw = T), data = Boston) #ii deg polynomial
  y.at.x = predict(Bos.fit, data.frame(lstat = x.points)) 
  lines(x.points, y.at.x, col = cols[ii], lty = ii, lwd = 2)
  Bos.res[ii, ] = c(AIC(Bos.fit), BIC(Bos.fit))
}
legend("topright", leg = 1:8, col = cols, lwd = 2, lty = 1:8)