--- title: "HDS 0.1 Linear model" author: "Matti Pirinen, University of Helsinki" date: "28.12.2023" output: html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` Suppose that we want to model how the value of a numerical outcome variable $y$ could be predicted from $p$ predictors $\pmb{x}=(x_1,\ldots,x_p).$ For example, $y$ could be the price of an apartment and $x_1$ could be the district within Helsinki and $x_2$ could be the number of bedrooms. Thus, we want to find a function $f$ that gives a good approximation for $y$ when we input the predictor values $\pmb{x}$ to the function $f$. We denote our prediction by $\widehat{y} = f(\pmb{x})$. We call such a function $f$ a **regression function** and we talk about "regressing $y$ on $\pmb{x}$" when we estimate the regression function. A common approach in statistical modeling is to aim for using the conditional expectation as the regression function: $f(\pmb{x})=\textrm{E}(Y\,|\,\pmb{X}=\pmb{x})$. (Note: Capital letters denote random variables and lower case letters are observed values for the random variables.) A theoretical justification is that this function has the smallest mean squared error among all possible regression functions. This means that the conditional expectation is the function $f$ that, for any given value $\pmb{x}$, solves the minimization problem $$\min_{f(\pmb{x})} \textrm{E}((Y-f(\pmb{x}))^2\,|\,\pmb{X}=\pmb{x}) ,$$ where the expectation is over random variable $Y$ for the fixed value of $\pmb{X}=\pmb{x}$. The simplest of regression models is the linear regression model that is our basic building block on this course. 1. Linear model gives a baseline to which compare more complex models: if your complex model does not do better than the linear model, most likely you want to use the linear model because of simplicity! 2. Linear model has a very good interpretability. 3. Weighted and correlated linear regression, generalized linear models (e.g. logistic regresison) and high-dimensional regression (e.g. LASSO, ridge regression) are built on top of the linear model. 4. Linear model is a great vehicle to demonstrate statistical concepts (e.g. uncertainty, model fit, predictions, inference). To refresh the linear model, * read the excellent chapter "**3. Linear Regression**" from [ISLR](https://www.statlearning.com/). Below is a brief summary of linear model including key formulas and R commands. * If you have not applied linear regression much *in practice* using **R**, go through the additional notes for [Practical linear regression](HDS0_practical_linearmodel.html). ### Definition Suppose outcome $y$ and $p$ other variables (or predictors) $\pmb{x} = (x_1,\ldots,x_p)$ have been measured on $n$ units/individuals: data for unit $i$ is $(y_i,x_{i1},\ldots,x_{ip})$. We want to study relationship between $y$ and $x$'s. The standard linear model assumes that there is a relationship $$ y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} + \varepsilon_i, $$ where $\varepsilon_i$ is an **error term** that captures the difference between the outcome and the **linear predictor**: $\varepsilon_i = y_i - \left(\beta_0 + \sum_j \beta_{ij} x_{ij}\right)$. We can turn this linear model formulation into a useful statistical model by assigning some distributional assumptions on the error terms $\varepsilon_i$. We always assume that $\textrm{E}(\varepsilon_i)=0$, i.e., there is no systematic bias in errors towards either negative or positive values. One typical assumption is **homoscedasticity** i.e, that the error terms have a constant variance $\textrm{Var}(\varepsilon_i) = \sigma^2$. Another is that the error terms for observations $i$ and $h$ are uncorrelated with each other, $\textrm{E}(\varepsilon_i \varepsilon_h) = 0$, and that the error terms are independent of predictors $\pmb{x}$. Often the model is further restricted to the case where the error terms are assumed to follow a **Normal distribution**, also called the **Gaussian distribution**. Note, however, that there are also widely-used linear models that have correlated errors, possibly with heteroscedasticity (varying error variance) and/or linear models where the errors are not Gaussian (i.e. are not distributed according to a Normal distribution). With the assumption $\textrm{E}(\varepsilon_i)=0$, we see that the conditional expectation, which is our natural candidate for the regression function, takes the form of the linear predictor: $$ \textrm{E}(Y\,|\,X = \pmb{x}) = \textrm{E}\left(\beta_0 + \sum_{j=1}^p \beta_j x_{ij} + \varepsilon_i\,\middle|\,X = \pmb{x} \right) = \beta_0 + \sum_{j=1}^p \beta_j x_{ij}, $$ and hence our goal is to estimate the unknown **parameters** $\pmb{\beta} = (\beta_0,\ldots,\beta_p)^T$ of this function. **Terminology.** It is conceptually important to distinguish the model parameters $\pmb{\beta} = (\beta_0,\ldots,\beta_p)^T$ from their estimates $\pmb{\widehat{\beta}} = \left(\widehat{\beta_0},\ldots,\widehat{\beta_p}\right)^T$ as well as to separate the error terms $\varepsilon_i$ from their estimates that are called **residuals** $\widehat{\varepsilon_i} = y_i - \left(\widehat{\beta_0} + \sum_{j=1}^p \widehat{\beta_j} x_{ij}\right).$ The parameters and error terms are unknown quantities that we will not know exactly, but we can estimate them, with varying levels of precision, using parameter estimates and residuals. When we have parameter estimates available, we can make a (linear model) **prediction** of an unobserved outcome value $y_u$ of unit $u$, assuming that we know the predictors $(x_{u1},\ldots,x_{up})$ of unit $u$, as $$\widehat{y_u} = \widehat{\beta_0} + \sum_{j=1}^p \widehat{\beta_j} x_{uj}.$$ With this notation, the residual can be written as $\widehat{\varepsilon_i} = y_i - \widehat{y_i}$. #### Example 0.1 Let's generate some data from a linear model with $p=1$. (A linear model with only one predictor is also called a **simple linear regression** model.) We want that our single predictor $x_1$ explains proportion $\phi$ of the total variance of outcome $y$ (whereas the remaining proportion $1-\phi$ is left for the error terms to account for). In data generation, we can first choose whichever variances for the predictor $x_1$ and error $\varepsilon$ in the population, and then determine $\beta_1$ based on those variances and the target value of $\phi$. For simplicity, let's choose to set $$\textrm{Var}(x_1)=\textrm{Var}(\varepsilon)=1.$$ This leads to following formulas \begin{eqnarray*}\textrm{Var}(x_1 \beta_1) &=& \beta_1^2\, \textrm{Var}(x_1) =\beta_1^2 \cdot 1 = \beta_1^2\\ \textrm{Var}(y) &=& \textrm{Var}(x_1 \beta_1 + \varepsilon) = \textrm{Var}(x_1 \beta_1) + \textrm{Var}(\varepsilon) + 2\,\textrm{Cov}(x_1\beta_1, \varepsilon)= \beta_1^2 + 1 + 0 = \beta_1^2 + 1\\ \phi &=& \textrm{Var}(x_1 \beta_1) / \textrm{Var}(y) = \beta_1^2/(\beta_1^2 +1)\\ \beta_1 &=& \pm \sqrt{\phi/(1-\phi)} \end{eqnarray*} where the first uses $\textrm{Var}(aX)=a^2\textrm{Var}(X)$ for a constant $a$ and random variable $X$, the second uses the fact that $x_1$ and $\varepsilon$ are independent and hence their covariance term is zero, and the fourth solves $\beta_1$ from the second order polynomial of the third formula. ```{r c1, fig.width = 5} set.seed(3102017) n = 200 # units (or samples) to be simulated phi = 0.2 # variance explained by x_1, should be 0 < phi < 1. b.0 = 0 # set intercept to 0, could equally well be any other value. b.1 = sqrt(phi/(1-phi)) # could equally well have minus sign. x = rnorm(n, 0, 1) # use Normal distribution, could be any other as long as var=1 eps = rnorm(n, 0, 1) # should be Normal in order that the standard tests are valid (or should it? see Exercises.) y = b.0 + x*b.1 + eps # use linear model formula to generate outcome y. # check variance explained by x: var(x*b.1) / var(y) # should be close to phi. plot(x, y) lm.fit = lm( y ~ x ) # Fit a linear model by least squares method. abline(lm.fit, col = "red") # Add estimated regression line in red. abline( a = b.0, b = b.1, col = "blue", lty = 2) # Add true population regression line in blue. ``` Question: Where can you find the error terms and where the residuals in the figure above? **Exercise.** Above we derived $\beta_1$ to yield a given variance explained. Sometimes we might want to derive $\beta_1$ to yield a particular correlation between $x_1$ and $y$. Show that by choosing $\beta_1 = \textrm{sign}(r)\sqrt{r^2/(1-r^2)},$ we have that $\textrm{cor}(x_1,y) = r$, when $\textrm{Var}(x_1)=\textrm{Var}(\varepsilon)=1$. Let's see the basic output of the linear model in R using `summary()`. ```{r} summary(lm.fit) # Show output from lm(). ``` #### Estimates Let's remind ourselves what is in this output and how it is computed. Linear model is typically fit by the least squares criterion. That is, we look for the parameter vector estimate $\pmb{\widehat{\beta}} = (\widehat{\beta}_0,\ldots,\widehat{\beta}_p)^T$ that minimizes the **residual sum of squares** (RSS): $$ RSS = (\pmb{y}-\pmb{X} \pmb{\widehat{\beta}})^T (\pmb{y}-\pmb{X} \pmb{\widehat{\beta}}) = \sum_i \left(y_i - \widehat{\beta}_0 - \sum_j x_{ij} \widehat{\beta}_j\right)^2 $$ where $\pmb{X}$ is $n \times (p+1)$ matrix whose first column contains ones to represent the constant **intercept** (or baseline) value $\beta_0$. Using rules for derivatives for matrix calculations $$\frac{\partial RSS}{\partial \pmb{\widehat{\beta}}} = -2 \pmb{X}^T\pmb{y}-2(\pmb{X}^T\pmb{X})\pmb{\widehat{\beta}},$$ that has the zero at the Least Squares solution $$ \pmb{\widehat{\beta}} = \left(\pmb{X}^T\pmb{X}\right)^{-1}\pmb{X}^T \pmb{y}, $$ assuming that $\pmb{X}^T\pmb{X}$ is of full rank. By treating $\pmb{X}$ as a fixed matrix and assuming that the error term $\pmb{\varepsilon}$ has mean 0, we have that the least squares estimator (LSE) is unbiased: $$\textrm{E}\left(\widehat{\pmb{\beta}} \,\big\rvert\, \pmb{\beta}\right) = \textrm{E}\left(\left( \pmb{X}^T \pmb{X} \right)^{-1}\pmb{X}^T \left(\pmb{X}\pmb{\beta}+\pmb{\varepsilon} \right)\,\big\rvert\, \pmb{\beta}\right) = \pmb{\beta} + \left( \pmb{X}^T \pmb{X} \right)^{-1}\pmb{X}^T \textrm{E}(\pmb{\varepsilon}) = \pmb{\beta}, $$ When we consider a linear model in which error terms are uncorrelated and have the same variance $\sigma^2$, i.e, the variance matrix of $\pmb{\varepsilon}$ is $\textrm{Var}(\pmb{\varepsilon}\,\rvert\,\sigma^2)=\sigma^2 \pmb{I}$, then LSE has as its sampling variance \begin{eqnarray*} \textrm{Var}\left(\pmb{\widehat{\beta}} \,\big\rvert\, \pmb{\beta}\right) &=& \textrm{Var} \left(\left( \pmb{X}^T \pmb{X} \right)^{-1}\pmb{X}^T \left(\pmb{X}\pmb{\beta}+\pmb{\varepsilon} \right) \,\big\rvert\, \pmb{\beta} \right) = \textrm{Var} \left( \pmb{\beta} + \left(\pmb{X}^T\pmb{X}\right)^{-1}\pmb{X}^T\pmb{\varepsilon}\,\big\rvert\, \pmb{\beta} \right) \\ &=& \textrm{Var} \left(\left(\pmb{X}^T\pmb{X}\right)^{-1}\pmb{X}^T\pmb{\varepsilon} \right)= \left(\pmb{X}^T\pmb{X}\right)^{-1} \pmb{X}^T \left(\sigma^2 \pmb{I}\right) \pmb{X} \left(\pmb{X}^T \pmb{X}\right)^{-1}\\ &=&\sigma^2 \left(\pmb{X}^T\pmb{X}\right)^{-1}, \end{eqnarray*} where the second row uses the fact that $\textrm{Var}(\pmb{A}\pmb{\varepsilon})=\pmb{A}\, \textrm{Var}(\pmb{\varepsilon})\, \pmb{A}^T$ for a constant matrix $\pmb{A}$ and random vector $\pmb{\varepsilon}$. Often $\sigma^2$ is also estimated, using $\widehat{\sigma^2}=\textrm{RSS}/(n-p-1)$. Residual standard error, $\sqrt{\widehat{\sigma^2}}$, describes average distance of an observation from the regression line. (It does not model exactly the expected distance, but rather the square root of the expected squared distance.) The `summary(lm.fit)` command produced * parameter estimates (or Coefficients), * their standard errors (SE) (estimates for square root of the sampling variance of the parameter estimates), * t-statistic (estimate/SE) and * P-value under the null hypothesis that the parameter is 0 and errors are uncorrelated and have a Normal distribution $N(0,\sigma^2)$. Under the above assumptions, the sampling distribution of t-statistic is $t$-distribution and hence q% confidence intervals are determined as $\widehat{\beta}\pm a\times \textrm{SE}$, where $a$ is the q/2% quantile of $t$-distribution with $n-p-1$ degrees of freedom. When $\sigma^2$ is known, the $t$-distribution is replaced by a Normal distribution, and same is approximately true when $n$ becomes large, even if an estimate $\widehat{\sigma^2}$ is used in computing SE. In these cases, we often talk about z-statistic instead of t-statistic. Last paragraph in the output tells about the full model fit. $R^2$ is the proportion of variance explained by the linear model, i.e., $R^2 = 1-\frac{RSS}{n-1} / \widehat{\textrm{Var}}(y)$. The adjusted version penalizes for additional predictors and is defined as $R_{adj}^2 = 1-\frac{RSS}{n-p-1} / \widehat{\textrm{Var}}(y)$. Note that if there is only the intercept parameter $\beta_0$ in the model ($p=0$), then $R^2=R_{adj}^2=0$, and if the model explains data perfectly ($RSS=0$), then $R^2=R_{adj}^2=1$. In all other cases, $R^2$ values are between 0 and 1 and larger values mean more variance explained by the model. ($R_{adj}^2$ can get negative values in cases where the model explains very poorly compared to the number of parameters in the model.) $R^2$ should not be the only value used to judge how suitable the model is for the data. One should always plot the data and the model fit in different ways to assess this question. For a simple linear model ($p=1$), the above scatter plot and regression line is a good start. More generally, `plot`command applied to an `lm`-object gives the following four diagnostic plots (read ). ```{r c2, fig.height=7, fig.width=7} par(mfrow=c(2,2)) #split the plot area in 2 rows and 2 columns, i.e. in 4 plots plot(lm.fit) #make diagnostic plots ``` #### Validity of linear model Let's list the assumptions behind the standard linear model and properties of its least squares estimates (LSE). $$y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} + \varepsilon_i$$ 1. Additivity and linearity. We assume that each predictor acts **additively** (there is + between terms that involve different predictors) and that the effect of each predictor on outcome is **linear** (predictor is simply multiplied by a $\beta$ coefficient). (How to extend linear model outside these assumptions?) 2. Error terms are independent of each other and of predictors. If this does not hold, then the amount of information in data does not correspond to the number of observations, and statistical inference based on theoretical distributions will be invalid. Additionally, LSE is not an optimal unbiased point estimate but a ganeralized least squares estimation, that takes into account the correlation between errors, gives more precise estimates. 3. Errors have same variance (homoscedasticity). If this does not hold, then a weighted linear regression would give more precise estimates. 4. Errors are Gaussian (i.e. have a normal distribution). Under this assumptions LSE coincides with the maximum likelihood estimate and hence has many optimality properties. However, LSE has several optimality properties even without this assumptions. For example, Gauss-Markov theorem says that LSE $\widehat{\beta}$ has the smallest sampling variance among all linear and unbiased estimators of $\beta$ as long as errors are homoscedastic and uncorrelated, no matter what is their distribution. Gaussian errors is in practice the least important assumption out of the ones listed here. #### Example 0.2 (`Auto`) Let's have a look at the `Auto` dataset that is included in `ISLR` package. ```{r} #install.packages("ISLR") #do this when you run first time library("ISLR") #This loads several datasets from the book ISLR #?Auto #To check help str(Auto) #plot pairs except "name" #pairs(subset(Auto, select=-name), cex=0.5, pch="+") #We want to explain miles per gallon, mpg #Let's start with horsepower lm.fit = lm(mpg ~ horsepower, data = Auto) summary(lm.fit) plot(Auto$horsepower, Auto$mpg, cex = 0.6, pch = 3) abline(lm.fit, col = "red", lwd = 2) ``` This model explains 60.5% of all variation in the data and is very clearly statistically significant (P-values for horsepower and whole model are very low). But does it fit well the data, that is, does the linear model adequately describe the relationship between the two variables? Let's draw a residual plot by showing the 1st out of the standard diagnostic plots. ```{r} plot(lm.fit, 1) #plot only the 1st diagnostic plot that is the residuals ``` It seems that there is some pattern whereby the residuals do not agree with the assumption of errors being independent of the linear predictor. In such cases adding some polynomial terms may help. Let's add a square term and plot the model fit and the residual plot again. ```{r} lm.fit.2 = lm(mpg ~ horsepower + I(horsepower^2), data = Auto) #Note I() notation summary(lm.fit.2) plot(Auto$horsepower, Auto$mpg, cex = 0.6, pch = 3) abline(lm.fit, col = "red", lwd = 2) #first order model in red x.val = seq(min(Auto$horsepower), max(Auto$horsepower), length = 100) #grid of 100 points from x-axis #y values from the lm.fit.2 model for grid values in x.val: y.val = predict.lm(lm.fit.2, newdata = data.frame(horsepower = x.val)) lines(x.val, y.val, col = "blue", lwd = 2) #2nd order model in blue legend("topright", legend = c("linear","quadratic"), col = c("red","blue"), lwd = 2) plot(lm.fit.2, 1) ``` Now we explain 69% of `mpg` and residual plot is much more equal across fitted values. There is maybe some increased variance for larger fitted values. The points named in residual plot are possible outliers. ```{r} pick = c("334","155") #note quotes since these are row names here, not indexes cbind(Auto[pick,], lm.fit.2$fitted.values[pick], lm.fit.2$residuals[pick]) ``` Would cubic polynomial make even better fit? ```{r} lm.fit.3 = lm(mpg ~ poly(horsepower,3), data = Auto) summary(lm.fit.3) ``` Answer is no. (See similar R2 compared to quadratic fit.) Let's think about interpretation of parameters. In linear fit without the quadratic term each increase of 1 `horsepower` changes `mpg` by the same amount: ```{r} signif(lm.fit$coeff[2],2) #for linear model this is Delta mpg for 1 unit hpwr ``` whereas for quadratic fit it varies with `horsepower` (here given near 50 and 150): ```{r} diff(predict(lm.fit.2, newdata = data.frame(horsepower = c(49.5,50.5)))) diff(predict(lm.fit.2, newdata = data.frame(horsepower = c(149.5,150.5)))) ``` This ends our short summary of the linear regression model with **R**. If you have not applied linear regression much in practice using **R**, go through additional lecture notes about [Practical linear regression](HDS0_practical_linearmodel.html). ### Centering the variables In practice, we often mean-center the variables to simplify the linear model. Let's write down the motivation for that. The least squares solution was derived by minimizing residual sum of squares (RSS). The solution $\pmb{\widehat{\beta}}$ derived by setting the first derivative to 0 satisfies the *normal equations* $$(\pmb{X}^T\pmb{X})\pmb{\widehat{\beta}} = \pmb{X}^T \pmb{y} \quad \Leftrightarrow \quad \pmb{X}^T(\pmb{y}-\pmb{X}\pmb{\widehat{\beta}})=0 \quad \Leftrightarrow \quad \pmb{X}^T \pmb{\widehat{\varepsilon}}=0.$$ When the model includes the intercept term, then the first column of $\pmb{X}$ is full of 1s, and the first element of the above vector $\pmb{X}^T \pmb{\widehat{\varepsilon}}=0$ is $\pmb{1}^T \pmb{\widehat{\varepsilon}} = \sum_i \widehat{\varepsilon}_i=0$. This shows that **the sum of residuals of the least squares solution is 0** (when the model includes the intercept term). By summing each term in the residual over units we thus get 0, hence $$\sum_{i=1}^n(y_i-\widehat{\beta}_0-x_{i1}\widehat{\beta}_1- \ldots -x_{ip}\widehat{\beta}_p) = n (\overline{y} - \widehat{\beta}_0 - \overline{x_1} \widehat{\beta}_1-\ldots - \overline{x_p} \widehat{\beta}_p)=0,$$ where bar means a mean value over units. Hence we see that with the least squares solution the linear model equation is satisfied exactly for an "average" unit whose values of outcome $y$ and each predictor $x_i$ are the corresponding mean values over the samples: $$\overline{y} = \widehat{\beta}_0 + \overline{x_1} \widehat{\beta}_1 + \ldots + \overline{x_p} \widehat{\beta}_p.$$ Suppose that we have mean-centered each predictor before fitting the model. Then $\overline{x_j}=0$ for all $j$ and $\widehat{\beta}_0 = \overline{y}$. If we have also mean-centered outcome $y$, then $\widehat{\beta}_0 = 0.$ Thus, by mean-centering the predictors and the outcome, we can simplify the computation by dropping the intercept term from the model since it equals to zero in the least squares solution. Consider such a mean-centered model and use $\alpha_j$ as coefficients to distinguish from $\beta_j$ that correspond to the original model (without mean-centered variables). We know that $\alpha_0=0$ and \begin{eqnarray*} y_i-\overline{y} &=& (x_{i1}-\overline{x}_1)\alpha_1 + \ldots + (x_{ip}-\overline{x}_p)\alpha_p + \varepsilon_i\\ y_i &=& (\overline{y}-\overline{x}_1\alpha_1-\ldots -\overline{x}_p\alpha_p) + x_{i1}\alpha_1 + \ldots + x_{ip}\alpha_p + \varepsilon_i \end{eqnarray*} We see that the latter form equals to the non-centered model $$ y_i = \beta_0 + x_{i1}\beta_1 + \ldots + x_{ip}\beta_p + \varepsilon_i , $$ if we set $(\overline{y}-\overline{x}_1\alpha_1-\ldots -\overline{x}_p\alpha_p) = \beta_0$ and $\alpha_j = \beta_j$ for $j=1,\ldots ,p$. Since the least squares solution is unique (when the model is of full rank), it follows that $\widehat{\alpha}_j = \widehat{\beta}_j$ for $j=1,\ldots,p$ and hence **mean-centering does not change the values of other coefficients than the intercept**. We conclude that by mean-centering both the outcome and all predictors in a linear regression model, we can drop the intercept term from the model (since its value will be 0) while the other coefficients have the same values as without mean-centering. In R, we can fit a linear model without intercept term by using `0` or `-1` in the call: `lm(y ~ 0 + x)` or `lm(y ~ -1 + x)`. ### Geometric interpretation in $n$-dimensional space We can also interpret the least squares method as projecting the outcome vector $\pmb{y}$ into the subspace defined by the intercept $\pmb{x}_0$ and $p$ predictor vectors $\pmb{x}_1,\ldots,\pmb{x}_p$. Let's define matrix $\pmb{H} = \pmb{X}\left(\pmb{X}^T \pmb{X}\right)^{-1}\pmb{X}^T$ with which we can represent the fitted values as $$\pmb{\widehat{y}} = \pmb{X} \pmb{\widehat{\beta}} = \pmb{X}\left(\pmb{X}^T \pmb{X}\right)^{-1}\pmb{X}^T\pmb{y} = \pmb{H} \pmb{y}.$$ $\pmb{H}$ is also called "hat-matrix" that puts a hat on $\pmb{y}$. In particular, $\pmb{H}$ is a matrix of an orthogonal projection because $\pmb{H}=\pmb{H}^2$ and $\pmb{H}=\pmb{H}^T$. Thus we conclude that the least squares method finds the fitted values by projecting the observed values into the subspace defined by predictors.