Suppose that we want to model how the value of a numerical outcome variable \(y\) could be predicted from \(p\) predictors \(\pmb{x}=(x_1,\ldots,x_p).\) For example, \(y\) could be the price of an apartement and \(x_1\) could be the district within Helsinki and \(x_2\) could be the number of bedrooms. Thus, we want to find such a function \(f\) that our prediction \(\widehat{y} = f(\pmb{x})\) is a good approximation to \(y\). We call such a function \(f\) a regression function and we talk about “regressing \(y\) on \(\pmb{x}\)” when we estimate the regression function. A common approach in statistical modeling is to aim for using the conditional expectation as the regression function: \(f(\pmb{x})=\textrm{E}(Y\,|\,\pmb{X}=\pmb{x})\). A theoretical justification is that this function has the smallest mean squared error among all possible regression functions. This means that the conditional expectation is the function \(f\) that solves the minimization problem \[\min_{f(\pmb{x})} \textrm{E}(Y-f(\pmb{x}))^2,\] for any given \(\pmb{x}\), where the expectation is over random variable \(Y\) for fixed \(\pmb{x}\).

The simplest of regression models is the linear regression model that is our basic building block on this course.

  1. Linear model gives a baseline to which compare more complex models: if your complex model does not do better than the linear model, most likely you want to use the linear model because of simplicity!

  2. Linear model has very good interpretability.

  3. Weighted and correlated linear regression, generalized linear models (e.g. logistic regresison) and high-dimensional regression (e.g. LASSO, ridge regression) are built on top of the linear model.

  4. Linear model is a great vehicle to demonstrate statistical concepts (e.g. uncertainty, model fit, predictions, inference).

To refresh the linear model,

Below is a brief summary of linear model including key formulas and R commands.


Suppose outcome \(y\) and \(p\) other variables (or predictors) \(\pmb{x} = (x_1,\ldots,x_p)\) have been measured on \(n\) individuals: data for individual \(i\) is \((y_i,x_{i1},\ldots,x_{ip})\). We want to study relationship between \(y\) and \(x\)’s. The standard linear model assumes that there is a relationship
\[ y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} + \varepsilon_i, \] where \(\varepsilon_i\) is an error term that captures the difference between the outcome and the linear predictor: \(\varepsilon_i = y_i - \left(\beta_0 + \sum_j \beta_{ij} x_{ij}\right)\). We can turn this linear model formulation into a useful statistical model by assigning some distributional assumptions on the error terms \(\varepsilon_i\). We always assume that \(\textrm{E}(\varepsilon_i)=0\). One typical assumption is homoscedasticity i.e, that the error terms have a constant variance \(\textrm{Var}(\varepsilon_i) = \sigma^2\). Another is that the error terms for observations \(i\) and \(h\) are uncorrelated with each other, \(\textrm{E}(\varepsilon_i \varepsilon_h) = 0\), and that the error terms are independent of predictors \(\pmb{x}\). Often the model is further restricted to the case where the error terms are assumed to follow a Normal distribution, also called the Gaussian distribution. Note, however, that there are also widely-used linear models that have correlated errors, possibly with heteroscedasticity (varying error variance) and/or linear models where the errors are not Gaussian (i.e. are not distributed according to a Normal distribution).

With the assumption \(\textrm{E}(\varepsilon_i)=0\), we see that the conditional expectation, which is our natural candidate for the regression function, takes the form of the linear predictor: \[ \textrm{E}(Y\,|\,X = \pmb{x}) = \textrm{E}\left(\beta_0 + \sum_{j=1}^p \beta_j x_{ij} + \varepsilon_i\,\middle|\,X = \pmb{x} \right) = \beta_0 + \sum_{j=1}^p \beta_j x_{ij}, \] and hence our goal is to estimate the unknown parameters \(\pmb{\beta} = (\beta_0,\ldots,\beta_p)^T\) of this function.

Terminology. It is conceptually important to distinguish the model parameters \(\pmb{\beta} = (\beta_0,\ldots,\beta_p)^T\) from their estimates \(\pmb{\widehat{\beta}} = \left(\widehat{\beta_0},\ldots,\widehat{\beta_p}\right)^T\) as well as to separate the error terms \(\varepsilon_i\) from their estimates that are called residuals \(\widehat{\varepsilon_i} = y_i - \left(\widehat{\beta_0} + \sum_{j=1}^p \widehat{\beta_j} x_{ij}\right).\) The parameters and error terms are unknown quantities that we will not know exactly, but we can estimate them, with varying levels of precision, using parameter estimates and residuals.

When we have parameter estimates available, we can make a (linear model) prediction of outcome value \(y_u\) of an unobserved individual \(u\) as \[\widehat{y_u} = \widehat{\beta_0} + \sum_{j=1}^p \widehat{\beta_j} x_{uj}.\] With this notation, the residual can be written as \(\widehat{\varepsilon_i} = y_i - \widehat{y_i}\).

Example 0.1

Let’s generate some data from a linear model with \(p=1\). We want that our single predictor \(x_1\) explains proportion \(\phi\) of the total variance of outcome \(y\) (whereas the remaining proportion \(1-\phi\) is left for the error terms to account for). In data generation, we can first choose whichever variances for the predictor \(x_1\) and error \(\varepsilon\) in the population, and then determine \(\beta_1\) based on those variances and the target value of \(\phi\). For simplicity, let’s use \[\textrm{Var}(x_1)=\textrm{Var}(\varepsilon)=1.\] This leads to following formulas \[\begin{eqnarray*}\textrm{Var}(x_1 \beta_1) &=& \beta_1^2\, \textrm{Var}(x_1) = \beta_1^2\\ \textrm{Var}(y) &=& \textrm{Var}(x_1 \beta_1 + \varepsilon) = \textrm{Var}(x_1 \beta_1) + \textrm{Var}(\varepsilon) + 2\,\textrm{Cov}(x_1\beta_1, \varepsilon)= \beta_1^2 + 1 + 0 = \beta_1^2 + 1\\ \phi &=& \textrm{Var}(x_1 \beta_1) / \textrm{Var}(y) = \beta_1^2/(\beta_1^2 +1)\\ \beta_1 &=& \pm \sqrt{\phi/(1-\phi)} \end{eqnarray*}\] where the first uses \(\textrm{Var}(aX)=a^2\textrm{Var}(X)\) for a constant \(a\) and random variable \(X\), the second uses the fact that \(x_1\) and \(\varepsilon\) are independent and hence their covariance term is zero, and the fourth solves \(\beta_1\) from the second order polynomial of formula 3.

n = 200 # individuals (or samples) to be simulated
phi = 0.2 # variance explained by x_1, should be 0 < phi < 1. 
b.0 = 0 # set intercept to 0, could equally well be any other value.
b.1 = sqrt(phi/(1-phi)) # could equally well have minus sign.
x = rnorm(n, 0, 1) # use Normal distribution, could be any other as long as var=1
eps = rnorm(n, 0, 1) # should be Normal in order that the standard tests are valid (or should it? see Exercises.)
y = b.0 + x*b.1 + eps #use linear model formula to generate outcome y.
# check variance explained by x:
var(x*b.1) / var(y) # should be close to phi.
## [1] 0.1983043
plot(x, y) = lm( y ~ x ) # Fit a linear model by least squares method.
abline(, col = "red") # Add estimated regression line in red.
abline( a = b.0, b = b.1, col = "blue", lty = 2) # Add true population regression line in blue.

Where are the error terms and where are the residuals in the figure above?

Exercise. Above we derived \(\beta_1\) to yield a given variance explained. Sometimes we might want to derive \(\beta_1\) to yield a particular correlation between \(x_1\) and \(y\). Show that by choosing \(\beta_1 = \textrm{sign}(r)\sqrt{r^2/(1-r^2)},\) we have that \(\textrm{cor}(x_1,y) = r\) when \(\textrm{Var}(x_1)=\textrm{Var}(\varepsilon)=1\).

Let’s see the basic output of the linear model in R using summary().

summary( # Show output from lm().
## Call:
## lm(formula = y ~ x)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.05209 -0.75674  0.00068  0.72571  2.82434 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.11214    0.07310  -1.534    0.127    
## x            0.43150    0.07367   5.858 1.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.033 on 198 degrees of freedom
## Multiple R-squared:  0.1477, Adjusted R-squared:  0.1434 
## F-statistic: 34.31 on 1 and 198 DF,  p-value: 1.931e-08


Let’s remind ourselves what is in this output and how it is computed. Linear model is typically fit by the least squares criterion. That is, we look for the parameter vector estimate \(\pmb{\widehat{\beta}} = (\widehat{\beta}_0,\ldots,\widehat{\beta}_p)^T\) that minimizes the residual sum of squares (RSS): \[ RSS = (\pmb{y}-\pmb{X} \pmb{\widehat{\beta}})^T (\pmb{y}-\pmb{X} \pmb{\widehat{\beta}}) = \sum_i \left(y_i - \widehat{\beta}_0 - \sum_j x_{ij} \widehat{\beta}_j\right)^2 \] where \(\pmb{X}\) is \(n \times (p+1)\) matrix whose first column contains ones to represent the constant intercept (or baseline) value \(\beta_0\). Using rules for derivatives for matrix calculations \[\frac{\partial RSS}{\partial \pmb{\widehat{\beta}}} = -2 \pmb{X}^T\pmb{y}-2(\pmb{X}^T\pmb{X})\pmb{\widehat{\beta}},\] that has the zero at the Least Squares solution \[ \pmb{\widehat{\beta}} = \left(\pmb{X}^T\pmb{X}\right)^{-1}\pmb{X}^T \pmb{y}, \] assuming that \(\pmb{X}^T\pmb{X}\) is of full rank. By treating \(\pmb{X}\) as a fixed matrix and assuming that the error term \(\pmb{\varepsilon}\) has mean 0, we have that the least squares estimator (LSE) is unbiased: \[\textrm{E}\left(\widehat{\pmb{\beta}} \,\big\rvert\, \pmb{\beta}\right) = \textrm{E}\left(\left( \pmb{X}^T \pmb{X} \right)^{-1}\pmb{X}^T \left(\pmb{X}\pmb{\beta}+\pmb{\varepsilon} \right)\,\big\rvert\, \pmb{\beta}\right) = \pmb{\beta} + \left( \pmb{X}^T \pmb{X} \right)^{-1}\pmb{X}^T \textrm{E}(\pmb{\varepsilon}) = \pmb{\beta}, \]

When we consider a linear model in which error terms are uncorrelated and have the same variance \(\sigma^2\), i.e, the variance matrix of \(\pmb{\varepsilon}\) is \(\textrm{Var}(\pmb{\varepsilon}\,\rvert\,\sigma^2)=\sigma^2 \pmb{I}\), then LSE has a sampling variance \[\begin{eqnarray*} \textrm{Var}\left(\pmb{\widehat{\beta}} \,\big\rvert\, \pmb{\beta}\right) &=& \textrm{Var} \left(\left( \pmb{X}^T \pmb{X} \right)^{-1}\pmb{X}^T \left(\pmb{X}\pmb{\beta}+\pmb{\varepsilon} \right) \,\big\rvert\, \pmb{\beta} \right) = \textrm{Var} \left( \pmb{\beta} + \left(\pmb{X}^T\pmb{X}\right)^{-1}\pmb{X}^T\pmb{\varepsilon}\,\big\rvert\, \pmb{\beta} \right) \\ &=& \textrm{Var} \left(\left(\pmb{X}^T\pmb{X}\right)^{-1}\pmb{X}^T\pmb{\varepsilon} \right)= \left(\pmb{X}^T\pmb{X}\right)^{-1} \pmb{X}^T \left(\sigma^2 \pmb{I}\right) \pmb{X} \left(\pmb{X}^T \pmb{X}\right)^{-1}\\ &=&\sigma^2 \left(\pmb{X}^T\pmb{X}\right)^{-1}, \end{eqnarray*}\] where the second row uses the fact that \(\textrm{Var}(\pmb{A}\pmb{\varepsilon})=\pmb{A} \textrm{Var}(\pmb{\varepsilon}) \pmb{A}^T\) for a constant matrix \(\pmb{A}\) and random vector \(\pmb{\varepsilon}\).

Often \(\sigma^2\) is also estimated, using \(\widehat{\sigma^2}=\textrm{RSS}/(n-p-1)\). Residual standard error, \(\sqrt{\widehat{\sigma^2}}\), describes average distance of an observation from the regression line. (It does not model exactly the expected distance, but rather the square root of the expected squared distance.)

The summary( command produced

  • parameter estimates (or Coefficients),
  • their standard errors (SE) (estimates for square root of the sampling variance of the parameter estimates),
  • t-statistic (estimate/SE) and
  • P-value under the null hypothesis that the parameter is 0 and errors are uncorrelated and have a Normal distribution \(N(0,\sigma^2)\).

Under the above assumptions, the sampling distribution of t-statistic is \(t\)-distribution and hence q% confidence intervals are determined as \(\widehat{\beta}\pm a\times \textrm{SE}\), where \(a\) is the q/2% quantile of \(t\)-distribution with \(n-p-1\) degrees of freedom. When \(\sigma^2\) is known, the \(t\)-distribution is replaced by a Normal distribution, and same is approximately true when \(n\) becomes large, even if an estimate \(\widehat{\sigma^2}\) is used in computing SE. In these cases, we often talk about z-statistic instead of t-statistic.

Last paragraph in output tells about the full model fit. \(R^2\) is the proportion of variance explained by the linear model, i.e., \(R^2 = 1-\frac{RSS}{n-1} / \widehat{\textrm{Var}}(y)\). The adjusted version penalizes for additional predictors and is defined as \(R_{adj}^2 = 1-\frac{RSS}{n-p-1} / \widehat{\textrm{Var}}(y)\). Note that if there is only the intercept parameter \(\beta_0\) in the model (\(p=0\)), then \(R^2=R_{adj}^2=0\), and if the model explains data perfectly (\(RSS=0\)), then \(R^2=R_{adj}^2=1\). In all other cases, \(R^2\) values are between 0 and 1 and larger values mean more variance explained by the model. (\(R_{adj}^2\) can get negative values in cases where the model explains very poorly compared to the number of parameters in the model.)

\(R^2\) should not be the only value used to judge how suitable the model is for the data. One should always plot the data and the model fit in different ways to assess this question. For a simple linear model (\(p=1\)), the above scatter plot and regression line is a good start. More generally, plotcommand applied to an lm-object gives the following four diagnostic plots (read

par(mfrow=c(2,2)) #split the plot area in 2 rows and 2 columns, i.e. in 4 plots
plot( #make diagnostic plots