---
title: 'Lecture 7: Linear regression II'
author: "Matti Pirinen"
date: "23.9.2021"
output:
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(19)
```
In this lecture, we will study more details about the linear regression model.
In particular, how to make predictions from the model using R, and
how to assess whether the model seems approriate for the data set at hand.
Finally, we will study the interpretation of the coefficients
in multiple regression setting where many predictors are simultaneously included
in the model.
Let's start by reading in the same height-weight data set as previously.
```{r}
measures = read.table("Davis_height_weight.txt", as.is = T, header = T)
head(measures, n = 3) #show first three rows
```
The columns `repwt`and `repht` are self-reported weight and self-reported height, respectively.
Let's fit separate linear regression models `lm.m` and `lm.f` in males and females, respectively.
```{r, fig.width = 9}
ii.m = (measures$sex == "M") #logical vector with 'TRUE' for Males and 'FALSE' for Females
lm.m = lm( weight ~ height, data = measures[ii.m,]) #fit model in males
lm.f = lm( weight ~ height, data = measures[!ii.m,]) #fit model in females
#use same range on the axes in both sexes to get a clear picture how they compare
x.range = c(145, 200) #height in cm
y.range = c(35, 120) #weight in kg
x.name = "height (cm)"
y.name = "weight (kg)"
par(mfrow = c(1,2)) #split plotting area to 1 x 2 grid
plot(measures$height[!ii.m], measures$weight[!ii.m], pch = 18, main = "FEMALES", col = "red",
ylab = y.name, xlab = x.name, xlim = x.range, ylim = y.range)
abline(lm.f, col = "darkred", lwd = 1.5) #add the regression line
plot(measures$height[ii.m], measures$weight[ii.m], pch = 20, main = "MALES", col = "blue",
ylab = y.name, xlab = x.name, xlim = x.range, ylim = y.range)
abline(lm.m, col = "darkblue", lwd = 1.5) #add the regression line
```
### Predictions from the regression model
On previous Lecture, we looked up the coefficients of the linear model and used them to
compute the predicted value of `weight` for an individual with a given `height`.
For example, to compute the predicted `weight` for a 180 cm tall male, we did
```{r}
sum(lm.m$coeff*c(1,180)) #this computes a*1 + b*180 for a and b taken from male regression 'lm.m'
```
Very generally, in R, the fitted models can be used for prediction via
the `predict(model, newdata)` function.
Let's predict `weight` for two new individuals who are 160 cm and 180 cm tall.
We need to give the input data as parameter `newdata =` formatted as
a data.frame that contains (at least) the same column names that were used in the model
fitted by the `lm()` function. In the current case, we need to define only the column `height`. We will call the data that goes in to the prediction as
`data.in`.
```{r}
data.in = data.frame(height = c(160, 180) )
data.in # show what we'll put in as newdata parameter for predict(, newdata = )
predict(lm.m, newdata = data.in) # predictions for males of 160cm and 180cm
```
We can present these results in a `data.frame` together with the input data:
```{r}
data.frame(data.in, predicted.weight = predict(lm.m, newdata = data.in))
```
The uncertainty of prediction can be asked at the level of individual
value `interval = "prediction"` or as a confidence interval
for the population mean at the given
parameter values `interval = "confidence"`.
Note that the option names in R commands can be abbreviated as long as the abbreviation
remains unique among all possible options; hence here
we can use `pred` and `conf`, for example.
```{r}
predict(lm.m, newdata = data.in, interval = "pred") #for males of 160cm and 180cm
```
The interpretation of the 2nd prediction interval here
is that, in 95% of the data sets, when we do this prediction
procedure for a new male who is 180 cm tall,
the new male will fall within the prediction interval,
which here is from 58 kg to 98 kg.
```{r}
predict(lm.m, newdata = data.in, interval = "conf") #for males of 160cm and 180cm
```
The interpretation of the confidence interval for the mean here
is that, in 95% of the data sets, when we do this prediction
procedure for male population of height of 180 cm,
the true population mean will be within the confidence interval,
which here is from 75.6 kg to 81.1 kg.
**WARNING**: A common error happens when `newdata` parameter
is not correctly specified because then `predict()` will
not complain but simply
outputs the fitted values of the original `lm()` model fit.
For example, if we misspell `newdata` as `new.data`,
R will not complain and the first two predicted values
are actually for the first 2 males in the original data,
NOT for 160cm and 180cm as the user might think!
```{r}
predict(lm.m, new.data = data.in)[1:2] #GOES WRONG because newdata has not been specified!
predict(lm.m, newdata = data.in)[1:2] #This is correct
```
#### Example 7.1
1. What is the predicted weight of a male who is 170 cm tall? What is the
predicted population average weight of all males who are 170 cm tall?
Give appropriate 95% intervals in both cases.
```{r}
#For completeness, let's re-fit the model first in males only
lm.m = lm( weight ~ height, data = measures[measures$sex == "M",])
#predicted weight for a male 170 cm
predict(lm.m, newdata = data.frame(height = c(170)), interval = "pred")
#predicted population average at 170 cm
predict(lm.m, newdata = data.frame(height = c(170)), interval = "conf")
```
We estimate that the population average weight for a male who is 170 cm tall is
67.9 kg (95%CI: 64.5,...,71.3).
We predict that 95% CI for weight values for 170 cm males is (47.6,...,88.2)
(and the mean weight is the same 67.9 kg as the estimate for the population mean).
2. Fit a linear model regressing `height` on `weight` in females. What is the predicted
height of a woman who weighs 60kg?
```{r}
lm.h.on.w = lm( height ~ weight, data = measures[measures$sex == "F",])
#predicted weight for a female of 60kg
predict(lm.h.on.w, newdata = data.frame(weight = c(60)), interval = "pred")
```
Predicted height is 166 cm with 95%CI of (156,...,176).
### Assessing model fit
Francis Anscombe generated [four data sets](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)
of 11 observations that had similar
summary statistics (means, variances, correlations) but look very different when plotted.
His point was to emphasize the importance to plot the data.
In R, these data sets are in the columns of the data.frame called `anscombe`.
Let's plot the datasets, fit linear model
and compute correlations and slopes that we put as a subtitle for each plot.
```{r, fig.width = 8, fig.height = 8}
anscombe[1:2,] # show first two rows
par(mfrow = c(2,2)) #2 x 2 plotting area for 4 plots
for(ii in 1:4){
x = anscombe[,ii] #x-coordinates for Anscombe's data set ii
y = anscombe[,4 + ii] #y-coordinates for Anscombe's data set ii
lm.fit = lm(y ~ x)
plot(x, y, main = paste("Anscombe",ii), pch = 19, xlab = "x", ylab = "y",
sub = paste0("cor = ",signif(cor(x,y),3),
"; slope = ",signif(coefficients(lm.fit)[2],3)))
abline(lm.fit, col = "red", lwd = 1.5)
}
```
We see that the data sets are identical with respect to the x-y correlation and the
linear model slope, but we also see that the data sets are very different
in other respects, and, in particular, the linear model does not seem an
appropriate description of the X-Y relationship for the data sets 2, 3 and 4.
How can we know whether a linear model is an approriate model
for our data set? First we should consider this conceptually:
Is there a reason to think that a linear relationship
between the variables is appropriate for the research question
we have in mind?
If the answer is positive, then we should ask how
is the linear model fitting the data and are the modeling assumptions
valid? The main assumptions for the linear model are:
1. The relationship between $Y$ and $X$ is linear, that is, there are
values $a$ and $b$ for which $Y = a + b\cdot X + \varepsilon$ and
the **error term** $\varepsilon$ does not depend on the value of $X$.
2. Errors $\varepsilon$ have constant variance across all values of X.
3. Errors $\varepsilon$ are independent of each other.
4. Errors $\varepsilon$ have a Normal distribution (in order that SE and P-value from `lm()` are valid).
There are 4 standard plots that we can use to assess these
assumptions and they are generated simply by calling `plot()`
on the regression model object.
Let's first check how our model of `weight ~ height` in females
behaves in these diagnostic plots:
```{r, fig.height = 8}
par(mfrow = c(2,2)) #plot diagnostics to 2 x 2 area
plot(lm.f) #calling plot on an lm-object generates 4 diagnostic plots
```
To interpret these plot, read the detailed explanation from here:
.
A short summary of the four plots:
1. Residuals ($y_i-\widehat{y}_i$) vs. fitted ($\widehat{y}_i$) should not show a
pattern where the distribution of residuals varies depending on the fitted values.
2. QQ-plot should ideally be on the line, whence the residuals ($y_i-\widehat{y}_i$)
are Normally distributed and SEs and P-values of the model coefficients are reliable.
However, small deviations from the line are not a problem.
3. Standardized residuals vs. fitted values should show a similar magnitude
along y-axis for each region on the x-axis.
Otherwise, the variance of the residuals is changing
and a more approriate model (than the standard `lm`) should take this variation into account.
4. Standardized residuals vs. leverage should not show points that reach
outside the red curved areas of Cook's distance 1. Otherwise,
such points would have a high influence on the slope of the linear model,
and hence the `lm()` estimates from such data would not be reliable.
For our test data,
the diagnostics look ok.
There are 4 slightly outlying observations, that have to such a degree
higher weight than predicted by their height, that they separate slightly
from the line on the QQ-plot. However, these are still minor deviations
and, importantly, they are not having a
large influence on the fitted line, which can be seen on the lower right plot,
where no points reach even a Cook's distance of 0.5.
(Only a small region beyond Cook's distance of 0.5 can be seen at the top right corner of the plot,
but no region with Cook's distance > 1 can be seen in this plot.)
#### Examples 7.2.
Let's demonstrate how deviations from the linear model assumptions show up in
the diagnostic plots. For that, let's change the existing female height-weight data
in a few different ways.
1. Let's make weight to have an extra contribution of `+0.1*(height-150)^2`
whence weight increases also quadratically and not just linearly, as a function of height.
Let's then fit the linear model and plot the model fit and the diagnostic plot 1.
```{r, fig.width = 9}
new.weight = measures$weight + 0.1*(measures$height - 150)^2
ii.f = (measures$sex == "F")
lm.new = lm(new.weight[ii.f] ~ measures$height[ii.f])
par(mfrow = c(1,2))
plot(measures$height[ii.f], new.weight[ii.f], pch = 19, col = "orange", main = "Changed data")
abline(lm.new, col = "red", lwd = 1.5)
plot(lm.new, 1) #plot only diagnostic plot no. 1
```
On left, we see the new data and the linear model fit that tends to underestimate
weight at both ends of the height distribution.
On right, we see a parabolic pattern of residuals vs. fitted values, which indicates
that a linear model is not adequate and suggests that a quadratic term of the predictor
should be included in the model.
(We will come back to how to do this later in this lecture.)
2. Let's make weight to have more variance with taller individuals
by adding an extra contribution of `+ rnorm(, mean = 0, sd = 2*|height-150|/100)`.
Let's then fit the linear model and plot the model fit and the diagnostic plot 3.
```{r, fig.width = 9}
new.weight = measures$weight + rnorm(nrow(measures), 0, 2*abs(measures$height - 150))
new.weight[new.weight < 40] = 40 #Let's set minimum new weight to 40kg
lm.new = lm(new.weight[ii.f] ~ measures$height[ii.f])
par(mfrow = c(1,2))
plot(measures$height[ii.f], new.weight[ii.f], pch = 19, col = "orange", main = "Changed data")
abline(lm.new, col = "red", lwd = 1.5)
plot(lm.new, 3) #plot only diagnostic plot no. 3
```
On left, we see that in the new data, the variance inclreases with height.
On right, we see an increasing pattern in standardized residuals,
which indicates that variance of residuals is changing.
3. Let's add one new individual to data with `height = 200` and
`weight = 50`. This individual will be an outlier (meaning that residual value is large)
because weight is so small compared to height.
The individual will also be influential for the slope of the line,
which is shown by the observation
reaching Cook's distance of 1 in Stndardized residuals vs. Leverage plot.
Let's show the original fitted line without the outlier from `lm.f` object
as a blue dashed line, and the new fitted line with the outlier individual included
as a red line.
```{r, fig.width = 9}
new.weight = c(measures$weight[ii.f], 50) #existing females + new_value 50
new.height = c(measures$height[ii.f], 200) #existing females + new_value 200
lm.new = lm(new.weight ~ new.height)
par(mfrow = c(1,2))
plot(new.height, new.weight, pch = 19, col = "orange", main = "Changed data")
abline(lm.new, col = "red", lwd = 1.5)
abline(lm.f, col = "blue", lwd = 1.5, lty = 2)
plot(lm.new, 5) #plot only diagnostic plot no. 5
```
On left, we see the outlier point at x=200, y=50. We also see how much that one point
changes the fitted line from blue to red.
On right, we see how the diagnostic plot no. 5 shows that the outlier point (indexed 112)
reaches Cook's distance 1, which indicates that it is very influential for the slope of the
line. In these cases, it is important to understand what is the reason
for the outlier points
and whether they could/should be excluded from the analysis.
A typical reason for coarse outliers is a measurement error of
some kind at some step of the data collection,
but outlier points can also be completely real and valid observations.
### Example analysis: Social factors in 1998
Let's study some social factors from around the world at the turn
of the millenium using a UN data sets from 1998. This file is
in csv format (comma separated values) and we can read it in
using `read.csv()` function that is simply a version of `read.table()`
function with default parameter values set to correspond a typical csv file.
```{r}
y = read.csv("UN98.csv", as.is = TRUE, header = TRUE)
head(y) #show some first rows
```
Read the description of the variables from .
Let's give these variables shorter names to avoid writing long names.
```{r}
colnames(y) = c("country","region","tfr","contr","eduM","eduF","lifeM",
"lifeF","infMor","GDP","econM","econF","illiM","illiF")
```
Let's plot the pairwise correlations using `corrplot` to an increased plotting area of size `fig.width = 8` and
`fig.height = 8` (defined in the Rmd code block definition).
```{r, fig.width = 8, fig.height = 8}
#install.packages("corrplot")
library(corrplot)
x = as.matrix(y[, 3:14]) #make a matrix of numeric columns of data.frame y for cor()
corr = cor(x, use = "pairwise")
corrplot.mixed(corr, order = "hclust")
```
We see two clear blocks of highly correlated variables. First block
is about positive things like life expectancy, education and
availability of contraception and the second contains
negative things like illiteracy and infant mortality.
We observe that "percentage of economically active women"
does not correlate much with the other variables.
Perhaps surprisingly,
GDP (gross domestic product, bruttokansantuote)
doesn't correlate positively with economic activity in males either.
Let's have a more careful look at these two variables.
```{r, fig.width = 10, fig.height = 4}
par(mfrow = c(1,3))
hist(y$econM, col = "dodgerblue", breaks = 25)
hist(y$GDP, col = "orange", breaks = 25)
hist(log10(y$GDP), col = "khaki", breaks = 25)
```
We see that activity is roughly symmetric and very roughly
Normally distributed but GDP is an example
of a highly skewed distribution with a strong tail to right. This is typical of
variables describing wealth or, more generally, variables that take only non-negative values.
Often logaríthm of such a variable is a much better input variable to linear regression
as it is often more symmetric and Normally distributed, as is the case here as well.
Thus we will use log10 of GDP.
(Note that variables need not be Normally distributed in order to be used in
linear regression, but an approximate Normality often makes the model fit better.)
```{r}
plot(y$econM, log10(y$GDP), pch = 19)
```
There does not seem to be clear patterns between these two variables, which agrees with
the small magnitude of correlation.
Let's print out statistics for countries with over 90% Males "economically active".
```{r}
y[which(y$econM > 90),] #we use which() here because that automatically filters out NAs
```
We have poor countries like Burundi and Chad
and rich countries like Qatar and United Arab Emirates
that both have a high proportion of male economic activity.
One clear factor separating these groups is infant mortality.
Let's look at the same plot but now coloring the points by infant mortality.
We make as many colors as there are countries by using the `heat.colors(n)` palette
that returns $n$ colors from red to yellow.
We index these colors by `rank` of each country in `infMor` variable (rank = 1
for the smallest `infMor` and rank = $n$ for the largest `infMor`).
```{r}
n = nrow(y)
plot(y$econM, log10(y$GDP), pch = 20, col = heat.colors(n)[rank(y$infMor)],
main = "Colored by infant mortality (red = low, yellow = high)")
```
We see that `infMor` seems predictive of GDP
(red associates with high GDP and yellow with low GDP).
What does the linear model say when we predict GDP
by both Economic activity and Infant mortality?
```{r}
lm.0 = lm(log10(y$GDP) ~ econM + infMor, data = y, na.action = "na.exclude")
summary(lm.0)
```
Economic activity is not an important predictor whereas Infant mortality is,
and the model explains about 50% of the variation in GDP.
Let's do a linear model using Infant mortality as the only predictor.
```{r}
lm.1 = lm(log10(GDP) ~ infMor, data = y, na.action = "na.exclude")
plot(y$infMor, log10(y$GDP), pch = 20)
abline(lm.1, col = "red", lwd = 1.5)
summary(lm.1)
```
We see that
log10 of GDP is decreasing by 0.0133 for each unit of infant deaths (per 1000 infants).
In other words, when infant mortality increases one per mille, GDP drops multiplicatively by a factor
of 10^(-0.0133157)=`r 10^(-0.0133157)`, i.e., it drops about 3%. This model alone
explains about 50% of the variation in GDP.
Does this model fit the data well? It looks like there is a tendency of residuals
being positive at the ends, which would suggest adding the second order term in the model.
Let's check the diagnostic plots.
```{r, fig.height = 8}
par(mfrow = c(2,2))
plot(lm.1)
```
No other clear problems except that the residuals having a shape of a prabola in the top-left plot.
Let's add the quadratic term in the model
whence our model becomes `log10(GDP) ~ InfMor + InfMor^2`.
The quadratic term needs to be input through `I()` notation in the model formula
as is done below.
```{r}
lm.2 = lm(log10(GDP) ~ infMor + I(infMor^2), data = y, na.action = "na.exclude")
summary(lm.2)
plot(y$infMor, log10(y$GDP), pch = 20)
abline(lm.1, col = "red", lwd = 2) #linear model fit without squared term
#In order to show the lm.2 model fit in the same Figure, we connect the fitted values via lines,
# but we need to do that by ordering the points from the smallest infMor to the largest
x = y$infMor
x[is.na(y$GDP)] = NA #add NAs where GDP is missing
ii = order(x) #indexes from smallest infMor to largest infMor
lines(x[ii], fitted(lm.2)[ii], col = "magenta", lwd = 2) #show model fit with the squared term
legend("topright", lwd = 2, legend = c("linear","linear + quadratic"), col = c("red","magenta"))
```
This quadratic term clearly improves model fit (R2 increases from 50% to 62%) and
the fitted values take into account the curved pattern in the data. Let's see
the diagnostic plots.
```{r, fig.height = 8}
par(mfrow = c(2,2))
plot(lm.2)
```
They look good.
We have found very strong (inverse) association between Infant mortality and GDP.
Can we now say that Infant mortality is **the cause** of differences in GDP?
The answer is **NO**. Causal statements can never be made from regression analysis alone!
**Correlation is not causation** is an important fact to remember in all
analyses of observational data!
This is because there can be some third factor Z,
that could be causally influencing both X and Y,
and hence the effect of Z could cause the observed correlation between X and Y
even though neither X nor Y was causally influencing the other.
Such a third variable Z is called a **confounder** when it is *confounding* the association
between X and Y.
A common statistics textbook example is the positive correlation between ice-cream sales and deaths by drowning.
Is ice-cream causing drowning? No, it is sun and hot temperature that increase
ice-cream sales and also increase activity on water, which then also leads to more accidents on water,
and consequently deaths by drowning. So weather is confounding the association between
ice-cream sales and drowning and the positive correlation between these two
variables is not due to a causal relationship.
Since we can never measure all variables in observational studies, it remains very
difficult to end up in causal inference using observational data as additional, unmeasured
confounders could influence the regression estimates.
Luckily, in clinical medicine and experimental science,
there is an important method to overcome the problem of
unmeasured variables: [**randomized controlled trial (RCT)**](https://en.wikipedia.org/wiki/Randomized_controlled_trial).
The situation is more challenging for observational
population-level epidemiology or social sciences,
for example, and there it is difficult to justify causal statements.
#### Multiple regression with social factors
Let's think more about what happens when we include multiple predictors in the same
linear regression model. Above we established that Infant Mortality was a strong predictor
of GDP. What about Total Fertility Rate (`tfr`)?
```{r}
lm.3 = lm(log10(GDP) ~ tfr, data = y, na.action = "na.exclude")
summary(lm.3)
plot(y$tfr, log10(y$GDP), pch = 20)
abline(lm.3, col = "red", lwd = 1.5)
```
A very clear association where higher fertility associates with decreasing GDP.
The model explains about 38% of the variation. Let's then include both
Infant Mortality `infMor` and `tfr` in the same model. This is an example of a multiple regression model.
```{r}
lm.4 = lm(log10(GDP) ~ tfr + infMor, data = y, na.action = "na.exclude")
summary(lm.4)
```
In this model, only `infMor` seems important (effect is many SEs away from zero as denoted
by small P-value) whereas the association with `tfr` has disappeared (effect dropped from -0.25 to -0.02
and P-value went from <2e-16 to 0.57).
Such dramatic changes are possible because `infMor` and `tfr` are correlated
and therefore part of what they each explain alone in a regression model,
can be explained by the other when they are jointly included in a multiple regression model.
In these data, it seems that
the effect of `tfr` could be completely explained by
the correlated variable `InfMor`. Multiple regression has here produced new insights over
pairwise correlations. Namely, out of the two highly correlated variables
`infMor` and `tfr` ($r=0.8$),
the former seems much better to predict changes in GDP than the latter.
We say that the effect size of `tfr` from multiple regression model "has been adjusted for"
`infMor` whereas the effect of `tfr` from the simple regression model was not so adjusted.
The interpretation for the effect estimates from multiple regression is that they tell
what happens to the outcome variable (here log10(GDP)) per a unit change in the
focal predictor (here `tfr`), when the other predictors (here `infMor`) are kept constant.
Thus, here we have learned that if we fix the level of infant mortality,
then the fertility rate does not anymore noticeably affect GDP.
In Finnish: Hedelmällisyysluvulla ei ole tilastollista yhteyttä BKT:hen kun
analyysi on *vakioitu* lapsikuolleisuudella.