---
title: "HDS 0.2 Correlation and linear regression in practice"
author: "Matti Pirinen, University of Helsinki"
date: "29.9.2021"
output:
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(19)
```
In this lecture we study relationship between two or more variables.
First we establish how to quantify the strength of the linear relationship
between continuous variables
and then we learn how to use the relationship to predict
value of an unknown variable
given the values of observed variables.
## 1. Correlation
We read in a data set of heights and weights of 199 individuals (88 males and 111 females).
This dataset originates from .
```{r}
y = read.table("https://www.mv.helsinki.fi/home/mjxpirin/HDS_course/material/Davis_height_weight.txt",as.is = T, header=T)
head(y)
```
The last two columns are self-reported values of weight and height.
Let's plot weight against height.
```{r}
plot(y$height, y$weight, pch = 20, ylab = "weight (kg)", xlab = "height (cm)")
```
Unsurprisingly, there is a clear pattern where taller individuals weight more.
To quantify the (linear part of the) relationship, we compute a Pearson's
**correlation coefficient** between the variables. This happens in two parts:
compute the covariance between the variables and scale the covariance by the variability of both variables to
get a dimensionless correlation coefficient.
**Covariance** measures the amount of variation in the variables that is linearly shared between the variables.
Technically, it is the average product of deviations from the means of the variables, and from
a sample it is computed as
$$\textrm{cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})
\textrm{, where }\overline{x}, \overline{y} \textrm{ are the means of }x_is \textrm{ and } y_is, \textrm{resp}.$$
When both X and Y tend to be above their mean values simultaneously, then covariance is positive, whereas
the covariance is negative if when X is above its mean, Y tends to be below its mean. Covariance
of 0 says that there is no linear relationship between X and Y.
Note that if we compute the covariance between the variable with itself,
that is, $X=Y$ in the formula above, the result is simply
the variance of that one variable.
Thus, covariance is a generalization of
the concept of variance to two variables.
**Correlation** coefficient results when covariance is normalized by the product of
the standard deviations of the variables:
$$\textrm{cor}(X,Y) = \frac{\textrm{cov}(X,Y)}{\textrm{SD}(X) \textrm{SD}(Y)}.$$
Correlation is always between -1 and +1, and it denotes the strength of the linear relationship
between the variables. If correlation is +1, then values of X and Y are on a line that has a positive slope
and if correlation is -1, then X and Y are on a line that has a negative slope. When correlation is 0,
there is no linear association between the variables. (See
Figures from [Wikipedia](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).)
Note that if correlation between X and Y is 1 it does not necessarily
mean that $Y = X$, but that only that
there is a linear relationship of the form
$Y= a+ bX$ for some constants $a$ and $b$, and any such linear relationship
leads to the correlation of 1.
Suppose we have an estimate $\widehat{r}$ of the correlation between X and Y, based on a sample of
$n$ observation ($n=199$ in the example above).
To do statistical inference with correlation coefficient we need to know the
standard error of the estimate of the correlation.
An approximation for Normally distributed data is:
$$\textrm{SD}(\widehat{r}) \approx \sqrt{\frac{1-\widehat{r}^2}{n-2}}$$
This is accurate when correlation is not far from 0 (say $|r|<0.5$), but is becoming less acurate
near $\pm 1$. For more accurate confidence intervals you can use `r.con()` from package `psych`.
We expect a clear positive correlation between weight and height based on the Figure above.
Let's see what it is.
```{r}
n = nrow(y) #sample size
r = cor(y$height, y$weight) #correlation is symmetric cor(weight,height) = cor(height,weight)
se = sqrt((1-r^2)/(n-2))
library(psych) #DO FIRST: install.packages("psych")
ival = r.con(r, n, p = 0.95, twotailed = TRUE)
cbind(r = r, low95CI = ival[1], up95CI = ival[2],
low95CI.norm.appr = r-1.96*se, up95CI.norm.appr = r+1.96*se)
```
In this case, `r.con()` gives 95CI as (0.707, 0.822) showing some difference from the less accurate normal approximation.
**Exercises 1**
1. Plot scatter plot of `weight` on x-axis and `repwt` on y-axis.
Make a similar plot for `height` and `repht`. Do the reported values
seem highly correlated with the measured values?
2. Compute correlation between `weight` and `repwt` (self-reported weight)
and between `height` and `repht` (self-reported height).
Note that you need to use `use = "complete.obs"` to get rid of NAs.
3. Compute a **correlation matrix** of the four variables `weight`, `height`,
`repwt` and `repht` by giving `cor()` function those four columns of the data matrix
as input. Note that you need to use `use = "pairwise"` in `cor()` to get rid of NA
values.
4. Plot the curve $y=x^2$ in the range $x \in (-1,\ldots,1)$ using 1000 equally spaced values for $x$. Guess what would be the correlation between $x$ and $y$. Compute correlation.
5. Plot the curve $y=x^3$ in the range $x \in (-1,\ldots,1)$ using 1000 equally spaced values for $x$. Guess what would be the correlation between $x$ and $y$. Compute correlation.
## 2. Linear regression
Correlation coefficient $r$ describes the strength of linear relationship between two variables.
Both variables are treated symmetrically in definition of $r$.
However, we may also want to utilize the linear relationship to predict the value of Y given that
we know the value of X. For example, we may use our observed population above to make a
statistical prediction of weights of individuals who are exactly 170 cm tall.
From the Figure above we see that such individuals have weights roughly in range from 50 kg to 80 kg.
To characterize the relationship more mathematically,
we want to fit a line through the observed data that describes how Y depends on X.
The **linear regression** model assumes that
$$y_i = a + b x_i + \varepsilon_i,$$
where $a$ (intercept, vakio) and $b$ (slope, kulmakerroin)
are model parameters that define the line and error term
$\varepsilon_i$
is the difference between the observed value $y_i$ and the value
predicted by the line
$\varepsilon_i = y_i - (a + b x_i)$.
Any pair of parameters $a$ and $b$ define one line (namely $y = a + bx$) in
X-Y coordinates and among all those
lines we will choose the one that fits best the observations as described next.
We denote the estimates by $\widehat{a}$ and $\widehat{b}$
and the prediction for observation $i$ by
$\widehat{y}_i = \widehat{a} + \widehat{b}x_i.$
The estimates are chosen by the least squares method
that minimizes the residual sum of squares
$$\textrm{RSS} = \sum_{i=1}^n(y_i-\widehat{y}_i)^2 = \sum_{i=1}^n(y_i-(\widehat{a}+\widehat{b}x_i))^2$$
In R the linear model is estimated using `lm()` function where formula `y~x`
says that "regress y on x". The regression line can be added to a figure by `abline()` function.
```{r}
lm.1 = lm( weight ~ height, data = y)
plot(y$height, y$weight, pch = 20, ylab = "weight (kg)", xlab = "height (cm)", col="gray")
abline(lm.1, col="red", lwd=1.5)
summary(lm.1)
```
In output of the `summary`:
* Residuals are the observed vertical differences between the line and the observed weight value
* Coefficients show the least squares parameter estimates with their SEs and P-values
(here highly significantly different from 0 with P-values < 2e-16). The slope tells that each cm
in height corresponds to an average increase of 1.15 kg in weight.
* Residual standard error is an estimate for standard deviation of errors (often denoted by $\sigma:$ $\text{Var}(\varepsilon)=\sigma^2$)
* R-squareds tell how large proportion of the variance in Y the line explains compared to the total
variance in Y. Typically, we should be looking at Adjusted Rsquared as it is
more reliable than the raw Rsquared if there are many predictors in the model.
In this case, the linear model on height explains almost 60% of the variation in weight and therefore
about 40% of variation in weight is something that cannot be explained
by (a linear effect of) height only. Note that in the case of simple linear regression with
one predictor (here height), Rsquared is exactly the square of the correlation between X and Y.
```{r}
cor(y$weight, y$height)^2
summary(lm.1)$r.squared
````
This explains why correlation coefficient is a measure of strength of a linear association between variabels:
When $r=\pm1$, then the linear regression model explains all variation in Y and hence the observations
are on a single line in X-Y coordinates.
Let's see what columns we have in the `lm` object:
```{r}
names(lm.1)
```
We could get both the fitted values (`lm.1$fitted.values`) as well as residuals (`lm.1$residuals`)
for each individual from the regression model object. For example, let's find for which individuals the
predicted weight is more than 20kg off.
```{r}
ii = which(abs(lm.1$residuals) > 20) #returns row indexes for which condition is TRUE
cbind(y[ii,2:4], lm.1$fitted.values[ii], lm.1$residuals[ii])
```
We notice 2 things. First there is one outlier case where weight is 119 kg whereas it is predicted
to be 76 kg based on the height of 180 cm. Second, all of the worst predictions happened for males.
Indeed, it is unlikely that the same linear regression model would be good for both
males and females.
Let's add sex as an another predictor to the model to allow different mean
weights in males and females.
Since `sex` is of type "character" with two values ("M" and "F"),
R automatically treats it as *factor* in regression model.
For factor variables, regression model will get
different baseline values for different levels of the factor
(here males and females).
```{r}
lm.2 = lm(weight ~ height + sex, data = y)
summary(lm.2)
```
Now our model says that after we account for difference in
average weight between males and females,
each cm in height corresponds to 0.81 kg in height (it was 1.15 kg when sex wasn't accounted for).
Additionally we see that for the same height, a male weights on average 8.2 kg more than a female.
This model is a better description of the data as adjusted Rsquared has increased to
63% from 59% in model `lm.1`.
But are there differences in the height-weight slope between the sexes?
Let's simply fit separate linear models in males and in females and see the slopes.
```{r}
males = y$sex == "M"
lm.m = lm(weight ~ height, data = y[males,])
lm.f = lm(weight ~ height, data = y[!males,])
```
We see that slope in females is
```{r}
b = summary(lm.f)$coeff[2,1] #coeff matrix has one row for each parameter
s = summary(lm.f)$coeff[2,2] #row 2 is for slope and row 1 for intercept
cbind(b=b, low95 = b-1.96*s, up95 =b+1.96*s)
```
And in males
```{r}
b = summary(lm.m)$coeff[2,1]
s = summary(lm.m)$coeff[2,2]
cbind(b=b, low95 = b-1.96*s, up95 = b+1.96*s)
```
It seems like slope in females might be considerably smaller than in males, but since the 95%CIs
are quite wide, this cannot be concluded reliably from these data alone.
Let's see how much regression explains within each sex.
```{r, fig.height=4.5}
par(mfrow=c(1,2))
plot(y[males,"height"], y[males,"weight"], pch = 20, ylim = c(35,120), xlim=c(140,200),
ylab = "weight (kg)", xlab = "height (cm)", col="gray")
abline(lm.m, col="cyan", lwd=1.5)
title(paste0("MALES R2=",signif(summary(lm.m)$r.squared,3)))
plot(y[!males,"height"], y[!males,"weight"], pch = 20, ylim = c(35,120), xlim=c(140,200),
ylab = "weight (kg)", xlab = "height (cm)", col="gray")
abline(lm.f, col="magenta", lwd=1.5)
title(paste0("FEMALES R2=",signif(summary(lm.f)$r.squared,3)))
```
We see a dramatic decrease in variance explained after we have adjusted the analyses for sex.
This is because there is a strong difference in both height and weight distributions between males and females,
and therefore a large part of the explanatory power of the orginal model was the effect of sex.
After sex is accounted for, height explains about 25-30% of variation in weight.
This shows how important the additional variables, also called
covariates, can be for the
interpretation of the regression model parameters.
We talk more about these effects of *covariate adjustment*
later in this lecture.
**Exercises 2**
1. Generate 10 values for $x$ from interval $(-1,\ldots,1)$ and generate
corresponding values for $y$ using `rnorm(10,x,sigma)` where
`sigma` takes sequentially values of 0.1, 1 and 10.
For each value of `sigma`, plot the points on X-Y coordinates,
fit the linear model and add the fitted line to the picture.
Use `summary()` command to see whether there is a statsitically
significant linear relationship between X and Y in each data set.
2. Plot scatter plot of `weight` on x-axis and `repwt` on y-axis.
Fit a linear model regressing `repwt` on `weight`.
Add the regression line to the Figure.
What is the slope of the model?
What is the value of `repwt` for individual 19 and what is the
fitted value of `repwt` based on the model for individual 19?
What is the residual error for individual 19?
## 3. Predictions from the regression model
Let's use `predict()` function to predict a weight for two new individuals who are 160 cm and 180 cm tall.
We need to give the input data in the same kind of data.frame that was used in the original
model fit with `lm()` function. The uncertainty of prediction can be asked on the level of individual
value `interval="prediction"` or as a confidence interval for the population average at the given
parameters `interval="confidence"`. Typically we are interested in predicting new observations, whence we use `"prediction"` and
the uncertainty is larger than in predicting the population mean with
`"confidence"` option.
```{r}
data.in = data.frame(height = c(160,180) )
predict(lm.f, newdata = data.in, interval = "prediction") #for females of 160 and 180 cm
```
```{r}
predict(lm.m, newdata = data.in, interval = "pred") #for males of 160 and 180 cm
```
We see that a female of 180 cm are predicted at 66 kg with 95% interval (54,79) whereas male of same height
at 78kg (58,98).
**Exercises 3.**
1. What is the predicted weight of a male who is 170 cm tall? What is the
predicted population average weight of all males who are 170 cm tall?
Give prediction intervals in both cases.
2. Fit a linear model regressing `height` on `weight` in females. What is the predicted
height of a woman who is 170cm tall?
## 4. Assessing model fit
How do we know whether a linear model is an approriate model
for our data set? First we should consider this conceptually:
Is there a reason to think that a linear relationship
between the variables is appropriate for the research question
we have in mind?
If answer is positive, then we should ask how
is the linear model fitting the data and are the modling assumptions
valid? The main assumptions for linear model are:
1. The relationship between Y and X is linear.
2. Errors of Y have constant variance for all values of X.
3. Errors of Y are independent of each other.
4. Errors have Normal distribution (for SE and P-value to be valid).
There are 4 standard plots that we can use to assess these
assumptions and they are made by calling `plot()`
on the regression model object.
Let's first check how our first model without adjustment for `sex`
behaves in these diagnostic plots:
```{r,fig.height = 8}
par(mfrow=c(2,2)) #plot diagnostics to 2x2 area
plot(lm.1)
```
To interpret these plot, read .
For our test data,
the diagnostics look quite good, except that the observation 20 seems a bit outlying from the rest.
However, the lower right plot shows that it does not have a
large influence on the fitted line
(as explained in the link above) and therefore we are not that concerned
about it.
### Anscombe's quartet
Francis Anscombe generated [four data](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)
sets of 11 observations that had similar
summary statistics (means, variances correlations) but look very different when plotted.
His point was to emphasize the importance to plot the data. Here we use
these four sets to assess how linear model fits different data sets.
Let's start by plotting the datasets and fitting linear model and computing correlations
and slopes.
```{r, fig.width = 8, fig.height = 8}
par(mfrow=c(2,2))
lm.ans = list() #this is empty list where linear models are collected
for(ii in 1:4){
x = anscombe[,ii] #x-coordinates for Anscombe's data set ii
y = anscombe[,4+ii] #y-coordinates
lm.ans[[ii]] = lm(y ~ x)
plot(x,y, main=paste("Anscombe",ii), pch = 19, xlab = "x", ylab = "y",
sub = paste0("r=",signif(cor(x,y),3)," b=",signif(coefficients(lm.ans[[ii]])[2],3)))
abline(lm.ans[[ii]], col = "red", lwd = 1.5)
}
```
We see that the data sets are identical with respect to x-y correlation and in the
linear model slope, but we also see that the data sets are very different in other respects,
and, in particular, the linear model does not seem appropriate for the
data sets 2,3,4.
Let's see diagnostic plots for the linear models.
```{r,fig.width = 9, fig.height = 9}
par(mfrow = c(4,4))
par(mar = c(3,3,2,1))
for(ii in 1:4){plot(lm.ans[[ii]])}
```
## 5. Example analysis: Social factors in 1998
Let's study some social factors from around the world using a UN data sets from 1998.
For description of variables see .
```{r}
y = read.csv("https://www.mv.helsinki.fi/home/mjxpirin/HDS_course/material/UN98.csv", as.is = T)
head(y) #show some first rows
```
Let's give these variables shorter names to avoid writing long
names below:
```{r}
colnames(y) = c("country","region","tfr","contr","eduM","eduF","lifeM",
"lifeF","infMor","GDP","econM","econF","illiM","illiF")
```
Let's plot the pairwise correlations using `corrplot`
```{r}
#install.packages("corrplot")
library(corrplot)
x = as.matrix(y[,3:14]) #make a matrix of numeric columns for cor()
corr = cor(x, use = "complete") #"complete" means remove NAs
corrplot.mixed(corr, order="hclust")
```
We see two clear blocks of highly correlated variables. First block
is about positive things like life expectancy, education and
availability of contraception and the second about
negative things like illiteracy, infant mortality and high fertility rate.
There are also some surprising correlations, or rather surprising
lack of correlation, between, for example, GDP and economic activity. Let's have a more careful look. Let's first check the distributions of these two variables.
```{r, fig.width = 9, fig.height = 4}
par(mfrow = c(1,3))
hist(y$econM, col = "blue", breaks = 25)
hist(y$GDP, col = "orange", breaks = 25)
hist(log10(y$GDP), col = "yellow", breaks = 25)
```
We see that activity is roughly symmetric and Normally distributed but GDP is an example
of highly skewed distribution with a strong tail to right. This is typical to
variables describing wealth or, more generally, variables that take only non-negative values.
Often logaríthm of such a variable is a much better input variable to linear regression
as it is often more symmetric and Normally distributed (as is the case here).
Thus we will use log10 of GDP.
```{r}
plot(y$econM, log10(y$GDP))
y[which(y$econM > 90),] #print countries with over 90% Males "economically active"
```
We have poor countries like Burundi and Chad and rich countries like Qatar and United Arab Emirates
that both have high proportion of economic activity. One clear factor separating these is infant mortality.
Let's look the same plot but color points by infant mortality.
```{r}
plot(y$econM, log10(y$GDP), pch = 20, col = heat.colors(nrow(y))[nrow(y)-rank(y$infMor)],
main = "Colored by infant mortality (dark=high, light=low)")
```
Based on this
we can guess what linear model says when we predict GDP by both Economic activity and Infant mortality:
```{r}
lm.0 = lm(log10(GDP) ~ econM + infMor, data = y)
summary(lm.0)
```
Economic activity is not important predictor whereas Infant mortality is.
Let's do linear model using Infant mortality alone as predictor.
```{r}
lm.1 = lm(log10(GDP) ~ infMor, data = y)
summary(lm.1)
plot(y$infMor, log10(y$GDP), pch = 20)
abline(lm.1, col = "red", lwd = 1.5)
```
We see that
log10 of GDP is decreasing by 0.014 for each unit of infant deaths (per 1000 infants).
In other words, when infant mortality increases one per mille, GDP drops multiplicatively by a factor
of 10^(-0.014051)=`r 10^(-0.014051)`, i.e., it drops about 3%. This model alone
explains about 50% of variation in GDP.
Does this model fit the data well? It looks like there is tendency of errors being positive at the
ends, which would suggest adding second order term in the model. Let's make diagnostic plots.
```{r, fig.height=8}
par(mfrow = c(2,2))
plot(lm.1)
```
No other clear problems except that residuals have a shape of a prabola in the first plot.
Let's add quadratic term whence our model becomes `log10(GDP) ~ InfMor + InfMor^2`.
The quadratic term needs to be input through `I()` notation as below.
```{r}
lm.2 = lm(log10(GDP) ~ infMor + I(infMor^2), data = y)
summary(lm.2)
plot(y$infMor, log10(y$GDP), pch = 20)
#To plot the fitted values from lm.2, we need the correct x-coordinates, and these we get as:
included = setdiff(1:nrow(y), as.numeric(lm.2$na.action)) #these indexes of y were incl'd in regression
points(y$infMor[included], lm.2$fitted.values, col = "red", lwd = 1.5)
legend("topright", pch = 1, legend = c("fitted"), col = "red")
```
This quadratic term clearly improves model fit (R2 increases from 50% to 62%) and
fitted values take into account the curved pattern in data. Let's see
diagnostic plots.
```{r, fig.height=8}
par(mfrow=c(2,2))
plot(lm.2)
```
Look good.
We have found very strong (inverse) association between Infant mortality and GDP.
Can we now say that Infant mortality is **the cause** of differences in GDP?
The answer is **NO**. Causal statements can never be made from regression analysis alone!
**Correlation is not causation** is an important fact to remember in all fields of
data analysis!
This is because there can be some third factor Z,
that could be causally influencing both X and Y,
and hence the effect of Z could cause the observed correlation between X and Y
even though neither X nor Y was causally influencing the other.
Such a third variable Z is called a **confounder** when it is *confounding* the association
between X and Y.
A common statistics textbook example is the positive correlation between ice-cream sales and deaths by drowning.
Is ice-cream causing drowning? No, it is sun and hot temperature that increase
ice-cream sales and also increase activity on water, which then also leads to more accidents on water,
and consequently deaths by drowning. So weather is confounding the association between
ice-cream sales and drowning. Correlation is not causation.
Since we can never measure all variables in observational studies, it remains very
difficult to end up in causal inference using observational data as additional, unmeasured
confounders could influence the correlation estimates.
Luckily, in clinical medicine and experimental science,
there is an important method to overcome the problem of
unmeasured variables: **randomized controlled trial (RCT)** ([link](https://en.wikipedia.org/wiki/Randomized_controlled_trial)).
The situation is more challenging for observational
population-level epidemiology or social sciences,
for example, and there causal statements are difficult to justify.
## 6. Multiple regression with social factors
Let's think a bit more about what happens when we include multiple predictors in the same
linear regression model. Above we established that Infant Mortality was a strong predictor
of GDP. What about Total Fertility Rate (tfr)?
```{r}
lm.3 = lm(log10(GDP) ~ tfr, data = y)
summary(lm.3)
plot(y$tfr, log10(y$GDP), pch = 20)
abline(lm.3, col = "red", lwd = 1.5)
```
A very clear association where higher fertility associates with decreasing GDP.
The model explains about 38% of the variation. Let's then include both
Infant Mortality and tfr in the same model. This is an example of multiple regression model.
```{r}
lm.4 = lm(log10(GDP) ~ tfr + infMor, data = y)
summary(lm.4)
```
In this model, only `infMor` seems important (effect is many SEs away from zero as denoted
by small P-value) whereas the association with `tfr` has disappeared (effect dropped from -0.25 to -0.02
and P-value went from <2e-16 to 0.57). This is possible because `infMor` and `tfr` are correlated
and therefore their effect estimates in multiple regression model can be different from their effects in
simple regression models where only one of them is included at once. In these data it seems that
the effect that `tfr` explained when it was alone in the model, can be completely explained by
a correlated variable `InfMor`. Multiple regression has here produced new insights over
pairwise correlations.
We say that the effect size of `tfr` from multiple regression model "has been adjusted for"
`infMor` whereas the effect of `tfr` from the simple regression model was not so adjusted.
The interpretation for the effect estimates from multiple regression is that they tell
what happens to the outcome variable (here log10(GDP)) per a unit change in the
focal predictor (here `tfr`), when other predictors (here `infMor`) are kept constant.
Thus, here we have learned that if we fix the level of infant mortality,
then the fertility rate does not anymore noticeably affect GDP.
Note that, in general,
when several highly correlated variables are simultaneously
included in a multiple regression model,
then the standard errors of individual parameters may become large
and it is difficult to interpret the parameters separately from the other
parameters with a meaningful precision.
In particular, individual P-values of predictors
tend to become large but they cannot be used for inferring
whether individual predictors are important or not for the total
model fit!