In this document, we study a relationship between two or more variables. First, we establish how to quantify the strength of a linear relationship between continuous variables and then we learn how to use the relationship to predict a value of an unknown variable given the values of the observed variables.
Let’s read in a data set of heights and weights of 199 individuals (88 males and 111 females). This dataset originates from http://vincentarelbundock.github.io/Rdatasets/doc/carData/Davis.html.
y = read.table("https://www.mv.helsinki.fi/home/mjxpirin/HDS_course/material/Davis_height_weight.txt",
as.is = TRUE, header = TRUE)
head(y)
## X sex weight height repwt repht
## 1 1 M 77 182 77 180
## 2 2 F 58 161 51 159
## 3 3 F 53 161 54 158
## 4 4 M 68 177 70 175
## 5 5 F 59 157 59 155
## 6 6 M 76 170 76 165
The last two columns are self-reported values of weight and height. Let’s plot weight against height.
plot(y$height, y$weight, pch = 20, ylab = "weight (kg)", xlab = "height (cm)")
Unsurprisingly, there is a clear pattern where taller individuals weigh more. To quantify the (linear part of the) relationship, we compute a Pearson’s correlation coefficient between the variables. This happens in two parts: (1) compute the covariance between the variables and (2) scale the covariance by the variability of both variables to get a dimensionless correlation coefficient.
Covariance measures the amount of variation in the variables that is linearly shared between the variables. Technically, it is the expectation of the product of deviations from the means of the variables, and from a sample it is computed as \[\textrm{cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y}) \textrm{, where }\overline{x}, \overline{y} \textrm{ are the means of }x_is \textrm{ and } y_is, \textrm{resp}.\] When both X and Y tend to be above their mean values simultaneously, then the covariance is positive, whereas the covariance is negative, if when X is above its mean, Y tends to be below its mean. Covariance of 0 says that there is no linear relationship between X and Y. Note that if we compute the covariance between the variable with itself, that is, \(X=Y\) in the formula above, the result is simply the variance of that one variable. Thus, covariance is a generalization of the concept of variance to two variables.
Correlation coefficient results when covariance is normalized by the product of the standard deviations of the variables: \[\textrm{cor}(X,Y) = \frac{\textrm{cov}(X,Y)}{\textrm{SD}(X) \textrm{SD}(Y)}.\] Correlation is always between -1 and +1, and it denotes the strength of the linear relationship between the variables. If correlation is +1, then the values of X and Y are on a line that has a positive slope and if correlation is -1, then X and Y are on a line that has a negative slope. When correlation is 0, there is no linear association between the variables. (See Figures from Wikipedia.) Note that if correlation between X and Y is 1, it does not necessarily mean that \(Y = X\), but only that there is a linear relationship of the form \(Y = a + bX\) for some constants \(a\) and \(b > 0\), and any such linear relationship leads to the correlation of 1.
Suppose we have an estimate \(\widehat{r}\) of the correlation between X
and Y, based on a sample of \(n\)
observations (\(n=199\) in the example
above). To do statistical inference with the correlation coefficient, we
need to know the standard error of the estimate of the correlation. An
approximation for Normally distributed data is: \[\textrm{SD}(\widehat{r}) \approx
\sqrt{\frac{1-\widehat{r}^2}{n-2}}\] This is accurate when
correlation is not far from 0 (say \(|r|<0.5\)), but becomes less accurate
near \(\pm 1\). For more accurate
confidence intervals, one can use r.con()
function from the
psych
package.
We expect a clear positive correlation between weight and height based on the Figure above. Let’s see what it is.
n = nrow(y) #sample size
r = cor(y$height, y$weight) #correlation is symmetric cor(weight,height) = cor(height,weight)
se = sqrt((1-r^2)/(n-2))
library(psych) #DO FIRST: install.packages("psych")
ival = r.con(r, n, p = 0.95, twotailed = TRUE)
c(r = r, low95CI = ival[1], up95CI = ival[2],
low95CI.norm.appr = r-1.96*se, up95CI.norm.appr = r+1.96*se)
## r low95CI up95CI low95CI.norm.appr
## 0.7707306 0.7074835 0.8217303 0.6817547
## up95CI.norm.appr
## 0.8597065
In this case, r.con()
gives 95CI as (0.707, 0.822)
showing some difference from the less accurate normal approximation.
Exercises 1
Plot scatter plot of weight
on x-axis and
repwt
on y-axis. Make a similar plot for
height
and repht
. Do the reported values seem
highly correlated with the measured values?
Compute correlation between weight
and
repwt
(self-reported weight) and between
height
and repht
(self-reported height). Note
that you need to use use = "complete.obs"
to get rid of
NAs.
Compute a correlation matrix of the four
variables weight
, height
, repwt
and repht
by giving cor()
function those four
columns of the data matrix as input. Note that you need to use
use = "pairwise"
in cor()
to get rid of NA
values.
Plot the curve \(y=x^2\) in the range \(x \in (-1,\ldots,1)\) using 1000 equally spaced values for \(x\). Guess what is the correlation between \(x\) and \(y\). Compute correlation.
Plot the curve \(y=x^3\) in the range \(x \in (-1,\ldots,1)\) using 1000 equally spaced values for \(x\). Guess what is the correlation between \(x\) and \(y\). Compute correlation.
Correlation coefficient \(r\) describes the strength of a linear relationship between two variables. Both variables are treated symmetrically in the definition of \(r\). However, we may also want to utilize the linear relationship to predict the value of Y given that we know the value of X. For example, we may use our observed population above to make a statistical prediction of weights of individuals who are exactly 170 cm tall. From the Figure above we see that such individuals have weights roughly in range from 50 kg to 80 kg. To characterize the relationship more mathematically, we want to fit a line through the observed data that describes how Y depends on X. The linear regression model assumes that \[y_i = a + b x_i + \varepsilon_i,\] where \(a\) (intercept, vakio) and \(b\) (slope, kulmakerroin) are model parameters that define the line and error term \(\varepsilon_i\) is the difference between the observed value \(y_i\) and the value predicted by the line \(\varepsilon_i = y_i - (a + b x_i)\). Any pair of parameters \(a\) and \(b\) define one line (namely \(y = a + bx\)) in X-Y coordinates and among all those lines we will choose the one that fits best the observations as described next. We denote the estimates by \(\widehat{a}\) and \(\widehat{b}\) and the prediction for observation \(i\) by \(\widehat{y}_i = \widehat{a} + \widehat{b}x_i.\) The estimates are chosen by the least squares method that minimizes the residual sum of squares \[\textrm{RSS} = \sum_{i=1}^n(y_i-\widehat{y}_i)^2 = \sum_{i=1}^n(y_i-(\widehat{a}+\widehat{b}x_i))^2\]
In R, the linear model is estimated using lm()
function
where formula y ~ x
says that “regress y on x”. The
regression line can be added to a figure by abline()
function.
lm.1 = lm( weight ~ height, data = y)
plot(y$height, y$weight, pch = 20, ylab = "weight (kg)", xlab = "height (cm)", col="gray")
abline(lm.1, col="red", lwd=1.5)
summary(lm.1)
##
## Call:
## lm(formula = weight ~ height, data = y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.650 -5.419 -0.576 4.857 42.887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -130.74698 11.56271 -11.31 <2e-16 ***
## height 1.14922 0.06769 16.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.523 on 197 degrees of freedom
## Multiple R-squared: 0.594, Adjusted R-squared: 0.592
## F-statistic: 288.3 on 1 and 197 DF, p-value: < 2.2e-16
In output of the summary
:
In this case, the linear model on height explains almost 60% of the variation in weight and therefore about 40% of variation in weight is something that cannot be explained by (a linear effect of) height only. Note that in the case of simple linear regression with one predictor (here height), Rsquared is exactly the square of the correlation between X and Y.
cor(y$weight, y$height)^2
## [1] 0.5940256
summary(lm.1)$r.squared
## [1] 0.5940256
This explains why correlation coefficient is a measure of strength of a linear association between variabels: When \(r=\pm1\), then the linear regression model explains all variation in Y and hence the observations are on a single line in X-Y coordinates.
Let’s see what columns we have in the lm
object:
names(lm.1)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
We could get both the fitted values (lm.1$fitted.values
)
as well as the residuals (lm.1$residuals
) for each
individual from the regression model object. For example, let’s find for
which individuals the predicted weight is more than 20kg off.
ii = which(abs(lm.1$residuals) > 20) #returns row indexes for which condition is TRUE
cbind(y[ii,2:4], lm.1$fitted.values[ii], lm.1$residuals[ii])
## sex weight height lm.1$fitted.values[ii] lm.1$residuals[ii]
## 20 M 119 180 76.11303 42.88697
## 29 M 101 183 79.56070 21.43930
## 53 M 102 185 81.85914 20.14086
## 96 M 103 185 81.85914 21.14086
## 191 M 89 173 68.06848 20.93152
We notice two things. First, there is one outlier case where weight is 119 kg whereas it is predicted to be 76 kg based on the height of 180 cm. Second, all of the worst predictions happened for males. Indeed, it is unlikely that a single linear regression model would be good for both males and females.
Let’s add sex as another predictor to the model to allow for
different mean weights in males and females. Since sex
is
of type “character” with two values (“M” and “F”), R automatically
treats it as factor in the regression model. For factor
variables, the regression model will get different baseline values for
different levels of the factor (here males and females).
lm.2 = lm(weight ~ height + sex, data = y)
summary(lm.2)
##
## Call:
## lm(formula = weight ~ height + sex, data = y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.131 -4.889 -0.404 5.205 41.490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -76.63620 15.75543 -4.864 2.36e-06 ***
## height 0.81072 0.09555 8.485 5.24e-15 ***
## sexM 8.21621 1.71726 4.784 3.37e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.086 on 196 degrees of freedom
## Multiple R-squared: 0.6365, Adjusted R-squared: 0.6328
## F-statistic: 171.6 on 2 and 196 DF, p-value: < 2.2e-16
Now our model says that after we account for the difference in the
average weight between males and females, each cm in height corresponds
to +0.81 kg in height (it was +1.15 kg when sex wasn’t accounted for).
Additionally, we see that for the same height, a male weighs on average
8.2 kg more than a female. This model is a better description of the
data as the adjusted Rsquared has increased to 63% from 59% in model
lm.1
.
But are there differences in the height-weight slopes between the sexes? Let’s simply fit separate linear models in males and in females and check the slopes.
males = y$sex == "M"
lm.m = lm(weight ~ height, data = y[males,])
lm.f = lm(weight ~ height, data = y[!males,])
We see that the slope in females is
b = summary(lm.f)$coeff[2,1] #coeff matrix has one row for each parameter
s = summary(lm.f)$coeff[2,2] #row 2 is for slope and row 1 for intercept
cbind(b=b, low95 = b-1.96*s, up95 =b+1.96*s)
## b low95 up95
## [1,] 0.6229425 0.4276806 0.8182044
And in males
b = summary(lm.m)$coeff[2,1]
s = summary(lm.m)$coeff[2,2]
cbind(b=b, low95 = b-1.96*s, up95 = b+1.96*s)
## b low95 up95
## [1,] 0.9955981 0.6670175 1.324179
It seems like the slope in females might be considerably smaller than in males, but since the 95%CIs are quite wide, this cannot be concluded reliably from these data alone.
Let’s see how much the regression model explains within each sex.
par(mfrow = c(1,2))
plot(y[males,"height"], y[males,"weight"], pch = 20, ylim = c(35, 120), xlim = c(140, 200),
ylab = "weight (kg)", xlab = "height (cm)", col="gray")
abline(lm.m, col = "cyan", lwd = 1.5)
title(paste0("MALES R2=",signif(summary(lm.m)$r.squared,3)))
plot(y[!males,"height"], y[!males,"weight"], pch = 20, ylim = c(35, 120), xlim = c(140, 200),
ylab = "weight (kg)", xlab = "height (cm)", col = "gray")
abline(lm.f, col = "magenta", lwd = 1.5)
title(paste0("FEMALES R2=", signif(summary(lm.f)$r.squared,3)))
We see a dramatic decrease in variance explained after we have adjusted the analyses for sex. This is because there is a strong difference in both height and weight distributions between males and females, and therefore a large part of the explanatory power of the orginal model was the effect of sex. After sex is accounted for, height explains about 25-30% of variation in weight. This shows how important the additional variables, also called covariates, can be for the interpretation of the regression model parameters. We talk more about these effects of covariate adjustment later in this document.
Exercises 2
Generate 10 values for \(x\)
from interval \((-1,\ldots,1)\) and
generate corresponding values for \(y\)
using rnorm(10,x,sigma)
where sigma
takes
sequentially values of 0.1, 1 and 10. For each value of
sigma
, plot the points on X-Y coordinates, fit the linear
model and add the fitted line to the picture. Use summary()
command to see whether there is a statsitically significant linear
relationship between X and Y in each data set.
Make a scatter plot of weight
on x-axis and
repwt
on y-axis. Fit a linear model regressing
repwt
on weight
. Add the regression line to
the Figure. What is the slope of the model? What is the value of
repwt
for individual 19 and what is the fitted value of
repwt
based on the model for individual 19? What is the
residual error for individual 19?
Let’s use predict()
function to predict a weight for two
new individuals who are 160 cm and 180 cm tall. We need to give the
input data in the same kind of data.frame that was used in the original
model fit with lm()
function. The uncertainty of prediction
can be asked on the level of individual value
interval = "prediction"
or as a confidence interval for the
population average at the given parameters
interval = "confidence"
. Typically, we are interested in
predicting new observations, and we use "prediction"
. Then
the uncertainty is larger than when predicting the population mean with
the "confidence"
option.
data.in = data.frame(height = c(160,180) )
predict(lm.f, newdata = data.in, interval = "prediction") #for females of 160 and 180 cm
## fit lwr upr
## 1 53.96238 42.10332 65.82143
## 2 66.42123 54.21885 78.62361
predict(lm.m, newdata = data.in, interval = "pred") #for males of 160 and 180 cm
## fit lwr upr
## 1 57.96565 36.95574 78.97555
## 2 77.87761 57.73251 98.02271
We see that a female of 180 cm are predicted at 66 kg with 95% interval (54,79) whereas male of same height at 78kg (58,98).
Exercises 3.
What is the predicted weight of a male who is 170 cm tall? What is the predicted population average weight of all males who are 170 cm tall? Give prediction intervals in both cases.
Fit a linear model regressing height
on
weight
in females. What is the predicted height of a woman
who is 170cm tall?
How do we know whether a linear model is an approriate model for our data set? First we should consider this conceptually: Is there a reason to think that a linear relationship between the variables is appropriate for the research question we have in mind? If answer is positive, then we should ask how is the linear model fitting the data and are the modling assumptions valid? The main assumptions for linear model are:
There arefour standard plots that we can use to assess these
assumptions and they are made by calling plot()
on the
regression model object. Let’s first check how our first model without
an adjustment for sex
behaves in these diagnostic
plots:
par(mfrow = c(2,2)) #plot diagnostics to 2x2 area
plot(lm.1)
To interpret these plot, read https://data.library.virginia.edu/diagnostic-plots/. For our test data, the diagnostics look quite good, except that the observation 20 seems a bit outlying from the rest. However, the lower right plot shows that it does not have a large influence on the fitted line (as explained in the link above) and therefore we are not that concerned about it.
Francis Anscombe generated four data sets of 11 observations that have similar summary statistics (means, variances and correlations) but look very different when plotted. His point was to emphasize the importance of plotting the data whenever possible. Here we use these four sets to assess how a linear model fits different data sets. Let’s start by plotting the data sets and fitting the linear model and computing correlations and slopes.
par(mfrow = c(2,2))
lm.ans = list() #this is empty list where linear models are collected
for(ii in 1:4){
x = anscombe[,ii] #x-coordinates for Anscombe's data set ii
y = anscombe[,4+ii] #y-coordinates
lm.ans[[ii]] = lm(y ~ x)
plot(x,y, main=paste("Anscombe",ii), pch = 19, xlab = "x", ylab = "y",
sub = paste0("r=",signif(cor(x,y),3)," b=",signif(coefficients(lm.ans[[ii]])[2],3)))
abline(lm.ans[[ii]], col = "red", lwd = 1.5)
}
We see that the data sets are identical with respect to x-y correlation and the linear model slope, but we also see that the data sets are very different in other respects, and, in particular, the linear model does not seem appropriate for the data sets 2, 3 and 4. Let’s see the diagnostic plots for these linear models.
par(mfrow = c(4,4))
par(mar = c(3,3,2,1))
for(ii in 1:4){plot(lm.ans[[ii]])}
## Warning: not plotting observations with leverage one:
## 8
Question: Find problematic patterns from the diagnostic plots on rows 2, 3 and 4, which tell that the linear model is not approriate for those data sets.