In this lecture we study relationship between two or more variables. First we establish how to quantify the strength of the linear relationship between continuous variables and then we learn how to use the relationship to predict value of an unknown variable given the values of observed variables.

1. Correlation

We read in a data set of heights and weights of 199 individuals (88 males and 111 females). This dataset originates from http://vincentarelbundock.github.io/Rdatasets/doc/carData/Davis.html.

y = read.table("https://www.mv.helsinki.fi/home/mjxpirin/HDS_course/material/Davis_height_weight.txt",as.is = T, header=T)
head(y)
##   X sex weight height repwt repht
## 1 1   M     77    182    77   180
## 2 2   F     58    161    51   159
## 3 3   F     53    161    54   158
## 4 4   M     68    177    70   175
## 5 5   F     59    157    59   155
## 6 6   M     76    170    76   165

The last two columns are self-reported values of weight and height. Let’s plot weight against height.

plot(y$height, y$weight, pch = 20, ylab = "weight (kg)", xlab = "height (cm)")

Unsurprisingly, there is a clear pattern where taller individuals weight more. To quantify the (linear part of the) relationship, we compute a Pearson’s correlation coefficient between the variables. This happens in two parts: compute the covariance between the variables and scale the covariance by the variability of both variables to get a dimensionless correlation coefficient.

Covariance measures the amount of variation in the variables that is linearly shared between the variables. Technically, it is the average product of deviations from the means of the variables, and from a sample it is computed as \[\textrm{cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y}) \textrm{, where }\overline{x}, \overline{y} \textrm{ are the means of }x_is \textrm{ and } y_is, \textrm{resp}.\] When both X and Y tend to be above their mean values simultaneously, then covariance is positive, whereas the covariance is negative if when X is above its mean, Y tends to be below its mean. Covariance of 0 says that there is no linear relationship between X and Y. Note that if we compute the covariance between the variable with itself, that is, \(X=Y\) in the formula above, the result is simply the variance of that one variable. Thus, covariance is a generalization of the concept of variance to two variables.

Correlation coefficient results when covariance is normalized by the product of the standard deviations of the variables: \[\textrm{cor}(X,Y) = \frac{\textrm{cov}(X,Y)}{\textrm{SD}(X) \textrm{SD}(Y)}.\] Correlation is always between -1 and +1, and it denotes the strength of the linear relationship between the variables. If correlation is +1, then values of X and Y are on a line that has a positive slope and if correlation is -1, then X and Y are on a line that has a negative slope. When correlation is 0, there is no linear association between the variables. (See Figures from Wikipedia.) Note that if correlation between X and Y is 1 it does not necessarily mean that \(Y = X\), but that only that there is a linear relationship of the form \(Y= a+ bX\) for some constants \(a\) and \(b\), and any such linear relationship leads to the correlation of 1.

Suppose we have an estimate \(\widehat{r}\) of the correlation between X and Y, based on a sample of \(n\) observation (\(n=199\) in the example above). To do statistical inference with correlation coefficient we need to know the standard error of the estimate of the correlation. An approximation for Normally distributed data is: \[\textrm{SD}(\widehat{r}) \approx \sqrt{\frac{1-\widehat{r}^2}{n-2}}\] This is accurate when correlation is not far from 0 (say \(|r|<0.5\)), but is becoming less acurate near \(\pm 1\). For more accurate confidence intervals you can use r.con() from package psych.

We expect a clear positive correlation between weight and height based on the Figure above. Let’s see what it is.

n = nrow(y) #sample size
r = cor(y$height, y$weight) #correlation is symmetric cor(weight,height) = cor(height,weight)
se = sqrt((1-r^2)/(n-2)) 
library(psych) #DO FIRST: install.packages("psych")
ival = r.con(r, n, p = 0.95, twotailed = TRUE)
cbind(r = r, low95CI = ival[1], up95CI = ival[2], 
      low95CI.norm.appr = r-1.96*se, up95CI.norm.appr = r+1.96*se)
##              r   low95CI    up95CI low95CI.norm.appr up95CI.norm.appr
## [1,] 0.7707306 0.7074835 0.8217303         0.6817547        0.8597065

In this case, r.con() gives 95CI as (0.707, 0.822) showing some difference from the less accurate normal approximation.

Exercises 1

  1. Plot scatter plot of weight on x-axis and repwt on y-axis. Make a similar plot for height and repht. Do the reported values seem highly correlated with the measured values?

  2. Compute correlation between weight and repwt (self-reported weight) and between height and repht (self-reported height). Note that you need to use use = "complete.obs" to get rid of NAs.

  3. Compute a correlation matrix of the four variables weight, height, repwt and repht by giving cor() function those four columns of the data matrix as input. Note that you need to use use = "pairwise" in cor() to get rid of NA values.

  4. Plot the curve \(y=x^2\) in the range \(x \in (-1,\ldots,1)\) using 1000 equally spaced values for \(x\). Guess what would be the correlation between \(x\) and \(y\). Compute correlation.

  5. Plot the curve \(y=x^3\) in the range \(x \in (-1,\ldots,1)\) using 1000 equally spaced values for \(x\). Guess what would be the correlation between \(x\) and \(y\). Compute correlation.

2. Linear regression

Correlation coefficient \(r\) describes the strength of linear relationship between two variables. Both variables are treated symmetrically in definition of \(r\). However, we may also want to utilize the linear relationship to predict the value of Y given that we know the value of X. For example, we may use our observed population above to make a statistical prediction of weights of individuals who are exactly 170 cm tall. From the Figure above we see that such individuals have weights roughly in range from 50 kg to 80 kg. To characterize the relationship more mathematically, we want to fit a line through the observed data that describes how Y depends on X. The linear regression model assumes that \[y_i = a + b x_i + \varepsilon_i,\] where \(a\) (intercept, vakio) and \(b\) (slope, kulmakerroin) are model parameters that define the line and error term \(\varepsilon_i\) is the difference between the observed value \(y_i\) and the value predicted by the line \(\varepsilon_i = y_i - (a + b x_i)\). Any pair of parameters \(a\) and \(b\) define one line (namely \(y = a + bx\)) in X-Y coordinates and among all those lines we will choose the one that fits best the observations as described next. We denote the estimates by \(\widehat{a}\) and \(\widehat{b}\) and the prediction for observation \(i\) by \(\widehat{y}_i = \widehat{a} + \widehat{b}x_i.\) The estimates are chosen by the least squares method that minimizes the residual sum of squares \[\textrm{RSS} = \sum_{i=1}^n(y_i-\widehat{y}_i)^2 = \sum_{i=1}^n(y_i-(\widehat{a}+\widehat{b}x_i))^2\]

In R the linear model is estimated using lm() function where formula y~x says that “regress y on x”. The regression line can be added to a figure by abline() function.

lm.1 = lm( weight ~ height, data = y)
plot(y$height, y$weight, pch = 20, ylab = "weight (kg)", xlab = "height (cm)", col="gray")
abline(lm.1, col="red", lwd=1.5)

summary(lm.1)
## 
## Call:
## lm(formula = weight ~ height, data = y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.650  -5.419  -0.576   4.857  42.887 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -130.74698   11.56271  -11.31   <2e-16 ***
## height         1.14922    0.06769   16.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.523 on 197 degrees of freedom
## Multiple R-squared:  0.594,  Adjusted R-squared:  0.592 
## F-statistic: 288.3 on 1 and 197 DF,  p-value: < 2.2e-16

In output of the summary:

In this case, the linear model on height explains almost 60% of the variation in weight and therefore about 40% of variation in weight is something that cannot be explained by (a linear effect of) height only. Note that in the case of simple linear regression with one predictor (here height), Rsquared is exactly the square of the correlation between X and Y.

cor(y$weight, y$height)^2
## [1] 0.5940256
summary(lm.1)$r.squared
## [1] 0.5940256

This explains why correlation coefficient is a measure of strength of a linear association between variabels: When \(r=\pm1\), then the linear regression model explains all variation in Y and hence the observations are on a single line in X-Y coordinates.

Let’s see what columns we have in the lm object:

names(lm.1)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

We could get both the fitted values (lm.1$fitted.values) as well as residuals (lm.1$residuals) for each individual from the regression model object. For example, let’s find for which individuals the predicted weight is more than 20kg off.

ii = which(abs(lm.1$residuals) > 20) #returns row indexes for which condition is TRUE
cbind(y[ii,2:4], lm.1$fitted.values[ii], lm.1$residuals[ii])
##     sex weight height lm.1$fitted.values[ii] lm.1$residuals[ii]
## 20    M    119    180               76.11303           42.88697
## 29    M    101    183               79.56070           21.43930
## 53    M    102    185               81.85914           20.14086
## 96    M    103    185               81.85914           21.14086
## 191   M     89    173               68.06848           20.93152

We notice 2 things. First there is one outlier case where weight is 119 kg whereas it is predicted to be 76 kg based on the height of 180 cm. Second, all of the worst predictions happened for males. Indeed, it is unlikely that the same linear regression model would be good for both males and females.

Let’s add sex as an another predictor to the model to allow different mean weights in males and females. Since sex is of type “character” with two values (“M” and “F”), R automatically treats it as factor in regression model. For factor variables, regression model will get different baseline values for different levels of the factor (here males and females).

lm.2 = lm(weight ~ height + sex, data = y) 
summary(lm.2)
## 
## Call:
## lm(formula = weight ~ height + sex, data = y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.131  -4.889  -0.404   5.205  41.490 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -76.63620   15.75543  -4.864 2.36e-06 ***
## height        0.81072    0.09555   8.485 5.24e-15 ***
## sexM          8.21621    1.71726   4.784 3.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.086 on 196 degrees of freedom
## Multiple R-squared:  0.6365, Adjusted R-squared:  0.6328 
## F-statistic: 171.6 on 2 and 196 DF,  p-value: < 2.2e-16

Now our model says that after we account for difference in average weight between males and females, each cm in height corresponds to 0.81 kg in height (it was 1.15 kg when sex wasn’t accounted for). Additionally we see that for the same height, a male weights on average 8.2 kg more than a female. This model is a better description of the data as adjusted Rsquared has increased to 63% from 59% in model lm.1.

But are there differences in the height-weight slope between the sexes? Let’s simply fit separate linear models in males and in females and see the slopes.

males = y$sex == "M"
lm.m = lm(weight ~ height, data = y[males,])
lm.f = lm(weight ~ height, data = y[!males,])

We see that slope in females is

b = summary(lm.f)$coeff[2,1] #coeff matrix has one row for each parameter
s = summary(lm.f)$coeff[2,2] #row 2 is for slope and row 1 for intercept
cbind(b=b, low95 = b-1.96*s, up95 =b+1.96*s)
##              b     low95      up95
## [1,] 0.6229425 0.4276806 0.8182044

And in males

b = summary(lm.m)$coeff[2,1]
s = summary(lm.m)$coeff[2,2]
cbind(b=b, low95 = b-1.96*s, up95 = b+1.96*s)
##              b     low95     up95
## [1,] 0.9955981 0.6670175 1.324179

It seems like slope in females might be considerably smaller than in males, but since the 95%CIs are quite wide, this cannot be concluded reliably from these data alone.

Let’s see how much regression explains within each sex.

par(mfrow=c(1,2))
plot(y[males,"height"], y[males,"weight"], pch = 20, ylim = c(35,120), xlim=c(140,200),
     ylab = "weight (kg)", xlab = "height (cm)", col="gray")
abline(lm.m, col="cyan", lwd=1.5)
title(paste0("MALES R2=",signif(summary(lm.m)$r.squared,3)))
plot(y[!males,"height"], y[!males,"weight"], pch = 20, ylim = c(35,120), xlim=c(140,200),
     ylab = "weight (kg)", xlab = "height (cm)", col="gray")
abline(lm.f, col="magenta", lwd=1.5)
title(paste0("FEMALES R2=",signif(summary(lm.f)$r.squared,3)))