In this lecture, we study relationship between two or more variables. First, we establish how to quantify the strength of the linear relationship between continuous variables and then we learn how to use the relationship to predict value of an unknown variable given the values of observed variables.

What is a linear relationship?

Mathematically, two variables \(X\) and \(Y\) are linearly related if there are some numbers \(a\) and \(b\) such that \(Y = a + b \cdot X\). Here, \(b\) is the coefficient that links the changes in \(X\) to changes in \(Y\): A change of one unit in variable \(X\) always corresponds to a change of \(b\) units in variable \(Y\). Additionally, \(a\) is the value that allows a shift between the ranges of \(X\) and \(Y\).

Let’s plot three linear relationships with parameters \(a = 0, b= 2\) in green \(a=1, b=-1\) in orange and \(a= -1, b=0\) in blue. Let’s use 5 points to demonstrate these three lines when x-coordinate is between -1 and 1.

n = 5
x = seq(-1, 1, length = n) 
y = 0 + 2*x
plot(x, y, t = "b", col = "darkgreen", lwd = 2) #t="b" uses "b"oth lines and points 
grid() #put a grid on the background
y = 1 + (-1)*x
lines(x, y, t = "b", col = "orange", lwd = 2) # add line to the existing plot
y = -1 + 0*x
lines(x, y, t = "b", col = "blue", lwd = 2) # add line to the existing plot

We see that depending on the sign of \(b\), the line is either increasing (\(b=2\), green), decreasing (\(b=-1\), orange), or flat (\(b=0\), blue). We call \(b\) as slope (kulmakerroin) and \(a\) as intercept (vakiotermi) of the linear relationship.

Example. Friedewald’s formula is an example of a linear relationship. It tells how to estimate LDL cholesterol values from total cholesterol, HDL cholesterol and triglycerides (when measured in mmol/l): \[\text{LDL}\approx \text{TotalC} - \text{HDL} - 0.45\cdot \text{TriGly}.\]

In practice, we never observe perfect linear relationships between measurements. Rather we observe relationships that are linear to some degree, and that are further diluted by noise in the measurements. We can model such imperfect linear relationships by adding some Normally distributed random variation on top of a perfect linear relationships of the previous figure. Let’s add most noise to the green and least to the blue line. The amount of noise is determined by the standard deviation (SD) of the Normal variable that is added on top of the perfect linear relationship, where larger SD means that we are making a more noisy observation of the underlying line. Here we use SDs of 0.8 (green), 0.4 (orange) and 0.1 (blue).

n = 50
x = seq(-1, 1, length = n) 
y = 0 + 2*x + rnorm(n, 0, 0.8)
plot(x, y, t = "p", col = "darkgreen", lwd = 2)  
grid() 
y = 1 + -1*x + rnorm(n, 0, 0.4)
lines(x, y, t = "p", col = "orange", lwd = 2) # add line to the existing plot
y = -1 + 0*x + rnorm(n, 0, 0.1)
lines(x, y, t = "p", col = "blue", lwd = 2) # add line to the existing plot

In this Figure, the quality of the linear model as an explanation of the relationship between X and Y varies between the three cases (blue best, green worst). Next we want to quantify such differences.

Correlation

We read in a data set of heights and weights of 199 individuals (88 males and 111 females). This dataset originates from http://vincentarelbundock.github.io/Rdatasets/doc/carData/Davis.html. You can download it from the course webpage. We will call it measures.

measures = read.table("Davis_height_weight.txt", as.is = T, header = T) #using 'T' as a shorthand for 'TRUE'
head(measures)
##   X sex weight height repwt repht
## 1 1   M     77    182    77   180
## 2 2   F     58    161    51   159
## 3 3   F     53    161    54   158
## 4 4   M     68    177    70   175
## 5 5   F     59    157    59   155
## 6 6   M     76    170    76   165

The last two columns are self-reported values of weight and height. Let’s plot weight against height.

plot(measures$height, measures$weight, pch = 20, #pch = 20 means a solid round symbol
     ylab = "weight (kg)", xlab = "height (cm)") 

Unsurprisingly, there is a clear pattern where taller individuals weight more. To quantify the (linear part of the) relationship, we compute a Pearson’s correlation coefficient between the variables. This happens in two parts: compute the covariance between the variables and scale the covariance by the variability of both variables to get a dimensionless correlation coefficient.

Covariance measures the amount of variation in the variables that is linearly shared between the variables. Technically, it is the average product of deviations from the means of the variables, and from a sample it is computed as \[\textrm{cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y}) \textrm{, where }\overline{x}, \overline{y} \textrm{ are the means of }x_i \textrm{ and } y_i, \textrm{respectively}.\] When both \(X\) and \(Y\) tend to be above their mean values simultaneously, then covariance is positive, whereas the covariance is negative if when \(X\) is above its mean, \(Y\) tends to be below its mean. In other words, covariance is positive when \(X\) and \(Y\) tend to increase together and decrease together; covariance is negative when they tend to go to opposite directions. Covariance of 0 says that there is no linear relationship between \(X\) and \(Y\). Note that if we compute the covariance between the variable and itself, that is, \(X=Y\) in the formula above, the result is simply the variance of that one variable. Thus, covariance is a generalization of the concept of variance to two variables.

Correlation coefficient results when covariance is normalized by the product of the standard deviations of the variables: \[\textrm{cor}(X,Y) = \frac{\textrm{cov}(X,Y)}{\textrm{SD}(X) \textrm{SD}(Y)}.\] Correlation is always between -1 and +1, and it denotes the strength of the linear relationship between the variables. If correlation is +1, then values of \(X\) and \(Y\) are on a line that has a positive slope and if correlation is -1, then \(X\) and \(Y\) are on a line that has a negative slope. When correlation is 0, there is no linear association between the variables. (See Figures from Wikipedia.) Note that if correlation between \(X\) and \(Y\) is 1 it does not necessarily mean that \(Y = X\), but only that there is a perfect linear relationship of the form \(Y= a + b\cdot X\) for some constants \(a\) and \(b > 0\), and any such linear relationship leads to the correlation of 1.

Let’s compute an estimate \(\widehat{r}\) of the correlation between height and weight based on our sample of \(n=199\) observations.

r = cor(measures$height, measures$weight) #correlation is symmetric cor(weight,height) = cor(height,weight)
r
## [1] 0.7707306

A correlation of ~0.8 is quite high. To get confidence intervals for the correlation coefficient we can use r.con() function from the psych package. (Install psych package with command install.packages("psych") first if Rstudio hasn’t done it automatically for you.) Let’s see what the 95%CI is.

n = nrow(measures) #sample size is the number of rows of y
library(psych) #DO FIRST: install.packages("psych")
r.res = r.con(r, n, p = 0.95, twotailed = TRUE) #returns lower and upper bounds of CI
c(r = r, low95CI = r.res[1], up95CI = r.res[2])
##         r   low95CI    up95CI 
## 0.7707306 0.7074835 0.8217303

In this case, r.con() gives 95%CI as (0.707, 0.822), so we can conclude that the height-weight correlation is about 70-80% in this population.

Examples 6.1

  1. Make a scatter plot of weight on x-axis and repwt on y-axis. Make a similar plot for height and repht. Do the reported values seem highly correlated with the measured values?

Let’s plot them next to each other by splitting plotting area into 1 row and 2 columns by par(mfrow = c(1,2)).

par(mfrow = c(1,2))
plot(measures$weight, measures$repwt, pch = 20) #pch = 20 means a solid round symbol
plot(measures$height, measures$repht, pch = 20)