In this lecture, we study relationship between two or more variables. First, we establish how to quantify the strength of the linear relationship between continuous variables and then we learn how to use the relationship to predict value of an unknown variable given the values of observed variables.

Mathematically, two variables \(X\) and \(Y\) are linearly related if there are some numbers \(a\) and \(b\) such that \(Y = a + b \cdot X\). Here, \(b\) is the coefficient that links the changes in \(X\) to changes in \(Y\): A change of one unit in variable \(X\) always corresponds to a change of \(b\) units in variable \(Y\). Additionally, \(a\) is the value that allows a shift between the ranges of \(X\) and \(Y\).

Let’s plot three linear relationships with parameters \(a = 0, b= 2\) in green \(a=1, b=-1\) in orange and \(a= -1, b=0\) in blue. Let’s use 5 points to demonstrate these three lines when x-coordinate is between -1 and 1.

```
n = 5
x = seq(-1, 1, length = n)
y = 0 + 2*x
plot(x, y, t = "b", col = "darkgreen", lwd = 2) #t="b" uses "b"oth lines and points
grid() #put a grid on the background
y = 1 + (-1)*x
lines(x, y, t = "b", col = "orange", lwd = 2) # add line to the existing plot
y = -1 + 0*x
lines(x, y, t = "b", col = "blue", lwd = 2) # add line to the existing plot
```

We see that depending on the sign of \(b\), the line is either increasing (\(b=2\), green), decreasing (\(b=-1\), orange), or flat (\(b=0\), blue). We call \(b\) as **slope**
(kulmakerroin) and \(a\) as
**intercept** (vakiotermi) of the linear relationship.

**Example.** Friedewald’s formula is an example of a
linear relationship. It tells how to estimate LDL cholesterol values
from total cholesterol, HDL cholesterol and triglycerides (when measured
in mmol/l): \[\text{LDL}\approx \text{TotalC}
- \text{HDL} - 0.45\cdot \text{TriGly}.\]

In practice, we never observe perfect linear relationships between measurements. Rather we observe relationships that are linear to some degree, and that are further diluted by noise in the measurements. We can model such imperfect linear relationships by adding some Normally distributed random variation on top of a perfect linear relationships of the previous figure. Let’s add most noise to the green and least to the blue line. The amount of noise is determined by the standard deviation (SD) of the Normal variable that is added on top of the perfect linear relationship, where larger SD means that we are making a more noisy observation of the underlying line. Here we use SDs of 0.8 (green), 0.4 (orange) and 0.1 (blue).

```
n = 50
x = seq(-1, 1, length = n)
y = 0 + 2*x + rnorm(n, 0, 0.8)
plot(x, y, t = "p", col = "darkgreen", lwd = 2)
grid()
y = 1 + -1*x + rnorm(n, 0, 0.4)
lines(x, y, t = "p", col = "orange", lwd = 2) # add line to the existing plot
y = -1 + 0*x + rnorm(n, 0, 0.1)
lines(x, y, t = "p", col = "blue", lwd = 2) # add line to the existing plot
```

In this Figure, the quality of the linear model as an explanation of the relationship between X and Y varies between the three cases (blue best, green worst). Next we want to quantify such differences.

We read in a data set of heights and weights of 199 individuals (88
males and 111 females). This dataset originates from http://vincentarelbundock.github.io/Rdatasets/doc/carData/Davis.html.
You can download it from the course webpage. We will call it
`measures`

.

```
measures = read.table("Davis_height_weight.txt", as.is = T, header = T) #using 'T' as a shorthand for 'TRUE'
head(measures)
```

```
## X sex weight height repwt repht
## 1 1 M 77 182 77 180
## 2 2 F 58 161 51 159
## 3 3 F 53 161 54 158
## 4 4 M 68 177 70 175
## 5 5 F 59 157 59 155
## 6 6 M 76 170 76 165
```

The last two columns are self-reported values of weight and height. Let’s plot weight against height.

```
plot(measures$height, measures$weight, pch = 20, #pch = 20 means a solid round symbol
ylab = "weight (kg)", xlab = "height (cm)")
```

Unsurprisingly, there is a clear pattern where taller individuals
weight more. To quantify the (linear part of the) relationship, we
compute a Pearson’s **correlation coefficient** between the
variables. This happens in two parts: compute the covariance between the
variables and scale the covariance by the variability of both variables
to get a dimensionless correlation coefficient.

**Covariance** measures the amount of variation in the
variables that is linearly shared between the variables. Technically, it
is the average product of deviations from the means of the variables,
and from a sample it is computed as \[\textrm{cov}(X,Y) =
\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})
\textrm{, where }\overline{x}, \overline{y} \textrm{ are the means of
}x_i \textrm{ and } y_i, \textrm{respectively}.\] When both \(X\) and \(Y\) tend to be above their mean values
simultaneously, then covariance is positive, whereas the covariance is
negative if when \(X\) is above its
mean, \(Y\) tends to be below its mean.
In other words, covariance is positive when \(X\) and \(Y\) tend to increase together and decrease
together; covariance is negative when they tend to go to opposite
directions. Covariance of 0 says that there is no linear relationship
between \(X\) and \(Y\). Note that if we compute the covariance
between the variable and itself, that is, \(X=Y\) in the formula above, the result is
simply the variance of that one variable. Thus, covariance is a
generalization of the concept of variance to two variables.

**Correlation** coefficient results when covariance is
normalized by the product of the standard deviations of the variables:
\[\textrm{cor}(X,Y) =
\frac{\textrm{cov}(X,Y)}{\textrm{SD}(X) \textrm{SD}(Y)}.\]
Correlation is always between -1 and +1, and it denotes the strength of
the linear relationship between the variables. If correlation is +1,
then values of \(X\) and \(Y\) are on a line that has a positive slope
and if correlation is -1, then \(X\)
and \(Y\) are on a line that has a
negative slope. When correlation is 0, there is no linear association
between the variables. (See Figures from Wikipedia.)
Note that if correlation between \(X\)
and \(Y\) is 1 it does not necessarily
mean that \(Y = X\), but only that
there is a perfect linear relationship of the form \(Y= a + b\cdot X\) for some constants \(a\) and \(b >
0\), and any such linear relationship leads to the correlation of
1.

Let’s compute an estimate \(\widehat{r}\) of the correlation between height and weight based on our sample of \(n=199\) observations.

```
r = cor(measures$height, measures$weight) #correlation is symmetric cor(weight,height) = cor(height,weight)
r
```

`## [1] 0.7707306`

A correlation of ~0.8 is quite high. To get confidence intervals for
the correlation coefficient we can use `r.con()`

function
from the `psych`

package. (Install `psych`

package
with command `install.packages("psych")`

first if Rstudio
hasn’t done it automatically for you.) Let’s see what the 95%CI is.

```
n = nrow(measures) #sample size is the number of rows of y
library(psych) #DO FIRST: install.packages("psych")
r.res = r.con(r, n, p = 0.95, twotailed = TRUE) #returns lower and upper bounds of CI
c(r = r, low95CI = r.res[1], up95CI = r.res[2])
```

```
## r low95CI up95CI
## 0.7707306 0.7074835 0.8217303
```

In this case, `r.con()`

gives 95%CI as (0.707, 0.822), so
we can conclude that the height-weight correlation is about 70-80% in
this population.

- Make a scatter plot of
`weight`

on x-axis and`repwt`

on y-axis. Make a similar plot for`height`

and`repht`

. Do the reported values seem highly correlated with the measured values?

Let’s plot them next to each other by splitting plotting area into 1
row and 2 columns by `par(mfrow = c(1,2))`

.

```
par(mfrow = c(1,2))
plot(measures$weight, measures$repwt, pch = 20) #pch = 20 means a solid round symbol
plot(measures$height, measures$repht, pch = 20)
```