In this lecture we study relationship between two or more variables. First we establish how to quantify the strength of the linear relationship between continuous variables and then we learn how to use the relationship to predict value of an unknown variable given the values of observed variables.

We read in a data set of heights and weights of 199 individuals (88 males and 111 females). This dataset originates from http://vincentarelbundock.github.io/Rdatasets/doc/carData/Davis.html.

```
y = read.table("https://www.mv.helsinki.fi/home/mjxpirin/HDS_course/material/Davis_height_weight.txt",as.is = T, header=T)
head(y)
```

```
## X sex weight height repwt repht
## 1 1 M 77 182 77 180
## 2 2 F 58 161 51 159
## 3 3 F 53 161 54 158
## 4 4 M 68 177 70 175
## 5 5 F 59 157 59 155
## 6 6 M 76 170 76 165
```

The last two columns are self-reported values of weight and height. Let’s plot weight against height.

`plot(y$height, y$weight, pch = 20, ylab = "weight (kg)", xlab = "height (cm)")`

Unsurprisingly, there is a clear pattern where taller individuals weight more. To quantify the (linear part of the) relationship, we compute a Pearson’s **correlation coefficient** between the variables. This happens in two parts: compute the covariance between the variables and scale the covariance by the variability of both variables to get a dimensionless correlation coefficient.

**Covariance** measures the amount of variation in the variables that is linearly shared between the variables. Technically, it is the average product of deviations from the means of the variables, and from a sample it is computed as \[\textrm{cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})
\textrm{, where }\overline{x}, \overline{y} \textrm{ are the means of }x_is \textrm{ and } y_is, \textrm{resp}.\] When both X and Y tend to be above their mean values simultaneously, then covariance is positive, whereas the covariance is negative if when X is above its mean, Y tends to be below its mean. Covariance of 0 says that there is no linear relationship between X and Y. Note that if we compute the covariance between the variable with itself, that is, \(X=Y\) in the formula above, the result is simply the variance of that one variable. Thus, covariance is a generalization of the concept of variance to two variables.

**Correlation** coefficient results when covariance is normalized by the product of the standard deviations of the variables: \[\textrm{cor}(X,Y) = \frac{\textrm{cov}(X,Y)}{\textrm{SD}(X) \textrm{SD}(Y)}.\] Correlation is always between -1 and +1, and it denotes the strength of the linear relationship between the variables. If correlation is +1, then values of X and Y are on a line that has a positive slope and if correlation is -1, then X and Y are on a line that has a negative slope. When correlation is 0, there is no linear association between the variables. (See Figures from Wikipedia.) Note that if correlation between X and Y is 1 it does not necessarily mean that \(Y = X\), but that only that there is a linear relationship of the form \(Y= a+ bX\) for some constants \(a\) and \(b\), and any such linear relationship leads to the correlation of 1.

Suppose we have an estimate \(\widehat{r}\) of the correlation between X and Y, based on a sample of \(n\) observation (\(n=199\) in the example above). To do statistical inference with correlation coefficient we need to know the standard error of the estimate of the correlation. An approximation for Normally distributed data is: \[\textrm{SD}(\widehat{r}) \approx \sqrt{\frac{1-\widehat{r}^2}{n-2}}\] This is accurate when correlation is not far from 0 (say \(|r|<0.5\)), but is becoming less acurate near \(\pm 1\). For more accurate confidence intervals you can use `r.con()`

from package `psych`

.

We expect a clear positive correlation between weight and height based on the Figure above. Let’s see what it is.

```
n = nrow(y) #sample size
r = cor(y$height, y$weight) #correlation is symmetric cor(weight,height) = cor(height,weight)
se = sqrt((1-r^2)/(n-2))
library(psych) #DO FIRST: install.packages("psych")
ival = r.con(r, n, p = 0.95, twotailed = TRUE)
cbind(r = r, low95CI = ival[1], up95CI = ival[2],
low95CI.norm.appr = r-1.96*se, up95CI.norm.appr = r+1.96*se)
```

```
## r low95CI up95CI low95CI.norm.appr up95CI.norm.appr
## [1,] 0.7707306 0.7074835 0.8217303 0.6817547 0.8597065
```

In this case, `r.con()`

gives 95CI as (0.707, 0.822) showing some difference from the less accurate normal approximation.

**Exercises 1**

Plot scatter plot of

`weight`

on x-axis and`repwt`

on y-axis. Make a similar plot for`height`

and`repht`

. Do the reported values seem highly correlated with the measured values?Compute correlation between

`weight`

and`repwt`

(self-reported weight) and between`height`

and`repht`

(self-reported height). Note that you need to use`use = "complete.obs"`

to get rid of NAs.Compute a

**correlation matrix**of the four variables`weight`

,`height`

,`repwt`

and`repht`

by giving`cor()`

function those four columns of the data matrix as input. Note that you need to use`use = "pairwise"`

in`cor()`

to get rid of NA values.Plot the curve \(y=x^2\) in the range \(x \in (-1,\ldots,1)\) using 1000 equally spaced values for \(x\). Guess what would be the correlation between \(x\) and \(y\). Compute correlation.

Plot the curve \(y=x^3\) in the range \(x \in (-1,\ldots,1)\) using 1000 equally spaced values for \(x\). Guess what would be the correlation between \(x\) and \(y\). Compute correlation.

Correlation coefficient \(r\) describes the strength of linear relationship between two variables. Both variables are treated symmetrically in definition of \(r\). However, we may also want to utilize the linear relationship to predict the value of Y given that we know the value of X. For example, we may use our observed population above to make a statistical prediction of weights of individuals who are exactly 170 cm tall. From the Figure above we see that such individuals have weights roughly in range from 50 kg to 80 kg. To characterize the relationship more mathematically, we want to fit a line through the observed data that describes how Y depends on X. The **linear regression** model assumes that \[y_i = a + b x_i + \varepsilon_i,\] where \(a\) (intercept, vakio) and \(b\) (slope, kulmakerroin) are model parameters that define the line and error term \(\varepsilon_i\) is the difference between the observed value \(y_i\) and the value predicted by the line \(\varepsilon_i = y_i - (a + b x_i)\). Any pair of parameters \(a\) and \(b\) define one line (namely \(y = a + bx\)) in X-Y coordinates and among all those lines we will choose the one that fits best the observations as described next. We denote the estimates by \(\widehat{a}\) and \(\widehat{b}\) and the prediction for observation \(i\) by \(\widehat{y}_i = \widehat{a} + \widehat{b}x_i.\) The estimates are chosen by the least squares method that minimizes the residual sum of squares \[\textrm{RSS} = \sum_{i=1}^n(y_i-\widehat{y}_i)^2 = \sum_{i=1}^n(y_i-(\widehat{a}+\widehat{b}x_i))^2\]

In R the linear model is estimated using `lm()`

function where formula `y~x`

says that “regress y on x”. The regression line can be added to a figure by `abline()`

function.

```
lm.1 = lm( weight ~ height, data = y)
plot(y$height, y$weight, pch = 20, ylab = "weight (kg)", xlab = "height (cm)", col="gray")
abline(lm.1, col="red", lwd=1.5)
```

`summary(lm.1)`

```
##
## Call:
## lm(formula = weight ~ height, data = y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.650 -5.419 -0.576 4.857 42.887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -130.74698 11.56271 -11.31 <2e-16 ***
## height 1.14922 0.06769 16.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.523 on 197 degrees of freedom
## Multiple R-squared: 0.594, Adjusted R-squared: 0.592
## F-statistic: 288.3 on 1 and 197 DF, p-value: < 2.2e-16
```

In output of the `summary`

:

- Residuals are the observed vertical differences between the line and the observed weight value
- Coefficients show the least squares parameter estimates with their SEs and P-values (here highly significantly different from 0 with P-values < 2e-16). The slope tells that each cm in height corresponds to an average increase of 1.15 kg in weight.
- Residual standard error is an estimate for standard deviation of errors (often denoted by \(\sigma:\) \(\text{Var}(\varepsilon)=\sigma^2\))
- R-squareds tell how large proportion of the variance in Y the line explains compared to the total variance in Y. Typically, we should be looking at Adjusted Rsquared as it is more reliable than the raw Rsquared if there are many predictors in the model.

In this case, the linear model on height explains almost 60% of the variation in weight and therefore about 40% of variation in weight is something that cannot be explained by (a linear effect of) height only. Note that in the case of simple linear regression with one predictor (here height), Rsquared is exactly the square of the correlation between X and Y.

`cor(y$weight, y$height)^2`

`## [1] 0.5940256`

`summary(lm.1)$r.squared`

`## [1] 0.5940256`

This explains why correlation coefficient is a measure of strength of a linear association between variabels: When \(r=\pm1\), then the linear regression model explains all variation in Y and hence the observations are on a single line in X-Y coordinates.

Let’s see what columns we have in the `lm`

object:

`names(lm.1)`

```
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
```

We could get both the fitted values (`lm.1$fitted.values`

) as well as residuals (`lm.1$residuals`

) for each individual from the regression model object. For example, let’s find for which individuals the predicted weight is more than 20kg off.

```
ii = which(abs(lm.1$residuals) > 20) #returns row indexes for which condition is TRUE
cbind(y[ii,2:4], lm.1$fitted.values[ii], lm.1$residuals[ii])
```

```
## sex weight height lm.1$fitted.values[ii] lm.1$residuals[ii]
## 20 M 119 180 76.11303 42.88697
## 29 M 101 183 79.56070 21.43930
## 53 M 102 185 81.85914 20.14086
## 96 M 103 185 81.85914 21.14086
## 191 M 89 173 68.06848 20.93152
```

We notice 2 things. First there is one outlier case where weight is 119 kg whereas it is predicted to be 76 kg based on the height of 180 cm. Second, all of the worst predictions happened for males. Indeed, it is unlikely that the same linear regression model would be good for both males and females.

Let’s add sex as an another predictor to the model to allow different mean weights in males and females. Since `sex`

is of type “character” with two values (“M” and “F”), R automatically treats it as *factor* in regression model. For factor variables, regression model will get different baseline values for different levels of the factor (here males and females).

```
lm.2 = lm(weight ~ height + sex, data = y)
summary(lm.2)
```

```
##
## Call:
## lm(formula = weight ~ height + sex, data = y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.131 -4.889 -0.404 5.205 41.490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -76.63620 15.75543 -4.864 2.36e-06 ***
## height 0.81072 0.09555 8.485 5.24e-15 ***
## sexM 8.21621 1.71726 4.784 3.37e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.086 on 196 degrees of freedom
## Multiple R-squared: 0.6365, Adjusted R-squared: 0.6328
## F-statistic: 171.6 on 2 and 196 DF, p-value: < 2.2e-16
```

Now our model says that after we account for difference in average weight between males and females, each cm in height corresponds to 0.81 kg in height (it was 1.15 kg when sex wasn’t accounted for). Additionally we see that for the same height, a male weights on average 8.2 kg more than a female. This model is a better description of the data as adjusted Rsquared has increased to 63% from 59% in model `lm.1`

.

But are there differences in the height-weight slope between the sexes? Let’s simply fit separate linear models in males and in females and see the slopes.

```
males = y$sex == "M"
lm.m = lm(weight ~ height, data = y[males,])
lm.f = lm(weight ~ height, data = y[!males,])
```

We see that slope in females is

```
b = summary(lm.f)$coeff[2,1] #coeff matrix has one row for each parameter
s = summary(lm.f)$coeff[2,2] #row 2 is for slope and row 1 for intercept
cbind(b=b, low95 = b-1.96*s, up95 =b+1.96*s)
```

```
## b low95 up95
## [1,] 0.6229425 0.4276806 0.8182044
```

And in males

```
b = summary(lm.m)$coeff[2,1]
s = summary(lm.m)$coeff[2,2]
cbind(b=b, low95 = b-1.96*s, up95 = b+1.96*s)
```

```
## b low95 up95
## [1,] 0.9955981 0.6670175 1.324179
```

It seems like slope in females might be considerably smaller than in males, but since the 95%CIs are quite wide, this cannot be concluded reliably from these data alone.

Let’s see how much regression explains within each sex.

```
par(mfrow=c(1,2))
plot(y[males,"height"], y[males,"weight"], pch = 20, ylim = c(35,120), xlim=c(140,200),
ylab = "weight (kg)", xlab = "height (cm)", col="gray")
abline(lm.m, col="cyan", lwd=1.5)
title(paste0("MALES R2=",signif(summary(lm.m)$r.squared,3)))
plot(y[!males,"height"], y[!males,"weight"], pch = 20, ylim = c(35,120), xlim=c(140,200),
ylab = "weight (kg)", xlab = "height (cm)", col="gray")
abline(lm.f, col="magenta", lwd=1.5)
title(paste0("FEMALES R2=",signif(summary(lm.f)$r.squared,3)))
```