In this lecture, we will study more details about the linear regression model. In particular, how to make predictions from the model using R, and how to assess whether the model seems approriate for the data set at hand. Finally, we will study the interpretation of the coefficients in multiple regression setting where many predictors are simultaneously included in the model.

Letâ€™s start by reading in the same height-weight data set as previously.

```
measures = read.table("Davis_height_weight.txt", as.is = T, header = T)
head(measures, n = 3) #show first three rows
```

```
## X sex weight height repwt repht
## 1 1 M 77 182 77 180
## 2 2 F 58 161 51 159
## 3 3 F 53 161 54 158
```

The columns `repwt`

and `repht`

are self-reported weight and self-reported height, respectively.

Letâ€™s fit separate linear regression models `lm.m`

and `lm.f`

in males and females, respectively.

```
ii.m = (measures$sex == "M") #logical vector with 'TRUE' for Males and 'FALSE' for Females
lm.m = lm( weight ~ height, data = measures[ii.m,]) #fit model in males
lm.f = lm( weight ~ height, data = measures[!ii.m,]) #fit model in females
#use same range on the axes in both sexes to get a clear picture how they compare
x.range = c(145, 200) #height in cm
y.range = c(35, 120) #weight in kg
x.name = "height (cm)"
y.name = "weight (kg)"
par(mfrow = c(1,2)) #split plotting area to 1 x 2 grid
plot(measures$height[!ii.m], measures$weight[!ii.m], pch = 18, main = "FEMALES", col = "red",
ylab = y.name, xlab = x.name, xlim = x.range, ylim = y.range)
abline(lm.f, col = "darkred", lwd = 1.5) #add the regression line
plot(measures$height[ii.m], measures$weight[ii.m], pch = 20, main = "MALES", col = "blue",
ylab = y.name, xlab = x.name, xlim = x.range, ylim = y.range)
abline(lm.m, col = "darkblue", lwd = 1.5) #add the regression line
```

On previous Lecture, we looked up the coefficients of the linear model and used them to compute the predicted value of `weight`

for an individual with a given `height`

. For example, to compute the predicted `weight`

for a 180 cm tall male, we did

`sum(lm.m$coeff*c(1,180)) #this computes a*1 + b*180 for a and b taken from male regression 'lm.m'`

`## [1] 77.87761`

Very generally, in R, the fitted models can be used for prediction via the `predict(model, newdata)`

function. Letâ€™s predict `weight`

for two new individuals who are 160 cm and 180 cm tall. We need to give the input data as parameter `newdata =`

formatted as a data.frame that contains (at least) the same column names that were used in the model fitted by the `lm()`

function. In the current case, we need to define only the column `height`

. We will call the data that goes in to the prediction as `data.in`

.

```
data.in = data.frame(height = c(160, 180) )
data.in # show what we'll put in as newdata parameter for predict(, newdata = )
```

```
## height
## 1 160
## 2 180
```

`predict(lm.m, newdata = data.in) # predictions for males of 160cm and 180cm`

```
## 1 2
## 57.96565 77.87761
```

We can present these results in a `data.frame`

together with the input data:

`data.frame(data.in, predicted.weight = predict(lm.m, newdata = data.in))`

```
## height predicted.weight
## 1 160 57.96565
## 2 180 77.87761
```

The uncertainty of prediction can be asked at the level of individual value `interval = "prediction"`

or as a confidence interval for the population mean at the given parameter values `interval = "confidence"`

. Note that the option names in R commands can be abbreviated as long as the abbreviation remains unique among all possible options; hence here we can use `pred`

and `conf`

, for example.

`predict(lm.m, newdata = data.in, interval = "pred") #for males of 160cm and 180cm`

```
## fit lwr upr
## 1 57.96565 36.95574 78.97555
## 2 77.87761 57.73251 98.02271
```

The interpretation of the 2nd prediction interval here is that, in 95% of the data sets, when we do this prediction procedure for a new male who is 180 cm tall, the new male will fall within the prediction interval, which here is from 58 kg to 98 kg.

`predict(lm.m, newdata = data.in, interval = "conf") #for males of 160cm and 180cm`

```
## fit lwr upr
## 1 57.96565 51.59499 64.33631
## 2 77.87761 75.64286 80.11236
```

The interpretation of the confidence interval for the mean here is that, in 95% of the data sets, when we do this prediction procedure for male population of height of 180 cm, the true population mean will be within the confidence interval, which here is from 75.6 kg to 81.1 kg.

**WARNING**: A common error happens when `newdata`

parameter is not correctly specified because then `predict()`

will not complain but simply outputs the fitted values of the original `lm()`

model fit. For example, if we misspell `newdata`

as `new.data`

, R will not complain and the first two predicted values are actually for the first 2 males in the original data, NOT for 160cm and 180cm as the user might think!

`predict(lm.m, new.data = data.in)[1:2] #GOES WRONG because newdata has not been specified!`

```
## 1 4
## 79.86881 74.89082
```

`predict(lm.m, newdata = data.in)[1:2] #This is correct`

```
## 1 2
## 57.96565 77.87761
```

- What is the predicted weight of a male who is 170 cm tall? What is the predicted population average weight of all males who are 170 cm tall? Give appropriate 95% intervals in both cases.

```
#For completeness, let's re-fit the model first in males only
lm.m = lm( weight ~ height, data = measures[measures$sex == "M",])
#predicted weight for a male 170 cm
predict(lm.m, newdata = data.frame(height = c(170)), interval = "pred")
```

```
## fit lwr upr
## 1 67.92163 47.61119 88.23207
```

```
#predicted population average at 170 cm
predict(lm.m, newdata = data.frame(height = c(170)), interval = "conf")
```

```
## fit lwr upr
## 1 67.92163 64.50355 71.3397
```

We estimate that the population average weight for a male who is 170 cm tall is 67.9 kg (95%CI: 64.5,â€¦,71.3). We predict that 95% CI for weight values for 170 cm males is (47.6,â€¦,88.2) (and the mean weight is the same 67.9 kg as the estimate for the population mean).

- Fit a linear model regressing
`height`

on`weight`

in females. What is the predicted height of a woman who weighs 60kg?

```
lm.h.on.w = lm( height ~ weight, data = measures[measures$sex == "F",])
#predicted weight for a female of 60kg
predict(lm.h.on.w, newdata = data.frame(weight = c(60)), interval = "pred")
```

```
## fit lwr upr
## 1 166.0199 156.2594 175.7805
```

Predicted height is 166 cm with 95%CI of (156,â€¦,176).

Francis Anscombe generated four data sets of 11 observations that had similar summary statistics (means, variances, correlations) but look very different when plotted. His point was to emphasize the importance to plot the data. In R, these data sets are in the columns of the data.frame called `anscombe`

. Letâ€™s plot the datasets, fit linear model and compute correlations and slopes that we put as a subtitle for each plot.

`anscombe[1:2,] # show first two rows `

```
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
```

```
par(mfrow = c(2,2)) #2 x 2 plotting area for 4 plots
for(ii in 1:4){
x = anscombe[,ii] #x-coordinates for Anscombe's data set ii
y = anscombe[,4 + ii] #y-coordinates for Anscombe's data set ii
lm.fit = lm(y ~ x)
plot(x, y, main = paste("Anscombe",ii), pch = 19, xlab = "x", ylab = "y",
sub = paste0("cor = ",signif(cor(x,y),3),
"; slope = ",signif(coefficients(lm.fit)[2],3)))
abline(lm.fit, col = "red", lwd = 1.5)
}
```