--- title: 'Lecture 7: Linear regression II' author: "Matti Pirinen" date: "23.9.2021" output: html_document: default --- {r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) set.seed(19)  In this lecture, we will study more details about the linear regression model. In particular, how to make predictions from the model using R, and how to assess whether the model seems approriate for the data set at hand. Finally, we will study the interpretation of the coefficients in multiple regression setting where many predictors are simultaneously included in the model. Let's start by reading in the same height-weight data set as previously. {r} measures = read.table("Davis_height_weight.txt", as.is = T, header = T) head(measures, n = 3) #show first three rows  The columns repwtand repht are self-reported weight and self-reported height, respectively. Let's fit separate linear regression models lm.m and lm.f in males and females, respectively. {r, fig.width = 9} ii.m = (measures$sex == "M") #logical vector with 'TRUE' for Males and 'FALSE' for Females lm.m = lm( weight ~ height, data = measures[ii.m,]) #fit model in males lm.f = lm( weight ~ height, data = measures[!ii.m,]) #fit model in females #use same range on the axes in both sexes to get a clear picture how they compare x.range = c(145, 200) #height in cm y.range = c(35, 120) #weight in kg x.name = "height (cm)" y.name = "weight (kg)" par(mfrow = c(1,2)) #split plotting area to 1 x 2 grid plot(measures$height[!ii.m], measures$weight[!ii.m], pch = 18, main = "FEMALES", col = "red", ylab = y.name, xlab = x.name, xlim = x.range, ylim = y.range) abline(lm.f, col = "darkred", lwd = 1.5) #add the regression line plot(measures$height[ii.m], measures$weight[ii.m], pch = 20, main = "MALES", col = "blue", ylab = y.name, xlab = x.name, xlim = x.range, ylim = y.range) abline(lm.m, col = "darkblue", lwd = 1.5) #add the regression line  ### Predictions from the regression model On previous Lecture, we looked up the coefficients of the linear model and used them to compute the predicted value of weight for an individual with a given height. For example, to compute the predicted weight for a 180 cm tall male, we did {r} sum(lm.m$coeff*c(1,180)) #this computes a*1 + b*180 for a and b taken from male regression 'lm.m'  Very generally, in R, the fitted models can be used for prediction via the predict(model, newdata) function. Let's predict weight for two new individuals who are 160 cm and 180 cm tall. We need to give the input data as parameter newdata = formatted as a data.frame that contains (at least) the same column names that were used in the model fitted by the lm() function. In the current case, we need to define only the column height. We will call the data that goes in to the prediction as data.in. {r} data.in = data.frame(height = c(160, 180) ) data.in # show what we'll put in as newdata parameter for predict(, newdata = ) predict(lm.m, newdata = data.in) # predictions for males of 160cm and 180cm  We can present these results in a data.frame together with the input data: {r} data.frame(data.in, predicted.weight = predict(lm.m, newdata = data.in))  The uncertainty of prediction can be asked at the level of individual value interval = "prediction" or as a confidence interval for the population mean at the given parameter values interval = "confidence". Note that the option names in R commands can be abbreviated as long as the abbreviation remains unique among all possible options; hence here we can use pred and conf, for example. {r} predict(lm.m, newdata = data.in, interval = "pred") #for males of 160cm and 180cm  The interpretation of the 2nd prediction interval here is that, in 95% of the data sets, when we do this prediction procedure for a new male who is 180 cm tall, the new male will fall within the prediction interval, which here is from 58 kg to 98 kg. {r} predict(lm.m, newdata = data.in, interval = "conf") #for males of 160cm and 180cm  The interpretation of the confidence interval for the mean here is that, in 95% of the data sets, when we do this prediction procedure for male population of height of 180 cm, the true population mean will be within the confidence interval, which here is from 75.6 kg to 81.1 kg. **WARNING**: A common error happens when newdata parameter is not correctly specified because then predict() will not complain but simply outputs the fitted values of the original lm() model fit. For example, if we misspell newdata as new.data, R will not complain and the first two predicted values are actually for the first 2 males in the original data, NOT for 160cm and 180cm as the user might think! {r} predict(lm.m, new.data = data.in)[1:2] #GOES WRONG because newdata has not been specified! predict(lm.m, newdata = data.in)[1:2] #This is correct  #### Example 7.1 1. What is the predicted weight of a male who is 170 cm tall? What is the predicted population average weight of all males who are 170 cm tall? Give appropriate 95% intervals in both cases. {r} #For completeness, let's re-fit the model first in males only lm.m = lm( weight ~ height, data = measures[measures$sex == "M",]) #predicted weight for a male 170 cm predict(lm.m, newdata = data.frame(height = c(170)), interval = "pred") #predicted population average at 170 cm predict(lm.m, newdata = data.frame(height = c(170)), interval = "conf")  We estimate that the population average weight for a male who is 170 cm tall is 67.9 kg (95%CI: 64.5,...,71.3). We predict that 95% CI for weight values for 170 cm males is (47.6,...,88.2) (and the mean weight is the same 67.9 kg as the estimate for the population mean). 2. Fit a linear model regressing height on weight in females. What is the predicted height of a woman who weighs 60kg? {r} lm.h.on.w = lm( height ~ weight, data = measures[measures$sex == "F",]) #predicted weight for a female of 60kg predict(lm.h.on.w, newdata = data.frame(weight = c(60)), interval = "pred")  Predicted height is 166 cm with 95%CI of (156,...,176). ### Assessing model fit Francis Anscombe generated [four data sets](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) of 11 observations that had similar summary statistics (means, variances, correlations) but look very different when plotted. His point was to emphasize the importance to plot the data. In R, these data sets are in the columns of the data.frame called anscombe. Let's plot the datasets, fit linear model and compute correlations and slopes that we put as a subtitle for each plot. {r, fig.width = 8, fig.height = 8} anscombe[1:2,] # show first two rows par(mfrow = c(2,2)) #2 x 2 plotting area for 4 plots for(ii in 1:4){ x = anscombe[,ii] #x-coordinates for Anscombe's data set ii y = anscombe[,4 + ii] #y-coordinates for Anscombe's data set ii lm.fit = lm(y ~ x) plot(x, y, main = paste("Anscombe",ii), pch = 19, xlab = "x", ylab = "y", sub = paste0("cor = ",signif(cor(x,y),3), "; slope = ",signif(coefficients(lm.fit)[2],3))) abline(lm.fit, col = "red", lwd = 1.5) }  We see that the data sets are identical with respect to the x-y correlation and the linear model slope, but we also see that the data sets are very different in other respects, and, in particular, the linear model does not seem an appropriate description of the X-Y relationship for the data sets 2, 3 and 4. How can we know whether a linear model is an approriate model for our data set? First we should consider this conceptually: Is there a reason to think that a linear relationship between the variables is appropriate for the research question we have in mind? If the answer is positive, then we should ask how is the linear model fitting the data and are the modeling assumptions valid? The main assumptions for the linear model are: 1. The relationship between$Y$and$X$is linear, that is, there are values$a$and$b$for which$Y = a + b\cdot X + \varepsilon$and the **error term**$\varepsilon$does not depend on the value of$X$. 2. Errors$\varepsilon$have constant variance across all values of X. 3. Errors$\varepsilon$are independent of each other. 4. Errors$\varepsilon$have a Normal distribution (in order that SE and P-value from lm() are valid). There are 4 standard plots that we can use to assess these assumptions and they are generated simply by calling plot() on the regression model object. Let's first check how our model of weight ~ height in females behaves in these diagnostic plots: {r, fig.height = 8} par(mfrow = c(2,2)) #plot diagnostics to 2 x 2 area plot(lm.f) #calling plot on an lm-object generates 4 diagnostic plots  To interpret these plot, read the detailed explanation from here: . A short summary of the four plots: 1. Residuals ($y_i-\widehat{y}_i$) vs. fitted ($\widehat{y}_i$) should not show a pattern where the distribution of residuals varies depending on the fitted values. 2. QQ-plot should ideally be on the line, whence the residuals ($y_i-\widehat{y}_i$) are Normally distributed and SEs and P-values of the model coefficients are reliable. However, small deviations from the line are not a problem. 3. Standardized residuals vs. fitted values should show a similar magnitude along y-axis for each region on the x-axis. Otherwise, the variance of the residuals is changing and a more approriate model (than the standard lm) should take this variation into account. 4. Standardized residuals vs. leverage should not show points that reach outside the red curved areas of Cook's distance 1. Otherwise, such points would have a high influence on the slope of the linear model, and hence the lm() estimates from such data would not be reliable. For our test data, the diagnostics look ok. There are 4 slightly outlying observations, that have to such a degree higher weight than predicted by their height, that they separate slightly from the line on the QQ-plot. However, these are still minor deviations and, importantly, they are not having a large influence on the fitted line, which can be seen on the lower right plot, where no points reach even a Cook's distance of 0.5. (Only a small region beyond Cook's distance of 0.5 can be seen at the top right corner of the plot, but no region with Cook's distance > 1 can be seen in this plot.) #### Examples 7.2. Let's demonstrate how deviations from the linear model assumptions show up in the diagnostic plots. For that, let's change the existing female height-weight data in a few different ways. 1. Let's make weight to have an extra contribution of +0.1*(height-150)^2 whence weight increases also quadratically and not just linearly, as a function of height. Let's then fit the linear model and plot the model fit and the diagnostic plot 1. {r, fig.width = 9} new.weight = measures$weight + 0.1*(measures$height - 150)^2 ii.f = (measures$sex == "F") lm.new = lm(new.weight[ii.f] ~ measures$height[ii.f]) par(mfrow = c(1,2)) plot(measures$height[ii.f], new.weight[ii.f], pch = 19, col = "orange", main = "Changed data") abline(lm.new, col = "red", lwd = 1.5) plot(lm.new, 1) #plot only diagnostic plot no. 1  On left, we see the new data and the linear model fit that tends to underestimate weight at both ends of the height distribution. On right, we see a parabolic pattern of residuals vs. fitted values, which indicates that a linear model is not adequate and suggests that a quadratic term of the predictor should be included in the model. (We will come back to how to do this later in this lecture.) 2. Let's make weight to have more variance with taller individuals by adding an extra contribution of + rnorm(, mean = 0, sd = 2*|height-150|/100). Let's then fit the linear model and plot the model fit and the diagnostic plot 3. {r, fig.width = 9} new.weight = measures$weight + rnorm(nrow(measures), 0, 2*abs(measures$height - 150)) new.weight[new.weight < 40] = 40 #Let's set minimum new weight to 40kg lm.new = lm(new.weight[ii.f] ~ measures$height[ii.f]) par(mfrow = c(1,2)) plot(measures$height[ii.f], new.weight[ii.f], pch = 19, col = "orange", main = "Changed data") abline(lm.new, col = "red", lwd = 1.5) plot(lm.new, 3) #plot only diagnostic plot no. 3  On left, we see that in the new data, the variance inclreases with height. On right, we see an increasing pattern in standardized residuals, which indicates that variance of residuals is changing. 3. Let's add one new individual to data with height = 200 and weight = 50. This individual will be an outlier (meaning that residual value is large) because weight is so small compared to height. The individual will also be influential for the slope of the line, which is shown by the observation reaching Cook's distance of 1 in Stndardized residuals vs. Leverage plot. Let's show the original fitted line without the outlier from lm.f object as a blue dashed line, and the new fitted line with the outlier individual included as a red line. {r, fig.width = 9} new.weight = c(measures$weight[ii.f], 50) #existing females + new_value 50 new.height = c(measures$height[ii.f], 200) #existing females + new_value 200 lm.new = lm(new.weight ~ new.height) par(mfrow = c(1,2)) plot(new.height, new.weight, pch = 19, col = "orange", main = "Changed data") abline(lm.new, col = "red", lwd = 1.5) abline(lm.f, col = "blue", lwd = 1.5, lty = 2) plot(lm.new, 5) #plot only diagnostic plot no. 5  On left, we see the outlier point at x=200, y=50. We also see how much that one point changes the fitted line from blue to red. On right, we see how the diagnostic plot no. 5 shows that the outlier point (indexed 112) reaches Cook's distance 1, which indicates that it is very influential for the slope of the line. In these cases, it is important to understand what is the reason for the outlier points and whether they could/should be excluded from the analysis. A typical reason for coarse outliers is a measurement error of some kind at some step of the data collection, but outlier points can also be completely real and valid observations. ### Example analysis: Social factors in 1998 Let's study some social factors from around the world at the turn of the millenium using a UN data sets from 1998. This file is in csv format (comma separated values) and we can read it in using read.csv() function that is simply a version of read.table() function with default parameter values set to correspond a typical csv file. {r} y = read.csv("UN98.csv", as.is = TRUE, header = TRUE) head(y) #show some first rows  Read the description of the variables from . Let's give these variables shorter names to avoid writing long names. {r} colnames(y) = c("country","region","tfr","contr","eduM","eduF","lifeM", "lifeF","infMor","GDP","econM","econF","illiM","illiF")  Let's plot the pairwise correlations using corrplot to an increased plotting area of size fig.width = 8 and fig.height = 8 (defined in the Rmd code block definition). {r, fig.width = 8, fig.height = 8} #install.packages("corrplot") library(corrplot) x = as.matrix(y[, 3:14]) #make a matrix of numeric columns of data.frame y for cor() corr = cor(x, use = "pairwise") corrplot.mixed(corr, order = "hclust")  We see two clear blocks of highly correlated variables. First block is about positive things like life expectancy, education and availability of contraception and the second contains negative things like illiteracy and infant mortality. We observe that "percentage of economically active women" does not correlate much with the other variables. Perhaps surprisingly, GDP (gross domestic product, bruttokansantuote) doesn't correlate positively with economic activity in males either. Let's have a more careful look at these two variables. {r, fig.width = 10, fig.height = 4} par(mfrow = c(1,3)) hist(y$econM, col = "dodgerblue", breaks = 25) hist(y$GDP, col = "orange", breaks = 25) hist(log10(y$GDP), col = "khaki", breaks = 25)  We see that activity is roughly symmetric and very roughly Normally distributed but GDP is an example of a highly skewed distribution with a strong tail to right. This is typical of variables describing wealth or, more generally, variables that take only non-negative values. Often logaríthm of such a variable is a much better input variable to linear regression as it is often more symmetric and Normally distributed, as is the case here as well. Thus we will use log10 of GDP. (Note that variables need not be Normally distributed in order to be used in linear regression, but an approximate Normality often makes the model fit better.) {r} plot(y$econM, log10(y$GDP), pch = 19)  There does not seem to be clear patterns between these two variables, which agrees with the small magnitude of correlation. Let's print out statistics for countries with over 90% Males "economically active". {r} y[which(y$econM > 90),] #we use which() here because that automatically filters out NAs  We have poor countries like Burundi and Chad and rich countries like Qatar and United Arab Emirates that both have a high proportion of male economic activity. One clear factor separating these groups is infant mortality. Let's look at the same plot but now coloring the points by infant mortality. We make as many colors as there are countries by using the heat.colors(n) palette that returns $n$ colors from red to yellow. We index these colors by rank of each country in infMor variable (rank = 1 for the smallest infMor and rank = $n$ for the largest infMor). {r} n = nrow(y) plot(y$econM, log10(y$GDP), pch = 20, col = heat.colors(n)[rank(y$infMor)], main = "Colored by infant mortality (red = low, yellow = high)")  We see that infMor seems predictive of GDP (red associates with high GDP and yellow with low GDP). What does the linear model say when we predict GDP by both Economic activity and Infant mortality? {r} lm.0 = lm(log10(y$GDP) ~ econM + infMor, data = y, na.action = "na.exclude") summary(lm.0)  Economic activity is not an important predictor whereas Infant mortality is, and the model explains about 50% of the variation in GDP. Let's do a linear model using Infant mortality as the only predictor. {r} lm.1 = lm(log10(GDP) ~ infMor, data = y, na.action = "na.exclude") plot(y$infMor, log10(y$GDP), pch = 20) abline(lm.1, col = "red", lwd = 1.5) summary(lm.1)  We see that log10 of GDP is decreasing by 0.0133 for each unit of infant deaths (per 1000 infants). In other words, when infant mortality increases one per mille, GDP drops multiplicatively by a factor of 10^(-0.0133157)=r 10^(-0.0133157), i.e., it drops about 3%. This model alone explains about 50% of the variation in GDP. Does this model fit the data well? It looks like there is a tendency of residuals being positive at the ends, which would suggest adding the second order term in the model. Let's check the diagnostic plots. {r, fig.height = 8} par(mfrow = c(2,2)) plot(lm.1)  No other clear problems except that the residuals having a shape of a prabola in the top-left plot. Let's add the quadratic term in the model whence our model becomes log10(GDP) ~ InfMor + InfMor^2. The quadratic term needs to be input through I() notation in the model formula as is done below. {r} lm.2 = lm(log10(GDP) ~ infMor + I(infMor^2), data = y, na.action = "na.exclude") summary(lm.2) plot(y$infMor, log10(y$GDP), pch = 20) abline(lm.1, col = "red", lwd = 2) #linear model fit without squared term #In order to show the lm.2 model fit in the same Figure, we connect the fitted values via lines, # but we need to do that by ordering the points from the smallest infMor to the largest x = y$infMor x[is.na(y$GDP)] = NA #add NAs where GDP is missing ii = order(x) #indexes from smallest infMor to largest infMor lines(x[ii], fitted(lm.2)[ii], col = "magenta", lwd = 2) #show model fit with the squared term legend("topright", lwd = 2, legend = c("linear","linear + quadratic"), col = c("red","magenta"))  This quadratic term clearly improves model fit (R2 increases from 50% to 62%) and the fitted values take into account the curved pattern in the data. Let's see the diagnostic plots. {r, fig.height = 8} par(mfrow = c(2,2)) plot(lm.2)  They look good. We have found very strong (inverse) association between Infant mortality and GDP. Can we now say that Infant mortality is **the cause** of differences in GDP? The answer is **NO**. Causal statements can never be made from regression analysis alone! **Correlation is not causation** is an important fact to remember in all analyses of observational data! This is because there can be some third factor Z, that could be causally influencing both X and Y, and hence the effect of Z could cause the observed correlation between X and Y even though neither X nor Y was causally influencing the other. Such a third variable Z is called a **confounder** when it is *confounding* the association between X and Y. A common statistics textbook example is the positive correlation between ice-cream sales and deaths by drowning. Is ice-cream causing drowning? No, it is sun and hot temperature that increase ice-cream sales and also increase activity on water, which then also leads to more accidents on water, and consequently deaths by drowning. So weather is confounding the association between ice-cream sales and drowning and the positive correlation between these two variables is not due to a causal relationship. Since we can never measure all variables in observational studies, it remains very difficult to end up in causal inference using observational data as additional, unmeasured confounders could influence the regression estimates. Luckily, in clinical medicine and experimental science, there is an important method to overcome the problem of unmeasured variables: [**randomized controlled trial (RCT)**](https://en.wikipedia.org/wiki/Randomized_controlled_trial). The situation is more challenging for observational population-level epidemiology or social sciences, for example, and there it is difficult to justify causal statements. #### Multiple regression with social factors Let's think more about what happens when we include multiple predictors in the same linear regression model. Above we established that Infant Mortality was a strong predictor of GDP. What about Total Fertility Rate (tfr)? {r} lm.3 = lm(log10(GDP) ~ tfr, data = y, na.action = "na.exclude") summary(lm.3) plot(y$tfr, log10(y$GDP), pch = 20) abline(lm.3, col = "red", lwd = 1.5)  A very clear association where higher fertility associates with decreasing GDP. The model explains about 38% of the variation. Let's then include both Infant Mortality infMor and tfr in the same model. This is an example of a multiple regression model. {r} lm.4 = lm(log10(GDP) ~ tfr + infMor, data = y, na.action = "na.exclude") summary(lm.4)  In this model, only infMor seems important (effect is many SEs away from zero as denoted by small P-value) whereas the association with tfr has disappeared (effect dropped from -0.25 to -0.02 and P-value went from <2e-16 to 0.57). Such dramatic changes are possible because infMor and tfr are correlated and therefore part of what they each explain alone in a regression model, can be explained by the other when they are jointly included in a multiple regression model. In these data, it seems that the effect of tfr could be completely explained by the correlated variable InfMor. Multiple regression has here produced new insights over pairwise correlations. Namely, out of the two highly correlated variables infMor and tfr ($r=0.8$), the former seems much better to predict changes in GDP than the latter. We say that the effect size of tfr from multiple regression model "has been adjusted for" infMor whereas the effect of tfr from the simple regression model was not so adjusted. The interpretation for the effect estimates from multiple regression is that they tell what happens to the outcome variable (here log10(GDP)) per a unit change in the focal predictor (here tfr), when the other predictors (here infMor) are kept constant. Thus, here we have learned that if we fix the level of infant mortality, then the fertility rate does not anymore noticeably affect GDP. In Finnish: Hedelmällisyysluvulla ei ole tilastollista yhteyttä BKT:hen kun analyysi on *vakioitu* lapsikuolleisuudella.