In this lecture we study relationship between a binary variable, such as having a disease vs. being healthy, and one or more predictors, such as risk factors of a disease. We extend the linear regression model to binary outcomes by a modeling approach called logistic regression.

Modeling risk for binary outcomes through log-odds

When the outcome \(Y\) of an experiment/observational study is binary (0 or 1), we typically want to model how the probability \(p_i\), that the outcome is 1 for individual \(i\), depends on the available knowledge of the values of predictor variables \(X_1,\ldots,X_k\) for individual \(i\). We call this probability “risk” although in general it can be any probability, not necessarily related to a disease risk.

Intuitively, extending our experience from the linear regression model, we would like to have a model like \[p_i = a + b\, x_i + \varepsilon_i,\] from which we could estimate how a change in the predictor value \(x_i\) changes the risk \(p_i\). This has two problems. First, we can’t measure \(p_i\) but we only observe binary outcome \(y_i\) which is either 0 or 1 (e.g. for patients we set \(y_i=1\) and for healthy \(y_i=0\)). Second, the risk value \(p_i\) is a probability and hence always between 0 and 1, whereas the linear model above doesn’t impose such a restriction to the right hand side of the equation, which could result in inacceptable risk values from the model.

In this lecture, we leave the first problem for R to handle. The second problem is solved by transforming the outcome value \(p_i\) from the interval (0,1) to the whole real line by the log-odds transformation, where \(p_i\) is replaced by \[ \log\left(\frac{p_i}{1-p_i}\right).\]

par(mfrow = c(1,2))
p = seq(0.001, 0.999, 0.001)
y = log( p/(1-p) )
plot(p, y, xlab = "risk p", ylab = "log-odds: log( p/(1-p) )", t = "l", lwd = 1.5)
plot(y, p, xlab = "log-odds: log( p/(1-p) )", ylab = "risk p", t = "l", lwd = 1.5)

When \(p_i\) is the probability that an event of interest happens, then \(1-p_i\) is the probability that the event does not happen. The odds of the event are \(p_i / (1-p_i)\) and they tell how many times more likely the event is to happen than not to happen. For example, if risk is 50%, then odds are 1:1 = 1 and if risk is 2/3 then odds are 2:1 = 2. Large odds (and also large log-odds) correspond to very probable events and small odds (and also small log-odds) correspond to very unlikely events. Mathematically, the inverse of log-odds transformation \[ \ell = \log\left( \frac{p}{1-p}\right) \qquad \textrm{ is } \qquad p = \frac{\exp(\ell)}{1+\exp(\ell)}.\]

By taking the logarithm of odds, we have a quantity that can take any real number as its value. Hence, it is conceptually valid to use a regression model where \[ \log\left(\frac{p_i}{1-p_i}\right) = a + b\,x_i. \] If \(b\) in this model is positive, then, as \(x\) increases, also the odds of the event increases and therefore also the probability of the event increases. If \(b\) is negative, then as \(x\) increases, also the odds of the event and hence also the probability of the event decreases. With the logistic regression model, for each pair of estimates of \(a\) and \(b\), we can do a prediction of risk \(p_i\) for individual \(i\) given his/her value for predictor \(x_i\).

Example 8.1

  1. Risk of a disease is 10%. What are the odds of the disease? And log-odds of the disease? How does odds and log-odds change when risk changes from 10% to (100% - 10%) = 90%?

Odds are \(p/(1-p) = 0.1/(1-0.1) = 0.1/0.9 = 1/9 \approx 0.111\). Log-odds are \(\log(1/9) = -2.197.\) If risk changes to 90%, then odds become \(9/1 = 9\) and log-odds become \(\log(9) = -\log(1/9) = 2.197\). Thus odds change to their inverse value, and log-odds changes the sign.

  1. Suppose that log-odds of a disease follows a linear model \(a + b\, x\), where \(x\) is measured BMI and parameter values are \(a=-1\) and \(b=0.01\). Is increasing BMI a risk or protective factor for the disease? What is the risk of disease for an individual with bmi = 30? What about bmi = 40?

Since larger log-odds mean larger odds mean larger risk, a positive value of \(b\) means that increasing BMI increases the risk.

a = -1
b = 0.01
x = c(30, 40)
logodds = a + b*x
p = exp(logodds)/(1 + exp(logodds))
data.frame(bmi = x, risk = p)
##   bmi      risk
## 1  30 0.3318122
## 2  40 0.3543437

Logistic regression in R

Let’s use a dataset Pima.tr on 200 Pima indian women from MASS package.

library(MASS)
y = Pima.tr
str(y)
## 'data.frame':    200 obs. of  8 variables:
##  $ npreg: int  5 7 5 0 0 5 3 1 3 2 ...
##  $ glu  : int  86 195 77 165 107 97 83 193 142 128 ...
##  $ bp   : int  68 70 82 76 60 76 58 50 80 78 ...
##  $ skin : int  28 33 41 43 25 27 31 16 15 37 ...
##  $ bmi  : num  30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 ...
##  $ ped  : num  0.364 0.163 0.156 0.259 0.133 ...
##  $ age  : int  24 55 35 26 23 52 25 24 63 31 ...
##  $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...

This data frame contains the following columns:

Variable Explanation
npreg number of pregnancies
glu plasma glucose concentration in an oral glucose tolerance test
bp diastolic blood pressure (mm Hg)
skin triceps skin fold thickness (mm)
bmi body mass index
ped diabetes pedigree function
age age in years
type Yes or No, for diabetic according to WHO criteria

Our outcome variable is the binary diabetes status type.

table(y$type)
## 
##  No Yes 
## 132  68

Let’s start by seeing how bmi associates to type using a boxplot and a t-test. Since type is a factor (see above output from str()), we can make boxplots and t-test of bmi with respect to the two levels of the factor by using notation bmi ~ type.

boxplot(bmi ~ type, data = y, xlab = "bmi", ylab = "Diabetes", 
        horizontal = TRUE, col=c("blue", "red")) #turns boxplot horizontal

t.test(bmi ~ type, data = y)
## 
##  Welch Two Sample t-test
## 
## data:  bmi by type
## t = -4.512, df = 171.46, p-value = 1.188e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.224615 -2.044547
## sample estimates:
##  mean in group No mean in group Yes 
##          31.07424          34.70882

There seems to be a difference of about 3.5 kg/m^2, 95%CI (2.0..5.2) between the groups.

If we are given a value of BMI, what would be a risk of diabetes (in these data)? Let’s use logistic regression. We use glm( ) function that fits “generalized linear models”, and we specify logistic regression by parameter family = "binomial". This means that the outcome variable in regression is modeled as having a binomial distribution, where the success probability may depend on the values of predictors. For model formula the syntax is similar to the linear model lm().

Important: If you forget to specify family = "binomial" in glm( ), then the function fits a linear model, not a logistic model!

glm.1 = glm(type ~ bmi, data = y, family = "binomial")
summary(glm.1) #Always check from summary that you have fitted correct model
## 
## Call:
## glm(formula = type ~ bmi, family = "binomial", data = y)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5797  -0.9235  -0.6541   1.2506   1.9377  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.11156    0.92806  -4.430 9.41e-06 ***
## bmi          0.10482    0.02738   3.829 0.000129 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 256.41  on 199  degrees of freedom
## Residual deviance: 239.97  on 198  degrees of freedom
## AIC: 243.97
## 
## Number of Fisher Scoring iterations: 4

Coefficient for bmi is the increase in log-odds of diabetes per one unit increase in bmi and shows a clear statistical association with P ~ 1e-4. Let’s demonstrate what this means in terms of the risk of diabetes by applying predict( ) function for a sequence of bmi values. The default output from predict( ) on logistic regression object is in log-odds but we can ask the prediction directly as probabilities by specifying type = "response".

bmi.grid = 18:55
data.in = data.frame(bmi = bmi.grid)
y.pred = predict(glm.1, newdata = data.in, type="response") #type = "response" predicts probabilities

Let’s plot these predictions as a function of bmi, and let’s add to the figure also the bmi distributions of the observed cases and controls of diabetes using stripchart().

plot(bmi.grid, y.pred, ylim = c(-0.05,1.05), t = "l",
     main = "prediction", ylab = "risk of diabetes", xlab = "bmi")
stripchart(y$bmi[y$type == "No"], at = -0.05, add = T, method = "jitter", jitter = 0.02, col = "blue")
stripchart(y$bmi[y$type == "Yes"], at = 1.05, add = T, method = "jitter", jitter = 0.02, col = "red")