---
title: 'Lecture 8: Logistic regression'
author: "Matti Pirinen"
date: "12.9.2022"
output:
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(19)
```
In this lecture, we study a relationship between a binary variable, such as
a disease status, and one or more predictors,
such as risk factors of a disease.
We extend the linear regression model to binary outcomes by a modeling approach
called **logistic regression**.
### Modeling risk for binary outcomes through log-odds
When the outcome $Y$ of an experiment/observational study is binary
(can take only two values, denoted by 0 and 1),
we typically want to model how the probability $p_i$, that the outcome is 1
for individual $i$, depends on the available knowledge of the
values of predictor variables $X_1,\ldots,X_k$ for individual $i$.
We call this probability "risk" although in general it can be any probability,
not necessarily related to a disease risk.
Intuitively, extending our experience from the linear regression model,
we would like to have a model for risk like
$$p_i = a + b\, x_i + \varepsilon_i,$$
from which we could estimate how a change in the predictor value $x_i$ changes the risk $p_i$.
This has two problems. First, we can't measure $p_i$ but we only observe binary outcome
$y_i$ which is either 0 or 1 (e.g. for patients we set $y_i=1$ and for healthy $y_i=0$).
Second, the risk value $p_i$ is a probability and hence always between 0 and 1,
whereas the linear model above doesn't impose such a restriction to the right
hand side of the equation, which could result in inacceptable risk values from the model.
In this lecture, we leave the first problem for R to handle.
The second problem is solved by transforming the
outcome value $p_i$ from the interval (0,1) to the whole
real line by the **log-odds transformation**, where $p_i$ is
replaced by $$ \log\left(\frac{p_i}{1-p_i}\right).$$
```{r, fig.width=10}
par(mfrow = c(1,2))
p = seq(0.001, 0.999, 0.001)
y = log( p/(1-p) )
plot(p, y, xlab = "risk p", ylab = "log-odds: log( p/(1-p) )", t = "l", lwd = 1.5)
plot(y, p, xlab = "log-odds: log( p/(1-p) )", ylab = "risk p", t = "l", lwd = 1.5)
```
When $p_i$ is the probability that an event of interest happens,
then $1-p_i$ is the probability that the event does not happen.
The **odds** of the event are $p_i / (1-p_i)$ and they tell how many times more
likely the event is to happen than not to happen.
For example, if risk is 50%,
then odds are 1:1 = 1 and if risk is 2/3 then odds are 2:1 = 2.
Large odds (and also large log-odds) correspond to very probable events
and small odds (and also small log-odds) correspond to very unlikely events.
Mathematically, the inverse of log-odds transformation
$$ \ell = \log\left( \frac{p}{1-p}\right) \qquad \textrm{ is } \qquad p = \frac{\exp(\ell)}{1+\exp(\ell)}.$$
By taking the logarithm of odds, we have a quantity that can take any real number
as its value.
Hence, it is conceptually valid to use a regression model where
$$
\log\left(\frac{p_i}{1-p_i}\right) = a + b\,x_i.
$$
If $b$ in this model is positive, then, as $x$ increases, also the odds of the event
increases and therefore also the probability of the event increases.
If $b$ is negative, then
as $x$ increases, also the odds of the event and hence also the probability of the event decreases.
With the logistic regression model, for each pair of estimates of $a$ and $b$,
we can do a prediction of risk $p_i$ for individual $i$ given his/her value
for predictor $x_i$.
#### Example 8.1
1. Risk of a disease is 10%. What are the odds of the disease?
And log-odds of the disease? How does odds and log-odds change
when risk changes from 10% to (100% - 10%) = 90%?
Odds are $p/(1-p) = 0.1/(1-0.1) = 0.1/0.9 = 1/9 \approx 0.111$.
Log-odds are $\log(1/9) = -2.197.$
If risk changes to 90%, then odds become $9/1 = 9$ and log-odds become $\log(9) = -\log(1/9) = 2.197$.
Thus odds change to their inverse value, and log-odds changes the sign.
2. Suppose that log-odds of a disease follows a
linear model $a + b\, x$, where $x$ is measured BMI and parameter
values are $a=-1$ and $b=0.01$.
Is increasing BMI a risk or protective factor for the disease?
What is the risk of disease for an individual with bmi = 30?
What about bmi = 40?
Since larger log-odds mean larger odds mean larger risk, a positive value
of $b$ means that increasing BMI increases the risk.
```{r}
a = -1
b = 0.01
x = c(30, 40)
logodds = a + b*x
p = exp(logodds)/(1 + exp(logodds))
data.frame(bmi = x, risk = p)
```
## Logistic regression in R
Let's use a dataset `Pima.tr` on 200 women from Pima population
(a Native American population) from `MASS` package.
```{r}
library(MASS)
y = Pima.tr
str(y)
```
This data frame contains the following columns:
| Variable | Explanation |
|----------|-------------|
|npreg|number of pregnancies|
|glu|plasma glucose concentration in an oral glucose tolerance test|
|bp|diastolic blood pressure (mm Hg)|
|skin|triceps skin fold thickness (mm)|
|bmi|body mass index |
|ped|diabetes pedigree function|
|age|age in years|
|type|Yes or No, for diabetic according to WHO criteria|
Our outcome variable is the binary diabetes status `type`.
```{r}
table(y$type)
```
Let's start by seeing how `bmi` associates to `type` using a boxplot and a t-test.
Since `type` is a factor (see above output from `str()`), we can make boxplots and t-test
of `bmi` with respect to the two levels of the factor by using notation `bmi ~ type`.
```{r}
boxplot(bmi ~ type, data = y, xlab = "bmi", ylab = "Diabetes",
horizontal = TRUE, col = c("blue", "red")) #turns boxplot horizontal
t.test(bmi ~ type, data = y)
```
There seems to be a difference of about 3.5 kg/m^2,
95%CI (2.0..5.2) between the groups.
If we are given a value of BMI, what would be a risk of diabetes (in these data)?
Let's use logistic regression.
We use `glm( )` function that fits "generalized linear models",
and we specify logistic regression
by parameter `family = "binomial"`.
This means that the outcome variable in regression is modeled as having a binomial
distribution, where the success probability may depend on the values of predictors.
For the model formula, the syntax is similar to the linear model `lm()`.
**Important**: If you forget to
specify `family = "binomial"` in `glm( )`, then the function fits a linear model,
not a logistic model!
```{r}
glm.1 = glm(type ~ bmi, data = y, family = "binomial")
summary(glm.1) #Always check from summary that you have fitted correct model
```
Coefficient for `bmi` is the increase in log-odds of diabetes per one unit increase in `bmi`
and shows a clear statistical association with P ~ 1e-4.
Let's demonstrate what this means in terms of the risk of diabetes by applying
`predict( )` function for a sequence of `bmi` values.
The default output from `predict( )` on logistic regression object is in log-odds
but we can ask the prediction directly as probabilities by specifying `type = "response"`.
```{r}
bmi.grid = 18:55
data.in = data.frame(bmi = bmi.grid)
y.pred = predict(glm.1, newdata = data.in, type = "response") #type = "response" predicts probabilities
```
Let's plot these predictions as a function of bmi,
and let's add to the figure also the bmi distributions
of the observed cases and controls of diabetes using `stripchart()`.
```{r}
plot(bmi.grid, y.pred, ylim = c(-0.05,1.05), t = "l",
main = "prediction", ylab = "risk of diabetes", xlab = "bmi")
stripchart(y$bmi[y$type == "No"], at = -0.05, add = T, method = "jitter", jitter = 0.02, col = "blue")
stripchart(y$bmi[y$type == "Yes"], at = 1.05, add = T, method = "jitter", jitter = 0.02, col = "red")
```
We drew the predictions up to `bmi = 55`, but we see that original data included few
observations above 45, and hence we shouldn't expect the model to be reliable after that.
We see that the risk raises from about 10% at `bmi < 20` up to 60% at `bmi > 40`.
While the relationship in this BMI range is close to linear,
we still see that there is some non-linearity in the prediction curve,
which is a difference from the corresponding linear regression model.
Is this a good model?
Let's define some measures for how well the model explains the data.
We use **misclassification error** and **ROC curve** for this purpose.
#### Missclassification error
We can predict the class (0 or 1) of each sample based on the model (e.g. by predicting class to be
1 if predicted risk > 0.5 and otherwise predicting class to be 0), and then see which proportion of the
predictions are wrong. This proportion of labels predicted wrongly is called
the missclassification rate.
Let's apply it to the current data. We can use `predict(glm.1, type = "response")` to produce
risks and then threshold them into 0 or 1 according to whether the risk is < 0.5 or > 0.5.
```{r}
pred.1 = predict(glm.1, type = "response") #if no newdata specified, then predictions are given for original data
y.pred.1 = ifelse(pred.1 < 0.5, 0, 1) #if pred.1[i] < 0.5 then set y.pred.1[i]=0 else set y.pred.1[i]=1
table(y.pred.1, y$type) #cross-tabulate predictions vs. true type
```
We see that missclassification error is
$\frac{57+12}{200} = 0.345$. [Two measures](https://en.wikipedia.org/wiki/Sensitivity_and_specificity) often used in medical research are
* *sensitivity*, the proportion of true 1s that are correctly labeled as 1s: $\frac{11}{57+11} = 0.162$ and
* *specificity*, the proportion of true 0s that are correctly labeled as 0s: $\frac{120}{120+12} = 0.909$.
The classifier is perfect if and only if it has both sensitivity and specificity equal to 1.
Here our model has very low sensitivity of 16% and higher specificity of 91%. This means that
the model doesn't really detect well the true diabetes cases.
Let's add `glu` to the model, which is expected to predict strongly diabetes,
since it is used to diagnose diabetes.
```{r}
boxplot(glu ~ type, data = y, xlab = "glu", ylab = "Diabetes", horizontal = T, col = c("blue","red"))
t.test(glu ~ type, data = y)
```
```{r}
glm.2 = glm(type ~ bmi + glu, data = y, family = "binomial")
summary(glm.2)
```
Let's see the missclassification rate in the updated model.
```{r}
pred.2 = predict(glm.2, type = "response")
y.pred.2 = ifelse(pred.2 < 0.5, 0, 1)
table(y.pred.2, y$type) #cross-tabulate predictions vs. true type
```
Missclassification rate is now $\frac{31+16}{200} = 0.235$
and has clearly fallen from its earlier value of 0.345 when BMI was alone in the model.
Missclassification rate is an intuitive measure,
but it has a downside that it is
defined by a particular fixed risk threshold (usually 0.5).
We can get more information about a classifier, if
we can see its performance throughout all possible risk cutoffs.
The ROC curve gives us such information.
#### ROC curve
In [*receiver operating characteristics* (ROC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) curve, we run through all possible
risk thresholds to classify samples into 1s and 0s and, for each threshold,
we compute the true positive rate (= sensitivity, the proportion of true 1s that are labeled as 1s)
against false positive rate (=1-specificity, the proportion of true 0s that are labeled as 1s).
Let's draw ROC curves from models 1 and 2 above, and discuss the curves.
```{r, fig.width = 5}
p.thresh = seq(-0.001, 1.001, 0.001) #risk thresholds for which ROC is computed
#We make roc.1 and roc.2 matrices here and then fill them in the for-loop below
roc.1 = matrix(NA, ncol = 2, nrow = length(p.thresh)) #will have col1 = FP rate and col2 = TP rate
roc.2 = matrix(NA, ncol = 2, nrow = length(p.thresh))
for(ii in 1:length(p.thresh)){ #goes through all risk thresholds
p = p.thresh[ii] # now p is the current risk threshold
y.pred.1 = ifelse(pred.1 < p, 0, 1) #0-1 predictions based on current risk threshold 'p'
y.pred.2 = ifelse(pred.2 < p, 0, 1)
roc.1[ii,1] = sum( y$type == "No" & y.pred.1 == 1) / sum( y$type == "No" ) #false positive rate
roc.1[ii,2] = sum( y$type == "Yes" & y.pred.1 == 1) / sum( y$type == "Yes") #true positive rate
roc.2[ii,1] = sum( y$type == "No" & y.pred.2 == 1) / sum( y$type == "No" )
roc.2[ii,2] = sum( y$type == "Yes" & y.pred.2 == 1) / sum( y$type == "Yes")
}
plot(NULL, xlim = c(0,1), ylim = c(0,1), main = "ROC",
xlab = "False positive rate", ylab = "True positive rate") #Makes an empty plot
lines(roc.1, col = "skyblue", lwd=1.5) #adds ROC of model 1
lines(roc.2, col = "darkblue", lwd=1.5) #adds ROC of model 2
legend("bottomright", leg = c("bmi", "bmi+glu"), col = c("skyblue", "darkblue"), lwd = 1.5)
abline(a = 0, b = 1, lty = 2) #adds diagonal line Y = X
```
An ROC curve always starts from the origin (0,0),
corresponding to rhe risk threhsold of 1.0, at where
no sample is yet classified as outcome 1 (because all estimated risks are below 1.0).
If the two groups were completely separable in their estimated risk values,
then ROC curve would climb straight up to point (0,1) at the
threshold that the first time completely separated the two groups.
From there, it would then
move horizontally towards the point (1,1) as the risk thresholds get closer to 0.0.
It would reach point (1,1) at latest when the risk thresholds reaches 0.0 because then
every observation is classified as 1.
If the risk estimates were just some random values, we would expect
the ROC curve to follow closely the diagonal.
As a conclusion, the better the classifier is,
more quickly its ROC jumps up vertically and more slowly it moves
horizontally.
Another way to express this is that we would like our classifier to have
simultaneously a high true positive rate AND a low false positive rate
and such classifiers reside near the top left corner of the ROC plot.
Mathematically, it can be shown
that the **area under the curve (AUC)** of an ROC curve
coresponds to the probability that a randomly chosen case (a true 1)
has a higher risk estimate than a randomly chosen control (a true 0).
Thus, AUC is a single number to summarize the performance of a classifier.
Above we see that both classifiers seem better than random (they are clearly above the diagonal)
and the model with `glu` is a better classifier
since it is above the other curve (and hence also covers more area under the curve).
See illustration of ROC curves:
.
#### ROCR package
Let's do the same things with `ROCR` package
(install if first, see the code below).
We need two vectors for each ROC curve: `predictions` which in logistic regression
in the predicted risk, and `labels` which are true 0/1 labels.
These are given to `predictions( )` function to produce
an ROCR object for performance evaluation.
We can then assess `performance( )`, for example, by
making an ROC curve using `measure = "tpr"` and `x.measure = "fpr"`.
```{r, fig.width = 5}
#install.packages("ROCR")
library(ROCR)
pred.1 = predict(glm.1, type = "response") #repeat risk predictions from model glm.1
rocr.pred.1 = prediction(pred.1, labels = y$type) #ROCR prediction object
roc.perf.1 = performance(rocr.pred.1, measure = "tpr", x.measure = "fpr") #ROCR performance object
plot(roc.perf.1, col = "skyblue")
pred.2 = predict(glm.2, type = "response")
rocr.pred.2 = prediction(pred.2, labels = y$type)
roc.perf.2 = performance(rocr.pred.2, measure = "tpr", x.measure = "fpr")
plot(roc.perf.2, col = "darkblue", add = TRUE) #add to existing plot
abline(a = 0, b = 1, lty = 2) #diagonal corresponding to a random assignment
legend("bottomright", leg = c("bmi", "bmi+glu"), col = c("skyblue", "darkblue"), lwd = 1.5)
```
We can also get missclassification rate using `ROCR` by asking `measure = "err"`.
Let's plot it as a function of the risk cutoff and mark the value 0.235, that we
computed manually above for `glm.2` model at cutoff 0.5.
```{r}
plot(performance(rocr.pred.2, measure = "err"), col = "darkblue")
plot(performance(rocr.pred.1, measure = "err"), col = "skyblue", add = T)
legend("topright", leg = c("bmi", "bmi+glu"), col = c("skyblue", "darkblue"), lwd = 1.5)
abline(h = 0.235, lty = 2)
```
We see that the smallest missclassification rates occur around
the risk threshold region (0.4,...,0.6).
## Test data and cross-validation
Our model assessment above has a potential problem because we first fitted the regression
model in the data set `y` and then we tested how well the model was doing using the **same** data set `y`.
This procedure may yield an overestimation of the performance of the model compared to
what the performance would be in some **different** test data set. This is because the method to
fit the model maximizes the fit of the model to the given **training data**.
It could, however, perform much worse in
an unseen **test data** set, because the test data was not available to fit the model in the first place.
The severity of this **overfitting problem** depends on the context and is likely very small
in our case where the model has only a couple of predictors, but could become very severe,
if we had used hundreds of predictors in the regression model.
Therefore, in general, a reliable performance assessment of a model should be done in an independent
test data set that was not used in training of the model.
This principle is a cornerstone of modern data science and
machine learning applications.
Luckily, we have an independent test data `Pima.te` that contains 332 more women from the same Pima
population that we used above as our training data.
Let's see how our models perform in the test data.
```{r, fig.width = 5}
library(ROCR)
pred.te.1 = predict(glm.1, newdata = Pima.te, type = "response") #.te = "test"
rocr.pred.te.1 = prediction(pred.te.1, labels = Pima.te$type)
roc.perf.te.1 = performance(rocr.pred.te.1, measure = "tpr", x.measure = "fpr")
plot(roc.perf.te.1, col = "springgreen")
pred.te.2 = predict(glm.2, newdata = Pima.te, type = "response")
rocr.pred.te.2 = prediction(pred.te.2, labels = Pima.te$type)
roc.perf.te.2 = performance(rocr.pred.te.2, measure = "tpr", x.measure = "fpr")
plot(roc.perf.te.2, col = "forestgreen", add = T)
abline(a = 0, b = 1, lty = 2)
plot(roc.perf.1, col = "skyblue", add = T, lty = 2)
plot(roc.perf.2, col = "darkblue", add = T, lty = 2)
legend("bottomright", legend = c("tr bmi","tr bmi+glu","te bmi","te bmi+glu"),
col = c("skyblue","darkblue","springgreen","forestgreen"), lty = c(2,2,1,1))
```
ROC curves show that performance is fairly similar in the training and test data,
which shows that we have not overfitted our model in this example but the model
performance generalizes as such also to unseen data from the same population.
It is important to be able to test the model performance in test data that are independent
from the training data.
But in the same time it seems wasteful if we need to put aside a large proportion of the available data for
testing purposes since then we can't use all available information to fit the model.
To tackle this problem, we can use the technique of **cross-validation**.
#### Cross-validation
In cross-validation, we split the data set into K parts (typically K=5 or K=10).
Then we fit the target model K separate times, by using one of the K parts as test data and the rest
as training data. Finally, we average the performance over the K different test data results.
This procedure gives a reliable estimate how the particular model would perform in independent
test data, but in the same time we have used all the data to inform the model fitting.
We would then choose the model that performs best in cross-validated performance metric
and fit that using the whole data set to serve as our model of choice.
(If you are interested to learn more, watch ).
### Example 8.2: Logistic regression on biopsy data
Here we consider `biopsy` data set from `MASS` package.
We aim to develop a logistic regression model in order to assist
the patientâ€™s medical team
in determining whether the tumor is malignant or not.
This breast cancer database was obtained from the University of Wisconsin Hospitals,
Madison from Dr. William H. Wolberg.
He assessed biopsies of breast tumours for 699 patients
up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10,
and the outcome is also known.
|Name|Explanation|
|----|-----------|
|ID | sample code|
|thick|clump thickness|
|u.size| uniformity of cell size|
|u.shape|uniformity of cell shape|
|adhsn| marginal adhesion|
|s.size| single epithelial cell size|
|nucl| bare nuclei (16 values are missing)|
|chrom|bland chromatin|
|n.nuc|normal nucleoli|
|mit|mitoses|
|type|"benign" or "malignant"|
```{r}
library(MASS)
y = biopsy #working copy to y
names(y) = c("ID","thick", "u.size", "u.shape", "adhsn", "s.size", "nucl", "chrom", "n.nuc", "mit", "type")
str(y)
```
We will remove ID column and we will remove 16 samples with some missing data using `na.omit()`.
The working data are assigned to `y`.
```{r}
y = na.omit(y) #remove rows with missing observations
y$ID = NULL #remove ID column
head(y)
nrow(y)
```
Let's visualize the variables.
```{r, fig.width=10, fig.height=10}
par(mfrow = c(3,3)) #3x3 plotting area for 9 variables
par(mar = c(3,4,4,1)) #crop a bit away from default margins for each figure
for(ii in 1:9){
tt = t.test(y[,ii] ~ y$type)
boxplot(y[,ii] ~ y$type, xlab = "", ylab = names(y)[ii],
cex.lab = 1.5, cex.main = 1.3, cex.axis = 1.2, col ="gray",
main = paste0(names(y)[ii]," D=",signif(tt$estimate[2]-tt$estimate[1],2),
" p=",signif(tt$p.value,2)))
}
```
We see that for all 9 variables there is a clear difference between malignant and benign cases.
Let's see the correlations between the 9 variables.
```{r}
library(corrplot)
corr = cor(y[,1:9])
corrplot.mixed(corr)
```
Quite strong positive correlations among the variables.
Let's split the data to 30% for testing and 70% for training.
It is important to do the split at random since the order of the samples in the data frame
may be associated with some of the variables whereas we want that our training and test data
are similar to each other with respect to the distribution of every variable.
```{r}
ind = sample(1:nrow(y), size = round(0.3*nrow(y))) #random sample from 1,...,683 of size 0.3*683
y.tr = y[-ind,] #70% for training
y.te = y[ind,] #30% for testing
````
Let's fit the model with all variables. We can use dot "." to denote using all variables except the
outcome variable as predictors.
```{r}
glm.full = glm(type ~ ., data = y.tr, family = "binomial") # . uses all variables as predictors (except type)
summary(glm.full)
pred.full = predict(glm.full, type = "response")
y.pred.full = ifelse(pred.full < 0.5, 0, 1)
tab.full = table(y.pred.full, y.tr$type)
tab.full
(tab.full[1,2] + tab.full[2,1]) / sum(tab.full)
```
The missclassification rate is only 1.7%.
Model looks very good in training data.
But how is it possible that
the individual predictors were not associated with the outcome
with very low P-values in the logistic regression summary?
This is because when the variables are highly correlated, their effects
are difficult to separate from each other statistically, and
hence the uncertainty of the individual model coefficients remains large
when all of the variables are included in the same model.
This leads to high P-values for each variable. However, when the model
is applied to the data, it can still make very
good predictions even though
the values of individual coefficients remain uncertain.
For comparison,
let's also make a simpler model by including only two variables `thick` and `chrom`.
```{r}
glm.2 = glm(type ~ thick + chrom, data = y.tr, family = "binomial")
summary(glm.2)
pred.2 = predict(glm.2, type = "response")
y.pred.2 = ifelse(pred.2 < 0.5, 0, 1)
tab.2 = table(y.pred.2, y.tr$type)
tab.2
(tab.2[1,2] + tab.2[2,1])/sum(tab.2)
```
This model has a bit worse missclassification rate (but much lower P-values
for the individual predictors).
Let's then see how they work in test data.
```{r}
pred.te.full = predict(glm.full, newdata = y.te, type = "response")
y.pred.te.full = ifelse(pred.te.full < 0.5, 0, 1)
tab.full = table(y.pred.te.full, y.te$type)
tab.full
(tab.full[1,2] + tab.full[2,1]) / sum(tab.full)
pred.te.2 = predict(glm.2, newdata = y.te, type = "response")
y.pred.te.2 = ifelse(pred.te.2 < 0.5, 0, 1)
tab.2 = table(y.pred.te.2, y.te$type)
tab.2
(tab.2[1,2] + tab.2[2,1]) / sum(tab.2)
```
Error rates ares slightly larger in test data than in training data
but the difference between the models stays similar:
full model has a smaller missclassification error than the simpler model.
Let's look at the ROCs.
```{r, fig.width = 5}
#Test data ROCs:
# Full model
rocr.pred.te.full = prediction(pred.te.full, labels = y.te$type)
roc.perf.te.full = performance(rocr.pred.te.full, measure = "tpr", x.measure = "fpr")
plot(roc.perf.te.full, col = "forestgreen")
# 2 predictor model
rocr.pred.te.2 = prediction(pred.te.2, labels = y.te$type)
roc.perf.te.2 = performance(rocr.pred.te.2, measure = "tpr", x.measure = "fpr")
plot(roc.perf.te.2, col = "springgreen", add = T)
#Training data ROCs:
# Full model
rocr.pred.tr.full = prediction(pred.full, labels = y.tr$type)
roc.perf.tr.full = performance(rocr.pred.tr.full, measure = "tpr", x.measure = "fpr")
plot(roc.perf.tr.full, col="darkblue", add = T, lty = 2)
# 2 predictor model
rocr.pred.tr.2 = prediction(pred.2, labels = y.tr$type)
roc.perf.tr.2 = performance(rocr.pred.tr.2, measure = "tpr", x.measure = "fpr")
plot(roc.perf.tr.2, col="skyblue", add = T, lty = 2)
abline(a = 0, b = 1, lty = 2) #diagonal for random assignment
legend("bottomright", legend = c("tr full","tr 2","te full","te 2"),
col = c("darkblue","skyblue","forestgreen","springgreen"), lty = c(2,2,1,1), lwd = 1.5)
```
We see that the full model performs better also in terms of ROC curve,
both in training and test data.
In general, the full model seems a very good classifier.
### Extra: Understanding overfitting and difference between training and testing error
In our examples, there was no dramatic difference between missclassification errors
in training vs. testing sets. This is because our logistic regression model was simple with only a few
predictors compared to the number of available observations. To understand better why it is
so important to consider a separate testing set, you can watch a video from StatsQuest: