---
title: "HDS 2. False discovery rate"
author: "Matti Pirinen, University of Helsinki"
date: "20.10.2021"
output:
html_document: default
params:
kableExtra: 0
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
In the previous lecture, we considered a multiple testing problem where
thousands of tests were done while only a few of them were expected
to be non-null. There the family-wise error control (FWER)
seemed a reasonable way to filter a few most prominent candidates for true positives from the vast set of null variables.
Let's now consider situations where we may have
a considerable amount of true
positives $m=p-p_0$ among the $p$ tested variable, e.g., $m/p = 10\%$. Let's start
by writing up easy ways to generate P-values for such a situation.
#### Distribution of z-scores
Suppose that in linear regression $y = \mu + x \beta +\varepsilon$ the true slope
is $\beta$. If $\textrm{Var}(x) = v_x$ and error variance is $\sigma^2$, then
sampling variance of $\widehat{\beta}$ is $v_\beta = \sigma^2/(n v_x)$ where
$n$ is the sample size, and z-score
for testing the slope is
$$z=\frac{\widehat{\beta}}{\sqrt{v_\beta}}\sim
\mathcal{N}\left(\frac{\beta}{\sqrt{v_\beta}},1\right). $$ We can compute
P-values for such z-scores using `pchisq(z^2, df = 1, lower = F)`, because under the
null $z \sim \mathcal{N}(0,1)$ and hence $z^2 \sim \chi^2_1$, i.e., $z^2$ follows the
central chi-square distribution with one degree of freedom.
With this method, we can
easily generate P-values for $p_0$ null variables ($\beta=0$) and $m$ non null
variables $\beta \neq 0$ and test various inference methods on those
P-values. Note that the P-value distribution of the non-null variables will
depend on sample size $n$, true value $\beta$ and the ratio of predictor and
noise variances $v_x/\sigma^2$, but the P-value distribution of the null
variables is always the same, namely, Uniform(0,1).
Let's generate P-values for $p=1000$ variables that we assume each have been tested
independently for a linear association with its own outcome variable.
Let's also assume that
$m=100$ of them actually do have an effect each explaining 1% of the variance
of the corresponding outcome.
We follow the procedure from Lecture 0 to generate such
data, and set $\sigma^2 = 1$ and $v_x = 1$.
```{r, fig.width = 5}
set.seed(11102017)
n = 1000 #sample size
p = 1000 #number of variables to be tested
m = 100 #number of non-null tests
b = sqrt( 0.01 / (1 - 0.01) ) #This means each predictor explains 1%, See Lecture 0.
#Generating P-values: 1,...,m are true effects, m+1,...,p are null.
eff = c(rep(T, m), rep(F, p-m)) #indicator for non-null effects
pval = pchisq( rnorm(p, b*sqrt(n)*as.numeric(eff), 1)^2, df = 1, lower = F)
boxplot(-log10(pval)[eff], -log10(pval)[!eff], col = c("limegreen","gray"),
names = c("EFFECTS","NULL"), ylab = "-log10 Pval")
abline(h = -log10(0.05), col = "blue", lty = 2) #significance threshold 0.05
abline(h = -log10(0.05 / p), col = "red", lty = 2) #Bonferroni corrected threshold of 0.05
```
We see that true effects tend to have smaller P-values (i.e. larger -log10
P-values) than null effects, but there is some overlap between the distributions
and therefore, from P-values alone, we cannot have a rule that would detect all true
positives but would not report any false positives.
Remember the characterization of test results according to this table:
```{r, echo = F}
dt = data.frame(
"Test result" =
c("negative (not significant)",
"positive (significant, discovery)",
"Total"),
"not null" = c("FN","TD","p-p0"),
"null" = c("TN", "FD", "p0"),
"Total" = c("p-D","D","p")
)
if(params$kableExtra != 1) {
knitr::kable(dt, col.names = gsub("[.]", " ", names(dt)))
}else{
library(kableExtra)
kbl(dt, col.names = gsub("[.]", " ", names(dt))) %>%
kable_classic(full_width = F, position = "left") %>%
add_header_above(c(" " = 1, "True state of the world" = 2, " " = 1))
}
```
Let's print such a table when we first do inference based on P-value threshold
(say 0.05) and then use Bonferroni correction.
```{r}
alpha = 0.05
dis = (pval < alpha) #logical indicating discoveries
table(dis, !eff, dnn = c("discov","null"))
dis = (p.adjust(pval, method = "bonferroni") < alpha)
table(dis, !eff, dnn = c("discov","null"))
```
The problem with the raw P-value threshold is that there are many
false discoveries (over one third of all discoveries are false). The problem
with the Bonferroni correction is that only a third of all true effects are
discovered. We want something in between. We want to directly control the
**proportion** of false discoveries made out of all discoveries.
### Definition of false discovery rate
Let's define **False Discovery Proportion (FDP)** as a random variable
$$ \textrm{FDP}=\frac{\text{FD}}{\max\{1,D\}}=\left\{\begin{array}{cc}
\frac{\text{FD}}{D}, \text{ if }D>0.\\
0, \text{ if }D=0. \end{array}\right.$$
**False Discovery Rate (FDR)** is the expectation of FDP:
$$ \textrm{FDR}=\textrm{E}(\textrm{FDP}).$$
Thus, it is the
(expected) proportion of false discoveries among all discoveries. By controlling
FDR at a given level $\alpha_{F}$, we will tolerate more false discoveries as
the number of tests increases as long as we will also keep on doing more true
discoveries. If a method guarantees to keep $\alpha_F = 0.1$, then,
in expectation, when
we make 10 discoveries we allow 1 of them to be a false discovery, whereas when
we make 100 discoveries we allow 10 false discoveries among them. Note the
difference to FWER control where we always allow at most 1 false discovery in
the experiment, no matter whether we are doing 1, 10, 1000 or 100,000 discoveries
altogether.
Given a rule for calling variables significant, the false
positive rate is the rate that truly null variables are called significant
and is measured by P-value.
The FDR, instead, is the rate at which significant variables
are truly null. For example,
a false positive rate of 5% means that, on average, 5% of
the truly null variables in the study will be called significant. A FDR
of 5% means that, on average, among all variables called significant,
5% of them are truly null.
Let's see which P-value threshold would yield a false discovery proportion (FDP)
of $\alpha_F=0.05$ in our ongoing example. (Note FDP is the realized value of
FD/D whereas FDR is its expectation. In real life we can't know FDP, but in
these simulations we can!)
```{r}
sort.pval = sort(pval) #sorted from smallest to largest
sort.eff = eff[order(pval)] #whether each sorted pval is from a true positive
fdp = cumsum(!sort.eff)/(1:p) #which proportion of discoveries are false
cols = rep("red",p); cols[sort.eff] = "orange" #true pos. in orange
plot(log10(sort.pval), fdp, xlab = "log10 P-value", ylab = "FDP",
col = cols, cex = 0.7, pch = 20)
alpha = 0.05
i = max( which(fdp < alpha) )
print( paste("fdp <",alpha,"when P-value is <",signif(sort.pval[i],3)) )
abline(v = log10(sort.pval[i]), col = "blue")
abline(h = alpha, col = "red")
#shows the step where fdp < alpha breaks
cbind(D=i:(i+1), FD=fdp[i:(i+1)]*c(i,i+1), fdp=fdp[i:(i+1)], pval=sort.pval[i:(i+1)])
```
So if we had a method that controlled FDR at 0.05, we expect it to give roughly
D=75 discoveries of which FD=3 are false. Compare this to D=139 and FD=49 with
unadjusted P-value threshold of 0.05 and to D=27 and FD=0 with Bonferroni
adjusted threshold of 0.05/p. But how can we in general control FDR given the
set of P-values?
Such a method was formulated by Yoav Benjamini and Yosef Hochberg in 1995.
#### Benjamini-Hochberg procedure (1995)
Let $H_j$ be the null
hypothesis for test $j$ and let $P_j$ be the corresponding P-value. Denote the
ordered sequence of P-values as $P_{(1)} \leq P_{(2)} \leq \ldots \leq P_{(p)}$
and let $H_{(j)}$ be the hypothesis corresponding to the $j$th P-value.
Benjamini-Hochberg procedure at level $\alpha_F$ (BH($\alpha_F$)) is to
$$\textrm{reject
the null hypotheses } H_{(1)},\ldots,H_{(k)}\textrm{, where }
k \textrm{ is the largest index } j \textrm{ for which } P_{(j)} \leq \frac{j}{p}\alpha_F.$$
**Theorem (BH).** For independent tests and for any configuration of false null hypotheses, BH($\alpha_F$) controls the FDR at level $\alpha_F$.
Original proof in Benjamini and Hochberg [(1995, J. R. Statist. Soc. B. 57(1):289-300)](https://www.jstor.org/stable/2346101).
**Intuitive explanation why BH works.**
Consider P-value $P_{(j)}$. Since $P_{(j)}$ is the $j$th smallest P-value, if we draw a significance threshold at $P_{(j)}$ we have made exactly $j$ discoveries.
On the other hand, we expect that out of all $p_0$ null effects
about $p_0\cdot P_{(j)}$ give a P-value $\leq P_{(j)}$. Thus we have an
approximation for false discovery proportion at threshold $P_{(j)}$
$$\textrm{FDP}(P_{(j)}) \approx \frac{p_0\cdot P_{(j)}}{j} \leq \frac{p \cdot P_{(j)}}{j}.$$
If we simply ask that for which $j$ is this estimated FDP$(P_{(j)})\leq \alpha_F$,
we get
$$\frac{p \cdot P_{(j)}}{j} \leq \alpha_F\, \Leftrightarrow\, P_{(j)} \leq \frac{j}{p}\alpha_F,$$
which is the condition of BH procedure.
**Proof of BH.** (Adapted from [Aditya Guntuboyina](https://statpacking.wordpress.com/2015/08/03/fdr-and-benjamini-hochberg/).)
Assume we have $p$ hypotheses and corresponding
P-values $\pmb{P}=(P_1,\ldots,P_p)$ and that BH($\alpha_F$) method
rejects $k$ of these hypotheses. This means that when the P-values are in
the ascending order $(P_{(1)},\ldots,P_{(p)})$,
then $k$ is the largest index that satisfies the BH-condition
$$P_{(k)} \leq {\frac{k}{p}\alpha_F}.\qquad \textrm{(1)}$$
We can write FDP as
$\frac{1}{\max\{k,1\}}\sum_{j \in I_0} \textrm{I}\{P_j \leq \alpha_F\cdot k / p\}$, where
$I_0$ is the set of $p_0$ true null hypotheses, and $\textrm{I}$ is an indicator function.
$$
\textrm{FDR} = \textrm{E}(\textrm{FDP}) =
\textrm{E}\left(\sum_{j \in I_0} \frac{\textrm{I}\{P_j \leq \alpha_F\cdot k / p\}}{\max\{k,1\}}\right) \qquad (2)
$$
The problem with trying to simplify this further is that $P_j$ and $k$ are not independent.
The idea in this proof is to replace $k$ with a closely related quantity
$\tilde{k}_j$ that is independent of $P_j$.
For any $j$, denote by
$\tilde{\pmb{P}}_j = (P_1,\ldots,P_{j-1},0,P_{j+1},\ldots P_p)$ the sequence of
P-values where $P_j$ has been changed to 0.
Denote by $\tilde{k}_j$ the number of rejections that BH($\alpha_F$) method
does when it is applied to $\tilde{\pmb{P}}_j$. Importantly,
$\tilde{k}_j$ only depends on $(P_i)_{i\neq j}$ but not on $P_j$, that,
by an assumption of the BH theorem, is independent of $(P_i)_{i\neq j}$.
Clearly $k \leq \tilde{k}_j$ since the difference between the two sequences
of P-values is that the element $j$ in $\tilde{\pmb{P}}_j$ is $0 \leq P_j$
while all other elements are the same between the two sequences. It is also
clear that $\tilde{k}_j \geq 1$ since BH($\alpha_F$) always rejects the hypothesis
corresponding to P-value 0.
We next show that
$$
\frac{\textrm{I}\{P_j \leq \alpha_F\cdot k / p\}}{\max\{k,1\}} \leq
\frac{\textrm{I}\{P_j \leq \alpha_F\cdot \tilde{k}_j / p\}}{\tilde{k}_j},\qquad
\textrm{for all } j=1,\ldots,p.\qquad (3)
$$
Since the right-hand side is non-negative, the
inequality (3) holds when
$\textrm{I}\{P_j \leq \alpha_F\cdot k / p\}=0.$
Consider then the remaining case $\textrm{I}\{P_j \leq \alpha_F\cdot k / p\} = 1$.
Since $k$ is the largest index for which the BH-condition (1) holds,
$P_j \leq P_{(k)},$ and hence the ordered sequences of P-values $\pmb{P}$ and
$\tilde{\pmb{P}}_j$ agree on every element that is $\geq P_j$.
In particular, they agree on every element that is $\geq P_{(k)}$.
Since $k$ was the largest index in ordered sequence $\pmb{P}$
for which the BH-condition (1) holds, it follows that
BH($\alpha_F$) rejects exactly the same hypotheses from both sequences.
Thus $\tilde{k}_j = k$ and
$\textrm{I}\{P_j \leq \alpha_F\cdot \tilde{k}_j / p\} = \textrm{I}\{P_j \leq \alpha_F\cdot k / p\} = 1,$
which proves the inequality (3) also in this case.
Let's get back to formula (2) and apply (3).
\begin{eqnarray*}
\textrm{FDR} &=& \textrm{E}(\textrm{FDP}) =
\textrm{E}\left(\sum_{j \in I_0} \frac{\textrm{I}\{P_j \leq \alpha_F\cdot k / p)\}}{\max\{k,1\}}\right) \leq
\textrm{E}\left( \sum_{j \in I_0} \frac{\textrm{I}\{P_j \leq \alpha_F\cdot \tilde{k}_j / p)\}}{\tilde{k}_j} \right) \\
&=&
\textrm{E}\left(\textrm{E}\left( \sum_{j \in I_0} \frac{\textrm{I}\{P_j \leq \alpha_F\cdot \tilde{k}_j / p\}}{\tilde{k}_j}\,\middle|\,\tilde{k}_j \right)\right) =
\textrm{E}\left( \sum_{j \in I_0} \frac{\textrm{E}(\textrm{I}\{P_j \leq \alpha_F\cdot \tilde{k}_j / p\} \,|\, \tilde{k}_j)}{\tilde{k}_j} \right)\\
&=& \textrm{E}\left( \sum_{j \in I_0} \frac{ \alpha_F \tilde{k}_j}{p \tilde{k}_j} \right)
= \frac{p_0}{p}\alpha_F \leq \alpha_F.
\end{eqnarray*}
The last line follows since the sum is over $p_0$ true null hypotheses whose
P-values are uniformly distributed in (0,1) and because $P_j$ is independent of
$\tilde{k}_j$.
This proves that BH($\alpha_F$) controls FDR at level $\alpha_F$
(actually even more strictly at level $\alpha_F \cdot p_0/p$).$\qquad \square$
Like in Holm procedure, in BH the rejection of the tested hypotheses depends not only on their P-values but also on their rank among all the P-values. We see that the critical threshold increases from the Bonferroni threshold $\alpha_F/p$ for the smallest P-value to the unadjusted threshold $\alpha_F$ for the largest P-value. Crucially, if any P-value $P_{(j)}$ is below its own threshold $j\cdot \alpha_F/p$,
it means that also ALL the hypotheses corresponding to smaller P-values will be rejected,
no matter whether they are below their own thresholds.
Let's apply BH to our current data, first manually and then through `p.adjust`.
```{r}
alpha = 0.05
i.BH = max(which(sort.pval <= ((1:p)*alpha)/p))
print(paste("Reject 1,...,",i.BH," i.e, if P-value <=",signif(sort.pval[i.BH],3)))
print(paste0("Discoveries:",i.BH,
"; False Discoveries:",sum(!sort.eff[1:i.BH]),
"; fdp=",signif(sum(!sort.eff[1:i.BH])/i.BH,2)))
pval.BH = p.adjust(pval,method = "BH")
#These are pvals adjusted by a factor p/rank[i]
#AND by the fact that no adjusted P-value can be larger than any of other adjusted
#P-values that come later in the ranking of the original P-values (See home exercises)
sum(pval.BH < alpha) #should be D given above
```
Let's draw a picture with P-value threshold (blue), Bonferroni threshold (purple) and BH threshold (cyan).
Let's sort the P-values from smallest to largest and draw P-values on -log10 scale.
```{r}
plot(1:p, -log10(sort.pval), pch = 3, xlab = "rank", ylab = "-log10 pval", col = cols, cex = 0.7)
abline(h = -log10(alpha), col = "blue", lwd = 2)
abline(h = -log10(alpha/p), col = "purple", lwd = 2)
lines(1:p, -log10((1:p)*alpha/p), col = "cyan", lwd = 2)
legend("topright", legend = paste(c("P=","BH","Bonferr."), alpha),
col = c("blue","cyan","purple"), lwd = 2)
```
The Figure tells that BH quite nicely draws the threshold around the point
where true effects start to mix with null effects, whereas Bonferroni is
too stringent and raw P-value based significance level is too liberal
when the goal is to separate orange and red.
*Exercise.* Suppose that all your $p=1000$ P-values were between $(0.01,0.05)$,
so they are not very small but still they are at the left tail of the whole
Uniform(0,1) distribution.
How would the three methods (Bonferroni, BH, "P<0.05")
label variables significant in this case,
when we control each at level $0.05$ as in the Figure above?
Earlier we saw that by using significance level $P\leq 0.05$ we tend to get a lot of
false positives when $p$ is large. How can our FDR method, that should keep
false discovery rate small, then agree with the traditional significance
threshold in this case? Why does it think that there are not a lot of false positives
among these 1000 variables?
#### Benjamini-Yekutieli procedure
It is important to remember that BH procedure was proven for **independent**
test statistics and therefore it is not as generally applicable as Bonferroni
and Holm methods.
(In practice, however, BH has been shown to be quite robust to violations of independence
assumption.)
An extension of BH has been proven to control FDR in all cases
by Benjamini and Yekutieli
[(The Annals of Statistics 2001, Vol. 29, No. 4, 1165-1188)](https://projecteuclid.org/euclid.aos/1013699998).
**Theorem.** When the Benjamini-Hochberg procedure is conducted with
$\alpha_F/\left(\sum_{j=1}^p \frac{1}{j}\right)$ in place of $\alpha_F$, it always controls the FDR at level $\leq \alpha_F$. (Proof skipped.)
This Benjamini-Yekutieli (BY) procedure can also be done by `p.adjust`.
```{r}
alpha = 0.05
i.BY = max(which(sort.pval <= ((1:p)*alpha/sum(1/(1:p))/p)))
print(paste("Reject 1,...,",i.BY," i.e, if P-value <",signif(sort.pval[i.BY],2)))
print(paste0("Discoveries:",i.BY,
"; False Discoveries:",sum(!sort.eff[1:i.BY]),
"; fdp=",signif(sum(!sort.eff[1:i.BY])/i.BY,2)))
pval.BY = p.adjust(pval,method = "BY") #these are pvals adjusted by factor p/rank[i]*sum(1/(1:p))
sum(pval.BY < alpha) #should be D given above
```
As we see, BY is more conservative than BH, which is the price to pay for proven guarantees of control of FDR in case of all possible dependency structures.
#### Relationship between FWER and FDR
We have
$$\textrm{FDR} = \textrm{E}(\textrm{FDP}) =
\textrm{Pr}(D=0) \cdot 0 + \textrm{Pr}(D > 0) \cdot \textrm{E}\left(\frac{\text{FD}}{D}\,\middle|\, D>0 \right) = \textrm{Pr}(D > 0) \cdot \textrm{E}\left(\frac{\text{FD}}{D}\,\middle|\,D>0\right).$$
**FDR weakly controls FWER.**
If all null hypotheses are true, then the concept of FDR is the same as FWER.
To see this note that here $\text{FD}=D$ and therefore if $\text{FD}=0$ then
$\textrm{FDP}=0$ and if $\text{FD}>0$ then $\textrm{FDP}=1$.
Thus, $\textrm{FDR} = \textrm{E}(\textrm{FDP}) = \textrm{Pr}(\text{FD}>0) = \textrm{FWER}$.
This means that FDR *weakly controls FWER*: If all null hypotheses are true, then
any method that controls FDR at level $\alpha$ also controls FWER at level $\alpha$.
However, if some null hypotheses are false, then FDR doesn't typically
control FWER.
**FWER controls FDR.**
Because $\textrm{FDP} \leq \textrm{I}(\text{FD}>0)$, where $\textrm{I}$ is the indicator function,
by taking expectation,
$$\textrm{FDR} = \textrm{E}(\textrm{FDP}) \leq \textrm{Pr}(\text{FD} > 0) = \textrm{FWER}.$$
Thus, any method that controls FWER at level $\alpha$ also controls FDR at level $\alpha.$
#### Examples
**Example 2.1.** Let's see how BH controls FWER under the global null hypothesis.
```{r}
p = 1000 #variables for each data set
alpha = 0.1 #target FWER
R = 1000 #replications of data set
res = matrix(NA, ncol = 3, nrow = R) #collect the number of discoveries by BH, Holm, Bonferr.
for(rr in 1:R){
#Generate P-values that are null.
pval = runif(p)
res[rr,] = c(sum(p.adjust(pval, method = "BH") < alpha),
sum(p.adjust(pval, method = "holm") < alpha),
sum(p.adjust(pval, method = "bonferroni") < alpha))
}
apply(res > 0, 2, mean) #which proportion report at least one discovery?
```
All are close to the target FWER value $\alpha=0.1.$
**Example 2.2.**
BH method has a property that status of being a discovery at level $\alpha$
can change if a new variable is included among the tested variables.
```{r}
pval = c(0.014, 0.09, 0.05, 0.16)
p.adjust(pval, method = "BH")
p.adjust(c(pval, 0.001), method = "BH")
```
Now the status at FDR level $\alpha = 0.05$ of P-value
0.014 has changed when we added one more variable with lower P-value.
This is because this new variable is taken as a discovery and, intuitively,
this inclusion then allows some more false discoveries to be included among
the discoveries without breaking the idea of controlling FDR.
**Example 2.3.**
Bonferroni, Holm and BH behave differently in whether the
P-values close to each other can become the same after the adjustment.
Let's demonstrate this with one tie in the middle of 4 P-values.
```{r}
pval = c(0.01, 0.02, 0.02, 0.03)
rbind(pval, p.adjust(pval, "bonferroni"))
rbind(pval, p.adjust(pval, "holm"))
rbind(pval, p.adjust(pval, "BH"))
```
Bonferroni multiplies each P-value by same value (here 4) and hence
tie will stay as it is but not unite with its neighbors.
Holm first multiplies by $(p-j+1)$, i.e., the 2nd value by 3 (to get 0.06)
and 3rd value by 2 (to get 0.04) and the 4th value stays as it is (0.03).
But then it needs to make sure that the adjusted P-values are in ascending order
and it does this, for each P-value, by taking the maximum with P-values on the
left. Hence all last 3 P-values become 0.06.
BH first multiplies by $(p/j)$, i.e., the 1st value by 4 (to get 0.04),
the 2nd value by 2 (to get 0.04)
and 3rd value by 4/3 (to get 0.0266667) and the 4th value stays as it is (0.03).
It also makes sure that the adjusted P-values become in ascending
order but it makes this by taking, for each P-value, the minimum with all
other P-values on the right hand side. Hence also 1st and 2nd P-values become
0.0266667.
We say that Holm method is a step-down procedure
and BH method is a step-up procedure.
Note that in this terminology,
"up" and "down" refer to the ordering of the test statistics and
not to the ordering of the P-values.
**Step-down procedures** start from the hypothesis with the
largest test statistic (and lowest P-value)
and step down through the sequence of hypotheses in descending order of
their test statistics while rejecting the hypotheses.
The procedure stops at the first non-rejection and labels
all remaining hypotheses as not-rejected.
**Step-up procedures** start from the hypothesis with the
lowest test statistic value (and highest P-value)
and step up through the sequence of hypotheses in ascending
order of their test statistics while retaining
hypotheses. The procedure stops at the first rejection and labels all remaining
hypotheses as rejected.
**Single step procedures** are the ones where
the same criterion is used for each hypothesis,
independent of its rank among the test statistics.
Bonferroni correction and fixed significance level testing
are examples of single step procedures.
**Questions.**
1. If you use BH($\alpha_F$) method to choose the significant variables,
are you guaranteed to have at most $\alpha_F D$ false discoveries among the chosen variables, where $D$ is the total number of discoveries made by the procedure?
2. If you have $p=1000$ variables to test, when should you use FDR control
and when FWER control?
3. What could be a situation when you would rather use BY procedure
than BH procedure to control for FDR?