So far, we have used P-values for inference based on the false positive rate (this is the definition of P-value itself), on family-wise error rate (FWER; Bonferroni and Holm) and on false discovery rate (FDR; Benjamini-Hochberg and Benjamini-Yekutieli). See the slides for a summary of these three approaches.
Here we will consider a quantity called Q-value that
can be attached to each test and that gives an empirical estimate of FDR
among all tests whose Q-values are smaller or equal to the given
Q-value. We will follow the exposition of Storey and Tibshirani (2003, PNAS 100:
9440-9445), whose work led to the qvalue
R-package.
Before defining Q-value, we will first refine the BH method by
empirically estimating \(p_0\), the
number of null tests. Remember that the proof of BH theorem showed that
the BH(\(\alpha_F\)) method actually
controls FDR at level \(p_0/p \cdot
\alpha_F.\) Thus, when \(p_0\)
is considerably smaller than \(p\), we
can improve the accuracy of an FDR method by estimating \(p_0.\)
(Technical note: In this text, we are assuming that the procedure to define discoveries has produced at least one discovery. That is, we assume that \(D>0\) even though we do not explicitly write it out in the formulas below. To be precise, this quantity is called the positive false discovery rate (pFDR). Even though the pFDR is a slightly modified version of the FDR we defined previously, in practice the two become equal when \(\text{Pr}(D>0) \approx 1,\) which is the case in any interesting analysis setting where at least some discoveries are expected to be made.)
Let’s define, for each P-value threshold \(t\in[0,1],\) \[\textrm{FDR}(t) = \textrm{E}\left(\frac{\textrm{FD}(t)}{\max\{{D(t),1}\}}\right),\] where random variables \(\textrm{FD}(t)=\#\{\textrm{null P-values} \leq t\}\) and \(D(t)=\#\{\textrm{P-values} \leq t\}\) in an experiment where in total \(p\) P-values are available.
To refine our understanding of the FDR methods, let’s think that the random variables \(\textrm{FD}(t)\) and \(D(t)\) result from \(p\) draws of P-values from a mixture distribution between Uniform(0,1) (for null P-values) and an alternative distribution with cdf \(\Phi_1\) and pdf \(\phi_1\) (for non-null P-values), with a mixture proportion \(\pi_0\) for the null distribution. In other words, the cdf \(\Phi\) and pdf \(\phi\) of the P-values are \[\begin{eqnarray} \Phi(t)&=&\pi_0 \cdot t + (1-\pi_0) \Phi_1(t), \qquad t\in[0,1],\\ \phi(t)&=&\pi_0 \cdot 1 + (1-\pi_0) \phi_1(t), \qquad t\in[0,1]. \end{eqnarray}\] We can interpret sampling from such a mixture distribution as a two-step process. Namely, we first choose between the null distribution (with probability \(\pi_0\)) and the alternative distribution (with probability \(1-\pi_0\)), and second, conditional on the chosen distribution, we sample a P-value from the chosen distribution.
Example 3.1. Suppose we do \(p=10,000\) tests of which \(m=2000\) are non-null, i.e., we can model this as a mixture distribution with \(\pi_0=\frac{p-m}{p} = 0.80\). The null P-values come from Uniform(0,1) and the non-null P-values come from distribution Beta(0.1,4.9). (Note that Beta(0.1,4.9) gives values in [0,1] that have a mean of \(0.1/(0.1 + 4.9)=0.02\) and a skew to right; there is no particular reason why in real data the non-null P-values had exactly this distribution, but this is used for a demonstration purpose here; see figures below.) Let’s plot the theoretical (orange) and empirical (green) density functions for (1) null tests, (2) non-null tests and (3) for all tests combined, i.e., the mixture distribution of the null and alternative distributions.
p = 10000
m = 2000
beta.1 = 0.1 # weight for unit interval's end point 1
beta.0 = 4.9 # weight for unit interval's end point 0
null.pval = runif(p - m, 0, 1)
alt.pval = rbeta(m, beta.1, beta.0) # non-null = alternative distribution
pval = c(alt.pval, null.pval) # all P-values together
eff = c(rep(TRUE, m), rep(FALSE, p - m)) # indicator for non-null effects
par(mfrow = c(1,3)) # Empirical histogram and theoretical curve for (1), (2), (3)
hist(null.pval, breaks = 20, prob = TRUE, col = "limegreen",
main = "null", xlab = "", xlim = c(0,1), sub="",
lwd = 1.5, lty = 2, xaxs="i", border = NA)
curve(dunif(x, 0, 1), 0, 1, add = TRUE, col = "orange", lwd = 2)
hist(alt.pval, breaks = 20, prob = TRUE, col = "limegreen",
main = "non-null", xlab = "", xlim = c(0,1), sub="",
lwd = 1.5, lty = 2, xaxs="i", border = NA)
curve(dbeta(x, shape1 = beta.1, shape2 = beta.0), 0, 1, add = TRUE,
col = "orange", lwd = 2)
legend("topright", fill = c("limegreen","orange"),
legend = c("empir.","theor."), cex = 1.5)
hist(pval, breaks = 20, prob = TRUE, col = "limegreen",
main = "combined", xlab = "", xlim = c(0, 1), sub="",
lwd = 1.5, lty = 2, xaxs="i", border = NA)
curve( (p - m)/p * 1 + m/p * dbeta(x, shape1 = beta.1, shape2 = beta.0), 0, 1,
add = TRUE, col = "orange", lwd = 2)
Density plots show how the null P-values follow the uniform distribution and the non-null P-values are enriched for small values near 0.
summary(null.pval)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000108 0.2521960 0.4999947 0.5002426 0.7467176 0.9998005
summary(alt.pval)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0000001 0.0001608 0.0196240 0.0081581 0.7202532
When these two distributions are combined in the rightmost panel, the enrichment of small P-values is still present but has a smaller weight of 20% compared to the weight of 80% given to the null distribution.
Since \(\Phi(t)\) is the probability that a randomly sampled P-value from this mixture distribution \(\leq t\), the random variables \(D(t)\) and \(\textrm{FD}(t)\) are distributed as \[\begin{eqnarray} D(t) &\sim& \textrm{Bin}(p,\Phi(t)) \\ \textrm{FD}(t)\,|\,D(t)&\sim& \textrm{Bin}(D(t), \theta_t), \text{ where } \\ \theta_{t} &=& \textrm{Pr}(\textrm{NULL} \, | \, P \leq t ) = \frac{\textrm{Pr}(\textrm{NULL})\,\textrm{Pr}(P \leq t \, | \, \textrm{NULL})}{\textrm{Pr}(P \leq t)}\\ &=&\frac{\pi_0 t} {\pi_0 t + (1-\pi_0)\Phi_1(t)}. \end{eqnarray}\] Using the law of total expectation: \(\textrm{E}(Y) = \textrm{E}(\textrm{E}(Y\,|\,X))\), and assuming that \(D(t)>0\), we have that \[\textrm{FDR}(t) = \textrm{E}\left(\frac{\textrm{FD}(t)}{D(t)}\right)= \textrm{E}\left(\textrm{E}\left(\frac{\textrm{FD}(t)}{D(t)}\,\middle|\,D(t)\right)\right)= \textrm{E}\left(\frac{1}{D(t)} \textrm{E}(\textrm{FD}(t)\,|\,D(t)) \right)= \textrm{E}\left(\frac{1}{D(t)} \theta_{t}\,D(t) \right)= \theta_{t}.\] On the other hand, \(\textrm{E}(\textrm{FD}(t)) = \textrm{E}(\textrm{E}(\textrm{FD}(t)\,|\,D(t))) =\textrm{E}(D(t)\theta_t) = \theta_t\cdot \textrm{E}(D(t))\) Thus \[\frac{\textrm{E}(\textrm{FD}(t))}{\textrm{E}(D(t))} = \frac{\theta_t\cdot\textrm{E}(D(t))}{\textrm{E}(D(t))} = \theta_t= \textrm{FDR}(t). \] So we can estimate \(\textrm{FDR}(t)\) by the ratio of expectations of \(\textrm{FD}(t)\) and \(D(t)\). And what could we use as estimates for these expectations?
For each P-value threshold \(t\), denote the number of all discoveries at threshold \(t\) by \(\widehat{D}(t) = \#\{\textrm{P-values} \leq t\}.\) We use this to estimate \(\textrm{E}(D(t)) \approx \widehat{D}(t)\).
To estimate \(\textrm{E}(FD(t))\), we remember that the null P-values are uniformly distributed and hence \(\textrm{E}(FD(t)) = p_0 \cdot t = \pi_0\cdot p \cdot t.\) In estimating \(\pi_0\) we again rely on the fact that the null P-values are uniformly distributed and that most P-values near 1 are expected to be from the null distribution. Let’s remind us how our P-value distribution looked like.
hist(pval, breaks = 40, prob = TRUE, col = "limegreen", main = "All P-values",
xlab = "", xlim = c(0,1), sub = "", lwd = 1.5, lty = 2, xaxs = "i", border = NA)
abline(h = 1, lty = 2, lwd = 2)
We can see that the density of P-values > 0.2 looks fairly flat.
If we assume that the density of the non-null P-values is \(\approx 0\) in this region, then the
density function of P-values in this region is approximately \(\pi_0\cdot 1 + (1-\pi_0)\cdot 0 = \pi_0\).
Thus the value of the density function in this flat part gives an
estimate of the overall proportion of null P-values \(\pi_0\). The constant density function over
any interval \((\lambda, 1]\) is the
probability mass in the interval divided by the length of the interval
(\(1-\lambda\)). With empirical data,
this probability mass is
\(\#\{P_j >\lambda \,|\,
j=1,\ldots,p\}/p\). We can apply such an estimate for any value
\(\lambda\) to yield an estimator \[
\widehat{\pi}_0(\lambda) = \frac{\#\{P_j >\lambda\, |\,
j=1,\ldots,p\}}{(1-\lambda) p}.
\] Parameter \(\lambda\) should
be large enough that there are not many non-null effects among the
P-values \(>\lambda\) but,
simultaneously, \(\lambda\) should be
small enough that there are enough null P-values \(> \lambda\) so that the estimate has a
reasonable precision. An inclusion of a few non-null P-values \(>\lambda\) makes this estimate
conservative, i.e., an overestimate of \(\pi_0\). That keeps us on the safe side and
leads to an overestimate of \(\textrm{FDR}(t)\).
If we take \(\lambda = 0\), then \(\widehat{\pi}_0(\lambda)=1\) which is too conservative when a considerable proportion of the tests are expected to be truly non-null. On the other hand, as we set \(\lambda\) closer to 1, the variance of \(\widehat{\pi}_0(\lambda)\) increases due to the decreasing number of P-values exceeding \(\lambda\), which makes the estimate more unreliable. Therefore, as a simple, practical choice for \(\lambda\), we could use 0.5, but an optimal threshold would depend on the data set. Let’s see how the choice of \(\lambda\) affects the estimate \(\widehat{\pi}_0(\lambda)\) in our current example data.
lambda = seq(0, 1, by = 0.05)
pi.0 = sapply(lambda, function(x) {sum(pval > x)}) / p / (1 - lambda)
plot(lambda, pi.0, t = "b", xlab = expression(lambda), ylab = expression(hat(pi)[0]))
abline(h = 1 - m/p, col = "red", lty = 2) #this is the true value
Indeed, we see that with values near \(\lambda = 0.5\), the estimate \(\widehat{\pi}_0(\lambda)\) is a good one whereas its quality deteriorates towards either of the endpoints of the interval.
With this estimator for \(\widehat{\pi}_0\), we have the estimate \[\widehat{\textrm{FDR}}(t) = \frac{\widehat{\textrm{FD}}(t)}{\widehat{D}(t)}= \frac{p\cdot \widehat{\textrm{Pr}}(\textrm{NULL P-value})\cdot \textrm{Pr}(P \leq t \,|\, \textrm{NULL P-value})}{\widehat{D}(t)} = \frac{p\cdot \widehat{\pi}_0 \cdot t}{\widehat{D}(t)}.\]
Let’s plot our estimate of \(\widehat{\textrm{FDR}}(t)\) in orange as a function of \(t\) by using the value \(\widehat{\pi}_0(0.5)\) estimated above. Let’s also plot a curve that corresponds to a conservative assumption \(\pi_0=1\) in red and let’s add the true empirical \(\textrm{FDP}(t)\) curve in blue, which we are able to do here since we know which P-values were simulated from the null.
pval.sorted = sort(pval)
pi.0 = pi.0[which(lambda == 0.5)]
par(mar = c(5,6,2,1)) # widen left margin for text
plot(pval.sorted, p * pi.0 * pval.sorted / (1:p), xlab = "t",
ylab = expression(paste(hat(FDR),"(t) or FDP(t)")), t = "l", col = "orange",
xlim = c(0,1), ylim = c(0,1), lwd = 1.5)
#Add line for conservative assumption pi.0 = 1:
lines(pval.sorted, p * 1 * pval.sorted / (1:p), xlab = "t",
col = "red", lwd = 1.5)
#Add empirical FDP line
fdp = cumsum(!eff[order(pval)]) / (1:p) #true false discovery proportion in data
lines(pval.sorted, fdp, xlab = "t", lty = 3, col = "blue", lwd = 1)
legend("bottomright", leg = expression(hat(pi)[0], pi[0] == 1 ,paste("true fdp")),
col = c("orange", "red", "blue"), lty = c(1, 1, 3), lwd = c(2, 2, 1))
Note that the derivation of BH approach (in lecture notes HDS2) made the conservative assumption \(\pi_0 = 1\) (or equivalently \(p_0 = p\)) which corresponds to the red curve in the figure above. Now we have improved that estimate by empirically estimating \(\widehat{\pi}_0 < 1\) (orange curve). In this particular simulated data set, we know the actual false discovery proportion for any threshold \(t\) (blue curve). We see that, as expected, the red line is clearly conservative: \(\widehat{\textrm{FDR}}(t) > \textrm{FDP}(t)\). The orange line instead becomes an accurate estimate for \(\textrm{FDP}(t)\). In practice, an empirically estimated \(\widehat{\pi}_0\) will sometimes lead to higher and sometimes lower values of \(\widehat{\textrm{FDR}}(t)\) compared to the true \(\textrm{FDP}(t)\), whereas assumption \(\pi_0=1\) is expected to be conservative in general.
Let’s also check whether \(\widehat{\textrm{FDR}}(t)\) estimated using the empirical \(\widehat{\pi}_0\) averages to FDP(t) over many data sets.
R = 1000 #replications of data generation
lambda = 0.5
res.fdr.lambda = matrix(NA, nrow = R, ncol = p ) #lambda = 0.5
res.fdr.1 = matrix(NA, nrow = R, ncol = p ) #lambda = 1
for(rr in 1:R){
pvs = c(rbeta(m, beta.1, beta.0), runif(p - m)) #m non-nulls and p-m nulls
pvs.sorted = sort(pvs)
pi.0 = sum(pvs > lambda)/p/(1 - lambda)
fdr.lambda = p * pi.0 * pvs.sorted / (1:p)
fdr.1 = p * 1 * pvs.sorted / (1:p)
fdp = cumsum(!eff[order(pvs)]) / (1:p)
res.fdr.lambda[rr,] = fdr.lambda - fdp
res.fdr.1[rr,] = fdr.1 - fdp
}
par(mar = c(5,6,2,1)) # widen left margin for text
plot(1:p, colSums(res.fdr.lambda) / R, col = "orange", t="l", lwd = 1.5,
ylim = c(-0.01, 0.2), ylab = expression(paste(hat(FDR),"(t) - FDP(t)")),
xlab = "P-values in increasing order")
lines(1:p, colSums(res.fdr.1) / R, col = "red", t = "l", lwd = 1.5)
abline(h = 0, lty = 3, col = "blue", lwd = 1)
legend("topleft", leg = expression(hat(pi)[0], pi[0] == 1, paste("true fdp")),
col = c("orange","red","blue"), lty = c(1, 1, 3), lwd = c(2, 2, 1))
We see that on average \(\widehat{\textrm{FDR}}(t)\) estimated using \(\widehat{\pi}_0\) is very close to true FDP(t) at any point \(t\), (but in any one instance it can be either smaller or larger). As expected, Figure also shows that use of \(\pi_0(\lambda = 0) = 1\) results in overestimates for FDP throughout the threshold values \(t\). Why does the red curve end at value 0.2?
Questions.
What was the piece of information from the observed P-value distribution that together with the number of tests \(p\) can be used to estimate \(\pi_0\), the proportion of null P-values?
Which assumption about \(\pi_0\) was done in the earlier derivation of the BH method?
In which case could an empirical estimate of \(\pi_0\) be misleading?
We define Q-value of a variable/test as the minimum FDR expected if we call that variable a discovery. To compute it we only use P-values. Thus for P-value \(P_j\) the Q-value is \[Q_j = Q(P_j) = \min_{t \geq P_j} \textrm{FDR}(t).\] Note that the minimum is needed because we will call \(j\) a discovery, if we will call any other variable with at least as large P-value a discovery. This idea is exactly the same as the step-up procedure property of the BH method. In practice, we use an estimate for \(\widehat{\textrm{FDR}}(t)\) to compute Q-values.
The Q-value for a particular hypothesis test is the expected proportion of false positives incurred when calling all tests with at most as large Q-values as significant/discoveries. Therefore, calculating the Q-values for each test and thresholding them at the Q-value level \(\alpha_F\) produces a set of significant variables among which a proportion of \(\alpha_F\) is expected to be false positives. Typically, P-value is described as the probability of a null test statistic being at least as extreme from the null hypothesis as the observed one. Q-value of a particular test can be described as the expected proportion of false positives among all tests with at least as extreme Q-value as the observed one.
Suppose that variables with Q-values \(\leq 5\%\) are called significant. This results in an FDR of 5% among the significant variables. A P-value threshold of 5% yields a false positive rate of 5% among all null variables in the data set. In light of the definition of the false positive rate, a P-value cutoff says little about the content of the variables actually called significant. The Q-values provide a more meaningful measure among the variables called significant. Because significant variables will likely undergo some subsequent verification, a Q-value threshold can be phrased in practical terms as the proportion of significant variables that turn out to be false leads.
qvalue
packageThe definition of Q-value is independent of any algorithm to compute
Q-values, and therefore using ‘Q-values’ doesn’t strictly speaking imply
which method has been used to compute them. However, often Q-value is
associated with a common approach to compute Q-values from P-values
using the qvalue
package
by John Storey. qvalue()
uses an empirical estimate \(\widehat{\pi}_0\) in defining Q-values
through \(\widehat{\textrm{FDR}}(t)\).
The only difference to our derivation above, where we computed \(\widehat{\pi}_0\) at \(\lambda=0.5\), is that by default
qvalue()
first computes \(\widehat{\pi}_0(\lambda)\) at a grid of
\(\lambda\) values < 0.95 and then
fits a spline from which it estimate \(\widehat{\pi}_0\) as the spline-predicted
value \(\widehat{\pi}_0(\lambda=1)\).
(One can override this default smoothing by passing to the
qvalue
function a fixed value of \(\lambda\) through an input parameter
lambda
.)
Let’s apply qvalue
and see what its basic output
functions produce.
#do this the first time to install qvalue package from Bioconductor:
#if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
#BiocManager::install("qvalue")
library(qvalue)
q = qvalue(pval) #this makes Q-values out of P-values
str(q)
## List of 8
## $ call : language qvalue(p = pval)
## $ pi0 : num 0.795
## $ qvalues : num [1:10000] 9.10e-08 2.24e-01 4.52e-01 1.52e-14 2.24e-04 ...
## $ pvalues : num [1:10000] 3.92e-09 6.47e-02 2.03e-01 1.15e-16 2.33e-05 ...
## $ lfdr : num [1:10000] 7.17e-07 7.56e-01 9.39e-01 7.17e-07 2.15e-03 ...
## $ pi0.lambda: num [1:19] 0.826 0.82 0.811 0.806 0.807 ...
## $ lambda : num [1:19] 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 ...
## $ pi0.smooth: num [1:19] 0.819 0.816 0.814 0.811 0.808 ...
## - attr(*, "class")= chr "qvalue"
We have fields for the estimated value of \(\pi_0\) (pi0
), for the
P-values (pvalues
) that were given as input and the
estimated Q-values (qvalues
). Additionally, there are
values of \(\widehat{\pi}_0(\lambda)\)
at the grid given in lambda
, and their smoothed estimates
from the spline fitted to these points. Local false discovery rate
(lfdr
) will be discussed in next lecture.
These Q-values were computed using \(\pi_0\) value
signif(q$pi0, 3)
## [1] 0.795
Basic plots of qvalue
plot(q)
hist(q)
summary(q)
##
## Call:
## qvalue(p = pval)
##
## pi0: 0.7947102
##
## Cumulative number of significant calls:
##
## <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1 <1
## p-value 960 1224 1598 1859 2150 2624 10000
## q-value 747 979 1296 1434 1598 1839 10000
## local FDR 589 747 986 1114 1188 1307 4671
This table tells, for example, that there are 2150 P-values below 0.05, but it is estimated that if we want to make a set of discoveries that had at most 5% false positives among them, we should only choose 1596 of the smallest P-values. From the P-value vs Q-value plot above, we see that the Q-value 0.05 would correspond to a P-value of about 0.01 in these data.
In the previous lecture, we used the Benjamini-Hochberg method to
define thresholds for a given FDR level. Let’s see how
qvalue
compares to the BH method in determining the
discoveries at a given FDR level. Here we will use \(\alpha_F = 0.1\).
alpha = 0.1
BH.pval = p.adjust(pval, method = "BH")
bh = c(sum(BH.pval < alpha), sum(BH.pval[(m+1):p] < alpha))
qval = c(sum(q$qvalues < alpha), sum(q$qvalues[(m+1):p] < alpha))
data.frame(D = c(bh[1],qval[1]),
TD = c(bh[1],qval[1]) - c(bh[2],qval[2]),
FD = c(bh[2],qval[2]),
FDP = c(bh[2],qval[2]) / c(bh[1],qval[1]),
row.names = c("BH", "qvalue"))
## D TD FD FDP
## BH 1736 1611 125 0.07200461
## qvalue 1839 1661 178 0.09679173
Here qvalue
gave 50 more true positives and 53 more
false discoveries than BH, and was the more accurate method to
approximate the target false discovery proportion of 10% (9.6%
vs. 7.2%). We will study more systematically the differences between BH
and qvalue
in the home exercises.
The contents of the course so far is well summarized by a 7-page document about multiple testing and FDR by John Storey: False Discovery Rates, 2010. The last bit in there about “Bayesian Derivation” is the topic of our next lecture.
Questions:
What does it mean when a test statistic has Q-value of 0.05 and how is it different from having a P-value of 0.05?
Which one is typically smaller, Q-value or P-value? (Or neither?)
Recently, the standard FDR methods (BH, Storey’s Q-values) have been
extended to include additional covariate information. For example, when
studying differences between brain measurements under two conditions,
the physical coordinates of each measured spot in brain may associate
with the proportion of true positive differences between the conditions.
In such a case, tests could be binned depending on the coordinates, and
a dependency between \(\widehat{\pi}_0\) and physical coordinates
could be estimated. Examples of such methods are Independet Hypothesis
Weighting (IHW)
and swfdr
.