HDS 3. Q-value

So far, we have used P-values for inference based on the false positive rate (this is the definition of P-value itself), on family-wise error rate (FWER; Bonferroni and Holm) and on false discovery rate (FDR; Benjamini-Hochberg and Benjamini-Yekutieli). See the slides for a summary of these three approaches.

Here we will consider a quantity called Q-value that can be attached to each test and that gives an empirical estimate of FDR among all tests whose Q-values are smaller or equal to the given Q-value. We will follow the exposition of Storey and Tibshirani (2003, PNAS 100: 9440-9445), whose work led to the qvalue R-package. Before defining Q-value, we will first refine the BH method by empirically estimating \(p_0\), the number of null tests. Remember that the proof of BH theorem showed that the BH(\(\alpha_F\)) method actually controls FDR at level \(p_0/p \cdot \alpha_F.\) Thus, when \(p_0\) is considerably smaller than \(p\), we can improve the accuracy of an FDR method by estimating \(p_0.\)

(Technical note: In this text, we are assuming that the procedure to define discoveries has produced at least one discovery. That is, we assume that \(D>0\) even though we do not explicitly write it out in the formulas below. To be precise, this quantity is called the positive false discovery rate (pFDR). Even though the pFDR is a slightly modified version of the FDR we defined previously, in practice the two become equal when \(\text{Pr}(D>0) \approx 1,\) which is the case in any interesting analysis setting where at least some discoveries are expected to be made.)

Let’s define, for each P-value threshold \(t\in[0,1],\) \[\textrm{FDR}(t) = \textrm{E}\left(\frac{\textrm{FD}(t)}{\max\{{D(t),1}\}}\right),\] where random variables \(\textrm{FD}(t)=\#\{\textrm{null P-values} \leq t\}\) and \(D(t)=\#\{\textrm{P-values} \leq t\}\) in an experiment where in total \(p\) P-values are available.

To refine our understanding of the FDR methods, let’s think that the random variables \(\textrm{FD}(t)\) and \(D(t)\) result from \(p\) draws of P-values from a mixture distribution between Uniform(0,1) (for null P-values) and an alternative distribution with cdf \(\Phi_1\) and pdf \(\phi_1\) (for non-null P-values), with a mixture proportion \(\pi_0\) for the null distribution. In other words, the cdf \(\Phi\) and pdf \(\phi\) of the P-values are \[\begin{eqnarray} \Phi(t)&=&\pi_0 \cdot t + (1-\pi_0) \Phi_1(t), \qquad t\in[0,1],\\ \phi(t)&=&\pi_0 \cdot 1 + (1-\pi_0) \phi_1(t), \qquad t\in[0,1]. \end{eqnarray}\] We can interpret sampling from such a mixture distribution as a two-step process. Namely, we first choose between the null distribution (with probability \(\pi_0\)) and the alternative distribution (with probability \(1-\pi_0\)), and second, conditional on the chosen distribution, we sample a P-value from the chosen distribution.

Example 3.1. Suppose we do \(p=10,000\) tests of which \(m=2000\) are non-null, i.e., we can model this as a mixture distribution with \(\pi_0=\frac{p-m}{p} = 0.80\). The null P-values come from Uniform(0,1) and the non-null P-values come from distribution Beta(0.1,4.9). (Note that Beta(0.1,4.9) gives values in [0,1] that have a mean of \(0.1/(0.1 + 4.9)=0.02\) and a skew to right; there is no particular reason why in real data the non-null P-values had exactly this distribution, but this is used for a demonstration purpose here; see figures below.) Let’s plot the theoretical (orange) and empirical (green) density functions for (1) null tests, (2) non-null tests and (3) for all tests combined, i.e., the mixture distribution of the null and alternative distributions.

p = 10000
m = 2000
beta.1 = 0.1 # weight for unit interval's end point 1
beta.0 = 4.9 # weight for unit interval's end point 0
null.pval = runif(p - m, 0, 1)
alt.pval = rbeta(m, beta.1, beta.0) # non-null = alternative distribution
pval = c(alt.pval, null.pval) # all P-values together
eff = c(rep(TRUE, m), rep(FALSE, p - m)) # indicator for non-null effects 
 
par(mfrow = c(1,3)) # Empirical histogram and theoretical curve for (1), (2), (3)
hist(null.pval, breaks = 20, prob = TRUE, col = "limegreen", 
     main = "null", xlab = "", xlim = c(0,1), sub="", 
     lwd = 1.5, lty = 2, xaxs="i", border = NA)
curve(dunif(x, 0, 1), 0, 1, add = TRUE, col = "orange", lwd = 2)

hist(alt.pval, breaks = 20, prob = TRUE, col = "limegreen", 
     main = "non-null", xlab = "", xlim = c(0,1), sub="",
     lwd = 1.5, lty = 2, xaxs="i", border = NA)
curve(dbeta(x, shape1 = beta.1, shape2 = beta.0), 0, 1, add = TRUE, 
      col = "orange", lwd = 2)
legend("topright", fill = c("limegreen","orange"), 
       legend = c("empir.","theor."), cex = 1.5)

hist(pval, breaks = 20, prob = TRUE, col = "limegreen", 
     main = "combined", xlab = "", xlim = c(0, 1), sub="",
     lwd = 1.5, lty = 2, xaxs="i", border = NA)
curve( (p - m)/p * 1 + m/p * dbeta(x, shape1 = beta.1, shape2 = beta.0), 0, 1, 
       add = TRUE, col = "orange", lwd = 2)

Density plots show how the null P-values follow the uniform distribution and the non-null P-values are enriched for small values near 0.

summary(null.pval)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000108 0.2521960 0.4999947 0.5002426 0.7467176 0.9998005

summary(alt.pval)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0000001 0.0001608 0.0196240 0.0081581 0.7202532

When these two distributions are combined in the rightmost panel, the enrichment of small P-values is still present but has a smaller weight of 20% compared to the weight of 80% given to the null distribution.

Since \(\Phi(t)\) is the probability that a randomly sampled P-value from this mixture distribution \(\leq t\), the random variables \(D(t)\) and \(\textrm{FD}(t)\) are distributed as \[\begin{eqnarray} D(t) &\sim& \textrm{Bin}(p,\Phi(t)) \\ \textrm{FD}(t)\,|\,D(t)&\sim& \textrm{Bin}(D(t), \theta_t), \text{ where } \\ \theta_{t} &=& \textrm{Pr}(\textrm{NULL} \, | \, P \leq t ) = \frac{\textrm{Pr}(\textrm{NULL})\,\textrm{Pr}(P \leq t \, | \, \textrm{NULL})}{\textrm{Pr}(P \leq t)}\\ &=&\frac{\pi_0 t} {\pi_0 t + (1-\pi_0)\Phi_1(t)}. \end{eqnarray}\] Using the law of total expectation: \(\textrm{E}(Y) = \textrm{E}(\textrm{E}(Y\,|\,X))\), and assuming that \(D(t)>0\), we have that \[\textrm{FDR}(t) = \textrm{E}\left(\frac{\textrm{FD}(t)}{D(t)}\right)= \textrm{E}\left(\textrm{E}\left(\frac{\textrm{FD}(t)}{D(t)}\,\middle|\,D(t)\right)\right)= \textrm{E}\left(\frac{1}{D(t)} \textrm{E}(\textrm{FD}(t)\,|\,D(t)) \right)= \textrm{E}\left(\frac{1}{D(t)} \theta_{t}\,D(t) \right)= \theta_{t}.\] On the other hand, \(\textrm{E}(\textrm{FD}(t)) = \textrm{E}(\textrm{E}(\textrm{FD}(t)\,|\,D(t))) =\textrm{E}(D(t)\theta_t) = \theta_t\cdot \textrm{E}(D(t))\) Thus \[\frac{\textrm{E}(\textrm{FD}(t))}{\textrm{E}(D(t))} = \frac{\theta_t\cdot\textrm{E}(D(t))}{\textrm{E}(D(t))} = \theta_t= \textrm{FDR}(t). \] So we can estimate \(\textrm{FDR}(t)\) by the ratio of expectations of \(\textrm{FD}(t)\) and \(D(t)\). And what could we use as estimates for these expectations?

For each P-value threshold \(t\), denote the number of all discoveries at threshold \(t\) by \(\widehat{D}(t) = \#\{\textrm{P-values} \leq t\}.\) We use this to estimate \(\textrm{E}(D(t)) \approx \widehat{D}(t)\).

To estimate \(\textrm{E}(FD(t))\), we remember that the null P-values are uniformly distributed and hence \(\textrm{E}(FD(t)) = p_0 \cdot t = \pi_0\cdot p \cdot t.\) In estimating \(\pi_0\) we again rely on the fact that the null P-values are uniformly distributed and that most P-values near 1 are expected to be from the null distribution. Let’s remind us how our P-value distribution looked like.

hist(pval, breaks = 40, prob = TRUE, col = "limegreen", main = "All P-values",
     xlab = "", xlim = c(0,1), sub = "", lwd = 1.5, lty = 2, xaxs = "i", border = NA)
abline(h = 1, lty = 2, lwd = 2)

We can see that the density of P-values > 0.2 looks fairly flat. If we assume that the density of the non-null P-values is \(\approx 0\) in this region, then the density function of P-values in this region is approximately \(\pi_0\cdot 1 + (1-\pi_0)\cdot 0 = \pi_0\). Thus the value of the density function in this flat part gives an estimate of the overall proportion of null P-values \(\pi_0\). The constant density function over any interval \((\lambda, 1]\) is the probability mass in the interval divided by the length of the interval (\(1-\lambda\)). With empirical data, this probability mass is
\(\#\{P_j >\lambda \,|\, j=1,\ldots,p\}/p\). We can apply such an estimate for any value \(\lambda\) to yield an estimator \[ \widehat{\pi}_0(\lambda) = \frac{\#\{P_j >\lambda\, |\, j=1,\ldots,p\}}{(1-\lambda) p}. \] Parameter \(\lambda\) should be large enough that there are not many non-null effects among the P-values \(>\lambda\) but, simultaneously, \(\lambda\) should be small enough that there are enough null P-values \(> \lambda\) so that the estimate has a reasonable precision. An inclusion of a few non-null P-values \(>\lambda\) makes this estimate conservative, i.e., an overestimate of \(\pi_0\). That keeps us on the safe side and leads to an overestimate of \(\textrm{FDR}(t)\).

If we take \(\lambda = 0\), then \(\widehat{\pi}_0(\lambda)=1\) which is too conservative when a considerable proportion of the tests are expected to be truly non-null. On the other hand, as we set \(\lambda\) closer to 1, the variance of \(\widehat{\pi}_0(\lambda)\) increases due to the decreasing number of P-values exceeding \(\lambda\), which makes the estimate more unreliable. Therefore, as a simple, practical choice for \(\lambda\), we could use 0.5, but an optimal threshold would depend on the data set. Let’s see how the choice of \(\lambda\) affects the estimate \(\widehat{\pi}_0(\lambda)\) in our current example data.

lambda = seq(0, 1, by = 0.05)
pi.0 = sapply(lambda, function(x) {sum(pval > x)}) / p / (1 - lambda)
plot(lambda, pi.0, t = "b", xlab = expression(lambda), ylab = expression(hat(pi)[0]))
abline(h = 1 - m/p, col = "red", lty = 2) #this is the true value

Indeed, we see that with values near \(\lambda = 0.5\), the estimate \(\widehat{\pi}_0(\lambda)\) is a good one whereas its quality deteriorates towards either of the endpoints of the interval.

With this estimator for \(\widehat{\pi}_0\), we have the estimate \[\widehat{\textrm{FDR}}(t) = \frac{\widehat{\textrm{FD}}(t)}{\widehat{D}(t)}= \frac{p\cdot \widehat{\textrm{Pr}}(\textrm{NULL P-value})\cdot \textrm{Pr}(P \leq t \,|\, \textrm{NULL P-value})}{\widehat{D}(t)} = \frac{p\cdot \widehat{\pi}_0 \cdot t}{\widehat{D}(t)}.\]

Let’s plot our estimate of \(\widehat{\textrm{FDR}}(t)\) in orange as a function of \(t\) by using the value \(\widehat{\pi}_0(0.5)\) estimated above. Let’s also plot a curve that corresponds to a conservative assumption \(\pi_0=1\) in red and let’s add the true empirical \(\textrm{FDP}(t)\) curve in blue, which we are able to do here since we know which P-values were simulated from the null.

pval.sorted = sort(pval)
pi.0 = pi.0[which(lambda == 0.5)] 
par(mar = c(5,6,2,1)) # widen left margin for text
plot(pval.sorted, p * pi.0 * pval.sorted / (1:p), xlab = "t",
     ylab = expression(paste(hat(FDR),"(t) or FDP(t)")), t = "l", col = "orange",
     xlim = c(0,1), ylim = c(0,1), lwd = 1.5)
#Add line for conservative assumption pi.0 = 1:
lines(pval.sorted, p * 1 * pval.sorted / (1:p), xlab = "t",
      col = "red", lwd = 1.5)
#Add empirical FDP line
fdp = cumsum(!eff[order(pval)]) / (1:p) #true false discovery proportion in data
lines(pval.sorted, fdp, xlab = "t", lty = 3, col = "blue", lwd = 1)
legend("bottomright", leg = expression(hat(pi)[0], pi[0] == 1 ,paste("true fdp")),
       col = c("orange", "red", "blue"), lty = c(1, 1, 3), lwd = c(2, 2, 1))

Note that the derivation of BH approach (in lecture notes HDS2) made the conservative assumption \(\pi_0 = 1\) (or equivalently \(p_0 = p\)) which corresponds to the red curve in the figure above. Now we have improved that estimate by empirically estimating \(\widehat{\pi}_0 < 1\) (orange curve). In this particular simulated data set, we know the actual false discovery proportion for any threshold \(t\) (blue curve). We see that, as expected, the red line is clearly conservative: \(\widehat{\textrm{FDR}}(t) > \textrm{FDP}(t)\). The orange line instead becomes an accurate estimate for \(\textrm{FDP}(t)\). In practice, an empirically estimated \(\widehat{\pi}_0\) will sometimes lead to higher and sometimes lower values of \(\widehat{\textrm{FDR}}(t)\) compared to the true \(\textrm{FDP}(t)\), whereas assumption \(\pi_0=1\) is expected to be conservative in general.

Let’s also check whether \(\widehat{\textrm{FDR}}(t)\) estimated using the empirical \(\widehat{\pi}_0\) averages to FDP(t) over many data sets.

R = 1000 #replications of data generation
lambda = 0.5
res.fdr.lambda = matrix(NA, nrow = R, ncol = p ) #lambda = 0.5
res.fdr.1 = matrix(NA, nrow = R, ncol = p ) #lambda = 1
for(rr in 1:R){
  pvs = c(rbeta(m, beta.1, beta.0), runif(p - m)) #m non-nulls and p-m nulls
  pvs.sorted = sort(pvs)
  pi.0 = sum(pvs > lambda)/p/(1 - lambda)
  fdr.lambda = p * pi.0 * pvs.sorted / (1:p)
  fdr.1 = p * 1 * pvs.sorted / (1:p)
  fdp = cumsum(!eff[order(pvs)]) / (1:p)
  res.fdr.lambda[rr,] = fdr.lambda - fdp
  res.fdr.1[rr,] = fdr.1 - fdp
}
par(mar = c(5,6,2,1)) # widen left margin for text
plot(1:p, colSums(res.fdr.lambda) / R, col = "orange", t="l", lwd = 1.5,
     ylim = c(-0.01, 0.2), ylab = expression(paste(hat(FDR),"(t) - FDP(t)")), 
     xlab = "P-values in increasing order")
lines(1:p, colSums(res.fdr.1) / R, col = "red", t = "l", lwd = 1.5)
abline(h = 0, lty = 3, col = "blue", lwd = 1)
legend("topleft", leg = expression(hat(pi)[0], pi[0] == 1, paste("true fdp")), 
       col = c("orange","red","blue"), lty = c(1, 1, 3), lwd = c(2, 2, 1))

We see that on average \(\widehat{\textrm{FDR}}(t)\) estimated using \(\widehat{\pi}_0\) is very close to true FDP(t) at any point \(t\), (but in any one instance it can be either smaller or larger). As expected, Figure also shows that use of \(\pi_0(\lambda = 0) = 1\) results in overestimates for FDP throughout the threshold values \(t\). Why does the red curve end at value 0.2?

Questions.

What was the piece of information from the observed P-value distribution that together with the number of tests \(p\) can be used to estimate \(\pi_0\), the proportion of null P-values?
Which assumption about \(\pi_0\) was done in the earlier derivation of the BH method?
In which case could an empirical estimate of \(\pi_0\) be misleading?

Definition of Q-value

We define Q-value of a variable/test as the minimum FDR expected if we call that variable a discovery. To compute it we only use P-values. Thus for P-value \(P_j\) the Q-value is \[Q_j = Q(P_j) = \min_{t \geq P_j} \textrm{FDR}(t).\] Note that the minimum is needed because we will call \(j\) a discovery, if we will call any other variable with at least as large P-value a discovery. This idea is exactly the same as the step-up procedure property of the BH method. In practice, we use an estimate for \(\widehat{\textrm{FDR}}(t)\) to compute Q-values.

The Q-value for a particular hypothesis test is the expected proportion of false positives incurred when calling all tests with at most as large Q-values as significant/discoveries. Therefore, calculating the Q-values for each test and thresholding them at the Q-value level \(\alpha_F\) produces a set of significant variables among which a proportion of \(\alpha_F\) is expected to be false positives. Typically, P-value is described as the probability of a null test statistic being at least as extreme from the null hypothesis as the observed one. Q-value of a particular test can be described as the expected proportion of false positives among all tests with at least as extreme Q-value as the observed one.

Suppose that variables with Q-values \(\leq 5\%\) are called significant. This results in an FDR of 5% among the significant variables. A P-value threshold of 5% yields a false positive rate of 5% among all null variables in the data set. In light of the definition of the false positive rate, a P-value cutoff says little about the content of the variables actually called significant. The Q-values provide a more meaningful measure among the variables called significant. Because significant variables will likely undergo some subsequent verification, a Q-value threshold can be phrased in practical terms as the proportion of significant variables that turn out to be false leads.

`qvalue` package

The definition of Q-value is independent of any algorithm to compute Q-values, and therefore using ‘Q-values’ doesn’t strictly speaking imply which method has been used to compute them. However, often Q-value is associated with a common approach to compute Q-values from P-values using the qvalue package by John Storey. qvalue() uses an empirical estimate \(\widehat{\pi}_0\) in defining Q-values through \(\widehat{\textrm{FDR}}(t)\). The only difference to our derivation above, where we computed \(\widehat{\pi}_0\) at \(\lambda=0.5\), is that by default qvalue() first computes \(\widehat{\pi}_0(\lambda)\) at a grid of \(\lambda\) values < 0.95 and then fits a spline from which it estimate \(\widehat{\pi}_0\) as the spline-predicted value \(\widehat{\pi}_0(\lambda=1)\). (One can override this default smoothing by passing to the qvalue function a fixed value of \(\lambda\) through an input parameter lambda.)

Let’s apply qvalue and see what its basic output functions produce.

#do this the first time to install qvalue package from Bioconductor:
#if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
#BiocManager::install("qvalue")
library(qvalue)
q = qvalue(pval) #this makes Q-values out of P-values
str(q)

## List of 8
##  $ call      : language qvalue(p = pval)
##  $ pi0       : num 0.795
##  $ qvalues   : num [1:10000] 9.10e-08 2.24e-01 4.52e-01 1.52e-14 2.24e-04 ...
##  $ pvalues   : num [1:10000] 3.92e-09 6.47e-02 2.03e-01 1.15e-16 2.33e-05 ...
##  $ lfdr      : num [1:10000] 7.17e-07 7.56e-01 9.39e-01 7.17e-07 2.15e-03 ...
##  $ pi0.lambda: num [1:19] 0.826 0.82 0.811 0.806 0.807 ...
##  $ lambda    : num [1:19] 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 ...
##  $ pi0.smooth: num [1:19] 0.819 0.816 0.814 0.811 0.808 ...
##  - attr(*, "class")= chr "qvalue"

We have fields for the estimated value of \(\pi_0\) (pi0), for the P-values (pvalues) that were given as input and the estimated Q-values (qvalues). Additionally, there are values of \(\widehat{\pi}_0(\lambda)\) at the grid given in lambda, and their smoothed estimates from the spline fitted to these points. Local false discovery rate (lfdr) will be discussed in next lecture.

These Q-values were computed using \(\pi_0\) value

signif(q$pi0, 3)

## [1] 0.795

Basic plots of qvalue

plot(q)

hist(q)

summary(q)

## 
## Call:
## qvalue(p = pval)
## 
## pi0: 0.7947102   
## 
## Cumulative number of significant calls:
## 
##           <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1    <1
## p-value      960   1224  1598   1859  2150 2624 10000
## q-value      747    979  1296   1434  1598 1839 10000
## local FDR    589    747   986   1114  1188 1307  4671

This table tells, for example, that there are 2150 P-values below 0.05, but it is estimated that if we want to make a set of discoveries that had at most 5% false positives among them, we should only choose 1596 of the smallest P-values. From the P-value vs Q-value plot above, we see that the Q-value 0.05 would correspond to a P-value of about 0.01 in these data.

In the previous lecture, we used the Benjamini-Hochberg method to define thresholds for a given FDR level. Let’s see how qvalue compares to the BH method in determining the discoveries at a given FDR level. Here we will use \(\alpha_F = 0.1\).

alpha = 0.1
BH.pval = p.adjust(pval, method = "BH")
bh = c(sum(BH.pval < alpha), sum(BH.pval[(m+1):p] < alpha))
qval = c(sum(q$qvalues < alpha), sum(q$qvalues[(m+1):p] < alpha))
data.frame(D = c(bh[1],qval[1]),
           TD = c(bh[1],qval[1]) - c(bh[2],qval[2]),
           FD = c(bh[2],qval[2]),
           FDP = c(bh[2],qval[2]) / c(bh[1],qval[1]), 
           row.names = c("BH", "qvalue"))

##           D   TD  FD        FDP
## BH     1736 1611 125 0.07200461
## qvalue 1839 1661 178 0.09679173

Here qvalue gave 50 more true positives and 53 more false discoveries than BH, and was the more accurate method to approximate the target false discovery proportion of 10% (9.6% vs. 7.2%). We will study more systematically the differences between BH and qvalue in the home exercises.

The contents of the course so far is well summarized by a 7-page document about multiple testing and FDR by John Storey: False Discovery Rates, 2010. The last bit in there about “Bayesian Derivation” is the topic of our next lecture.

Questions:

What does it mean when a test statistic has Q-value of 0.05 and how is it different from having a P-value of 0.05?
Which one is typically smaller, Q-value or P-value? (Or neither?)

Extra: Using covariate information with FDR

Recently, the standard FDR methods (BH, Storey’s Q-values) have been extended to include additional covariate information. For example, when studying differences between brain measurements under two conditions, the physical coordinates of each measured spot in brain may associate with the proportion of true positive differences between the conditions. In such a case, tests could be binned depending on the coordinates, and a dependency between \(\widehat{\pi}_0\) and physical coordinates could be estimated. Examples of such methods are Independet Hypothesis Weighting (IHW) and swfdr.

HDS 3. Q-value

Matti Pirinen, University of Helsinki

23.1.2024

Definition of Q-value

`qvalue` package

Extra: Using covariate information with FDR