--- title: "HDS 3. Q-value" author: "Matti Pirinen, University of Helsinki" date: "21.10.2021" output: html_document: default --- {r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) set.seed(566)  So far we have used P-values for inference based on false positive rate (this is the definition of P-value itself), on family-wise error rate (FWER; Bonferroni and Holm) and on false discovery rate (FDR; Benjamini-Hochberg and Benjamini-Yekutieli). See [slides](https://www.mv.helsinki.fi/home/mjxpirin/HDS_course/material/HDS_error_control.pdf). Here we will consider a quantity called Q-value that can be attached to each test and that gives an empirical estimate of FDR among all tests whose Q-values are smaller or equal to the given Q-value. We will follow Storey and Tibshirani [(2003, PNAS 100: 9440-9445)](https://www.pnas.org/content/100/16/9440), whose work led to qvalue R-package. Before defining Q-value we will first refine BH method by empirically estimating$p_0$, the number of null tests. Remember that the proof of BH theorem showed that BH($\alpha_F$) method actually controls FDR at level$p_0/p \cdot \alpha_F.$When$p_0$is considerably smaller than$p$, we can improve the accuracy of the FDR method by estimating$p_0.$Let's define, for each P-value threshold$t\in[0,1],$$$\textrm{FDR}(t) = \textrm{E}\left(\frac{\textrm{FD}(t)}{D(t)}\right),$$ where random variables$\textrm{FD}(t)=\#\{\textrm{null P-values} \leq t\}$and$D(t)=\#\{\textrm{P-values} \leq t\}$in an experiment where in total$p$P-values are available. To refine our understanding of FDR methods, let's think that random variables$\textrm{FD}(t)$and$D(t)$result from$p$draws of P-values from a mixture distribution between Uniform(0,1) (for null P-values) and an alternative distribution with cdf$\Phi_1$and pdf$\phi_1$(for non-null P-values), with mixture proportion$\pi_0$for the null. In other words, cdf$\Phi$and pdf$\phi$of the P-values are \begin{eqnarray} \Phi(t)&=&\pi_0 \cdot t + (1-\pi_0) \Phi_1(t), \qquad t\in[0,1],\\ \phi(t)&=&\pi_0 \cdot 1 + (1-\pi_0) \phi_1(t), \qquad t\in[0,1]. \end{eqnarray} We can interpret sampling from such a mixture distribution as a 2-step process. Namely, we first choose between the null distribution (with probability$\pi_0$) and the alternative distribution (with probability$1-\pi_0$), and second, conditional on the chosen distribution, we sample a P-value from the chosen distribution. **Example 3.1.** Suppose we do$p=10000$tests of which$m=2000$are non-null, i.e., we can model this as a mixture distribution with$\pi_0=\frac{p-m}{p} = 0.80$. Null P-values come from Uniform(0,1) and non-null P-values come from distribution Beta(0.1,4.9). (Beta(0.1,4.9) gives values in [0,1] that have mean$0.1/(0.1+4.9)=0.02$and a skew to right; there is no particular reason why in real data non-null P-values had exactly this distribution, but this is used for a demonstration purpose here; see figures below.) Let's plot theoretical (orange) and empirical (green) density functions for (1) null tests, (2) non-null tests and (3) for all tests combined (that is, the mixture distribution of the null and alternative distributions). {r, fig.width = 9, fig.height = 3} p = 10000 m = 2000 beta.1 = 0.1 # weight for unit interval's end point 1 beta.0 = 4.9 # weight for unit interval's end point 0 null.pval = runif(p-m, 0, 1) alt.pval = rbeta(m, beta.1, beta.0) #non-null = alternative distribution pval = c(alt.pval, null.pval) #all P-values together eff = c(rep(T, m), rep(F, p - m)) #indicator for non-null effects par(mfrow = c(1,3)) #Empirical histogram and theoretical curve for (1), (2), (3) hist(null.pval, breaks = 20, prob = T, col = "limegreen", main = "null", xlab = "", xlim = c(0,1), sub="", lwd = 1.5, lty = 2, xaxs="i", border = NA) curve(dunif(x, 0, 1), 0, 1, add = T, col = "orange", lwd = 2) hist(alt.pval, breaks = 20, prob = T, col = "limegreen", main = "non-null", xlab = "", xlim = c(0,1), sub="", lwd = 1.5, lty = 2, xaxs="i", border = NA) curve(dbeta(x, shape1 = beta.1, shape2 = beta.0), 0, 1, add = T, col = "orange", lwd = 2) legend("topright", fill = c("limegreen","orange"), legend = c("empir","theor"), cex = 1.5) hist(pval, breaks = 20, prob = T, col = "limegreen", main = "combined", xlab = "", xlim = c(0,1), sub="", lwd = 1.5, lty = 2, xaxs="i", border = NA) curve((p-m)/p*1 + m/p*dbeta(x, shape1 = beta.1, shape2 = beta.0), 0, 1, add = T, col = "orange", lwd = 2)  Density plots show how the null P-values follow the uniform distribution and non-null P-values are enriched for small values near 0. {r} summary(null.pval) summary(alt.pval)  When these two distributions are combined in the rightmost panel, the enrichment of small P-values is still present but has a smaller weight of 20% compared to the weight of 80% given to the null distribution. Since$\Phi(t)$is the probability that a particular P-value from this mixture distribution$\leq t$, the random variables$D(t)$and$\textrm{FD}(t)$are distributed as \begin{eqnarray} D(t) &\sim& \textrm{Bin}(p,\Phi(t)) \\ \textrm{FD}(t)\,|\,D(t)&\sim& \textrm{Bin}(D(t), \theta_t), \text{ where } \\ \theta_{t} &=& \textrm{Pr}(\textrm{NULL} \, | \, P \leq t ) = \frac{\textrm{Pr}(\textrm{NULL})\textrm{Pr}(P \leq t \, | \, \textrm{NULL})}{\textrm{Pr}(P \leq t)}\\ &=&\frac{\pi_0 t} {\pi_0 t + (1-\pi_0)\Phi_1(t)}. \end{eqnarray} Using the law of total expectation:$\textrm{E}_Y(Y) = \textrm{E}_X(\textrm{E}_Y(Y\,|\,X))$, we have that $$\textrm{FDR}(t) = \textrm{E}\left(\frac{\textrm{FD}(t)}{D(t)}\right)= \textrm{E}\left(\textrm{E}\left(\frac{\textrm{FD}(t)}{D(t)}\,\middle|\,D(t)\right)\right)= \textrm{E}\left(\frac{1}{D(t)} \textrm{E}(\textrm{FD}(t)\,|\,D(t)) \right)= \textrm{E}\left(\frac{1}{D(t)} \theta_{t}\,D(t) \right)= \theta_{t}.$$ On the other hand,$\textrm{E}(\textrm{FD}(t)) = \textrm{E}(\textrm{E}(\textrm{FD}(t)\,|\,D(t))) =\textrm{E}(D(t)\theta_t) = \theta_t\cdot \textrm{E}(D(t))$Thus $$\frac{\textrm{E}(\textrm{FD}(t))}{\textrm{E}(D(t))} = \frac{\theta_t\cdot\textrm{E}(D(t))}{\textrm{E}(D(t))} = \theta_t= \textrm{FDR}(t).$$ So we can estimate$\textrm{FDR}(t)$by the ratio of expectations of$\textrm{FD}(t)$and$D(t)$. And what could we use as estimates for these expectations? For each P-value threshold$t$, denote the number of all discoveries at threshold$t$by$\widehat{D}(t) = \#\{\textrm{P-values} \leq t\}.$We use this to estimate$\textrm{E}(D(t)) \approx \widehat{D}(t)$. To estimate$\textrm{E}(FD(t))$, we remember that the null P-values are uniformly distributed and hence$\textrm{E}(FD(t)) = p_0 \cdot t = \pi_0\cdot p \cdot t.$In estimating$\pi_0$we again rely on the fact that the null P-values are uniformly distributed and that most P-values near 1 are expected to be from the null distribution. Let's remind us how our P-value distribution looked like. {r} hist(pval, breaks = 40, prob = T, col = "limegreen", main = "All P-values", xlab = "", xlim = c(0,1), sub = "", lwd = 1.5, lty = 2, xaxs = "i", border = NA) abline(h = 1, lty = 2, lwd = 2)  We can see that the density of P-values > 0.2 looks fairly flat. If we assume that the density of non-null P-values is$\approx 0$in this region, then the density function of P-values in this region is approximately$\pi_0\cdot 1 + (1-\pi_0)\cdot 0 = \pi_0$. Thus the value of the density function in this flat part gives an estimate of the overall proportion of null P-values$\pi_0$. The constant density function over any interval$(\lambda, 1]$is the probability mass in the interval divided by the length of the interval ($1-\lambda$). With empirical data, this probability mass is$\#\{P_j >\lambda \,|\, j=1,\ldots,p\}/p$. We can apply such an estimate for any value$\lambda$to yield an estimator $$\widehat{\pi}_0(\lambda) = \frac{\#\{P_j >\lambda\, |\, j=1,\ldots,p\}}{(1-\lambda) p}.$$ Parameter$\lambda$should be large enough that there are not many true effects with P-values$>\lambda$but simultaneously$\lambda$should be small enough that there are enough null P-values$> \lambda$so that the estimate has a good precision. An inclusion of a few non-null P-values$>\lambda$makes this estimate conservative, that is, an overestimate of$\pi_0$. That keeps us on the safe side and leads to an overestimate$\textrm{FDR}(t)$. If we take$\lambda = 0$, then$\widehat{\pi}_0(\lambda)=1$which is usually much too conservative when a considerable proportion of tests are expected to be truly non-null. On the other hand, as we set$\lambda$closer to 1, the variance of$\widehat{\pi}_0(\lambda)$increases due to decreasing number of P-values exceeding$\lambda$, which makes the estimate more unreliable. Therefore, as a simple, practical choice for$\lambda$we could use 0.5, but optimal threshold would depend on the data set. Let's see see how the choice of$\lambda$affects the estimate$\widehat{\pi}_0(\lambda)$in our ongoing example data. {r} lambda = seq(0, 1, by = 0.05) pi.0 = sapply(lambda, function(x) {sum(pval > x)}) / p / (1 - lambda) plot(lambda, pi.0, t = "b", xlab = expression(lambda), ylab = expression(hat(pi)[0])) abline(h = 1 - m/p, col = "red", lty = 2) #this is the true value  Indeed, we see that with values near$\lambda = 0.5$, the estimate$\widehat{\pi}_0(\lambda)$is a good one whereas its quality deteriorates towards either of the endpoints of the interval. With this estimator for$\widehat{\pi}_0$, we have the estimate $$\widehat{\textrm{FDR}}(t) = \frac{\widehat{\textrm{FD}}(t)}{\widehat{D}(t)}= \frac{p\cdot \widehat{\textrm{Pr}}(\textrm{NULL P-value})\cdot \textrm{Pr}(P \leq t \,|\, \textrm{NULL P-value})}{\widehat{D}(t)} = \frac{p\cdot \widehat{\pi}_0 \cdot t}{\widehat{D}(t)}.$$ Let's plot our estimate$\widehat{\textrm{FDR}}(t)$in orange as function of$t$by using the value$\widehat{\pi}_0(0.5)$estimated above. Let's also plot a curve that corresponds to conservative assumption$\pi_0=1$in red and let's add the true empirical$\textrm{FDP}(t)$curve in blue, which we are able to do since we know which P-values were simulated from the null. {r} pval.sorted = sort(pval) pi.0 = pi.0[which(lambda == 0.5)] par(mar = c(5,6,2,1)) # widen left margin for text plot(pval.sorted, p * pi.0 * pval.sorted / (1:p), xlab = "t", ylab = expression(paste(hat(FDR),"(t) or FDP(t)")), t = "l", col = "orange", xlim = c(0,1), ylim = c(0,1), lwd = 1.5) #Add line for conservative assumption pi.0 = 1: lines(pval.sorted, p * 1 * pval.sorted / (1:p), xlab = "t", col = "red", lwd = 1.5) #Add empirical FDP line fdp = cumsum(!eff[order(pval)]) / (1:p) #true false discovery proportion in data lines(pval.sorted, fdp, xlab = "t", lty = 3, col = "blue", lwd = 1) legend("bottomright", leg = expression(hat(pi)[0], pi[0] == 1 ,paste("true fdp")), col = c("orange", "red", "blue"), lty = c(1, 1, 3), lwd = c(2, 2, 1))  Note that the derivation of BH approach (in lecture notes HDS2) made the conservative assumption$\pi_0 = 1$(or equivalently$p_0 = p$) which corresponds to the red curve in the figure above. Now we have improved that estimate by empirically estimating$\widehat{\pi}_0 < 1$(orange curve). In this particular simulated data set, we know the actual false discovery proportion for any threshold$t$(blue curve). We see that, as expected, red line is clearly conservative:$\widehat{\textrm{FDR}}(t) > \textrm{FDP}(t)$. Orange line instead becomes an accurate estimate for$\textrm{FDP}(t)$. In practice, empirically estimated$\widehat{\pi}_0$will sometimes lead to higher and sometimes lower values of$\widehat{\textrm{FDR}}(t)$compared to the true$\textrm{FDP}(t)$, whereas assumption$\pi_0=1$is expected to be conservative in general. Let's also check whether$\widehat{\textrm{FDR}}(t)$estimated using empirical$\widehat{\pi}_0$averages to FDP(t) over many data sets. {r} R = 1000 #replications of data generation lambda = 0.5 res.fdr.lambda = matrix(NA, nrow = R, ncol = p ) #lambda = 0.5 res.fdr.1 = matrix(NA, nrow = R, ncol = p ) #lambda = 1 for(rr in 1:R){ pvs = c(rbeta(m, beta.1, beta.0), runif(p-m)) #m non-nulls and p-m nulls pvs.sorted = sort(pvs) pi.0 = sum(pvs > lambda)/p/(1-lambda) fdr.lambda = p * pi.0 * pvs.sorted / (1:p) fdr.1 = p * 1 * pvs.sorted / (1:p) fdp = cumsum(!eff[order(pvs)]) / (1:p) res.fdr.lambda[rr,] = fdr.lambda - fdp res.fdr.1[rr,] = fdr.1 - fdp } par(mar = c(5,6,2,1)) # widen left margin for text plot(1:p, colSums(res.fdr.lambda) / R, col = "orange", t="l", lwd = 1.5, ylim = c(-0.01,0.2), ylab = expression(paste(hat(FDR),"(t) - FDP(t)")), xlab = "P-values in increasing order") lines(1:p, colSums(res.fdr.1)/R, col = "red", t = "l", lwd = 1.5) abline(h = 0, lty = 3, col = "blue", lwd = 1) legend("topleft", leg = expression(hat(pi)[0], pi[0] == 1 ,paste("true fdp")), col = c("orange","red","blue"), lty = c(1, 1, 3), lwd = c(2, 2, 1))  We see that on average$\widehat{\textrm{FDR}}(t)$estimated using$\widehat{\pi}_0$is very close to true FDP(t) at any point$t$, (but in any one instance it can be either smaller or larger). As expected, Figure also shows that use of$\pi_0(\lambda = 0) = 1$results in overestimates for FDP throughout the threshold values$t$. Why does the red curve grow end at value 0.2? **Questions.** 1. What was the piece of information from observed P-value distribution that together with the number of tests$p$can be used to estimate$\pi_0$, the proportion of null P-values? 2. Which assumption about$\pi_0$was done in earlier derivation of BH method? 3. In which case could an empirical estimate of$\pi_0$be misleading? #### Definition of Q-value We define Q-value of a variable/test as **the minimum FDR expected if we call that variable a discovery**. To compute it we only use P-values. Thus for P-value$P_j$the Q-value is $$Q_j = Q(P_j) = \min_{t \geq P_j} \textrm{FDR}(t).$$ Note that the minimum is needed because we will call$j$a discovery, if we will call any other variable with at least as large P-value a discovery. This idea is exactly the same as the step-up procedure property of the BH method. In practice, we use estimate for$\widehat{\textrm{FDR}}(t)$to compute Q-values. The Q-value for a particular hypothesis test is the expected proportion of false positives incurred when calling all tests with at most as large Q-values as significant/discoveries. Therefore, calculating the Q-values for each test and thresholding them at Q-value level$\alpha_F$produces a set of significant variables so that a proportion of$\alpha_F$is expected to be false positives. Typically, P-value is described as the probability of a null test statistic being at least as extreme from the null hypothesis as the observed one. Q-value of a particular test can be described as the expected proportion of false positives among all tests with at least as extreme Q-value as the observed one. Suppose that variables with Q-values$\leq 5\%$are called significant. This results in a FDR of 5% among the significant variables. A P-value threshold of 5% yields a false positive rate of 5% among all null variables in the data set. In light of the definition of the false positive rate, a P-value cutoff says little about the content of the variables actually called significant. The Q-values provide a more meaningful measure among the variables called significant. Because significant variables will likely undergo some subsequent verification, a Q-value threshold can be phrased in practical terms as the proportion of significant variables that turn out to be false leads. #### qvalue package The definition of Q-value is independent of any algorithm to compute Q-values, and therefore using 'Q-values' doesn't strictly speaking imply which method has been used to compute them. However, often Q-value is associated with a common approach to compute Q-values from P-values using the qvalue [package](http://www.bioconductor.org/packages/release/bioc/html/qvalue.html) by John Storey. qvalue() uses an empirical estimate$\widehat{\pi}_0$in defining Q-values through$\widehat{\textrm{FDR}}(t)$. The only difference to our derivation above, where we computed$\widehat{\pi}_0$at$\lambda=0.5$, is that by default qvalue() first computes$\widehat{\pi}_0(\lambda)$at a grid of$\lambda$values < 0.95 and then fits a spline from which it estimate$\widehat{\pi}_0$as the spline-predicted value$\widehat{\pi}_0(\lambda=1)$. (One can override this default smoothing by passing to qvalue function a fixed value of$\lambda$through parameter lambda.) Let's apply qvalue and see what its basic output functions produce. {r} #do this the first time to install qvalue package from Bioconductor: #if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") #BiocManager::install("qvalue") library(qvalue) q = qvalue(pval) #this makes Q-values out of P-values str(q)  We have fields for the estimated value of$\pi_0$(pi0), for P-values (pvalues) that were given as input and estimated Q-values (qvalues). Additionally there are values of$\widehat{\pi}_0(\lambda)$at the grid given in lambda, and their smoothed estimates from the spline fitted to these points. Local false discovery rate (lfdr) will be discussed in next lecture. These Q-values were computed using$\pi_0$value {r} signif(q$pi0, 3)  Basic plots of qvalue {r} plot(q) hist(q)  {r} summary(q)  This table tells, for example, that there are 2150 P-values below 0.05, but it is estimated that if we want to make a set of discoveries that had at most 5% false positives among them, we should only choose 1596 of the smallest P-values. From P-value vs Q-value plot above, we see that Q-value 0.05 would correspond P-value of about 0.01 in these data. In the previous lecture, we used Benjamini-Hochberg method to define thresholds for a given FDR level. Let's see how qvaluecompares to BH method in determining the discoveries at a given FDR level. Here we will use $\alpha_F = 0.1$. {r} alpha = 0.1 BH.pval = p.adjust(pval, method = "BH") bh = c(sum(BH.pval < alpha), sum(BH.pval[(m+1):p] < alpha)) qval = c(sum(q$qvalues < alpha), sum(q$qvalues[(m+1):p] < alpha)) data.frame(D = c(bh[1],qval[1]), TD = c(bh[1],qval[1])-c(bh[2],qval[2]), FD = c(bh[2],qval[2]), FDP = c(bh[2],qval[2]) / c(bh[1],qval[1]), row.names=c("BH","qvalue"))  Here qvalue gave 50 more true positives and 53 more false discoveries than BH, and was the more accurate method to approximate the target false discovery proportion of 10% (9.6% vs. 7.2%). We will study more systematically BH and qvalue in home exercises. The contents of the course so far is summarized by a 7-page document of multiple testing and FDR by John Storey: [False Discovery Rates, 2010](http://genomics.princeton.edu/storeylab/papers/Storey_FDR_2010.pdf). The last bit in there about "Bayesian Derivation" is the topic of our next lecture. 1. What does it mean when a test statistic has Q-value of 0.05 and how is it different from having a P-value of 0.05? 2. Which one is typically smaller, Q-value or P-value? (Or neither?) #### Extra: Using covariate information with FDR Recently, the standard FDR methods (BH, Storey's Q-values) have been extended to include additional covariate information. For example, when studying differences between brain measurements under two conditions, the physical coordinates of each measured spot in brain may associate with the proportion of true positive differences between the conditions. In such a case, tests could be binned depending on the coordinates, and a dependency between $\widehat{\pi}_0$ and physical coordinates could be estimated. Examples of such methods are Independet Hypothesis Weighting [(IHW)](https://www.nature.com/articles/nmeth.3885) and [swfdr](https://peerj.com/articles/6035/).