---
title: 'GWAS 3: Statistical power'
author: "Matti Pirinen, University of Helsinki"
date: 'Updated: 10-Nov-2020; 1st version 23-Jan-2019'
output:
html_document: default
urlcolor: blue
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(19)
```
This document is licensed under a
[Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).
The slide set referred to in this document is "GWAS 3".
We saw previously how a strict significance threshold (like 5e-8) is needed
in GWAS in order to avoid reporting false positives
(i.e., null variants that reach the threshold).
On the other hand, a strict threshold
makes it relatively difficult to get even the true positives to reach the threshold.
(In other words, we have many false negatives.)
Next we will study the properties of the variants and study design
that determine how likely we are to catch the true positives:
We study the statistical power in GWAS. (See slides 1-7.)
A review on the topic
by [Sham and Purcell 2014](https://www.nature.com/articles/nrg3706).
**Statistical power** of a statistical significance test
is the probability that the test will reject the null hypothesis $H_0$
(in GWAS, $H_0$ says that $\beta=0$) at the given significance threshold
when the data follow a specific *alternative hypothesis* $H_1$.
In the GWAS setting, $H_1$ is specified by fixing the study design (total sample size,
case and controls counts) and parameters describing the variant (MAF and effect size).
To compute the P-value, we only needed to consider the null hypothesis $H_0$.
For power analysis, we also need
to define explicitly how the true effects look like, i.e.,
we need a quantitative alternative hypothesis $H_1$.
Of course, not all true effects are the same, and therefore power
analysis is often represented as a power curve over a plausible range of
parameter values.
#### 3.1 Test statistic under the alternative
By assuming that the sampling distribution of effect size estimate $\widehat{\beta}$
is Normal (which assumption is based on the Central Limit Theorem and works well for
large sample sizes and common variants),
we have that $\widehat{\beta} \sim \mathcal{N}(\beta,\textrm{SE}^2)$,
where $\beta$ is the true effect size and SE the standard error of the estimator.
As we saw last week,
it follows that under the null, i.e. when $\beta=0$,
the Wald's test statistic $z = \widehat{\beta}/\textrm{SE}$
is distributed approximately as $z\sim \mathcal{N}(0,1)$, which is used to derive a P-value.
The other way to get a P-value is from the chi-square distribution
as $z^2 \sim \chi^2_1$.
We will often use the chi-square distribution
because then we need to consider only the upper tail of the distribution to compute
(the two-sided) P-value,
and this makes visualization of the distibutions simpler than with the Normal distribution.
What about if $\beta \neq 0$, that is, if the variant has a non-zero effect?
Then $z\sim \mathcal{N}(\beta/\textrm{SE}, 1)$ and
$z^2 \sim \chi^2_1((\beta/\textrm{SE})^2)$, which is called a chi-square distribution
with 1 degree of freedom and non-centrality parameter NCP=$(\beta/\textrm{SE})^2$.
When $\beta=0$ this reduces to the standard *central* chi-square distribution $\chi^2_1 = \chi^2_1(0)$.
**Example 3.1.** Let's illustrate these distributions by a simulation of
GWAS results under the alternative and under the null.
To save time,
we don't do regressions with genotype-phenotype data
but simulate the effect estimates directly
from their known distributions.
First, however, we need to find the standard error, and that we do by fitting one regression model.
We will visualize the distributions both for the Wald statistic (having a Normal distribution)
and for its square (having a chi-square distribution).
```{r, fig.height = 5, fig.width = 10}
n = 500 #individuals
p = 5000 #SNPs for both null and alternative
f = 0.5 #MAF
b.alt = 0.2 #effect size under the alternative hypothesis
x = rbinom(n, 2, f) #genotypes at 1 SNP for n ind
y = scale( rnorm(n) ) #random phenotype normalized to have sample sd=1
se = summary( lm( y ~ x ) )$coeff[2,2] #pick se, and assume it stays constant and independent of beta
b.hat.null = rnorm(p, 0, se) #estimates under null
b.hat.alt = rnorm(p, b.alt, se) #estimates under alternative
par(mfrow=c(1,2))
# Draw observed densities of z-scores
plot(NULL, xlim = c(-3,6), ylim = c(0,0.5), xlab = "z",
ylab = "density", col = "white") #empty panel for plotting
lines(density( (b.hat.null/se) ), col = "black", lwd = 2) #Wald stat for null variants
lines(density( (b.hat.alt/se) ), col = "red", lwd = 2) #Wald stat for alternative variants
# add theoretical densities for z-scores
x.seq = seq(-3, 6, 0.01)
lines(x.seq, dnorm(x.seq, 0, 1), col = "blue", lty = 2) #for null
lines(x.seq, dnorm(x.seq, b.alt/se, 1), col = "orange", lty = 2) #for alternative
# Draw observed densities of z^2
plot(NULL, xlim = c(0,35), ylim = c(0,1), xlab = expression(z^2),
ylab = "density", col = "white") #empty panel for plotting
lines(density( (b.hat.null/se)^2 ), col = "black", lwd = 2) #chi-square stat for null variants
lines(density( (b.hat.alt/se)^2 ), col = "red", lwd = 2) #chi-square stat for alternative variants
# Let's add theoretical densities of the chi-square distributions
x.seq = seq(0, 35, 0.01)
lines(x.seq, dchisq(x.seq, df = 1, ncp = 0), col = "blue", lty = 2) #ncp=0 for null
lines(x.seq, dchisq(x.seq, df = 1, ncp = (b.alt/se)^2), col = "orange", lty = 2) #ncp = (beta/se)^2 for alternative
legend("topright", leg = c("NULL obs'd","ALT obs'd","NULL theor","ALT theor"),
col = c("black","red","blue","orange"),
lty = c(1,1,2,2), lwd = c(2,2,1,1) )
#Let's add significance thresholds corresponding to 0.05 and 5e-8
#By definition, the thresholds are always computed under the null.
q.thresh = qchisq( c(0.05, 5e-8), df = 1, ncp = 0, lower = FALSE)
abline(v = q.thresh, col = c("darkgreen", "springgreen"), lty = 3)
text( q.thresh+2, c(0.4,0.4), c("P<0.05","P<5e-8") )
```
Theoretical distributions match well with the observed ones so
we trust that we now understand the relevant
parameters also of the theoretical chi-square distribution.
We also see that almost all of the alternative distribution is to the right of
the significance threhsold of 0.05 but to the left of threshold 5e-8.
Thus, we would discover almost all variants with these parameters at level 0.05
but almost none at the genome-wide significance level of 5e-8.
How to compute the exact proportion of the distribution to the right of a
given threshold value?
```{r}
q.thresh = qchisq(c(0.05,5e-8), df = 1, ncp = 0, lower = FALSE) #repeating thresholds in chi-square units
pchisq(q.thresh, df = 1, ncp = (b.alt/se)^2, lower = FALSE) #correspond to right tail probabilities
```
So we have a probability of 90% to detect a variant at significance threshold 0.05,
when effect size is 0.2 sd units of a quantitative phenotype,
MAF is 50% and sample size is 500. This probability
is also called statistical **power** corresponding to the given parameters
(significance threshold, $\beta$, MAF and $n$).
On the other hand, with these parameters, we only have a power of 1.3% at
the genome-wide significance level 5e-8.
A frequentist interpretation is that we are likely to discover
90 out of 100 variants having the parameters considered at the significance
level 0.05 but only about 1 in 100 at the level 5e-8.
**Question.** What is "power" under the null? This may be a confusing question
since we typically use power to describe the ability to
detect *non-zero effects*, but, technically, we can interpret this question
as asking for the probability of getting a significant result when the
null hypothesis holds.
And, by definition, this probability is the significance threshold $\alpha$,
and does not depend on $n$ or MAF.
Consequently, the power to detect any non-zero effect
can never be less than $\alpha$ and will get close to $\alpha$
for tiny effects ($\beta \approx 0$)
which are almost indistinguishable from 0.
Typically, we would like to design studies to have a good power (say at least 80%)
to detect the types of variants that we are interested in. How can we do that?
### 3.2 Ingredients of power
The parameters affecting power are
1. Sample size $n$; increasing sample size increases power
because it increases accuracy of effect size estimate.
2. Effect size $\beta$; increasing the absolute value of effect size increases
power because it increases difference from the null model.
3. Minor allele frequency $f$; increasing MAF increases power because it increases accuracy of effect size estimate.
4. Significance threshold $\alpha$; increasing the threshold increases power because larger significance thresholds are easier to achieve.
5. In case-control studies, the proportion of cases $\phi$; moving $\phi$
closer to 0.5 increases power because it increases accuracy of effect size estimate.
We'll soon discuss intuition why each of these parameters affects power.
But let's first write down how these parameters and power are tied together.
For a given significance level $\alpha$, power is determined by
the non-centrality parameter NCP$=(\beta/\textrm{SE})^2$ of the chi-square distribution.
The mean of the distribution is 1+NCP; hence
the larger the NCP, the larger the power, because the distribution moves to right with increasing NCP.
We see that the NCP increases with $\beta^2$ and this explains why
increasing $|\beta|$ increases power.
We also see that the NCP increases as SE decreases and therefore we need
to know how SE depends on $n$, $f$ and $\phi$.
#### 3.2.1 Formulas for standard errors
For the linear model
$$y = \mu + x\beta + \varepsilon,$$
SE of $\widehat{\beta}$ is
$$\textrm{SE}_\textrm{lin} = \frac{\sigma}{\sqrt{\textrm{Var}(x) n}} \approx \frac{\sigma}{\sqrt{2 f (1-f) n}},$$
where the variance of genotype $x$ is, under Hardy-Weinberg
equilibrium, approximately $2f(1-f)$, and $\sigma$ is the standard deviation
of the error term $\varepsilon$: $\sigma^2 = \textrm{Var}(y) - \beta^2 \textrm{Var}(x).$
This form for SE is a direct consequence of the variance
estimate of the mean-centered linear model: $\textrm{Var}(\widehat{\beta})=\sigma^2/ \sum_{i=1}^n x_i^2 .$
In a typical QT-GWAS, the effects of variants on the total phenotypic variance are small (< 1%) and then we can
assume that the error variance $\sigma^2 \approx \textrm{Var}(y),$ which is 1 if the phenotype is processed
by quantile normalisation or scaling of the residuals after regressing out the covariate effects.
For *binary* case-control data analyzed by logistic regression,
$$\textrm{SE}_\textrm{bin} \approx \frac{1}{\sqrt{\textrm{Var}(x) n \phi (1-\phi)}} \approx \frac{1}{\sqrt{2 f (1-f) n \phi (1-\phi)}}.$$
Thus, the difference to the linear model formula is that $\sigma$ is replaced by 1
and $n$ is replaced by the **effective sample size** $n \phi (1-\phi).$
For derivation, see Appendix A of [Vukcevic et al. 2012](https://onlinelibrary.wiley.com/doi/full/10.1002/gepi.20576).
Smaller SE means higher precision of the effect size estimate.
Both of the formulas show how SE decreases with increasing sample size $n$,
with increasing MAF $f$ and, for binary data, SE decreases as $\phi$ gets closer
to 0.5.
These formulas work well for typical GWAS settings, but may not hold when
some parameter ($n$ or $f$ or $\phi(1-\phi)$) gets close to zero.
To know exactly when the formulas start to break down, it is best to do simulations.
In particular, the formula may not be good for the rare variant testing ($f<0.001$).
Now we can write down the NCPs of additive GWAS models as
$$\textrm{NCP}_\textrm{lin} = (\beta/\textrm{SE}_\textrm{lin})^2 \approx 2 f (1-f) n \beta^2/\sigma^2 \qquad \textrm{ and }
\qquad \textrm{NCP}_\textrm{bin} = (\beta/\textrm{SE}_\textrm{bin})^2 \approx 2 f (1-f) n \phi(1-\phi) \beta^2.$$
Let's next discuss how and why each parameter affects the NCPs, and hence power,
the way it does.
#### 3.2.2 Sample size
Out of the parameters affecting power, the sample size is most
clearly under the control of study design.
Therefore, it is the primary parameter by which we
can design studies of sufficient power.
Increasing $n$ decreases SE in the regression models in proportion to $1/\sqrt{n}$.
In the GWAS context, we can think that larger $n$ leads to
more accurate estimate of the phenotypic means in each of the genotype classes.
Therefore, as $n$ grows, we also have more accurate estimate of the phenotypic
difference between the genotype classes, which means that we are
better able to distinguish a true phenotypic
difference between genotype groups from zero.
In other words, we have a larger power
to detect a genotype's effect on the phenotype.
**Example 3.2.** Above we saw that with $n=500$ (and MAF = 0.5, $\beta=0.2$) we had only
1% power at significance level $\alpha=$ 5e-8. Let's determine how large $n$ should be
to achieve 90% power.
```{r}
f = 0.5
b.alt = 0.2
sigma = sqrt(1 - 2*f*(1-f)*b.alt^2) #error sd after SNP effect is accounted for (see next part for explanation)
ns = seq(500, 4000, 10) #candidate values for n
ses = sigma/sqrt(ns*2*f*(1-f)) #SEs corresponding to each candidate n
q.thresh = qchisq(5e-8, df = 1, ncp = 0, lower = F) #chi-sqr threshold corresp alpha=5e-8
pwr = pchisq(q.thresh, df = 1, ncp=(b.alt/ses)^2, lower=F) #power at alpha=5e-8 for VECTOR of SE values
plot(ns, pwr, col = "darkgreen", xlab = "n", ylab = "power",
main = paste0("QT sd=1; MAF=",f,"; beta=",b.alt), t = "l", lwd = 1.5)
abline(h = 0.9, lty = 2)
#Let's output the first n that gives power >= 90%
ns[min(which(pwr >= 0.9))]
```
So, we need $n=2230$ in order to have power of 90%.
#### 3.2.3 Effect size and variance explained
When effect size is $\beta$ and MAF is $f$, then
variance explained by the additive effect on the genotype is
$\textrm{Var}(x\beta) = \textrm{Var}(x) \beta^2 \approx 2f(1-f)\beta^2$.
When total phenotypic variance of a quantitative trait is 1,
then $2f(1-f)\beta^2$ is also the proportion of
the variance explained by the variant.
For example, in our ongoing example setting,
the variance explained by the variant is
```{r}
2*f*(1-f)*b.alt^2.
```
That is, the variant explains 2% of the variation of the phenotype.
This is a very large variance explained compared to
a typical common variant association with
complex traits, such as BMI or height,
but is more realistic for some molecular traits, such as metabolite levels,
that are less complex genetically and may have larger effects from individual variants.
**Example 3.3.** What if we wanted to find a suitable $n$ that gives 90% power
for MAF=50% when the variant explained only 0.5% of the phenotype?
```{r}
f = 0.5
y.explained = 0.005
b.alt = sqrt(y.explained / (2*f*(1-f)) ) #this is beta that explains 0.5%
sigma = sqrt(1 - y.explained) #error sd after SNP effect is accounted for
ns = seq(1000,12000,10) #candidate n
ses = sigma / sqrt( ns*2*f*(1-f) ) #SE corresponding to each n
q.thresh = qchisq(5e-8, df = 1, ncp = 0, lower = F) #threshold corresp alpha=5e-8
pwr = pchisq(q.thresh, df = 1, ncp = (b.alt/ses)^2, lower = F) #power at alpha=5e-8
plot(ns,pwr, col = "darkgreen", xlab = "n", ylab = "power",
main = paste0("QT sd=1; MAF=",f,"; beta=",b.alt), t = "l", lwd = 1.5)
abline( h = 0.9, lty = 2 )
#Let's output n that is the first that gives power >= 90%
ns[min(which(pwr >= 0.9))]
```
So, we needed to multiply the sample size by a factor of ~4.
Note that this really makes a difference in practice.
It is very different to collect 2230 individuals than to collect 9030.
Power calculations are important!
Could we maybe had guessed the factor of 4 without doing the actual power calculation?
For a fixed $\alpha$, power is determined by NCP.
Earlier we determined the parameters that give 90% power.
If we equate NCP defined by these parameters with an NCP with
a new effect size and with an unknown sample size $n_2$, we can solve for $n_2.$
Furthermore, as $f=0.5$ remains constant, it cancels out
and we get
$$
\frac{2f(1-f)\beta_1^2 n_1}{\sigma_1^2} = \frac{2f(1-f)\beta_2^2 n_2}{\sigma_2^2}
\longrightarrow
n_2 = \frac{\beta_1^2 n_1 \sigma_2^2}{\beta_2^2 \sigma_1^2} =
n_1 \frac{0.2^2}{0.1^2} \frac{1-0.005}{1-0.02} \approx 4.0612\,\cdot n_1.
$$
We conclude that when the variance explained by the variant is small (say < 2%),
and we drop the variance explained by a factor of $a$, (which is the same
as dropping the effect size by a factor of $\sqrt{a}$), we must increase
the sample size approximately by a factor of $a$ to maintain constant power.
If variance explained by the variant is larger than
a few percents, then it will have a bigger effect
on the result, but in GWAS, typically, the variance explained remains < 1%.
#### 3.2.4 Minor allele frequency
When other parameters than MAF $f$ remain constant, the NCPs are proportional to
$f(1-f)$ which is maximised at $f=0.5$ and which decreases to zero
as $f\to 0$.
Technically, this can be explained by a general property that,
in regression models, the increased variance in the explanatory variable
makes the effect estimates more precise.
Thinking more concretely the GWAS setting,
if MAF $f$ is very small, then
almost all individuals in the sample have the major homozygote genotype.
Consequently, we will estimate the mean phenotype very accurately for that
homozygote genotype, but we will have almost no information about
the phenotypic mean of the heterozygote group. Therefore, we have little
information whether the two groups are *different* from each other.
The mathematics show that, from the point of view of power,
the best balance between the precisions of the phenotypic mean estimates
of different genotype groups is achieved when MAF=0.5.
**Example 3.4.** Consider a situation where MAF of the same variant is 0.5 in
population A but only 0.2 in population B.
How much larger sample size would we need in B to achieve the same power
as in A assuming all other parameters were the same?
**Answer.**
Let's equate the NCPs and since other parameters than $f$ and $n$ cancel out
we are left with
$$
n_B f_B(1-f_B) = n_A f_A(1-f_A) \longrightarrow n_B = n_A \frac{f_A(1-f_A)}{f_B(1-f_B)} = n_A \frac{0.5 \times 0.5}{0.2 \times 0.8} = 1.5625\,\cdot n_A.
$$
So we need 56% more samples in B than in A.
We can get a simpler approximative
formula in cases where MAF is low (<5%) in both populations,
because then $(1-f_A)/(1-f_B) \approx 1$ and we are left with
$n_B \approx f_A/f_B \, n_A$.
**Example 3.5.** Last week we encountered *PCSK9* gene's
missense variant rs11591147 that had a strong effect on LDL-cholesterol levels.
GnomAD database shows that it has MAF 4.2% (713/16994) in Finns whereas
MAF is 1.5% (1235/81394) in non-Finnish Europeans (NFE). Hence
we estimate that we would need to collect about 4.2/1.5 = 2.8 times larger cohort in
NFE than in FIN to detect this variant at a given significance threshold.
(The exact factor is $0.042\cdot (1-0.042) / (0.015\cdot (1-0.015)) = 2.723.$)
#### 3.2.5 Proportion of cases
In a case-control analysis of a binary trait,
the proportion of cases in the sample, $\phi$, affects power.
Mathematically, the effect is similar to that of MAF, i.e., the
NCP is proportional to $\phi(1-\phi).$
Hence, all other things being fixed (including the total sample size),
the largest power is achieved when we have equal number of
cases and controls ($\phi=0.5$),
and power goes to zero when $\phi\to 0$ or $\phi \to 1.$
The intuition is that if we had only few cases, then the allele frequency
information in cases will be very inaccurate
and therefore it is very difficult to
determine whether case and control frequencies
are *different* from each other, no matter
how accurately we could estimate the control allele frequency.
**Example 3.6.**
Consider two GWAS on migraine.
1. The UK Biobank study of 500,000 individuals
of whom 15,000 have self-reported migraine.
2. Case-control analysis of 60,000 individuals of whom 30,000 are migraineurs and 30,000 are controls.
Which of the studies gives larger power?
**Answer.**
The power is determined by NCP=$2f(1-f)n\phi(1-\phi)\beta^2$, and the two studies
are assumed to differ only in their effective sample size $n \phi (1-\phi).$
Therefore, the one that has the larger effective sample size also has larger power.
```{r}
n = c(500000, 60000)
phi = c(15000, 30000)/n
cbind(n, phi, eff.n = n*phi*(1-phi))
```
These studies are very similarly powered but the 2nd study has a slightly higher power.
This is true even though the total sample size of the 2nd study is only
12% of the total sample size of the 1st study.
#### 3.2.6 Power calculators
[Genetic Association Study Power Calculator](https://csg.sph.umich.edu/abecasis/gas_power_calculator/index.html)
at University of Michigan specifies case-control study effect sizes by
genotype relative risk (GRR) and disease prevalence in population.
For their multiplicative model, GRR is the relative risk between genotype 1 and 0 as well
as between genotype 2 and 1. It can be shown that for low prevalence (<1%) GRR is approximately
the odds-ratio.
**Example 3.7.**
Let's test it. Make a study with 10000 cases and 10000 controls, with disease allele frequency 40%,
GRR of 1.1 and disease prevalence of 0.001, in which case GRR is ~ OR.
The calculator gives power 22.5% at $\alpha=5$e-8.
Compare it to the way we have done it:
```{r}
b = log(1.1) #b is log-odds, approximately GRR for a low prevalence disease
n = 20000
f = 0.4
phi = 0.5
pchisq(qchisq(5e-8, df = 1, lower = F), df = 1, ncp = 2*f*(1-f)*n*phi*(1-phi)*b^2, lower = F)
```
Methods agree well and estimate 22..23% power.
#### 3.2.7 Risk allele frequency (RAF) vs minor allele frequency (MAF)
Continue with the Power calculator and set GRR to 1.15 and RAF to 0.2: power is 60%.
Change RAF to 0.8; power drops to 51%.
How can that be since in our formula the power depends only
on MAF which is the same (0.2) in both cases?
Let's consider a disease with low prevalence (say < 1%).
Then the RAF in disease-free controls is almost the same as RAF in the whole population
since over 99% of the population are controls. By definition, risk allele is more frequent in cases
than in controls and hence RAF in the ascertained case-control sample, that is heavily enriched
for cases compared to their population frequency, will be higher than RAF in the population.
Hence, if the risk allele is also the minor allele, then MAF in the ascertained case-control
sample will be higher than in the population increasing power. However, if the risk allele
is a major allele, then MAF in the ascertained case-control sample will be lower than in the population
decreasing power. We conclude that in an ascertained case-control sample, we have more power
to detect risk increasing minor alleles than protective minor alleles even when their
MAFs were the same in population [Chan et al. 2014 AJHG](https://www.sciencedirect.com/science/article/pii/S0002929714000640?via%3Dihub).
This reminds us about the importance of considering $f$ as the MAF in the sample that we are analysing,
which for ascertained case control studies may be different from the population MAF.
### 3.3 Why well powered studies are so important?
The GWAS evolution over the last 15 years gives
an illuminating lesson on the
influence of statistical power on scientific conclusions.
Since the effect sizes of common variants affecting complex diseases
are relatively small, the first years of GWAS of any one disease
were able to detect only few GWS associations because the sample
sizes were only a few thousands.
However, when sample size grew to tens of thousands, the number
of associations started a steep increase.
This pattern has occured to pretty much every trait and disease studied.
(Demonstrated with schizophrenia on slides 8-10.)
**Example 3.8.**
Let's look at the results of the Schizophrenia study
"Biological insights from 108 schizophrenia-associated genetic loci",
Nature 511 421-427
taken from Supplementary table 2 at http://www.nature.com/nature/journal/v511/n7510/extref/nature13595-s1.pdf
```{r, warning=F}
sz.res = read.table("http://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/sz_res.txt",
as.is = TRUE, header = TRUE)
sz.res[1,] #see what data we have
#Let's plot the known SZ variants on frequency - effect size coordinates
#And draw some power curves there at genome-wide significance threshold
maf = sz.res[,"Frq_control"] #Not yet maf but allele 1 frequency
maf[maf > 0.5] = 1 - maf[maf > 0.5] #Make it to MAF: always less than 0.5
b = abs(log(sz.res[,"Combined_OR"])) #effect size on log-odds-ratio scale with positive sign
pw.thresh = 0.5
p.threshold = 5e-8
plot(maf, b, ylim = c(0,0.3), xlim = c(0.01,0.5), xlab = "MAF",
ylab = "EFFECT SIZE (in log-odds-ratio)", xaxt = "n", yaxt = "n", log = "x", #make x-axis logarithmic
main = substitute(paste("Power = ", pw.thresh ," at ", alpha ," = ",p.threshold),
list(pw.thresh = pw.thresh, p.threshold = p.threshold)),
cex.main = 1.8, cex.lab = 1.3, pch = 19)
axis(1, at = c(0.01, 0.02, 0.05, 0.10, 0.25, 0.5), cex.axis = 1.3)
axis(2, at = c(0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3), cex.axis = 1.3)
grid()
q = qchisq(p.threshold, df = 1, lower = F) #chi-square value corresp. significance threshold
#matrix of numbers of cases (col1) and controls (col2):
Ns = matrix( c(3332,3587, 10000,10000, 34000,45600), ncol = 2 , byrow = T)
cols=c("green", "cyan", "blue")
f = seq(0.01, 0.5, length = 200)
b = seq(0, 0.3, length = 200)
legends = c()
par(mar = c(6,6,5,1))
for(set in 1:nrow(Ns)){
pw = rep(NA, length(b)) #power at each candidate b
b.for.f = rep(NA,length(f)) #for each f gives the b value that leads to target power
for(i in 1:length(f)){
pw = pchisq(q, df = 1, ncp = Ns[set,1]*Ns[set,2] / sum(Ns[set,])*2*f[i]*(1-f[i])*b^2, lower = F)
b.for.f[i] = b[ min( which(pw > pw.thresh) ) ]
}
lines(f, b.for.f, t = "l", col = cols[set], lwd = 1.6)
legends = c(legends, paste(Ns[set,],collapse = "/") ) #make a "#cases/#controls" tag for legend
}
legend("bottomleft", lty = c(1,1), col = cols, legend = legends, lwd = 2, cex = 1.3)
```
No wonder that the 2009 study with 3300 cases and 3600 controls (slide 8) didn't find any individual SNPs
associated with SZ because it had insufficient power for any now know (common) SZ variant.
Note that none of these GWS variants has MAF < 0.02, most likely because
rare variants have not yet been comprehensively analyzed in these large meta-analyses.
#### 3.3.1 Absence of evidence is not evidence of absence
It is important to understand that
a "non-significant" P-value should not be interpreted as evidence that there is no effect.
It can be so interpreted only for those effect sizes for which the power to detect them
was ~100%.
But a "non-significant" P-value does not rule out smaller effects.
Therefore, any claims that argue based on statistical evidence that
an effect does not exist must be made precise by stating
which power there would had been to detect
effects, as a function of interesting quantities such as $\beta$ and MAF.
Additionally, the effect size estimate and its SE should be reported in addition to
the "non-significant" P-value.
**Example 3.9.** Suppose that we study migraine and find a variant $v$ that has an association
P-value 3.2e-15 in a large migraine GWAS, a P-value of 2.1e-9 in a subset of
cases with *migraine without aura* and a P-value of 0.08
in a subset of cases with *migraine with aura*.
Would you interpret this as evidence that the variant $v$ is associated with
migraine, and, in particular, that this effect is specific to migraine without aura
and is not present in migraine with aura?
What other information would you need to make this conclusion?
Describe some example study setting (by describing sample sizes and effect sizes)
where the above conclusion would be appropriate,
and another study setting where it would not be appropriate.
#### 3.3.2 Proportion of true positives
Let's recall the formula for the odds of a significant P-value indicating a true positive rather
than a false positive:
$$\frac{P(T|S)}{P(N|S)}=\frac{P(T)P(S|T)}{P(N)P(S|N)} =\textrm{prior-odds} \times
\frac{\textrm{power}}{\textrm{significance threshold}}.$$
Thus, for a fixed significance threshold and prior-odds of association,
the probability of a significant result being true increases proportional to the power of the study.
Hence, a larger proportion of the significant findings from a well powered study is
likely to be true positives than from an underpowered study.
Another way to think this is that all studies (independent of their power)
have the same rate of labelling null effects as significant
(and this rate is $\alpha$, the significance level, the Type I error rate),
but different rates of labelling true effects as significant
(and this rate is the power of the study).
Hence, by increasing power, a larger proportion of all significant results
will be true positives.
#### 3.3.3 Winner's curse
Suppose that we have 50% power to detect MAF=5% variants with effect $\beta=0.2$
with our GWAS sample at GWS threshold.
This means that, in our data, half of the variants whose
true effect is 0.2 will reach the threshold and other half does not.
How are these two groups different in our data? They are different in that
those that reach the threshold have estimated $\widehat{\beta}>0.2$ whereas
those that don't reach the threshold have estimated $\widehat{\beta}<0.2$.
Consequently, out of all those variants with true $\beta=0.2$ and
MAF=5%, the ones that become GWS in our data have their effect sizes **overestimated**.
This is the *winner's curse*: when power is low enough, the variants can reach
the GWS threshold only if their effect sizes are overestimated.
We are "winners" because we are able to detect some of these true effects,
but we are simultaneously "cursed" because we have upwardly biased effect size
estimates. And this curse gets worse as the power decreases.
Originally, the term winner's curse relates to auctions where
the winner is the one who has made the biggest offer and the biggest offer
tends to be higher than the consensus value of the item;
the winner is always cursed to pay too much.
Note that the winner's curse does not occur for variants that have
so large effect sizes that power to detect them is ~100%.
For such variants, we still overestimate the effect
size in 50% of variants and underestimate it in the remaining 50% of variants,
but now we will detect essentially all of those variants
at the given significance threshold,
even those whose effects we underestimate.
Hence, with high power, there is no general trend
of the significant variants' effects being overestimated.
In practice this means that we are less concern with the winner's curse for
GWS variants that have P-values, say < 1e-20, than for those that have P-values
just slightly below 5e-8.
How to get rid of winner's curse? There are some statistical methods for that (of course!),
but the simplest way is to get an effect size estimate from an independent replication data set.
Since those replication data were not used in making
the original GWS finding, there is no reason to expect that the effect size estimate
would be (upwardly) biased also in the replication data.
When deciding on the size of the replication study, it is good to remember
that, due to the winner's curse in the original study, the raw effect size
estimate from the discovery GWAS (i.e., the GWAS that made the GWS discovery in the first place)
may result in much too optimistic sample size estimate if used in the power calculation.
#### 3.3.4 Reasonable study design
If a study proposal has a low power to answer the question of interest,
it is difficult to argue why that study should be funded.
The funding agencies want to know that the studies they fund
will have a reasonable chance to provide solid conclusions.
Power calculations are crucial in medical studies that study treatment options
on patients or animals because there either too small or too large cohorts are unethical.
Too small cohorts sacrifice some individuals by exposing them to potentially harmful
treatments without ever producing solid results, whereas
too large cohorts will expose unnecessarily many individuals to a suboptimal treatment,
even after available statistical evidence could had already told which is the optimal treatment.
In GWAS studies, it is also typical that each new GWAS data collection will be combined with the
already existing ones, and hence the joint power of all existing data
is the (scientifically) most relevant quantity.