---
title: 'GWAS 9: Meta-analysis and summary statistics'
author: "Matti Pirinen, University of Helsinki"
date: 'Latest update: 2.12.2020; first version: 20-Feb-2019'
output:
html_document: default
urlcolor: blue
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(19)
```
This document is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).
The slide set referred to in this document is "GWAS 9".
Consider GWAS data on $n$ individuals and $p$ SNPs.
**GWAS summary statistics** can include a varying combination of the following information,
for each variant $l$, or region $\textrm{reg},$
* **Association statistics**
$A_l = \left(\text{EA}_l, \widehat{\beta}_{l}, \textrm{SE}_{l}, P_{l} \right)$ where
EA is the effect allele for which $\widehat{\beta}_{l}$ is reported.
These statistics are of size $4p$, and hence their proportion compared to the
whole data is $4/n$.
* **Information statistics** $I_l = (\textrm{EAF}_l, \textrm{INFO}_l, \textrm{QC}_l,\ldots),$
where EAF is the effect allele frequency (sometimes given only in controls),
INFO is an imputation information score, and QC includes
quality control measures, such as Hardy-Weinberg P-value and missingness rate of the
genotype calls. Proportion to raw data is around $10/n$ (assuming 10 pieces of information
per variant).
* **LD-matrix** $R_{\textrm{reg}}$ for certain region(s) of the genome.
For a region of size $p_{\text{reg}}$, the size of $R_{\textrm{reg}}$
is about $\frac{1}{2}p_{\text{reg}}^2$, whereas that of the raw data is
$n p_{\text{reg}}$.
Thus, the ratio is $p_{\text{reg}}/(2n)$.
When $n$ is of order 1e+5, the summary statistics take only a tiny fraction
of the space required by the raw data. Additionally, raw genotype-phenotye
data are sensitive, personal data and cannot be shared freely,
whereas usually there is no legal restrictions on sharing the GWAS summary statistics.
For these reasons, large consortia, such as [GIANT](https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium) or [CARDIoGRAMplusC4D](http://www.cardiogramplusc4d.org/data-downloads/), are distributing their results as summary statistics. Therefore there is a need for methods that can
further analyze the summary statistics, e.g., in fine-mapping, in imputation,
in heritability estimation or in gene-level testing.
The utilization of summary statistics is
reviewed by [Pasaniuc & Price 2017](https://www.nature.com/articles/nrg.2016.142).
**Example 9.1.** Reminder how the central association statistics
are related to each other.
* If we are given `b.est=`$\widehat{\beta}$ and `se=`$\textrm{SE}$, we can compute
the P-value as
`pchisq((b.est/se)^2, df=1, lower=F)`.
* If we are given $\widehat{\beta}$ and P-value `pval`, we can compute SE as
`sqrt(b.est^2 / qchisq(pval, df=1, lower=F))`.
* If we are given SE and P-value, we can compute `b.est=`$\widehat{\beta}$
for the trait increasing allele
as `sqrt(se^2*qchisq(pval, df=1, lower=F))`.
(We need to know which allele is the trait increasing from external information.)
* Assuming that no (strong) covariates have been applied,
we can further infer one of sample size, MAF or case-proportion
from SE given that the other two are known, according to the formulas in GWAS3.
We can also estimate an SE from these study parameters, and use it with a known
effect estimate to derive a P-value.
In these notes, we go through how the association summary statistics are produced
through *meta-analysis* of individual GWAS and how some analyses, that
we have so far done from raw data, could be done using only the summary data.
### 9.1. Meta-analysis
Suppose we have results from $K$ independent GWAS on the same phenotype.
(Here independent means that the samples of the GWAS are not overlapping.)
Hence,
for each variant $l$, we have $K$ sets of GWAS association statistics
$A_{kl} = \left(\widehat{\beta}_{kl}, \textrm{SE}_{kl}, P_{kl} \right).$
How do we combine these $K$ pieces of information
into a single combined estimate of the effect size, SE and P-value?
A combination of summary-level results from two or more studies into a single
estimate is called **meta-analysis**
("meta" refers to something happening at a higher level, meta-analysis is an "analysis of analyses"). A review of meta-analysis in GWAS
by [Evangelou & Ioannidis 2013](https://www.nature.com/articles/nrg3472).
In practice, all large GWAS are nowadays meta-analyses carried out by
international consortia and a consortium may contain even over
hundred individual studies.
Often each study runs a GWAS and shares the summary statistics with a
centralized analysis group that carries out the meta-analysis.
While this approach avoids sharing sensitive individual-level genotype-phenotype data
and also operates only with the light-weight summary data,
it also restricts the set of possible downstream analyses,
since only the marginal additive effect estimates are available.
It has become clear that in order
to maximize the scientific output from the consortia efforts,
future meta-analyses should be designed so that all raw data
will be collected in a single place.
Unfortunately, this is not always easy because of legal issues related to
the sharing of genotype-phenotype data.
Let's get back to the question how
do we combine $K$ sets of
GWAS association statistics on same (or at least similar) phenotype:
$A_{kl} = \left(\widehat{\beta}_{kl}, \textrm{SE}_{kl}, P_{kl} \right)$.
The answer depends on what we assume about the possible variation between
the true underlying effects $\beta_{kl}$ for $k=1,\ldots,K$.
#### 9.1.1 Fixed-effects model
The most common assumption is that all studies are measuring the same underlying
quantity, i.e., $\beta_{1l}=\ldots = \beta_{Kl}$ and there are no (noticeable)
differences in phenotype definitions and no distinct biases between the studies.
This is called the fixed-effects model because the effect size is the same across the studies.
In this case, the statistically most efficient unbiased linear estimator of the
common effect size $\beta$ is the **inverse-variance weighted (IVW)** estimator
(here denoted by F, Fixed-effect estimator):
\begin{align}
\widehat{\beta}_{l,F} &= \frac{w_{1l}\widehat{\beta}_{1l} + \ldots + w_{Kl}\widehat{\beta}_{Kl}}{w_{1l}+\ldots +w_{Kl}}\\
\text{SE}_{l,F} &= \left(w_{1l} + \ldots + w_{Kl} \right)^{-\frac{1}{2}},\quad \text{where the weight}\\
w_{kl}&= \frac{1}{\text{SE}_{kl}^2} \text{ is the inverse-variance of study } k.
\end{align}
In statistics, the inverse of the variance is called **precision**.
With that in mind, the above formulas are easy to remember:
* The weight given to each study in IVW estimator is proportional to the precision
of the study and the weights sum to 1.
* The precision of the IVW estimator is the sum of the precisions of the individual studies.
**Efficiency of IVW.** It sounds intuitively reasonable to weight each estimate
by its precision, but what is the mathematical argument behind this?
Let's consider the case of two studies and assume that both
yield unbiased estimates of the common effect size $\beta$, with precisions $w_i=1/\text{SE}_i^2$
for $i=1,2$. This means that
$$\text{E}\left(\widehat{\beta}_i\, |\, \beta\right) = \beta \quad\text{ and }\quad
\text{Var}(\widehat{\beta}_i) = \text{SE}_i^2 = \frac{1}{w_i}, \quad \text{ for } i=1,2.
$$
Consider all possible linear estimators $t(u) = u\widehat{\beta}_1 + (1-u)\widehat{\beta}_2$
determined by value of $u \in[0,1]$.
Estimator $t(u)$ is unbiased for all $u$ as
$$\text{E}(t(u)\,|\,\beta) = u\text{E}\left(\widehat{\beta}_1\,|\,\beta\right) + (1-u)\text{E}\left(\widehat{\beta}_2\,|\,\beta\right) = u\beta + (1-u)\beta = \beta.$$
Thus, on average, any weighting scheme gives the correct answer, and
the question is, which one of these weightings gives the most precise
combined estimate (= smallest variance).
$$\text{Var}(t(u)) = \text{Var}\left(u\widehat{\beta}_1\right) + \text{Var}\left((1-u)\widehat{\beta}_2\right)
= u^2 \frac{1}{w_1} + (1-u)^2 \frac{1}{w_2} = u^2\left(\frac{1}{w_1} + \frac{1}{w_2} \right)
-\frac{2}{w_2}u+\frac{1}{w_2}.
$$
This is a second order polynomial with respect to $u$ and has its minimum where the derivative is 0, i.e.,
at $u_0 = \frac{2/w_2}{2(1/w_1 + 1/w_2)} = \frac{w_1}{w_1 + w_2},$ which is the IVW.
We conclude that IVW is **the minimum variance unbiased linear estimator** of the
fixed effect model.
**Example 9.2.** Suppose that we do a fixed-effect meta-analysis using
IVW of two studies on LDL-cholesterol
where the sample sizes of the studies are $n_1=5,000$ and $n_2 =10,000$.
If both studies have applied similar covariates and hence have similar
error variance $\sigma_\varepsilon^2$, then the
precisions of the studies are $w_i = 2n_if_i(1-f_i)/\sigma^2_\varepsilon$ for $i=1,2.$
At a SNP
that has same MAF $f$ in both studies, the precisions are proportional to $n_i$
and hence the weights of the IVW are $\frac{w_1}{w_1+w_2}=0.333$
and $\frac{w_2}{w_1+w_2}=0.666$ and the
precision of the IVW is $w_F = w_1 + w_2 = 2(n_1+n_2)f(1-f)/\sigma^2_\varepsilon$,
which is the same as precision from a study with $n_1+n_2=15,000$ samples.
Indeed, with linear model, the precision from splitting the data into
any subsets and then combining them with IVW is (approximately) the same as
doing a joint analysis of all the data with separate intercept terms for each
subset (Exercise). (If subsets are small and there are
covariates involved, then random noise causes some numerical differences
between the approaches.)
**Example 9.3.** Suppose that we do IVW meta-analysis of two studies on
Parkinson's disease where $n_1=10,000$ of which $3,000$ are cases ($\phi_1=0.3$)
and $n_2 = 6,000$ of which $3,000$ are cases ($\phi_2=0.5$).
Thus the effective sample sizes are $n_{e1}=10000\cdot 0.3\cdot 0.7=2100$
and $n_{e2}=6000\cdot 0.5\cdot 0.5=1500.$
If we assume that the MAF of the SNP is the same in both studies,
then the precisions of the studies are $w_i = 2n_{ei}f(1-f)$ for $i=1,2$
and the weights of the IVW are $\frac{w_1}{w_1+w_2}=\frac{2100}{3600}=0.583$
and $\frac{w_2}{w_1+w_2}=\frac{1500}{3600}=0.417$
and the precision of the IVW estimator is $w_F = w_1 + w_2 = 2(n_{e1}+n_{e2})f(1-f),$
which is the same as precision from a study with effective sample size of $n_{e1}+n_{e2}$.
It can be shown that by splitting the case-control data into subsets, the
sum of the effective sample sizes over the subsets is always $\le$ the effective
sample size of the whole data (and the equality holds when the case proportion is constant
across the subsets). This gives a technical explanation why,
in logistic regression, an inclusion of a
binary covariate, such as sex or population label, causes a decrease in precision, and hence increase in SE compared to a single joint analysis of all data without the covariate.
This follows because a use of a binary covariate is approximately equivalent to splitting
the data by the covariate value, analyzing subsets separately and
combining the results using IVW (Exercise).
**Example 9.4.** Consider the association statistics at SNP rs11984041 on the large vessel
subtype of Ischemic Stroke in
[Bellenguez et al. (2012)](https://www.nature.com/articles/ng.1081).
They reported a discovery OR 1.50 (1.25-1.79) and replication1 OR
1.38 (1.17-1.63) and replication2 OR 1.39 (1.15-1.68), all for the same allele.
What is the combined estimate, SE
and P-value using the fixed-effects meta-analysis?
(They report 1.42 (1.28-1.57), P=1.87e-11.)
**Answer.**
Let's make a function `meta.F()` that does the
IVW meta-analysis for given estimates and SEs.
```{r}
meta.F <- function(b.est, se){
#returns inverse-variance weighted meta-analysis estimate, SE and P-value.
b.F = sum(b.est / se^2) / sum(1 / se^2)
se.F = 1 / sqrt(sum(1 / se^2))
p.F = pchisq( (b.F / se.F)^2, df = 1, lower = F)
return(list(b.F = b.F, se.F = se.F, p.F = p.F))
}
```
With these data, we need to compute the SEs from the 95%CIs and then use IVW.
```{r}
b.est = log(c(1.50, 1.38, 1.39)) #beta is logOR for case-control data
ci = log(matrix(c(1.25, 1.79,
1.17, 1.63,
1.15, 1.68), byrow = T, ncol = 2))
se = (ci[,2] - ci[,1])/(2*1.96) #length of 95%CI is 2*1.96*SE
meta.res = meta.F(b.est, se)
meta.res
cbind(OR = exp(meta.res$b.F), low = exp(meta.res$b.F - 1.96*meta.res$se.F),
up = exp(meta.res$b.F + 1.96*meta.res$se.F), pval = meta.res$p.F)
```
The results match well given the accuracy of two decimals in CIs to compute SE.
Let's visualize them by a forest plot.
```{r}
forest.plot <- function(x, intervals, labels = NULL, main = NULL, xlab = "Effect size",
pchs = rep(19,length(x)), cols = rep("black", length(x)),
cexs = rep(1,length(x))){
K = length(x)
stopifnot(nrow(intervals) == K)
plot(0, col="white", xlim = c( min(c(intervals[,1],0) - 0.05), max(c(intervals[,2],0) + 0.05)),
ylim = c(0, K+1), xlab = xlab, ylab = "", yaxt = "n",main = main)
axis(2, at = K:1, labels = labels, cex.axis = 0.8)
arrows(intervals[,1], K:1, intervals[,2], K:1,
code = 3, angle = 90, length = 0.02, col = cols)
points(x, K:1, pch = pchs, cex = cexs, col = cols)
abline(v = 0,lty = 2)
}
```
```{r,fig.width=6}
b.est = c(b.est, meta.res$b.F)
ci = rbind(ci, c(meta.res$b.F + c(-1,1)*1.96*meta.res$se.F))
labs = c("Discv", "Rep1", "Rep2", "Meta")
main.txt = "rs11984041 Stroke/LVD"
forest.plot(b.est, ci, labels = labs, main = main.txt, xlab = "logOR",
pchs = c(19, 19, 19, 18), cexs = c(.8, .8, .8, 1.3), cols = c(1, 1, 1, 4))
```
The plot shows that the estimates are very consistent with each other.
We also see how the uncertainty has decreased in the combined estimate
compared to the individual studies.
(In practice, visualizations are recommended to
be done with existing R-packages such as `meta` that have many more
options.)
#### 9.1.2 Heterogeneity
When we talk about *heterogeneity* in meta-analysis we mean that
the true effect sizes between studies are different.
How can we assess that from the observed data?
Let's first list heterogeneity measures that are typically reported in meta-analyses.
**Cochran's Q.** Assume as the null hypothesis that all $K$
studies are measuring the same effect $\beta$ and
that the errors are Normally distributed:
$\widehat{\beta}_k \sim \mathcal{N}(\beta,\text{SE}_k^2)$, for study $k$.
Then each $z_k = (\widehat{\beta}_k - \beta)/\text{SE}_k \sim \mathcal{N}(0,1)$,
and, assuming that the studies are independent, the sum of the squares of
these $K$ independent standard Normals has a chi-square distribution with
$K$ degrees of freedom:
$\sum_{k=1}^K z_k^2 \sim \chi_K^2$.
When we replace true $\beta$ with the fixed-effect estimate $\widehat{\beta}_F$
from IVW method, then we have a heterogeneity measure called **Cochran's Q**:
$$Q = \sum_{k=1}^K\left(\frac{\widehat{\beta}_k - \widehat{\beta}_F}{\text{SE}_k}\right)^2
=\sum_{k=1}^K w_k \left(\widehat{\beta}_k - \widehat{\beta}_F\right)^2,
$$
that under the null hypothesis of the fixed-effect assumption has distribution $\chi_{K-1}^2$
(where one degree of freedom is lost as we have used the data to estimate the common mean $\widehat{\beta}_F$ to replace the true $\beta$).
This distribution can be used to derive P-values for heterogeneity.
However,
when $K$ is small (say < 5), there is little power to see any heterogeneity with this test,
and when $K$ is large (say > 100), then even a small heterogeneity becomes statistically
significant.
**$I^2$.** To change the focus from P-value to the amount of heterogeneity,
a heterogeneity index $I^2$ has been [proposed](https://www.bmj.com/content/327/7414/557):
$$I^2 = \frac{Q-(K-1)}{Q} = 1 - \frac{K-1}{Q}.$$
This value is between 0 and 1 and often reported as percentage
(negative values are rounded up to 0).
The idea is that if $Q$ shows only the null hypothesis expectation of variation
(which is $K-1$),
then $I^2=0\%$ indicating no heterogeneity, whereas values > 50% are interpreted as
moderate amount of heterogeneity and > 75% high heterogeneity.
With a small number of studies, the uncertainty around estimate of $I^2$ is large
and little can inferred statistically.
**Between study variance $T^2$.**
Often a model for heterogeneity between the true effects is a two-stage
hierarchical model where the heterogeneity is defined by a variance parameter $T^2$.
We assume that first each true effect is sampled from
$(\beta_k|\beta, T^2) \sim \mathcal{N}(\beta,T^2)$
and then our estimates are formed by adding some noise around these values as
$(\widehat{\beta}_k | \beta_k) \sim \mathcal{N}(\beta_k, \text{SE}_k^2).$
From this model, a commonly-used estimate for $T^2$ is
$$\widehat{T^2} = \frac{Q-(K-1)}{\sum_{k=1}^K w_k - \frac{\sum_{k=1}^K w_k^2}{\sum_{k=1}^K w_k}}.$$
This is on the same scale as the (squared) effect sizes, which makes it different
from $Q$ and $I^2$ that are independent of the effect size scale.
When $T^2>0$ in the model formulation above, we have defined the standard
**random-effects** meta-analysis model where the effect sizes across studies
are not assumed to be exactly the same but still they are possibly quite similar
(namely, when $T$ is small compared to the common mean $\beta$ of all effects).
In statistics literature, it is common to derive a P-value assuming $\beta=0$ from such a model
and call that the random-effects model's P-value.
Such a test is not suitable for GWAS
where the relevant null hypothesis is that all effects are 0, rather than that
only their mean is 0. This issue is explained by [Han & Eskin 2011](https://www.sciencedirect.com/science/article/pii/S0002929711001558)
and they also propose a modification that tests the null hypothesis of exactly
0 effect in every study.
As a conclusion, the three quantities listed above to measure heterogeneity
are often reported in meta-analyses, but
they are often not that informative in cases where there are only a handful
of studies.
#### 9.1.3 Bayesian meta-analysis
The question of heterogeneity between studies can be more flexibly defined
in the Bayesian framework where a set of models with different assumptions
about heterogeneity can be directly compared against each other
and the interpretation
of the results of the model comparison does not depend on the sample size.
Let's remind ourselves
how, in GWAS 4, we compared the model with a non-zero effect to the null model
using the approximate Bayes factor, ABF.
We assumed that under the alternative hypothesis $H_1$, there was a non-zero
effect $\beta \sim \mathcal{N}(0,\tau^2)$
whereas under the null hypothesis $\beta=0$.
Then we derived the marginal likelihoods
that these models give for the observed data, and these marginal likelihoods
were proportional to Normal densities:
\begin{align}
P(\text{Data}\,|\,H_1) &= c\cdot \mathcal{N}(\widehat{\beta};0, \text{SE}^2 + \tau^2)\\
P(\text{Data}\,|\,H_0) &= c\cdot \mathcal{N}(\widehat{\beta};0, \text{SE}^2)
\end{align}
From this we got an approximate Bayes factor in favor of $H_1$ vs. $H_0$ as
$$
\text{ABF}_{1:0} = \frac{P(\text{Data}\,|\,H_1)}{P(\text{Data}\,|\,H_0)}
= \frac{\mathcal{N}(\widehat{\beta};0, \text{SE}^2 + \tau^2)}{\mathcal{N}(\widehat{\beta};0, \text{SE}^2 )}.
$$
Finally, we derived the posterior probability of $H_1$ under the assumption that
one of $H_0$ and $H_1$ is true and that the prior probability of it being
$H_1$ was $p_1 = 1-p_0$:
$$
P(H_1\,|\, \text{Data}) = \frac{p_1\cdot \text{ABF}_{1:0}}{p_0+p_1 \cdot \text{ABF}_{1:0}}.
$$
Thus, in order to get to a probabilistic model comparison,
we needed to set values for two parameters:
$\tau$ that determines how large effects
we expect to see under $H_1$, and $p_1$ that determines how probable we think that
the alternative hypothesis is *a priori*, before we have seen the data.
Let's see how we generalize this to multiple studies.
We use a multivariate Normal distribution as the prior distribution of the
effect size vector
$\pmb{\beta} = (\beta_1,\ldots,\beta_K)^T \sim \mathcal{N}_K(0,\pmb{\Theta}),$
where the prior matrix $\pmb{\Theta}$ is assumed to take the form
$$\pmb{\Theta} = \tau^2 \left[
\begin{array}{cccc}
1& \theta_{12}& \ldots &\theta_{1K}\\
\theta_{12}& 1& \ldots &\theta_{2K}\\
\vdots &\vdots&\ddots &\vdots \\
\theta_{1K}& \theta_{2K}&\ldots& 1
\end{array}
\right].
$$
Thus, the parameter $\tau^2$ still defines the prior variance of any one effect size
parameter, but now effect sizes from two studies may be correlated as defined by
prior correlation $\theta_{ij}$. For example,
* the fixed-effect model $H_F$ results if
we set all $\theta_{ij}=1$,
* the independent-effect model $H_I$
results if we set $\theta_{ij}=0$,
* the standard random-effect model $H_R(\rho)$
results if we set $\theta_{ij}=\rho$ for
some value of $\rho>0,$ where values close to 1 assume only little heterogeneity
and values close to 0 assume almost independent effects; our default is $\rho=0.9$,
* the null model $H_0$ is defined by setting $\tau^2=0$ (and then the values of $\theta_{ij}$
do not matter).
The likelihood function defined by the observed data is also proportional
to a multivariate Normal density
$\mathcal{N}_K(\widehat{\pmb{\beta}};\pmb{\beta}, \pmb{\Sigma}).$
If we assume that the studies are independent
(no overlapping samples), then $\pmb{\Sigma}$ is simply a diagonal matrix where
the diagonal is $(\text{SE}_1^2,\ldots,\text{SE}_K^2)$.
The marginal likelihood for data given the model $m$ (defined by prior variance matrix
$\pmb{\Theta}_m$) is
$$P(\text{Data} \, | \, H_m) = c\cdot \mathcal{N}_K(\widehat{\pmb{\beta}};
\pmb{0}, \pmb{\Sigma} + \pmb{\Theta}_m ).
$$
Approximate Bayes factor between any two models $m$ and $\ell$ is
$$\text{ABF}_{m:\ell} = \frac{P(\text{Data} \, | \, H_m)}{P(\text{Data} \, | \, H_\ell) }
=\frac{\mathcal{N}_K(\widehat{\pmb{\beta}};
\pmb{0}, \pmb{\Sigma} + \pmb{\Theta}_m )}{\mathcal{N}_K(\widehat{\pmb{\beta}};
\pmb{0}, \pmb{\Sigma} + \pmb{\Theta}_\ell )}.
$$
If model $m$ is given a prior probability $p_m$ (and $p_0+\ldots +p_K=1$), then we can
compute the posterior probability for model $m$ as
$$
P(H_m \, | \, \text{Data}) = \frac{p_m\cdot \text{ABF}_{m:0} }{\sum_{\ell=0}^K p_\ell\cdot \text{ABF}_{\ell:0}},
$$
where we have computed ABFs between all models and the null model
(and $\text{ABF}_{0:0}=1$). Thus, the posterior probability of a model
is proportional to the product of the prior probability of the model
and ABF of the model.
Let's write a function `abf.mv()` that computes ABFs and posterior probabilities
for any given set of prior matrices and prior probabilities.
First, we need a density for the multivariate normal.
(There is also a package `mvtnorm` for that.)
```{r}
log.dmvnorm <- function(x, mu = rep(0, length(x)), S = diag(1, length(x)) ){
#returns log of density of MV-Normal(mean = mu, var = S) at x
K = length(mu)
stopifnot(all(dim(S) == K))
stopifnot(length(x) == K)
chol.S = chol(S) #Cholesky decomposition
log.det = 2*sum(log(diag(chol.S))) #log of det(S)
inv.chol.S = solve(t(chol.S)) #inverse of cholesky^T
return(-K/2*log(2*pi) - 0.5*(log.det + crossprod(inv.chol.S %*% (x-mu))))
}
abf.mv <- function(b.est, Sigmas, prior = rep(1,length(Sigmas))){
#Returns posterior probabilities of the models listed in Sigmas by their
# total variance matrix (= sum of prior + likelihood variance matrices)
#Returns also ABFs w.r.t the first model in Sigmas.
M = length(Sigmas) #number of models
K = length(b.est) #number of studies
prior = prior/sum(prior)
log.abf = sapply(Sigmas, function(x){log.dmvnorm(b.est, S = x)})
abf = exp(log.abf - log.abf[1]) #abf w.r.t the first model
posterior = prior*abf
posterior = posterior/sum(posterior)
return(list(posterior = posterior, abf = abf))
}
```
**Example 9.5.**
Let's generate 4 data sets for 10 case-control
studies where the effective sample size varies between 250 and 2500 and
MAF varies between 0.4 and 0.5.
* 1st data set: all studies estimate the same effect $\beta=0.1$.
* 2nd data set: there is heterogeneity and
the true effects come from $\mathcal{N}(0.1, 0.04^2)$.
* 3rd data set: there is heterogeneity but no correlation in effects as they come
from $\mathcal{N}(0, 0.1^2).$
* 4th data set has a null SNP in all studies.
```{r}
K = 10
n.eff = runif(K, 250, 2500)
f = runif(K, 0.4, 0.5)
se = 1/sqrt(2*n.eff*f*(1-f)) #SEs
w = 1/se^2 #precisions
b = 0.1 #true mean of effects in 1 and 2
B.est = cbind(rnorm(K, b, se), #fixed effects
rnorm(K, b, sqrt(se^2 + 0.04^2)), #correlated random effects
rnorm(K, rep(0,K), sqrt(se^2 + 0.1^2)), #independent effects
rnorm(K, rep(0,K), se)) #null model
```
Let's then compare 4 models in these data sets:
(1) fixed-effects, (2) correlated-effects, (3) independent-effects and (4) the null.
We specify these models by their matrices ($\pmb{\Sigma} + \pmb{\Theta}_m$),
run `abf.mv()` and print the forest plot of the data as well as a barplot
of the posterior probability across the 4 competing models, assuming the
prior probability of each model is the same (= 0.25).
Let's also print the standard heterogeneity measures:
value of $I^2$ and the P-value from Cochran's $Q$-statistic.
```{r, fig.height = 10, fig.width = 6}
Sigma = diag(se^2) #this is the variance of likelihood function -- same for all models
tau2 = 0.2^2 #prior variance of effect size -- same for all non-null models
S.fix = tau2 * matrix(1, K, K) #fixed-effect model has 1s in the correlation matrix
S.cor = tau2 * matrix(0.9, K, K) #correlated effects has corr of 0.9 as off-diagonal
diag(S.cor) = tau2 #... and corr of 1 on the diagonal
S.ind = tau2 * diag(K) #diagonal matrix for independent effects model, off-diagonals = 0
S.null = matrix(0, K, K) #null model has 0 effects
Var.matrices = list(Sigma + S.null, Sigma + S.fix, Sigma + S.cor, Sigma + S.ind)
par(mfrow = c(4,2))
for(ii in 1:ncol(B.est)){
#Standard heterogeneity measures:
b.F = sum(w*B.est[,ii]) / sum(w) #IVW estimate under fixed-effect model
Q = sum( w * (B.est[,ii] - b.F)^2 ) #Cochran's Q
pval.Q = pchisq(Q, df = K-1, lower = F)
I2 = 1 - (K-1)/Q #I^2 from Q
#Bayesian model comparison:
abf.out = abf.mv(B.est[,ii], Sigmas = Var.matrices) #by default, prior is uniform
ci = cbind(B.est[,ii] - 1.96*se, B.est[,ii] + 1.96*se) #95%CIs
forest.plot(B.est[,ii], ci, main = paste("Data set",ii), xlab = "logOR")
barplot(abf.out$posterior, ylim = c(0,1), cex.sub = 1.3,
sub = paste0("I2=",max(c(0,round(I2*100))),"% het P=",signif(pval.Q,2)),
names.arg = c("NULL", "FIX", "COR", "IND"),
col = c("gray","limegreen","orange","dodgerblue"))
}
```
The Bayesian approach gives a way to assess whether there is heterogeneity
in the effect sizes by comparing correlated effects model and/or
independent effects model to the fixed effect model.
Above it indicates heterogeneity in data sets 2 and 3, as expected.
Similarly the $I^2$ value together with P-value from $Q$ indicate heterogeneity
in sets 2 and 3.
Extensions of the Bayesian approach to overlapping samples between studies, e.g., due
to shared controls,
and to the subset models, where the effect is non-zero only in
particular studies, are discussed by [Trochet et al. 2019](https://doi.org/10.1002/gepi.22202).
As the final comment about heterogeneity:
Whenever you have an interesting variant in a GWAS meta-analysis, make
a forest plot over all the cohorts to see how the effects look like
and don't rely only on some quantitative heterogeneity measures,
especially if there are only a couple of studies included.
#### 9.1.4 Publication bias
A crucial part of all meta-analyses is to include in the analysis *all* the data
available on the particular research question. In particular, one should never
use the results of the studies to decide which studies to include or leave out
since that will obviously bias the results of the meta-analysis.
Studies can be left out because of quality issues or differences in phenotypes,
for example, but these must be objective criteria that are not based on the
results of the study in any one SNP. In general, meta-analyses in epidemiology
and social science etc. are hampered by **publication bias**
which means that only studies reporting
statistically significant results are published whereas null studies never find their
way to public. Consequently, a meta-analysis may report a significant effect
based on published studies even though there could be another set of unpublished
studies that could show that, when all information is combined,
there is no effect. The pubication bias
is less of a problem in GWAS, because GWAS results are published simultaneously
genome-wide, not separately for the "significant" SNPs.
### 9.2. Further analyses with summary statistics
Meta-analysis yields a set of association statistics ($\widehat{\beta}$, SE, P-value).
Let's look at how we can do some of the downstream analyses with these pieces of
information without an access to the full raw genotype-phenotype data.
#### 9.2.1. Joint model of multiple SNPs
Let's consider the joint linear model with $p$ SNPs with
the mean centered phenotype $y$ and standardized
genotypes (and then we can drop the intercept term from the model):
$$\pmb{y} = \pmb{X^*} \pmb{\lambda}^* + \pmb{\varepsilon}.$$
The least squares estimator and its variance are
\begin{align}
\widehat{\pmb{\lambda}}^* &= \left(\pmb{X^*}^T\pmb{X^*}\right)^{-1} \pmb{X^*}^T\pmb{y},\\
\text{Var}\left(\widehat{\pmb{\lambda}}^*\right) &= \sigma_J^2 \left(\pmb{X^*}^T\pmb{X^*}\right)^{-1},
\end{align}
where
* $\sigma_J^2 = \text{Var}(\varepsilon) = \text{Var}(y) - (\widehat{\pmb{\lambda}}^*)^T \pmb{R} \widehat{\pmb{\lambda}}^*$ is the error variance from the Joint model and $\pmb{R}$ is the LD matrix of the SNPs in $\pmb{X}$.
It turns out that these quantities can be written using summary data from the marginal models $\pmb{y}=\pmb{x}^*_l\beta_l+\pmb{\varepsilon}_l$, since
\begin{align}
\left(\pmb{X^*}^T\pmb{X^*}\right) &= n \pmb{R}\\
\pmb{X^*}^T\pmb{y} &= n \widehat{\pmb{\beta}}^*\\
\text{Var}(y) &= (\widehat{\beta}^*_l)^2 + \widehat{\sigma}_l^2 = (\widehat{\beta}_l^*)^2 + 2 n f_l(1-f_l)\cdot \text{SE}_l^2,
\end{align}
where $f_l$ is the MAF of SNP $l$
and $\text{SE}_l$ is the standard error of the *allelic* marginal effect
$\widehat{\beta}_l$, as reported by GWAS.
With these formulas we have that
\begin{align}
\widehat{\pmb{\lambda}}^* &= \pmb{R}^{-1} \widehat{\pmb{\beta}}^*,\\
\text{Var}\left(\widehat{\pmb{\lambda}}^*\right) &= \frac{\widehat{\sigma_J^2}}{n} \pmb{R}^{-1},\\
\widehat{\sigma_J^2} &=
\text{median}_{l=1}^p\left\{(\widehat{\beta}_l^*)^2 + 2 n f_l(1-f_l)\cdot \text{SE}_l^2\right\} - (\widehat{\pmb{\beta}}^*)^T \pmb{R}^{-1} \widehat{\pmb{\beta}}^*,
\end{align}
where the median is taken over all the available SNPs in $\pmb{X}$
and its function is to reduce noise compared to the corresponding
variance estimate taken from any one $l$.
In particular, we do not need an access to raw $\pmb{X}$ and $\pmb{y}$
in order to do a stepwise forward selection or probabilistic fine-mapping
as long as we have the marginal GWAS association statistics and the LD matrix available.
If association statistics come from an IVW fixed-effect meta-analysis, then the LD-matrix is a
weighted sum of the LD-matrices of individual studies, where the weights are proportional
to the sample sizes of the studies Before the meta-analysis, one must make sure that
all the studies have measured the effects on the same scale in order that the fixed-effect
meta-analysis of the effect sizes makes sense. Typically, this is ensured by normalizing the
trait to have a variance of 1 in each study before running the GWAS.
If the association statistics are coming from a **logistic regression model
applied to case-control data**,
we modify the above formulas by setting $\sigma_J^2=1$ and replacing the sample size
$n$ with the effective sample size $n_e$.
If summary data come from one study, then $n_e = n \phi(1-\phi)$,
where $\phi$ is the proportion of cases in data.
If summary data are from a meta-analysis over several studies, then $n_e$ is the sum of
the effective sample sizes $n_{e}^{(i)} = n^{(i)} \phi_i (1-\phi_i)$ over individuals studies,
where $n^{(i)}$ is the total sample size (cases + controls) of study $i$ and
$\phi_i$ is the proportion of cases in study $i$.
In this meta-analysis case, the LD-matrix is a
weighted sum of the LD-matrices of individual studies, where the weights are proportional
to the effective sample sizes of the studies.
This approximation works well when the effect sizes are not very large,
MAF is not very small and $\phi$ is quite balanced, say, within (0.2,0.8).
The idea of computing the joint model from the summary statistics was introduced by
[Yang et al. 2012](https://www.nature.com/articles/ng.2213)
and has been widely used through the conditional and joint analysis (COJO)
module of the software package [GCTA](https://cnsgenomics.com/software/gcta/#Overview).
The same idea is used in many fine-mapping software packages such as [FINEMAP](http://www.christianbenner.com/).
Let's try it with two SNPs and a quantitative trait. (Uses `geno.2loci()` from Section 7 to generate genotypes.)
```{r, echo = F}
geno.2loci <- function(n, r, mafs, return.geno = TRUE){
#INPUT:
# n, individuals
# r, correlation coefficient between the alleles on the same haplotype of the two loci
# mafs, MAFs of the two loci
#OUTPUT:
# if return.geno=TRUE: n x 2 matrix of GENOTYPES of n individuals at 2 loci
# if return.geno=FALSE: (2n) x 2 matrix of HAPLOTYPES of n individuals (2n haplotypes) at 2 loci
stopifnot( r >= (-1) & r <= 1 )
stopifnot( length(mafs) == 2 )
stopifnot( all(mafs > 0) & all(mafs <= 0.5) )
stopifnot( n > 0)
#Label SNPs and alleles so that a and b are minor alleles and freq a <= freq b.
#At the end, possibly switch the order of SNPs back to the one given by the user.
f.a = min(mafs) # maf at SNP1
f.b = max(mafs) # maf at SNP2
#With these parameters, the possible LD coefficient r has values in the interval:
r.min = max( -1, -sqrt(f.a/(1-f.a)*f.b/(1-f.b)) )
r.max = min( 1, sqrt(f.a/(1-f.a)/f.b*(1-f.b)) )
#c(r.min,r.max)
#Check that r is from this interval
if(r < r.min | r > r.max) stop(paste0("with these mafs r should be in (",r.min,",",r.max,")"))
# Alleles SNP1: A (major) and a (minor); SNP2: B (major) and b (minor).
# Compute conditional probabilities for allele 'a' given allele at locus 2:
q0 = f.a - r*sqrt(f.a*(1-f.a)*f.b/(1-f.b)) #P(a|B)
q1 = f.a + (1-f.b)*r*sqrt(f.a*(1-f.a)/f.b/(1-f.b)) #P(a|b)
#Compute the four haplotype frequencies:
f.ab = f.b*q1
f.aB = (1-f.b)*q0
f.Ab = f.b*(1-q1)
f.AB = (1-f.b)*(1-q0)
f = c(f.ab,f.aB,f.Ab,f.AB)
f #These are the haplotype frequencies in the population.
haps = matrix(c(1,1,1,0,0,1,0,0), nrow = 4, ncol = 2, byrow = T) #4 haplotypes in the population.
#Generate data for n individuals where each individual is measured at these two SNPs:
hap.ind = sample(1:4, size = 2*n, replace = T, prob = f) #There are 2*n haplotypes, 2 for each individual
if(mafs[1] > mafs[2]) haps = haps[,2:1] #Whether to change the order of loci?
#Either make genotype matrix by summing the two haplotypes for each individual...
if(return.geno) X = haps[hap.ind[1:n],] + haps[hap.ind[(n+1):(2*n)],]
if(!return.geno) X = haps[hap.ind,] #...or return haplotypes as such.
return(X)
}
```
```{r}
n = 2000
maf = c(0.3, 0.4)
lambda = c(0.2, -0.05) #true causal effects
r = 0.5 #LD btw SNPs 1 & 2
X = geno.2loci(n, r, mafs = maf, return.geno = TRUE)
R = cor(X) #LD in data
f = colSums(X)/2/n #MAFs in data
sc = sqrt(2*f*(1-f)) #scaling constants
y = scale(X %*% lambda + rnorm(n, 0, sqrt(1 - var(X %*% lambda))))
b.est = rbind(summary(lm(y ~ X[,1]))$coeff[2,1:2],
summary(lm(y ~ X[,2]))$coeff[2,1:2]) #col1 estimate, col2 SE
b.s = b.est[,1]*sc #scaled betas
l.s.sumstat = solve(R, b.s) #computes scaled lambdas as R^-1 * b.s
sigma2.J.sumstat = median(b.s^2 + n*sc^2*b.est[,2]^2) - as.vector(t(b.s) %*% solve(R, b.s))
l.s.se.sumstat = sqrt(sigma2.J.sumstat/n * diag(solve(R))) #SEs of scaled lambdas
cbind(lambda = l.s.sumstat/sc, se = l.s.se.sumstat/sc) #show on allelic scale
#Compare to the joint model on raw data
summary(lm(y ~ X))$coeff[2:3,1:2]
```
Same results up to the 3rd decimal. (Binary trait case left as an exercise.)
Given that the association summary statistics
from large meta-analyses are publicly available,
it would be very nice if we could do joint models and fine-mapping by
combining those statistics with
LD-information from some **reference database**,
without needing the access to the original genotypes.
Indeed, this is how GCTA-COJO analyses are done.
Unfortunately, with recent large datasets, such as the UK Biobank,
it has become clear that the accuracy of the
LD-estimates must increase together with the GWAS sample size.
Otherwise, the summary statistic methods start reporting false positives
because of the inconsistency between the highly precise effect estimates
and the LD information from the reference data
([Benner et al. 2017](https://www.sciencedirect.com/science/article/pii/S0002929717303348)).
Hence, in general, we will need LD-information from the same data from which the GWAS
summary statistics were calculated in order to do reliable fine-mapping and joint analysis.
This is one reason why future meta-analyses
should be planned in such a way that all data are collected in one place, and why we will
need new ways to seamlessly distribute LD-information as another type of GWAS summary statistics.
#### 9.2.2 Polygenic scores
Our goal so far has been to identify **causal variants** that tell about the
biology of the phenotype and propose ways for targeted treatments.
Another way to utilize GWAS results is to **predict** phenotypes.
There is a difference between understanding the causes of a phenomenon
and an ability to predict the phenomenon:
While understanding typically implies a good prediction,
a good prediction does not necessarily require understanding.
For example, we do not need to know which of the two variants in high LD
with each other is a causal one in order to do a good prediction:
Either of the variants will
do almost equally well when used in a prediction model,
because, due to high LD, they carry almost the same information about
the genetic differences between individuals.
Let's consider the standard additive model for the phenotype across the whole genome:
$$
y_i = \eta_i + \varepsilon_i = \sum_{k=1}^p x_{ik} \lambda_{k} + \varepsilon_i.
$$
If we knew the true causal effects $\lambda_{k}$, then we could do the perfect prediction
of the genetic component $\eta_i = \sum_k x_{ik} \lambda_{k}$ for individual $i$ given her/his
genotypes. In the population, this perfect genetic prediction would explain the proportion
$h^2$ (=additive heritability) of the total phenotypic variance and this would be as good as
an additive genetic prediction ever gets.
By a **polygenic score** (PS) we mean an instance of the additive genetic predictor defined
by a set of weights $\pmb{\alpha} = (\alpha_k)_{k=1}^p$ that predicts the genetic
component of individual $i$ as
$$
\text{PS}_i(\pmb{\alpha}) = \sum_{k=1}^p x_{ik} \alpha_{k}.
$$
Typically, the weights $\alpha_k$ are obtained from the GWAS summary statistics $\widehat{\beta}_k$,
possibly with some variable selection and/or shrinkage of the effect sizes and/or LD-adjustments
to approximate the causal effects $\lambda_k$ rather than the marginal effects $\beta_k$.
Such PS can then be tested against the known phenotype values in a test cohort to see
how much phenotypic variation it explains.
A recent guide for making PS by [Choi et al. 2020](https://www.nature.com/articles/s41596-020-0353-1)
and the corresponding online [tutorial](https://choishingwan.github.io/PRS-Tutorial/).
These articles talk about
developing and evaluating polygenic risk prediction models
([Chatterjee et al. 2016](https://www.nature.com/articles/nrg.2016.27)) and
personal and clinical utility of PS
([Torkamani et al. 2018](https://www.nature.com/articles/s41576-018-0018-x)).
The methods to derive the best possible PS weights $\alpha_k$ are currently one of the hot
topics in the GWAS field since the recent results have shown that the current GWAS
summary statistics can already provide useful predictive discrimination for risk of several diseases
(slides 17-21). Hope is that by more advanced modeling, the accuracy can be further improved.
New methods are typically compared to the two reference methods:
**P-value clumping** and [LDpred](https://github.com/bvilhjal/ldpred).
Both of these take in the GWAS association statistics and produce PS-weights by accounting for LD.
P-value clumping prunes away variants that are
in high LD with each other whereas LDpred
is a Bayesian method that outputs estimates for the causal effect sizes
for each variant by accounting for
the LD-structure around the variant.
**P-value clumping ($r^2$,$d$,$P_{\text{thr}}$).**
The simplest PS uses the marginal GWAS effect estimates $\widehat{\beta}_k$ as weights.
Suppose that we have two variants in high LD. Their marginal effects are almost the same
and including them both in the PS is likely to
overestimate the joint contribution from these two SNPs and, hence, reduce the PS accuracy.
To avoid this, we do some **LD-pruning** meaning that we will only include non-zero weights for
SNPs that are not in high LD with each other. For example, we may require that $r^2<0.1$ between
all pairs of variants that have non-zero weights in PS.
In practice, such LD-pruning is applied only within certain window size (e.g., $d=1$Mb)
for computational reasons and because
LD decays quickly with distance in homogeneous populations.
**P-value clumping** means LD-pruning that prefers to leave in the data
the variant with the lowest GWAS
P-value and prune away its LD-friends that have higher GWAS P-values.
This way our final LD-pruned data set contains as many of
the top GWAS variants (in terms of the lowest P-values) as possible given the pruning parameter $r^2$.
Typically, there is also a P-value threshold (between 0.05 and 5e-8) to ensure that
all variants included in PS with non-zero weights will have GWAS P-value $