---
title: 'GWAS 8: Heritability and mixed models'
author: "Matti Pirinen, University of Helsinki"
date: "Latest update: 12-Apr-2023."
output:
html_document: default
urlcolor: blue
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(19)
```
This document is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).
The slide set referred to in this document is "GWAS 8".
We have seen Manhattan plots for BMI, migraine, schizophrenia etc.
where the number of association regions rises to hundreds.
How much these regions explain of the population
variation in a trait or its liability?
How many other genomic regions, or additional variants in the current regions,
might also contribute to these phenotypes?
Is there something special about the genomic regions that
harbor causal variants for a particular phenotype?
How well can we predict these phenotypes from genetic data
and how larger GWAS are likely to improve these predictions?
These are questions about **genetic architecture** of the phenotypes.
A primary parameter of genetic architecture is **heritability**.
### 8.1 Heritability
Heritability measures, in a particular population,
the proportion of variance of the phenotype that is due to genetic differences between individuals.
A review by [Visscher et al.](https://www.nature.com/articles/nrg2322)
is an excellent overview of the concept and its interpretation.
An important point is that heritability is always a property of a particular population,
and its value can vary between different environmental conditions as the relative roles of
genes and environment change.
Heritabilities have been estimated for a long time using phenotypic correlations between relatives
of different degrees. (Slides 3-4)
For example, the traditional twin estimates compare phenotypic similarity
in monozygotic twin pairs (who share whole genome IBD = 2)
to similarity in dizygotic twin pairs
(who are genetically like any other pairs of full-siblings: IBD0 = 25%,
IBD1 = 50%, IBD2 = 25%).
Under some (strong!) assumptions, such as that the environmental
contribution to the phenotypic similarity would be the same
for a monozygotic as for a dizygotic twin pair,
and that the genetic effects act additively over loci,
it follows that the difference between the correlation
estimates of these two types of twin pairs
leads to an estimate of heritability.
More generally, any pedigree records can been used for estimating how the observed
phenotypic correlations can be explained by the estimated genetic sharing,
which, under (strong!) assumptions
about covariance of the environmental effects between different relative types,
again leads to estimates of heritability.
Here our interest is in how much heritability do we explain with the already
discovered GWAS regions,
and whether we could use GWAS data, that typically have only few close relatives,
to estimate heritability of a trait.
In particular, such estimates should be immune to the confounding
factors of the shared environment that occurrs between close relatives.
### 8.2 Heritability of GWAS loci
We consider the measure of
**heritability in the narrow sense**, $h^2$, which is the heritability due to
the **additive genetic effects**. Thus, *dominance effects* within one locus
(i.e. the amount by which the heterozygotes' phenotype mean deviates
from the average of the two groups of homozygotes), or interaction
effects (*epistasis*) between multiple loci, are not included in this estimate.
The broad-sense
heritability $H^2$, which includes all variance due to genetics, is
more difficult to estimate than $h^2$.
If SNP $l$ has MAF $f_l$ and allelic causal effect
$\lambda_l$, then the phenotypic variance explained (causally) by
the SNP is $2f_l(1-f_l)\lambda_l^2 = {\lambda_l^*}^2$.
(Here $\lambda_l^*$ denotes the scaled causal effect introduced in Section 7.)
If the trait variance in population is $\sigma_Y^2$,
then the (additive) heritability contributed by the SNP is
$$h_l^2 = \frac{{\lambda_l^*}^2}{\sigma_Y^2} = \frac{2f_l(1-f_l)\lambda_l^2}{\sigma_Y^2}.$$
In practice, the variance explained is often estimated
using an estimate of the marginal effect as
$(\widehat{\beta}_l^*)^2 = 2 f_l (1-f_l) \widehat{\beta}_l^2$.
This is the phenotypic variance that is explained by genetic variation
tagged by variant $l$, (similar interpretation applies to heritability),
but is still often interpeted as "variance explained by variant $l$".
Let's assume that we have standardized the phenotype ($\sigma_Y^2=1$) before the GWAS.
According to the simplest model, assuming no LD or deviation from
the additivity between the causal effects of different variants, the
additive heritability over all variants is
$h^2 = \sum_{l=1}^p {\lambda^*_l}^2.$
More generally, allowing LD, $p$ variants in a region
contribute to the phenotype of one individual by the quantity
${\pmb{x}^*}^T \pmb{\lambda^*},$
and the variance contributed by this region is
$$
\begin{aligned}
h^2_{\textrm{reg}} &= \textrm{Var}({\pmb{x}^*}^T \pmb{\lambda^*}) =
\pmb{\lambda^*}^T \textrm{Var}(\pmb{x}^*) \pmb{\lambda^*} =
\pmb{\lambda^*}^T \pmb{R} \pmb{\lambda^*} \\
&=
(\pmb{R}^{-1}\pmb{\beta^*})^T \pmb{R} (\pmb{R}^{-1}\pmb{\beta^*})=
\pmb{\beta^*}^T \pmb{R}^{-1} \pmb{\beta^*},
\end{aligned}
$$
where $\pmb{R}$ is the LD-matrix (Pearson correlations) of the $p$ variants.
The above formula shows how to compute the regional heritability using either
the causal effect sizes or the marginal effect sizes (latter of which are directly estimated
by the standard GWAS).
If there is no LD ($\pmb{R} = \pmb{I}$) between the variants,
then the heritability is the sum of the squared scaled effects
(either causal or marginal effects as they are the same in this case).
For example, [FINEMAP](http://www.christianbenner.com/) (from Section 7 of course material)
gives an estimate of regional heritability for each causal configuration in its .config file
according to the formula above.
**Example 8.1.**
Consider SNPs 1 and 2 whose minor allele correlation is $r_{12}$,
MAFs are 0.24 and the marginal estimates are
$\widehat{\beta}_1 =\widehat{\beta}_2 = 0.08$
in a QT GWAS with a phenotypic variance of 1.
What is the estimate of heritability explained by
these SNPs jointly, and how much it differs from the case where they were independent,
when $r_{12}$ is 0.99 (almost the same variant twice),
0.3 (positively correlated effect alleles),
0 (independece) or -0.2 (effect alleles are masking each other)?
```{r}
b.est = c(0.08, 0.08)
f = c(0.24, 0.24)
b.s = b.est * sqrt(2*f*(1-f)) #scaled marginal effects
res = c()
for(r.12 in c(0.99, 0.3, 0, -0.2)){
R = matrix(c(1, r.12, r.12, 1),2,2)
#regional heritability b.s^T R^-1 b.s
h2.reg = t(b.s) %*% solve(R) %*% b.s #solve(R) is the inverse of R
h2.ind = sum(b.s^2)
res = rbind(res,c(r.12, h2.reg, h2.ind))
}
colnames(res) = c("r.12","h2.reg","h2.ind")
res
```
When the effect alleles are 0.99 correlated, then they are explaining the same
signal and their joint variance explained is, correctly, only half of the sum of the marginals.
With a negative correlation, and a similar direction and size in the observed effects,
the SNPs must been masking each other's marginal effects and their total contribution
is larger than the sum of their marginal contributions.
For diseases, the heritability is often measured on the **liability scale**, which requires
an estimate of the disease prevalence
(see [Box 5](https://www.nature.com/articles/nrg2322) of Visscher et al.).
A few examples of the heritability estimates summed over GWAS loci
(defined by a variant having P < 5e-8), in decreasing order
of heritability:
* Paraoxonase-1 level has a heritability of 70% at a single locus
(unsurprisingly, harboring the gene *PON1*) [Benner et al. 2018](https://www.biorxiv.org/content/10.1101/318618v1).
* Height study by [Yengo et al. 2022](https://www.nature.com/articles/s41586-022-05275-y)
found that 12,111 SNPs in 7,209 loci accounted for ~40% of variation in height. Sample size was 5.4 million!
The GWAS loci covered 21% of the genome and it seems that these loci cover all effects on height from common SNPs.
* LDL-cholesterol study by [Graham et al. 2021](https://www.nature.com/articles/s41586-021-04064-3)
explained ~13% of the variance in HDL-C and LDL-C by up to 1750 variants in 923 loci.
* Schizophrenia study by [Tubetskoy et al. 2022](https://www.nature.com/articles/s41586-022-04434-5) explained
about 2.4% of variance in liability using 277 independent GWS variants.
* Crohn's disease study by [Jostins et al. 2015](https://www.nature.com/articles/nature11582)
explained about 14% of variance in liability from 160 loci.
* BMI study by [Locke et al. 2015](https://www.nature.com/articles/nature14177) found that the 97 GWS loci
account for 2.7% of BMI variation.
In particular, for many complex diseases, such as schizophrenia, the variance
explained by GWS loci is still very small. On the other hand, traditional ways to estimate
heritability suggest a high heritability for schizophrenia.
Where is that heritability, if the top GWAS loci show this little traces of it?
Or are the traditional estimates grossly overestimating heritability?
This gap is called the [missing heritability problem](https://en.wikipedia.org/wiki/Missing_heritability_problem) (Slides 6-7).
Next we will look at two approaches that use GWAS data to estimate
the genome-wide **SNP heritability** $h^2_{\textrm{SNP}}$ that considers the heritability
contribution of all SNPs that are included in the study,
not just those that happen to reach the genome-wide significance level.
The first method, the **linear mixed model**, is based
on an efficient way of correlating the variant sharing with the phenotypic similarity
in the population sample.
The second method, **LD-score regression (LDSC)**, is based on a link between
the amount of tagging by LD and the amount of GWAS signal seen at a variant,
which link is induced by a highly polygenic genetic architecture.
The linear mixed model requires original phenotype-genotype data whereas
LDSC works with the GWAS summary data (i.e. the marginal effect estimates and their SEs).
### 8.3. Linear Mixed Model
Let's write down the full linear model for additive effects across the genome for a quantitative
phenotype $Y$:
$$y_i = \mu + \pmb{z}_i^T\pmb{\alpha} + {\pmb{x}_i^*}^T\pmb{\lambda}^* + \varepsilon_i= \mu + \pmb{z}_i^T\pmb{\alpha} + \sum_{l=1}^p x_{il}^* \lambda_l^* + \varepsilon_i,$$
where $\pmb{z}_i$ is the vector of covariate values for $i$ and $\varepsilon_i\sim\mathcal{N}(0,\sigma_E^2)$
is the (environmental) error term that is assumed to be uncorrelated across individuals.
We saw in GWAS 7 that if we try to estimate the parameters $\lambda_l^*$ using the ordinary
least squares estimator, we run into problems because of high correlations between SNPs and,
more generally, overfitting.
This is because the standard linear regression model
is too flexible to adapt to the data when $p$ grows large
if the effect sizes are not restricted in any way.
To overcome these problems, the linear mixed model treats the effect sizes
$\lambda_l^*$ as **random effects** that share a common (prior) distribution,
here chosen to be $\mathcal{N}(0,\tau^2)$, where $\tau^2$ will be estimated from the data.
Now the parameters $\lambda_l^*$ are not allowed freely to choose their
values but their magnitude is restricted by a *shared* variance parameter $\tau^2$.
The model is able to learn from the data how the whole set of values of $\lambda^*$
look like when considered together, and then apply that information to keep the
magnitude of $\lambda^*$s appropriate by adjusting a single variance parameter $\tau^2$.
Another way to think about the difference between this random effects model and the
least squares estimation is that our focus changes from estimating
each of the $p$ values $\lambda_l^*$ to estimating their *shared* distribution,
as determined by the variance parameter $\tau^2$.
Hence we reduce the number of parameters estimated from $p$ to 1,
and will avoid overfitting.
The name *mixed model* reflects that the model is a *mix* of both **fixed effects** $\alpha$,
whose individual values are estimated as in the standard linear model,
and **random effects** $\lambda^*_l$, whose joint *distribution*
is estimated, rather than the individual values of $\lambda_l^*$s.
How can we link the new parameter $\tau^2$ to the observed values of $y$?
The answer is to write down what the random effect
assumption means in terms of the observed similarity (mathematically covariance)
of the phenotypes of individuals $i$ and $j$.
If we follow the random effect formulation, and independently
draw each $\lambda_l^* \sim \mathcal{N}(0,\tau^2)$, what is the
consequence on the phenotypic covariance between $i$ and $j$, induced
by the terms $\eta_i = \sum_{l=1}^p x_{il}^* \lambda_l^*$
and $\eta_j = \sum_{l=1}^p x_{jl}^* \lambda_l^*$?
$$
\begin{aligned}
\textrm{Cov}(\eta_i, \eta_j) &=
\textrm{Cov}\left(\sum_{l=1}^p x_{il}^* \lambda_l^*, \sum_{l=1}^p x_{jl}^* \lambda_l^* \right) =
\sum_{l=1}^p \sum_{k=1}^p x_{il}^* x_{jk}^*\, \textrm{Cov}(\lambda_l^*,\lambda_k^*) =
\sum_{l=1}^p x_{il}^* x_{jl}^*\, \textrm{Cov}(\lambda_l^*,\lambda_l^*) \\
&= \sum_{l=1}^p x_{il}^* x_{jl}^* \tau^2
= p\tau^2 \pmb{G}_{ij},
\end{aligned}
$$
where $\pmb{G}$ is the GRM-cor from chapter 5 of the course material,
i.e., $\pmb{G} = \frac{1}{p}\pmb{X}^*{\pmb{X}^*}^T$
is the $n\times n$ empirical correlation matrix for individuals computed across all SNPs.
This is saying that the **additive genetic components** $\eta_i$ of the trait are correlated
across the individuals according to the genetic relatedness of the individuals,
as measured by GRM-cor, and are scaled so that their
variance is $p\tau^2$, where $\tau^2$ is the variance of the causal effect sizes.
Note that $\tau^2$ is also the expected phenotypic variance contributed by any one causal effect
as $\textrm{E}({\lambda_l^*}^2) = \textrm{Var}(\lambda_l^*) = \tau^2$.
If we ignore LD between nearby variants, then $\sigma_G^2 = p\tau^2$ is the expected phenotypic variance
contributed by all $p$ variants together,
and we would estimate the heritability as $\widehat{h}^2 = {\sigma_G^2}/{(\sigma_G^2 + \sigma_E^2)}.$
The last step is to write down the joint distribution of the
phenotype vector $\pmb{y}$, as defined by the variance components $\eta$ and $\varepsilon$,
from the relationship $y_i = \mu + \pmb{z}_i^T\pmb{\alpha} + \eta_i + \varepsilon_i$.
The phenotype vector is an $n$-dimensional multivariate normal vector,
whose mean $\pmb{\mu}$ has components $\mu_i = \mu + \pmb{z}_i^T\pmb{\alpha}$
and whose covariance matrix
$\pmb{\Sigma}(\pmb{\sigma^2}) = \sigma_G^2 \pmb{G} + \sigma_E^2 \pmb{I}_n$
is a function of two unknown variance parameters $\pmb{\sigma^2} = (\sigma^2_G,\sigma_E^2).$
(Slide 8.)
A naive computation of such $n$-dimensional multivariate Normal likelihoods is
expensive -- $\mathcal{O}(n^3)$ operations -- and in recent years
many new ways to speed up the computation have been introduced.
Currently,
[GCTA](https://cnsgenomics.com/software/gcta/#Overview) with its **fastGWA** module
remains a widely-used method and [BOLT-REML](https://data.broadinstitute.org/alkesgroup/BOLT-LMM/) is
an efficient implementation of the mixed model applicable to 100,000s of samples.
Recently, [REGENIE](https://rgcgithub.github.io/regenie/overview/) has implemented an efficient
version of a similar model.
In what follows, we will
experiment with the mixed model by using a simple trick that can be done easily in R, but
which would not generalize to multiple variance components, and therefore differs
from more complex methods such as GCTA and BOLT-REML.
#### 8.3.1 A Mixed model estimation method
To simplify the setting, let's assume we first regress out the covariates from $y$
using linear regression
and consider the (quantile normalized) residuals from that regression as our
covariate-adjusted phenotype $y'$.
Our task is to maximize the multivariate Normal likelihood function of
$$\pmb{y'}\sim \mathcal{N}\left(0,\pmb{\Sigma}(\pmb{\sigma^2})\right), \textrm{ with respect to } \pmb{\sigma^2}. $$
The log-likelihood of the multivariate Normal is
$$L(\pmb{\sigma^2}) = -\frac{1}{2} \log \det\pmb{\Sigma}(\pmb{\sigma^2})
-\frac{1}{2} \pmb{y'}^T \pmb{\Sigma}(\pmb{\sigma^2})^{-1} \pmb{y'}.$$
If we make an eigendecomposition of the GRM-cor matrix $\pmb{G}=\pmb{U}\pmb{D}\pmb{U}^T$,
where $\pmb{U}$ is an orthonormal matrix of eigenvectors and $\pmb{D}$ is a diagonal matrix
of eigenvalues, then the inverse and determinant of the $n$ dimensional matrix
can be transformed to those of diagonal matrices by rotating the phenotype with the
eigenvectors into a new phenotype $\widetilde{\pmb{y}} = \pmb{U}^T\pmb{y}':$
$$L(\pmb{\sigma^2}) = -\frac{1}{2} \sum_{i=1}^n \log(\sigma_G^2 D_i + \sigma_E^2) -\frac{1}{2}
\sum_{i=1}^n \frac{\widetilde{y}^2_i}{\sigma_G^2 D_i + \sigma_E^2} =
-\frac{1}{2} \sum_{i=1}^n \left( \log(\sigma_G^2 D_i + \sigma_E^2) +
\frac{\widetilde{y}^2_i}{\sigma_G^2 D_i + \sigma_E^2} \right),$$
where $D_i$ is the diagonal element $i$ of $\pmb{D}$.
(The derivation of the above transformation
in not explained in detail here but is given by
[Pirinen et al. 2013](https://projecteuclid.org/euclid.aoas/1365527203).)
This version of the log-likelihood is easy to optimize in R by using `optim()`, and
we can help `optim()` by giving it also the gradient of the log-likelihood.
Here is a function that returns the log-likelihood which we can then maximize by `optim()`.
```{r}
lmm.loglik <- function(sigma, y, d)
{
sigma.sum = sigma[1] * d + sigma[2]
res = -0.5 * sum(log(sigma.sum) + y ^ 2 / sigma.sum)
return(res) #returns log likelihood
}
lmm.gradient <- function(sigma, y, d)
{
sigma.sum = sigma[1] * d + sigma[2]
tmp = y ^ 2 / sigma.sum - 1
dsigmaG = 0.5 * sum(d / sigma.sum * tmp)
dsigmaE = 0.5 * sum(1 / sigma.sum * tmp)
return(c(dsigmaG, dsigmaE)) #returns gradient
}
```
**Example 8.2.**
Let's try the mixed model with $n=2000$ samples and $p=10,000$ independent common variants
with MAF 0.5 and simulate trait with $h^2=0.5$.
We expect that each variant here would explain only $h^2/p = 0.00005$
of the trait variance!
We don't expect to have any genome-wide significant findings with $n=2000$ samples.
```{r, fig.width = 4, fig.height=4}
p = 10000
n = 2000
f = 0.5
h2 = 0.5
X.s = scale(replicate(p, rbinom(n, size = 2, p = f) )) #use scaled genotypes
#tau^2 = var(lambda.s) = h2 / p
lambda.s = rnorm(p, 0, sqrt( h2 / p)) #scaled effects
#generate phenotype: SNP effects + random noise with var=1-h2
y = scale( X.s %*% lambda.s + rnorm(n, 0, sqrt(1-h2) )) #scaling makes mean(y)=0 as our LMM ignores intercept!
#test individual SNPs and make a QQ-plot
pval = as.numeric(apply(X.s, 2, function(x){summary(lm( y ~ x))$coefficients[2,4]}))
expect.stats = qchisq(ppoints(p), df = 1, lower = F)
obs.stats = qchisq(pval, df = 1, lower = F)
lambda = median(obs.stats) / median(expect.stats) #GC lambda = ratio at medians
qqplot(expect.stats, obs.stats, xlab = "chisq expected", ylab = "chisq observed",
sub = paste0("lambda=",signif(lambda,3)), cex = 0.8)
abline(0,1)
sort(pval)[1:5] #show the 5 smallest P-values
```
We have no genome-wide significant SNPs and the QQ-plot doesn't look very inflated.
However, the data were simulated in such a way that we expect that the SNPs explain together
half of the phenotypic variance, even if we can't see any individual SNP having a clear effect.
What does the mixed model say?
```{r}
#G = cor(t(X.s)) #make GRM-cor matrix
G = (X.s %*% t(X.s))/p #make GRM-cor matrix (see Chapter 5.1.2.)
eig = eigen(G) #decompose G = U D t(U)
y.Ut = t(eig$vectors) %*% y #transform y to y.Ut by using U^t y
d = eig$values #eigenvalues of G
start.val = c(1, 1) / 2 #starting values for the optimization of the two variance parameters
#1st optim run with robust Nelder-Mead method without gradient
res = optim(start.val, fn = lmm.loglik, y = y.Ut, d = d,
method = 'Nelder-Mead', control = list(fnscale = -1)) #fnscale=-1 to maximize, not minimize
#2nd optim run refining the estimate by BFGS method using gradient
res = optim(res$par, fn = lmm.loglik, gr = lmm.gradient, y = y.Ut, d = d,
method = 'BFGS', hessian = T, control = list(fnscale = -1)) #fnscale=-1 to maximize, not minimize
sigma2 = res$par #estimates of sigma.G^2 and sigma.E^2
sigma2.SE = sqrt(diag(solve(-res$hessian))) #SE of sigma_G^2 and sigma_E^2
res = cbind(sigma2, sigma2.SE) # genetic variance and environmental variance with SEs
rownames(res) = c("sigma.G^2","sigma.E^2")
res
h2.est = sigma2[1]/sum(sigma2)
h2.est # heritability is ~sigma.G^2 since var(y)=1. Then we also have SE(h2) = SE(sigma.G^2).
```
Our estimate is close to the true value of 50%.
(Note that given the SE of 0.07 we are lucky to get *this* close here.)
The mixed model can indeed pick up the
joint contribution of all those tiny effects!
Note that we haven't derived SE for $\widehat{h}^2$ but
in cases where total variance of phenotype is 1,
we can use SE of $\widehat{\sigma}_G^2$ as SE of $\widehat{h}^2$.
**Example 8.3.**
LMM worked nicely above, but what happens when the true genetic architecture is not 100% polygenic,
that is, when only a subset of all variants contribute to the phenotype.
The mixed model was derived assuming that all effects are non-zero.
What happens if we simulated non-zero effects only for, say 20% of the SNPs?
Let's recycle our genotype data and $G$ matrix and its decomposition since
those take some time to make. Let's simply simulate new sets of
$\lambda^*$s for a scenario where 20% of SNPs have non-zero
effects and together they explain 30%
of the trait variance. Let's repeat this 10 times and plot the estimates.
```{r, eval=T}
set.seed(12)
phi = 0.2 # proportion phi of the SNPs are non-zero
h2 = 0.3
n.iter = 10 #how many simulations -- all use the same genotype data
h2.res = matrix(NA, ncol = 2, nrow = n.iter)
for(iter in 1:n.iter){
#choose which SNPs have an effect
c.ind = sort(sample(1:p, size = round(phi*p)))
#var(lambda.s) = h2 / (phi*p)
lambda.s = rnorm(length(c.ind), 0, sqrt( h2 / (phi*p)))
#generate phenotype: SNP effects + random noise with var = 1 - h2
y = scale( X.s[,c.ind] %*% lambda.s + rnorm(n, 0, sqrt(1-h2) ))#scale: mean(y) = 0, sd(y) = 1
y.Ut = t(eig$vectors) %*% y #transform y to y.Ut = U^t y
start.val = c(1, 1) / 2 #starting values for the optimization of the two variance parameters
#1st optim run with robust Nelder-Mead method without gradient
res = optim(start.val, fn = lmm.loglik, y = y.Ut, d = d,
method = 'Nelder-Mead', control = list(fnscale = -1))
#2nd optim run refining the estimate by BFGS method using gradient
res = optim(res$par, fn = lmm.loglik, gr = lmm.gradient, y = y.Ut, d = d,
method = 'BFGS', hessian = T, control = list(fnscale = -1))
sigma2 = res$par #estimates of sigma.G^2 and sigma.E^2
sigma2.SE = sqrt(diag(solve(-res$hessian))) #SE of sigma_G^2 and sigma_E^2
h2.res[iter,] = c(sigma2[1]/sum(sigma2),sigma2.SE[1]) # h2.est with sigma.G^2 SE
}
plot(1:n.iter, h2.res[,1], pch = 19,
main = paste0(" n = ",n," p = ",p," h2 = ",h2," phi = ",phi),
xaxt="n", ylab="h2 estimate", ylim = c(0,1), xlab = "")
arrows(1:n.iter, h2.res[,1]-1.96*h2.res[,2], 1:n.iter, h2.res[,1]+1.96*h2.res[,2],
code = 3, length = 0.05, angle = 90)
abline(h = h2, lty = 2, col = "green")
```
It works fine even when only 20% of the SNPs have non-zero effects.
It seems that LMM is robust to a considerable proportion of zero effects among the SNPs,
which is great news!
#### 8.3.2 Mixed model heritability estimates
LMMs have been used for a long time in animal breeding and pedigree
analyses and this approach became popular in GWAS data in humans by the landmark
publication of [Yang et al. (2010)](https://www.nature.com/articles/ng.608)
where they showed that 45% of the variation in height
could be explained by about 300,000 SNPs on a genotyping chip.
This was an important piece of
information for the discussion surrounding the missing heritability problem, because
after that publication it became more widely considered plausible that a lot of the gap between
the variance explained by GWS regions (only ~5% for height at that time)
and the heritability of height as estimated by twin and sibling studies (~70-80%),
may well be just smaller effects not able to become GW-significant
with given sample sizes.
The authors have explained in their later publication [Visscher et al. 2010](https://www.ncbi.nlm.nih.gov/pubmed/21142928)
that "During the refereeing process
(the paper was rejected by two other journals before publication in Nature Genetics)
and following the publication of Yang et al. (2010) it became clear to us
that the methodology we applied, the interpretation of the results and
the consequences of the findings on the genetic architecture of human height
and that for other traits such as complex disease are not well understood or
appreciated."
Nowadays, the LMM is routinely used for SNP heritability estimation of quantitative traits.
More recent applications suggest that for height and BMI there is not much of a
gap anymore between the SNP heritability from very dense marker sets and
the estimates of the total heritability by other means than through GWAS data
[Yang et al. 2015](https://www.nature.com/articles/ng.3390),
[Young 2019](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1008222),
[Yengo et al. 2022](https://www.nature.com/articles/s41586-022-05275-y).
At the same time, the analysis has got more complex compared to the original
version that only included one variance component for all the SNPs,
as in our example above.
It is now clear that a better model should allow
SNPs with different MAFs and different amounts of LD-tagging to
have different variance parameters of the effect size distribution.
In practice, this means that one should compute separate GRM-cor
matrices for each group of SNPs that needs to be modeled separately
and then estimate their
variance contribution jointly, in a single LMM, that can include
tens of different GRM-cor matrices. This is possible with both
GCTA and BOLT-REML.
Typical heritability analysis using LMM
filters out close relatives (at least 2nd degree or closer).
This is because closer relatives usually are also positively correlated
in some of their environmental factors, which could create correlation
in their phenotypes that LMM would falsely pick up as heritability.
When the analysis is done in a homogeneous population sample
of "unrelated" individuals, this concern is greatly reduced.
Thus, LMM with GWAS data
does something that has been impossible to imagine before:
Estimating heritability using "unrelated" individuals.
A downside of restricting the analysis to unrelateds is that the precision of
the variance components is much smaller than if there were a larger
range of possible relationships in the data. This means that large
samples are needed to get useful estimates of
SNP heritability from population data.
What about the disease studies? The GCTA approach was also quickly [applied
to case-control data](https://www.sciencedirect.com/science/article/pii/S0002929711000206),
but the results are much more complicated
to interpret than for the quantitative phenotypes.
First, there is a transformation from binary phenotype to liability model.
Second, there is the case-control ascertainment which complicates statistical
modeling (as we saw with the simple covariate adjustments) and makes
LMM in general behave unfavorably,
as reported by [Golan et al. 2014](https://www.pnas.org/content/111/49/E5272).
Third, typically cases and controls have been
genotyped/handled/collected differently, and while a careful quality control
can make GWS findings reliable and replicable, it remains a concern that the
tiny effects picked up by the mixed model are not only polygenic effects but can
also contain confounding effects.
Hence interpreting the variance parameters of
a case-control GWAS data as heritabilities of diseases
makes quite a many quite strong assumptions.
We will come back to this important question about whether the inflation
in a QQ-plot is confoundig or polygenicity
in section 8.4 below.
A perspective of using LMM to estimate heritability by
[Yang et al. 2017](https://www.nature.com/articles/ng.3941).
#### 8.3.3 Mixed model in GWAS
All the discussion above has been about estimating the variance parameters
using the LMM. But mixed models have also become widely-used for
running the primary GWAS analysis.
Suppose we want to test the effect of SNP $s$ on the phenotype $Y$.
The LMM approach does the linear regression by including in the model a random
effect from **other variants except $s$ and its LD-friends**:
$$y_i = \mu + \pmb{z}_i^T\pmb{\alpha} + x_{is}\beta_s + \eta_i + \varepsilon_i,$$
where $\pmb{z}_i$ is the vector of covariate values for $i$,
$\eta_i = \sum_{l=1}^{p_s} x_{il}^*\lambda_l^*$ is the additive contribution
of the rest of the genome *except variant $s$ and its LD-friends*
and $\varepsilon_i \sim \mathcal{N}(0,\sigma_E^2)$
is the (environmental) error term that is assumed to be uncorrelated across individuals.
The random effect assumption $\lambda_l^*\sim \mathcal{N}(0,\tau^2)$
will then lead to similar computations as above with the heritability estimation
except that the GRM-cor matrix $G$ can now be different for different variants tested,
and there is an additional fixed effect in the model corresponding to
the marginal effect $\beta_s$ of SNP $s$.
In practice, when testing SNPs on, say, chromosome 1,
GRM-cor can be computed for all other chromosomes and then used as the covariance
structure of the random effect for all SNPs on chr 1 (e.g. BOLT-LMM and GCTA do this).
Local updating of the random effect has also been studied
[(Listgarten et al. 2012)](https://www.nature.com/articles/nmeth.2037).
There are (at least) two benefits for adding the rest of the genome
as a random effect in linear regression:
1. The precision of the estimator $\widehat{\beta}_s$
will increase as the (often substantial) variation in phenotype
from the rest of the genome is explained away by the model.
This also leads to increased statistical power to detect new associations.
2. The rest of the genome captures confounding effects
that are due to relatedness and/or
population structure and hence $\widehat{\beta}_s$ from a LMM
has been automatically adjusted for these confounders.
Thus, LMM is an alternative for the standard linear regression with leading
PCs as covariates, and since LMM also increases power and also accounts for
related individuals, it is a very useful in QT GWAS.
To be on the safe side, when the data have both close relatives and
clear population structure, then these patterns may not be correctly modelled by a single
joint random component, and, in general, one should generate different random effects for
each major source of phenotypic covariance in the sample.
Typically, we still remove one from each pair of closely related individuals
because there are not many of these pairs and it is difficult to know whether
their environmental correlation has been correctly modelled by LMM.
While LMMs have also been applied to GWAS of binary traits, until recently
they have only been applicable to the cleanest cases of
common variants and balanced case-control ratios.
Recent research on efficient mixed model for
binary phenotypes have been implemented by
[Zhou et al. 2018](https://www.nature.com/articles/s41588-018-0184-y)
into [SAIGE](https://github.com/weizhouUMICH/SAIGE) software and
by [Mbatchou et al. 2020](https://www.biorxiv.org/content/10.1101/2020.06.19.162354v2)
into [REGENIE](https://rgcgithub.github.io/regenie/overview/).
Now, let's return to the important question that puzzled us above.
We know that both confounding and true polygenic effects can cause
inflation in QQ-plots and the genomic-control parameter $\lambda$.
How could we tell whether
the inflation is true signal from thousands of small effects, or whether it is
a result of some confounding bias?
### 8.4. LD-score regression (LDSC)
LDSC developed by [Bulik-Sullivan et al. 2015](https://www.nature.com/articles/ng.3211)
can separate confounding from polygenicity by utilizing LD in a clever way.
Let's think about consequences of high polygenicity
(a lot of non-zero causal effect sizes) on the observed marginal effect sizes
when we take into account the differences in LD patterns between the variants.
We know that for a region with $p$ variants and LD-matrix $\pmb{R},$
$\pmb{\beta}^* = \pmb{R}\pmb{\lambda}^*$. If we assume a highly polygenic
model, with each $\lambda_l^* \sim \mathcal{N}(0,\tau^2)$, we get that
$$\begin{aligned}
\textrm{E}(\beta_l^*) &= \textrm{E}\left(\sum_{k=1}^p r_{lk} \lambda_k^*\right) =
\sum_{k=1}^p r_{lk} \textrm{E}(\lambda_k^*) = 0,\\
\textrm{E}({\beta_l^*}^2) &= \textrm{E}\left(\sum_{k=1}^p r_{lk} \lambda_k^*\right)^2 =
\sum_{k=1}^pr_{lk}^2\textrm{E}({\lambda_k^*}^2) + 2 \sum_{k0$),
* regardless of the possible confounding bias,
the slope of the regression gives an estimate of the SNP heritability
when multiplied by $p/n$.
**Binary traits.** It is a custom to apply LDSC (and other heritability estimation methods)
to binary traits in two steps. First, pretend that the trait is quantitative and compute
heritability $h^2_\text{obs}$ on the **observed scale**, that is, when the binary trait is
treated as a quantitative trait with trait values 0 and 1. Second, turn that estimate on the
**liability scale** by accounting for the population prevalence of the binary trait,
and by also accounting for a possible case-control ascertainment.
This is explained by [Lee et al. 2011](https://www.sciencedirect.com/science/article/pii/S0002929711000206?via%3Dihub).
LDSC sounds like a simple and useful tool!
Let's see some results (slides 9-11).