This document is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The slide set referred to in this document is “GWAS 6”.

When we do a GWAS, our hope is to identify variants that point to biological mechanisms behind the phenotypes, since by knowing those mechanisms we could understand better how each phenotype is generated and, importantly, how harmful phenotypes could be influenced.

Often this agenda is summarized by saying that we look for
**causal variants**, i.e., variants whose biological
functions contribute to the system that determines the phenotype of
interest.

Next we look at extensions of our basic regression models that allow
us to decrease the amount of both false positives (i.e. non-causal
variants labelled as interesting) and false negatives (i.e. causal
variants not labelled as interesting), by including some additional
variables into the regression models. Those additional variables, (such
as sex, age or population structure), whose effects on the phenotype are
not of our primary interest, but that we may include in the analysis for
other reasons, are called **covariates**.

To conceptually separate different settings with respect to how and why we should make use of covariates, we define two concepts about relationships between variables:

**Independence.**

**Intuition**: Variables \(Y\) and \(X\) are independent (in a population \(S\)) if knowing the value of one of them does not tell anything more about the value of the other than what we can already learn by the population distribution of the other.**Definition**: \(Y\) and \(X\) are independet (denoted by \(Y \perp X\)) if \(P(X,Y)=P(X)P(Y)\), or, equivalently, \(P(Y\,|\,X=x)=P(Y)\) for all values of \(x\).**Example**: Consider a SNP and denote by \(A_1\) and \(A_2\) the 0-1 indicator of the minor allele at the SNP in the two genomes of an individual. Assume we know that MAF is 0.3 in the population, i.e., \(P(A_i=1)=0.3\) and \(P(A_i=0)=0.7\) for \(i=1,2\). The alleles in the two genomes of an individual from the population are independent if and only if the joint probability distribution of \(A_1\) and \(A_2\) in the population is

\(A_1\) | \(A_2\) | \(P(A_1,A_2)=P(A_1)P(A_2)\) | \(X=A_1 + A_2\) |
---|---|---|---|

0 | 0 | \(0.7 \cdot 0.7 = 0.49\) | 0 |

0 | 1 | \(0.7 \cdot 0.3 = 0.21\) | 1 |

1 | 0 | \(0.3 \cdot 0.7 = 0.21\) | 1 |

1 | 1 | \(0.3 \cdot 0.3 = 0.09\) | 2 |

That is, the genotype \(X\) follows the Hardy-Weinberg equilibrium proportions \(P(X)=(0.49,0.42,0.09).\) The independence means that even if we learned that individual \(i\) had inherited allele \(1\) from the father, our estimate of the probability that the maternal allele of \(i\) is also 1 remains 0.3, i.e., the same as it was before we knew about the state of \(i\)’s paternal allele.

In short: Independent variables do not carry additional information about each other on top of what can be learned from their marginal distribution in the population.

**Causal relationship.**

**Intuition**: Variable \(X\) causally affects variable \(Y\) if a hypothetical intervention that directly changes only the value of \(X\) but not the value of any other variable than \(X\), will also (indirectly) affect \(Y\).**Example**. You come home every day at 8pm and you (A) press the light switch and (B) take off your shoes. Observation: (C) Your dark home becomes illuminated. Out of A and B, A is a causal effect of C, since if you do*everything else as usual*except not hit the light switch, there will be no light. However, B is not a causal effect of C since if you do*everything else as usual*except that you leave your shoes on, there will still be light.**Warning:**Causality is a thorny concept that we can’t quite ever define precisely at the same level of mathematical rigor as we can do with other concepts in statistics. However, “causality” is what we would like to have in the end, for example, when we aim to build an effective medical intervention, and therefore we need to educate our intuition and understanding about what that causality means.

Good news is that it is easy to understand how genetic variants can causally affect traits, which is our main business here; Things get much more complicated when we wonder what should we do with other covariates available that can have all kinds of causal and non-causal relationships between both the genetic variants and the traits of interest.

**Confounder (intuitive definition).** When we study the
relationship between a predictor \(X\)
and an outcome \(Y\), a confounder
\(Z\) is a variable that is
independently associated with \(X\) and
\(Y\) and can therefore cause a
spurious association between \(X\) and
\(Y\) in an analysis that ignores \(Z\).

The mantra “correlation does not imply causation” means that the observed association / correlation between X and Y could be due to a third variable Z that is associated with both X and Y, and hence an X-Y association can appear even without any causal relationship between X and Y. And indeed, this is how most correlations are: Not causal.

Confounding in the Catalogue of Bias.

**Example 6.1.** If in a heart disease case-control
GWAS, the cases are heart disease patients from Helsinki and controls
are population samples from the Estonian biobank, then a standard GWAS
analysis comparing allele frequencies between cases and controls would
indicate statistical differences throughout the genome. However, these
associations would not be only heart disease related biologically causal
signals, but also spurious associations caused by genetic population
structure. Indeed, all variants that have different allele frequencies
between Finland and Estonia would show an association with case-control
status in this sample, independent of whether the variant has anything
to do with the biology of heart disease. Here population structure (Z)
between Finland and Estonia is a confounder that is associated with both
the SNP frequencies (X) and the outcome (Y). Z is associated with X
because population membership affects allele frequencies and Z is
associated with Y because our case-control sample was collected in such
a way that there is an association with case status and population label
(cases were from Helsinki, controls were from Estonia). This exampe is a
caricature of a badly designed case-control GWAS that suffers from the
population stratification problem (Slides 2-4).

**Example 6.2.** Since there is a difference (of about
1.5 cm) in average (sex and age adjusted) adult height between East and
West Finland, and there is a relatively strong allele frequency gradient
between EF and WF, any GWAS on adult height with Finnish samples that
does not account for the confounding effect of population structure,
risks producing false positives. Note that since variation in height is
to a large part genetic, it my well be that some of the E-W allele
frequency differences are actually causing some of the observed height
differences, but the interpretation of the results would be difficult.
This is because the known population structure would confound the
results for all those variants that are not biologically linked to
height and result in false positive association, and the population
structure would also bias the effect estimates at the true height
variants.

**What can we do with such confounders?** The idea is
that the analyses should be **stratified** according to the
possible levels of the confounder. For example, if we have samples from
Finland and from Estonia, we should do separate analyses in both
countries and combine the results. The confounder cannot confound the
results when all allele frequency comparisons are done among samples
that share the level of the confounder, e.g., the two levels of
confounder could be having a Finnish background or having an Estonian
background. (But note that in Example 6.1, the study design is such that
we would have no power left after adjusting for population label because
we had no controls from Finland and no cases from Estonia! See slides
6-9 for a more realistic example.) More generally, this idea of
stratifying the analysis by the levels of the confounder is implemented
by including the confounders as **covariates** in the GWAS
regression model. Then we talk about **adjusting the analysis for
the covariate**. Technically, multi-level discrete confounders
are expanded as indicator variables, one per each level of the
confounder, and continuous confounders are simply included as continuous
covariates in the regression model. To understand the confounding from
regression model’s point of view, let’s define it more precisely.

**Confounder (mathematical definition for GWAS
setting).** Suppose that for fixed values of \(\mu, \beta\) and \(\gamma\) and a joint distribution of the
random variables \(X\), \(Y\) and \(Z\) in a population, the phenotype \(Y\) truly follows the regression model M:
\[\textrm{Model M:}\qquad Y \sim \mu +
Z\gamma + X\beta.\] We call the covariate \(Z\) a **confounder** of the
association between \(X\) and \(Y\), if \(\beta =
0\) in model M, but, when the covariate \(Z\) is omitted and the association between
\(X\) and \(Y\) is described by a simpler model M’:
\[\textrm{Model M':}\qquad Y \sim
\mu' + X\beta',\] then \(\beta' \neq 0\).

Essentially, this definition says that a confounder is a variable
whose omission from a GWAS regression model will cause a spurious
association between the genotype and the phenotype. For example, when
the population structure (Z) is included as a covariate in a height GWAS
in Finland (Example 6.2), it explains away some of the east-west
difference in height in Finland. Consequently, those genetic variants
(X), whose apparent association with height (Y) is **only**
through their geographic allele frequency differences, will not show
association in this model (of type M) where population structure is
already explicitly included as a covariate. However, when the population
structure variable is omitted from the model (of type M’), then all
variants that have east-west difference in allele frequency, will also
show a statistical association with height, even if their
height-association is only through the link between population structure
and height and not through a direct biological effect. Thus, population
structure is a confounder, and we must include it as a covariate or
otherwise we get false positives.

Note that a more general definition of confounding could say that a
confounder of X-Y association is any variable that when omitted from the
model will *change* the effect estimate of X on Y (i.e. \(\beta'\neq \beta\)). Our GWAS-motivated
definition is narrower: We are only worried about cases where there is
no true effect (\(\beta=0\)), but where
a regression model without appropriate covariates will mislead us by
producing an effect (\(\beta' \neq
0\)).

(A word of warning: Even though we always want to avoid confounding, we still shouldn’t always include covariate \(Z\) in the model simply because \(Z\) is associated with both \(X\) and \(Y\)! There is a possible problem of collider bias there, that we will study later.)

The large scale of data in GWAS provides us with an opportunity to
assess whether a genome-wide confounding effect seems to be present. We
should be worried, if there is a too large a number of apparent
associations throughout the genome. A traditional way to assess this has
been to look at the QQ-plot at the middle point of the distribution (at
the median of the test statistic values) and compare it to the
corresponding value under the null hypothesis that there are no true
associations at all. The ratio of the median of the observed
distribution to the theoretical median of the null distribution is
called the **genomic control parameter** \(\lambda\). The test statistic used in this
QQ-plot is often the chi-square statistic of association, but also
-log10(P-value) is used. If \(\lambda\)
is much > 1, that could indicate that we have a problem with some
omitted confounder (such as population structure) that causes elevated
signals across the genome. Unfortunately, this approach is not that
useful anymore with very large and statistically powerful GWAS, since
there \(\lambda\) can elevate
substantially from 1 already because of a true polygenic signal
distributed across the genome. Instead, LD-score regression
(that we will discuss later) is a better way to indicate whether we seem
to have missed some important confounders.

**Genomic control (GC).** Many GWAS have used *genomic
control* to adjust for observed inflation (i.e. too many too
large values) of the test statistic distribution. The procedure is to
divide each chi-square test statistic value by the GC parameter \(\lambda\) (defined above) and this way
force the observed test statistic distribution to match the null
expectation at the median value. Such method can reasonably adjust for a
confounder only if that confounder affects all variants in a similar
way. However, considering population structure as an example confounder,
some variants are much more geographically stratified than others and
hence an adjustment of all variants by the same constant \(\lambda\) is unlikely to be an appropriate
solution. GC also keeps the order of the variants the same before and
after its application, which seems a too inflexible property to properly
model that confounders can work differently with different variants.
Furthermore, GC can give a misleading impression that there is no
confounding left because the “corrected”” distribution has \(\lambda=1\), whereas in reality GC may have
severely undercorrected the most biased variants and overcorrected the
less biased variants. Sometimes GC has also been applied to the standard
error estimates by multiplying them with \(\sqrt{\lambda}\), which has the same effect
on the test statistic as dividing the chi-square statistic by \(\lambda\). This approach again reveals the
inflexibility of GC: The effect size estimates are not adjusted at all
even though it feels that they should be adjusted according to each
variant’s correlation with the confounders.

As opposed to GC, regression models that include a confounder as covariate allow each variant to be adjusted according to how correlated the variant is with the covariate.

How can we know that we have adjusted for all the relevant confounders? Unfortunately, we can’t ever be sure about this in an observational study. But we can at least avoid the common pitfalls causing typical confounding effects in GWAS:

**Population structure**that is associated with the phenotype will cause inflated P-values across the genome and this can be dealt with by including population structure as a covariate or by using*mixed models*(that we will talk about later).**Sample ascertainment**that has been done based on phenotype causes a risk that different phenotypic groups may not match well each other with respect to their genetic backgrounds, because different phenotypic groups may have been collected from different places/times/circumstances. Case-control sampling is the most prominent example of this potential problem. This does not mean that case-control sampling is not a good strategy. It just means that it opens up possible problems that must be taken care of. In GWAS setting, confounding due to sample ascertainment can be seen as a subcategory of general confounding by population structure.**Genotyping**of the samples has been done in batches that are not independent of the phenotype, which causes a risk that errors and biases in sample handling and genotyping process lead to spurious genotype-phenotype associations, because such biases may affect differently different batches, and different batches have different phenotype distributions. For example, if cases have been genotyped in Lab 1 in 2009 and controls in Lab 2 in 2015 then technical differences in genotyping process could lead to spurious genotype-disease associations. Often genotyping batches are used as covariates, which adjusts the analysis for the differences in the phenotypic means of the batches, but this strategy cannot be followed if the case-control status is to a large part associated with batches. In those cases, one simply needs to do particularly careful quality control between the genotyping batches. To avoid the batch problem, ideally, one should randomise the samples to genotyping batches, because then the batch could not be associated with the phenotype. Unfortunately, this is rarely possible in practice, since studies tend to combine many sources of existing genotyping data rather than genotyping all samples anew.

Let’s assume that our individuals are geographically spread around the line from Turku (in Southwest Finland) to Kajaani (in Eastern Finland) that is the gradient of the major population structure in Finland. Each individual \(i\) has a coordinate \(u_i\in[0,1]\), where 0=Turku, 1=Kajaani. For each variant \(k\), we determine its allele 1 frequency on the line by first sampling a Turku-Kajaani difference \(d_k \sim \textrm{Uniform}(-0.5,0.5),\) and then sampling Turku frequency \(t_k \sim \textrm{Uniform}(0.5,1)\), if \(d_k<0\); and \(t_k \sim \textrm{Uniform}(0,0.5)\), if \(d_k \geq 0.\) Now allele 1 frequency at position \(u \in [0,1]\) is \(t_k + u\cdot d_k\). So allele frequency changes smoothly from \(t_k\) in Turku to \(t_k + d_k\) in Kajaani.

Let’s simulate \(p=1000\) such variants for \(n=1000\) individuals randomly picked along the line from Turku to Kajaani.

```
n = 1000 #individuals
p = 1000 #SNPs
u = runif(n, 0, 1) #coordinates of inds
d = runif(p, -0.5, 0.5) #allele freq difference btw T and K
tu = runif(p, 0, 0.5) #when d>=0 then tu is in (0,0.5)
tu[ d < 0 ] = 1 - tu[ d < 0 ] #when d<0 then tu is in (0.5,1)
X = matrix(NA, nrow = n, ncol = p) #SNP data matrix, rows individuals, columns SNPs
for(k in 1:p){
f = tu[k] + u*d[k] #allele 1 frequency for each individual
X[,k] = rbinom(n, size = 2, prob = f) #genotypes from varying frequencies
}
```

Here we have a set of genotypes that carry a population structure effect.

Let’s assume that we study a phenotype \(Y\) that depends on the environment described by \(u\) as \(Y= u + \varepsilon\), where \(\varepsilon \sim \mathcal{N}(0,1)\), but there is no direct effect of any of our \(p\) variants on the phenotype.

```
y = u + rnorm(n) #phenotype has a geographical cline but no genetic effect from X
y = scale(y) #standardize trait to have mean = 0 and var = 1 for convenience
#Check how it looks like
plot(u, y, xlab = "Location from Turku to Kajaani", ylab = "phenotype", pch = 3, col = "gray")
abline( lm(y ~ u), col = "brown", lwd = 2 ) #add the regression line to plot
```

So we have a cline in the phenotype where the mean increases as we move from Turku towards Kajaani.

Let’s do a “GWAS” of phenotype \(Y\) on our \(p=1000\) variants. We collect only P-values from this GWAS.

```
#collect only P-values, i.e., element [2,4] of lm's summary()
pval.1 = apply(X, 2, function(x){summary(lm(y ~ x))$coeff[2,4]})
summary(pval.1) #median of P-values should be 0.5 under the null
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000001 0.0232523 0.1213125 0.2608591 0.4376539 0.9980830
```

`plot(1:p, -log10(pval.1), xlab = "variant", ylab = "-log10 P-value.1", col = "red") `