This document is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The slide set referred to in this document is “GWAS 6”.

When we do a GWAS, our hope is to identify variants that point to biological mechanisms behind the phenotypes, since by knowing those mechanisms we could understand better how each phenotype is generated and, importantly, how harmful phenotypes could be influenced.

Often this agenda is summarized by saying that we look for **causal variants**, i.e., variants whose biological functions contribute to the system that determines the phenotype of interest.

Next we look at extensions of our basic regression models that allow us to decrease the amount of both false positives (i.e. non-causal variants labelled as interesting) and false negatives (i.e. causal variants not labelled as interesting), by including some additional variables into the regression models. Those additional variables, (such as sex, age or population structure), whose effects on the phenotype are not of our primary interest, but that we may include in the analysis for other reasons, are called **covariates**.

To conceptually separate different settings with respect to how and why we should make use of covariates, we define two concepts about relationships between variables:

**Independence.**

**Intuition**: Variables \(Y\) and \(X\) are independent (in a population \(S\)) if knowing the value of one of them does not tell anything more about the value of the other than what we can already learn by the population distribution of the other.**Definition**: \(Y\) and \(X\) are independet (denoted by \(Y \perp X\)) if \(P(X,Y)=P(X)P(Y)\), or, equivalently, \(P(Y\,|\,X=x)=P(Y)\) for all values of \(x\).**Example**: Consider a SNP and denote by \(A_1\) and \(A_2\) the 0-1 indicator of the minor allele at the SNP in the two genomes of an individual. Assume we know that MAF is 0.3 in the population, i.e., \(P(A_i=1)=0.3\) and \(P(A_i=0)=0.7\) for \(i=1,2\). The alleles in the two genomes of an individual from the population are independent if and only if the joint probability distribution of \(A_1\) and \(A_2\) in the population is

\(A_1\) | \(A_2\) | \(P(A_1,A_2)=P(A_1)P(A_2)\) | \(X=A_1 + A_2\) |
---|---|---|---|

0 | 0 | \(0.7 \cdot 0.7 = 0.49\) | 0 |

0 | 1 | \(0.7 \cdot 0.3 = 0.21\) | 1 |

1 | 0 | \(0.3 \cdot 0.7 = 0.21\) | 1 |

1 | 1 | \(0.3 \cdot 0.3 = 0.09\) | 2 |

That is, the genotype \(X\) follows the Hardy-Weinberg equilibrium proportions \(P(X)=(0.49,0.42,0.09).\) The independence means that even if we learned that individual \(i\) had inherited allele \(1\) from the father, our estimate of the probability that the maternal allele of \(i\) is also 1 remains 0.3, i.e., the same as it was before we knew about the state of \(i\)’s paternal allele.

In short: Independent variables do not carry additional information about each other on top of what can be learned from their marginal distribution in the population.

**Causal relationship.**

**Intuition**: Variable \(X\) causally affects variable \(Y\) if a hypothetical intervention that directly changes only the value of \(X\) but not the value of any other variable than \(X\), will also (indirectly) affect \(Y\).**Example**. You come home every day at 8pm and you (A) press the light switch and (B) take off your shoes. Observation: (C) Your dark home becomes illuminated. Out of A and B, A is a causal effect of C, since if you do*everything else as usual*except not hit the light switch, there will be no light. However, B is not a causal effect of C since if you do*everything else as usual*except that you leave your shoes on, there will still be light.**Warning:**Causality is a thorny concept that we can’t quite ever define precisely at the same level of mathematical rigor as we can do with other concepts in statistics. However, “causality” is what we would like to have in the end, for example, when we aim to build an effective medical intervention, and therefore we need to educate our intuition and understanding about what that causality means.

Good news is that it is easy to understand how genetic variants can causally affect traits, which is our main business here; Things get much more complicated when we wonder what should we do with other covariates available that can have all kinds of causal and non-causal relationships between both the genetic variants and the traits of interest.

**Confounder (intuitive definition).** When we study the relationship between a predictor \(X\) and an outcome \(Y\), a confounder \(Z\) is a variable that is independently associated with \(X\) and \(Y\) and can therefore cause a spurious association between \(X\) and \(Y\) in an analysis that ignores \(Z\).

The mantra “correlation does not imply causation” means that the observed association / correlation between X and Y could be due to a third variable Z that is associated with both X and Y, and hence an X-Y association can appear even without any causal relationship between X and Y. And indeed, this is how most correlations are: Not causal.

Confounding in the Catalogue of Bias.

**Example 6.1.** If in a heart disease case-control GWAS, the cases are heart disease patients from Helsinki and controls are population samples from the Estonian biobank, then a standard GWAS analysis comparing allele frequencies between cases and controls would indicate statistical differences throughout the genome. However, these associations would not be only heart disease related biologically causal signals, but also spurious associations caused by genetic population structure. Indeed, all variants that have different allele frequencies between Finland and Estonia would show an association for case-control status in this sample, independent of whether the variant has anything to do with the biology of heart disease. Here population structure (Z) between Finland and Estonia is a confounder that is associated with both the SNP frequencies (X) and the outcome (Y). Z is associated with X because population membership affects allele frequencies and Z is associated with Y because our case-control sample was collected in such a way that there is an association with case status and population label (cases were from Helsinki, controls were from Estonia). This exampe is a caricature of a badly designed case-control GWAS that suffers from the population stratification problem (Slides 2-4).

**Example 6.2.** Since there is a difference (of about 1.5 cm) in average (sex and age adjusted) adult height between East and West Finland, and there is a relatively strong allele frequency gradient between EF and WF, any GWAS on adult height with Finnish samples that does not account for the confounding effect of population structure, risks producing false positives. Note that since variation in height is to a large part genetic, it my well be that some of the E-W allele frequency differences are actually causing some of the observed height differences, but the interpretation of the results would be difficult. This is because the known population structure would confound the results for all those variants that are not biologically linked to height and result in false positive association, and the population structure would also bias the effect estimates at the true height variants.

**What can we do with such confounders?** The idea is that the analyses should be **stratified** according to the possible levels of the confounder. For example, if we have samples from Finland and from Estonia, we should do separate analyses in both countries and combine the results. The confounder cannot confound the results when all allele frequency comparisons are done among samples that share the level of the confounder, e.g., the two levels of confounder could be having a Finnish background or having an Estonian background. (But note that in Example 6.1, the study design is such that we would have no power left after adjusting for population label because we had no controls from Finland and no cases from Estonia! See slides 6-9 for a more realistic example.) More generally, this idea of stratifying the analysis by the levels of the confounder is implemented by including the confounders as **covariates** in the GWAS regression model. Then we talk about **adjusting the analysis for the covariate**. Technically, multi-level discrete confounders are expanded as indicator variables, one per each level of the confounder, and continuous confounders are simply included as continuous covariates in the regression model. To understand the confounding from regression model’s point of view, let’s define it more precisely.

**Confounder (mathematical definition for GWAS setting).** Suppose that for fixed values of \(\mu, \beta\) and \(\gamma\) and a joint distribution of the random variables \(X\), \(Y\) and \(Z\) in a population, the phenotype \(Y\) truly follows the regression model M: \[\textrm{Model M:}\qquad Y \sim \mu + Z\gamma + X\beta.\] We call the covariate \(Z\) a **confounder** of the association between \(X\) and \(Y\), if \(\beta = 0\) in model M, but, when the covariate \(Z\) is omitted and the association between \(X\) and \(Y\) is described by a simpler model M’: \[\textrm{Model M':}\qquad Y \sim \mu' + X\beta',\] then \(\beta' \neq 0\).

Essentially, this definition says that a confounder is a variable whose omission from a GWAS regression model will cause a spurious association between the genotype and the phenotype. For example, when the population structure (Z) is included as a covariate in a height GWAS in Finland (Example 6.2), it explains away some of the east-west difference in height in Finland. Consequently, those genetic variants (X), whose apparent association with height (Y) is **only** through their geographic allele frequency differences, will not show association in this model (of type M) where population structure is already explicitly included as a covariate. However, when the population structure variable is omitted from the model (of type M’), then all variants that have east-west difference in allele frequency, will also show a statistical association with height, even if their height-association is only through the link between population structure and height and not through a direct biological effect. Thus, population structure is a confounder, and we must include it as a covariate or otherwise we get false positives.

Note that a more general definition of confounding could say that a confounder of X-Y association is any variable that when omitted from the model will *change* the effect estimate of X on Y (i.e. \(\beta'\neq \beta\)). Our GWAS-motivated definition is narrower: We are only worried about cases where there is no true effect (\(\beta=0\)), but where a regression model without appropriate covariates will mislead us by producing an effect (\(\beta' \neq 0\)).

(A word of warning: Even though we always want to avoid confounding, we still shouldn’t always include covariate \(Z\) in the model simply because \(Z\) is associated with both \(X\) and \(Y\)! There is a possible problem of collider bias there, that we will study later.)

The large scale of data in GWAS provides us with an opportunity to assess whether a genome-wide confounding effect seems to be in action. We should be worried if there is a too large a number of apparent associations throughout the genome. Standard way to assess this is to look at the QQ-plot at the middle point of the distribution (at the median of the test statistic values) and compare it to the corresponding value under the null hypothesis that there are no true associations at all. The ratio of the median of the observed distribution to the theoretical median of the null distribution is called the **genomic control parameter** \(\lambda\). Traditionally, the test statistic is the chi-square statistic of association, but also -log10(P-value) is used. If \(\lambda\) is much > 1, that could indicate that we have a problem with some omitted confounder (such as population structure) that causes elevated signals across the genome. Unfortunately, this approach is not that useful anymore with very large and statistically powerful GWAS, since there \(\lambda\) can elevate substantially from 1 already because of a true polygenic signal distributed across the genome. Instead, LD-score regression (that we will discuss later) is a better way to indicate whether we seem to have missed some important confounders.

**Genomic control (GC).** Many GWAS have used *genomic control* to adjust for observed inflation (i.e. too many too large values) of the test statistic distribution. The procedure is to divide each chi-square test statistic value by the GC parameter \(\lambda\) (defined above) and this way force the observed test statistic distribution to match the null expectation at the median value. Such method can reasonably adjust for a confounder only if that confounder affects all variants in a similar way. However, considering population structure as an example confounder, some variants are much more geographically stratified than others and hence an adjustment of all variants by the same constant \(\lambda\) is unlikely to be an appropriate solution. GC also keeps the order of the variants the same before and after its application, which seems a too inflexible property to properly model that confounders can work differently with different variants. Furthermore, GC can give a misleading impression that there is no confounding left because the “corrected”" distribution has \(\lambda=1\), whereas in reality GC may have severely undercorrected the most biased variants and overcorrected the less biased variants. Sometimes GC has also been applied to the standard error estimates by multiplying them with \(\sqrt{\lambda}\), which has the same effect on the test statistic as dividing the chi-square statistic by \(\lambda\). This approach again reveals the inflexibility of GC: The effect size estimates are not adjusted at all even though it feels that they should be adjusted according to each variant’s correlation with the confounders.

As opposed to GC, regression models that include a confounder as covariate allow each variant to be adjusted according to how correlated the variant is with the covariate.

How can we know that we have adjusted for all the relevant confounders? Unfortunately, we can’t ever be sure about this in an observational study. But we can at least avoid the common pitfalls causing typical confounding effects in GWAS:

**Population structure**that is associated with the phenotype will cause inflated P-values across the genome and this can be dealt with by including population structure as a covariate or by using*mixed models*(that we will talk about later).**Sample ascertainment**that has been done based on phenotype causes a risk that different phenotypic groups may not match well each other with respect to their genetic backgrounds, because different phenotypic groups may have been collected from different places/times/circumstances. Case-control sampling is the most prominent example of this potential problem. This does not mean that case-control sampling is not a good strategy. It just means that it opens up possible problems that must be taken care of. In GWAS setting, confounding due to sample ascertainment can be seen as a subcategory of general confounding by population structure.**Genotyping**of the samples has been done in batches that are not independent of the phenotype, which causes a risk that errors and biases in sample handling and genotyping process lead to spurious genotype-phenotype associations, because such biases may affect differently different batches, and different batches have different phenotype distributions. For example, if cases have been genotyped in Lab 1 in 2009 and controls in Lab 2 in 2015 then technical differences in genotyping process could lead to spurious genotype-disease associations. Often genotyping batches are used as covariates, which adjusts the analysis for the differences in the phenotypic means of the batches, but this strategy cannot be followed if the case-control status is to a large part associated with batches. In those cases, one simply needs to do particularly careful quality control between the genotyping batches. To avoid the batch problem, ideally, one should randomise the samples to genotyping batches, because then the batch could not be associated with the phenotype. Unfortunately, this is rarely possible in practice, since studies tend to combine many sources of existing genotyping data rather than genotyping all samples anew.

Let’s assume that our individuals are geographically spread around the line from Turku (in Southwest Finland) to Kajaani (in Eastern Finland) that is the gradient of the major population structure in Finland. Each individual \(i\) has a coordinate \(u_i\in[0,1]\), where 0=Turku, 1=Kajaani. For each variant \(k\), we determine its allele 1 frequency on the line by first sampling a Turku-Kajaani difference \(d_k \sim \textrm{Uniform}(-0.5,0.5),\) and then sampling Turku frequency \(t_k \sim \textrm{Uniform}(0.5,1)\), if \(d_k<0\); and \(t_k \sim \textrm{Uniform}(0,0.5)\), if \(d_k \geq 0.\) Now allele 1 frequency at position \(u \in [0,1]\) is \(t_k + u\cdot d_k\). So allele frequency changes smoothly from \(t_k\) in Turku to \(t_k + d_k\) in Kajaani.

Let’s simulate \(p=1000\) such variants for \(n=1000\) individuals randomly picked along the line from Turku to Kajaani.

```
n = 1000 #individuals
p = 1000 #SNPs
u = runif(n, 0, 1) #coordinates of inds
d = runif(p, -0.5, 0.5) #allele freq difference btw T and K
tu = runif(p, 0, 0.5) #when d>=0 then tu is in (0,0.5)
tu[ d < 0 ] = 1 - tu[ d < 0 ] #when d<0 then tu is in (0.5,1)
X = matrix(NA, nrow = n, ncol = p) #SNP data matrix, rows individuals, columns SNPs
for(k in 1:p){
f = tu[k] + u*d[k] #allele 1 frequency for each individual
X[,k] = rbinom(n, size = 2, prob = f) #genotypes from varying frequencies
}
```

Here we have a set of genotypes that carry a population structure effect.

Let’s assume that we study a phenotype \(Y\) that depends on the environment described by \(u\) as \(Y= u + \varepsilon\), where \(\varepsilon \sim \mathcal{N}(0,1)\), but there is no direct effect of any of our \(p\) variants on the phenotype.

```
y = u + rnorm(n) #phenotype has a geographical cline but no genetic effect from X
y = scale(y) #standardize trait to have mean = 0 and var = 1 for convenience
#Check how it looks like
plot(u, y, xlab = "Location from Turku to Kajaani", ylab = "phenotype", pch = 3, col = "gray")
abline( lm(y ~ u), col = "brown", lwd = 2 ) #add the regression line to plot
```

So we have a cline in the phenotype where the mean increases as we move from Turku towards Kajaani.

Let’s do a “GWAS” of phenotype \(Y\) on our \(p=1000\) variants. We collect only P-values from this GWAS.

```
#collect only P-values, i.e., element [2,4] of lm's summary()
pval.1 = apply(X, 2, function(x){summary(lm(y ~ x))$coeff[2,4]})
summary(pval.1) #median of P-values should be 0.5 under the null
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000001 0.0232523 0.1213125 0.2608591 0.4376539 0.9980830
```

`plot(1:p, -log10(pval.1), xlab = "variant", ylab = "-log10 P-value.1", col = "red") `