This document is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The slide set referred to in this document is “GWAS 1”.

This course is about statistical/computational ideas and methods that are used in genome-wide association studies (GWAS). (Note that on these notes ‘GWAS’ is both singular and plural while other texts may use ‘GWASs’ or ‘GWASes’ for plural.) More generally, these same methods are useful for analyzing large data sets of many other types, but, on this course, we will approach these from the angle of GWAS.

A GWAS quantifies statistical association between genetic variation and phenotypes. A phenotype, also called a trait, can be any measured or observed property of an individual. Examples of phenotypes are quantitative traits like standing height or concentration of cholesterol particles in circulation, or binary traits like diagnoses of multiple sclerosis or schizophrenia.

Why do we do GWAS? (slides 2-6)

We do GWAS because a statistical association between a particular physical region of the genome and the phenotype

These results may further benefit

The genome of an individual remain (nearly) constant throughout the individual’s lifetime. This is a truly remarkable property compared to, e.g., other molecular sources of information (such as metabolomics, metagenomics, transcriptomics, proteomics or epigenomics) or environmental factors that may vary widely across time. Therefore, the genome seems an ideal starting point for scientific research: it needs to be measured only once for an individual and there is no reverse causation from the phenotype to genome (with cancer as an important exception).

Ethical aspects

As with any powerful technique, the utilization of results from GWAS also raises many new and difficult ethical questions and the legislation of utilization of genome information is under active development around the world. For example, we need concrete answers to

When working with human genome data, we should always keep it clear in mind that there are such profound questions related to these data, and that the data we handle will likely turn out to be more powerful than most of us can imagine today. Human genome data are never ‘just data’ but include highly personal information, and they need to be handled with high respect and care. Access to genetic data requires a written agreement between the researcher and the data provider about how and for which purpose the data can be used.

Contents of this course (slides 24-25)

The plan is to discuss the following topics (in varying level of detail):

  1. What is a GWAS?

  2. Statistics of GWAS (regression coefficients, P-values, statistical power, Bayes factors)

  3. Genetic relatedness and population structure

  4. Confounding and covariates in GWAS

  5. Haplotypes, linkage disequilibrum, imputation, fine-mapping

  6. Linear mixed models and heritability

  7. Summary statistics and meta-analysis

  8. Mendelian randomization

  9. Success and critisism of GWAS

  10. Human genetics research at FIMM

1.1 Genetic variation (slides 7-12)

We all carry two nuclear genomes (i.e. genomes located in cell nucleus), one inherited from each of our two parents. Additionally, we have a small mitochondrial genome, assumed to be inherited exclusively from the mother, but on this course the term ‘genome’ refers to the nuclear genome.

Human genome is 3.2 billion nucleotides (or base pairs or DNA letters A,C,G,T) long sequence (see yourgenome.org), that is divided into separate physical pieces called chromosomes (see yourgenome.org). There are 22 autosomal (non-sex related) chromosomes and two sex chromosomes (X chromosome and Y chromosome). Normally, humans have two copies of each autosome and individuals with one copy of X and one of Y are males whereas individuals who have two copies of X are females. Abnormal number of chromosomes (called aneuploidies) typically cause severe consequences or an early death if present in all cells of an individual. The most common non-lethal exception is the Down syndrome (3 copies of chr 21). Mosaicism, where some cells have abnormal chromosome numbers also exist and are often present in cancer cells.

There are three types of pairings that come up when we analyse genomes.

Terms

1.1.1 Genetic variants

At any one position of the genome (e.g. nucleotide site at position 13,475,383 of chromosome 1, denoted by chr1:13,475,383) variation can exist between the genomes in the population. For example, my paternal chromosome can have a base A and maternal chromosome can have a base G (on the +strand of the DNA) at that position. (slide 10) Such a one-nucleotide variation is called a single-nucleotide variant (SNV) and the two versions are called alleles. So in the example case, I would be carrying both an allele A and an allele G at that SNV, whereas you might be carrying two copies of allele A at the same SNV. My genotype would be AG and yours AA. An individual having different alleles on his/her two genomes is heterozygous at that locus, and an individual having two copies of the same allele is homozygous at that locus. If neither of the alleles is very rare in the population, say, the minor allele frequency (MAF) is > 1% in the population, the variant is called a polymorphism, single-nucleotide polymorphism (SNP). There are over 10 million SNPs in the human genome. More complex genetic variation (slide 9) include structural variation (SV) such as copy number variants (CNVs), that include duplications or deletions of genomic regions, or rearrangements of the genome, such as inversions or translocations of DNA segments (see yourgenome.org).

A predefined set of 500,000 - 1,000,000 SNPs can be measured reliably and fairly cheaply (< 50 euros/sample) by DNA microarrays, which has been the single most important factor making GWAS possible (slide 11; Illustration). On this course, we consider SNPs as the canonical type of genetic variation. Typically, the SNPs are biallelic, i.e., there are only two alleles present in the population and this is what we assume in the following. In principle, however, all four possible alleles of a SNP could be present in the population.

Ambiguous SNPs. If the two alleles of a SNP are either C,G or A,T we call the SNP ambiguous because the strand information must be available (and correct) in order to make sense of the genotypes at this SNP. This is because allele C on + strand would be called allele G on - strand and if this SNP is reported with respect to different strands in different studies, the results get mixed up. The same problem does not happen with the other SNPs, e.g., a SNP with alleles A,C, because this SNP contains alleles T,G on the opposite strand and we could unambiguously match A to T and C to G between the studies. Note that we can resolve most ambiguous SNPs reliably based on the allele frequencies as long as the minor allele frequency is not close to 50%. If we are combining several studies, we should always start by plotting the allele frequencies between the studies after the alleles should be matching each other in order to see that the frequencies indeed match across the studies.

Some catalogues of genetic variation

A large part of the genetics research over the last 30 years have been driven by international projects aiming to catalogue genetic variation in public domain.

  • The Human Genome Project 1990-2003 established a first draft of a human genome sequence.

  • The HapMap project 2002-2009 studied the correlation structure of the common SNPs.

  • The 1000 Genomes project 2008-2015, expanded HapMap to genome sequence information across the globe and currently remains a widely-used reference for global allele frequency information. 1000G project was able to characterize well common variation in different populations, but missed many rare variants of single individuals because the costs of very accurate sequencing were too high. The tremendous impact of the 1000G project stems from the fact that everyone can download the individual level genome data of the 1000G samples from the project’s website and use it in their own research.

  • Exome Aggregation Consortium (ExAC) 2014-2016 concentrated only on the protein coding parts of the genome, so called exons, that make up less than 2% of the genome and was able to provide accurate sequence data for the exomes of over 60,000 individuals. This effort has been particularly important for medical interpretation of rare variants seen in clinics that diagnose patients with severe disease. ExAC provides summary level information through browser and downloads but individual level data cannot be downloaded.

  • Genome Aggregation Database (gnomAD) is expanding the ExAC database and also includes additional whole genome sequencing information. It is the current state-of-the-art among the public genome variation databases.

1.1.2 Genotypes and Hardy-Weinberg equilibrium

Let’s consider one SNP in the population. The SNP has two alleles that could be called by their nucleotides, but with quantitative analyses in mind, we name the alleles in such a way that the minor allele (the one that is less common in the population) is called allele 1 and the major allele (the one that is more common in the population) is called allele 0. (There is no general rule that GWAS results are reported as allele 1 corresponding to the minor allele, and even if there was, the minor allele could differ between two data sets/populations, so consistency across studies needs always to be checked.) Let’s denote the minor allele frequency (MAF) by \(f\). Since each individual has two copies of the genome, there are individuals with three possible genetic types (called genotypes) at this SNP. We denote each genotype by the number of copies of allele 1 that the genotype contains (i.e. genotype can be 0, 1 or 2). If we assume that, at this SNP, the two alleles in one individual are sampled at random from the population, then the genotype frequencies in the population follow the binomial distribution \(\textrm{Bin}(2,f)\):

genotype expected frequency from Bin(2,f)
0 \((1-f)^2\)
1 \(2\,f(1-f)\)
2 \(f^2\)

These frequencies are called the Hardy-Weinberg equilibrium (HWE) genotype frequencies (or HW proportions) and they define the theoretical equilibrium genotype frequencies given the value of \(f\) in an ideal randomly mating population without selection, migration, or genetic drift (=statistical fluctuations due to finite population size). In practice, most variants in human populations do approximately follow HWE. Clear deviations from HWE could point to, for example, recent population structure (e.g. two populations have admixed), assortative mating (individuals tend to mate with partners of their kind) or natural selection (e.g. genotype 1 is very advantageous whereas genotype 2 is lethal and hence completely absent from the population). On the other hand, technical problems in genotype calling (i.e. in determining genotypes from the intensity measures from a genotyping chip) can also cause deviations from HWE, either because of bad quality data or because the variation is not a biallelic SNP and has more than two alleles (slide 12). Therefore, often variants which do not follow HW frequencies are excluded from many GWAS analyses as part of the quality control (QC) procedure.

Testing HWE. To test for (deviations from) HWE a 1-df chi-square test can be used where the expected counts are derived under HWE given the allele frequencies.

Suppose that we have observed genotype counts \(n_0,n_1,n_2\) for genotypes 0,1 and 2, respectively, and \(N = n_0 + n_1 + n_2.\) Our estimate for population frequency of allele 1 is \(\widehat{f} = (n_1+2\,n_2)/(2\,N)\), and the expected genotype counts under HWE are \(h_0 = N(1-\widehat{f})^2\), \(h_1 = 2 N \widehat{f}(1-\widehat{f})\) and \(h_2 = N \widehat{f}^2\). The test statistic measures the deviation between the observed counts and the expected counts: \[ t_{HWE} = \sum_{i=0}^2 \frac{(n_i-h_i)^2}{h_i}.\] If HWE holds, then \(t_{HWE}\) follows approximately a Chi-square distribution with 1 degree of freedom, which is used for deriving a P-value. If theoretical chi-square distribution is used, the test is asymptotic and hence not necessarily valid for small sample size or very rare variants, and a test statistic distribution based on permutations should be used, e.g., by R package HardyWeinberg.

Example 1.1. Look up the SNP rs429358 from the Ensmbl browser https://www.ensembl.org/. Choose Human and type ‘rs429358’; you’ll see the variant’s chr (19), position (44,908,684 in the version GRCh38 of the Human Genome coordinates mentioned on top left) and the two alleles, T and C, of which C is predicted to be the ‘ancestral’ that is, the older allele, and C is also the minor allele with average MAF of 15% across human populations. Next click ‘Population Genetics’ to see allele and genotype frequencies in different human populations. (Familiarize with given populations by hovering mouse above them.) Scrolling down, in the 1000 Genomes project Phase 3 (1000G) Finnish data, the minor allele C has frequency \(37/198 \approx 18.7\%\) and the observed genotype counts are 66 (TT), 29 (TC) and 4 (CC) individuals. Let’s use these values to visualize the genotype distribution and apply the standard test for HWE.

geno = c(66,29,4)
n = sum(geno) #number of individuals
f = sum(geno*c(0,1,2))/(2*n) #(66*0 + 29*1 + 4*2) / (2*(66+29+4))
f #MAF
## [1] 0.1868687
hwe.prop = c((1-f)^2, 2*f*(1-f), f^2) #these would be the genotype freqs under HWE
rbind(obs = geno/n, hwe = hwe.prop) #print the observed genotype freqs and the HWE.
##          [,1]      [,2]       [,3]
## obs 0.6666667 0.2929293 0.04040404
## hwe 0.6611825 0.3038976 0.03491991
#For testing HWE we use chi-square test even though counts are quite small in last cell:
hwe.test = sum((geno - n*hwe.prop)^2/(n*hwe.prop)) #HWE test statistic
hwe.p = pchisq(hwe.test, df = 1, lower = FALSE) #P-value from the test
barplot(geno, main = paste("rs429358 FIN in 1000G Phase3; HWE P=",signif(hwe.p,3)),
        names = c(0,1,2), xlab = "genotype", col = "skyblue")

To get familiar with how to generate realistic genotype data, let’s also make example genotype data for \(n = 1000\) additional Finns at this variant, both by sampling from the genotype frequencies and by sampling from the allele frequencies (assuming HWE). Since this variant seems to follow HWE, we do not expect qualitative differences between the two simulation approaches.

set.seed(19) #setting seed guarantees the same simulation results every time this code is run
n = 1000
sample.from.geno = sample(c(0,1,2), prob = geno, size = n, replace = T) #sample from genotype frequencies
# replace = TRUE means sampling with replacement, that is, 
# each genotype can be sampled many times, always with the same probabilities given in 'prob'
tab = table(sample.from.geno) #table counts how many times each value is present
counts.from.geno = rep(0,3) #How many carriers of each genotype
counts.from.geno[ 1 + as.numeric( names(tab) )] = as.numeric(tab) #works even if some count is 0

#To sample from HWE frequencies, we could use:
#sample.from.hwe = sample(c(0,1,2), prob = c((1-f)^2, 2*f*(1-f), f^2), size = n, replace = T)
#but a simpler way is to sample n genotypes directly from Bin(2,f) distribution:
sample.from.hwe = rbinom(n, size = 2, p = f)
counts.from.hwe = rep(0,3) #Let's count how many carriers of each genotype
for(ii in 0:2){ #this is another way to do the counting compared to table() above
  counts.from.hwe[ii+1] = sum(sample.from.hwe == ii)}

rbind(geno = counts.from.geno/n, hwe = counts.from.hwe/n)
##       [,1]  [,2]  [,3]
## geno 0.651 0.313 0.036
## hwe  0.672 0.298 0.030
barplot(cbind(counts.from.geno/n, counts.from.hwe/n), 
        names = c("geno","HWE"), beside = F, horiz = T)

They look pretty similar but to do statistical inference we should also quantify the uncertainty of the estimates (e.g. by 95% intervals). For small counts, a Bayesian credible interval called Jeffreys interval is preferred over the standard 95% confidence interval, whereas for larger counts the two approaches agree. Details of the two approaches are here.

Let’s make Jeffreys intervals for the estimates of each genotype frequency in both of the data sets.

interval.from.geno = matrix(NA, ncol = 2, nrow = 3) #empty matrix
interval.from.hwe = matrix(NA, ncol = 2, nrow = 3)
for(ii in 1:3){ #find intervals while looping over 3 genotypes
  interval.from.geno[ii,] = qbeta(c(0.025,0.975), counts.from.geno[ii]+0.5, n-counts.from.geno[ii]+0.5)
  interval.from.hwe[ii,] = qbeta(c(0.025,0.975), counts.from.hwe[ii]+0.5, n-counts.from.hwe[ii]+0.5)
}

Now we can print out the observed genotype frequency (1st col) and its 95% interval (2nd and 3rd cols) for both data sets and compare whether the estimates seem similar given the uncertainty described by the intervals:

cbind(geno.est = counts.from.geno/n,interval.from.geno,
      hwe.est = counts.from.hwe/n,interval.from.hwe ) 
##      geno.est                       hwe.est                      
## [1,]    0.651 0.62105469 0.68007266   0.672 0.64243720 0.70056879
## [2,]    0.313 0.28483127 0.34224942   0.298 0.27026622 0.32690115
## [3,]    0.036 0.02576052 0.04891794   0.030 0.02074395 0.04196844

All estimates are within other data set’s 95% Jeffreys credible interval and we have no reason to suspect frequency differences between the two data sets.

A standard two-sample chi-square test can also be carried out to quantify the frequency difference as a P-value:

chisq.test(rbind(counts.from.geno,counts.from.hwe)) #tests whether rows have same distribution
## 
##  Pearson's Chi-squared test
## 
## data:  rbind(counts.from.geno, counts.from.hwe)
## X-squared = 1.247, df = 2, p-value = 0.5361

Unsurprisingly, this does not indicate any frequency difference between the two data sets as P-value is large (0.5361). We’ll talk more about interpretation of P-values later.

1.2 What is a genome-wide association study?

Let’s look at some recent examples of GWAS. Two main types of GWAS are studying quantitative traits or disease phenotypes.

Example 1.2. QT-GWAS (slides 14-18) GWAS on body-mass index (BMI) by Locke et al. (2015) combined data of 339,000 individuals from 125 studies around the world to study the association of SNPs and BMI. It highlighted 97 regions of the genome with convincing statistical association with BMI. Pathway analyses provided support for a role of the central nervous system in obesity susceptibility and implicated new genes and pathways related to synaptic function, glutamate signalling, insulin secretion/action, energy metabolism, lipid biology and adipogenesis.

Example 1.3. Disease GWAS (slides 20-22) GWAS on migraine by Gormley et al. (2016) combined genetic data on 60,000 cases (individuals with migraine) and 315,000 controls (individuals with no known migraine) originating from 22 studies. Genetic data was available on millions of genetic variants. At each variant, the genotype distribution between cases and controls were compared. 38 regions of the genome showed a convincing statistical association with migraine. Downstream analyses combined the genes into pathways and cell types and highlighted enrichment of signals near genes that are active in vascular system.

Terms:

GWAS have shown us that, very generally, complex traits and common diseases are highly polygenic, and many common variants with only small effects influence these phenotypes. We don’t yet know which are the exact causal variants for each phenotype because of the correlation structure among genetic variants (this is the fine-mapping problem we’ll look later). We also don’t yet know very accurately how rare variants affect each phenotype because that requires very large sample sizes interrogated by genome sequencing techniques, not only by SNP arrays.

1.2.1 Quantitative traits

Let’s mimick the data we see on slide 3. The phenotype is LDL-cholesterol level and we assume that the trait distributions of individuals with 0, 1 or 2 copies of allele T at SNP rs11591147 are Normal distributions with SD=1 and with means of 0.02, -0.40 and -2.00, respectively. Allele T frequency is 4% in Finland. Let’s simulate \(n=10,000\) individuals and boxplot them by genotype.

n = 10000
f = 0.04
mu = c(0.02, -0.40, -2.00) #mean of each genotype
sigma = c(1, 1, 1) #SD of each genotype

x = rbinom(n, size = 2, p = f) #genotypes for 'n' individuals assuming HWE
table(x)/n #(always check that simulated data is ok before starting to work with it!)
## x
##      0      1      2 
## 0.9212 0.0772 0.0016
y = rep(NA,n) #make empty phenotype vector
for(ii in 0:2){ #go through each genotype group: 0, 1, 2.
  y[x == ii] = rnorm(sum(x == ii), mu[1+ii], sigma[1+ii]) } #generate trait for group ii

boxplot(y ~ x, main = "Simulated rs11591147 in Finns", ylab = "LDL", 
        xlab = "Copies of T", col = "limegreen")

We see that the phenotype varies with genotype in such a way that each additional copy of allele T decreases the level of LDL.

Additive model

The simplest way to analyze these data statistically is to use an additive model, that makes the assumption that the means of the groups depend additively on the number of allele 1 in the genotype, and that the SDs of the genotype groups are constant. Thus, we fit a linear model \(y = \mu + x \beta + \varepsilon\), where \(y\) is the phenotype, \(x\) is the genotype (0,1 or 2) and parameters to be estimated are

  • \(\mu\), the mean of genotype 0 and
  • \(\beta\), the effect of each copy of allele 1 on the mean phenotype.

The error terms \(\varepsilon\) are assumed to have an identical Normal distribution \(N(0,\sigma^2)\) where \(\sigma^2\) is not known and will be estimated from the data. Let’s fit this linear model in R using lm().

lm.fit = lm(y ~ x)
summary(lm.fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7634 -0.6652 -0.0119  0.6759  3.8529 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01358    0.01032   1.316    0.188    
## x           -0.44553    0.03570 -12.480   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9915 on 9998 degrees of freedom
## Multiple R-squared:  0.01534,    Adjusted R-squared:  0.01524 
## F-statistic: 155.7 on 1 and 9998 DF,  p-value: < 2.2e-16

The summary(lm.fit) command produced

  • parameter estimates (or Coefficients) \(\widehat{\mu}\) and \(\widehat{\beta}\),
  • their standard errors (SE) (estimates for square root of the sampling variance of the parameter estimates),
  • t-statistic (estimate/SE) and
  • P-value under the null hypothesis that the parameter is 0 and errors are uncorrelated and have distribution \(N(0,\sigma^2)\).

Under the assumptions of linear model, sampling distribution of t-statistic is \(t\)-distribution and hence q% confidence intervals are determined as \(\widehat{\beta}\pm a\times \textrm{SE}\), where \(a\) is the q/2% quantile of \(t\)-distribution with \(n-2\) degrees of freedom. When \(\sigma^2\) is known, the \(t\)-distribution is replaced by a Gaussian, and same is approximately true when \(n\) becomes large, even if estimate \(\widehat{\sigma^2}\) is used in computing SE. In these cases, we often talk about z-statistic instead of t-statistic. In GWAS analyses, we typically have thousands of samples and use z-scores and the Normal approximation by default.

The last paragraph in the output tells about the full model fit. We can measure how much variation in \(y\) is left unexplained by the model by computing residual sum of squares (RSS): \[ RSS = \sum_{i=1}^n \left(y_i - \widehat{\mu} - x_{i} \widehat{\beta}\right)^2 \] \(R^2\) is the proportion of variance explained by the linear model, i.e., \(R^2 = 1-\frac{RSS}{n-1} / \widehat{\textrm{Var}}(y)\). Adjusted version penalizes for additional predictors and is defined here as \(R_{adj}^2 = 1-\frac{RSS}{n-2} / \widehat{\textrm{Var}}(y)\). Note that if there is only the intercept parameter \(\mu\) in the model, then \(R^2=R_{adj}^2=0\), and if the model explains data perfectly (\(RSS=0\)), then \(R^2=R_{adj}^2=1\). In other cases \(R^2\) values are between 0 and 1 and larger values mean more variance explained by the model.

However, \(R^2\) should not be the only value used to judge how suitable the model is for the data. One should also plot the data and the model fit in different ways to assess this question. (Although, of course not for all variants in GWAS, but for the most interesting ones.) For this simple linear model, a scatter plot and a regression line is a good way to assess whether there seem to be deviations from the additivity assumption. Additionally, the differences in residual variation between the groups could indicate interaction effecs between the genetic variant and some other genetic or environmental variable.

plot( x + runif(n, -0.05, 0.05), y, xlab = "genotype", ylab = "LDL", xaxt = "n", 
      pch = 3, cex = 0.5, col = "gray") 
#runif() adds some jitter to x so that all points are not on top of each other
axis(1, at = 0:2, labels = 0:2)
points(0:2, c(mean(y[x==0]), mean(y[x==1]), mean(y[x==2])), col = "red", pch = "X", cex = 1.3)
abline(lm.fit, col = "orange", lwd = 2)
legend("topright", pch = "X", legend ="group means", col = "red")

Conclusion: We see a statistically highly significant association between the genotype and phenotype where a copy of allele T decreases LDL levels by 0.45 units. This variant explains about 1.5% of the variation in LDL-cholesterol levels. We also see that individuals homozygous for allele T (genotype 2) have on average lower levels of LDL than the model predicts, which indicates a deviation from the additivity assumption. Let’s next fit a full 2-parameter model to quantify this deviation.

Full model

Let’s add a new parameter \(\gamma\) to the model that describes the residual effect for group 2 after the additive effect is accounted for. The model is \(y = \mu + x\beta + z\gamma + \varepsilon,\) where \(z\) is indicator of genotype 2, i.e., \(z_i=1\) if individual \(i\) has genotype 2 and otherwise \(z_i=0\). This is the full model where each genotype group has its own mean (genotype 0: \(\mu\); genotype 1: \(\mu + \beta\) and genotype 2: \(\mu + 2\beta + \gamma\)).

z = as.numeric( x == 2 ) #z is indicator for genotype group 2
lm.full = lm( y ~ x + z )
summary(lm.full)
## 
## Call:
## lm(formula = y ~ x + z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8095 -0.6675 -0.0128  0.6760  3.8548 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01165    0.01032   1.129    0.259    
## x           -0.39750    0.03711 -10.711  < 2e-16 ***
## z           -1.20614    0.25789  -4.677 2.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9905 on 9997 degrees of freedom
## Multiple R-squared:  0.01749,    Adjusted R-squared:  0.01729 
## F-statistic: 88.97 on 2 and 9997 DF,  p-value: < 2.2e-16

It seems that also the new variable is useful (large effect and small P-value). Now the interpretation of coefficients is that genotype 1 has avg. phenotype of -0.40 and genotype 2 has avg. phenotype -0.398*2-1.206 = -2.00.

Note also that the full model above gives the same model fit and is simply a different parameterization of the linear regression that treats the genotype as a factor with three levels.

lm.full2 = lm( y ~ as.factor(x) )
summary(lm.full2)
## 
## Call:
## lm(formula = y ~ as.factor(x))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8095 -0.6675 -0.0128  0.6760  3.8548 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.01165    0.01032   1.129    0.259    
## as.factor(x)1 -0.39750    0.03711 -10.711  < 2e-16 ***
## as.factor(x)2 -2.00115    0.24784  -8.074 7.56e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9905 on 9997 degrees of freedom
## Multiple R-squared:  0.01749,    Adjusted R-squared:  0.01729 
## F-statistic: 88.97 on 2 and 9997 DF,  p-value: < 2.2e-16

We should use this latter parameterisation of the full model if we are interested in SE of the genetic effect of genotype 2, because that information is not given by the first parameterisation of the full model. If we instead are interested to quantify how much the data show deviation from the additivity then the first parameterisation is the most suitable.

(The latter parameterisation is also the same model as the traditional Analysis of Variance ANOVA but we are working within the regression model framework for reasons that become clear once we get to confounders and covariates.)

All these models make the same assumption that SD is constant across the genotype groups.

Current approach to quantitative phenotypes in GWAS

Almost all GWAS analyze quantitative traits using the additive model, i.e., a linear regression model with a single parameter for genetic effect. The full model is typically only used for a small group of interesting variants identified by the additive model to check them manually for possible deviations from the additive effects. The main reason for this is that the additive model is usually almost as powerful to find associations as the full model even when deviations from additivity are present in the data, since typically one of the genotype groups is much smaller than the other two and hence does not affect much the statistical model fit. Additionally, our current understanding is that most associations follow well the additive model and the additive model has more power than the full model, for these cases. (But note that our current understanding may be biased in favor of the additive model since we do not usually look very carefully for non-additive effects.) It would seem useful to run both the additive and full model in GWAS, but this is often not done because there are millions of variants to be analyzed and hence there are a lot of work and output already when only the simplest model is applied.

Quantile normalisation (QN). Often in large GWAS the quantitative phenotype is forced to follow a Normal distribution by quantile normalisation (also inverse-Normal transfromation). This way the effect of phenotypic outliers is diminished while keeping the ranking of the trait values constant. This also harmonizes the trait distributions across multiple cohorts by forcing them to look similar.

To apply QN to a set of \(n\) trait values, we first regress out from the trait values the central covariates (such as sex and age) using a linear model. Then we order the residuals of the regression in ascending order \(\widehat{r}_{(1)}\leq \ldots \leq \widehat{r}_{(n)}\). Now the QN’ed trait values for the sample is taken from the inverse of the cumulative distribution function of the Normal distribution at \(n\) equally spaced values between 0 and 1, and the resulting values \(q_1 \leq \ldots \leq q_n\) are matched to the individuals so that the value \(q_i\) becomes the trait value for individual who corresponds to residual \(\widehat{r}_{(i)}\). Advantage of QN is robustness to outliers and to systematic differences in measurements between studies. A disadvantage is that we lose some information of the phenotype distribution, which could be useful if it was modeled properly.

Let’s do QN for 100 males and 100 females where the phenotype in males follows \(2+\textrm{Gamma}(\textrm{shape}=1.5,\textrm{scale}=1.5)\) and in females \(6+\textrm{Gamma}(\textrm{shape}=1.5,\textrm{scale}=1.5)\).

n = 200 #males + females
fem = rep( c(0,1), each = n/2) #who is female
y = 2 + rgamma(n, shape = 1.5, scale = 1.5) #males have shift of 2
y[fem == 1] = 4 + y[fem == 1] #females have shift of 6 = 2 + 4
hist(y, breaks = 30, col = "khaki") #shows some outliers compared to mixture of 2 Normals