--- title: "GWAS 7: Linkage disequilibrium (LD) and fine-mapping" author: "Matti Pirinen, University of Helsinki" date: 'Latest update: March 31, 2025' output: pdf_document: default html_document: default urlcolor: blue --- ```{r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE) set.seed(19) ``` This document is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/). The slide set referred to in this document is "GWAS 7". ### 7.1 Haplotypes Terms * **ploidy**, the number of copies of the genome in the cell nucleus. * **haploid**, one copy of the genome; gametes (sperm and egg cells) are haploid cell types. * **diploid**, two copies of the genome; other human cell types with nucleus than gametes are diploid. * **haplotype**, **haploid genotype**, in diploid cell/individual, a genome inherited from one parent. * **diploid genotype**, a complete set of two genomes of a diploid cell/individual without haplotype information. * **diplotype**, diploid genotype with haplotype information; a pair of haplotypes for a diploid genotype. We know that chromosomes are inherited from the parents as recombined segments of grandparents' chromosomes (slide 2). From a genotyping array we observed (diploid) genotypes at each locus, and we can't always be certain which are the two underlying haplotypes (slide 3). Consider two SNPs 1 and 2 on the same chromosome, with alleles A and C at SNP1 and G and T at SNP2. If we observe an individual whose genotype is SNP1 = A/A, SNP2 = G/T, then her haplotypes are definitely \{A-G, A-T\}. That is, the two chromosomes that she has inherited from her two parents read A-G and A-T. But if we instead observe genotype SNP1 = A/C, SNP2 = G/T, then the haplotypes are not determined without external information: they can be either \{A-G, C-T\} or \{A-T, C-G\}. The process of figuring out the haplotypes at multiple loci from the (unordered) genotypes is called haplotype **phasing**. Historically, the phase was determined from pedigree data where parents' genotypes inform about possible offspring haplotypes. Also experimental methods for phasing exist but are costly. More often, nowadays, phasing is done in large population samples by using computational methods and the result is probabilistic. A review on phasing by [Browning & Browning (2011)](https://www.nature.com/articles/nrg3054). How can we phase genotypes into haplotypes? When we consider genotypes at $p$ SNPs where the individual is heterozygous, then he/she has $2^p$ possible haplotypes and $2^{p-1}$ possible phasings (as one haplotype already determines the other haplotype given the genotypes). These numbers become huge already for tens of SNPs. However, empirical data from the early 2000s started to show that when many individuals were genotyped at SNPs residing near each other in the genome, the observed genotypes suggested that only a few ($\ll 2^p$) different haplotypes were present in the population in any one region of the genome. **Example 7.1.** (Slide 4) Go to [LDlink](https://ldlink.nci.nih.gov/?tab=home) that gives access to the haplotype data from the 1000 Genomes project. Choose LDhap tab and input the following list of 9 SNPs that come from a 2500 bps long region on chr 1: rs115037027
rs12409788
rs1576517
rs151240271
rs12752436
rs76864380
rs6586443
rs35213023
rs34910942 * Choose first CHS (Southern Han Chinese) population and press `Calculate`. It shows that only two of these SNPs have MAF>5% and four are monomorphic in the Table shown, i.e., show only one allele in these data. There are only 4 haplotypes present in the population out of theoretically possible $2^5=32$. * Change population to FIN. There are 5 haplotypes present and one SNP is monomorphic. Compare SNPs 1 and 6. They have the same pattern of variation across the haplotypes: Whenever one has **T** also the other has **T**. They are completely associated with each other and, consequently, in a GWAS they would give exactly the same results. Also the triple 5, 8 and 9 is similarly fully correlated. * Change population to LWK (Kenya). Now we see 8 haplotypes, no monomorphic SNPs and only the SNPs 5 and 8 are perfectly correlated with each other. This example shows how some variants in a narrow region of 2500 bps are highly correlated with their neighbors when it comes to the haplotype patterns that exist in any one population. Typically, African populations show more diversity than Asian and European populations due to more recent genetic bottlenecks that have reduced variation in the history of the latter two. But also in the African populations, the haplotype patterns are very much restricted compared to what they would be if the alleles at nearby loci would be distributed nearly independently at the population level. The human genome has a **haplotype block structure** (slide 5), which suggests that the recombination process shuffles the chromosomes mainly at only a relatively small number of sites on the genome, so called *recombination hotspots*. Between these hotspots, there are larger regions of genome with low recombination rates (slide 6). Consequently, * Phasing, especially in the low recombination areas, can be done accurately by collecting a large number of genotyped individuals and using a probability model to estimate which are the most likely haplotypes in this population. Since only a few haplotypes exist in the population, we can get a fairly reliable phasing for individuals by sharing haplotype information among them, and we simultaneously also get accurate estimates for the population frequencies of the haplotypes. Existing high quality reference haplotype data from the target population also improves phasing. Phasing software: [Beagle5](http://faculty.washington.edu/browning/beagle/beagle.html), [Eagle2](https://data.broadinstitute.org/alkesgroup/Eagle/), [SHAPEIT5](https://odelaneau.github.io/shapeit5/). * SNPs in the same haplotype block tag each other well because each allele of a particular SNP sits on only a limited number of haplotypes in that block. In particular, by typing the set of the most informative tag-SNPs, we can already cover also a lot of neighboring genetic variation that we don't directly genotype. This was a property that boosted the first GWAS: in order to see statistical associations between genomic regions and phenotypes, it is enough that we manage to genotype tag-SNPs of causal variants. * The exact causal variants are difficult to pinpoint based on the statistical information alone since so many variants are so highly correlated with each other. This is the **fine-mapping problem**. ### 7.2 Linkage disequilibrium (LD) We know from empirical data that the human genome has a haplotype block structure. This correlation structure arises because when a new variant (assumed here to be an SNV) is introduced into the population as a mutation, it emerges on one of the existing haplotypes of the population. Because haplotypes are inherited as large segments of grandparental haplotypes, with often just 1 or 2 recombination events per chromosome, the new mutation will be, for many subsequent generations, passed on to the next generation together with the same alleles that were close by to it on its original haplotype background. However, as more generations pass, recombinations cut down the original haplotype background to smaller and smaller shared segments between the descendants of the original haplotype, and hence the correlation at the population level between the variant and its neighbors tends to decrease over time (slides 8-9). Consider any two SNPs at distinct sites of the genome. Suppose that we had perfect haplotype information. Then we could observe how often their alleles 1 are on the same haplotype in the population and compare that to the expected proportion if the two variants were independently distributed. If allele 1 frequencies in population are $f_k$ and $f_l$ at SNPs $k$ and $l$, and if the frequency of the haplotype 1-1 is $f_{kl}$, then we define * disequilibrium coefficient $D_{kl}=f_{kl}-f_k\cdot f_l$, which, theoretically, is 0 if the two loci are independent, also said to be in **linkage equilibrium**, LE. If the loci are not independent ($D_{kl}\neq 0$), then we say that the loci are in **linkage disequilibrium** (LD). Naturally, we never know the frequencies exactly in population and we use sample estimates to compute $D_{kl}$. Hence, we will always observe some non-zero value for $D_{kl}$ in our sample and the interest is in the magnitude of $D_{kl}$. We can measure the amount of LD also by using the Pearson's correlation coefficient $r_{kl} = \textrm{cor}(a_k,a_l)$, where $a_k$ and $a_l$ are the indicators of the alleles at SNPs $k$ and $l$ on the same haplotype. It can be shown that $r$ and $D$ are related through * $r_{kl} = \frac{D_{kl}}{\sqrt{f_k(1-f_k)f_l(1-f_l)}}.$ Note that, theoretically, both $r$ and $D$ are 0 if and only if the loci are in linkage equilibrium. In what follows, we are interested in $r$, because it determines the statistical consequences of LD on GWAS analyses. A statistical way to think about LD is that if SNPs $k$ and $l$ are in LD, then by observing for a particular haplotype its allele at SNP $k$, we also gain more information about its allele at SNP $l$ than what we had solely based on the population allele frequencies at SNP $l$. #### 7.2.1 Generating two-locus haplotype data Let's follow [Vukcevic et al. (2011)](https://onlinelibrary.wiley.com/doi/full/10.1002/gepi.20576) who derived formulas for the 4 haplotype frequencies in the population, given the MAFs at the two SNPs (1 and 2), and LD as measured by $r = r_{12}$ between the SNPs. (More comprehensive treatment of the subject is available in [Damjan Vukcevic's D.Phil thesis, Oxford, 2009](https://ora.ox.ac.uk/objects/uuid:8f89593e-a4ab-4df0-b297-74194be7891c).) Suppose that SNP1 has alleles **a** (minor) and **A** (major) and SNP2 has alleles **b** (minor) and **B** (major) and that $f_a \leq f_b.$ First, in order that the allele frequencies and the given level of LD, as measured by $r$, correspond to a haplotype distribution, the following inequalities must hold: $$ -\sqrt{\left(\frac{f_a}{1-f_a}\right) \left(\frac{f_b}{1-f_b}\right)} \leq r \leq \sqrt{\left(\frac{f_a}{1-f_a}\right) \bigg/ \left(\frac{f_b}{1-f_b}\right)}. $$ For example, the correlation between the two SNPs can be really high ($r\approx 1$) only if both have very similar MAFs, and in order for the correlation to be highly negative ($r\approx -1$), both MAFs must be close to 0.5. This is because if we were to assign the minor allele at both SNPs to haplotypes, then the only way that we can make the correlation $\approx 1$ is that **a** and **b** are almost always on the same haplotype and **A** and **B** are almost always on the same haplotype. This is possible only if $f_a \approx f_b$ because otherwise there are either some extra **a**s that don't find any **b**s to pair with and therefore must pair with **B**s, or there are some extra **b**s that don't find **a**s to pair with and therefore must pair with **A**s, both of which will lead to a reduced correlation compared to the maximum value of 1. To get a high negative correlation for the minor alleles, each minor allele must be paired up with the major allele at the other locus, and this is possible only if MAF is close to 0.5 at both loci. Note that if we change the allele coding at one locus from allele 1 being the minor allele to allele 1 being the major allele, that simply changes the sign, but not the magnitude, of the corresponding value of the correlation $r$. The haplotype frequencies, and samples from them, can be computed using the R-function `geno.2loci()` given below. ```{r} geno.2loci <- function(n, r, mafs, return.geno = TRUE){ #INPUT: # n, individuals # r, correlation coefficient between the alleles on the same haplotype of the two loci # mafs, MAFs of the two loci #OUTPUT: # if return.geno == TRUE: n x 2 matrix of GENOTYPES of n individuals at 2 loci # if return.geno == FALSE: (2n) x 2 matrix of HAPLOTYPES of n individuals (2n haplotypes) at 2 loci stopifnot( r >= (-1) & r <= 1 ) stopifnot( length(mafs) == 2 ) stopifnot( all(mafs > 0) & all(mafs <= 0.5) ) stopifnot( n > 0) #Label SNPs and alleles so that a and b are minor alleles and freq a <= freq b. #At the end, possibly switch the order of SNPs back to the one given by the user. f.a = min(mafs) # maf at SNP1 f.b = max(mafs) # maf at SNP2 #With these parameters, the admissible value of correlation 'r' are in the interval: r.min = max( -1, -sqrt(f.a/(1-f.a)*f.b/(1-f.b)) ) r.max = min( 1, sqrt(f.a/(1-f.a)/f.b*(1-f.b)) ) #c(r.min,r.max) #Check that the given 'r' is from this interval if(r < r.min | r > r.max) stop(paste0("with these mafs, r should be in (",r.min,",",r.max,")")) # Alleles SNP1: A (major) and a (minor); SNP2: B (major) and b (minor). # Compute conditional probabilities for allele 'a' given allele at locus 2: q0 = f.a - r*sqrt(f.a*(1-f.a)*f.b/(1-f.b)) #Pr(a|B) q1 = f.a + (1-f.b)*r*sqrt(f.a*(1-f.a)/f.b/(1-f.b)) #Pr(a|b) #Compute the four haplotype frequencies: f.ab = f.b*q1 f.aB = (1 - f.b)*q0 f.Ab = f.b*(1 - q1) f.AB = (1 - f.b)*(1 - q0) f = c(f.ab, f.aB, f.Ab, f.AB) f #These are the haplotype frequencies in the population. #4 haplotypes in the population. haps = matrix(c(1,1,1,0,0,1,0,0), nrow = 4, ncol = 2, byrow = TRUE) #Generate data for n individuals where each individual is measured at these two SNPs. #There are 2*n haplotypes, 2 for each individual. hap.ind = sample(1:4, size = 2*n, replace = TRUE, prob = f) if(mafs[1] > mafs[2]) haps = haps[,2:1] #Whether to change the order of the loci? #Either make a genotype matrix by summing the two haplotypes for each individual... if(return.geno) X = haps[hap.ind[1:n],] + haps[hap.ind[(n+1):(2*n)],] if(!return.geno) X = haps[hap.ind,] #...or return the haplotypes as such. return(X) } ``` Let's check that this works and gives a correlation $\approx r$. ```{r} params = matrix(c(0.3, 0.3, 0.5, #each row has: r, MAF at SNP1, MAF at SNP2 0.8, 0.4, 0.3, -0.5, 0.34, 0.4), byrow = TRUE, ncol = 3) n = 10000 for(ii in 1:nrow(params)){ X = geno.2loci(n, r = params[ii,1], mafs = params[ii,2:3], return.geno = FALSE ) #haplotype level data print(c(cor(X[,1],X[,2]), colSums(X)/nrow(X))) } #and compare to params above ``` It works, so let's use it to demonstrate how haplotypes look like under different amounts of LD. We start from high LD (r = 0.9), then intermediate LD (r = 0.66) and finally no LD (r = 0). MAFs at the two SNPs must be similar for the high LD case. (Here we use 0.4.) ```{r, fig.width = 9} n = 50 rs = c(0.9, 0.666, 0.0) #correlation values mafs = matrix(c(0.4,0.4, 0.4,0.5, 0.4,0.4), byrow = TRUE, ncol = 2) cbind(r = rs, mafs) #check that we put correct values in matrix par(mfrow = c(3,1)) par(mar = c(2,4,4,1)) for(ii in 1:nrow(mafs)){ X = geno.2loci(n, r = rs[ii], mafs = mafs[ii,], return.geno = FALSE) #return haplotypes X = X[order(X[,1]),] #order so that haps with major allele at SNP1 come first image(X, yaxt = "n", xaxt = "n", xlab = "", ylab = "", col = c("floralwhite","dodgerblue"), cex.main = 1.3, main = paste0(2*n," haplotypes, observed r=",signif(cor(X[,1],X[,2]), 3) )) abline(h = 0.5, lwd = 1.5, lty = 2, col = "black") axis(2, at = c(0,1), labels = c("SNP1","SNP2"), cex.lab = 1.3 ) } ``` In the first case, we see that allele **A** (white) at SNP1 almost always indicates allele **B** (white) at SNP2, and ALSO allele **a** (blue) indicates allele **b** (blue) at the other SNP. Thus, knowing the allele at either of the SNPs predicts the allele at the other SNP with very high accuracy. These two SNPs are in **high LD**: population parameter was $r=0.9$, the observed value from the sample is shown in the title. In the second case, MAF at SNP1 is lower than at SNP2 and even though blue at SNP1 predicts well that SNP2 also has blue allele, also many white alleles at SNP1 go together with a blue allele at SNP2. Hence LD, as measured by $r$, is less than in the first case. This case demonstrates that SNPs with very different MAFs cannot have very high correlation because there is no way to make all blue match blue AND all white match white when there are different numbers of blue alleles at the two SNPs. These SNPs are clearly in LD with each other, although not in as strong LD as the SNPs in the first case. In the third case, the SNPs are independent in the population ($r=0$ in simulation), and also in the observed data, there is little correlation between the loci. Thus, by knowing the allele at SNP1 does not help predicting what is the allele at SNP2 on the same haplotype. The best predicton we can do statistically is to guess the allele at SNP2 based on the population allele frequencies, and ignore which allele was observed at SNP1, since it does not give additional information over the allele frequencies in population. These SNPs are **not in LD** with each other. Let's also check what happens to the estimate of $r$ if we estimate it at the genotype level, not at the haplotype level, as we have done so far. Thus, we'll generate haplotypes as previously, but before computing $r$, we collapse the haplotypes into genotypes (by `return.geno = TRUE`) and then compute the correlation between the genotypes at the two loci. ```{r} params = matrix(c(0.3, 0.3, 0.5, #each row has: r, MAF at SNP1, MAF at SNP2 0.8, 0.4, 0.35, -0.5, 0.34, 0.4), byrow = TRUE, ncol = 3) n = 10000 for(ii in 1:nrow(params)){ X = geno.2loci(n, r = params[ii,1], mafs = params[ii,2:3], return.geno = TRUE ) print(c(cor(X[,1],X[,2]), colSums(X)/2/nrow(X))) } #and compare to params above ``` This seems to estimate the haplotype level correlations even though the input data are genotypes. This is good to know as the data we observe in practice are at the level of genotypes and phasing it to haplotypes is not always straightforward. **Example 7.2.** We can also get LD estimates for human populations from [LDlink](https://ldlink.nci.nih.gov/?tab=home). * Choose LDhap and consider two SNPs rs7837688, rs4242382 in the CEU population. You'll see the haplotype distribution in 1000G CEU data. There are 3 haplotypes (out of all 4 possible) and one is quite rare (1%). So these are almost in perfect LD where allele at one SNP tells the allele at the other. * While you could compute $r$ from the info given by LDhap, you can also query it directly from LDpair tab, which gives $r^2=0.8791$, i.e., $r=0.938.$ (Slide 10.) * Look at the same SNPs in LWK population. All 4 haplotypes are present and there is no LD in terms of $r^2$. Often the populations of African ancestry have less LD than those of non-African ancestry since the latter have gone through more recent bottlenecks. Such bottlenecks increase LD because when there is a relatively few haplotypes that founded the subsequent generations, then those haplotypes will determine the LD patterns. (Slide 11.) * Go to LDproxy tab and look for rs7837688 in the Chinese CHB population. You'll see the Figure and list of the variants that are in the highest LD with rs7837688. Note also how the high LD is concentrated within a haplotype block that is between recombination hotspots (region between the high peaks of the recombination rate cM/Mb). (Slide 12.) * LDmatrix (slide 13) can be made for a given set of SNPs. **Example 7.3.** Let's plot an LD matrix for 74 variants around the *APOE* gene on chr 19 using the 1000 Genomes Finnish samples (genomic coordinates are in GRCh37). ```{r,fig.width = 6} path = "https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/APOE_1000G_FIN_74SNPS." haps = read.table(paste0(path,"txt")) info = read.table(paste0(path,"legend.txt"),header = TRUE, as.is = TRUE) dim(haps) #rows haplotypes, cols SNPs info[1:3,] #info for first three SNPs haps[1:3,] #data for first three haplotypes R = cor(haps) #LD matrix is the pxp correlation matrix n.cols = 100 layout(matrix(c(1,2), nrow = 1), widths = c(9,1)) par(mar = c(2,2,3,1)) #plot LD matrix using image image(abs(R), col = heat.colors(n.cols), breaks = seq(0, 1, length = n.cols + 1), asp = 1, xaxt = "n", yaxt = "n", bty = "n", main = paste("|r| of ", dim(R)[1],"SNPs around APOE")) par(mar = c(2,1,3,1)) #plot scale for colors plot.window(xlim = c(0,1), ylim = c(0,n.cols)) points(x = rep(1,n.cols + 1),y=(0:n.cols), col = heat.colors(n.cols + 1), pch = 15, cex = 2) axis(4, at = c(0, n.cols / 2, n.cols), labels = c(0, 0.5, 1) ) ``` LD matrices from two Finnish study samples show a highly consistent LD structure on slide 14. ### 7.3 Effect of LD on GWAS results Let's then experiment with a GWAS setting where SNP1 is a causal variant with the causal effect size of $\lambda_1 = 0.2$ and MAF = 0.2 whereas SNP2 does not have a causal effect ($\lambda_2=0$) and has MAF = 0.4, on a quantitative phenotype whose variance is 1 in the population. LD between the SNPs is $r_{12}=0.6$. So far on this course, we have used $\beta$ to denote the effect size of allele 1. The reason why we now use $\lambda$ rather than $\beta$ is because we want to separate the **causal effect** ($\lambda$) from the **marginal effect** ($\beta$), as will become clear soon. Let's make 1000 simulations of such a setting, using $n=1000$ individuals in each simulation, and let's do the typical single-SNP GWAS on each variant and collect the effect estimate $\widehat{\beta}_l$, its SE and P-value for both variants $l=1,2.$ ```{r, fig.height=4} n.iter = 1000 n = 1000 r = 0.6 mafs = c(0.2, 0.4) lambda = c(0.2, 0) #causal effects of each SNP res = matrix(NA, ncol = 6, nrow = n.iter) #6 cols: beta1, SE1, P1; beta2, SE2, P2 colnames(res) = c("beta1", "SE1", "pval1", "beta2", "SE2", "pval2") for(iter in 1:n.iter){ X = geno.2loci(n, r, mafs, return.geno = TRUE) #generate 2-locus genotypes y = X %*% lambda + rnorm(n,0, sqrt(1-var(X %*% lambda)) ) #var(y) = 1, res[iter, 1:3] = summary(lm(y ~ X[,1]))$coeff[2, c(1,2,4)] #collect beta, SE, P-val of SNP1 res[iter, 4:6] = summary(lm(y ~ X[,2]))$coeff[2, c(1,2,4)] #collect beta, SE, P-val of SNP2 } ``` Let's plot the effect estimates $\widehat{\beta}_1$ and $\widehat{\beta}_2$ against each other and also as separate distributions. Let's add vertical lines at the expected mean values: $\lambda_1$ for SNP1 and $r_{12}\,\lambda_1\,\sqrt{\frac{f_1(1-f_1)}{f_2(1-f_2)}}$ for SNP2 (the derivation of this will be explained later). ```{r, fig.height = 5, fig.width = 9} par(mfrow = c(1,2)) plot(res[,"beta1"], res[,"beta2"], xlab = expression(beta[1]), ylab = expression(beta[2]), main = paste("r =",signif(cor(res[,"beta1"], res[,"beta2"]), 3))) abline(0,1) plot(density(res[,"beta1"]), xlab = expression(beta), col = "black", main = "", xlim = c(0,0.4), ylim = c(0,9)) lines(density(res[,"beta2"]), col = "red") abline(v = lambda[1], lty = 2) abline(v = lambda[1]*r*sqrt(mafs[1]*(1-mafs[1])/mafs[2]/(1-mafs[2])), lty = 2) legend("topright", col = c("black","red"), lwd = 1, legend = c(1,2) ) ``` We see that there is a clear correlation in the GWAS effect estimates between SNPs 1 and 2, and that the correlation seems to match well with the LD correlation between the SNPs ($r_{12}=0.6$). The distributions of the effect estimates show that $\widehat{\beta}_1$ is centered at the true causal effect $\lambda_1$ and $\widehat{\beta}_2$ is centered at the value $r_{12}\,\lambda_1 \sqrt{\frac{f_1(1-f_1)}{f_2(1-f_2)}}$ (and not at 0, which would be the true *causal* effect of SNP2; derivation on slide 17). Let's look at the P-values: ```{r, fig.height = 5, fig.width = 9} par(mfrow = c(1,2)) plot(-log10(res[,"pval1"]), -log10(res[,"pval2"]), xlab = "-log10P1", ylab = "-log10P2") abline(0,1) plot(density(-log10(res[,"pval1"])), xlab = "-log10", col = "black", main = "", ylim = c(0,0.4)) lines(density(-log10(res[,"pval2"])), col = "red") legend("topright", col = c("black","red"), lwd = 1, legend = c(1,2) ) ``` Also the P-values are correlated and SNP2 has P-values that are clearly much smaller than under the null even though SNP2 does not have a causal effect on the phenotype. We also see that there are a few cases where SNP2 actually has lower P-value (higher -log10 P-value) than SNP1 even though SNP1 is the causal variant that is driving the genotype-phenotype association here. Consequently, * When SNPs 1 and 2 are LD-friends (in LD with each other), and only SNP1 is causal, then the causal effect of SNP1 will leak to the marginal effect of SNP2 in the sense that the standard one-SNP-at-a-time analysis at SNP2 is estimating the marginal effect $$\beta_2 = r_{12} \lambda_1 \sqrt{\frac{f_1(1-f_1)}{f_2(1-f_2)}},$$ rather than the true causal effect of SNP2 (which would be 0). * This means that if we had not genotyped SNP1 but only genotyped SNP2, we would still see the association signal in this region, but its effect size would be reduced by a factor of $r_{12}\sqrt{\frac{f_1(1-f_1)}{f_2(1-f_2)}}$, and its NCP would by reduced by a factor of $r_{12}^2$, compared to those values that we would estimate at the true causal variant. * Hence, when we see peaks on Manhattan plots, we cannot be sure whether we have the causal variation itself included in our analysis, or whether we are just tagging it by our SNPs. In principle, the possible hidden causal variation does not need to be a single SNV, it can be a structural variant, or a haplotype-level effect, for example. * The top variant in terms of P-value does not need to be the causal variant because of tagging and statistical sampling effects. In particular, when there are many variants in high LD with the causal(s) then the association region looks messy (slide 15) and we can't know for sure which are the causal variants by looking at the marginal effect sizes and P-values. #### 7.3.1 Relation between causal effect $\lambda$ and marginal effect $\beta$ Consider a region with $p$ SNPs with their causal effects in vector $\pmb{\lambda} = (\lambda_1,\ldots,\lambda_p)^T.$ This means that the region contributes to the phenotype of individual $i$ by the amount $\pmb{x}_i^T\pmb{\lambda} = \sum_{l=1}^p x_{il}\lambda_l.$ When we do a standard GWAS, that includes one variant at a time in the regression model, we estimate the marginal effects $\beta_l$ rather than the causal effects $\lambda_l$. Above we saw how the non-causal variant ($\lambda_2=0$) had non-zero marginal effect size ($\beta_2\neq 0$), whose value was determined by the causal effect $\lambda_1$, $r_{12}$ and the allele frequencies at the two SNPs. This relationship generalizes to an arbitrary number $p$ of SNPs in a region and the marginal effect at SNP $l$ is $$\beta_l = \sum_{k=1}^p r_{kl} \lambda_k \sqrt{\frac{f_k(1-f_k)}{f_l(1-f_l)}} = \lambda_l + \sum_{k\neq l} r_{kl} \lambda_k \sqrt{\frac{f_k(1-f_k)}{f_l(1-f_l)}},$$ where the first sum is over all variants in the region, and, in the second formulation, we have separated the variant's own causal effect from the sum (as $r=1$ between the variant and itself) (Slide 18). Thus, if $l$ is not in LD with any other (causal) variant, then the marginal effect $\beta_l$ equals to the causal effect $\lambda_l$. Otherwise, the marginal effect at SNP $l$ combines the causal effect of $l$ with the causal effects from all other variants, weighted by their LD with the SNP $l$. In particular, when a non-causal SNP $l$ is in high LD with one or more causal variants, then we estimate its marginal effect as non-zero even when the SNP has no biologically causal effect on the phenotype. This phenomenon explains why Manhattan plots show piles of nearby SNPs that together reach highly significant P-values: a few of them may be truly causal, but most of them are only in LD with those few causals, and most of the non-zero marginal effects simply reflect the causal effects of their LD-friends. We want the formulas to be compact and easy to remember, and we are not happy with the above formula with this respect, so we introduce some scaling to help us here. #### Scaled genotypes and scaled effects. Suppose that the MAF at SNP $l$ is $f_l$ and individual $i$ has genotype $x_{il}$. We define the **scaled genotype** of individual $i$ as $$x_{il}^* = \frac{x_{il}-2f_l}{\sqrt{2f_l(1-f_l)}}.$$ Thus, while $x_{il}$ had three possible values 0, 1 or 2 independent of MAF, also $x_{il}^*$ has three possible values, but these values are chosen so that the average scaled genotype in the population is 0, its standard deviation and variance is 1, and, consequently, the numerical difference between two genotype classes grows as the MAF decreases. This is shown in the Table below, where the columns present the numerical values of the scaled genotypes corresponding to the genotypes 0,1 and 2, for four values of MAF $f$. |genotype | $f=0.50$ | $f=0.10$|$f=0.01$|$f=0.001$| |:---------:|:---------:|:---------:|:---------:|:---------:| |0 |-1.41 | -0.47 |-0.14 |-0.045 | |1 |0 |1.89 | 6.96| 22.33| |2 |1.41 | 4.24| 14.07| 44.70| We further define the scaled effects $\beta_l^*$ by running the (linear or logistic) regression model using the scaled genotypes: $$Y \sim \mu^* + X^*\beta^*.$$ This model yields exactly the same P-value for $\beta^*_l$ as the regression using the original genotypes gives for $\beta_l$, but the effect size that is being estimated is now $\beta^*_l = \beta_l \sqrt{2f_l(1-f_l)}$ and its standard error is $\textrm{SE}^*_l = \textrm{SE}_l \sqrt{2f_l(1-f_l)}$. (Note that the z-score $\beta_l/\textrm{SE}_l = \beta^*_l/\textrm{SE}^*_l$ doesn't change under this scaling.) The **allelic effect** $\beta_l$ describes the change in the phenotype per each copy of allele 1, whereas the **scaled effect** $\beta_l^*$ describes the change in the phenotype per each standard deviation unit of the genotype variable in the population. This linear transformation does not change the statistical model fit ($R^2$ or P-value) at all. In particular, we need not run any new GWAS in order to estimate the scaled effects, but we can simply transform the existing allelic effect estimates $\widehat{\beta}_l$ to the scaled effect estimates $\widehat{\beta^*_l}$ by multiplication with $\sqrt{2f_l(1-f_l)}.$ Note also that, for the quantitative traits, the square of the scaled effect ${\beta_l^*}^2 = 2f_l(1-f_l)\beta^2$ is directly the phenotypic variance explained by the SNP. We do the same scaling with the causal effects where the **scaled causal effect** $\lambda_l^* = \lambda_l \sqrt{2f_l(1-f_l)}.$ With the scaled effects, the relationship between the marginal and causal effects is simply $$\beta^*_l = \sum_{k=1}^p r_{kl}\lambda_k^* = \lambda_l^* + \sum_{k\neq l} r_{kl}\lambda_k^*,$$ where $r_{kl}$ is the correlation between SNPs $k$ and $l$, and that correlation is exactly the same whether we consider original genotypes or the scaled genotypes. Thus, we have simplified our earlier formula by replacing the complicated allele-frequency-ratio-square-root by a few * marks. It is clear that when we are interested in the biology behind the phenotype, then we are only interested in the causal variants and their causal effects. How can we estimate these causal effects ($\lambda$)? So far our GWAS results have only been about the marginal effects ($\beta$). #### 7.3.2 Joint model Let's write the phenotype as an additive combination of the (causal) effects of all $p$ variants in the genome: $$y_i = \mu + \pmb{x}_i^T \pmb{\lambda} + \varepsilon_i = \mu + \sum_{l=1}^p x_{il}\lambda_l + \varepsilon_i,$$ where $\pmb{x}_i$ is the row $i$ of the full genotype matrix $\pmb{X}$ that contains the $p$ genotypes of individual $i$, and $\pmb{\lambda}$ is the vector of the (unknown) causal effects of the $p$ variants. For binary phenotype, $y_i$ is to be interpreted as the log-odds of the disease for individual $i$ (like in logistic regression). The difference between this **joint model** of all variants and the **marginal model** of a single variant ($y_i = \mu + x_{il}\beta_l + \varepsilon_i$) is that the joint model accounts for the effects of all variants on the phenotype simultaneously whereas the marginal model only considers one variant at a time. Consequently, an estimator of an effect from the marginal model will estimate the marginal effect $\beta_l$ whereas an estimator of the joint model is estimating the causal effects $\pmb{\lambda}$ simultaneously. A standard GWAS applies $p$ separate marginal regression models, one for each variant, and if we collect the marginal effects into a single vector $\pmb{\beta}=(\beta_1,\ldots,\beta_p)^T$, then by the formulas above, we have a linear relationship between the marginal and causal effects: $$\pmb{\beta^*} = \pmb{R}\pmb{\lambda^*} \quad \textrm{ or equivalently } \quad \pmb{\lambda^*} = \pmb{R}^{-1} \pmb{\beta^*},$$ where $\pmb{R}$ is the $p\times p$ **LD-matrix** whose element $(l,k)$ is the pairwise correlation $r_{lk}$ of variants $l$ and $k$. We give this formula only for the scaled effects rather than the allelic effects because the same formula for the allelic effects involves ratios of allele frequencies for each pair of variants, which would complicate the notation compared to this version where the differences in allele frequencies have been scaled away. But remember that we can always get from the scaled effect of variant $l$ to the allelic effect of $l$ by dividing by $\sqrt{2f_l(1-f_l)}$. These considerations suggest two ways to estimate the causal effects $\pmb{\lambda}$. 1. Use a multiple regression model where the phenotype is regressed on multiple genotypes simultaneously. In R, this is done by simply using a matrix $X$ of all genotypes in the formula, e.g., `lm(y ~ X)` or `glm(y ~ X, family = "binomial")`. 2. Take the estimates of the (scaled) marginal GWAS effects and transform them by the inverse of the LD matrix: $\pmb{\lambda^*} = \pmb{R}^{-1} \pmb{\beta^*}$. Let's try the first one with the settings of our previous example, where we had MAFs 0.2 and 0.4, the causal effects $\pmb{\lambda} = (0.2, 0)$ and $r_{12}=0.6$ in the population. Let's fit both the marginal models and the joint model and plot the estimates of $\beta$s and $\lambda$s. ```{r, fig.height = 5, fig.width = 9} n.iter = 500 n = 1000 r = 0.6 mafs = c(0.2, 0.4) lambda = c(0.2, 0) #causal effects of each SNP res = matrix(NA, ncol = 12, nrow = n.iter) colnames(res) = c("beta1","SE.b1","pval.b1","beta2","SE.b2","pval.b2", "lambda1","SE.l1","pval.l1", "lambda2","SE.l2","pval.l2") for(iter in 1:n.iter){ X = geno.2loci(n, r, mafs, return.geno = TRUE) #generate 2-locus genotypes y = X %*% lambda + rnorm(n, 0, sqrt(1 - var(X %*% lambda)) ) #var(y) = 1, res[iter, 1:3] = summary(lm(y ~ X[,1]))$coeff[2, c(1,2,4)] #collect beta, SE.b1, P.b1 of SNP1 res[iter, 4:6] = summary(lm(y ~ X[,2]))$coeff[2, c(1,2,4)] #collect beta, SE.b1, P.b1 of SNP2 joint.coeff = summary(lm(y ~ X))$coeff res[iter, 7:9] = joint.coeff[2, c(1,2,4)] #collect lambda1, SE.l1, P.l1 of SNP1 from joint model res[iter, 10:12] = joint.coeff[3, c(1,2,4)] #collect lambda2, SE.l2, P.l2 of SNP2 from joint model } par(mfrow = c(1,2)) #1st row (2 panels): #Plot marginal effects as separate distributions #Add blue lines to the true causal effects lambda #Add black line to the expected marginal effects of beta2: #r*sqrt(maf1*(1-maf1)/maf2/(1-maf2))*lambda1 for(ii in 1:2){ eff = c("beta","lambda")[ii] title.txt = c("marginal model", "joint model") plot(density(res[,paste0(eff,"1")]), xlab = eff, col = "black", main = title.txt[ii], ylim = c(0,9), xlim = c(-0.2,0.5)) lines(density(res[,paste0(eff,"2")]), col="red") abline(v = lambda[1], lty = 2, col = "blue") abline(v = lambda[2], lty = 2, col = "blue") abline(v = lambda[1]*r*sqrt(mafs[1]*(1-mafs[1])/mafs[2]/(1-mafs[2])), lty = 2, col = "black") legend("topright", col = c("black","red"), lwd = 1, legend = c(1,2) ) } ``` We see that the estimates from the joint model are centered on the true causal effects (blue lines, right panel), whereas the marginal effect of the non-causal SNP2 is estimating the value $r_{12}\,\lambda_1 \sqrt{\frac{f_1(1-f_1)}{f_2(1-f_2)}}$ (black line, left panel). It also seems that the distributions of estimates are wider for $\lambda$s than for $\beta$s. This is because when the two variables are correlated, the regression model tries to figure out how to divide the total effect between the two correlated variables, which increases uncertainty about how large the effect should be *for any one of them separately*. This uncertainty becomes larger the higher the magnitude of the correlation. At the extreme, for two perfectly correlated variables, there is no way to separate their effects statistically; the standard regression model breaks down if you try to include perfectly correlated variables in the model. The price to pay for an unbiased estimate of the causal effect by the joint model is an increased uncertainty of the effect estimate compared to the marginal model. It can be shown that $$\textrm{SE}\left(\widehat{\lambda_1}\right) = \frac{\sigma_\varepsilon}{\sqrt{2n\left(1-r_{12}^2\right)f_1(1-f_1)}} \approx \frac{\textrm{SE}\left(\widehat{\beta_1}\right)}{\sqrt{\left(1-r_{12}^2\right)}}.$$ Thus, when the SNPs are independent ($r_{12}=0$), the precision of $\widehat{\lambda}$s is the same as for $\widehat{\beta}$s and when the correlation approaches $\pm 1$, then the joint model becomes more and more unsure how the joint effect should be split between the two candidates and the SE of $\widehat{\lambda}$s grows larger and larger. In practice, it is not useful to include very highly correlated variables in a regression model simultaneously because their effects cannot be separated from each other (except with extremely large amounts of data). It follows that if we look at the P-value from the Wald statistic of SNP1, then we would expect to see lower P-values for $\beta_1$ than for $\lambda_1$ because while both estimate the correct causal effect of 0.2, the SE of $\widehat{\beta}_1$ is lower than the SE of $\widehat{\lambda}_1$, which leads to lower P-values for $\beta_1$ than for $\lambda_1$ on average. This is not a problem in a standard GWAS, where we first identify the regions where variants show high association based on the marginal models, ($\widehat{\beta}$s and their P-values), and then we turn to a more detailed analysis of those regions using some versions of a joint model. A joint (linear or logistic) regression model of several SNPs gives unbiased estimates of the causal effects of the SNPs assuming that ALL causal variants of the region are included in the model. This is a nice property, but in the same time these standard regression models are not suitable for large scale variable selection among thousands of candidate variables and they do not handle well highly correlated variables. Therefore, we can only use them for joint models of a handful of variants at a time. Next, we will look how this is typically done for a GWAS region by stepwise forward selection. Later, we will see how more recent fine-mapping models combine sparse causal models with efficient computation. ### 7.4 Stepwise forward selection Let's aim to reduce the set of $p$ SNPs in the region into a (much) smaller subset $S$ of SNPs that can already statistically explain the association pattern that we see in the region. Technically, this means that we want to find as small a subset $S$ as possible for which the joint regression model $$Y \sim \pmb{X}_S$$ explains the data $Y$ already so well that any extended model $$Y \sim \pmb{X}_S + X_a,$$ with one (or more) additional SNP(s) $a \notin S$ included, would not lead to a statistically better fit to the observed data. (Here $\pmb{X}_S$ is a matrix whose columns contain the genotypes of the SNPs included in set $S$.) There could be many versions of such a set $S$, and here we start by trying to find one of them. We will demonstrate this with two data sets generated using the *APOE* region data on 186 Finnish haplotypes from the 1000 Genomes project, and we call these data sets as `APOE.1` and `APOE.2`. **Example 7.4: APOE.1 data.** Let's generate 1000 individuals by sampling their haplotype pairs from the reference pool with replacement and adding them up to get individual level genotypes. This reflects what we would observe from genotyping. ```{r} path = "https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/APOE_1000G_FIN_74SNPS." haps = as.matrix(read.table(paste0(path,"txt"))) dim(haps) #rows haplotypes, cols SNPs n = 1000 X = haps[sample(1:nrow(haps), size = n, repl = TRUE),] + haps[sample(1:nrow(haps), size = n, repl = TRUE),] #n individ. ``` Let's then choose three causal variants: 7, 37 and 51, and give them all a causal effect of $\lambda = 0.3$. Let's plot the causal and marginal effects for the region. ```{r,fig.height = 4, fig.width = 9} c.ind.1 = c(7, 37, 51) # causal SNP indexes for APOE.1 data p = ncol(X) lambda = rep(0, p) lambda[c.ind.1] = 0.3 f = colSums(X)/2/nrow(X) # allele 1 freqs f.sc = sqrt(2*f*(1 - f)) #scaling factor for genotypes R = cor(X) # R is computed from genotypes beta = (R %*% (lambda * f.sc)) / f.sc # from lambdas to betas par(mfrow = c(1,2)) par(mar = c(4.5, 4.5, 2, 0.5)) plot(lambda, xlab = "SNP", ylab = expression(lambda), col = "dodgerblue", pch = 19, ylim = c(-0.5, 0.5), main = "causal effects") plot(beta, xlab = "SNP", ylab = expression(beta), col = "red", pch = 1, ylim = c(-0.5, 0.5), main = "marginal effects") ``` We see that many marginal effects are larger than any causal effects because some SNPs are in high LD with several causal variants and therefore their marginal effects are a weighted sum of the causals. In particular, some non-causal SNPs have larger marginal effects than some causals, and therefore it could happen that the top SNP in this region is not a causal variant. Let's generate a trait with variance 1, do a standard marginal GWAS model, and highlight the causal variants in the Manhattan plot of this region. ```{r,fig.height = 4, fig.width = 9} set.seed(18) y.1 = X %*% lambda + rnorm(n, 0, sqrt(1 - var(X %*% lambda)) ) #trait for APOE.1 data res.1 = apply(X, 2, function(x){summary(lm(y.1 ~ x))$coeff[2,c(1,2,4)]}) #3 rows: beta,SE,pval par(mar = c(4.5, 4.5, 2, 1)) plot(-log10(res.1[3,]), xlab = "SNP", ylab = "-log10 P", main="APOE.1 marginal analysis") points(c.ind.1, -log10(res.1[3,c.ind.1]), cex = 2, lwd = 1.4, col = "dodgerblue") #highlight causals ``` The top variant is one of the causals (37) but some non-causal variants are also nearby with similar P-values. Could we fit a joint model for all the variants in the region and get a clearer picture of the causal ones? ```{r, fig.height = 4, fig.width = 9} joint.coeff = summary(lm(y.1 ~ X))$coeff[-1, c(1,2,4)] #joint model isn't able to return results for all SNPs nrow(joint.coeff) # how many coefficients did we get out of 74? par(mar = c(4.5, 4.5, 2, 1)) plot(-log10(joint.coeff[,3]), xlab = "SNP", ylab = "-log10 P", main = "APOE.1 joint analysis") ring.i = which(row.names(joint.coeff) %in% paste0("XV",c.ind.1)) # indexes of the causals present points(ring.i, -log10(joint.coeff[ring.i,3]), cex = 2, lwd = 1.4, col = "dodgerblue") ``` Unfortunately, the standard linear regression is unable to compute statistics for 16 out of 74 variants because of high correlations -- we miss also one of the causals -- and the joint model gives nearly useless results for the remaining SNPs as the effects are split between all correlated SNPs simultaneously, whereas in reality there are only three causal variants here. Technically, the joint model would work, if we had a large enough sample size, but for many SNPs with high correlations, those sample sizes would need to be unrealistically large. ```{r, eval = F, include = F} #Write files for FINEMAP data.name = "APOE_1" FM.path.1 = paste0("FINEMAP/FINEMAP_",data.name) FM.path.2 = paste0("FINEMAP_",data.name) write(R, ncol = ncol(R), file = paste0(FM.path.1,".ld")) write.table(cbind(rsid = as.numeric(1:p), chromosome = rep(19,p), position = info[,2], allele1 = info[,3], allele2 = info[,4], maf = apply( cbind(info[,5], 1-info[,5]), 1, min), beta = res.1[1,], se = res.1[2,]), file = paste0(FM.path.1,".z"), quote = F, row.names = F) cat("z;ld;snp;config;cred;log;n_samples\n", file = paste0(FM.path.1,".master")) cat(paste0(FM.path.2,c(".z;",".ld;",".snp;",".config;",".cred;",".log;"),collapse = ""),n,"\n", sep ="",file = paste0(FM.path.1,".master"), append = T) ``` A more robust way is to use **stepwise forward selection** (also called **iterative conditioning**, or **conditional analysis**) to build iteratively a joint model for a small subset $S$ of SNPs by choosing one SNP at a time to be included in $S$. This approach was made popular by [GCTA](https://cnsgenomics.com/software/gcta/#Overview)'s Conditional & joint (COJO) analysis of GWAS results. The algorithm works as follows ``` initially S is empty repeat until all P-values outside S are > threshold add SNP with the lowest P-value to S update P-values of all SNPs 'l' outside S using joint model Y ~ X.S + X.l end repeat ``` After the algorithm finishes, $S$ contains a set of SNPs whose P-values in the joint model with their predecessors in $S$ are below the given threshold, and there are no additional SNPs with this property outside $S$. Idea is that $S$ will contain only one of the SNPs that are tagging similarly the causals because when two such SNPs are simultaneously in the model, then they don't have low P-values anymore, which is saying that the second one is not needed *after the first one is already in the model*. This *conditioning* on the current contents of $S$ gives the procedure the name of iterative conditioning / conditional analysis. The name we use here, *stepwise forward selection*, means that we proceed *forwards* from the empty model, step-by-step, where at each step one SNP can be selected to be included in S. (*Backward selection* would start with all variables in the model and would proceed by eliminating them one-by-one. This would not be a feasible strategy when there are a lot of candidate variables available.) On the other hand, if there are two causal variants in the region, then the P-value of a SNP tagging primarily the second causal variant will remain low after we condition on the top-SNP that tags primarily the first causal signal. Hence, both SNPs may be included in $S$. Sometimes the P-value of the second SNP may even get much lower after the first SNP is included in the model, if the two SNPs were masking each other in their marginal effects. Slide 19 has an example of such a case from a [Parkinson's disease locus 4q22/*SNCA*](http://hmg.oxfordjournals.org/content/20/2/345.long). Thus, when stepwise selection includes several variants in $S$, it suggests that either there are multiple causal variants in the region, or that at least we need more than one variant to tag well a single causal variant in the region, e.g., because the actual causal variant is not genotyped. Let's write a function that does the stepwise selection and produces a plot of P-values after each step * showing the newly chosen SNP in red, * showing the preceding SNPs in $S$ in gray, * highlighting the true causals with blue. ```{r} stepwise.fwd <- function(y, X, p.thresh = 1e-4, plot.path = TRUE, ring.i = NULL){ #Stepwise forward selection where at most one variant is included in S at each iteration ## INPUT # y, quantitative trait for n individuals # X, genotypes for n individuals x p SNPs # p.thresh, P-value threshold that is used for deciding whether to include variant in S # plot.path, TRUE / FALSE, whether produces a plot of every step of the algorithm # ring.i, set of indexes of SNPs that should always be highlighted in plots by a ring ## OUTPUT # chosen.i, indexes of chosen variants in S, # chosen.p, the P-values at the iteration when each chosen variant was chosen, # next.p, the smallest P-value left outside S when finished. p = ncol(X) #number of SNPs col.new = "red" col.old = "gray" col.ring = "dodgerblue" pval = apply(X, 2, function(x){summary(lm(y ~ x))$coeff[2,4]}) #start from marginal P-values cols = rep("black", p) pchs = rep(1, p) chosen.i = c() #collect here the chosen SNPs chosen.p = c() #collect here the P-values from the iteration where the choice was made if(plot.path) { par(mfrow = c(1, 3)) par(mar = c(1, 2, 4, 0.1)) txt = "Initial state" plot(-log10(pval), xaxt = "n", xlab = "", ylab = "-log10P", main = txt, col = cols) if(!is.null(ring.i)) points(ring.i, -log10(pval[ring.i]), cex = 2, lwd = 1.4, col = col.ring) # highlight abline(h = -log10(p.thresh), lty = 2, col = "red") } ii = 0 #iteration index while( min(pval) < p.thresh ){ #continue as long as something is below the threshold ii = ii + 1 chosen.i[ii] = which.min(pval)[1] # add 1 SNP with min P-val to the chosen ones... chosen.p[ii] = pval[chosen.i[ii]] #... and store its P-value at this iteration. #test other SNPs except the already chosen ones -- and include the chosen ones as covariates tmp = apply(X[,-chosen.i], 2, function(x){summary(lm(y ~ x + X[,chosen.i]))$coeff[2,4]}) pval[-chosen.i] = tmp # we have P-values for other SNPs except already chosen ones in 'tmp' if(plot.path) { pval[chosen.i] = chosen.p # add causals with P-values temporarily for plotting cols[chosen.i] = col.old # earlier chosen remain gray -- cols[chosen.i[ii]] = col.new # -- the one newly chosen will be red pchs[chosen.i[ii]] = 19 #solid for chosen ones txt = paste0(ii,": chose ",chosen.i[ii]) plot(-log10(pval), xaxt = "n", xlab = "", ylab = "-log10P", main = txt, col = cols, pch = pchs) if(!is.null(ring.i)) points(ring.i,-log10(pval[ring.i]), cex = 2, lwd = 1.4, col = col.ring) abline(h = -log10(p.thresh), lty = 2, col = "red") } pval[chosen.i] = 1 #mark chose ones by P-value = 1 for the while-loop condition } return(list(chosen.i = chosen.i, chosen.p = chosen.p, next.p = min(pval))) } ``` We use a fairly liberal threshold of $P=$1e-4 here for two reasons: 1. We want the set $S$ to fully explain the association pattern in the region so that no peaks of P-values remain. 2. When we know that there is a robust GWS association in the region, then the variants in the region have a higher prior probability to be associated than random variants in the genome. There is no consensus value to use at stepwise conditioning. Sometimes it is run with the GWS-level of 5e-8, and sometimes continued until P > 0.05. Luckily, the path of the algorithm doesn't depend on the threshold used, so one can also cut the path later to be more stringent as needed for any particular application. ```{r,fig.height = 3, fig.width = 9} step.1 = stepwise.fwd(y.1, X, p.thresh = 1e-4, plot.path = TRUE, ring.i = c.ind.1) ``` Here, the stepwise selection worked well in the sense that it managed to pick exactly the three causal variants. Note also how the P-values of the true causal variants 7 and 51 get smaller in steps 1 and 2, respectively, compared to the P-values from the previous iteration. This shows how conditioning on the other causal variants can reveal the true causal effects more clearly. **Example 7.5: APOE.2 data.** Stepwise forward selection doesn't always work quite so nicely: ```{r,fig.height=3, fig.width=9} set.seed(20) c.ind.2 = c(23,36,51) lambda = rep(0, p) lambda[c.ind.2] = 0.4 y.2 = X %*% lambda + rnorm(n, 0, sqrt(1-var(X %*% lambda)) ) res.2 = apply(X, 2, function(x){summary(lm(y.2 ~ x))$coeff[2,c(1,2,4)]}) step.2 = stepwise.fwd(y.2, X, p.thresh = 1e-4, plot.path = TRUE, ring.i = c.ind.2) ``` Let's see how the chosen variants 20 and 66 are related to the causal ones 23, 36 and 51: ```{r} ind = c(step.2$chosen.i, c.ind.2) R[ind,ind] ``` It seems that none of the causal variants is very highly tagged by either of the chosen ones, but all causal variants show considerable correlation with the chosen ones. Let's check whether the chosen set explains the phenotype similarly to the causal variants. ```{r} lm.chosen = lm( y.2 ~ X[,step.2$chosen.i]) #chosen SNPs 20 and 66 summary(lm.chosen) lm.causal = lm( y.2 ~ X[,c.ind.2]) #causal SNPs 23, 36, 51 summary(lm.causal) ``` The chosen variants explain about 12.4% (R^2-adj.) of the trait variance whereas the three causals together explain about 15.0%. Note that we compare the adjusted-R^2 measure which attempts to take into account the number of parameters in the models whereas the multiple-R^2 tends to always favor model with more predictors. Another way to compare models with possibly different numbers of parameters is the [Bayesian Information Criterion](https://en.wikipedia.org/wiki/Bayesian_information_criterion) (BIC) that penalizes models for additional parameters. In short, a lower value of BIC suggests a better model. ```{r} cbind(chosen = BIC(lm.chosen), causal = BIC(lm.causal)) ``` We see that also BIC favors the true causal model over the one chosen by the stepwise selection (23 units in BIC is a clear difference). What if we put all 5 variants in the same model? ```{r} lm.joint = lm( y.2 ~ X[,ind]) #all 5 SNPs summary(lm.joint) ``` We see that, in this joint model, the two non-causals (20,66) picked by the conditional analysis are not needed in the model once we have the true causal variants there. (This model explains 15.1% while the model with 3 causals already explained 15.0%.) We conclude that even though the stepwise selection picked a model that had a reasonable explanatory power, the model comparison shows that the chosen model has a statistically worse fit than the true causal model. Let's next move to **fine-mapping** methods to see how we can improve the search and quantification of causal variants. ```{r, eval = F, include = F} #Write files for FINEMAP res = apply(X, 2, function(x){summary(lm(y.2 ~ x))$coeff[2,1:2]}) data.name = "APOE_2" FM.path.1 = paste0("FINEMAP/FINEMAP_",data.name) FM.path.2 = paste0("FINEMAP_",data.name) write(R, ncol = ncol(R), file = paste0(FM.path.1,".ld")) write.table(cbind(rsid = 1:p, chromosome = rep(19,p), position = info[,2], allele1 = info[,3], allele2 = info[,4], maf = apply( cbind(info[,5], 1-info[,5]), 1, min), beta = res.2[1,], se = res.2[2,]), file = paste0(FM.path.1,".z"), quote = F, row.names = F) cat("z;ld;snp;config;cred;log;n_samples\n", file = paste0(FM.path.1,".master")) cat(paste0(FM.path.2,c(".z;",".ld;",".snp;",".config;",".cred;",".log;"),collapse = ""),n,"\n", sep ="",file = paste0(FM.path.1,".master"), append = T) ``` ### 7.5 Fine-mapping When we **fine-map** a GWAS region, our goal is to identify the set of causal variants $C$. Since high levels of LD among candidate variants often makes it impossible to give a definite answer based on statistical evidence alone, our answers will be probabilistic. In particular, we work with * probability for variant $l$ to be one of the causal variants in the region, $p_l = \text{Pr}(l\in C \,|\, \textrm{Data})$. This is also called a **posterior inclusion probability (PIP)**, where *inclusion* refers to the inclusion of the variant in the causal model. * probability for a set $S$ of variants to be the **causal set / causal configuration** of variants in the region, $p_S = \text{Pr}(S = C\, |\, \textrm{Data})$. * **credible sets** of causal variants which have the property that, with a fixed coverage, such as with 95% probability, the credible set contains one of the causal variants or the credible set contains all of the causal variants, where the choice between these two definitions of the credible set depends on the context. * probability distribution that the region contains $1,2,\ldots K$ causal variants up to some value of $K$, typically $\leq 20$. Note the difference between such quantities and the results from the stepwise selection, which resulted in only one set of variants but did not estimate how probable each of those variants is to be a causal variant *given the LD patterns and the other candidate variants in the region*. Even if a variant has a very low marginal P-value, if it shares the signal with 10 highly correlated variants, then its PIP will likely be $\leq 0.1$ and its credible set will contain all its 10 LD-friends. Neither of these properties can be read from the output of the stepwise selection. A review of fine-mapping methods is given by [Schaid et al.](https://www.nature.com/articles/s41576-018-0016-z) A common theme is the use of a Bayesian model that allows quantifying the probabilities of causality for each set and each variant under the assumptions that all causal variants are included in the analysis and that there are no interaction effects between the variants. Slides 22-31 have some more details of the model behind a widely-used [FINEMAP](http://www.christianbenner.com/) software, written by Christian Benner during his PhD work at FIMM, Helsinki. Another popular fine-mapping method is [SuSiE](https://stephenslab.github.io/susieR/). Let's see what happens when FINEMAP is run on our two data sets `APOE.1` and `APOE.2` where we observed varying success with the stepwise selection. #### FINEMAP on APOE.1 FINEMAP gives three output files: * **.snp** includes summary data (probabilities and causal effect estimates) for each SNP. * **.config** includes a list of the most probable sets of causal variants with their probabilities and variance explained. * **.cred3** includes the credible sets for 3 causal variants corresponding to the top-configuration in the .config file. Let's read in FINEMAP's results for individual SNPs and print some columns. ```{r} fm.path = "https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/FINEMAP/FINEMAP_APOE_" fm.1.snp = read.table(paste0(fm.path,"1.snp"), as.is = T, header = T) fm.1.snp[1,] fm.1.snp[1:10,c(1,11,13:16)] ``` Here `prob` is the PIP, i.e., probability that this is one of the causal variants, `mean` is the estimated causal effect $\lambda$ (on the allelic scale), averaged over all possible causal sets and `mean_incl` is the estimate of $\lambda$ over those causal sets in which this SNP is included. The three causal variants were simulated at positions 7, 37 and 51. The first is labelled as a definite causal with PIP ~ 100%. The variant 51 cannot be as perfectly singled out from the highly correlated variant 52 but gets PIP of ~ 50%. The variant 37 is sharing its probaility with several other correlated variants and has a PIP of only 11%. The causal effects in `mean_incl` seem to be estimating well the true value of 0.3 given their uncertainty as measured by the posterior SD. Let's plot the marginal GWAS P-value and the FINEMAP's PIPs and highlight the true causal variants. ```{r,fig.height = 4, fig.width = 9} par(mfrow = c(1,2)) par(mar = c(4.5,4.5,3,1)) plot(-log10(res.1[3,]), xlab = "SNP", ylab = "-log10 P", main = "MARGINAL GWAS APOE.1") points(c.ind.1, -log10(res.1[3,c.ind.1]), cex = 2, lwd = 1.4, col = "dodgerblue") plot(fm.1.snp$index, fm.1.snp$prob, xlab = "SNP", ylab = "prob of causality",main = "FINEMAP APOE.1") i = which(fm.1.snp$index %in% c.ind.1) points(fm.1.snp[i,"index"], fm.1.snp[i,"prob"], cex = 2, lwd = 1.4, col = "dodgerblue") ``` FINEMAP has refined the associations as there are less variants with interesting PIPs after the fine-mapping than there are in the marginal P-values. Note how FINEMAP is most certain that the variant 7 is a causal variant whereas the marginal P-value was smaller for the variant 37 than the variant 7. A likely reason for this difference is that 37 has more LD-friends with which it is forced to share some probability of being causal, whereas 7 is less correlated with the other variants and hence more clearly a causal variant itself. Such information cannot be extracted from the marginal P-values. The results in the .snp file are SNP-wise summaries over all candidate causal configurations that FINEMAP has evaluated. By default the .config file reports the top 50,000 of these configs. Let's show a few top ones, with the posterior probabilities and the estimated proportion of variance explained by each configuration ($h^2$, heritability). ```{r} fm.1.config = read.table(paste0(fm.path, "1.config"), as.is = TRUE, header = TRUE) fm.1.config[1:10, c(2:4,8:9)] ``` Let's also see the credible sets given by FINEMAP. ```{r} fm.1.cred = read.table(paste0(fm.path,"1.cred3"), as.is = TRUE, header = TRUE) fm.1.cred[,-1] ``` For the credible sets (with 3 causal variants), FINEMAP takes the top 3-SNP causal configuration (here 7,51,54), and then asks, for each variant in the top configuration, which are the other candidate variants that could possibly replace the top variant in this causal configuration. For example, the column `cred_1` shows that the credible set $A_1 = \{7\}$ includes only SNP 7 as it already covers 0.99999 of the posterior probability of being in the causal set with 51 and 54. In other words, the sum of the probabilities of any other variant replacing 7 is < 0.00001 in this causal set that additionally contains SNPs 51 and 54. For the second credible set, $A_{2}=\{51,52\}$, the correct causal variant 51 is the best candidate but there is also another candidate, 52, that has a slightly smaller probability than 51 to be the second causal variant. For the third credible set, $A_{3}$, the picture is less clear and we need 16 variants before we cover 95% of the probability of including a causal variant in $A_{3}$. The true causal variant, 37, is the third variant in the list. A credible set collects the most likely causal variants and is a place to search for known functional variants that could be responsible of the association signal. If the credible set is small, it can provide a useful list of candidate variants to be taken further to experimental studies that could then identify the causal variant among the candidates via its biological function. The size of the credible set depends mainly on the GWAS sample size and the LD structure in such a way that a large sample size and lower levels of LD decrease the size of the credible sets. #### FINEMAP on APOE.2 On APOE.2 data, the causals were 23, 36 and 51 but due to LD, there were many other variants that had much lower P-values than the true causals. The conditional analysis chose 20 and 66, which gave a clearly worse fit than the true causal set. Let's see FINEMAP's results. (Suppressing the plotting commands from the document.) ```{r,fig.height = 4, fig.width = 9, echo = F} fm.2.snp = read.table(paste0(fm.path,"2.snp"), as.is = TRUE, header = TRUE) #fm.2.snp[1:10,c(1,11,17:20)] pchs = rep(1,p) par(mfrow = c(1,2)) par(mar = c(4.5,4.5,3,1)) plot(-log10(res.2[3,]), xlab = "SNP", ylab = "-log10 P", pch = pchs, main = "MARGINAL GWAS APOE.2") points(c.ind.2, -log10(res.2[3,c.ind.2]), cex = 2, lwd = 1.4, col = "dodgerblue") plot(fm.2.snp$index, fm.2.snp$prob, xlab = "SNP", ylab = "prob of causality", pch = pchs, main = "FINEMAP APOE.2", ylim = c(0,1)) i = which(fm.2.snp$index %in% c.ind.2) points(fm.2.snp[i,"index"], fm.2.snp[i,"prob"], cex = 2, lwd = 1.4, col = "dodgerblue") ``` We see that FINEMAP gives PIPs between 20% and 50% to the true causals, and that there are two non-causals getting even higher PIPs. Even if the true causals are not having the highest PIPs here, the FINEMAP results still seem more robust for capturing the true causal variants than the marginal P-values. Let's check the top 15 configs. ```{r, echo = F} fm.2.config = read.table(paste0(fm.path,"2.config"), as.is = TRUE, header = TRUE) fm.2.config[1:15,c(2:4,8:9)] ``` The top FINEMAP config is (24,42,51) whereas the true causal config (23,36,51) has rank 12. Let's compare the statistical evidence between the top FINEMAP config, the true causal config and the conditional analysis config (20,66). ```{r} lm.fm = lm(y.2 ~ X[,c(24,42,51)]) lm.causal = lm(y.2 ~ X[,c(23,36,51)]) lm.step = lm(y.2 ~ X[,c(20,66)]) data.frame(finemap = BIC(lm.fm), true_causal = BIC(lm.causal), stepwise = BIC(lm.step)) ``` Here the top config from FINEMAP happens to be a slightly better description of the data than the true causal config, which is a result of statistical sampling effects in a finite data set. We also see that the conditional analysis has failed badly here compared to FINEMAP in terms of explaining the data (26 units difference in BIC). We conclude that FINEMAP has made an efficient search through the space of configurations and identified a config with more statistical evidence than the true causal config. Given the data available, we cannot expect to get better fine-mapping results based on the statistical evidence alone. #### 7.5.1 Fine-mapping assuming one causal variant Sometimes we do not have access to the original GWAS data and therefore we cannot accurately estimate LD that corresponds to the GWAS summary statistics. This means that we cannot do complete fine-mapping with FINEMAP or other similar methods that require LD information. In these cases, it is still possible to compute, for each variant in the region, the posterior probability of causality under the assumption that **there is at most one causal variant in the region and it is included in the study**. The approach was introduced by [Maller et al. 2012](https://www.nature.com/articles/ng.2435). Let's mark by $H_l, l=1,\ldots,p$ the hypothesis that says $l$ is the only causal variant and let $H_0$ be the null hypothesis that says there is no causal variant in the region. Mark by $H_C = H_1 \cup \ldots \cup H_p$ the hypothesis that says that there is exactly one causal variant in the region. We assume that, *a priori*, each variant is equally likely to be the causal variant, i.e., $\text{Pr}(H_l\,|\,H_C) = \frac{1}{p}$. It follows that the regional Bayes factor is $$\textrm{BF}_{\textrm{reg}} = \frac{\text{Pr}(\textrm{Data}\, | \, H_C)}{\text{Pr}(\textrm{Data}\, | \, H_0)}= \frac{\sum_{l=1}^p \text{Pr}(\textrm{Data}\, | \, H_l)\, \text{Pr}(H_l\, | \, H_C)}{\text{Pr}(\textrm{Data}\, | \, H_0)}= \frac{\sum_{l=1}^p \frac{1}{p} \text{Pr}(\textrm{Data}\, | \, H_l)}{\text{Pr}(\textrm{Data}\, | \, H_0)}= \frac{1}{p} \sum_{l=1}^p \textrm{BF}_l,$$ where $\textrm{BF}_l$ is the single variant Bayes factor that does not depend on other variants or on the LD structure of the region. In practice, it can be computed as an Approximate Bayes Factor (ABF) from GWAS summary statistics as was explained in Section 4 of the course material. The posterior probability for variant $l$ is then, according to Bayes formula, $$\text{Pr}(H_l \, | \, \textrm{Data}, H_C) = \frac{\text{Pr}(H_l\, | \, H_C) \text{Pr}(\textrm{Data} \, | \, H_l)} {\text{Pr}(\textrm{Data} \, | \, H_C)} = \frac{1}{p} \frac{\frac{\text{Pr}(\textrm{Data} \, | \, H_l)}{\text{Pr}(\textrm{Data}\, | \, H_0)}} {\frac{\text{Pr}(\textrm{Data} \, | \, H_C)}{\text{Pr}(\textrm{Data}\, | \, H_0)}} = \frac{1}{p} \frac{\textrm{BF}_l}{\textrm{BF}_{\textrm{reg}}} =\frac{\textrm{BF}_l}{\sum_{k=1}^p \textrm{BF}_k} \propto \textrm{BF}_l.$$ Thus, the posterior probability that SNP $l$ is the single causal variant in the region, under the assumption that there is exactly one causal variant, is proportional to the single-SNP Bayes factor of SNP $l$. In practice, we get the posterior probabilities by computing ABF for each variant and then normalizing them to sum to one. This derivation is independent of LD in the region because, under the assumption of only a single causal SNP, the marginal effect at the causal SNP is the causal effect itself, independent of the LD with other variants. With strongly associated but also highly correlated SNPs, we are left with the result that any one of them could be the causal variant. This will manifest through high single-SNP and regional BFs, but with the posterior probability distributed fairly evenly across the highly correlated SNPs. We have assumed that the causal SNP is included in the study. When this is not true, the above approach is still applicable if there is a good surrogate SNP for the causal effect. In the presence of multiple causal SNPs, this approach is no longer recommended. It will tend to pick up the SNP with the best marginal effect, which may or may not be one of the causal SNPs. **Example 7.6: APOE.3 data.** Let's make yet another phenotype data set for the *APOE* region genotype data that we currently have in `X` matrix. Now we include only one causal variant, SNP 42. ```{r, fig.width = 9, fig.height = 4} set.seed(6) c.ind = c(42) lambda = rep(0, p) lambda[c.ind] = 0.3 y.3 = X %*% lambda + rnorm(n, 0, sqrt(1 - var(X %*% lambda)) ) #Data has been generated. Now run a GWAS one variant at a time. res.3 = apply(X, 2, function(x){summary(lm(y.3 ~ x))$coeff[2, c(1,2,4)]}) #Compute single-SNP ABFs using the GWAS summary statistics tau = 0.2 #prior SD of effect size b.est = res.3[1,] se = res.3[2,] log.abf = dnorm(b.est, 0 , sqrt(tau^2 + se^2), log = T) - dnorm(b.est, 0, se, log = T) abf = exp(log.abf) #Normalize ABFs to get PIPs of the variants assuming that there is exactly one causal variant posterior = abf/sum(abf) #Let's see which variants have probability > 1% ind = which(posterior > 0.01) ind = ind[order(posterior[ind], decreasing = T)] #Order them from largest to smallest data.frame(SNP = ind, prob = posterior[ind], Pvalue = res.3[3,ind]) #Let's plot the -log10 P-values and posterior probabilities par(mfrow = c(1,2)) cols = rep("gray", p) cols[ind] = "indianred2" plot(-log10(res.3[3,]), xlab = "SNP", ylab = "-log10(P-value)", main = "APOE.3", cex = 0.7, col = cols, pch = 19) plot(posterior, xlab = "SNP", ylab = "posterior probability", main = "", cex = 0.7, col = cols, pch = 19) ``` We see that the three smallest P-values also have the largest posterior probabilities of being causal. The SNP 42 (78%) has about 4 times higher probability than SNP 36 (18%) and 50 times higher probability than SNP 30 (1.6%) of being the causal variant in the region (when we assume that there is exactly one causal variant). To determine the 95% credible set, we would include the SNPs with the highest posteriors until their cumulative sum is > 95%. Here this set would contain only SNPs 42 and 36 as their sum is already 78% + 18% = 96%. If we would have several conditionally independent association signals in the region, as returned, for example, by the stepwise forward search, then we could fine-map each of them separately. For this, we would compute the association statistics (beta and SE) by including the SNPs representing the other signals in the region as covariates, and then carry out the fine-mapping for each signal separately as above by assuming that there is only one causal variant left after conditioning on the other signals in the region. (In Exercise 5.2 we do that for one signal conditioned on two other signals.)