The slide set referred to in this document is “GWAS 7”.

### 7.1 Haplotypes

Terms

• ploidy, the number of copies of the genome in the cell nucleus.
• haploid, one copy of the genome; germ cells: sperm and egg are haploid cell types.
• diploid, two copies of the genome; other human cell types with nucleus than germ cells are diploid.
• haplotype, haploid genotype, in diploid cell/individual, a genome inherited from one parent.
• diploid genotype, a complete set of two genomes of a diploid cell/individual without additional haplotype information.
• diplotype, diploid genotype with haplotype information; a pair of haplotypes for a diploid genotype.

We know that chromosomes are inherited from the parents as recombined segments of grandparents’ chromosomes (slide 2). From a genotyping array we observed (diploid) genotypes at each locus, and we can’t always be certain which are the two underlying haplotypes (slide 3).

Consider two SNPs 1 and 2 on the same chromosome, with alleles A and C at SNP1 and G and T at SNP2. If we observe an individual whose genotype is SNP1 = A/A, SNP2 = G/T, then her haplotypes are definitely {A-G, A-T}. That is, the two chromosomes that she has inherited from her two parents read A-G and A-T. But if we instead observe genotype SNP1 = A/C, SNP2 = G/T, then the haplotypes are not determined without external information: they can be either {A-G, C-T} or {A-T, C-G}. The process of figuring out the haplotypes at multiple loci from the (unordered) genotypes is called haplotype phasing. Historically, the phase was determined from pedigree data where parents’ genotypes inform about possible offspring haplotypes. Also experimental methods for phasing exist (given enough time and money). More often, nowadays, phasing is done in large population samples by using computational methods and the result is probabilistic. A review on phasing by Browning & Browning (2011).

How can we phase genotypes into haplotypes? When we consider genotypes at $$p$$ SNPs where the individual is heterozygous, then he/she has $$2^p$$ possible haplotypes and $$2^{p-1}$$ possible phasings (as one haplotype already determines the other haplotype given the genotypes). These numbers become huge already for tens of SNPs. However, empirical data started to show from the early 2000s, that when many individiduals were genotyped at SNPs residing near each other in the genome, the observed genotypes suggested that only a few ($$\ll 2^p$$) different haplotypes were present in the population in any one region of the genome.

Example 7.1. (Slide 4) Go to LDlink that gives access to the haplotype data from the 1000 Genomes project. Choose LDhap tab and input the following list of 9 SNPs that come from a 2500 bps long region on chr 1:

rs115037027
rs12409788
rs1576517
rs151240271
rs12752436
rs76864380
rs6586443
rs35213023
rs34910942

• Choose first CHS (Southern Han Chinese) population and press Calculate. It shows that only two of these SNPs have MAF>5% and four are monomorphic in the Table shown, i.e., show only one allele in these data. There are only 4 haplotypes present in the population out of theoretically possible $$2^5=32$$.

• Change population to FIN. There are 5 haplotypes present and one SNP is monomorphic. Compare SNPs 1 and 6. They have the same pattern of variation across the haplotypes: Whenever one has T also the other has T. They are completely associated
with each other and, consequently, in a GWAS they would give exactly the same results. Also triple 5, 8 and 9 is similarly fully correlated.

• Change population to LWK (Kenya). Now we see 8 haplotypes, no monomorphic SNPs and only the SNPs 5 and 8 are completely linked with each other.

This example shows how some variants in a narrow region of 2500 bps are highly correlated with their neighbors when it comes to the haplotype patterns that exist in any one population. Typically, African populations show more diversity than Asian and European populations due to more recent genetic bottlenecks that have reduced variation in the history of the latter two. But also in African populations, the haplotype patterns are very much restricted compared to what they would be if the alleles at nearby loci would be anywhere close to independence at the population level.

The human genome has a haplotype block structure (slide 5), which suggests that the recombination process shuffles the chromosomes mainly at only a relatively small number of sites on the genome, so called recombination hotspots. Between these hotspots, there are larger regions of genome with low recombination rates (slide 6). Consequently,

• Phasing, especially in the low recombination areas, can be done accurately by collecting a large number of genotyped individuals and using a probability model to estimate which are the most likely haplotypes in this population. Since only a few haplotypes exist in the population, we can get a fairly reliable phasing for individuals by sharing haplotype information among them, and we simultaneously also get accurate estimates for the population frequencies of the haplotypes. Existing high quality reference haplotype data from the target population also improves phasing. Phasing software: Beagle5, Eagle2, SHAPEIT3.

• SNPs in the same haplotype block tag each other well because each allele of a particular SNP sits on only a limited number of haplotypes in that block. In particular, by typing the set of the most informative tag-SNPs, we can already cover also a lot of neighboring genetic variation that we don’t directly genotype. This was a property that boosted the first GWAS: in order to see statistical associations with phenotypes, it is enough that we manage to genotype any tag-SNP of a causal SNP.

• The exact causal SNPs (or other variants) are difficult to pinpoint based on the statistical information alone since so many SNPs are so highly correlated with each other. This is the fine-mapping problem.

We know from empirical data that the human genome has a haplotype block structure. This correlation structure arises because when a new variant (assume to be SNV) is introduced to the population as a mutation, it emerges on one of the existing haplotypes of the population. Because haplotypes are inherited as large segments of grandparental haplotypes, with often just 1 or 2 recombination events per chromosome, the new mutation will be, for many subsequent generations, passed on to the next generation with the same alleles that were close by to it on its orginal haplotype background. However, as more generations pass, recombinations cut down the original haplotype background to smaller and smaller shared segments between the descendants of the original haplotype, and hence the correlation at the population level between the variant and its neighbors tends to decrease over time (slides 8-9).

Consider any two SNPs at distinct sites of the genome. Suppose that we had perfect haplotype information. Then we could observe how often their alleles 1 are on the same haplotype in the population and compare that to the expected proportion if the two variants were independent. If allele 1 frequencies in population are $$f_k$$ and $$f_l$$ at SNPs $$k$$ and $$l$$, and if the frequency of the haplotype 1-1 is $$f_{kl}$$, then we define

• disequilibrium coefficient $$D_{kl}=f_{kl}-f_k\cdot f_l$$,

which, theoretically, is 0 if the two loci are independent, also said to be in linkage equilibrium, LE. If the loci are not independent ($$D_{kl}\neq 0$$), then we say that the loci are in linkage disequilibrium (LD). Naturally, we never know the frequencies exactly in population and we use sample estimates to compute $$D_{kl}$$. Hence we will always observe some non-zero value for $$D_{kl}$$ in our sample. The interest is in the magnitude of $$D_{kl}$$.

We can measure the amount of LD also by using the Pearson’s correlation coefficient $$r_{kl} = \textrm{cor}(a_k,a_l)$$, where $$a_k$$ and $$a_l$$ are the indicators of the alleles at SNPs $$k$$ and $$l$$ on the same haplotype. It can be shown that $$r$$ and $$D$$ are related through

• $$r_{kl} = \frac{D_{kl}}{\sqrt{f_k(1-f_k)f_l(1-f_l)}}.$$

Note that, theoretically, both $$r$$ and $$D$$ are 0 if and only if the loci are in linkage equilibrium. In what follows, we are interested in $$r$$, because it determines the statistical consequences of LD on GWAS analyses.

A statistical way to think about LD is that if SNPs $$k$$ and $$l$$ are in LD, then by observing for a particular haplotype its allele at SNP $$k$$, we also gain more information about its allele at SNP $$l$$ than what we had solely based on the population allele frequencies at SNP $$l$$.

#### 7.2.1 Generating two-locus haplotype data

Let’s follow Vukcevic et al. (2011) who derived formulas for the 4 haplotype frequencies in the population, given the MAFs at the two SNPs (1 and 2), and LD as measured by $$r = r_{12}$$ between the SNPs. (More comprehensive treatment of the subject is available in Damjan Vukcevic’s D.Phil thesis, Oxford, 2009.)

Suppose that SNP1 has alleles a (minor) and A (major) and SNP2 has alleles b (minor) and B (major) and that $$f_a \leq f_b.$$ First, in order that the allele frequencies and the given level of LD, as measured by $$r$$, correspond to a haplotype distribution, the following inequalities must hold: $-\sqrt{\left(\frac{f_a}{1-f_a}\right) \left(\frac{f_b}{1-f_b}\right)} \leq r \leq \sqrt{\left(\frac{f_a}{1-f_a}\right) \bigg/ \left(\frac{f_b}{1-f_b}\right)}.$ For example, correlation between two SNPs can be really high ($$r\approx 1$$) only if both have very similar MAFs, and in order for correlation to be highly negative ($$r\approx -1$$), both MAFs must be close to 0.5. This is because if we were to assign the minor allele at both SNPs to haplotypes, then the only way that we can make correlation $$\approx 1$$ is that a and b are almost always on the same haplotype and A and B are almost always on the same haplotype. This is only possible if $$f_a \approx f_b$$ because otherwise there are either some extra as that don’t find any b to pair with and therefore must pair with B, or there are some extra bs that don’t find an a to pair with and therefore must pair with A, both of which will lead to a reduction of correlation from $$\approx 1$$. To get a high negative correlation for the minor alleles, each minor allele must be paired up with the major allele at the other locus, and this is possible only if MAF is close to 0.5 at both loci. Note that if we change the allele coding at one locus from allele 1 being the minor allele to allele 1 being the major allele, that simply changes the sign, but not the magnitude, of the corresponding $$r$$.

The haplotype frequencies, and samples from them, can then be computed using the R-function geno.2loci() given below.

geno.2loci <- function(n, r, mafs, return.geno = TRUE){
#INPUT:
# n, individuals
# r, correlation coefficient between the alleles on the same haplotype of the two loci
# mafs, MAFs of the two loci
#OUTPUT:
# if return.geno=TRUE: n x 2 matrix of GENOTYPES of n individuals at 2 loci
# if return.geno=FALSE: (2n) x 2 matrix of HAPLOTYPES of n individuals (2n haplotypes) at 2 loci
stopifnot( r >= (-1) & r <= 1 )
stopifnot( length(mafs) == 2 )
stopifnot( all(mafs > 0) & all(mafs <= 0.5) )
stopifnot( n > 0)

#Label SNPs and alleles so that a and b are minor alleles and freq a <= freq b.
#At the end, possibly switch the order of SNPs back to the one given by the user.
f.a = min(mafs) # maf at SNP1
f.b = max(mafs)  # maf at SNP2

#With these parameters, the possible LD coefficient r has values in the interval:
r.min = max( -1, -sqrt(f.a/(1-f.a)*f.b/(1-f.b)) )
r.max = min( 1, sqrt(f.a/(1-f.a)/f.b*(1-f.b)) )
#c(r.min,r.max)
#Check that r is from this interval
if(r < r.min | r > r.max) stop(paste0("with these mafs r should be in (",r.min,",",r.max,")"))

# Alleles SNP1: A (major) and a (minor); SNP2: B (major) and b (minor).
# Compute conditional probabilities for allele 'a' given allele at locus 2:
q0 = f.a - r*sqrt(f.a*(1-f.a)*f.b/(1-f.b))         #P(a|B)
q1 = f.a + (1-f.b)*r*sqrt(f.a*(1-f.a)/f.b/(1-f.b)) #P(a|b)

#Compute the four haplotype frequencies:
f.ab = f.b*q1
f.aB = (1-f.b)*q0
f.Ab = f.b*(1-q1)
f.AB = (1-f.b)*(1-q0)
f = c(f.ab,f.aB,f.Ab,f.AB)
f #These are the haplotype frequencies in the population.
haps = matrix(c(1,1,1,0,0,1,0,0), nrow = 4, ncol = 2, byrow = T) #4 haplotypes in the population.

#Generate data for n individuals where each individual is measured at these two SNPs:
hap.ind = sample(1:4, size = 2*n, replace = T, prob = f) #There are 2*n haplotypes, 2 for each individual

if(mafs[1] > mafs[2]) haps = haps[,2:1] #Whether to change the order of loci?
#Either make genotype matrix by summing the two haplotypes for each individual...
if(return.geno) X = haps[hap.ind[1:n],] + haps[hap.ind[(n+1):(2*n)],]
if(!return.geno) X = haps[hap.ind,]  #...or return haplotypes as such.
return(X)
}

Let’s check that this seems to work and gives a correlation of $$r$$.

params = matrix(c(0.3,0.3,0.5, #each row has: r, MAF at SNP1, MAF at SNP2
0.8, 0.4, 0.3,
-0.5, 0.34, 0.4), byrow = T, ncol = 3)
n = 10000
for(ii in 1:nrow(params)){
X = geno.2loci(n, r = params[ii,1], mafs = params[ii,2:3], return.geno = FALSE ) #haplotype level data
print(c(cor(X[,1],X[,2]), colSums(X)/nrow(X)))
} #and compare to params above
## [1] 0.2964545 0.2983500 0.4993500
## [1] 0.8038991 0.3997500 0.3024000
## [1] -0.5016151  0.3387000  0.4026500

It works, so let’s use it to demonstrate how haplotypes with different LD look like. Start from high LD (r=0.9), then intermediate LD (r=0.66) and finally no LD (r=0). MAFs at the two SNPs must be similar for the high LD case. (Here we use 0.4.)

n = 50
rs = c(0.9, 0.666, 0.0) #correlation values
mafs = matrix(c(0.4,0.4, 0.4,0.5, 0.4,0.4), byrow = T, ncol = 2)
cbind(r = rs, mafs) #check that we put correct values in matrix
##          r
## [1,] 0.900 0.4 0.4
## [2,] 0.666 0.4 0.5
## [3,] 0.000 0.4 0.4
par(mfrow = c(3,1))
par(mar = c(2,4,4,1))
for(ii in 1:nrow(mafs)){
X = geno.2loci(n, r = rs[ii], mafs = mafs[ii,],  return.geno = F) #return haplotypes
X = X[order(X[,1]),] #order so that haps with major allele at SNP1 come first
image(X, yaxt = "n", xaxt = "n", xlab = "", ylab = "",
col = c("floralwhite","dodgerblue"), cex.main = 1.3,
main = paste0(2*n," haplotypes, observed r=",signif(cor(X[,1],X[,2]), 3) ))
abline(h = 0.5, lwd = 1.5, lty = 2, col = "black")
axis(2, at = c(0,1), labels = c("SNP1","SNP2"), cex.lab = 1.3 )
}

In the first case, we see that allele A (white) at SNP1 almost always indicates allele B (white) at SNP2, and ALSO allele a (blue) indicates allele b (blue) at the other SNP. Thus, knowing the allele at either of the SNPs predicts the allele at the other SNP with very high accuracy. These two SNPs are in high LD: population parameter was $$r=0.9$$, the observed value from the sample is shown in the title.

In the second case, MAF at SNP1 is lower than at SNP2 and even though blue at SNP1 predicts well that SNP2 also has blue allele, also many white alleles at SNP1 go together with a blue allele at SNP2. Hence LD, as measured by $$r$$, is less than in the first case. This case demonstrates that SNPs with very different MAFs cannot have very high correlation because there is no way to make all blue match blue AND all white match white when there are different numbers of blue and white at the two SNPs. These SNPs are clearly in LD with each other, although not as strongly as the SNPs in the first case.

In the third case, the SNPs are independent in the population ($$r=0$$ in simulation), and also in the observed data there is little correlation between the loci. Thus, by knowing the allele at SNP1 does not help predicting what is the allele at SNP2 on the same haplotype. The best predicton we can do statistically is to guess the allele at SNP2 based on the population allele frequencies, and ignore which allele was observed at SNP1, since it does not give additional information over the allele frequencies in population. These SNPs are not in LD with each other.

Let’s also check what happens to the estimate of $$r$$ if we estimate it at the genotype level, not at the haplotype level, as we have done so far. Thus, we’ll generate haplotypes as previously, but before computing $$r$$, we collapse the haplotypes into genotypes (by return.geno = TRUE) and then compute correlation between the genotypes at the two loci.

params = matrix(c(0.3,0.3,0.5, #each row has: r, MAF at SNP1, MAF at SNP2
0.8, 0.4, 0.35,
-0.5, 0.34, 0.4), byrow = T, ncol = 3)
n = 10000
for(ii in 1:nrow(params)){
X = geno.2loci(n, r = params[ii,1], mafs = params[ii,2:3], return.geno = TRUE )
print(c(cor(X[,1],X[,2]), colSums(X)/2/nrow(X)))
} #and compare to params above
## [1] 0.2793206 0.3005500 0.4936000
## [1] 0.7971017 0.4030500 0.3517000
## [1] -0.4914978  0.3377500  0.4011000

This seems to estimate the haplotype level correlations even though the input data are genotypes. This is good to know as the data we observe in practice are at the level of genotypes and phasing it to haplotypes is not always straightforward.

Example 7.2. We can also get LD estimates for human populations from LDlink.

• Choose LDhap and consider two SNPs rs7837688, rs4242382 in the CEU population. You’ll see the haplotype distribution in 1000G CEU data. There are 3 haplotypes (out of all 4 possible) and one is quite rare (1%). So these are almost in perfect LD where allele at one SNP tells the allele at the other.

• While you could compute $$r$$ from the info given by LDhap, you can also query it directly from LDpair tab, which gives $$r^2=0.8791$$, i.e., $$r=0.938.$$ (Slide 10.)

• Look at the same SNPs in LWK population. All 4 haplotypes are present and there is no LD in terms of $$r^2$$. Often African populations have less LD than non-African since the latter have been through more recent bottlenecks. Such bottlenecks increase LD because when there is a relatively few haplotypes that found the subsequent generations then those haplotypes will determine the LD patterns. (Slide 11.)

• Go to LDproxy tab and look rs7837688 in the Chinese CHB population. You’ll see the Figure and list of the variants that are in highest LD with rs7837688. Note also how the high LD is concentrated within a haplotype block that is between recombination hotspots (region between high peaks of recombination rate cM/Mb). (Slide 12.)

• LDmatrix (slide 13) can be made for a user given set of SNPs.

Example 7.3. Let’s plot an LD matrix for 74 variants around APOE gene on chr 19 using 1000 Genomes Finnish samples (coordinates GRCh37).

path = "https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/APOE_1000G_FIN_74SNPS."
dim(haps) #rows haplotypes, cols SNPs
## [1] 186  74
info[1:3,] #info for first three SNPs
##           id position a0 a1    af1
## 1   rs387976 45379060  A  C 0.3065
## 2  rs3852859 45379309  T  C 0.1720
## 3 rs73050293 45379746  A  G 0.2312
haps[1:3,] #data for first three haplotypes 
##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1  0  0  1  0  0  0  0  0  0   0   0   0   0   0   1   0   1   0   1   1   1
## 2  1  0  0  1  1  0  1  1  1   1   0   1   1   0   0   0   1   0   0   1   1
## 3  1  0  0  1  1  0  1  1  1   1   0   1   1   0   0   0   1   0   0   0   0
##   V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
## 1   0   0   0   1   0   0   1   1   1   0   1   0   1   1   1   0   0   1   0
## 2   0   0   0   0   1   0   0   1   0   1   0   0   0   1   0   1   0   0   1
## 3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59
## 1   1   1   1   0   0   1   0   0   1   1   1   1   0   1   1   1   0   1   0
## 2   1   0   1   0   1   1   1   0   1   1   1   1   0   1   1   1   0   1   0
## 3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74
## 1   0   1   1   0   0   0   1   0   0   0   0   0   0   0   0
## 2   0   1   0   0   0   0   0   1   0   1   0   1   0   1   1
## 3   0   0   0   0   0   0   0   1   0   0   1   0   1   0   1
R = cor(haps) #LD matrix is the pxp correlation matrix
n.cols = 100
layout(matrix(c(1,2), nrow = 1), widths = c(9,1))
par(mar = c(2,2,3,1)) #plot LD matrix using image
image(abs(R), col = heat.colors(n.cols),
breaks = seq(0, 1, length = n.cols + 1),
asp = 1, xaxt = "n", yaxt = "n", bty = "n",
main = paste("|r| of ", dim(R)[1],"SNPs around APOE"))
par(mar = c(2,1,3,1)) #plot scale for colors
plot.window(xlim = c(0,1), ylim = c(0,n.cols))
points(x = rep(1,n.cols + 1),y=(0:n.cols), col = heat.colors(n.cols + 1), pch = 15, cex = 2)
axis(4, at = c(0, n.cols / 2, n.cols),
labels = c(0, 0.5, 1) )

LD matrices from two Finnish cohorts show highly similar LD structure to each other on slide 14.

### 7.3 Effect of LD on GWAS results

Let’s then experiment with a GWAS setting where SNP1 is a causal variant with causal effect size of $$\lambda_1 = 0.2$$ and MAF = 0.2 whereas SNP2 does not have a causal effect ($$\lambda_2=0$$) and has MAF = 0.4, on a quantitative phenotype whose variance is 1 in population. LD between the SNPs is $$r_{12}=0.6$$. So far on this course we have used $$\beta$$ to denote the effect size of allele 1. Reason why we now use $$\lambda$$ rather than $$\beta$$ is because we want to separate the causal effect ($$\lambda$$) from the marginal effect ($$\beta$$), as will become clear soon.

Let’s make 1000 simulations of such setting, using $$n=1000$$ individuals in each simulation, and let’s do the typical single-SNP GWAS on each variant and collect the effect estimate $$\widehat{\beta}_l$$ its SE and P-value for both variants $$l=1,2.$$

n.iter = 1000
n = 1000
r = 0.6
mafs = c(0.2, 0.4)
lambda = c(0.2, 0) #causal effects of each SNP
res = matrix(NA, ncol = 6, nrow = n.iter) #6 cols: beta1, SE1, P1; beta2, SE2, P2
colnames(res) = c("beta1","SE1","pval1","beta2","SE2","pval2")
for(iter in 1:n.iter){
X = geno.2loci(n, r, mafs, return.geno = T) #generate 2-locus genotypes
y = X %*% lambda + rnorm(n,0, sqrt(1-var(X %*% lambda)) ) #var(y) = 1,
res[iter, 1:3] = summary(lm(y ~ X[,1]))$coeff[2, c(1,2,4)] #collect beta, SE, P-val of SNP1 res[iter, 4:6] = summary(lm(y ~ X[,2]))$coeff[2, c(1,2,4)] #collect beta, SE, P-val of SNP2
}

Let’s plot effect estimates $$\widehat{\beta}_1$$ and $$\widehat{\beta}_2$$ against each other and as separate distributions. Let’s add vertical lines at the expected mean values: $$\lambda_1$$ for SNP1 and $$r_{12}\,\lambda_1\,\sqrt{\frac{f_1(1-f_1)}{f_2(1-f_2)}}$$ for SNP2 (explained later where this comes from).

par(mfrow = c(1,2))
plot(res[,"beta1"], res[,"beta2"], xlab = expression(beta[1]), ylab = expression(beta[2]),
main = paste("r =",signif(cor(res[,"beta1"], res[,"beta2"]), 3)))
abline(0,1)
plot(density(res[,"beta1"]), xlab = expression(beta), col = "black", main = "",
xlim = c(0,0.4), ylim = c(0,8))
lines(density(res[,"beta2"]), col = "red")
abline(v = lambda[1], lty = 2)
abline(v = lambda[1]*r*sqrt(mafs[1]*(1-mafs[1])/mafs[2]/(1-mafs[2])), lty = 2)
legend("topright", col = c("black","red"), lwd = 1, legend = c(1,2) )