This document is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The slide set referred to in this document is “GWAS 5”.

5.1 Genetic relatedness

All humans are genetically related to each other, but the level of relatedness varies between pairs of individuals. These differences in the levels of relatedness need to be accounted for in GWAS, since otherwise they could bias the statistical association between genotypes and phenotypes. Here, we define what is relatedness and how it is estimated, and later in the course we’ll talk more about why and how we use estimates of relatedness and population stucture in GWAS.

See slides (3-6) that summarize the process that generates variation in the relatedness among humans (or other diploid species).

A review of the relatedness estimates: Speed & Balding 2015, Relatedness in the post-genomic era: is it still useful?.

5.1.1 Defining relatedness

Consider two genomes \(G_1\) and \(G_2\). Suppose that we could determine, for each genomic position \(x\), the number of generations \(t(x;G_1,G_2)\) that have passed since the most recent common ancestor (MRCA) genome of both \(G_1\) and \(G_2\) at site \(x\) was alive. From such information, we could define a relatedness value between \(G_1\) and \(G_2\) as the average time of MRCAs over some selected set of \(L\) loci. (Slide 5)

The relatedness measures used in practice try to capture part of this perfect relatedness information, although not in terms of the number of generations since MRCA, but rather as a relative measure compared to the other pairs of genomes in the study population. This shortcut is made because estimating time to MRCA would require multiple assumptions about population genetic history, which are difficult to verify, and would involve demanding computation. Instead, some simple relatedness statistics used in GWAS can be easily computed for millions of pairs of individuals, and we will study these next.

Terms

  • Identical-by-state (IBS). Two genomes are IBS at locus \(x\) if they have identical DNA sequence at locus \(x\). Locus can be one or more nucelotides long. The observed IBS distribution for a pair of individuals is described by three values \((\textrm{IBS}_0, \textrm{IBS}_1, \textrm{IBS}_2)\) that tell which proportion of the genome between these individuals is of IBS 0,1 or 2, respectively. (Slide 7)

  • Identical-by-descent (IBD). Two genomes are IBD(\(t\)) at locus \(x\), if they have inherited identical DNA sequence at locus \(x\) from a common ancestor who has lived at most \(t\) generations ago. This is a well-defined concept, if pedigree information for a few generations is available. Typically in GWAS data, no such info is available, and the concept of IBD is used without explicitly defining what is the reference timeframe \(t\). Then it is expected that a reasonable definition of the reference level is implicitly implied by the method that is used for estimating whether genomes are IBD. Most often, IBD is defined by assuming that population allele frequencies are known and that there is IBD sharing between a pair of individuals if the pair shares more allele IBS than would be expected from a random pair from the population. IBD implies IBS but not vice versa. (Slide 7.) The IBD distribution for a pair of individuals is described by three values \((\textrm{IBD}_0, \textrm{IBD}_1, \textrm{IBD}_2)\) that tell which proportion of the genome is estimated to be of IBD 0,1 or 2, respectively. The known pedigree relationships (like full-sibs) determine only the expected value of IBD sharing; the realized IBD sharing has variation within any other relationship category except parent-offspring. (Slides 8-10).

  • Kinship coefficient \(\phi_{ij}\) is the probability that one allele sampled from individual \(i\) and one allele sampled from the same locus from individual \(j\) are IBD.

  • Inbreeding coefficient \(f_i=\phi_{MF} = 2\phi_{ii}-1\), where \(M\) and \(F\) are the mother and father of \(i\). It describes how closely the two genomes of an individual \(i\) are related. Outbred means not inbred.

  • Relatedness coefficient \(r_{ij} = 2\phi_{ij}.\) Assuming no inbreeding, \(r_{ij}\) can be interpreted as the proportion of the genome that is shared between \(i\) and \(j\).

In practical applications, make always clear in mind whether you are considering kinship or relatedness coefficients as the values differ by a factor of 2. Here are some values for relatives.

If the purpose is to assign individuals into categories according to close relationships, a commonly-used inference procedure of the KING software is based on the powers of \(\frac{1}{2}\). According to it, a pair is inferred to be a \(k\)th degree relative pair based on the following cut points for the estimated \(r_{ij}\) or \(\phi_{ij}\) values:

degree \(r_{ij}\) within \(\phi_{ij}\) within
mono twins \(>\frac{1}{2^{0.5}} = 0.707\) \(> \frac{1}{2^{1.5}} = 0.354\)
1st deg. \((\frac{1}{2^{1.5}},\frac{1}{2^{0.5}}) = (0.354, 0.707)\) \((\frac{1}{2^{2.5}},\frac{1}{2^{1.5}}) = (0.177, 0.354)\)
2nd deg. \((\frac{1}{2^{2.5}},\frac{1}{2^{1.5}}) = (0.177, 0.354)\) \((\frac{1}{2^{3.5}},\frac{1}{2^{2.5}}) = (0.088, 0.177)\)
3rd deg. \((\frac{1}{2^{3.5}},\frac{1}{2^{2.5}}) = (0.088, 0.177)\) \((\frac{1}{2^{4.5}},\frac{1}{2^{3.5}}) = (0.044, 0.088)\)
unrelated \(<\frac{1}{2^{3.5}}=0.088\) \(<\frac{1}{2^{4.5}}=0.044\)

Note that the expected value of relatedness for each category is the lower bound multiplied by \(\sqrt{2} = 2^{0.5}\) and is simultaneously also the upper bound divided by \(\sqrt{2}\).

5.1.2 Estimating relatedness

Example 5.1. Let’s generate genotype data for 5 pairs of full siblings, 5 pairs of half-siblings and 10 unrelated individuals, from a single population using \(p=10000\) SNPs whose MAF range from 0.2 to 0.5. We will then demonstrate different relatedness estimates on these data by showing the 30x30 relatedness matrix of the individuals.

Let’s put the within family data generation into its own function that allows generating pairs of offspring that share 2 parents (full-sibs), 1 parent (half-sibs) or 0 parent (unrelated).

offspring.geno <- function(n.families, n.snps, fs = rep(0.5,n.snps), n.shared.parents=2){
  #INPUT: 
  # n.families, number of families where each family produces two offspring (>0)
  # n.snps, number of independent SNPs used in simulation (>0)
  # fs, vector of allele 1 freqs for SNPs, length == n.snps, values >0 & <1
  # n.shared.parents, 0,1,2 shared parents for the two offspring in each family
  #OUTPUT:
  # X, the genotypes of (2*n.families) offspring, (2*n.families) x n.snps matrix with 0,1,2 entries
  
  stopifnot(n.families > 0)
  stopifnot(n.snps > 0)
  stopifnot(all(fs > 0 & fs < 1) & length(fs) == n.snps)
  stopifnot(n.shared.parents %in% 0:2)
  
  if(n.shared.parents == 2) parents = list(c(1,2), c(1,2)) #parents[[1]] are the parents of offspring 1
  if(n.shared.parents == 1) parents = list(c(1,2), c(3,2))
  if(n.shared.parents == 0) parents = list(c(1,2), c(3,4))
  n.parents = 4 - n.shared.parents #4, 3 or 2 for values of n.shared.parent of 0, 1 or 2
  
  X = matrix(0, nrow = 2*n.families, ncol = n.snps)
  for(ii in 1:n.families){ #each "family" means a pair of offspring that share 'n.shared.parents'
    x.parents = t(replicate(2*n.parents, rbinom(n.snps, size = 1, prob = fs) )) #2*n.parents parental genomes
    for(offs in 1:2){ #for two offspring within "family"
      #phase is the indicator of whether offs inherits each parents' 1st allele or not
      phase = t(replicate(2, rbinom(n.snps, size = 1, prob = 0.5))) #phase has one row for each parent
        for(i.parent in 1:2){
          for(ph in 0:1){
            loci = (phase[i.parent,] == ph) #which loci from i.parent have phase ph?
            #add to current offs' genotype i.parent's allele from the correct phase
            X[2*(ii-1) + offs, loci] = 
              X[2*(ii-1) + offs, loci] + x.parents[2*parents[[offs]][i.parent] - ph, loci] 
          }
        }
    }
  }
  return (X)
}

And now the simulation of 30 samples x 10000 SNPs genotype matrix X.

p = 10000 #SNPs
fs = runif(p, 0.2, 0.5) #MAF at each SNP is Uniform(0.2, 0.5)

X = rbind(offspring.geno(n.families = 5, n.snps = p, fs = fs, n.shared.parents = 0),
             offspring.geno(n.families = 5, n.snps = p, fs = fs, n.shared.parents = 1),
             offspring.geno(n.families = 5, n.snps = p, fs = fs, n.shared.parents = 2))

X = X[,apply(X,2,var) > 0] #remove possible monomorphic variants 

IBS estimator

It is straightforward to count the proportion of loci where a pair of individuals share 0,1 or 2 alleles IBS. These IBS values already make a distinction between different levels of relatedness and a plot of, say, IBS0 against IBS2 shows often a clear structure. However, the absolute amount of IBS sharing depends heavily on allele frequencies in the population and methods that estimate kinship or relatedness coefficients typically aim for estimating IBD.

We generate each IBSz-matrix by multiplication over possible sharing patterns: There are three possible ways to have IBS2: either sharing genotype 2 or genotype 1 or genotype 0. And there are two ways to be IBS1: one individual must be heterozygous while the other must be one of the homozygotes. Finally, the overall IBS estimate is IBS2 + 0.5 * IBS1.

n.cols = 50 #number of colors
IBS.2 = ( (X==2) %*% t(X==2) + (X==1) %*% t(X==1) + (X==0) %*% t(X==0) )/p #prop. of loci with same genotypes
IBS.1 = ( (X==1) %*% t(X==0 | X==2) + (X==0 | X==2) %*% t(X==1) )/p #prop. of loci with one allele shared
IBS = IBS.2 + 0.5*IBS.1 #prop. of genome IBS
layout(matrix(c(1,2), nrow = 1), width = c(9,1)) #plotting area divided into 9/10 and 1/10 wide regions
par(mar = c(2,2,3,1)) #plot IBS matrix using image
image(IBS, col = topo.colors(n.cols), breaks = seq(min(IBS), max(IBS), length = n.cols + 1),
      asp = 1, xaxt = "n", yaxt = "n", bty = "n", 
      main = paste("Avg. IBS from",ncol(X),"SNPs")) 
par(mar = c(2,1,3,1)) #plot scale for colors
plot.window(xlim = c(0,1), ylim = c(0,n.cols)) 
points(x = rep(1, n.cols + 1), y = (0:n.cols), col = topo.colors(n.cols + 1), pch = 15, cex = 2)
axis(4, at = c(0, n.cols / 2, n.cols),
     labels = c(round(min(IBS),2), round((min(IBS) + max(IBS))/2,2), round(max(IBS),2)) )

The numerical values of this IBS plot depend on the population allele frequencies and are not as easily mapped to the known relationship categories as with IBD based measures for which, for example, \(r\sim50\%\) for full sibs and \(r\sim25\%\) for half sibs. Additionally, it seems that separation of the half-sibs from unrelateds in IBS matrix is not very clear.

Correlation estimator (GCTA’s Genetic relationship matrix)

Let’s look at how a standard measure of correlation applied to a large number \(p\) of variants can be interpreted as an estimate of IBD.

Consider a locus \(l\) where frequency of allele 1 is \(f_l\) and denote by \(A_{hl}\in\{0,1\}\) the (random variable describing) allele of genome \(h\) at locus \(l\). Under the assumption that all genomes in the sample come from the same homogeneous population, we have that \(\textrm{E}(A_{hl})=f_l\) and \(\textrm{Var}(A_{hl}) = f_l(1-f_l)\) for all \(h\). Denote by \(r_{hk}\) the (unknown) probability that genomes \(h\) and \(k\) are IBD at a locus. We have \[ P(A_{hl} = 1 = A_{kl}\, |\, f_l, r_{hk}) = r_{hk}\, f_l + (1-r_{hk})\,f_l^2,\] where the first term describes the case where \(h\) and \(k\) are IBD at \(l\) (with probability \(r_{hk}\)) and happen to share the same copy of allele 1 (with probability \(f_l\)), and the second term describes the case where \(h\) and \(k\) are not IBD at \(l\) (prob. \(1-r_{hk}\)) and happen to carry independent copies of allele 1 (prob. \(f_l^2\)). We can solve for \[r_{hk} = \frac{P(A_{hl} = 1 = A_{kl} )-f_l^2}{f_l(1-f_l)} = \frac{\textrm{E}(A_{hl} A_{kl})-\textrm{E}(A_{hl})\textrm{E}(A_{kl}) }{ \sqrt{\textrm{Var}(A_{hl})\textrm{Var}(A_{kl})}}=\textrm{cor}(A_{hl},A_{kl}). \] Thus, for genomes, their IBD proportion can be estimated by the correlation of their allele states.

Let \(G_{il} = A_{i_1l} + A_{i_2l} \in \{0,1,2\}\) be the genotype of individual \(i\) at locus \(l\) and assume HWE: \(\textrm{Var}(G_{il})=2f_l(1-f_l)\). \[ \textrm{cor}(G_{il},G_{jl}) = \frac{\textrm{Cov}(A_{i_1l} + A_{i_2l}, A_{j_1l} + A_{j_2l})}{2f_l(1-f_l)}=\frac{(r_{i_1j_1} +r_{i_1j_2} +r_{i_2j_1} +r_{i_2j_2})f_l(1-f_l)}{2f_l(1-f_l)}=\frac{4\phi_{ij}}{2}=2\phi_{ij} = r_{ij}. \] Thus, also for genotypes, correlation can be used to estimate the IBD sharing of the individuals.

We use all available SNPs (after some pruning we’ll talk more later) to estimate the pairwise correlations using formula \[ \widehat{r_{ij}} = \frac{1}{p} \sum_{l=1}^p \widehat{\textrm{cor}}(G_{il},G_{jl}) = \frac{1}{p} \sum_{l=1}^p \frac{\left(g_{il}-2\,\widehat{f_l}\right) \left(g_{jl}-2\,\widehat{f_l}\right)}{2\,\widehat{f_l}\left(1-\widehat{f_l}\right) }, \] where \(g_{il}\in\{0,1,2\}\) is the observed genotype of \(i\) and \(\widehat{f_l}\) is the allele 1 frequency estimate from the sample, both at locus \(l\).

For \(n\) individuals, these pairwise relatedness coefficients form an \(n\times n\) genetic relationship matrix (GRM) for which we use notation GRM-cor to emphasise that it is the correlation based GRM. This GRM-cor has been widely used in conjuction with linear mixed models software GCTA. Also PLINK has an efficent implementation for computing GRM-cor.

The interpretation of negative values for \(\widehat{r_{ij}}\) is that those pairs are less related than an expected pair of unrelated (no-IBD sharing) individuals. This is possible, for example, if there is population structure in the data because then the individuals from different populations can look less related than the theoretical construction of a pair of “unrelated individuals” based on the allele frequency data from the combined sample. A related, and possibly more severe, problem is that when the sample is not from a single homogeneous population, then individuals from the same population seem more related than they actually are (for example 3rd degree relatives can look like 2nd degree relatives). Thus, correlation estimator, or any other estimator that defines the reference level of IBD based on the sample’s allele frequencies, is not robust to population structure.

Note also a possible numerical instability of the estimator at SNPs where MAF is close to 0. Then the denominator \(\widehat{f_l}(1-\widehat{f_l}) \approx 0\), and therefore a pair of carriers of the rare allele will get a huge positive contribution to their relatedness estimate from the rare allele. For example, if \(\widehat{f_l} = 0.001\), then a pair of individuals who share the minor allele will get a huge contribution of \((1-0.002)^2/(2\cdot 0.001 \cdot 0.999) = +498.5\) to their relatedness estimate, whereas individuals who share the common genotype 0, will get only a tiny contribution of \(+0.002\) and a pair who have different genotypes 0 and 1 will get contribution of only \(-1.00.\) Intuitively, it feels correct to assume that individuals who share a rare allele indeed are likely to have recent IBD sharing at that locus, but the unboundedness of the estimator with rare alleles seems a technical problem that leads to a huge variance of the estimator. Therefore, we typically restrict GRM-cor estimator to common variants (say MAF>5%).

Let’s make GRM-cor from our current data set.

X.scaled = scale(X) #standardize each SNP at columns of X
GRM = (X.scaled %*% t(X.scaled))/p #correlation matrix of individuals based on standardized SNPs

layout(matrix(c(1,2), nrow = 1), width = c(9,1)) 
par(mar = c(2,2,3,1)) #margins for matrix
image(GRM, col = topo.colors(n.cols), breaks = seq(min(GRM),max(GRM), length = n.cols + 1), 
      asp = 1, xaxt = "n", yaxt = "n", bty = "n", 
      main = paste("GRM from",ncol(X),"SNPs")) 
par(mar = c(2,1,3,1)) #margins for scale
plot.window(xlim = c(0,1), ylim = c(0,n.cols))
points(x = rep(1,n.cols +1 ), y = (0:n.cols), col = topo.colors(n.cols + 1), pch = 15, cex = 2)
axis(4, at = c(0, n.cols/2, n.cols),
     labels = c(round(min(GRM),2), round((min(GRM)+max(GRM))/2,2), round(max(GRM),2)) )

Let’s print out which pairs seem like full sibs, \(0.4 < r_{ij} < 0.6\) or half sibs, \(0.15 < r_{ij} < 0.35.\)

full.sibs = which( GRM > 0.4 & GRM < 0.6, arr.ind = T) #each pair twice here since R is symmetric
full.sibs = full.sibs[full.sibs[,1] < full.sibs[,2], ] #pick only pairs where 1st index < 2nd index
rbind(full = as.vector(t(full.sibs)))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## full   21   22   23   24   25   26   27   28   29    30
half.sibs = which( GRM > 0.15 & GRM < 0.35, arr.ind = T) 
half.sibs = half.sibs[half.sibs[,1] < half.sibs[,2], ] 
rbind(half = as.vector(t(half.sibs)))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## half   11   12   13   14   15   16   17   18   19    20

Seems correct.

KING

IBD relatedness estimators explained above assume that individuals come from homogeneous population. KING by Manichaikul et al. 2010 is a relatedness estimation method that is more robust to population structure and is very efficiently implemented. It derives a kinship estimate for a pair of individuals without reference to the population allele frequencies. Instead, it gets the relatedness information from the difference between the counts of loci where both individuals are heterozygotes (\(N_{1,1}\)) and counts of loci where they are different homozygotes (\(N_{2,0}\)), normalized by the sum of the heterozygous loci of the individuals: \[ \widehat{\phi_{ij}} = \frac{N_{1,1} - N_{2,0}}{N_{1}^{(i)} + N_{1}^{(j)}}. \] Let’s apply KING estimate to the current dataset with the full-sibs, half-sibs and unrelateds, and let’s multiply KING’s kinship estimate by 2 to get from the KING’s default kinship scale to the scale of relatedness coefficient \(r\).

denominator = matrix(rep(rowSums(X==1), nrow(X)), nrow = nrow(X), byrow = T) +
              matrix(rep(rowSums(X==1), nrow(X)), nrow = nrow(X), byrow = F)
king.r = 2*((X==1) %*% t(X==1) - 2*((X==0) %*% t(X==2) + (X==2) %*% t(X==0)) ) / denominator  
#(suppressing commands of matrix printing from output)

Looks correct. We’ll later come back to the difference between KING and GRM-cor when there is some population structure in the data.

KING was recently used to estimate relatedness for all 1.2e+11 pairs from the UK Biobank data (~500,000 samples)(Slide 12).

IBD segments

Genome is inherited in chunks and therefore individuals sharing an allele IBD at a locus are also likely to share a larger region IBD around that locus. Most accurate relatedness estimates come from methods that can model this spatial structure of IBD segments. Those methods are usually computationally intensive and not applicable to GWAS data sets with \(10^4\) or \(10^5\) individuals and we don’t study them on this course. A recent study by Ramstetter et al. 2017 lists and compare many of those methods.

5.1.3 Use of IBD estimates in quality control

(From Purcell et al. 2007.)

IBD sharing estimates can be used for QC and to indicate and diagnose errors in records or genotype data, including sample swaps, sample duplications, and
sample contamination events, as well as misspecified or undetected familial relationships. For example, values of IBD2 near 1 indicate duplicated samples (or monozygotic twins).

If DNA from some individual(s) contaminates other samples, this can lead to a distinctive pattern of contaminated samples showing high IBD with all other individuals. This is because contamination induces false heterozygote calls (e.g., AA pooled with CC may well be genotyped as AC), and heterozygotes cannot ever be IBS=0 with any other SNP genotype, which artificially inflates the IBD estimates. Furthermore, contaminated samples will show strong, negative inbreeding coefficients, indicating more heterozygotes than expected. This is why excessive heterozygocity is a QC criterion in GWAS data.

See also slides 16-17 for an application of relatedness estimates in QC.

5.1.4 Uses of relatedness estimates outside GWAS

Several companies offer direct-to-customer service to analyze customer’s DNA, and over 15 million people have already subscribed to such services globally. One of the uses is to reveal genetic relatives based on the estimated IBD sharing among the customers. The purpose of uniting distant relatives is nice, but the power of genetics can also cause complicated situations, as described in this Vox News article “With genetic testing, I gave my parents the gift of divorce”.

The rapid growth of the public databases, where voluntary individuals have uploaded their genetic data and other information (such as age and location), have already been used to identify from a database a relative of an external DNA sample (e.g. from a crime scene), which has led to a complete identification of the external DNA sample. A famous example is the case of the Golden State Killer in CA, US, where in April 2018 a suspect was arrested for murders and rapes commited in 1970s and 1980s. The police got the lead by uploading the DNA found from a crime scene to a GEDmatch database and identified 10-20 3rd degree cousins whose additional information helped to identify the suspect. How lucky were the police? by Edge and Coop. A recent Science paper by Erlich et al. 2018 summarizes this approach and its profound future prospects (Slides 13-15):

"Consumer genomics databases have reached the scale of millions of individuals. Recently, law enforcement 
authorities have exploited some of these databases to identify suspects via distant familial relatives. 
Using genomic data of 1.28 million individuals tested with consumer genomics, we investigated the power 
of this technique. We project that about 60% of the searches for individuals of European-descent will 
result in a third cousin or closer match, which can allow their identification using demographic identifiers. 
Moreover, the technique could implicate nearly any US-individual of European-descent in the near future. 
We demonstrate that the technique can also identify research participants of a public sequencing project. 
Based on these results, we propose a potential mitigation strategy and policy implications to human 
subject research."

5.2 Population structure

Several relatedness estimation methods considered above were base on the assumption of homogeneous population, i.e., a population without genetic structure. Genetic population structure is present in the sample, if the sample can be divided into groups in such a way that individuals from one group are more genetically similar among themselves than with individuals from different groups.

5.2.1 Sources of population structure

Most common variants among the humans are shared across the world, in the sense that those common alleles are present all around the world. Still, even the common variants carry information about the geographic location of their carrier since their allele frequencies are different in different areas. (Slides 19-20.)

Example 5.2. Genetic drift

Suppose that a population with \(100\) individuals is living in a particular area. Suppose that half of the population migrates to a new area and the two subpopulations do not have any contact with each other anymore, and in particular, do not have any genetic exhange in the following generations. Let’s visualize how the allele frequency of a particular variant evolves in the two populations as a function of generations they have been separated. The assumption is that at time 0 the allele frequency is the same in both populations, and, in each of the following generations, the offspring alleles are randomly sampled from the existing alleles with replacement (one individual can have none, one or multiple offspring). This sampling means that Hardy-Weinberg equilibrium holds in each subpopulation. There are 50 individuals in each population, so 100 alleles in each population. The population size is assumed constant.

n = 100 #alleles in each generation of each subpopulation
f0 = 0.5 #starting allele frequency
K = 100 #number of generations in simulation
npop = 10 #Let's make 10 rather than just 2 pops to illustrate more general behavior
f = matrix(NA, ncol = K+1, nrow = npop) #results of allele freqs across pops and generations
for(pop in 1:npop){
  a = rbinom(n,1,f0) #starting configuration of alleles
  f[pop,1] = mean(a) #allele frequency at generation 0
  for(ii in 1:K){
    a = sample(a, size = n, replace = T) #resample generation ii
    f[pop,1+ii] = mean(a) #allele frequency at generation ii (index ii+1 since generation 0 is at index 1)
  }
}
plot(NULL, xlab = "generation", ylab = "allele frequency", 
     main = paste("n =",n), xlim = c(0,K), ylim = c(0,1))
for(pop in 1:npop){lines(0:K, f[pop,], lwd = 1.4, col = topo.colors(npop)[pop])}