This document is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The slide set referred to in this document is “GWAS 5”.
Several relatedness estimation methods considered above were base on the assumption of homogeneous population, i.e., a population without genetic structure. Genetic population structure is present in the sample, if the sample can be divided into groups in such a way that individuals from one group are more genetically similar among themselves than with individuals from different groups.
Most common variants among the humans are shared across the world, in the sense that those common alleles are present all around the world. Still, even the common variants carry information about the geographic location of their carrier since their allele frequencies are different in different areas. (Slides 19-20.)
Suppose that a population with \(100\) individuals is living in a particular area. Suppose that half of the population migrates to a new area and the two subpopulations do not have any contact with each other anymore, and in particular, do not have any genetic exhange in the following generations. Let’s visualize how the allele frequency of a particular variant evolves in the two populations as a function of generations they have been separated. The assumption is that at time 0 the allele frequency is the same in both populations, and, in each of the following generations, the offspring alleles are randomly sampled from the existing alleles with replacement (one individual can have none, one or multiple offspring). This sampling means that Hardy-Weinberg equilibrium holds in each subpopulation. There are 50 individuals in each population, so 100 alleles in each population. The population size is assumed constant.
n = 100 #alleles in each generation of each subpopulation
f0 = 0.5 #starting allele frequency
K = 100 #number of generations in simulation
npop = 10 #Let's make 10 rather than just 2 pops to illustrate more general behavior
f = matrix(NA, ncol = K+1, nrow = npop) #results of allele freqs across pops and generations
for(pop in 1:npop){
a = rbinom(n,1,f0) #starting configuration of alleles
f[pop,1] = mean(a) #allele frequency at generation 0
for(ii in 1:K){
a = sample(a, size = n, replace = T) #resample generation ii
f[pop,1+ii] = mean(a) #allele frequency at generation ii (index ii+1 since generation 0 is at index 1)
}
}
plot(NULL, xlab = "generation", ylab = "allele frequency",
main = paste("n =",n), xlim = c(0,K), ylim = c(0,1))
for(pop in 1:npop){lines(0:K, f[pop,], lwd = 1.4, col = topo.colors(npop)[pop])}
The figure shows ten possible evolutions for our subpopulations. We see that each population starts near the common allele frequency value (here 0.5), but as the time goes by, some populations differ strongly in their allele frequency. If we were to genotype individuals from generation 80 at this variant, we could tell, for example, that heterozygotes are definitely from one of the 5 populations not yet fixed to either allele and we could tell which are the possibilities for each homozygote individual. Thus, pure random variation in allele frequencies between generations generates marked allele frequency differences between populations over time. This is called genetic drift.
The amount of genetic drift is strongly dependent on the population size. Statistician might think that each new generation tries to “estimate” its parents’ generation allele frequency by sampling alleles from parents’ generation and by computing their frequencies, and the precision of this estimate decreases as the sample size decreases.
Let’s repeat similar analysis but with 10 times larger population size. (And let’s not repeat the code in the output, just the updated Figure.)
Within the same time span, the larger populations were much less
affected by the genetic drift. This phenomenon also explains why some
human populations are more strongly diverged genetically from their
neighbours than some other, geographically equally distant populations.
The Finnish genetic ancestry is an example of a population with a
relatively strong genetic isolation among the European populations, to a
large part due to a relatively small historical “founder population”
size in Finland. Consequently, the overall genetic background among
individuals with Finnish genetic ancestry is less heterogeneous than in
many other European countries (which make some genetic analyses simpler
in Finland), and due to a strong historical genetic drift effect in
Finland, some functional, and possibly harmful, alleles have survived in
Finland
with much higher frequency than elsewhere (say, e.g., 1% in Finland and
0.01% elsewhere) even though selection effect might be against them. We
already know what a huge difference such a frequency difference has in
terms of statistical power to discover the phenotype associations! These
are reasons why international research community has put much focus on Finland when it comes to
doing genetics research.
Above we saw that already a single variant shows informative genetic population structure between groups that were separated for a while. And when we combine 100,000s of variants across the genome, we can expect to see quite a clear structure, if we just have tools to pick it up.
Suppose we have \(n\) individuals measured on \(p\) SNPs. Thus our data matrix is an \(n \times p\) matrix \(\mathbf{X}\). We would like to summarize the main structure of these data with respect to the individuals by using only much less than \(p\) dimensions. Such dimension reduction is most intuitive, if it ends in two or one dimensions since then we can easily see the results ourselves by simply plotting the individuals in their new 2 or 1 dimensional coordinates.
Imagine each individual as a point in \(p\)-dimensional space where each dimension corresponds to a SNP and the individual’s coordinate on the axis is the genotype at that SNP. If we draw one line through the \(p\)-dimensional space and project each point on that line, then we can represent each individual by using only one value, the individual’s coordinate on the chosen line. Now each individual is represented by one value instead of original \(p\) values so we have done efficient dimension reduction from \(p\) to 1. Is this useful? Only if we can choose that one line in such a way that it will capture a useful amount of the information in the data. In PCA, first defined in 1901 by K. Pearson, our criterion to choose the line is that the individuals’ projections on the line should have the largest variance possible, among all possible lines that we could draw through the \(p\)-dimensional space.
Let’s make an example data of 2 SNPs where we have 50 individuals from two populations. In a blue population, the allele 1 frequencies are 5% and 95% at the two SNPs, and in a red population they are the opposite: 95% and 5%. We don’t expect that either SNP alone would completely separate the two populations but maybe PCA is able to combine more of their information and present it in one dimension to show the population structure from the data better than either of the SNPs alone. Let’s draw a picture.
set.seed(20)
npop = 2
cols = c("blue","red")
f = matrix(c(0.05,0.95,0.95,0.05), byrow = T, ncol = npop)
p = nrow(f) #number of SNPs
n = rep(50, npop) #number of samples from each population
pop = rep(1:npop, n) #from which population each individual comes
X = c() #empty genotype data matrix
for(ii in 1:p){
x = c() #empty genotype vector for SNP ii
for(jj in 1:npop){
x = c(x, rbinom(n[jj], size = 2, f[ii,jj]) ) #add genotypes of pop jj to x
}
X = cbind(X,x) #add SNP x as a new column to genotype matrix X
}
jitt.1 = runif(n[1], -0.03, 0.03) #add some noise to coordinates in plot to avoid overlap
jitt.2 = runif(n[2], -0.03, 0.03)
plot(X[,1] + jitt.1, X[,2] + jitt.2,
col = cols[pop], xlab = "SNP1", ylab = "SNP2")
We have clear patterns in 2d-SNP space where the red population is
mainly at the lower right and the blue at the upper left corner. But
neither SNP alone can separate the two populations and if we colored all
dots black we would not see a clear population separation on this plot.
Let’s see which is the line on which the projections of these points
have the largest variance. We get than from PCA computed by
prcomp()
.
X = scale(X) #always standardize each variant before PCA
pca = prcomp(X) #do PCA; we look later what this function returns
plot(X[,1] + jitt.1, X[,2] + jitt.2, asp = 1, #Plot the points, now after scaling
col = cols[pop], xlab = "SNP1 (stnd'd)",
ylab = "SNP2 (stnd'd)")
abline(a = 0, b = pca$rotation[2,1]/pca$rotation[1,1]) #add the PC1 line
If we now project all points on that PC1-line, as is done in
pca$x[,1]
(and use just random y-coordinates for
visualization) we have:
plot(pca$x[, 1], runif(sum(n), -1, 1), col = cols[pop], yaxt = "n", xlab = "PC1", ylab = "")
The direction picked by the PC1 also happens to separate the two populations. So even with only two SNPs, neither of which alone separates the populations, PCA can combine their information to capture the main structure of the data, which here matches the existence of the two populations.
Although the 2-SNP example above is just a toy demonstration, it gives us a good reason to expect that when we use 100,000s of SNPs, the PCs, i.e., the directions of the largest genetic variation in the data, will capture population structure, if such exists.
Terms:
Principal components of the genetic data are the linear 1D-subspaces of the original genotype space that have the following properties: The 1st PC explains the maximum variance possible by any linear 1D-subspace. For any \(k>1\), the \(k^{th}\) PC explains the maximum variance possible by any linear 1D-subspace conditional on being orthogonal to all preceeding PCs. The number of PCs is \(\min\{n,p\}.\)
Scores or principal component scores are the
coordinates of the individuals when they have been projected on the PC.
They are computed using PC loadings and the individuals’ (standardized)
genotypes. In object returned by prcomp()
, the scores are
in matrix x
, so, e.g., the scores on PC 6 are in vector
pca$x[,6]
.
Loadings are the coefficients that determine how
scores are computed from the (standardized) genotypes. Each PC \(k\) has a loading vector \(\mathbf{l}_k = (l_{k1},\ldots,l_{kp})^T\)
and the score of individual \(i\) on PC
\(k\) is \[\textrm{score}_{ik} = \mathbf{l}_k^T \mathbf{x}_i
= \sum_{m=1}^p l_{km}x_{im},\] where \(x_{im}\) is the standardized SNP genotype
of \(i\) at SNP \(m\). Note that when we have the loadings at
hand, we can project also any external individual on the existing PCs
generated by our reference data set. Loadings from prcomp()
object pca
are in matrix
pca$rotation
.
Example 5.3. Let’s generate data from three poulations P1,P2,P3 of which P1 and P2 are more closely related with each other and more distant from P3. A standard measure of differentiation is Fst value that describes how large a proportion of genetic variation is present between populations compared to within populations. For example, we have Fst ~ 0.003 between Eastern and Western Finland and ~ 0.10 between populations from different continents. The Balding-Nichols model to generate allele frequencies for two populations with Fst value of \(F\) samples a background allele frequency \(f\), for example, from a Uniform distribution, and then samples the subpopulation allele frequencies as independent samples from the distribution \[\textrm{Beta}\left(\frac{1-F}{F}f, \frac{1-F}{F}(1-f)\right ).\] Let’s generate data so that Fst between P1 and P2 is 0.003 and Fst between P3 and the shared ancestral population of P1 and P2 is 0.05. (We might be looking at differences between Eastern (P1) and Western Finns (P2) and Northern Africans (P3).) Let’s sample \(n=100\) individuals from each population using \(p = 3000\) SNPs and show PCs 1-2.
n = 100 #per population
p = 3000 #SNPs
fst.12 = 0.003
fst.12.3 = 0.05
f = runif(p, 0.1, 0.5) #common SNPs in background population
f.3 = rbeta(p, (1-fst.12.3)/fst.12.3*f, (1-fst.12.3)/fst.12.3*(1-f))
f.12 = rbeta(p, (1-fst.12.3)/fst.12.3*f, (1-fst.12.3)/fst.12.3*(1-f)) #P1&P2's shared ancestor
f.1 = rbeta(p, (1-fst.12)/fst.12*f.12, (1-fst.12)/fst.12*(1-f.12))
f.2 = rbeta(p, (1-fst.12)/fst.12*f.12, (1-fst.12)/fst.12*(1-f.12))
#Let's check that f.1 and f.2 looks similar compared to f.1 and f.3 or f.2 and f.3
par(mfrow = c(1,3))
plot(f.1,f.2, main = paste("Fst", fst.12), xlim = c(0,1), ylim = c(0,1), pch = 3)
plot(f.1,f.3, main = paste("Fst >", fst.12.3), xlim = c(0,1), ylim = c(0,1), pch = 3)
plot(f.2,f.3, main = paste("Fst >", fst.12.3), xlim = c(0,1), ylim = c(0,1), pch = 3)
Let’s then generate the genotype data and do PCA.
x = cbind(
replicate(n, rbinom(p, size = 2, p = f.1)), #generate n inds from P1
replicate(n, rbinom(p, size = 2, p = f.2)), # from P2
replicate(n, rbinom(p, size = 2, p = f.3))) # from P3
x = t(x) #each replicate (=ind) is now in a column, but we want inds to rows and SNPs to cols
pca = prcomp(x, scale = T) #do PCA
cols = rep( c("cyan","skyblue","magenta"), each = n) #color for each ind according to pop
plot(pca$x[,1], pca$x[,2], col = cols, xlab = "PC1", ylab = "PC2")
We see that 1st PC picks P3 apart from P1&P2, but does not separate P1 and P2, whereas PC2 starts to separate P1 and P2. Separation would become clear if we added a few thousand more SNPs or if we increased Fst between P1 and P2 (left as exercise).
Example 5.4. Let’s also do PCA with real allele frequency data from 1000 Genomes project. Let’s read in a frequency file (that is based on the files from IMPUTE2 webpage) and see what’s in it.
af = read.table("http://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/afreq_1000G_phase1_chr15-22.txt",
as.is = T, header = T)
dim(af)
## [1] 5266 19
af[1,]
## chr id position a0 a1 ASW CEU CHB CHS CLM FIN GBR
## 1 15 rs11248847 20101049 G A 0.2377 0.1882 0.4278 0.335 0.1917 0.1613 0.1461
## IBS JPT LWK MXL PUR TSI YRI
## 1 0.2143 0.3876 0.2165 0.25 0.2091 0.1888 0.09659
We have allele 1 frequency info for 5,266 SNPs in 14 populations. The SNPs have been chosen from chromosomes 15-22 and are at least 100,000 bps apart to avoid linkage disequilibrium blocks. Additionally, the global MAF of all these are > 5%, but some of them they may very rare in any one population.
Note that sample sizes for some populations (IBS in particular) is small so we won’t use them. Let’s demonstrate a European PCA using GBR, TSI, CEU and FIN. We simply simulate some number of individuals (\(n=50\)) from each population and do PCA. (We could use the original individual level 1000 Genomes data as well, but that requires large files, so here we just work with the allele frequencies and simulate our individuals from them.)
p = nrow(af) #number of SNPs
n = 50 #samples per population
pop.labs = c("GBR","TSI","CEU","FIN")
pop = rep(1:length(pop.labs), each=n)
x = c()
for(ii in 1:length(pop.labs)){
x = rbind(x, t(replicate(n, rbinom(p, size = 2, prob = af[,pop.labs[ii]]) ) ) )
}
x = x[, apply(x, 2, var) > 0 ] #remove monomorphic variants
pca = prcomp(x, scale = T)
plot(pca$x[,1], pca$x[,2], col = pop, pch = 19, xlab = "PC1", ylab= "PC2",
main = "Simulation from 1000 Genomes Phase 1")
legend("topright", leg = pop.labs, col = 1:length(pop.labs), pch = 19, cex = 1)
Looks like PC1 picks North-South direction and PC2 separates Central Europeans and British from Finns and Italians.
plot(pca$x[,1], pca$x[,3], col = pop, pch = 19, xlab = "PC1", ylab= "PC3",
main = "Simulation from 1000 Genomes Phase 1")
And PC3 then separates GBR and CEU.
Time to see some colorful pictures that PCA has produced with real data (slides 21-25).
Technically, PCA is the eigenvalue decomposition of the correlation based genetic relatedness matrix (GRM) that we discussed earlier, in the sense that the eigenvectors of GRM are the scores on the PCs (1st eigenvector corresponds to 1st PC etc.) This means that GRM can be seen as being built up from PCs, one by one, as shown on slide 27.
Tutorial on PCA by Jon Shlens is an excellent tutorial, but unfortunately it uses the rows and columns of data matrix the other way as we are using so the data matrix is given as \(p \times n\) matrix, and \(p\) (the number of variables) is called \(m\).
Most GWAS software packages can do PCA. Typically, the steps are like done by Kerminen 2015:
Identify suitable set of individuals by excluding one individual from each pair of closely related individuals. (Close relatives can drive some leading PCs and hence mix up the analysis if the purpose is to find population structure, as we will see soon.)
Identify a suitable set of SNPs, typically MAF>5% and strict LD-pruning is applied (we’ll talk about pruning in next chapter). Remove also the known regions of high-LD (Price et al. 2008).
Make sure that the method is using mean-centered genotypes; often the SNPs are further satandardized to have variance 1 in the sample. Note: Also GRM uses standardized genotypes as it is based on the genotype correlation.
Plot the loadings along the genome to observe if there are some regions with much larger contributions than the genome average. Spikes somewhere in genome indicate that those regions are driving the PC, and that is most likely because of inadequate pruning in that region (See Figure 9 in Kerminen 2015).
Visualize the PCs, e.g., by plotting two PCs against each other. Observe if there are outliers that seem to drive some of the leading (say 20) PCs, and possibly remove those outliers, and do PCA anew. They could be individuals with some quality problems or with genetic ancestry that is different from the rest of the sample, which can be a problem in GWAS as we’ll discuss soon. If you have geographic location info about your samples, color each PC-plot with expected genographic populations. If PC picks up populations it is likely to be correctly done, if not, need to look more.
Project the excluded relatives on the PCs. Some relatives may have been excluded while generating the PCs, but they will have valid scores on the PCs when projected using the loadings computed when those relatives were excluded.
Let’s make a data set that has 10 individuals from each of populations 1 and 2. The Fst between populations is 0.01 and the data includes one pair of full sibs from population 1, while the other individuals are unrelated within each population. Let’s see how the 1st and the 2nd PC behave.
set.seed(20)
p = 10000 #SNPs
fst.12 = 0.01
cols = c("black","limegreen")
f = runif(p, 0.2,0.5) #common SNPs in background population
f.1 = rbeta(p, (1-fst.12)/fst.12*f, (1-fst.12)/fst.12*(1-f))
f.2 = rbeta(p, (1-fst.12)/fst.12*f, (1-fst.12)/fst.12*(1-f))
X = rbind(offspring.geno(n.families = 1, n.snps = p, fs = f.1, n.shared.parents = 2), #full-sibs from 1
offspring.geno(n.families = 4, n.snps = p, fs = f.1, n.shared.parents = 0), #unrel from 1
offspring.geno(n.families = 5, n.snps = p, fs = f.2, n.shared.parents = 0)) #unrel from 2
pop = c(rep(1,2*5), rep(2,2*5)) #population labels: 10x Pop1 and 10x Pop2.
X = X[, apply(X, 2, var) > 0] #remove monomorphic variants before scaling
pca = prcomp(X, scale = T) #do PCA
plot(pca$x[,1], pca$x[,2], asp = 1, col = cols[pop], xlab = "PC1", ylab = "PC2", pch = 19)
Thus, the 1st PC picks the relative pair and not the population structure. Only the 2nd PC picks the population structure. This is not what we want in GWAS, where we want to adjust for relatedness in other ways and use PCA to get an idea of the population structure. Let’s do the PCA by first removing one of the sibs, and then projecting him/her back among the others in PC plot.
X.unrel = X[-1,] #remove row 1
pca = prcomp(X.unrel, scale = T)
plot(pca$x[,1], pca$x[,2], asp = 1, col = cols[pop[-1]], xlab = "PC1", ylab = "PC2",
main = "Unrelated PCA", pch = 19)
#project back individual 1 to the PCs spanned by the 19 other individuals
pca.1 = predict(pca, newdata = matrix(X[1,], nrow = 1)) #projects newdata to PCs defined in 'pca'
points(pca.1[1], pca.1[2], col = "violet", pch = 19, cex = 1.2) #ind 1 on PCs
We see that the excluded sibling (in violet) is projected among his/her population, and now PCs are not affected by any close relationships.
Note that when projecting samples using predict()
on an
object from prcomp, the new data must be in same units as the original
data that was used to do PCA with prcomp()
. If the original
data went in to prcomp()
as unscaled genotypes, then the
new data must also be unscaled genotypes with same allele coding. And if
the data were scaled first outside prcomp()
, then the new
data must be scaled with the same mean and sd before prediction.
Those interested in current level of fine-scale genetic structure that can be extracted from genetic data can see results from
Estonia (Pankratov et al. 2020).
France (Saint Pierre et al. 2020).
Netherlands (Byrne et al. 2020).
Scotland (Gilbert et al. 2019)
Spain (Bycroft et al. 2019).
Finland (Kerminen et al. 2017).
Ireland (Gilbert et al. 2017).
These are not done by PCA, but by an approach that estimates genome segment sharing.