We have seen how PCA extracts such linear combinations of the \(p\) original variables that are maximally informative among all linear combinations. Often the leading PCs have a clear and interpretable structure and therefore PCA is a widely-used method to visualize and reduce high-dimensional data.

PCs are **global linear** functions of data and hence the leading PCs tend to capture such directions from the input space on which the distant data points remain distant from each other also in the leading PCs as such directions maximize the variance of the projected points. However, for high-dimensional data that happen to be structured in some non-linear way on some lower dimensional subspace, it would also be important to keep the similar samples close together in the low-dimensional representation, which may not be possible by any global linear function such as a PC.

Many methods for dimension reduction that try to capture more of the local structure are non-linear and are not guaranteed to yield a globally optimal solution (result may change with the seed of the random number generator that initializes the algorithm).

Here we study two methods: **t-SNE** and **UMAP**. Letâ€™s first see what they produce in practice and then come back to what is going on under the hood.

The 1000 Genomes Project has produced genotype data from across the world. Here we consider a subset of \(n=1092\) individuals from the following 14 **populations**, divided into 4 continental **groups**,

- ASW [AFR] (61) - African Ancestry in Southwest US
- CEU [EUR] (85) - Utah residents (CEPH) with Northern and Western European ancestry
- CHB [ASN] (97) - Han Chinese in Beijing, China
- CHS [ASN] (100) - Southern Han Chinese
- CLM [AMR] (60) - Colombian in Medellin, Colombia
- FIN [EUR] (93) - Finnish from Finland
- GBR [EUR] (89) - British from England and Scotland
- IBS [EUR] (14) - Iberian population in Spain
- JPT [ASN] (89) - Japanese in Toyko, Japan
- LWK [AFR] (97) - Luhya in Webuye, Kenya
- MXL [AMR] (66) - Mexican Ancestry in Los Angeles, CA
- PUR [AMR] (55) - Puerto Rican in Puerto Rico
- TSI [EUR] (98) - Toscani in Italia
- YRI [AFR] (88) - Yoruba in Ibadan, Nigeria

Each individual has been measured on \(p=4212\) genetic variants (each can have value 0, 1 or 2) from chromosomes 15-22.

```
X = read.table("geno_1000G_phase1_chr15-22.txt", as.is = T, header = T)
dim(X)
```

`## [1] 1092 4215`

`X[1:4, 1:10]`

```
## id population group X1 X2 X3 X4 X5 X6 X7
## 1 HG00096 GBR EUR 0 0 0 0 2 0 0
## 2 HG00097 GBR EUR 1 1 0 0 1 0 0
## 3 HG00099 GBR EUR 0 0 0 0 1 1 0
## 4 HG00100 GBR EUR 0 0 1 0 2 0 1
```

`table(X[,"group"], X[,"population"])`

```
##
## ASW CEU CHB CHS CLM FIN GBR IBS JPT LWK MXL PUR TSI YRI
## AFR 61 0 0 0 0 0 0 0 0 97 0 0 0 88
## AMR 0 0 0 0 60 0 0 0 0 0 66 55 0 0
## ASN 0 0 97 100 0 0 0 0 89 0 0 0 0 0
## EUR 0 85 0 0 0 93 89 14 0 0 0 0 98 0
```

Letâ€™s do PCA and plot 12 leading PCs in pairwise plots coloring each individual by the continental group (Africa, Americas, Asia, Europe) given by the `group`

variable.

```
x = as.matrix(X[, 4:ncol(X)])
n = nrow(x)
p = ncol(x)
pr = prcomp(x, scale = T)
grs = names(table(X$group))
grs.col = hsv(c(0.1, 0.3, 0.9, 0.6), 1, 1) #find distinct colors for groups
cols.gr = rep(NA, n) #color of the group of each individual
for(ii in 1:length(grs)) cols.gr[X$group == grs[ii]] = grs.col[ii]
par(mfrow = c(2,3))
for(ii in 1:6){
plot(pr$x[,2*ii-1], pr$x[,2*ii], col = cols.gr, pch = 3,
xlab = paste0("PC", 2*ii-1), ylab = paste0("PC", 2*ii))
if(ii == 1) legend("bottomright", col = grs.col, leg = grs, pch = 3, cex = 1.3)
}
```