---
title: "HDS 10. Nonlinear Dimension Reduction with t-SNE and UMAP"
author: "Matti Pirinen, University of Helsinki"
date: "30.11.2021"
output:
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
We have seen how PCA extracts such linear combinations of the $p$ original
variables that are maximally informative among all linear combinations.
Often the leading PCs have a clear and interpretable structure and therefore
PCA is a widely-used method to visualize and reduce high-dimensional data.
PCs are **global linear** functions of data
and hence the leading PCs tend to capture such directions from the input space
on which the distant data points remain
distant from each other also in the leading PCs
as such directions maximize the variance of the projected points.
However, for high-dimensional data that happen to be structured
in some non-linear way on some lower dimensional subspace,
it would also be important to keep the similar samples close together
in the low-dimensional representation,
which may not be possible by any global linear function such as a PC.
Many methods for dimension reduction that try to capture more of the
local structure are non-linear and
are not guaranteed to yield a globally optimal solution
(result may change with the seed of the random number generator
that initializes the algorithm).
Here we study two methods: [**t-SNE**](https://lvdmaaten.github.io/tsne/)
and [**UMAP**](https://umap-learn.readthedocs.io/en/latest/index.html).
Let's first see what they produce in practice and
then come back to what is going on under the hood.
### 1000 Genomes data
The [1000 Genomes Project](https://www.internationalgenome.org/)
has produced genotype data from across the world.
Here we consider a subset of $n=1092$ individuals from
the following 14 **populations**,
divided into 4 continental **groups**,
* ASW [AFR] (61) - African Ancestry in Southwest US
* CEU [EUR] (85) - Utah residents (CEPH) with Northern and Western European ancestry
* CHB [ASN] (97) - Han Chinese in Beijing, China
* CHS [ASN] (100) - Southern Han Chinese
* CLM [AMR] (60) - Colombian in Medellin, Colombia
* FIN [EUR] (93) - Finnish from Finland
* GBR [EUR] (89) - British from England and Scotland
* IBS [EUR] (14) - Iberian population in Spain
* JPT [ASN] (89) - Japanese in Toyko, Japan
* LWK [AFR] (97) - Luhya in Webuye, Kenya
* MXL [AMR] (66) - Mexican Ancestry in Los Angeles, CA
* PUR [AMR] (55) - Puerto Rican in Puerto Rico
* TSI [EUR] (98) - Toscani in Italia
* YRI [AFR] (88) - Yoruba in Ibadan, Nigeria
Each individual has been measured on $p=4212$ genetic variants
(each can have value 0, 1 or 2) from chromosomes 15-22.
```{r}
X = read.table("geno_1000G_phase1_chr15-22.txt", as.is = T, header = T)
dim(X)
X[1:4, 1:10]
table(X[,"group"], X[,"population"])
```
Let's do PCA and plot 12 leading PCs in pairwise plots coloring
each individual by the continental group (Africa, Americas, Asia, Europe)
given by the `group` variable.
```{r, fig.width = 9.5, fig.height = 7}
x = as.matrix(X[, 4:ncol(X)])
n = nrow(x)
p = ncol(x)
pr = prcomp(x, scale = T)
grs = names(table(X$group))
grs.col = hsv(c(0.1, 0.3, 0.9, 0.6), 1, 1) #find distinct colors for groups
cols.gr = rep(NA, n) #color of the group of each individual
for(ii in 1:length(grs)) cols.gr[X$group == grs[ii]] = grs.col[ii]
par(mfrow = c(2,3))
for(ii in 1:6){
plot(pr$x[,2*ii-1], pr$x[,2*ii], col = cols.gr, pch = 3,
xlab = paste0("PC", 2*ii-1), ylab = paste0("PC", 2*ii))
if(ii == 1) legend("bottomright", col = grs.col, leg = grs, pch = 3, cex = 1.3)
}
```
Visually, PCs 1-8 seem to capture broader structure whereas
PCs from 9 onwards seem to
separate small groups of possibly more closely related pairs or triples.
Let's next color points by population rather than continent.
```{r, fig.width = 8.5, fig.height = 5.5}
#3 AFR, 3 AMR, 3 ASN and 5 EUR populations given colors and plotting symbols
pops = c("ASW","LWK","YRI","CLM","MXL","PUR","CHB","CHS","JPT","CEU","GBR","IBS","TSI","FIN")
pops.col = hsv(c(0.15, 0.1, 0.125, 0.25, 0.35, 0.4,
0.85, 0.9, 0.95, 0.6, 0.65, 0.7, 0.75, 0.5), 1, 1)
pops.pch = c(3,2,0, 0,2,3, 0,2,3, 0,2,1,5,3) #plotting symbol for each population
cols.pop = rep(NA, n)
pchs.pop = rep(NA, n)
for(ii in 1:length(pops)) {
cols.pop[X$population == pops[ii]] = pops.col[ii]
pchs.pop[X$population == pops[ii]] = pops.pch[ii]
}
par(mar = c(4,4,0.5,8), xpd = TRUE)
plot(pr$x[,1], pr$x[,2], col = cols.pop, pch = pchs.pop, xlab = "PC1", ylab = "PC2")
legend("topright", inset = c(-0.15,0), leg = pops, col = pops.col, pch = pops.pch)
```
```{r, fig.width = 6.5, fig.height = 6.5 }
par(mfrow=c(2,2), mar = c(4,4,1,0.5))
for(ii in 1:4){
plot(pr$x[,2*ii-1], pr$x[,2*ii], col = cols.pop, pch = pchs.pop,
xlab = paste0("PC",2*ii-1), ylab = paste0("PC",2*ii))
if(ii == 0) legend("bottomright", col = grs.col, leg = grs, pch = 3, cex = 1.3)
}
legend("bottomright", leg = pops, col = pops.col, pch = pops.pch, cex = 0.8)
```
Let's then compare the PCA plots to t-SNE and UMAP, using R packages
`Rtsne` and `umap`, respectively,
that we apply to compress the first 8 PCs to 2 dimensions.
```{r, fig.width = 8, fig.height = 6, eval = T}
#install.packages("Rtsne")
library(Rtsne)
set.seed(67)
tsne = Rtsne(X = pr$x[,1:8], perplexity = 10, theta = 0.0, pca = FALSE)
par(mar = c(4,4,4,8), xpd = TRUE)
plot(tsne$Y, col = cols.pop, pch = pchs.pop, main = "t-SNE", xlab ="", ylab = "")
legend("topright", inset = c(-0.15,0), leg = pops, col = pops.col, pch = pops.pch)
```
```{r, fig.width = 8, fig.height = 6, eval = T}
#install.packages("umap")
library(umap)
set.seed(67)
umap.res = umap(pr$x[,1:8])
par(mar = c(4,4,4,8), xpd = TRUE)
plot(umap.res$layout, col = cols.pop, pch = pchs.pop, main = "UMAP", xlab ="", ylab = "")
legend("topright", inset = c(-0.15,0), leg = pops, col = pops.col, pch = pops.pch)
```
Wee see that t-SNE and UMAP largely
group individuals from a same population close together
and separate them from other populations whereas they do not
put so much emphasis on making, for example,
the two African continental populations YRI and LWK
equally distant from all 5 European populations or all 3 Asian populations,
as the PCs 1 and 2 did above.
Thus, we see that t-SNE and UMAP may indeed preserve more of the
local structure around the neighborhood of each sample but consequently cannot
be completely faithful to the overall global structure as defined by leading PCs.
In a sense, t-SNE and UMAP try to present
both the local and global structure in only 2 dimensions,
and for this they need to find a trade-off between these two goals.
### t-SNE: t-distributed Stochastic Neighbor Embedding
t-SNE was introduce in
["Visualizing Data using t-SNE"](http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)
by van der Maaten & Hinton (2008).
It builds on earlier work on Stochastic Neighbor Embedding (SNE),
where the idea is to measure distance in
the high-dimensional input space by a conditional probability.
If $\pmb{x}_i$ and $\pmb{x}_j$
are two $p$-dimensional data points,
we can compute a conditional probability $p_{j|i}$ that
$\pmb{x}_i$ would pick as its neighbor the point $\pmb{x}_j$ if neighbors
would be chosen from a Gaussian distribution centered on $\pmb{x}_i$ and having a
same variance $\sigma_i^2$ in each dimension (we come back to how $\sigma_i^2$
will be chosen later).
$$
p_{j|i} \propto \exp \left(- \frac{\|\pmb{x}_i - \pmb{x}_j\|_2^2}{2\,\sigma_i^2} \right)
\quad \textrm{ and } \quad \sum_{j\neq i} p_{j|i} = 1.
$$
This probability is larger for points that are closer to $\pmb{x}_i$ than for
those that are farther away.
Similarly, we can define $p_{i|j}$.
Finally, we can average (and normalize by $n$) these to get
$p_{ij} = p_{ji} = \frac{1}{2n}(p_{i|j} + p_{j|i})$
to represent the similarity between $\pmb{x}_i$ and $\pmb{x}_j$ by a single value.
The goal of SNE is to map the $p$-dimensional input values $\pmb{x}_i$ to
2-dimensional (or 3-dimensional) points $y_i$ in such a way that
the distances $q_{ij}$ defined by a similar density function evaluation
in the output space would optimally match the input space distances $p_{ij}$.
Here "optimally" means in terms of the Kullback-Leibler divergence of distribution
$Q = (q_{ij})$ from distribution $P = (p_{ij})$:
$$
\textrm{KL}(P\,||\,Q) = \sum_{i
Let's see how `n_neighbors` compares to effect of perplexity in t-SNE (that we saw above).
```{r, fig.width = 9.5, fig.height = 3.5, eval = T, warning = F}
par(mfrow=c(1,3))
set.seed(81)
umap.config = umap.defaults #umap takes parameters in a config object
for(nbors in c(3, 5, 20)){
umap.config$n_neighbors = nbors
umap.res = umap(pr$x[,1:8], config = umap.config)
plot(umap.res$layout, col = cols.pop, pch = pchs.pop,
main = paste0("neighbors = ",nbors), xlab ="", ylab = "")
}
```
#### Discussion
* Most non-linear dimension reduction techniques (including t-SNE and UMAP) lack the strong interpretability of Principal Component Analysis where the dimensions are the directions of greatest variance in the source data. If strong interpretability is needed, PCA is recommended.
* As t-SNE and UMAP are based on the distance between observations rather than the source features, they do not produce easily interpretable loadings per each variable that PCA can provide for each output dimension.
* A core assumptions of UMAP is that there exists manifold structure in the
data. Because of this UMAP may find manifold structure within the noise of
a dataset, a type of overfitting.
As more data is sampled, UMAP becomes more robust. However, care must be taken with
small sample sizes of noisy data, or data with only large-scale manifold structure.
* If data are high-dimensional, say $p>100$, it is often useful to first
apply PCA and take some tens of the leading PCs as input for t-SNE or UMAP to further
reduce to 2 or 3 dimensions for visualization.