---
title: 'GWAS 11: Mendelian randomization'
author: "Matti Pirinen, University of Helsinki"
date: "27-Feb-2019"
output:
  pdf_document: default
  html_document: default
urlcolor: blue
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(19)
```

This document is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).

The slide set referred to in this document is "GWAS 11".

Mendelian randomization uses genetic variants to study whether a particular
exposure is a **causal** risk factor for a disease. 
See slides 2-?
A soft [introduction](https://www.technologyreview.com/s/611713/researchers-find-way-to-mimic-clinical-trials-using-genetics/) by G.Taubes. 
A [review](https://academic.oup.com/hmg/article/23/R1/R89/2900899) by Davey Smith & Hemani 2014.

Let's consider the example from [Do et al. 2013](http://www.nature.com/ng/journal/v45/n11/full/ng.2795.html) Nat Gen.
They studied the relationship between LDL-C, HDL-C and triglyserides (TG) on the
risk of coronary artery disease (CAD). 
They collected 185 SNPs that were associated with some of the risk factors 
(LDL, HDL or TG, or several of them).

The idea in MR is to see how SNPs associated some risk factor of a disease
are associated with the disease.
If the risk factor has a causal effect on the disease, then the SNPs affecting
the risk factor should also affect the disease. Furthermore, we could have
a quantitative expectation how much each SNP increases the risk based on
how much the SNP changes the risk factor, 
and how much the risk factor affects the disease risk.

```{r}
x = read.table("https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/2019/material/do_2013_frq_data.txt",
               as.is = T, header = T)
x[1,] 
nrow(x) #how many SNPs?
```

So we have 185 SNPs, and for each we know its allele1's effect on
HDL, LDL and TG as well as
effect on risk for CAD.

Suppose we want to know whether LDL has a causal effect on CAD.
In standard MR we would look at SNPs that affect ONLY LDL, and 
we would study whether they also affect CAD.

```{r, fig.height = 3}
#Pick SNPs that affect only one trait, and see whether they affect CAD
p.aff = 5e-8 #P-value < this means SNP is affecting 
p.not.aff = 1e-2 #P-value > this means SNP is not affecting
cols = c("red","orange","dodgerblue")
traits = c("TG","LDL","HDL")
par(mfrow=c(1,3))
for(set in 1:3){
 foc = traits[1 + set %% 3] #which trait is the focal one, which are side ones 
 s1 = traits[1 + (set+1) %% 3]
 s2 = traits[1 + (set+2) %% 3]

 ind = (x[,paste0(foc,"_p")] < p.aff & 
        x[,paste0(s1,"_p")] > p.not.aff &
        x[,paste0(s2,"_p")] > p.not.aff)

 plot(x[ind,paste0(foc,"_beta")],x[ind,"CAD_beta"],
      xlab = paste0(foc,"_beta"), ylab = "CAD_beta", pch = 19, col = cols[set],
      main = paste(sum(ind),"SNPs"), cex.axis = 1.3, cex.lab = 1.5)
 grid()
 abline(h=0,lty = 2)
 abline(v=0,lty = 2)
}
```

What would you think about these traits being causal for CAD?

The need to remove all SNPs with effects on several traits limits 
information from 185 to only a handful of SNPs.
Let's do as Do et al. (2013) and use all the SNPs and regress 
the CAD effects on the effects on traits.

First regression for one trait at a time:
```{r, fig.height = 3}
res = data.frame(matrix(NA, ncol=5, nrow=3))
par(mfrow=c(1,3))
for(set in 1:3){
 foc=c("LDL","HDL","TG")[set] #which trait is the focal one, which are side ones 
 y.lab = "CAD_beta"; x.lab = paste0(foc,"_beta")
 y = x[,y.lab]; z = x[,x.lab]
 lm.fit = lm( y ~ z )
 plot(z, y, main = paste0("r2=", signif( cor(z,y)^2, 2)),
      xlab = x.lab, ylab = y.lab, col = cols[set], pch = 19)
 grid()
 abline(h = 0)
 abline(v = 0)
 res[set,] = c(foc, signif(summary(lm.fit)$coeff[2,],3))
}
res
```

All three traits have a significant correlation with CAD effect size.
What happens when all three are in the same model ? 

```{r}
lm.fit=lm(CAD_beta ~ HDL_beta + LDL_beta + TG_beta, data = x)
summary(lm.fit)$coeff[-1,]
```

HDL is not important predictor when LDL and TG are in the model.
So HDL is not likely to be causal. But LDL and TG might be.
However, if we had measured also some other correlated lipids or other biomarkers
then we might find a more precise cause for the association between our 
current LDL and TG effects and CAD effects.
For example, **ApoB** has been suggested as a possible factor
behind both LDL-C and TG effects on CAD by [Ference et al. 2019](https://jamanetwork.com/journals/jama/fullarticle/2722770).

We cannot ever prove causality using only MR, but we can learn
about the stregth of evidence for the causality hypothesis.