--- title: 'GWAS 11: Mendelian randomization' author: "Matti Pirinen, University of Helsinki" date: "27-Feb-2019" output: pdf_document: default html_document: default urlcolor: blue --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) set.seed(19) ``` This document is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/). The slide set referred to in this document is "GWAS 11". Mendelian randomization uses genetic variants to study whether a particular exposure is a **causal** risk factor for a disease. See slides 2-? A soft [introduction](https://www.technologyreview.com/s/611713/researchers-find-way-to-mimic-clinical-trials-using-genetics/) by G.Taubes. A [review](https://academic.oup.com/hmg/article/23/R1/R89/2900899) by Davey Smith & Hemani 2014. Let's consider the example from [Do et al. 2013](http://www.nature.com/ng/journal/v45/n11/full/ng.2795.html) Nat Gen. They studied the relationship between LDL-C, HDL-C and triglyserides (TG) on the risk of coronary artery disease (CAD). They collected 185 SNPs that were associated with some of the risk factors (LDL, HDL or TG, or several of them). The idea in MR is to see how SNPs associated some risk factor of a disease are associated with the disease. If the risk factor has a causal effect on the disease, then the SNPs affecting the risk factor should also affect the disease. Furthermore, we could have a quantitative expectation how much each SNP increases the risk based on how much the SNP changes the risk factor, and how much the risk factor affects the disease risk. ```{r} x = read.table("https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/2019/material/do_2013_frq_data.txt", as.is = T, header = T) x[1,] nrow(x) #how many SNPs? ``` So we have 185 SNPs, and for each we know its allele1's effect on HDL, LDL and TG as well as effect on risk for CAD. Suppose we want to know whether LDL has a causal effect on CAD. In standard MR we would look at SNPs that affect ONLY LDL, and we would study whether they also affect CAD. ```{r, fig.height = 3} #Pick SNPs that affect only one trait, and see whether they affect CAD p.aff = 5e-8 #P-value < this means SNP is affecting p.not.aff = 1e-2 #P-value > this means SNP is not affecting cols = c("red","orange","dodgerblue") traits = c("TG","LDL","HDL") par(mfrow=c(1,3)) for(set in 1:3){ foc = traits[1 + set %% 3] #which trait is the focal one, which are side ones s1 = traits[1 + (set+1) %% 3] s2 = traits[1 + (set+2) %% 3] ind = (x[,paste0(foc,"_p")] < p.aff & x[,paste0(s1,"_p")] > p.not.aff & x[,paste0(s2,"_p")] > p.not.aff) plot(x[ind,paste0(foc,"_beta")],x[ind,"CAD_beta"], xlab = paste0(foc,"_beta"), ylab = "CAD_beta", pch = 19, col = cols[set], main = paste(sum(ind),"SNPs"), cex.axis = 1.3, cex.lab = 1.5) grid() abline(h=0,lty = 2) abline(v=0,lty = 2) } ``` What would you think about these traits being causal for CAD? The need to remove all SNPs with effects on several traits limits information from 185 to only a handful of SNPs. Let's do as Do et al. (2013) and use all the SNPs and regress the CAD effects on the effects on traits. First regression for one trait at a time: ```{r, fig.height = 3} res = data.frame(matrix(NA, ncol=5, nrow=3)) par(mfrow=c(1,3)) for(set in 1:3){ foc=c("LDL","HDL","TG")[set] #which trait is the focal one, which are side ones y.lab = "CAD_beta"; x.lab = paste0(foc,"_beta") y = x[,y.lab]; z = x[,x.lab] lm.fit = lm( y ~ z ) plot(z, y, main = paste0("r2=", signif( cor(z,y)^2, 2)), xlab = x.lab, ylab = y.lab, col = cols[set], pch = 19) grid() abline(h = 0) abline(v = 0) res[set,] = c(foc, signif(summary(lm.fit)$coeff[2,],3)) } res ``` All three traits have a significant correlation with CAD effect size. What happens when all three are in the same model ? ```{r} lm.fit=lm(CAD_beta ~ HDL_beta + LDL_beta + TG_beta, data = x) summary(lm.fit)$coeff[-1,] ``` HDL is not important predictor when LDL and TG are in the model. So HDL is not likely to be causal. But LDL and TG might be. However, if we had measured also some other correlated lipids or other biomarkers then we might find a more precise cause for the association between our current LDL and TG effects and CAD effects. For example, **ApoB** has been suggested as a possible factor behind both LDL-C and TG effects on CAD by [Ference et al. 2019](https://jamanetwork.com/journals/jama/fullarticle/2722770). We cannot ever prove causality using only MR, but we can learn about the stregth of evidence for the causality hypothesis.