and 95% CI for OR is $(\exp(\textrm{logOR}-1.96\cdot \textrm{SE}),\, \exp(\textrm{logOR}+1.96\cdot \textrm{SE}))$.

Thus, the endpoints of the 95%CI must always be computed on the log-odds scale and then transformed to the OR scale. Similar inference can be done for the $\textrm{OR}_2$ parameter measuring the odds-ratio between genotypes 2 and 0 by substituting the counts of genotype 1 with the counts of genotype 2 in the formulas above. If any counts are very small, or even 0, the SE is not reliable. One can add a value of 0.5 to each of the observed counts to get an OR estimate even in these case, but one shouldn't rely on the SE estimate. #### Logistic regression Logistic regression model takes the place of linear regression as the basic GWAS model when the phenotype is binary. It explains the logarithm of the odds of the disease by the genotype. The simplest model is the **additive** model: $$\log\left( \frac{\textrm{Pr}(Y=1\,|\,X=x)}{\textrm{Pr}(Y=0\,|\,X=x)} \right) = \mu + x\beta.$$ Thus, $\mu$ is the logarithm of odds ('log-odds') for genotype 0 and $\beta$ is the log of odds ratio (logOR) between genotype 1 and 0 (and $\exp(\beta)$ is the corresponding odds ratio). Similarly, $2\beta$ is the logOR between genotypes 2 and 0. This model is additive on the log-odds scale and hence multiplicative on the odds scale. Due to this duality, it is sometimes called additive model and sometimes called multiplicative model, which is a source of confusion. In these notes, it is called the additive model. In R, such a logistic regression model can be fitted by command `glm(y ~ x, family = "binomial")`. To try out logistic regression, we should learn how to simulate some case-control data that follow the logistic regression model. **Example 1.5.** Let's assume that our risk allele $A$ has frequency 13% in controls, and that it is in HWE in controls. If the risk model is additive on log-odds scale with odds-ratio 1.43 per each copy of allele A, what are the genotype frequencies in cases? Let's denote case frequencies by $f_0, f_1, f_2$ and control frequencies by $q_0, q_1, q_2.$ From formulas above, we get that $$f_1 = \textrm{Pr}(A=1\,|\,Y=1) = \textrm{OR}_1 \,\frac{\textrm{Pr}(A=1\,|\,Y=0)} {\textrm{Pr}(A=0\,|\,Y=0) }\, \textrm{Pr}(A=0\,|\,Y=1) = \textrm{OR}_1 \frac{q_1 f_0}{q_0},$$ $$f_2 = \textrm{Pr}(A=2\,|\,Y=1) = \textrm{OR}_2\, \frac{\textrm{Pr}(A=2\,|\,Y=0)} {\textrm{Pr}(A=0\,|\,Y=0) } \,\textrm{Pr}(A=0\,|\,Y=1) = \textrm{OR}_2\, \frac{q_2 f_0}{q_0}.$$ Since $f_0+f_1+f_2=1,$ we get $$f_0 = \left(1 + \textrm{OR}_1 \frac{q_1}{q_0} +\textrm{OR}_2 \frac{q_2 }{q_0}\right)^{-1}.$$ Now we can compute the genotype frequencies in cases. (Note $\textrm{OR}_2 = \textrm{OR}_1^2$ under the additive model). Let's write a function that computes the case and control frequencies given control allele frequencies and OR. ```{r} case.control.freqs <- function(q, or){ #if dimension of 'q' is 1 then 'q' is taken as allele 1 freq. in controls and HWE is assumed in controls #if dimension of 'q' is 3 then 'q' is taken as the genotype (0,1,2) frequencies in controls #if dimension of 'or' is 1, then 'or' is per each copy of allele 1 #if dimension of 'or' is 2, then 'or[1]' is for genotype 1 vs.0 and 'or[2]' is geno 2 vs. 0 if(length(q) == 1) q = c((1-q)^2, 2*q*(1-q), q^2) # assumes HWE in controls stopifnot(length(q) == 3) if(length(or) == 1) or = c(or, or^2) stopifnot(length(or) == 2) f = q / q[1] * c(1, or) f = f / sum(f) data.frame(cases = f, controls = q, row.names = c(0,1,2)) } ``` ```{r} or = 1.43 a.cntrl = 0.13 cc.f = case.control.freqs(a.cntrl, or) cc.f ``` Let's generate 2000 cases and controls from these genotype frequencies and estimate the genetic effect using logistic regression. ```{r} n = 2000 x.cases = sample(c(0, 1, 2), prob = cc.f$cases, size = n, replace = T) x.controls = sample(c(0, 1, 2), prob = cc.f$controls, size = n, replace = T) x = c(x.cases, x.controls) #genotypes of all samples y = c(rep(1, n), rep(0, n)) #binary phenotype corresponding to genotypes: 1st cases, then controls glm.fit = glm(y ~ x, family = "binomial") summary(glm.fit) ``` What is the estimate and 95%CI on odds ratio scale? ```{r} b = summary(glm.fit)$coeff[2,1] #estimate, beta-hat se = summary(glm.fit)$coeff[2,2] #standard error #endpoints computed on logOR scale, then transformed to OR scale: c(or.est = exp(b), low95 = exp(b - 1.96*se), up95 = exp(b + 1.96*se)) ``` Let's compare the result from the additive logistic regression model to the results that we get from the raw genotype counts between the genotypes 1 and 0 using the formulas above (see "Inference for OR based on counts"): ```{r} s1 = sum(x.cases == 1); s0 = sum(x.cases == 0); r1 = sum(x.controls == 1); r0 = sum(x.controls == 0) or.1.counts = s1*r0 / (s0*r1) se.1.counts = sqrt(sum( 1 / c(s1, s0, r1, r0) ) ) c(or.est = or.1.counts, low95 = exp(log(or.1.counts) - 1.96 * se.1.counts), up95 = exp(log(or.1.counts) + 1.96 * se.1.counts) ) ``` These are a bit different from the additive model results above. The reason is that the additive model above uses also the data from individuals with genotype 2 to estimate the OR parameter while the formulas do not. It turns out that if we applied the additive regression model only to the individuals having either genotype 0 or genotype 1, then we would get essentially the same results as we get from the raw counts: ```{r} glm.fit = glm(y[x != 2] ~ x[x != 2], family = "binomial") b = summary(glm.fit)$coeff[2,1] #estimate, beta-hat se = summary(glm.fit)$coeff[2,2] #standard error exp(c( b, b - 1.96 * se, b + 1.96 * se )) ``` Remember that even though the count based inference and logistic regression inference based on additive model gives similar results here, they may give very different results when there are confounding covariates included in the regression model, and in such situations, we trust more the regression model results. We come back to this later on the material. Similarly to the quantitative phenotypes, we can use the full model also for the binary data: ```{r} z = as.numeric( x == 2 ) glm.full = glm( y ~ x + z, family = "binomial") summary(glm.full) ``` In this example, we have no deviation from additivity as the data were simulated assuming the additive model. #### Ascertained case-control studies Suppose we are studying MS disease whose prevalence is about 1/1000. Even if we collect 100,000 individuals from the population, we still would get only about 100 cases! In ascertained case-control studies we enrich cases by collecting a case sample directly from the diseased people and similarly we collect controls either from the general population or, even better, from the individuals we know are disease free. (For diseases with prevalence < 1% there is little difference between these two control ascertainment strategies.) Thus, by a phenotype-based ascertainment, we may have a GWAS of 10,000 individuals that is divided into sets of 5,000 cases and 5,000 controls. This approach gives much higher statistical power to detect associations than a population collection of 100 cases and ~100,000 controls. (We will do power analyses later on the course.) Can we analyse such ascertained case-control samples using the same logistic regression model as we applied above or does the ascertainment cause some issues? The answer is yes we can. The parameter $\beta$ is the logOR and we showed earlier that this parameter can be estimated also by ascertaining individuals based on their phenotypes and similar result also extends to the use of logistic regression. However, the parameter $\mu$ that determines the absolute odds of disease for genotype class 0 depends on the sampling strategy, i.e., on the proportion of cases in the data. Thus, in ascertained data, $\mu$ does NOT estimate the population prevalence parameter of the genotype group 0. However, we can still apply the logistic regression model to the ascertained case-control sample and estimate the three central association statistics: genetic effect $\beta$, its uncertainty (standard error), and P-value. #### GWAS software As current GWAS consider 100,000s of individuals and millions of variants, those analyses are done with specialized software packages that read in specific file formats. The most popular software is PLINK in its recent, efficient version [2.0](https://www.cog-genomics.org/plink/2.0/). Another widely used software is [SNPTEST](https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html). On this course, we do not focus on the commands or input file formats of any particular GWAS software, since that alone would take already all our time and the software packages are in constant development, which means that data input formats and commands tend to change over time. Instead, the goal of this course is to understand why each analysis is done and how to interpret the output from analyses, especially from a statistical point of view. These skills are independent of any particular GWAS software.