and 95% CI for OR is $(\exp(\textrm{logOR}-1.96\cdot \textrm{SE}),\, \exp(\textrm{logOR}+1.96\cdot \textrm{SE}))$.

Thus, the endpoints of the 95%CI must always be computed on the log-odds scale and then transformed to the OR scale. Similar inference can be done for the $\textrm{OR}_2$ parameter measuring the odds-ratio between genotypes 2 and 0 by substituting the counts of genotype 1 with the counts of genotype 2 in the formulas above. If any counts are very small, or even 0, the SE is not reliable. One can add a value of 0.5 to each of the observed counts to get an OR estimate even in these case, but one shouldn't rely on the SE estimate. #### Logistic regression Logistic regression model takes the place of linear regression as the basic GWAS model when the phenotype is binary. It explains the logarithm of the odds of the disease by the genotype. The simplest model is the **additive** model: $$\log\left( \frac{\textrm{Pr}(Y=1|X=x)}{\textrm{Pr}(Y=0|X=x)} \right) = \mu + x\beta.$$ Thus, $\mu$ is the logarithm of odds ('log-odds') for genotype 0 and $\beta$ is the log of odds ratio (logOR) between genotype 1 and 0 (and $\exp(\beta)$ is the corresponding odds ratio). Similarly $2\beta$ is the logOR between genotypes 2 and 0. This model is additive on the log-odds scale and hence multiplicative on the odds scale. Due to this duality, it is sometimes called additive model and sometimes called multiplicative model, which is a source of confusion. In these notes, it is called the additive model. In R, such a logistic regression model can be fitted by `glm(y~x, family="binomial")`. To try out logistic regression, we should learn how to simulate some case-control data according to the logistic regression model. **Example 1.5.** Let's assume that our risk allele $A$ has frequency 13% in controls, and that it is in HWE in controls. If the risk model is additive on log-odds scale with odds-ratio 1.43 per each copy of allele A, what are the genotype frequencies in cases? Let's denote case frequencies by $f_0, f_1, f_2$ and control frequencies by $q_0, q_1, q_2.$ From formulas above, we get that $$f_1 = \textrm{Pr}(A=1|Y=1) = \textrm{OR}_1 \frac{\textrm{Pr}(A=1|Y=0)} {\textrm{Pr}(A=0|Y=0) } \textrm{Pr}(A=0|Y=1) = \textrm{OR}_1 \frac{q_1 f_0}{q_0},$$ $$f_2 = \textrm{Pr}(A=2|Y=1) = \textrm{OR}_2 \frac{\textrm{Pr}(A=2|Y=0)} {\textrm{Pr}(A=0|Y=0) } \textrm{Pr}(A=0|Y=1) = \textrm{OR}_2 \frac{q_2 f_0}{q_0}.$$ Since $f_0+f_1+f_2=1,$ we get $$f_0 = \left(1 + \textrm{OR}_1 \frac{q_1}{q_0} +\textrm{OR}_2 \frac{q_2 }{q_0}\right)^{-1}.$$ Now we can compute the genotype frequencies in cases (note $\textrm{OR}_2 = \textrm{OR}_1^2$ under the additive model): ```{r} or = 1.43 a.cntrl = 0.13 q = c((1-a.cntrl)^2, 2*a.cntrl*(1-a.cntrl), a.cntrl^2) #HWE holds in controls f.0 = 1/(1 + or*q[2]/q[1] + or^2*q[3]/q[1]) f = c(f.0, or*q[2]/q[1]*f.0, or^2*q[3]/q[1]*f.0) rbind(controls = q, cases = f) #print out cases and controls ``` Let's generate 2000 cases and controls from these genotype frequencies and estimate the genetic effect using logistic regression. ```{r} n = 2000 x.cases = sample(c(0,1,2), prob = f, size = n, replace = T) x.controls = sample(c(0,1,2), prob = q, size = n, replace = T) x = c(x.cases, x.controls) #genotypes of all samples y = c(rep(1,n), rep(0,n)) #binary phenotype corresponding to genotypes: 1st cases, then controls glm.fit = glm(y ~ x, family = "binomial") summary(glm.fit) ``` What is the estimate and 95%CI on odds ratio scale? ```{r} b = summary(glm.fit)$coeff[2,1] #estimate, beta-hat se = summary(glm.fit)$coeff[2,2] #standard error exp(c(b, b - 1.96*se, b + 1.96*se)) #endpoints always on logOR scale, then transform to OR scale ``` Let's compare the result from the additive logistic regression model to the results that we get from the raw genotype counts between the genotypes 1 and 0 using the formulas above (see "Inference for OR based on counts"): ```{r} s1 = sum(x.cases == 1); s0 = sum(x.cases == 0); r1 = sum(x.controls == 1); r0 = sum(x.controls == 0) or.1.counts = s1*r0/s0/r1 se.1.counts = sqrt(sum( 1/c(s1,s0,r1,r0) ) ) c(or.1.counts, exp(log(or.1.counts) - 1.96*se.1.counts), exp(log(or.1.counts) + 1.96*se.1.counts) ) ``` These are a bit different from the additive model results above. The reason is that the additive model above also uses the data from individuals with genotype 2 to estimate the OR parameter. It turns out that if we apply the additive regression model only to the individuals having either genotype 0 or genotype 1, then we get essentially the same results as we get from the raw counts: ```{r} glm.fit = glm(y[x != 2] ~ x[x != 2], family = "binomial") b = summary(glm.fit)$coeff[2,1] #estimate, beta-hat se = summary(glm.fit)$coeff[2,2] #standard error exp(c( b, b - 1.96 * se, b + 1.96 * se )) ``` Remember that even though the count based inference and logistic regression inference based on additive model gives similar results here, they may give very different results when there are confounding covariates included in the regression model, and in such situations we trust more in the regression model results. We come back to this later. Analogous to the quantitative phenotypes, we can use the full model also for the binary data: ```{r} z = as.numeric( x == 2 ) glm.full = glm( y ~ x + z, family = "binomial") summary(glm.full) ``` In this example, we have no deviation from additivity. #### Ascertained case-control studies Suppose we are studying MS disease whose prevalence is about 1/1000. Even if we collect 100,000 individuals from the population, we still would get only about 100 cases! In ascertained case-control studies we enrich cases by collecting a case sample directly from the diseased people and similarly we collect controls either from the general population or, even better, from the individuals we know are disease free. (For diseases with prevalence < 1% there is little difference between these control ascertainment strategies.) Thus by a phenotype-based ascertainment we may have a GWAS of 10,000 individuals that is divided into sample of 5,000 cases and 5,000 controls. This approach gives much higher power than population collection of 100 cases and ~100,000 controls. (We will do power analyses next week.) Can we analyse such ascertained case-control samples using the same logistic regression model as applied above? The answer is yes. The parameter $\beta$ is the logOR and we showed earlier that this parameter can be estimated also by ascertaining individuals based on their phenotypes. This result also extends to the use of logistic regression. However, the parameter $\mu$ that determines the absolute odds of disease for genotype class 0 depends on the sampling strategy, i.e., on the proportion of cases in the data. Thus, in ascertained data, $\mu$ does NOT estimate the population prevalence parameter of the genotype group 0. However, we can still apply the logistic regression model to the ascertained case-control sample and estimate the three central association statistics: genetic effect $\beta$, its uncertainty (standard error), and P-value. #### GWAS software As current GWAS consider 10,000s of individuals and millions of variants, those analyses are done with a specific software that read the specific file formats. Most popular software is PLINK in its recent, efficient version [2.0](https://www.cog-genomics.org/plink/2.0/). Another widely used software is [SNPTEST](https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html). In this course, we do not focus on the commands or input file formats of any particular software, since that would take already all our time and the software are in constant development. The goal of this course is to understand why each analysis is done and how to interpret the output, in particular from a statistical point of view. These skills are independent of the GWAS software.