Intro to statistical power

Let’s return to the setting of Lectures 2 & 3: \(n\) pairs of same-sex twins dicordant for psoriasis and in \(x\) of the pairs the psoriatic individual has larger BMI than the non-psoriatic one.

Let’s see how the sample size \(n\) affects our inference on the proportion \(p\) of twin pairs where psoriatic twin has higher BMI than the non-psoriatic one. Let’s consider two data sets with sample sizes of 20 and 100, and let’s assume that the point estimate for the proportion in both cases is 70%. (That is, we have observed 14 and 70 successes from data sets of sizes 20 and 100, respectively.) Let’s do the binomial test in both data sets.

p.est = 0.70 #our point estimate, x/n, is 70% in both cases
n = c(20, 100) #two sample sizes 
x = n*p.est #two observations, one for each sample size, both correspond to proportion of 70%
x #14 and 70
## [1] 14 70
binom.test(x[1], n[1])
## 
##  Exact binomial test
## 
## data:  x[1] and n[1]
## number of successes = 14, number of trials = 20, p-value = 0.1153
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4572108 0.8810684
## sample estimates:
## probability of success 
##                    0.7
binom.test(x[2], n[2])
## 
##  Exact binomial test
## 
## data:  x[2] and n[2]
## number of successes = 70, number of trials = 100, p-value = 7.85e-05
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.6001853 0.7875936
## sample estimates:
## probability of success 
##                    0.7

We see that the confidence interval around the point estimate 0.70 shrinks as the sample size \(n\) grows. This is because with larger \(n\) we have more information about where exactly the proportion parameter \(p\) is, and this is reflected by a smaller standard error and narrower CI. With more information there is less uncertainty about \(p\). This is why statisticians always want to have bigger sample size.

There is also a striking difference in P-values (0.12 vs 8e-5). Let’s try to understand why the sample size affects statistical testing and P-value.

Null hypothesis is that BMI and psoriasis are independent. This is equivalent to the assumption that the observed value follows Bin(n,p = 0.5) distribution, that is, it is like the total number of heads in \(n\) tosses of a fair coin. Let’s plot in the same figure the probability mass functions under the null hypothesis for possible point estimate values (of the proportion p=x/n) for the two sample sizes.

p.range = seq(0, 1, 0.05) #values at which density will be evaluated at steps of 1/20=0.05
n = c(20, 100)
y.1 = dbinom(p.range*n[1], size = n[1], p = 0.5) #null distribution when n = 20
y.2 = dbinom(p.range*n[2], size = n[2], p = 0.5) #null distribution when n = 100
#renormalize so that both distributions sum to 1 and are thus comparable
y.1 = y.1 / sum(y.1)
y.2 = y.2 / sum(y.2) 
plot(p.range, y.2, col = "darkblue",t = "l", lwd = 2, xlab = "proportion", ylab = "probability")
lines(p.range, y.1, col = "orange",t = "l", lwd = 2) #add line for n=20
legend("topright", legend = paste0("n=",n), col = c("orange","darkblue"), lty = 1, lwd = 2)
abline(v = 0.7, col = "red") # Mark observation 70% in red

We see that the null distribution for larger sample size n=100 is more concentrated near the null hypothesis value 0.5 than the distribution with n=20. By looking at the probability mass outside the red line, we can conclude that if the true association between BMI and psoriasis was such that in 70% of discordant twin pairs the psoriatic one had a larger BMI, then, with sample size \(n=100\), we would expect to distinguish that association from the null hypothesis with clear statistical evidence (small P-value because observation is so improbable under the blue null), but, for \(n=20\), we might not distinguish it clearly (P-value can well be > 0.05 because the observation near 70% could happen also under the orange null every now and then). We say that the larger sample size brings larger statistical power to detect true deviations from the null hypothesis.

This logic is verified by the binomial tests above, where we saw that P-values for observation 70% in these two sample sizes are quite different: 0.12 for \(n=20\) but less than 1e-4 for \(n=100\).

Example 5.1.

Suppose that the biomarker B is distributed like N(1,1) in healthy population and like N(2,1) in diabetes cases. Generate a random set of \(n\) samples from the healthy population and another sample of same size from diabetics. We know that the distributions have different means (namely, 1 and 2), and we want to see how statistically clearly we could detect that difference with different sample sizes. Test whether the means are different using a t-test when

  1. \(n=5\)

  2. \(n=25\)

  3. \(n=50\)

Do the means look statistically different based on the P-values in these three cases?

ns = c(5, 25, 50)
for(n in ns){ #With for loop we do not need to copy-paste the code separately for each value
  x.1 = rnorm(n, 1, 1) #healthy
  x.2 = rnorm(n, 2, 1) #diabetics
  print(t.test(x.1, x.2)) #Needs "print()" inside for-loop to print out the result of t-test.
}
## 
##  Welch Two Sample t-test
## 
## data:  x.1 and x.2
## t = -2.2419, df = 7.0118, p-value = 0.05985
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.09460386  0.05545062
## sample estimates:
## mean of x mean of y 
##  0.857512  1.877089 
## 
## 
##  Welch Two Sample t-test
## 
## data:  x.1 and x.2
## t = -2.3704, df = 43.329, p-value = 0.02229
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3441528 -0.1085158
## sample estimates:
## mean of x mean of y 
##  1.297058  2.023392 
## 
## 
##  Welch Two Sample t-test
## 
## data:  x.1 and x.2
## t = -5.4251, df = 97.976, p-value = 4.185e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.522608 -0.707022
## sample estimates:
## mean of x mean of y 
##  1.041062  2.155877

P-values get smaller with increasing sample size, reflecting that we will detect the same true difference in means with more statistical evidence using a larger sample size than using a smaller sample size.

P-values and Type I error rate

So far we have observed data and computed the P-value of observing such kind of data under the null hypothesis. The idea has been that if P-value is small, then the observed data seem unlikely to rise under the null hypothesis and hence we have a reason to suspect the null hypothesis.

We can also imagine a statistical testing procedure where before seeing data and computing P-value we set a significance level (\(\alpha\), “alpha”, e.g. 0.05, 0.01 or 0.001) and if the observed data reach that significance level (i.e. computed P-value \(\leq \alpha\)), we “reject” the null hypothesis because it seems implausible. The motivation for pre-defining the significance level is that then we know that the testing procedure has the following property: If the null hypothesis was true, then in repeated experiments we would falsely reject the null hypothesis for proportion \(\alpha\) of the experiments. By controlling \(\alpha\), we can control the level of false rejections of the null hypothesis. If \(\alpha = 0.05\), then we would make a false call in 1/20 experiments where the null hypothesis was actually true; if \(\alpha\) is 0.001, then 1/1000 of the true null hypotheses were falsely rejected etc. The significance level \(\alpha\) of the testing procedure is also called Type I error rate, since a false rejection of the null hypothesis is called a type I error. “Type I error: Falsely declaring an effect found when there truly is none.”

Let’s draw a picture of null distribution of 100 flips of a fair coin, (analogous to twin pairs where higher BMI an psoriasis status go together), and mark the 2-sided significance thresholds for 0.05, 0.001 and 1e-8.

n = 100 # twin pairs
p0 = 0.5 # null hypothesis value for proportion
x = 0:n # plotting range 0,...,n
#Plot the null distribution
plot(x, dbinom(x, n, p0), t = "l", lwd = 2, main = "null distribution Bin(100,0.5)",
     xlab = "observed count", ylab = "probability", xaxs = "i", yaxs = "i")
#Plot cutoff point for P-value of 0.001 under the null distribution in red
sig.threshold = c(0.05, 0.001, 1e-8) 
cols = c("blue", "green", "orange")
for(ii in 1:length(sig.threshold) ){ #goes through all sig.thresholds
  abline(v = qbinom(sig.threshold[ii]/2, n, p0), 
         col = cols[ii], lwd = 2, lty = 2) #mark left cutpoint
  abline(v = qbinom(1-sig.threshold[ii]/2, n, p0), 
         col = cols[ii], lwd = 2, lty = 2) # mark also right cutpoint
}
legend("topright", col = cols, legend = sig.threshold, lwd = 2, lty = 2)

We see that in order to get a significant deviation from the null hypothesis at significance level 0.05, we need to have either <40 or >60 successes. On the other hand, to get statistical significance at level 1e-8, we would need as extreme a result as <22 or >78 successes. This reflects the fact that we need much more evidence against the null in order to reject the null hypothesis at level 1e-8 than we need to reject the null at level 0.05. Consequently, we will make less Type I errors if our significance level is 1e-8 than if it is 0.05. But at a same time, at level 1e-8, we will miss some real deviations from the null hypothesis that we would detect at level 0.05. That is, we do more Type II errors at a lower significance level than at a higher significance level. “Type II error: Not rejecting the null hypothesis when it does not hold.”

Typically significance level is chosen purely by Type I error rate, that is, by our tolerance for false rejection of null hypothesis. This tolerance varies between contexts. In clinical trials, the significance level could be 0.05, but for example in genome-wide analyses we get excited about a new association between a genetic variant and a disease only when it reaches a significance level 5e-8 = 0.00000005! Similarly, in 2012 the phycisists claimed that they had “found” Higgs’ boson after they had gathered statistical evidence that reached their “5 sigma rule” (observation 5 standard deviations away from the null hypothesis value) that corresponds to a significance level of 6e-7.

Rationale for differences could be that in clinical trial we only study one hypothesis (a therapy works) that possibly already has some prior evidence why it has been developed in the first place. So we do not require a lot statistical evidence at this point to get further with the process. In genome-wide analyses we are starting by a fishing experiment where we have millions of genetic variants and no prior evidence of any real association with diseases. Thus, we need a lot of statistical evidence to start believing that this one variant (out of millions of possible ones) is truly associated with the disease. Finally, with Higgs’ boson, we again have only one hypothesis to test, but here the price of making a wrong claim is large for the science: Phycisists don’t want to publish a result about existence of a new particle unless they are (very) certain. Hence, they require much more statistical evidence than clinical trials.

Examples 5.2.

  1. Your null hypothesis is that a coin is fair (p=50%). If you toss the coin \(n\) times and count the number of heads, what are the critical points below and above the null value 50% at which the outcome starts showing statistically significantly that the coin is biased, when
  1. \(n=10, \alpha=0.05\)

  2. \(n=100, \alpha=0.05\)

  3. \(n=10, \alpha=0.0001\)

  4. \(n=100, \alpha=0.0001\).

p = 0.5
ns = rep(c(10, 100), 2)
a = rep(c(0.05, 0.0001), each = 2)
for(ii in 1:length(ns)){ #for-loop over 4 sets of parameters n and alpha
  print(paste("n =",ns[ii],"alpha =",a[ii],": at most",
              qbinom(a[ii]/2,ns[ii],p)-1,"or at least",
              qbinom(1-a[ii]/2,ns[ii],p)+1))
}
## [1] "n = 10 alpha = 0.05 : at most 1 or at least 9"
## [1] "n = 100 alpha = 0.05 : at most 39 or at least 61"
## [1] "n = 10 alpha = 1e-04 : at most -1 or at least 11"
## [1] "n = 100 alpha = 1e-04 : at most 30 or at least 70"

We see that for \(n=10\), no observation can reach significance level \(\alpha = 0.001\) as the critical points are outside the range 0,…,10.

  1. Compute these critical points for a fixed \(\alpha=0.05\) as \(n\) goes through 10, 50, 100, 200, … 1000. Compute which proportions they correspond to and show the results on screen.
a = 0.05
ns = c(10, 50, seq(100,1000,100))
res = c() #empty vector that will be filled to be a matrix with 3 columns:
#1st col = n; 2nd col = lower critical point; 3rd col = upper critical point
for(ii in 1:length(ns)){
  res = rbind(res, 
              c(ns[ii], qbinom(a/2, ns[ii], p)-1, qbinom(1-a/2, ns[ii], p)+1))
}
res[,2:3] = res[,2:3]/res[,1] #propotions are columns 2 and 3 divided by column 1
res #show results on screen
##       [,1]      [,2]      [,3]
##  [1,]   10 0.1000000 0.9000000
##  [2,]   50 0.3400000 0.6600000
##  [3,]  100 0.3900000 0.6100000
##  [4,]  200 0.4250000 0.5750000
##  [5,]  300 0.4400000 0.5600000
##  [6,]  400 0.4475000 0.5525000
##  [7,]  500 0.4540000 0.5460000
##  [8,]  600 0.4583333 0.5416667
##  [9,]  700 0.4614286 0.5385714
## [10,]  800 0.4637500 0.5362500
## [11,]  900 0.4666667 0.5333333
## [12,] 1000 0.4680000 0.5320000

We see that in order to get a significant result at significance level 0.05, with sample size of 10 we need <20% or >80% success rate, while with sample size of 1000 it is enough if we get <47% or >53% success rate. Sample size really matters when it comes to determining which point estimate value is significant at a fixed significance level!

Statistical Power

Statistical power is a probability of detecting a given effect size, with a given sample size at a given signficance level.

Typical context where statistical power comes up is the following. Let’s say that in order to be convinced that there is an association between psoriasis and BMI we want to see a P-value of 0.001 (or less) if true effect size is \(p=0.70\) (or larger), when we compare the observed twin-pair data to the null hypothesis value of \(p=0.50\). How large a study should we have in order that the power with these parameters is at least 90%?

Before computing the power, let’s add to our picture of the significance regions also the distribution of the alternative hypothesis, that is defined by the true effect size parameter. Here the alternative hypothesis is Bin(100, 0.7).

n = 100 #twin pairs
p0 = 0.5 #null hypothesis proportion
x = 0:n #plotting range 0,...,n
#Plot null distribution
plot(x, dbinom(x, n, p0), t = "l", lwd = 2, main = "null Bin(100,0.5); altern Bin(100,0.7)", 
     xlab = "observed count", ylab = "probability", xaxs = "i", yaxs = "i", ylim = c(0,0.09))
#Plot cutoff point for P-value of 0.001 under the null distribution in red
sig.threshold = c(0.05, 0.001, 1e-8) 
cols = c("blue", "green", "orange")
for(ii in 1:length(sig.threshold)){
  abline(v = qbinom(sig.threshold[ii]/2, n, p0), 
         col = cols[ii], lwd = 2, lty = 2) #add left cutpoint
  abline(v = qbinom(1-sig.threshold[ii]/2, n, p0),
         col=cols[ii],lwd=2,lty=2) #add also right cutpoint
}
legend("topright", col = cols, legend = sig.threshold, lwd = 2, lty = 2)
#And add alternative hypothesis distribution in magenta
p1 = 0.7
lines(x, dbinom(x, n, p1), lwd = 2, col = "magenta")