MEDSTAT Lecture 2: P-value

In the previous lecture, we learned to use the cumulative distribution function pbinom() of the binomial distribution to compute tail probabilities of the binomial distribution. Such tail probabilities can answer questions, such as, what is the probability that (i) at most 10 patients survive or (ii) at least 20 patients survive, out of 50 patients when the average survival rate is 30%. In this lecture, we use such tail probabilities to define the concept of P-value that is typically presented with every statistical test in medical literature.

Definition of P-value

Suppose we have 100 pairs of same-sex twins that are discordant for psoriasis (i.e. one has been diagnosed with psoriasis and the other hasn’t) and we know that BMI is higher for the psoriatic individual rather than his/her non-psoriatic twin sibling in 71 out of 100 pairs. Does that sound like higher BMI has a tendency to go together with the psoriasis status in the population of twin pairs from which these pairs are coming from? How to quantify statistically the association between BMI and psoriasis in these data?

Idea: If the psoriasis status was independent of BMI in the target population, then we would expect that there is a 50% chance that the individual with psoriasis happens to have a higher BMI than his/her twin sibling. In other words, the number of twin pairs where the psoriatic one has the higher BMI is distributed as Bin(100, 0.50), i.e, has a binomial distribution with 100 trials each having a success probability of 0.50. Think this as the distribution of number of “tails” in 100 tosses of a fair coin where each toss corresponds to one twin pair. Under the independence assumption, we would expect about 50 successes (out of 100 trials) in this binomial experiment, but with some variation around the exact value 50. And we did observe 71. Is 71 already so far from the expected value of 50 that we have a good reason to question the independence assumption? Or could it easily happen “by chance” if success probability truly was 0.50? How could we tell?

We want to quantify (or “test”) how consistent our observation (71 out of 100) is with the independence assumption (“success probability is 0.50”). In statistics terms, we define a null hypothesis (denoted by \(H_0\)) saying that “psoriasis status is independent of BMI”. This can be expressed as \[H_0:p=0.50,\] which says that the null hypothesis (\(H_0\)) is that the probability \(p\) that the psoriatic twin has a higher BMI than his/her twin sibling is 0.50 (that is, 50%). Often the null hypothesis is defined in such a way that it describes the default situation in the target population which, if it holds true, is not very interesting or surprising. Instead, if the null hypothesis turns out not to hold true, that is usually interesting or surprising. A standard way to test the null hypothesis is by quantifying how likely the observed data are under the null hypothesis. The logic is that if the kind of data that we have observed are very unlikely to occur under the null hypothesis, then we have a good reason to think that the null hypothesis might not be true.

Let’s plot the probability distribution of possible observations under the theoretical null hypothesis (that assumes psoriasis and BMI are independent in the target population). We could use barplot like in Lecture 1 but now we will rather make a continuous curve that joins the probabilities together using the plot() function.

n = 100 #pairs of twins
# Let's stay in range 25...75 as tails outside this are very improbable 
# and hence unnecessary to be shown here
x = 25:75
y = dbinom(x, size = n, prob = 0.5)
# Let's plot the points in x-y coordinates where x is number of twin pairs 25,...,75
# for which psoriasis and higher BMI occur in the same individual
# and y is the probability of that observation under the null hypothesis.  
plot(x, y, xlab = "obs'd count", ylab = "probability", 
     main = "Bin(100,0.5)", t = "l", yaxs = "i") 
# main = gives the title for the plot. It should be in quotes " ".
# t = "l" means that the plotting type is "line", that is, points are connected by a line.
# yaxs = "i" forces the plotting area strictly to the range of the observed values on the y-axis.

# Let's also mark our actual observation that was 71.
x.obs = 71
# Let's add a vertical line at our observation using abline().
abline(v = x.obs, col = "red") # v = x.obs means vertical line at position x.obs
abline(v = n - x.obs, col = "red", lty = 2) # lty = 2 to make a dashed line at 29 = 100 - 71.

The observation 71 (red line) is very far at the right hand tail of the null distribution, so it is pretty improbable to get a value at least this large under the null hypothesis, that is, under the assumption that BMI and psoriasis were independent.

Let’s compute the probability that under the null hypothesis one would get an observation at least as far in the right hand tail as we have observed.

# As pbinom(70, n, 0.5) reports probability mass cumulated from 0 to 70, 
# then '1 - pbinom(70, n, 0.5)' is the mass cumulating from 71 to 100. 
1 - pbinom(x.obs - 1, size = n, prob = 0.50)

## [1] 1.608001e-05

So in less than 2 out of 100,000 experiments under the null hypothesis would we get 71 or more as the outcome count. What’s your conclusion about the independence between psoriasis and BMI?

Conclusion: Either psoriasis status and higher BMI are linked, or a very improbable event has happened “by chance”, namely that we have observed at least as many as 71/100 twin pairs where high BMI and psoriasis go together, which is already so far from the 50% expectation that it has a very small probability if BMI and psoriasis were independent.

Let’s list these right-hand tail probabilities for even more extreme possible outcomes

#Let's list values 71,...,100 and combine them with corresponding tail probabilities
# using cbind() function, "column bind"", combines columns into a matrix.
x = x.obs:n
cbind(value = x, tailprob = 1 - pbinom(x - 1, size = n, prob = 0.50) )

##       value     tailprob
##  [1,]    71 1.608001e-05
##  [2,]    72 6.289575e-06
##  [3,]    73 2.346206e-06
##  [4,]    74 8.336813e-07
##  [5,]    75 2.818141e-07
##  [6,]    76 9.050013e-08
##  [7,]    77 2.756790e-08
##  [8,]    78 7.952664e-09
##  [9,]    79 2.168683e-09
## [10,]    80 5.579545e-10
## [11,]    81 1.351381e-10
## [12,]    82 3.073897e-11
## [13,]    83 6.548984e-12
## [14,]    84 1.302958e-12
## [15,]    85 2.412515e-13
## [16,]    86 4.141132e-14
## [17,]    87 6.550316e-15
## [18,]    88 8.881784e-16
## [19,]    89 1.110223e-16
## [20,]    90 0.000000e+00
## [21,]    91 0.000000e+00
## [22,]    92 0.000000e+00
## [23,]    93 0.000000e+00
## [24,]    94 0.000000e+00
## [25,]    95 0.000000e+00
## [26,]    96 0.000000e+00
## [27,]    97 0.000000e+00
## [28,]    98 0.000000e+00
## [29,]    99 0.000000e+00
## [30,]   100 0.000000e+00

We see that the probabilities would get smaller and smaller as we observed more extreme deviations from the value 50 that is the expected value under the null hypothesis. Intuitively, more extreme observations also give more evidence that the deviation from the expected value of 50 is not just a “chance event” or a “statistical fluctuation” but rather there is some real link between BMI and psoriasis causing this deviation. This is because the probability of getting so extreme values purely by “chance” drops essentially to 0.

NOTE: This kind of association between two variables (here the psoriasis status and higher BMI value) does not automatically mean that either of the two would be causing the other. Rather it means that there is some statistically observable connection between the two and therefore it may be useful to spend more time thinking about which mechanisms may cause such an association and whether such an association could have any importance for example in understanding the disease mechanisms or developing better treatment options.

P-value.

We have just encountered a “P-value”. Definition of P-value is the probability of getting “at least as extreme data set under the null hypothesis as was observed”. The logic is that if P-value is very small, then it would be very improbable to observe a data set that is at least as extreme as was observed if the null hypothesis was true. Therefore, an observation with a small P-value may make us question the null hypothesis.

Two questions emerge:

How do we define which data sets are “at least as extreme” as the observed one under the null hypothesis?
How small a P-value should be in order that we can conclude that the null hypothesis is not true? Or can we ever do that?

Which data sets are “at least as extreme” as the observed one?

Let’s repeat the previous density plot of the possible outcomes in the range from 25 to 75 of a binomial experiment Bin(100, 0.5) with the two red lines at 71 and 29. (Let’s suppress the code from output by using echo = FALSE in the code block definition so that we only see the resulting plot but do not repeat the code in the knitted document.)

In this example, the most probable number of twin pairs where higher BMI and psoriasis occur in the same individual is 50 (out of 100) under the null hypothesis that BMI and psoriasis are independent. It seems intuitive that the larger the deviation is from 50, the more “extreme” the observation is under the null hypothesis. A quantitative definition could be that out of two putative observations, the one with the smaller probability (under the null hypothesis) is the “more extreme” one.

Above we computed that probability under the null to observe 71 or more pairs is about 2 in 100,000. If we had not fixed beforehand that we would only be interested in deviations that are towards the higher values from 50, then observing at least 71 out of 100 would be as extreme a value from 50 as observing at most 29 out of 100 (dashed red line above). Thus, probability under the null of seeing “as extreme data” as has been observed would then mean observing “at least 71 or at most 29 cases out of 100 trials”. Let’s compute the P-value under such a two-sided alternative hypothesis.

pbinom(29, size = n, prob = 0.5) + (1 - pbinom(71-1, size = n, prob = 0.5))

## [1] 3.216002e-05

This is a two-sided P-value, where we make no prior hypothesis which direction defines more extreme cases and thus we consider both directions. This two-sided P-value gives here 2 x P-value from the one-sided test that considered only one tail of the distribution. Typically, we consider 2-sided tests since we want to see whether there is any deviation from the null hypothesis, no matter to which direction. The one-sided test ignores the other direction and is conceptually applicable only rarely when we have a strong reason to pre-define the tail of interest, for example, if we try to replicate a result from an earlier study.

More about one-sided vs. two-sided tests in BMJ 1994;309:248 http://www.bmj.com/content/309/6949/248.full.

How small a P-value should be in order that we can conclude that the null hypothesis is not true?

The short answer is that we cannot ever conclude truth based on statistics alone. And even to properly quantify the level of evidence that the null hypothesis is true, we would need other information in addition to the P-value.

However, P-value does provide some information and, very generally, smaller P-values are more evidence against the null hypothesis than larger P-values. This has led to the use of the term statistically significant to indicate that the test has produced a certain pre-defined level of evidence against the null hypothesis. Most often the evidence is measured by the P-value where smaller P-values are interpreted to correspond to “more statistically significant” results.

For example, based on the previous psoriasis-BMI example, we could conclude that there is a statistically significant association between psoriasis and BMI, at a significance level of 0.001, because the P-value was about 3e-5 < 0.001. The association is also significant at the level of 1e-4 (because 3e-5 < 1e-4), but not anymore at the level of 1e-5 (because 3e-5 > 1e-5). By far, the most common significance level in medicine, and elsewhere in applied statistics, is 0.05. This is unfortunate, since a P-value around 0.05 is typically not yet at all strong evidence against the null hypothesis. This is a problem because it means that many “statistically significant” results from literature are actually false positives, where the P-value has got a bit below 0.05 simply “by chance” even though the null hypothesis holds. Note that the probability of having a P-value smaller than 0.05 “by chance”, even when the null hypothesis holds, is 5% or 1 in 20, so it indeed happens fairly often just by chance even when the null hypothesis is true.

All P-value thresholds for defining “statistical significance” (such as the commonly used 0.05) are arbitrary and they should be only one piece of evidence on a (long) path to determine which of the hypotheses is the most likely to hold. It is not a good practice to dichotomize P-values by only reporting whether they were “significant/non-significant” at a certain significance level, but rather we should always report the exact numerical value of the P-value to give more information to the reader. Unfortunately, the pursue of “statistical significance” is often so strong in medical literature that other, often more important, aspects of the statistical analyses are not paid enough attention. Partly this is because in order to get publications, one needs to find some “statistically significant” results. Consequently, there are a lot of published studies where P-value has just got slightly below 0.05, but when a similar experiment is repeated again, no supporting evidence for the claimed result can be found, possibly because the results was a false positive in the first place.

Another important aspect in the medical field is that even if there is a true statistically significant association, it may not be significant for clinical practice for example because its magnitude is so small: https://www.students4bestevidence.net/statistical-significance-vs-clinical-significance/.

Summary.

Null hypothesis is a statement of interest about the target population.
P-value is the probability that if the null hypothesis
was true, then a sampled data set from the target population would be at least as extreme as what we have observed.
The purpose of P-value is to quantify the possible inconsistency between the null hypothesis and the observed data. If P-value is small, then it seems unlikely that under the null hypothesis this extreme data would had occurred. Thus, small P-values give us a reason to suspect that null hypothesis may not be true. However, it may still be the case that the null hypothesis is true but a rare event has occurred and hence we see a small P-value.
Statistical significance at level \(\alpha\) is a technical term that is used when the P-value falls below a pre-defined significance level \(\alpha\), but it does not conclusively show that the null hypothesis does not hold, and the P-values should be reported as numerical values, not dichotomized into significant vs. non-significant based on the chosen significance level.

NOTE: P-value is always a probability under the null hypothesis of observing certain kind of data sets, it is NOT a probability about the null hypothesis itself. Thus, based on the P-value alone, you cannot say what is the probability that the null hypothesis holds. All you can say with the P-value is that if the null hypothesis holds, then the probability of getting data similar to what you have observed or that are even more extreme than you have observed, is described by the P-value.

Do you want to read another explanation of the P-value from BMJ? https://www.bmj.com/content/340/bmj.c2203.

Example 2.1.

(1) You want to study whether the appendicitis (umpilisäke) cases operated at Meilahti Hospital have any bias from 50%-50% split between men and women. What is your null hypothesis?

Null hypothesis is that 50% of the population of appendicitis cases in Meilahti Hospital are males. If we considered this setting as a binomial experiment, then given that we observe \(n\) patients, we would assume that the number of male patients among the observed patients would be distributed as Bin(\(n\), 0.5) if the null hypothesis holds.

(2) Over the last year, there have been 316 patients of which 180 were women and 136 were men. What is the P-value of this observation under the null hypothesis of part (1)? Is the observed proportion of men statistically significantly different from 50% at a significance level of 0.05? Do you suspect a bias from the 50%-50% split in the patient population based on this P-value?

n = 316 # all patients
x = 136 # men
x/n # proportion of men seems somewhat below 0.5

## [1] 0.4303797

# Let's compute 2-sided P-value. Since x/n < 0.5, we compute the left tail and multiply by 2
2*pbinom(x, size = n, prob = 0.5) # 2-sided P-value

## [1] 0.0154331

The observed proportion of men in the sample was 0.43 and is thus somewhat below the value of 0.50. The P-value 0.015 is statistically significant at level 0.05 (since 0.015 < 0.05). This result could indicate a pattern where the population of patients indeed include less men than women. However, the statistical evidence is not yet very strong as an observation with a P-value \(\leq 0.015\) happens even under the null hypothesis more often than once in a hundred experiments by chance. Thus, more data would still be required to get more convincing evidence to either direction.

(3) If you do a comparison between a treatment group and a placebo group and the statistics show a difference with a P-value of 0.003, what does that mean exactly, explained in plain language? What about if the P-value were 0.00008? Which one of these two P-values would show stronger evidence for a true treatment effect in the underlying population and why?

A P-value of 0.003 from a clinical trial means that we have observed a difference of such a magnitude that it or a more extreme difference would be observed only in 3 out of 1000 clinical trials like this, if the null hypothesis holds, i.e., if there was no true difference between the effects of the treatment and placebo in the target population. Thus, we consider that it is quite improbable (0.003) to observe at least this extreme data “by chance” when the null hypothesis holds. However, still such a difference can happen “by chance” under the null hypothesis, it is just quite improbable.

NOTE: It would be WRONG to say that the probability of this observation being produced “by chance” is 0.003. Such a statement would suggest that 0.003 was the probability of the null hypothesis while 0.003 actually is a probability of certain sets of data under the null hypothesis, and it is NOT a probability of the null hypothesis itself. Based on only the P-value, it is never possible to say how probable the null hypothesis is.

A P-value of 0.00008 is clearly < 0.003, and it does provide more evidence than the P-value of 0.003 that there is a true treatment effect in the population. This is because, typically, a more extreme deviation from the expectation under the null hypothesis is more evidence against the null hypothesis, even though we cannot derive the exact probability of the null hypothesis based on the P-value alone.

Binomial test and Fisher’s test

`binom.test()`

Above we have illustrated the computation of P-values using the binomial distribution function pbinom() so that we understand clearly that the P-value is computed as a tail probability of the null distribution. In practice, we can compute P-values for binomial setting more simply by using an existing function called binom.test().

Let’s redo the statistical testing of 71 successes out of 100 trials when the null hypothesis is that the success probability is 0.50. (We hope to get a result of 3.216e-5 in order to be consistent with our earlier result.)

binom.test(71, n = 100, p = 0.50)

## 
##  Exact binomial test
## 
## data:  71 and 100
## number of successes = 71, number of trials = 100, p-value = 3.216e-05
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.6107340 0.7964258
## sample estimates:
## probability of success 
##                   0.71

Output includes the input data (successes and trials) and the P-value under the null hypothesis of \(p=0.50\), when the P-value is computed using the two-sided test (“alternative hypothesis” here is that the population proportion value can be different from 0.50 to either direction).

We could also do a one-sided test, where the alternative hypothesis is that the observed value is greater than expected under the null hypothesis. To do this, we use alternative = "greater" in binom.test().

binom.test(71, n = 100, p = 0.50, alternative = "greater")

## 
##  Exact binomial test
## 
## data:  71 and 100
## number of successes = 71, number of trials = 100, p-value = 1.608e-05
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.6262716 1.0000000
## sample estimates:
## probability of success 
##                   0.71

Note how the “alternative hypothesis” has changed in the output and now only the upper tail probability is computed. Alternatively, only the lower tail would be considered if we set alternative = "less". If alternative= is not specified, it is taken to be two.sided by default.

Type ?binom.test to see the details of how to use this function.

Example 2.2.

(1) Over the last year, there has been 316 appendicitis patients of which 180 were women and 136 were men. Make a binomial test to see whether this indicates a significant deviation from the null hypothesis that the proportion of males among the appendicitis patient population is 50%. Use the significance level of 0.05.

binom.test(136, n = 316, p = 0.50, alt = "two.sided") #136 successes out of 316 trials when prob = 0.50

## 
##  Exact binomial test
## 
## data:  136 and 316
## number of successes = 136, number of trials = 316, p-value = 0.01543
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.3750857 0.4869912
## sample estimates:
## probability of success 
##              0.4303797

As seen in example 2.1, with a P-value around 0.015, this is a statistically significant results at significance level 0.05.

(2) Above we studied the population of the appendicitis patients. Let’s now study the entire population of the hospital’s catchment area. Our question is whether the incidence of appendicitis is similar in males and females. Suppose that we know that the underlying population has more females (60% of people) than males (40% of people). Our null hypothesis is that the incidence is the same in both sexes. This means that, according to the null hypothesis, an appendicitis patient is male with a probability of 40%. Let’s test the observed data against this null hypothesis \(p=0.40.\)

binom.test(136, n = 316, p = 0.40, alt = "two.sided")

## 
##  Exact binomial test
## 
## data:  136 and 316
## number of successes = 136, number of trials = 316, p-value = 0.2756
## alternative hypothesis: true probability of success is not equal to 0.4
## 95 percent confidence interval:
##  0.3750857 0.4869912
## sample estimates:
## probability of success 
##              0.4303797

We see that the P-value is now much larger (about 0.27) than earlier (about 0.015) and we have no reason to question the null hypothesis of an equal incidence in males and females based on this result. Thus, we conclude that an unequal (40%:60%) population proportion between the biological sexes would already be enough to explain the apparent difference of 136:180 in the appendicitis cases, since there is not anymore any statistically significant deviation from the null hypothesis once the null hypothesis takes into account the different population proportions of males and females.

Example 2.3

Suppose that you are treating patients and you expect that the success rate is 30% in your patient population. You treat 50 patients and observe 23 successes and 27 failures. What is the two-sided P-value for the observation under the null hypothesis that the success rate is 30%? Do these data make you question the null hypothesis?

Answer: We expect a success rate of 30% and we have observed an empirical rate of \(0.46=23/50\) from a sample of size 50. Let’s draw a picture of the null distribution and mark our observation, by adapting the code we used earlier to draw the distribution Bin(100,0.5).

n = 50
p.null = 0.3
x = 1:n
y = dbinom(x, size = n, prob = p.null)
plot(x, y, xlab="obs'd count", ylab = "probability", 
     main = "Bin(50,0.3)", t = "l", yaxs = "i") 
x.obs = 23
abline(v = x.obs, col = "red")

We see that the probability of getting at least this large counts under the null is the probability of the right-hand tail:

p.right = 1 - pbinom(x.obs-1, size = n, prob = p.null) # one-sided P-value
p.right

## [1] 0.01227613

What would be the 2-sided P-value? As opposed to our earlier example with Bin(100, 0.5), the current distribution, Bin(50, 0.3), is not symmetric with respect to its expected value 15, and thus there is not (necessarily) any such value < 15 for which the left-hand tail probability at that value would match exactly p.right=0.0123. Thus, we need to choose among two options to determine a two-sided P-value:

Sum up the probabilities of all those outcome values whose probabilities are less than or equal to the probability of the observed value. This is an exact way to define a two-sided P-value, and this is how the binom.test() function does it.

binom.test(x.obs, n = n, p = p.null)

## 
##  Exact binomial test
## 
## data:  x.obs and n
## number of successes = 23, number of trials = 50, p-value = 0.01954
## alternative hypothesis: true probability of success is not equal to 0.3
## 95 percent confidence interval:
##  0.3181492 0.6067580
## sample estimates:
## probability of success 
##                   0.46

Simply multiply the right-hand tail probability by 2 to represent the idea that we want to reserve an equal amount of probability from the left tail for “at least as extreme observations” as we do from the right tail. This simple approximation works fine for practical purposes, but it does not give exactly the same P-value as was given by binom.test().

2*p.right

## [1] 0.02455225

Both methods give a P-value of about 0.02. Conclusion is that there is some evidence that the success rate might be larger than 30% (statistically significant deviation from the null at significance level 0.05), but more data would still be needed before evidence for the deviation might become very convincing.

`fisher.test()`

In the binomial test, we had a fixed null hypothesis parameter value (like \(p=0.50\)) and then we measured one property on each individual (e.g. being male or female). Going one step further, we may have our observations characterized by two properties, for example, being male or female and having the disease or being healthy. How can we test whether these two properties are independent of each other or whether they are associated with each other. Here, independence between sex and disease means that the disease incidence is the same in both sexes, i.e., it does not depend on the sex. And an association between sex and disease means that disease is more common among one of the sexes than among the other sex. In large samples, we can compute the disease incidence in each sex and compare those with a proportion test, as we will do in the next lecture. For small samples, we should use Fisher’s test, as we do now.

As an example, suppose we measure the carrier status of a particular rare genetic variant of the HLA-C gene on the immune system related HLA region of the human genome in healthy controls and in psoriasis patients, and we get the following counts.

	non-carriers	carriers
healthy	40	1
psoriasis	13	6

The question is whether the variant carrier status is associated with the disease status, or whether the two properties are independent of each other in the population where these data come from.

We can test this by Fisher’s test. In R, that works by making a 2x2-matrix or a 2x2 data frame of data and by applying fisher.test() on it. Let’s use a data frame and add the column and row names.

x = data.frame(noncar = c(40, 13), 
               carrier = c(1, 6), 
               row.names = c("healthy", "psoriasis"))
x # always check that you have input the correct data before analysis!

##           noncar carrier
## healthy       40       1
## psoriasis     13       6

fisher.test(x)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  x
## p-value = 0.003011
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##    1.864826 865.022378
## sample estimates:
## odds ratio 
##    17.4474

Fisher’s exact test returns a P-value as the sum of probabilities of all such 2x2-tables that have the same row and column counts as we have observed, and that have at least as small a probability as the one that we have observed. Thus, here the “at least as extreme” criteria of the P-value definition is defined by how probable each 2x2-table is if we were to keep the row and column sums fixed but otherwise were to randomly assign the numerical values to the 4 cells of the table. You can read more from Wikipedia.

Our observation is that only 1/41 healthy people have the variant but 6/19 patients have it. The P-value describes how statistically extreme this observation is. The P-value 0.003 says that only in 3 out of 1000 random 2x2-tables with these total counts, the genetic variants are distributed at least in this extreme way between the two groups.

Example 2.4.

Of 30 people employed in a small workshop, 18 worked in one department and 12 in another department. In one year, five out of the 18 in the fist department reported a hand injury, while one out of the 12 in the second department did so. Is there a statistically plausible difference between the safety of the departments and how would you report this result?

In essence, we are comparing whether the proportions of 5/18 and 1/12 are statistically different, and we do that by Fisher’s test. For the sake of demonstration, we now use a matrix rather than a data frame as the input format.

# 1st row: healthy, 2nd row: injured, columns are departments
x = rbind(healthy = c(18-5, 12-1), injured = c(5,1)) 
colnames(x) = c("dep1","dep2")
x

##         dep1 dep2
## healthy   13   11
## injured    5    1

fisher.test(x)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  x
## p-value = 0.3575
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.004593763 2.709980565
## sample estimates:
## odds ratio 
##  0.2466137

Thus, these data do not indicate a statistically noticeable difference between the departments (P-value is 0.36, so a deviation at least this extreme occurs in every third 2x2-table of this type even under the null hypothesis). Note that we cannot conclude that there is not any difference between the departments, only that, with this amount of data, we do not see any statistical evidence for a difference between the departments.