During the previous lectures we got familiar with basic statistics by using the binomial distribution that models the number of “successes” in a sample of \(n\) trials. It is a discrete distribution where the outcome variable can only take discrete values \(0,1,2,\ldots,n\).

Now we move to outcome variables, such as BMI or LDL cholesterol levels, that can take any value in a continuous interval, such as, for example, in the interval (10.0, 150.0) for BMI measured in kg/m\(^2\).

Normal Distribution

The most prevalent continuous distribution is the Normal distribution (also called the Gaussian distribution after German mathematician Carl F. Gauss, 1777-1855). A reason for its wide applicability is that, in numerous settings, the distribution of sum of independent random variables tends towards a Normal distribution. (This deep mathematical fact called the Central Limit Theorem will be demonstrated a bit at the end of the lecture.) A consequence is that complex properties, such as variation between individuals in IQ or height, or susceptibility to coronary artery disease, that all result from an interplay between a huge number of individual factors, both from the genome and from the environment, follow approximately a Normal distribution in the population. Read more: http://www.bmj.com/content/310/6975/298.full.

Normal distribution is defined by 2 parameters: mean (mu, \(\mu\)) and standard deviation (sigma, \(\sigma\)). Often (outside R) variance (sigma^2, \(\sigma^2\)) is used in place of standard deviation to define the second parameter. Always pay attention to which one is in question since mixing up numerical values of these parameters badly mixes up all the statistics! For now, remember that the basic R functions take in standard deviation sigma.

The standard normal distribution, N(0,1), has mean = 0 and sd = 1. Let’s plot the density function dnorm(,0, 1) and the cumulative distribution function pnorm(, 0, 1) of N(0,1) next to each other. We use par(mfrow = c(1,2)) to set the plotting parameter mfrow to split the plotting region into 1 row and 2 columns, and the two plots will be shown next to each other.

x = seq(-3, 3, 0.001) #range of x-values where we evaluate dnorm() and pnorm()
d = dnorm(x, 0, 1) #values of density function of N(0,1)
p = pnorm(x, 0, 1) #values of cumulative distribution function of N(0,1)
par(mfrow = c(1,2)) #plotting region split to 1 x 2 area to show two plots next to each other
plot(x, d, xlab = "", ylab = "Density function of N(0,1)", 
     main = "Standard Normal Distribution",
     t = "l", lwd = 1.3, col = "limegreen")
plot(x, p, xlab = "", ylab = "Cumulative distribution function of N(0,1)", 
     main = "mean = 0 and sd = 1",
     t = "l", lwd = 1.3, col = "darkgreen")

The density plot shows that the peak is at the mean of the distribution, and the density drops from there symmetrically making a bell-shaped curve characteristic to a Normal distribution.

The cumulative distribution plot shows how the probability mass accumulates as we move from small values (here starting from -3) to larger values (here up to 3). We know that 95% of the probability mass of the N(0,1) distribution is between values -1.96 and 1.96:

qnorm(c(0.025, 0.975), 0, 1) #95% of mass of N(0,1) is between these two points
## [1] -1.959964  1.959964

(This fact was actually the motivation to use 1.96 as the multiplier of the SE in Lecture 3 when we derived approximate 95% confidence intervals for a proportion parameter.)

Let’s generate samples from a Normal distribution, say with mean = 4 and sd = 2, using rnorm() and plot the data using a histogram. Standard hist() shows on y-axis the counts of observations in each bin, but by setting prob = TRUE we can make the y-axis to scale to the values of a density function (making the total area of the histogram = 1). Then we can also show the theoretical density function dnorm(,4,2) in the same plot and compare the two.

n.samples = 5000 #samples from distribution
mu = 4 #mean
sigma = 2 #standard deviation

x = rnorm(n.samples, mu, sigma) #random sample from normal distribution
c(mean = mean(x), sd = sd(x)) #show empirical mean and sd of data
##     mean       sd 
## 3.985483 1.994015
hist(x, breaks = 40, col = "gray", prob = TRUE, main = "N(4,2)")
x.grid = seq(min(x), max(x), length.out = 1000) #grid of x-values to evaluate dnorm(, mu, sigma)
lines(x.grid, dnorm(x.grid, mu, sigma), col = "red", lwd = 1.5, lty = 2) #add dashed line to the current plot
#Let's make a legend text to appear in the top right corner
#pch is plotting symbol (15 square; -1 no symbol); lty is line type (0 no line; 2 dashed line)
legend("topright", col = c("gray","red"), legend = c("empirical","theoretical"), 
       lwd = 1.5, pch = c(15,-1), lty = c(0,2)) 

We see that the histogram of 5000 random samples from N(4,2) matches quite well with the theoretical density function of N(4,2), as it should. With more samples, the match would get even tighter.

Examples 4.1

  1. Distribution of height in Finnish men is Normal with mean = 176 and sd = 7 (in cm). Generate example data set of heights of \(n=10000\) Finnish men and plot a histogram of the data.
x = rnorm(10000, 176, 7)
hist(x, main = "", xlab = "", col = "black", breaks = 50)

  1. What should be the theoretical variance of these data? Compute empirical variance of simulated data in R.

Theoretical variance is sd squared, i.e., \(7^2=49\). Let’s validate:

c(theoretical = 7^2, empirical = var(x)) 
## theoretical   empirical 
##    49.00000    48.88782
  1. Use pnorm() to estimate the expected proportion of Finnish men that are \(\leq 165\) cm. Compare to the empirical estimate of the same proportion in the simulated data set.
pnorm(165, 176, 7) #theoretical proportion of values of N(176, var = 7^2) that are <= 165
## [1] 0.05804157
mean(x <= 165) #empirical proportion of values in x that are <= 165
## [1] 0.0571
  1. How tall is the tallest 1% of Finnish men? Use qnorm() for theoretical estimate and quantile() for an empirical estimate from the simulated sample.
qnorm(0.99, 176, 7) #theoretical cut-point with 99% in the left tail so 1% in the right tail
## [1] 192.2844
quantile(x, 0.99) #empirical cut-point that leaves 99% of the values on the left side
##      99% 
## 192.2204

Assessing normality of data

Does my data set follow a Normal distribution?

Let’s generate two data sets of size \(n=1000\). First from the uniform distribution in (0,1) and the second from N(0,1). Uniform distribution has “equal probability to pick any value in the interval”, and therefore its density function is a flat line. Let’s plot the two distributions using histograms.

n.samples = 1000
x.uni = runif(n.samples, 0, 1) #runif takes in left (0) and right (1) endpoints of interval
x.norm = rnorm(n.samples, 0, 1) #standard normal N(0,1)
par(mfrow = c(1,2))
#we can scale histogram to probability density by prob = TRUE
hist(x.uni, prob = TRUE, main = "Uniform(0,1)", col = "orange") 
hist(x.norm, prob = TRUE, main = "N(0,1)", col = "limegreen")