During the previous lectures we got familiar with basic statistics by using the binomial distribution that models the number of “successes” in a sample of \(n\) trials. It is a discrete distribution where the outcome variable can only take discrete values \(0,1,\ldots,n\).
Now we move to outcome variables, such as BMI or LDL cholesterol levels, that can take any value in a continuous interval, such as, say in the interval (10.0,150.0) for BMI measured in kg/m\(^2\).
The most prevalent continuous distribution is the Normal distribution (also called the Gaussian distribution after German mathematician Carl F. Gauss). A reason for its wide applicability is that in very many settings the distribution of sums of independent random variables tend towards a Normal distribution, and hence many complex properties, such as IQ or height or susceptibility to coronary artery disease, that result from an interplay between a huge number of individual factors both from the genome and from the environment, follow approximately a Normal distribution in the population. Read more: http://www.bmj.com/content/310/6975/298.full.
Normal distribution is defined by 2 parameters: mean (mu, \(\mu\)) and standard deviation (sigma, \(\sigma\)). Often (outside R) variance (sigma^2, \(\sigma^2\)) is used in place of standard deviation to define the second parameter. Always pay attention to which one is in question since mixing up these parameters badly mixes up all the statistics! For now, remember that the basic R functions take in standard deviation
The standard normal distribution, N(0,1), has mean = 0 and sd = 1. Let’s plot the density function
dnorm(,0, 1) and the cumulative distribution function
pnorm(, 0, 1) of N(0,1) next to each other. We use
par(mfrow=c(1,2)) to say that we set plotting parameter
mfrow to split the plotting region into 1 row and 2 columns, whence the two plots show up next to each other.
x = seq(-3, 3, 0.001) #range of x-values where we evaluate dnorm() and pnorm() d = dnorm(x, 0, 1) #values of density function of N(0,1) p = pnorm(x, 0, 1) #values of cumulative distribution function of N(0,1) par(mfrow = c(1,2)) #plotting region split to 1 x 2 area to show two plots next to each other plot(x, d, xlab = "", ylab = "Density function of N(0,1)", main = "Standard Normal Distribution", t = "l", lwd = 1.3, col = "limegreen") plot(x, p, xlab = "", ylab = "Cumulative distribution function of N(0,1)", main = "mean = 0 and sd = 1", t = "l", lwd = 1.3, col = "darkgreen")
The density plot shows that the peak is at the mean of the distribution, and the density drops from there symmetrically making a bell-shaped curve characteristic to a Normal distribution.
The cumulative distribution plot shows how the probability mass accumulates as we move from small values (here starting from -3) to larger values (here up to 3). We know that 95% of the probability mass of the N(0,1) distribution is between values -1.96 and 1.96:
qnorm(c(0.025, 0.975), 0, 1) #95% of mass of N(0,1) is between these two points
##  -1.959964 1.959964
(This fact was actually the motivation to use 1.96 as the multiplier of the SE in Lecture 3 when we derived approximate 95% confidence intervals for a proportion parameter.)
Let’s generate samples from a Normal distribution, say with mean=4 and sd=2, using
rnorm() and plot the data using a histogram. Standard
hist() shows on y-axis the counts of observations in each bin, but by setting
prob = TRUE we can make the y-axis to scale to the values of a density function (making the total area of the histogram = 1). Then we can also show the theoretical density function
dnorm(,4,2) in the same plot and compare the two.
n.samples = 5000 #samples from distribution mu = 4 #mean sigma = 2 #standard deviation x = rnorm(n.samples, mu, sigma) #random sample from normal distribution data.frame(mean = mean(x), sd = sd(x)) #show empirical mean and sd of data
## mean sd ## 1 3.985483 1.994015
hist(x, breaks = 40, col = "gray", prob = TRUE, main = "N(4,2)") x.grid = seq(min(x), max(x), length.out = 1000) #grid of x-values to evaluate dnorm(, mu, sigma) lines(x.grid, dnorm(x.grid, mu, sigma), col = "red", lwd = 1.5, lty = 2) #add dashed line to the current plot #Let's make a legend text to appear in the top right corner #pch is plotting symbol (15 square; -1 no symbol); lty is line type (0 no line; 2 dashed line) legend("topright", col = c("gray","red"), legend = c("empirical","theoretical"), lwd = 1.5, pch = c(15,-1), lty = c(0,2))
We see that the histogram of 5000 random samples from N(4,2) matches quite well with the theoretical density function of N(4,2), as it should. With more samples, the match would get even tighter.
x = rnorm(10000, 2.3, 3.1) hist(x, main = "", xlab = "", col = "black", breaks = 50)