# Statistical Methods in Medical Reserach -
# Laaketieteellisen tutkimuksen tilastolliset menetelmat
# At University of Helsinki
# 2.9.2021
# Matti Pirinen
###
### Topic 1 Learn R: R basics
###
# Open this file in Rstudio. Rstudio opens this as "R script".
# You see this R script on top-left in R studio.
# You can adjust the sizes of the 4 panels on screen so that you have
# the most space for this script and the console below.
# Right panels can be smaller for now.
# The answers to Test-yourselfs D1.1, ... , D1.5 are at the end of this file.
# Try first to do them yourself and later check for correct answers.
# These exercises are for you to monitor your own learning. They are not to be returned.
# MAKING COMMENTS in R:
# Lines starting with "#" are comments that help reader to follow what's happening, and
# they are not statistical calculations that we do with R.
# For example, all lines so far in this Rscript have been simply comments and not calculations.
# If you highlight a comment line and hit "Run", it will be sent to Console but nothing happens.
# Try that with this line. Highlight this line and click "Run". Look at the console below.
# Now try the same but remove the "#" at the beginning of the line above.
# Now you get "Error: unexpected symbol ..." which means that R tried to interpret
# the line as an R command but couldn't understand it and reports an error.
#********************************
# 1.1 Using R as calculator
#********************************
# Run the expressions below and see the output in Console below.
# To run, highlight the code and click "Run" or press Ctrl+Enter (in Windows) or Cmd+Enter (in Mac).
5 + 4 #addition, yhteenlasku
3.4 - 1.4 #subtraction, vahennyslasku
4 * 4 #multiplication, kertolasku
12.5 / 2 #division, jakolasku
3^3 #exponentiation, potenssiinkorotus
# Make sure you understand these 5 operations.
# Note that the decimal separator in R is period '.'
# NOT comma ',' as it would be in the Finnish notation of numbers.
# Note that empty spaces between values and operators are optional,
# but they tend to add clarity to code so that's why they are used here.
# Scientific notation simplifies expressing large or small values
# 'xe+n' means 'x' times 10^n, 'xe-n' means 'x' divided by 10^n, when n is positive value.
1e4 #is same as 10000 (that is, 1 times 10^4)
1e-4 #is same as 0.0001 (that is, 1 divided by 10^4)
0.000000000000012 #you see that by default R expresses this in scientific notation as 1.2e-14
# Some special values that are not numerical:
NA # Not Available, is used when a value is missing.
# If NA is a part of calculation, the result will always be NA. Run lines below to see this:
2 + NA
NA * 5
(3.4 + 1.5) / NA
Inf #Infinity is larger than any number that R can express.
# You may get to Inf if you take too large exponents,
# 1e308 seems to be the limit in my system after which R goes to Inf
1e308
1e309
# or if you divide by values close to 0 or exactly 0.
1 / 1e-500
1 / 0
#In calculations, Inf stays as Inf when combined with finite values
2 + Inf
Inf * 2
# Division by Inf gives 0:
1 / Inf
# There is also -Inf, the infinity on the negative side:
2 - Inf #think that you are subtracting a huge value from 2. Result is -Inf
(-2 * Inf) # While 2*Inf is Inf, minus sign in front of Inf changes it to -Inf.
# Some operations with Infinities are undefined and produce
# "NaN = Not a number". This is not a missing value like NA,
# but rather a mathematically undefined value, such as result of undetermined operations:
Inf - Inf
Inf / Inf
0 / 0
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Test Yourself 1.1. (Answers are at the end of this file.)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Blood pressure of Mr. X was 132 units on the first measurement.
# Write computations in R language to show
# How much BP was at the second measurement when
# (a) it had increased from the 1st measurement by 4.6 units (kasvanut 4.6 yksikkoa)?
# (b) it had decreased from the 1st measurement by 4.6 units (vahentynyt 4.6 yksikkoa)?
# (c) it had increased from the 1st measurement by 10 percent (kasvanut 10%)?
# (d) it had decreased from the 1st measurement by 30 percent (pienentynyt 30%)?
# Write using R in scientific notation
# (e) 0.00459
# (f) 1449000
#*******************************************
# 1.2 Generating variables and vectors
#*******************************************
# We can assign values to variables and then operate with variable symbols
x = 3 #Sets symbol 'x' to have value 3 (numerical)
y = 4 #Sets symbol 'y' to have value 4 (numerical)
x + y #Calculates x + y, which, with current values, is 7 = 3 + 4
# Note: "<-" means the same as "=", so x <- 3 is another way to write x = 3
# Note: top-right panel in Rstudio, "Environment", now shows the variables and their current values.
# Example: Assign value 2 to x and 9 to y and compute x to the power of y.
x = 2
y = 9
x^y
# "vector" is a combination of individual values into a single object.
# We create vectors by putting the values to be combined within 'c( )' syntax.
# Remeber it from 'c' saying 'combine'. Values within c( ) should be separated by commas.
# Let's make a vector of three values 1, 4 and 89 and let's name it as 'x'
x = c(1, 4, 89)
x #running variable name simply shows the value of the variable.
# Note: If you used period '.' in place of comma ',' in vector definition, things go wrong!
# Vector can be indexed by [ ] notation
x[2] # returns element no.2 of vector x, here that is x[2] = 4.
# If you ask for an element that is not present in the vector, you'll get NA.
x[4] #The length of vector 'x' is 3, so element 4 does not exist, thus it is "NA, Not Available".
# Indexing a vecor with a vector returns one element per each index value.
# Let's take elements with indexes 1 once, 3 once and 2 twice from vector 'x':
x[c(1,3,2,2)] #returns a vector of length 4, element no. 2 of original 'x' is repeated twice here.
# If a single value is operated on vector, R applies the operation separately to every element of vector.
2 + x #elementwise addition gives 3 = 2+1, 6 = 2+4, 91 = 2+89
2 * x #elementwise multiplication gives 2 = 2*1, 8 = 2*4, 178 = 2*89
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Test Yourself 1.2. (Answers are at the end of this file.)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Blood pressure of Mr. X has been measured 5 times.
# Results are 134, 145, 129, 133, 150.
# (a) Make a vector called 'bp' that contains Mr X's 5 BP measurements.
# (b) Use an operation on vector 'bp' to decrease every measurement by 5 units.
# (c) Use an operation on vector 'bp' to decrease every measurement by 5%.
# (d) Use vector indexing to extract element 3 from vector 'bp'.
# (e) Use vector indexing to extract elements 3 and 4 from vector 'bp' and compute their average.
# Let's generate another vector 'y' of same length as 'x'
y = c(9, 5.7, -19)
# Addition of two vectors of SAME length does elementwise addition.
# 1st elements are added together, 2nd elements together etc.
x + y #Returns vector of length 3 with elements 10 = 1 + 9, 9.7 = 4 + 5.7 and 70 = 89 - 19
x - y #elementwise subtraction
x * y #elementwise multiplication
x / y #elementwise division
# Add one more element at the end of 'y':
y = c(y, 3)
y #Check that now 'y' has length 4
x + y #This is still elementwise addition but now 'x' is recycled to become same length as 'y'.
# Computes 1 + 9, 4 + 5.7, 89 - 19 and 1 + 3, where x is recycled from beginning at the 4th element.
# A warning is produced when vectors of different sizes are combined as this is likely a mistake
# and users likely intended to do something else.
length(x) #get the length of a vector
length(y)
# Example: Add '5' to the beginning of vector x and compute x + y
x = c(5, x) #adding '5' as the first element of x. The rest of the new 'x' stays as in the current 'x'.
x #Check what x contains now
x + y #elementwise addition of vectors x and y
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Test Yourself 1.3. (Answers are at the end of this file.)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Blood pressure of Mr. X has been measured 2 times (1st morning, 2nd afternoon) on two consecutive days.
# Results are
# Day 1, morning = 134, afternoon = 145
# Day 2, morning = 129, afternoon = 133.
# Make vectors 'day1' and 'day2' that each have the two values and
# compute using vector operations
# (a) Add 2 units on both measurements on day 1.
# (b) Compute mean of morning measurements and mean of afternoon measurements.
# We can quickly generate some special vectors without typing them explicitly:
# vector from value 'n' to value 'm' is generated by 'n:m'
1:5 #is same as c(1,2,3,4,5)
5:1 #is same as c(5,4,3,2,1)
# When step between values is something else than one unit we use 'seq()' function, in 2 ways:
seq(1, 2, by = 0.25) #sequence from 1 to 2 by steps of 0.25
seq(1, 2, length.out = 10) #sequence from 1 to 2 using 10 equally spaced values
#Generating vectors by repeating some value
rep(1, 10) #repeat '1' 10 times
rep(c(1,2), c(3,5)) #repeat '1' 3 times and after that repeat '2' 5 times
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Test Yourself 1.4. (Answers are at the end of this file.)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#(a) Generate sequence from 0 to 1 by steps of 0.05 using 'seq()'
#(b) Generate a vector of length 5 that is full of value 10 using 'rep()'
#(c) Generate a decreasing integer sequence from 95 to 85.
#******************************
# 1.3 Using functions
#******************************
# seq() and rep() above are examples of functions that take in "parameters" or "arguments"
# and process those parameters to produce output values.
# You can see help for any function by adding '?' in front of the function name.
# For example, help for the rep function opens by typing ?rep on console.
?rep
# Now help opens on bottom right window.
# it says that we can use rep( ) in different ways using different arguments:
rep(1, times = 10) #repeats 1 for 10 times
rep(c(1,2), times = 5) #repeats vector (1,2) for 10 times
rep(c(1,2), each = 5) #repeats each element of vector c(1,2) 5 times on its own
# Generally, function arguments are included in parentheses, separated by comma.
# It is good practice to include the name of the parameter in function call to
# keep things clear. So while both function calls
rep(1, times = 10)
rep(1, 10)
# produce the same output, the first is more clear since we can explicitly
# see that 10 is value for 'times'.
# In the second version, we need to know that
# the second argument of rep is 'times' to interpret the command correctly.
# Importantly, we may easily make mistakes if we ignore names of parameters.
# For example, we cannot get the correct result of command
rep(c(1,2), each = 5)
#by simply typing
rep(c(1,2), 5)
#because standard form of rep() interprets the second value as 'times = ' not 'each = ' parameter.
#Some basic statistics applied to a vector of values
x = c(1, 2, 3)
mean(x) #arithmetic mean, (1+2+3)/3
sd(x) #sample standard deviation
var(x) #sample variance is sd^2
sum(x) #summing values, 1+2+3
#Open help page of 'sd()':
?sd
# It says: "If na.rm is TRUE then missing values are removed before computation proceeds."
# Suppose we have measured 7 patients but two values are missing
x = c(3.4, 5.1, 2.7, NA, 9.7, 5.5, NA)
sd(x) #Now sd(x) produces NA since one or more input values is NA
# How do you modify the sd(x) command in order to get sd of
# the 5 values that are not NA.
sd(x, na.rm = TRUE) #according to help, parameter 'na.rm = TRUE' ReMoves NA values
# Let's check that this is the same as manually removing the two NAs, leave out indexes 4 and 7:
sd(x[c(1,2,3,5,6)])
# It is the same result.
# We could also remove indexes 4 and 7 by using minus sign in front of indexes like this:
sd(x[-c(4, 7)]) #vector x "minus indexes 4 and 7" produces vector x without elements 4 and 7.
x = 1:20 #vector of length 20 with elements 1,2,3,...,20.
median(x) #Median is the middle value, 50% of values are smaller than median
#We can also ask for other quantiles,
# that are the cut-off points that separate a given proportion of (sorted) values.
quantile(x, 0.25) #25% of values in 'x' are smaller than this value
quantile(x, 0.75) #75% of values in 'x' are smaller than this value
min(x) #minimum of 'x'
max(x) #maximum of 'x'
range(x) #returns two values: the smallest and the largest of 'x'
summary(x) #returns a set of summary values for vector 'x'
sqrt(4) #square root function, in Finnish: neliojuuri
#random order of elements of vector can be produced by 'sample()' function
x = sample(x)
x #now values 1,...,20 have been randomly shuffled.
sort(x) #Sorting vector in ascending order by 'sort()'
sort(x, decreasing = TRUE) #Sorting it in descending order by 'sort(, decreasing = TRUE)'
# Note that sort(x) does not change 'x', which still remains randomly shuffled.
# sort(x) has just returned a new vector that was sorted.
# Check this by checking that the value of 'x' is still the shuffled one
x
# to sort it back we need to assign sorted result to x
x = sort(x)
x
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Test Yourself 1.5. (Answers are at the end of this file.)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# (a) Use 'rep()' to generate vector 'x' of nine elements that
# repeats value 6.5 four times and value 2.3 five times.
# Check that your x is as expected.
# (b) Compute arithmetic mean of values in x.
# (c) Compute sample variance of values in x.
# (d) Compute sample standard deviation of values in x.
# (e) Compute sum of values in x.
# (f) Compute median of values in x.
# (g) Compute cut-off point below which there are 33% of values in x.
# (h) Compute minimum of values in x.
# (i) Compute maximum of values in x.
# (j) Compute range of values in x.
# (k) Compute 'summary()' of values in x.
# (l) Sort x in increasing order.
#
##
### ANSWERS
##
#
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Answers to Test Yourself 1.1.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Blood pressure of Mr. X was 132 units on the first measurement.
# Write computations in R language to show
# How much BP was at the second measurement when
# (a) it had increased from the 1st measurement by 4.6 units (kasvanut 4.6 yksikkoa)?
x = 132 #assign 132 to variable x. This is the 1st measurement value.
x + 4.6 #is the answer to (a)
# (b) it had decreased from the 1st measurement by 4.6 units (vahentynyt 4.6 yksikkoa)?
x - 4.6 #is the answer to (b)
# (c) it had increased from the 1st measurement by 10 percent (kasvanut 10%)?
x * 1.10 #is the answer to (c); increasing by 10% is same as multiplying by 110% = 1.10
# (d) it had decreased from the 1st measurement by 30 percent?
x * (1 - 0.30) #is the answer to (d); decrease of 30% is x*0.30; remaining value is x*(1 - 0.30)
# Write using R in scientific notation
# (e) 0.00459
4.59e-3
# (f) 1449000
1.449e6
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Answers to Test Yourself 1.2.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Blood pressure of Mr. X has been measured 5 times.
# Results are 134, 145, 129, 133, 150.
# (a) Make a vector called 'bp' that contains Mr X's 5 BP measurements.
bp = c(134, 145, 129, 133, 150)
bp #always check what you have defined
# (b) Use an operation on vector 'bp' to decrease every measurement by 5 units.
bp + 5
# (c) Use an operation on vector 'bp' to decrease every measurement by 5%.
bp * (1 - 0.05)
# (d) Use vector indexing to extract element 3 from vector 'bp'.
bp[3]
# (e) Use vector indexing to extract elements 3 and 4 from vector 'bp' and compute their average.
(bp[3] + bp[4])/2 #(other ways are also possible)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Answers to Test Yourself 1.3.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Blood pressure of Mr. X has been measured 2 times (morning, afternoon) on two consecutive days.
# Results are
# Day 1, morning = 134, afternoon = 145
# Day 2, morning = 129, afternoon = 133.
# Make two vectors, one for day 1 and one for day 2 measurements.
day1 = c(134, 145)
day2 = c(129, 133)
# Compute using vector operations
# (a) Add 2 units on both measurements on day 1.
day1 + 2
# (b) Compute mean of morning measurements and mean of afternoon measurements.
(day1 + day2)/2 #returns two values, 1st = mean of mornings, 2nd = mean of afternoons
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Answers to Test Yourself 1.4.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#(a) Generate sequence from 0 to 1 by steps of 0.05 using 'seq()'
seq(0, 1, by = 0.05)
#(b) Generate a vector of length 5 that contains value 10 five times using 'rep()'
rep(10, times = 5)
#(c) Generate a decreasing integer sequence from 95 to 85.
95:85
#OR
seq(95, 85, by = -1)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Answers to Test Yourself 1.5.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# (a) Use 'rep()' to generate vector 'x' of nine elements that
# repeats value 6.5 four times and value 2.3 five times.
# Check that your x is as expected.
x = rep(c(6.5, 2.3), c(4, 5))
x
# (b) Compute arithmetic mean of values in x.
mean(x) #4.166667
# (c) Compute sample variance of values in x.
var(x) #4.9
# (d) Compute sample standard deviation of values in x.
sd(x) #2.213594
# (e) Compute sum of values in x.
sum(x) #37.5
# (f) Compute median of values in x.
median(x) #2.3
# (g) Compute cut-off point below which there are 33% of values in x.
quantile(x, 0.33) #2.3
# (h) Compute minimum of values in x.
min(x) #2.3
# (i) Compute maximum of values in x.
max(x) #6.5
# (j) Compute range of values in x.
range(x)
# (k) Compute summary()' of values in x.
summary(x)
# (l) Sort x in increasing order.
sort(x)