# Statistical Methods in Medical Reserach - # Laaketieteellisen tutkimuksen tilastolliset menetelmat # At University of Helsinki # 4.8.2023 # Matti Pirinen ### ### Topic 1 Learn R: R basics ### # Open this file in Rstudio. Rstudio opens this as "R script". # You see this R script on the top-left panel in R studio. # You can adjust the sizes of the 4 panels on the screen so that # you can have the most space for this script and the console below. # Right-hand panels can be smaller for now. # The answers to Test-yourselfs D1.1, ... , D1.5 are at the end of this file. # Try first to do them yourself and later check for correct answers. # These exercises are for you to monitor your own learning. They are not to be returned. # MAKING COMMENTS in R: # Lines starting with "#" are comments that help the reader to follow what is happening # and they are not statistical calculations that we do with R. # For example, all the lines so far in this Rscript have been simply comments and not calculations. # If you highlight one of the comment lines and hit "Run", # it will be sent to the Console below but nothing happens (except comment being printed there). # Try that with this line. Highlight this line and click "Run". Look at the console below. # Now try the same but remove the "#" at the beginning of the line above. # Now you get "Error: unexpected symbol ..." which means that R tried to interpret # the line as an R command but couldn't understand it and therefore complains. #******************************** # # 1.1 Using R as calculator # #******************************** # Run the expressions below and see the output in the Console below. # To run, highlight the code and click "Run" or press Ctrl+Enter (in Windows) or Cmd+Enter (in Mac). 5 + 4 #addition, yhteenlasku 3.4 - 1.4 #subtraction, vahennyslasku 4 * 4 #multiplication, kertolasku 12.5 / 2 #division, jakolasku 3^3 #exponentiation, potenssiinkorotus # Make sure you understand these 5 operations. # Note that the decimal separator in R is period '.' # NOT comma ',' as it would be in the Finnish notation of numbers. # And trying to use comma as decimal separator will cause error in R! # Note that the empty spaces between values and operators are optional, # but they tend to add clarity to the code so that's why it is good to use them. # Scientific notation simplifies expressing very large or very small values. # There 'e+' or just 'e' is a shorthand for 10^ and 'e-' for 1 / 10^ # Thus, 'xe+n' means 'x' times 10^n, 'xe-n' means 'x' divided by 10^n, when n is any positive value. 1e4 #is same as 10000 (that is, 1 times 10^4) 1e-4 #is same as 0.0001 (that is, 1 divided by 10^4) 0.000000000000012 #you see that, by default, R expresses this in the scientific notation as 1.2e-14 # Some special values that are not numerical: NA # Not Available, is used when a value is missing. # If NA is a part of a calculation, the result will always be NA. Run lines below to see this: 2 + NA NA * 5 (3.4 + 1.5) / NA Inf #Infinity is larger than any number that R can express. # You may get to Inf if you take too large exponents: # 1e308 seems to be the limit in my system after which R goes to Inf 1e308 #This is still OK as a numeric value ... 1e309 # ... but this becomes Inf # Other source of Inf is to divide by values very close to 0 or by exactly 0. 1 / 1e-500 1 / 0 #In calculations, Inf stays as Inf when combined with finite values 2 + Inf Inf * 2 # Division by Inf gives 0: 1 / Inf # There is also -Inf, the infinity on the negative side: 2 - Inf #think that you are subtracting a huge value from 2. Result is -Inf (-2 * Inf) # While 2*Inf is Inf, minus sign in front of Inf changes it to -Inf. # Some operations with Infinities are undefined and produce # "NaN = Not a number". This is not a missing value like NA, # but rather a mathematically undefined value, such as a result of undetermined operations: Inf - Inf Inf / Inf 0 / 0 #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Test Yourself 1.1. (Answers are at the end of this file.) #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Blood pressure of Mr. X was 132 units on the first measurement. # Write computations in R language to show # How much BP was at the second measurement when # (a) it had increased from the 1st measurement by 4.6 units (kasvanut 4.6 yksikkoa)? # (b) it had decreased from the 1st measurement by 4.6 units (vahentynyt 4.6 yksikkoa)? # (c) it had increased from the 1st measurement by 10 percent (kasvanut 10%)? # (d) it had decreased from the 1st measurement by 30 percent (pienentynyt 30%)? # Write using R in scientific notation # (e) 0.00459 # (f) 1449000 #******************************************* # # 1.2 Generating variables and vectors # #******************************************* # We can assign values to variables and then operate with the variable symbols x = 3 #Sets symbol 'x' to have value 3 (numerical) y = 4 #Sets symbol 'y' to have value 4 (numerical) x + y #Calculates x + y, which, with current values, is 7 = 3 + 4 # Note: "<-" means the same as "=", so x <- 3 is another way to write x = 3 # Note: top-right panel in Rstudio, "Environment", now shows the variables and their current values. # Example: Assign value 2 to x and 9 to y and compute x to the power of y. x = 2 y = 9 x^y # A "vector" is a combination of individual numerical values into a single object. # We create vectors by putting the values to be combined within 'c( )' syntax. # Remember it from 'c' saying 'combine'. Values within c( ) should be separated by commas. # Let's make a vector of three values 1, 4 and 89 and let's name it as 'x' x = c(1, 4, 89) x #running variable name simply shows the value of the variable. # Note: If you used period '.' in place of comma ',' in vector definition, things go wrong! # Vector can be indexed by [ ] notation x[2] # returns element no.2 of vector x, here that is x[2] = 4. # If you ask for an element that is not present in the vector, you'll get NA. x[4] #The length of vector 'x' is 3, so element 4 does not exist, thus it is "NA, Not Available". # Indexing a vector with a vector returns one element per each index value. # Let's take elements with indexes 1 once, 3 once and 2 twice from vector 'x': x[c(1,3,2,2)] #returns a vector of length 4, element no. 2 of original 'x' is repeated twice here. # If a single value is operated on vector, R applies the operation separately to every element of vector. 2 + x #elementwise addition gives 3 = 2+1, 6 = 2+4, 91 = 2+89 2 * x #elementwise multiplication gives 2 = 2*1, 8 = 2*4, 178 = 2*89 #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Test Yourself 1.2. (Answers are at the end of this file.) #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Blood pressure of Mr. X has been measured 5 times. # Results are 134, 145, 129, 133, 150. # (a) Make a vector called 'bp' that contains Mr X's 5 BP measurements. # (b) Use an operation on vector 'bp' to decrease every measurement by 5 units. # (c) Use an operation on vector 'bp' to decrease every measurement by 5%. # (d) Use vector indexing to extract element 3 from vector 'bp'. # (e) Use vector indexing to extract elements 3 and 4 from vector 'bp' and compute their average. # Let's generate another vector 'y' of same length as 'x' y = c(9, 5.7, -19) # Addition of two vectors of SAME length does elementwise addition. # 1st elements are added together, 2nd elements together etc. x + y #Returns vector of length 3 with elements 10 = 1 + 9, 9.7 = 4 + 5.7 and 70 = 89 - 19 x - y #elementwise subtraction x * y #elementwise multiplication x / y #elementwise division # Add one more element at the end of 'y': y = c(y, 3) y #Check that now 'y' has length 4 x + y #This is still elementwise addition but now 'x' is recycled to become same length as 'y'. # Computes 1 + 9, 4 + 5.7, 89 - 19 and 1 + 3, where x is recycled from beginning at the 4th element. # A warning is produced when vectors of different sizes are operated as this is likely a mistake # and the user likely had intended to do something else. length(x) #get the length of a vector length(y) # Example: Include '5' at the beginning of vector x and compute x + y x = c(5, x) #adding '5' as the first element of x. The rest of the new 'x' stays as in the current 'x'. x #Check what x contains now x + y #elementwise addition of vectors x and y. No warnings needed anymore since lengths are the same. #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Test Yourself 1.3. (Answers are at the end of this file.) #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Blood pressure of Mr. X has been measured 2 times (1st morning, 2nd afternoon) on two consecutive days. # Results are # Day 1, morning = 134, afternoon = 145 # Day 2, morning = 129, afternoon = 133. # Make vectors 'day1' and 'day2' that each have the two values and # compute using vector operations # (a) Add 2 units on both measurements of day 1. # (b) Compute mean of morning measurements and mean of afternoon measurements. # We can quickly generate some special vectors without typing them explicitly: # vector from value 'n' to value 'm' is generated by 'n:m' 1:5 #is same as c(1,2,3,4,5) 5:1 #is same as c(5,4,3,2,1) # When step between values is something else than one unit we use 'seq()' function, in 2 ways: seq(1, 2, by = 0.25) #sequence from 1 to 2 by steps of 0.25 seq(1, 2, length.out = 10) #sequence from 1 to 2 using 10 equally spaced values #Generating vectors by repeating some value rep(1, 10) #repeat '1' 10 times rep(c(1,2), c(3,5)) #repeat '1' 3 times and after that repeat '2' 5 times #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Test Yourself 1.4. (Answers are at the end of this file.) #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# #(a) Generate sequence from 0 to 1 by steps of 0.05 using 'seq()' #(b) Generate a vector of length 5 that is full of value 10 using 'rep()' #(c) Generate a decreasing integer sequence from 95 to 85. #****************************** # # 1.3 Using functions # #****************************** # seq( ) and rep( ) above are examples of functions that take in "parameters" or "arguments" # and process those parameters to produce output values. # You can see help for any function by adding '?' in front of the function name. # For example, help for the rep function opens by typing ?rep on console. ?rep # Now help opens on the window bottom right. # it says that we can use rep( ) in different ways using different arguments: rep(1, times = 10) #repeats 1 for 10 times rep(c(1,2), times = 5) #repeats vector (1,2) for 10 times rep(c(1,2), each = 5) #repeats each element of vector c(1,2) 5 times on its own # Generally, function arguments are included within the parentheses, separated by commas. # It is a good practice to include the name of each parameter in the function call to # keep the code clear. So while both of these function calls rep(1, times = 10) rep(1, 10) # produce the same output, the first is more clear since we can explicitly # see that 10 is value for parameter 'times'. # In the second version, we need to know that # the second argument of rep is 'times' to interpret the command correctly. # Importantly, we may easily make mistakes if we ignore names of parameters. # For example, we cannot get the correct result of command rep(c(1,2), each = 5) #by simply typing rep(c(1,2), 5) #because by default rep( ) interprets the second value as 'times' not 'each' parameter. # You can learn this behavior from the help of rep( ) where it says that # The default behavior of call rep(x) is as if the call was # rep(x, times = 1, length.out = NA, each = 1), # and here the first parameter after 'x' is 'times' not 'each. #Some basic statistics applied to a vector of values x = c(1, 2, 3) mean(x) #arithmetic mean, (1 + 2 + 3) / 3 sd(x) #sample standard deviation var(x) #sample variance, which is same as sd(x)^2 sum(x) #summing values, 1 + 2 + 3 #Open help page of 'sd()': ?sd # It says: "If na.rm is TRUE then missing values are removed before computation proceeds." # Suppose we have measured 7 patients but two values are missing x = c(3.4, 5.1, 2.7, NA, 9.7, 5.5, NA) sd(x) #Now sd(x) produces NA since one or more input values is NA # How do you modify the sd(x) command in order to get the sd of # the 5 values that are not NA? sd(x, na.rm = TRUE) #according to the help, parameter 'na.rm = TRUE' ReMoves NA values # Let's check that this is the same as manually removing the two NAs, namely leaving out indexes 4 and 7: sd(x[c(1,2,3,5,6)]) # It is the same result. # We could also remove indexes 4 and 7 by using minus sign in front of indexes like this: sd(x[-c(4, 7)]) #vector x "minus indexes 4 and 7" produces vector x without elements 4 and 7. x = 1:20 #vector of length 20 with elements 1,2,3,...,20. median(x) #Median is the middle value, 50% of values are smaller than median #We can also ask for other quantiles, # that are the cut-off points that separate a given proportion of (sorted) values. quantile(x, 0.25) #25% of values in 'x' are smaller than this value quantile(x, 0.75) #75% of values in 'x' are smaller than this value min(x) #minimum of 'x' max(x) #maximum of 'x' range(x) #returns two values: the smallest and the largest of 'x' summary(x) #returns a set of summary values for vector 'x' sqrt(4) #square root function, in Finnish: neliojuuri #random order of elements of vector can be produced by 'sample()' function x = sample(x) x #now values 1,...,20 have been randomly shuffled. sort(x) #Sorting vector in ascending order by 'sort()' sort(x, decreasing = TRUE) #Sorting it in descending order by 'sort(, decreasing = TRUE)' # Note that sort(x) does not change 'x', which still remains randomly shuffled. # sort(x) has just returned a new vector that was sorted. # Check this by checking that the value of 'x' is still the shuffled one x # to sort it back we need to assign sorted result to x x = sort(x) x #now x is in order #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Test Yourself 1.5. (Answers are at the end of this file.) #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # (a) Use 'rep()' to generate vector 'x' of nine elements that # repeats value 6.5 four times and value 2.3 five times. # Check that your x is as expected. # (b) Compute arithmetic mean of values in x. # (c) Compute sample variance of values in x. # (d) Compute sample standard deviation of values in x. # (e) Compute sum of values in x. # (f) Compute median of values in x. # (g) Compute cut-off point below which there are 33% of values in x. # (h) Compute minimum of values in x. # (i) Compute maximum of values in x. # (j) Compute range of values in x. # (k) Compute 'summary()' of values in x. # (l) Sort x in increasing order. # ## ### ANSWERS ## # #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Answers to Test Yourself 1.1. #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Blood pressure of Mr. X was 132 units on the first measurement. # Write computations in R language to show # How much BP was at the second measurement when # (a) it had increased from the 1st measurement by 4.6 units (kasvanut 4.6 yksikkoa)? x = 132 #assign 132 to variable x. This is the 1st measurement value. x + 4.6 #is the answer to (a) # (b) it had decreased from the 1st measurement by 4.6 units (vahentynyt 4.6 yksikkoa)? x - 4.6 #is the answer to (b) # (c) it had increased from the 1st measurement by 10 percent (kasvanut 10%)? x * 1.10 #is the answer to (c); increasing by 10% is same as multiplying by 110% = 1.10 # (d) it had decreased from the 1st measurement by 30 percent? x * (1 - 0.30) #is the answer to (d); decrease of 30% is x*0.30; remaining value is x*(1 - 0.30) # Write using R in scientific notation # (e) 0.00459 4.59e-3 # (f) 1449000 1.449e6 #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Answers to Test Yourself 1.2. #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Blood pressure of Mr. X has been measured 5 times. # Results are 134, 145, 129, 133, 150. # (a) Make a vector called 'bp' that contains Mr X's 5 BP measurements. bp = c(134, 145, 129, 133, 150) bp #always check what you have defined # (b) Use an operation on vector 'bp' to decrease every measurement by 5 units. bp + 5 # (c) Use an operation on vector 'bp' to decrease every measurement by 5%. bp * (1 - 0.05) # (d) Use vector indexing to extract element 3 from vector 'bp'. bp[3] # (e) Use vector indexing to extract elements 3 and 4 from vector 'bp' and compute their average. (bp[3] + bp[4])/2 #and other ways are also possible, such as mean(bp[3:4]) #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Answers to Test Yourself 1.3. #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Blood pressure of Mr. X has been measured 2 times (morning, afternoon) on two consecutive days. # Results are # Day 1, morning = 134, afternoon = 145 # Day 2, morning = 129, afternoon = 133. # Make two vectors, one for day 1 and one for day 2 measurements. day1 = c(134, 145) day2 = c(129, 133) # Compute using vector operations # (a) Add 2 units on both measurements on day 1. day1 + 2 # (b) Compute mean of morning measurements and mean of afternoon measurements. (day1 + day2)/2 #returns two values, 1st = mean of mornings, 2nd = mean of afternoons #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Answers to Test Yourself 1.4. #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# #(a) Generate sequence from 0 to 1 by steps of 0.05 using 'seq()' seq(0, 1, by = 0.05) #(b) Generate a vector of length 5 that contains value 10 five times using 'rep()' rep(10, times = 5) #(c) Generate a decreasing integer sequence from 95 to 85. 95:85 #OR seq(95, 85, by = -1) #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # Answers to Test Yourself 1.5. #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # (a) Use 'rep()' to generate vector 'x' of nine elements that # repeats value 6.5 four times and value 2.3 five times. # Check that your x is as expected. x = rep(c(6.5, 2.3), c(4, 5)) x # (b) Compute arithmetic mean of values in x. mean(x) #4.166667 # (c) Compute sample variance of values in x. var(x) #4.9 # (d) Compute sample standard deviation of values in x. sd(x) #2.213594 # (e) Compute sum of values in x. sum(x) #37.5 # (f) Compute median of values in x. median(x) #2.3 # (g) Compute cut-off point below which there are 33% of values in x. quantile(x, 0.33) #2.3 # (h) Compute minimum of values in x. min(x) #2.3 # (i) Compute maximum of values in x. max(x) #6.5 # (j) Compute range of values in x. range(x) # (k) Compute summary()' of values in x. summary(x) # (l) Sort x in increasing order. sort(x)