# Statistical Methods in Medical Reserach - 
# Laaketieteellisen tutkimuksen tilastolliset menetelmat
# At University of Helsinki
# 15.8.2024
# Matti Pirinen


###
### Topic 1 Learn R: R basics
###


# Open this file in Rstudio. Rstudio opens this as "R script".
# You see this R script on the top-left panel in R studio.
# You can adjust the sizes of the 4 panels on the screen so that 
# you can have the most space for this script and the console below.
# Right-hand panels can be smaller for now.

# The answers to Test-yourselfs D1.1, ... , D1.5 are at the end of this file.
# Try first to do them yourself and later check for correct answers.
# These exercises are for you to monitor your own learning. They are not to be returned.


# MAKING COMMENTS in R:
# Lines starting with "#" are comments that help the reader to follow what is happening
#  and they are not statistical calculations that we do with R.
# For example, all the lines so far in this Rscript have been simply comments and not calculations.
# If you highlight one of the comment lines and hit "Run", 
# it will be sent to the Console below but nothing happens (except comment being printed there).

# Try that with this line. Highlight this line and click "Run". Look at the console below.

# Now try the same but remove the "#" at the beginning of the line above.
# Now you get "Error: unexpected symbol ..." which means that R tried to interpret
# the line as an R command but couldn't understand it and therefore complains.


#********************************
#
#   1.1 Using R as calculator
#
#********************************

# Run the expressions below and see the output in the Console below.
# To run, highlight the code and click "Run" or press Ctrl+Enter (in Windows) or Cmd+Enter (in Mac).

5 + 4 #addition, yhteenlasku
3.4 - 1.4 #subtraction, vahennyslasku
4 * 4 #multiplication, kertolasku
12.5 / 2 #division, jakolasku
3^3 #exponentiation, potenssiinkorotus
# Make sure you understand these 5 operations.
# Note that the decimal separator in R is period '.' 
#   NOT comma ',' as it would be in the Finnish notation of numbers.
#   And trying to use comma as decimal separator will cause error in R!
# Note that the empty spaces between values and operators are optional,
# but they tend to add clarity to the code so that's why it is good to use them.

# Scientific notation simplifies expressing very large or very small values.
# There 'e+' or just 'e' is a shorthand for 10^ and 'e-' for  1 / 10^
# Thus, 'xe+n' means 'x' times 10^n, 'xe-n' means 'x' divided by 10^n, when n is any positive value.
1e4 #is same as 10000 (that is, 1 times 10^4)
1e-4 #is same as 0.0001 (that is, 1 divided by 10^4)
0.000000000000012 #you see that, by default, R expresses this in the scientific notation as 1.2e-14

# Some special values that are not numerical:
NA # Not Available, is used when a value is missing.
# If NA is a part of a calculation, the result will always be NA. Run lines below to see this:
2 + NA
NA * 5
(3.4 + 1.5) / NA

Inf #Infinity is larger than any number that R can express. 
# You may get to Inf if you take too large exponents:
# 1e308 seems to be the limit in my system after which R goes to Inf
1e308 #This is still OK as a numeric value ...
1e309 # ... but this becomes Inf
# Other source of Inf is to divide by values very close to 0 or by exactly 0.
1 / 1e-500
1 / 0
#In calculations, Inf stays as Inf when combined with finite values 
2 + Inf
Inf * 2
# Division by Inf gives 0:
1 / Inf
# There is also -Inf, the infinity on the negative side:
2 - Inf #think that you are subtracting a huge value from 2. Result is -Inf
(-2 * Inf) # While 2*Inf is Inf, minus sign in front of Inf changes it to -Inf.

# Some operations with Infinities are undefined and produce
# "NaN = Not a number". This is not a missing value like NA,
# but rather a mathematically undefined value, such as a result of undetermined operations:
Inf - Inf 
Inf / Inf
0 / 0


#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Test Yourself 1.1. (Answers are at the end of this file.)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#


# Blood pressure of Mr. X was 132 units on the first measurement. 
# Write computations in R language to show
# How much BP was at the second measurement when
# (a) it had increased from the 1st measurement by 4.6 units (kasvanut 4.6 yksikkoa)?
# (b) it had decreased from the 1st measurement by 4.6 units (vahentynyt 4.6 yksikkoa)?
# (c) it had increased from the 1st measurement by 10 percent (kasvanut 10%)?
# (d) it had decreased from the 1st measurement by 30 percent (pienentynyt 30%)?

# Write using R in scientific notation
# (e) 0.00459
# (f) 1449000


#*******************************************
#
#   1.2 Generating variables and vectors
#
#*******************************************

# We can assign values to variables and then operate with the variable symbols
x = 3 #Sets symbol 'x' to have value 3 (numerical)
y = 4 #Sets symbol 'y' to have value 4 (numerical)
x + y #Calculates x + y, which, with current values, is 7 = 3 + 4
# Note: "<-" means the same as "=", so x <- 3 is another way to write x = 3
# Note: top-right panel in Rstudio, "Environment", now shows the variables and their current values.

# Example: Assign value 2 to x and 9 to y and compute x to the power of y.
x = 2
y = 9
x^y

# A "vector" is a combination of individual numerical values into a single object.
# We create vectors by putting the values to be combined within 'c( )' syntax.
# Remember it from 'c' saying 'combine'. Values within c( ) should be separated by commas.
# Let's make a vector of three values 1, 4 and 89 and let's name it as 'x'
x = c(1, 4, 89) 
x #running variable name simply shows the value of the variable.
# Note: If you used period '.' in place of comma ',' in vector definition, things go wrong!

# Vector can be indexed by [ ] notation
x[2] # returns element no.2 of vector x, here that is x[2] = 4.
# If you ask for an element that is not present in the vector, you'll get NA.
x[4] #The length of vector 'x' is 3, so element 4 does not exist, thus it is "NA, Not Available".
# Indexing a vector with a vector returns one element per each index value.
# Let's take elements with indexes 1 once, 3 once and 2 twice from vector 'x':
x[c(1,3,2,2)] #returns a vector of length 4, element no. 2 of original 'x' is repeated twice here.

# If a single value is operated on vector, R applies the operation separately to every element of vector.
2 + x #elementwise addition gives 3 = 2+1, 6 = 2+4, 91 = 2+89
2 * x #elementwise multiplication gives 2 = 2*1, 8 = 2*4, 178 = 2*89


#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Test Yourself 1.2. (Answers are at the end of this file.)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

# Blood pressure of Mr. X has been measured 5 times.
# Results are 134, 145, 129, 133, 150.
# (a) Make a vector called 'bp' that contains Mr X's 5 BP measurements.
# (b) Use an operation on vector 'bp' to decrease every measurement by 5 units.
# (c) Use an operation on vector 'bp' to decrease every measurement by 5%.
# (d) Use vector indexing to extract element 3 from vector 'bp'.
# (e) Use vector indexing to extract elements 3 and 4 from vector 'bp' and compute their average.


# Let's generate another vector 'y' of same length as 'x'
y = c(9, 5.7, -19)
# Addition of two vectors of SAME length does elementwise addition.
# 1st elements are added together, 2nd elements together etc.
x + y #Returns vector of length 3 with elements 10 = 1 + 9, 9.7 = 4 + 5.7 and 70 = 89 - 19
x - y #elementwise subtraction
x * y #elementwise multiplication
x / y #elementwise division

# Add one more element at the end of 'y':
y = c(y, 3)
y #Check that now 'y' has length 4

x + y #This is still elementwise addition but now 'x' is recycled to become same length as 'y'.
# Computes 1 + 9, 4 + 5.7, 89 - 19 and 1 + 3, where x is recycled from beginning at the 4th element.
# A warning is produced when vectors of different sizes are operated as this is likely a mistake
# and the user likely had intended to do something else.
length(x) #get the length of a vector
length(y)

# Example: Include '5' at the beginning of vector x and compute x + y
x = c(5, x) #adding '5' as the first element of x. The rest of the new 'x' stays as in the current 'x'.
x #Check what x contains now
x + y #elementwise addition of vectors x and y. No warnings needed anymore since lengths are the same.


#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Test Yourself 1.3. (Answers are at the end of this file.)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

# Blood pressure of Mr. X has been measured 2 times (1st morning, 2nd afternoon) on two consecutive days.
# Results are 
# Day 1, morning = 134, afternoon = 145
# Day 2, morning = 129, afternoon = 133.
# Make vectors 'day1' and 'day2' that each have the two values and 
#  compute using vector operations
#  (a) Add 2 units on both measurements of day 1.
#  (b) Compute mean of morning measurements and mean of afternoon measurements.


# We can quickly generate some special vectors without typing them explicitly:
# vector from value 'n' to value 'm' is generated by 'n:m'
1:5 #is same as c(1,2,3,4,5)
5:1 #is same as c(5,4,3,2,1)
# When step between values is something else than one unit we use 'seq()' function, in 2 ways:
seq(1, 2, by = 0.25) #sequence from 1 to 2 by steps of 0.25
seq(1, 2, length.out = 10) #sequence from 1 to 2 using 10 equally spaced values  

#Generating vectors by repeating some value
rep(1, 10) #repeat '1' 10 times
rep(c(1,2), c(3,5)) #repeat '1' 3 times and after that repeat '2' 5 times


#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Test Yourself 1.4. (Answers are at the end of this file.)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#(a) Generate sequence from 0 to 1 by steps of 0.05 using 'seq()'
#(b) Generate a vector of length 5 that is full of value 10 using 'rep()'
#(c) Generate a decreasing integer sequence from 95 to 85.


#******************************
#
#   1.3 Using functions
#
#******************************

# seq( ) and rep( ) above are examples of functions that take in "parameters" or "arguments"
# and process those parameters to produce output values.
# You can see help for any function by adding '?' in front of the function name.
# For example, help for the rep function opens by typing ?rep on console.
?rep
# Now help opens on the window bottom right.
#  it says that we can use rep( ) in different ways using different arguments:
rep(1, times = 10) #repeats 1 for 10 times
rep(c(1,2), times = 5) #repeats vector (1,2) for 10 times
rep(c(1,2), each = 5) #repeats each element of vector c(1,2) 5 times on its own

# Generally, function arguments are included within the parentheses, separated by commas.
# It is a good practice to include the name of each parameter in the function call to
# keep the code clear. So while both of these function calls
rep(1, times = 10) 
rep(1, 10) 
# produce the same output, the first is more clear since we can explicitly
# see that 10 is value for parameter 'times'.
# In the second version, we need to know that
# the second argument of rep is 'times' to interpret the command correctly.
# Importantly, we may easily make mistakes if we ignore names of parameters.
# For example, we cannot get the correct result of command
rep(c(1,2), each = 5)
#by simply typing
rep(c(1,2), 5)
#because by default rep( ) interprets the second value as 'times' not 'each' parameter.
#  You can learn this behavior from the help of rep( ) where it says that
#  The default behavior of call rep(x) is as if the call was
#  rep(x, times = 1, length.out = NA, each = 1),
#  and here the first parameter after 'x' is 'times' not 'each.


#Some basic statistics applied to a vector of values
x = c(1, 2, 3)
mean(x) #arithmetic mean, (1 + 2 + 3) / 3 
sd(x) #sample standard deviation
var(x) #sample variance, which is same as sd(x)^2
sum(x) #summing values, 1 + 2 + 3

#Open help page of 'sd()':
?sd
# It says: "If na.rm is TRUE then missing values are removed before computation proceeds."
# Suppose we have measured 7 patients but two values are missing
x = c(3.4, 5.1, 2.7, NA, 9.7, 5.5, NA)
sd(x) #Now sd(x) produces NA since one or more input values is NA
# How do you modify the sd(x) command in order to get the sd of
# the 5 values that are not NA?
sd(x, na.rm = TRUE) #according to the help, parameter 'na.rm = TRUE' ReMoves NA values
# Let's check that this is the same as manually removing the two NAs, namely leaving out indexes 4 and 7:
sd(x[c(1,2,3,5,6)])
# It is the same result.
# We could also remove indexes 4 and 7 by using minus sign in front of indexes like this:
sd(x[-c(4, 7)]) #vector x "minus indexes 4 and 7" produces vector x without elements 4 and 7.

x = 1:20 #vector of length 20 with elements 1,2,3,...,20.
median(x) #Median is the middle value, 50% of values are smaller than median
#We can also ask for other quantiles, 
# that are the cut-off points that separate a given proportion of (sorted) values.
quantile(x, 0.25) #25% of values in 'x' are smaller than this value
quantile(x, 0.75) #75% of values in 'x' are smaller than this value

min(x) #minimum of 'x'
max(x) #maximum of 'x'
range(x) #returns two values: the smallest and the largest of 'x'
summary(x) #returns a set of summary values for vector 'x'

sqrt(4) #square root function, in Finnish: neliojuuri

#random order of elements of vector can be produced by 'sample()' function
x = sample(x)
x #now values 1,...,20 have been randomly shuffled.
sort(x) #Sorting vector in ascending order by 'sort()'
sort(x, decreasing = TRUE) #Sorting it in descending order by 'sort(, decreasing = TRUE)'
# Note that sort(x) does not change 'x', which still remains randomly shuffled.
# sort(x) has just returned a new vector that was sorted.
# Check this by checking that the value of 'x' is still the shuffled one
x
# to sort it back we need to assign sorted result to x
x = sort(x)
x #now x is in order


#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Test Yourself 1.5. (Answers are at the end of this file.)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

# (a) Use 'rep()' to generate vector 'x' of nine elements that
#     repeats value 6.5 four times and value 2.3 five times.
#     Check that your x is as expected.
# (b) Compute arithmetic mean of values in x.
# (c) Compute sample variance of values in x.
# (d) Compute sample standard deviation of values in x.
# (e) Compute sum of values in x.
# (f) Compute median of values in x.
# (g) Compute cut-off point below which there are 33% of values in x.
# (h) Compute minimum of values in x.
# (i) Compute maximum of values in x.
# (j) Compute range of values in x.
# (k) Compute 'summary()' of values in x.
# (l) Sort x in increasing order.


#
##
### ANSWERS
##
#


#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Answers to Test Yourself 1.1.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

# Blood pressure of Mr. X was 132 units on the first measurement.
# Write computations in R language to show
# How much BP was at the second measurement when
# (a) it had increased from the 1st measurement by 4.6 units (kasvanut 4.6 yksikkoa)?

x = 132 #assign 132 to variable x. This is the 1st measurement value.
x + 4.6 #is the answer to (a)

# (b) it had decreased from the 1st measurement by 4.6 units (vahentynyt 4.6 yksikkoa)?

x - 4.6 #is the answer to (b)


# (c) it had increased from the 1st measurement by 10 percent (kasvanut 10%)?

x * 1.10 #is the answer to (c); increasing by 10% is same as multiplying by 110% = 1.10


# (d) it had decreased from the 1st measurement by 30 percent?

x * (1 - 0.30) #is the answer to (d); decrease of 30% is x*0.30; remaining value is x*(1 - 0.30)


# Write using R in scientific notation
# (e) 0.00459

4.59e-3

# (f) 1449000

1.449e6


#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Answers to Test Yourself 1.2.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

# Blood pressure of Mr. X has been measured 5 times.
# Results are 134, 145, 129, 133, 150.
# (a) Make a vector called 'bp' that contains Mr X's 5 BP measurements.

bp = c(134, 145, 129, 133, 150)
bp #always check what you have defined

# (b) Use an operation on vector 'bp' to decrease every measurement by 5 units.

bp + 5

# (c) Use an operation on vector 'bp' to decrease every measurement by 5%.

bp * (1 - 0.05)

# (d) Use vector indexing to extract element 3 from vector 'bp'.

bp[3]

# (e) Use vector indexing to extract elements 3 and 4 from vector 'bp' and compute their average.


(bp[3] + bp[4])/2 
#and other ways are also possible, such as 
mean(bp[3:4])


#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Answers to Test Yourself 1.3.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

# Blood pressure of Mr. X has been measured 2 times (morning, afternoon) on two consecutive days.
# Results are 
# Day 1, morning = 134, afternoon = 145
# Day 2, morning = 129, afternoon = 133.
# Make two vectors, one for day 1 and one for day 2 measurements.

day1 = c(134, 145)
day2 = c(129, 133)

# Compute using vector operations
# (a) Add 2 units on both measurements on day 1.

day1 + 2

# (b) Compute mean of morning measurements and mean of afternoon measurements.

(day1 + day2)/2 #returns two values, 1st = mean of mornings, 2nd = mean of afternoons


#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Answers to Test Yourself 1.4.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

#(a) Generate sequence from 0 to 1 by steps of 0.05 using 'seq()'

seq(0, 1, by = 0.05)

#(b) Generate a vector of length 5 that contains value 10 five times using 'rep()'

rep(10, times = 5)

#(c) Generate a decreasing integer sequence from 95 to 85.

95:85
#OR
seq(95, 85, by = -1)


#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# Answers to Test Yourself 1.5.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

# (a) Use 'rep()' to generate vector 'x' of nine elements that
#     repeats value 6.5 four times and value 2.3 five times.
#     Check that your x is as expected.

x = rep(c(6.5, 2.3), c(4, 5))
x 

# (b) Compute arithmetic mean of values in x.

mean(x) #4.166667

# (c) Compute sample variance of values in x.

var(x) #4.9

# (d) Compute sample standard deviation of values in x.

sd(x) #2.213594

# (e) Compute sum of values in x.

sum(x) #37.5

# (f) Compute median of values in x.

median(x) #2.3

# (g) Compute cut-off point below which there are 33% of values in x.

quantile(x, 0.33) #2.3

# (h) Compute minimum of values in x.

min(x) #2.3

# (i) Compute maximum of values in x.

max(x) #6.5

# (j) Compute range of values in x.

range(x)

# (k) Compute summary()' of values in x.

summary(x)

# (l) Sort x in increasing order.

sort(x)