--- title: "HDS 6.1 Breast cancer metastasis example" author: "Matti Pirinen, University of Helsinki" date: "9.1.2024" output: html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` Let's look at a study of breast cancer metastasis published by Van't Veer et al. in 2002. Data (15 MB) were downloaded from http://web.as.uky.edu/statistics/users/pbreheny/603/ and can now be found from . In the study, biological samples were obtained from tumors of women with breast cancer. These samples were scanned with a microarray, that measures the expression of 10,000s of genes simultaneously, i.e, how much of each gene product is being produced by the cells in the sample. The patients were followed up to see how long it took for the cancer to metastasize (spread elsewhere, which is bad news). Clinically, the goal is to identify patients with poor prognosis in order to administer more aggressive follow-up treatment for them. Scientifically, knowledge of the genes related to the worse outcomes can help understand the disease better and to help develop some therapeutics in the future. ```{r} set.seed(21) D = read.table("vantveer.txt", header = T) # add your path here dim(D) # rows patients, cols outcome + gene expressions anyNA(D) y = as.numeric(D[,1]) # survival time X = as.matrix(D[,2:ncol(D)]) # gene expression measurements rm(D) n = length(y) p = ncol(X) summary(y) summary(apply(X, 2, mean)) # not mean centered summary(apply(X, 2, sd)) # not standardized ``` A possible filtering step could be to remove the predictors that have small variance, because those are less likely to be informative in statistical sense. Of course, any gene could be important biologically even if its expression varies only a little between individuals but we have less statistical power to find such effects, and therefore, if we needed to filter out predictors, those uninformative genes could be candidates. Our methods are efficient enough so, for now, let's keep all predictors in and standardize the columns. ```{r} X = scale(X) # now mean = 0 and var = 1 for each column y = y - mean(y) # mean-center, but do not scale to keep the interpretation of time units. # we can now ignore the intercept because everything is mean-centered ``` Let's first do a quick version of the ordinary least squares univariately for each gene $j$ using the model $y = x_j \beta_j + \varepsilon$. We do not want to do a for-loop to apply `lm()` for 24,000+ times but instead we use formulas from the Exercise 1.5. After standardization within the sample $\pmb x_j^T \pmb x_j=n-1$ for each column $j$ and thus $$\widehat{\beta}_j =\frac{\pmb x_j^T \pmb y}{\pmb x_j^T \pmb x_j} = \frac{\pmb x_j^T \pmb y}{n-1}.$$ Thus, the vector $\widehat{\pmb \beta}^{(UNI)} = \pmb X^T \pmb y/(n-1)$ has the univariate least squares estimates, and this can be computed by a single matrix-by-vector operation, so it is as quick as it can get. Similarly, we can compute the univariate estimate of $\sigma^2_j$ for each gene as $$\widehat{\sigma^2}_j = \frac{1}{n-2} \left (\pmb y- \pmb x_j \widehat{\beta}_j\right)^T \left(\pmb y- \pmb x_j \widehat{\beta}_j\right)$$ and then we have estimates for the standard errors as $$s_j = \sqrt{\frac{\widehat{\sigma^2}_j}{(n-1)}}.$$ ```{r} beta.uni = as.vector((t(X)%*%y)/(n-1)) sigma2.uni = colSums(( y - t( t(X)*beta.uni) )^2)/(n-2) #this is sigma2 formula above written in R se = sqrt(sigma2.uni/(n-1)) # Now we have fit > 24,000 linear models. # Check an arbitrary column against lm() output: i = 10625 summary(lm(y ~ X[,i]))$coefficients[2,1:2] c(beta.uni[i], se[i]) # OK. pval = pchisq( (beta.uni/se)^2, df = 1, lower = F) qqplot(-log10(ppoints(p)), -log10(pval), pch = 4) # huge deviation from the null abline(0,1) summary(pval) ``` Let's fit a LASSO model to the data. ```{r, warning = FALSE } library(glmnet) cv.lasso = cv.glmnet(X, y, alpha = 1) # takes < 5 seconds even with p > 24,000 plot(cv.lasso) plot(cv.lasso$glmnet.fit, xvar = "lambda") abline( v = log(cv.lasso$lambda.min), lty = 2 ) abline( v = log(cv.lasso$lambda.1se), lty = 2 ) ``` LASSO seems to choose only 12 predictors at `lambda.min`. Let's look at their statistics. ```{r} lasso.var = as.matrix(coef(cv.lasso, s = "lambda.min")) genes = names(which(abs(lasso.var[,1]) > 1e-10)) # names of the chosen genes lasso.ind = which(abs(lasso.var[,1]) > 1e-10) - 1 # indexes, removing intercept index by -1 data.frame(gene = genes, beta = beta.uni[lasso.ind], se = se[lasso.ind], pval = pval[lasso.ind]) sum(pval < 1e-7) ``` We see that the P-values of variables chosen by LASSO are small in general (< 1e-5) but that there are many other genes with small P-values that LASSO has dropped, e.g., 13 genes had a P-value < 1e-7 and there are only 6 such in the list. So LASSO is not just about sorting P-values. Can we also fit ridge regression to this data set with p > 24,000 even when it does not produce similar sparsity as LASSO? ```{r, eval = T} cv.ridge = cv.glmnet(X, y, alpha = 0) # takes only ~ 20 seconds even with p > 24000 plot(cv.ridge) plot(cv.ridge$glmnet.fit, xvar = "lambda") abline( v = log(cv.ridge$lambda.min), lty = 2 ) abline( v = log(cv.ridge$lambda.1se), lty = 2 ) ``` The cross-validated MSEs at minimum seem similar between ridge regression (RR) and LASSO. Note also that $\lambda$ is large ($>\exp(9)\approx 8100$) so RR heavily penalizes the linear model, as it should when $p >> n$. A big benefit of LASSO is its small number of predictors, here only 12 compared to 24,188 that are used by RR. If we can build two prediction models that are almost as good, and the first one has 12 predictors and the second has 24,188 predictors, the first one is more practical in many ways. What about trying out some elastic net model that is a compromise between the two? ```{r} cv.enet = cv.glmnet(X, y, alpha = 0.8) # takes only ~ 5 seconds even with p > 24,000 plot(cv.enet) plot(cv.enet$glmnet.fit, xvar = "lambda") abline( v = log(cv.enet$lambda.min), lty = 2 ) abline( v = log(cv.enet$lambda.1se), lty = 2 ) ``` Elastic net has a few more predictors than LASSO but still produces a very sparse model compared to original $p$. CV'd MSEs seem similar between the three methods.