--- title: "HDS 8. Bayesian variable selection" author: "Matti Pirinen, University of Helsinki" date: "27.11.2021" output: html_document: default --- {r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) set.seed(15)  In HDS6, we saw that LASSO has a property of setting many of the coefficients to exactly zero at the optimum of the objective function and hence we said that LASSO does *variable selection*. A technical reason for this behavior was the LASSO's prior distribution for coefficients that had a sharp peak near zero. In HDS7, we embedded the LASSO model into a fully Bayesian model blasso that allowed us to estimate the full posterior distribution of the coefficients but that was computationally more heavy to apply. Next we will consider conceptually simpler Bayesian models for variable selection, where the prior distribution of a coefficient states that coefficient is non-zero with probability$\pi_1$and has a positive probability mass$1-\pi_1$of being exactly zero. ## Spike and slab prior (SSP) Under SSP, we model individual coefficient$\beta_j$using a mixture distribution $$\beta_j\,|\, \pi_1, \tau^2 \sim (1-\pi_1) \delta_0 + \pi_1 \mathcal{N}(0, \tau^2)$$ where$\delta_0$is point mass at 0. In other words, $$\begin{cases} \beta_j = 0, & \text{with probability } 1-\pi_1,\\ \beta_j \sim \mathcal{N}(0, \tau^2), & \text{with probability } \pi_1. \end{cases}$$ {r, echo = F} pi1 = 0.5 tau = 1 plot(NULL, xlim = c(-3,3), ylim = c(0,1), main = expression(paste("Spike and slab with ",pi[1],"=0.5 ",tau,"=1")), xlab = expression(beta[j]), ylab = "density / probability") x = seq(-3, 3, 0.01) lines(x, dnorm(x, 0, tau), lwd = 2, col ="blue") arrows(0, 0, 0, pi1, length = 0, lwd = 3) points(0,0, pch = 19, cex=2) legend("topright", col = c("black", "blue"), leg = c("spike", "slab"), pch = c(19,-1), lty = c(0,1), lwd = c(0,2))  Suppose we have$n$samples and$p$predictors (columns of$\pmb X$) to model outcome vector$\pmb y$. Let$\pmb \gamma = (\gamma_j)_{j=1}^p$be vector of binary indicators indicating which variables are non-zero, that is, $$\gamma_j = \begin{cases} 1, & \text{if } \beta_j \neq 0,\\ 0, & \text{if } \beta_j = 0. \end{cases}$$ We can assign to each predictor, independently, an SSP($\pi_1,\tau^2$) prior where the values$\pi_1$and$\tau^2$are shared between the predictors. Value of$\pi_1$could be determined based on which proportion of predictors we expect to be non-zero, and$\tau^2$could be determined based on what magnitude of coefficient values we are expecting. Such model is implemented in BoomSpikeSlab package. (Later we will discuss extensions that estimate$\pi_1$and$\tau^2$from data.) Let's try it on the same data we analyzed using Bayesian LASSO in HDS 7 that had$n=250$,$p=30$and first 3 variables had an effect size of 0.229 while the remaining 27 were exactly 0. We use prior$\pi_1 = 5/30 \approx 0.167$(by setting expected.model.size = 5) and$\tau^2=1$(by setting prior.beta.sd = rep(1,p)). {r, results = FALSE, message = FALSE} set.seed(122) p = 30 n = 250 phi = 0.05 #variance explained by x_1, should be 0 < phi < 1. b = rep(c(sqrt(phi / (1-phi)), 0), c(3, p-3)) #effects 1,2,3 are non-zero, see Lect. 0.1 for "phi" X = scale(matrix(rnorm(n*p), nrow = n)) #cols 1,2,3 of X have effects, other cols are noise eps = scale(rnorm(n, 0, 1)) #epsilon, error term y = scale(X%*%b + eps, scale = FALSE) #makes y have mean = 0 library(BoomSpikeSlab) prior = IndependentSpikeSlabPrior(X, y, expected.model.size = 5, prior.beta.sd = rep(1,p)) lm.ss = lm.spike(y ~ X - 1, niter = 1000, prior = prior)  {r} summary(lm.ss)  We see that the 3 variables with true effects have posterior inclusion probabilities (PIP) > 0.90 whereas all others have PIPs < 0.10. In the results, we have two sets of posterior means and sds: first from the full posterior distribution and second conditional on variable being included in the model. When the variable has a large PIP, then the two are similar, but for small PIPs, the two differ and the full unconditional mean is a shrunk version of the conditional mean where the shrinkage factor is the PIP. For example, for variable X6 the posterior mean is r summary(lm.ss)$coefficients["X6",3] conditional on the variable being non-zero, and X6 is non-zero with a posterior probability of r summary(lm.ss)$coefficients["X6",5] Thus, the (unconditional) posterior mean is r summary(lm.ss)$coefficients["X6",5] $\cdot$ r summary(lm.ss)$coefficients["X6",3] + r 1-summary(lm.ss)$coefficients["X6",5] $\cdot$ 0 = r summary(lm.ss)$coefficients["X6",1]. Note that in similar spirit, one could extend the LASSO method to [**relaxed LASSO**](https://glmnet.stanford.edu/articles/relax.html) where one first runs LASSO to choose the non-zero variables (corresponding to unconditional analysis) and then fits an unpenalized model using only the chosen variables (corresponding to conditional analysis). Statistical inference for such a stepwise procedure is, however, complicated whereas in the Bayesian approach the posterior distribution is directly available and a natural basis for inference. Let's visualize the PIPs and posterior distributions of the coefficients. {r, fig.height = 10} par(mfrow = c(1,2)) plot(lm.ss) #plots PIPs plot(lm.ss,"coefficients")#plots coeffs  Posterior on the model size can also be plotted. {r} plot(lm.ss,"size", freq = F)  ### Extension to estimating$\pi_1$and$\tau^2$To make the model fully Bayesian, we would like to estimate also parameters$\pi_1$and$\tau^2$. For that, we will consider$\tau^2$as being specified relative to the error variance$\sigma^2$, i.e., we use prior $$\beta_j\,|\, \pi_1, \tau^2, \sigma^2 \sim (1-\pi_1) \delta_0 + \pi_1 \mathcal{N}(0, \sigma^2 \tau^2).$$ For example, these parameters can be given priors$\pi_1 \sim \text{Beta}(b_0,b_1)$and$\tau^2 \sim \text{Inverse-Gamma}(1/2,s^2/2)$where the hyper-parameters$b_0,b_1$and$s$are given such fixed values that the priors cover all reasonable values of the parameters in a fairly uniform manner. Additionally, a flat prior on$\sigma^2$could be an$\text{Inverse-Gamma}(a,a)$where$a$is a small positive value, such as 0.01, as such a distribution approximates a uniform prior on$\log(\sigma^2)$scale (so called Jeffrey's prior for the variance parameter). By using such "flat priors" we hope to achieve a posterior distribution that has been guided by information from the data while the effect of the prior distribution on the posterior remains small. In the following analysis, we will use hyper-parameters$b_0 = b_1 = 1$,$s = 0.5$and$a=0.01$that lead to the following prior densities on$\pi_1$and$\tau^2$. {r, fig.width = 8, fig.height = 4, echo = F} par(mfrow = c(1,2)) u = seq(0, 1, 0.01) v = dbeta(u, 1, 1) plot(u,v, t = "l", main = "Beta(1,1)", xlab = expression(pi[1]), ylab = "density") u = seq(0, 2, 0.01) s = 0.5 shape = 0.5 rate = 0.5*s^2 v = rate^shape / gamma(shape) * u^(-1-shape) * exp(-rate/u) plot(u, v, t = "l", main = "Inv-Gamma(1/2,1/8)", xlab = expression(tau^2), ylab = "density")  The following Gibbs sampler code to implement this model is adapted from Fabian Dablander {r, eval = T} #' Spike-and-Slab Regression using Gibbs Sampling for p > 1 predictors #' #' @param y: vector of responses #' @param X: matrix of predictor values #' @param nr_samples: indicates number of samples drawn #' @param a1: parameter a1 of Gamma prior on variance sigma2e #' @param a2: parameter a2 of Gamma prior on variance sigma2e #' @param pi1: parameter of prior over mixture weight #' @param a: 1st parameter of Beta prior for pi1 #' @param b: 2nd parameter of Beta prior for pi1 #' @param burnin: number of samples we discard ('burnin samples') #' #' @returns matrix of posterior samples from parameters gamma, beta, tau2, sigma2e, pi1 ss_regress <- function( y, X, a1 = .01, a2 = .01, pi1 = .5, a = 1, b = 1, s = 1/2, nr_samples = 6000, nr_burnin = round(nr_samples / 4, 2) ) { p <- ncol(X) n <- nrow(X) # res is where we store the posterior samples res <- matrix(NA, nrow = nr_samples, ncol = 2*p + 1 + 1 + 1) colnames(res) <- c( paste0('gamma', seq(p)), paste0('beta', seq(p)), 'sigma2', 'tau2', 'pi1' ) # take the MLE estimate as the values for the first sample m <- lm(y ~ X - 1) res[1, ] <- c(rep(0, p), coef(m), var(predict(m) - y), 1, .5) # compute only once XtX <- t(X) %*% X Xty <- t(X) %*% y # we start running the Gibbs sampler for (i in seq(2, nr_samples)) { # first, get all the values of the previous time point gamma_prev <- res[i-1, seq(p)] beta_prev <- res[i-1, seq(p + 1, 2*p)] sigma2_prev <- res[i-1, ncol(res) - 2] tau2_prev <- res[i-1, ncol(res) - 1] pi1_prev <- res[i-1, ncol(res)] ## Start sampling from the conditional posterior distributions ############################################################## # sample pi1 from a Beta pi1_new <- rbeta(1, a + sum(gamma_prev), b + sum(1 - gamma_prev)) # sample sigma2e from an Inverse-Gamma err <- y - X %*% beta_prev sigma2_new <- 1 / rgamma(1, a1 + n/2, a2 + t(err) %*% err / 2) # sample tau2 from an Inverse Gamma tau2_new <- 1 / rgamma( 1, 1/2 + 1/2 * sum(gamma_prev), s^2/2 + t(beta_prev) %*% beta_prev / (2*sigma2_new) ) # sample beta from multivariate Gaussian beta_cov <- qr.solve((1/sigma2_new) * XtX + diag(1/(tau2_new*sigma2_new), p)) beta_mean <- beta_cov %*% Xty * (1/sigma2_new) beta_new <- mvtnorm::rmvnorm(1, beta_mean, beta_cov) # sample each gamma_j in random order for (j in sample(seq(p))) { # get the betas for which beta_j is zero gamma0 <- gamma_prev gamma0[j] <- 0 bp0 <- t(beta_new * gamma0) # compute the z variables and the conditional variance xj <- X[, j] z <- y - X %*% bp0 cond_var <- sum(xj^2) + 1/tau2_new # compute chance parameter of the conditional posterior of gamma_j (Bernoulli) l0 <- log(1 - pi1_new) l1 <- ( log(pi1_new) - .5 * log(tau2_new*sigma2_new) + sum(xj*z)^2 / (2*sigma2_new*cond_var) + .5 * log(sigma2_new / cond_var) ) # sample gamma_j from a Bernoulli gamma_prev[j] <- rbinom(1, 1, exp(l1) / (exp(l1) + exp(l0))) } gamma_new <- gamma_prev # add new samples res[i, ] <- c(gamma_new, beta_new*gamma_new, sigma2_new, tau2_new, pi1_new) } # remove the first nr_burnin number of samples res[-seq(nr_burnin), ] }  Let's fit the model using this Gibbs sampler and see posteriors of$\pi_1$and$\tau^2$. {r, fig.width = 8, fig.height = 4} set.seed(18) ss = ss_regress( y, X, a1 = .01, a2 = .01, pi1 = .5, a = 1, b = 1, s = 1/2, nr_samples = 5000 ) par(mfrow = c(1,2)) plot(density(ss[,"pi1"]), xlab = expression(pi[1]), lwd = 2, main = "Posterior of pi1", col = "red") plot(density(ss[,"tau2"]), xlab = expression(tau^2), lwd = 2, main = "Posterior of tau2", col = "orange", xlim = c(0,1)) summary(ss[,"pi1"]) summary(ss[,"tau2"])  The posterior of$\pi_1$is quite wide and suggest that$\pi_1$is somewhere between 0.05 and 0.4. For$\tau^2$, the bulk of the posterior is below 0.20. We can also see the PIPs and coefficient estimates. {r, fig.width = 11, fig.height = 9, message = F} par(mfrow = c(1,2)) barplot(colMeans(ss[,1:p]), names.arg = 1:30, col = "limegreen", horiz = TRUE, xlab = "PIP") library(vioplot) vioplot( ss[,(p+1):(2*p)], names = 1:30, col = "blue", horizontal = TRUE, xlab = expression(beta[j])) abline(v = 0, lty = 2)  Unfortunately, these implementation of Bayesian variable selection using Spike and Slab prior through Gibbs sampler do not scale to large data sets. First problem is that at each iteration of the Gibbs sampler matrix decomposition on$p\times p$matrix related to$\pmb X^T \pmb X$are done. Second problem is that at each iteration every predictor is updated even when the model might be very sparse, i.e., only a few coefficients are non-zero. These issues become too demanding when$p$becomes large. Let's next discuss some recent work that has made it possible to apply these kinds of models to very large$p$. ## Efficient computation of Bayes factors under SSP Let's continue with the assumption that each predictor has an independent SSP($\pi_1,\tau^2$) prior. It has been shown by [Benner et al.](https://www.biorxiv.org/content/10.1101/318618v1.full.pdf) that the Bayes factor (BF) comparing a model where the non-zero coefficients are specified by a configuration$\pmb \gamma$and the null model$\pmb \gamma_0 = (0,\ldots,0)$where all coefficients are zero can be computed using data on only the non-zero variables (Details in Slides). This allows computing BF($\pmb \gamma : \pmb \gamma_0$) between$\pmb \gamma$and$\pmb \gamma_0$in$\mathcal{O}(k^3)$operations, where$k = \sum_{j=1}^p\gamma_j$is the number of non-zero coefficients in$\pmb \gamma$, compared to$\mathcal{O}(p^3)$operations that are required when the computation is done in the standard way. When we consider sparse models where$k \sim 10$and$p \sim 10^5$this can make all the difference. Note that under SSP($\pi_1,\tau^2$) prior, the prior probability of a configuration$\pmb \gamma$is$\pi_1^{k_\gamma}(1-\pi_1)^{p-k_\gamma}$and only depends on$k_\gamma$. Thus, by knowing the BFs for all possible configurations, we can compute an (unnormalized) posterior probabilities for the configurations as $$\text{Pr}(\pmb \gamma\, | \, \text{Data}) \propto \text{BF}(\pmb \gamma : \pmb \gamma_0) \pi_1^{k_\gamma}(1-\pi_1)^{p-k_\gamma}.$$ By normalizing these quantities with respect to their sum over$\pmb \gamma$, we have computed the posteriors for every configuration. However, if we allowed, say, 10 non-zero variables out of$10^5$candidates, we would still have${10^5 \choose 10} \approx 10^{43}$configurations to evaluate, and that is far too many even when we can do every one of them quickly. For this reason, stochastic search algorithms have been introduced to approximate the posterior by finding the most relevant configurations from the whole space of configurations and then normalizing their un-normalized posteriors with respect to the relevant subset of configurations. (See slides for an example.) ## Sum of single effect model (SUSIE) Another recent approach to analyze Bayesian variable selection model is using the ‘single-effect regression’ (SER) model as a building block ([Wang et al.](https://rss.onlinelibrary.wiley.com/doi/10.1111/rssb.12388)). SER model is a multiple-regression model in which exactly one of the$p$predictors has a non-zero regression coefficient. The idea is to combine some number$K$of these SERs together to get a model where there are$K$non-zero predictors, or$K$*effects*. The key is that fitting each SER model conditional on the other SER models is very quick. #### SUSIE model with$K$effects Let's assume that the number of non-zero effects$K$, the error variance$\sigma^2$and the prior variance on the non-zero effect$\tau^2$are fixed. For each non-zero effect$k=1,\ldots,K$, define$\pmb \gamma^{(k)} \in \{0,1\}^p$as the indicator vector that shows which is the$k$th non-zero coefficient, that is,$\sum_{j=1}^p \gamma^{(k)}_j=1$. The model is \begin{eqnarray*} \pmb y &\sim& \mathcal{N}(\pmb X \pmb \beta,\sigma^2 \pmb I)\\ \pmb \beta &=& \sum_{k=1}^K \pmb \beta^{(k)} \\ \pmb \beta^{(k)} &=& \beta^{(k)}\pmb \gamma^{(k)}, \text{ for each }k=1,\ldots,K,\\ \beta^{(k)} &\sim& \mathcal{N}(0, \tau^2), \text{ for each }k=1,\ldots,K,\\ \pmb \gamma^{(k)} &\sim& \text{Multinomial}(1,(1/p,\ldots,1/p)), \text{ for each }k=1,\ldots,K. \end{eqnarray*} The last line means that the exactly one of the elements of$\pmb \gamma^{(k)}$is non-zero and, *a priori*, it is equally likely to be any one of the$p$elements. Note that the model does not rule out the possibility that$\pmb \gamma^{(k)}=\pmb \gamma^{(\ell)}$for some$k \neq \ell$even though conceptually it would be natural to avoid such overlaps. However, by leaving a possibility for an overlap, we can treat each effect independently and we get much simpler computations. Additionally, in practice, we won't observe overlapping effects in the posterior distribution estimated by the following algorithm. #### Iterative Bayesian stepwise selection (IBSS) A key motivation for the SUSIE model is that, given$(\pmb \beta^{(\ell)})_{\ell \neq k}$fitting$\pmb \beta^{(k)}$involves only fitting an SER model whose posterior distribution on the indicator$\pmb \gamma^{(k)}$and the parameter$\beta^{(k)}$is simple to compute. This suggests an iterative algorithm (called IBSS) to fit the model. Below we apply it to our existing data set by allowing$K=5$effects and setting prior variance$\tau^2 = 1$. {r} p = ncol(X) n = nrow(X) K = 5 tau2 = 1 xx = colSums(X^2) # = diagonal of t(X) %*% X. Will be needed later. b = matrix(0, nrow = K, ncol = p) #initialize each effect beta_k = 0 res.b = res.pip = res.v = b #initialize results to 0 diff.in.post = rep(0,p) #initialize to 0 converged = FALSE tol = 1e-6 # stop when difference between iterations < tol iter = 0 while(!converged){ iter = iter + 1 for (k in 1:K){ #Fitting SER model for effect k r = as.vector(y - rowSums(X%*%t(b[-k,]))) #residuals without effect k #Use Exercise 1.5 results here to get MLE of linear model b.k = t(X) %*% r / xx #univariate OLS estimate B = matrix(b.k, byrow = T, nrow = n, ncol = p) M = ( X*B - r )^2 v.k = colSums(M) / xx / (n-1) #MLE for variance of b.k #Combine MLEs with prior distributions to get posteriors post.v.k = 1/(1/v.k + 1/tau2) post.b.k = post.v.k/v.k * b.k #Conditional posterior for beta_k is N(post.b.k, post.v.k) # conditional on the effect being at that position #PIPs for effect being at position 1,...,p: log.bf = 0.5*log(v.k/(v.k + tau2)) + 0.5*b.k^2/v.k*tau2/(tau2 + v.k) tmp = exp(log.bf - max(log.bf)) post.ip = tmp / sum(tmp) b[k,] = post.b.k*post.ip #save new effect k estimate for next iteration #compare to previous iteration to observe convergence: diff.in.post[k] = max(abs(b[k,] - res.b[k,]*res.pip[k,])) #save results res.b[k,] = post.b.k res.pip[k,] = post.ip res.v[k,] = post.v.k } converged = (all(diff.in.post < tol)) } print(paste("Ended in",iter,"iterations"))  Let's examine the output of this algorithm. We allowed 5 non-zero effects. Let's see the distributions of PIPs for each effect$k = 1,\ldots,5$. {r, fig.width = 11, fig.height = 7} par(mfrow = c(1,5)) for(ii in 1:K){ barplot(res.pip[ii,], names.arg = 1:p, horiz = TRUE, xlim = c(0,1), main = paste("Effect",ii), col = ii, xlab = "PIP") }  We see that for the first 3 effects, there is a clear difference between predictors in their PIPs. For the last two, none of the coefficients has a large PIP and therefore we may want to ignore those effects. In other words, likely there are only 3 clear non-zero effects there and the remaining effects, if included in SUSIE model, wouldn't provide any information about which of the predictors would have a non-zero effect. Note, however, that despite this reasonable ad hoc criterion to determine the number of effects, the SUSIE model cannot make any probabilistic assessment of the number of non-zero effects whereas SSP model directly gave us the posterior on the number of non-zero coefficients. We can next estimate the coefficients by summing over$K$SER models. Let's denote the posterior mean and variance of the SER model$k$conditional on the effect being at predictor$j$by$\mu_j^{(k)}$and$\eta_j^{(k)}$, respectively. Then we can compute the mean and variance for each coefficient (marginalized over all$K$SER models) as follows. \begin{eqnarray*} \text{E}(\beta_j) &=& \text{E}\left(\sum_{k=1}^K \gamma^{(k)}_j \beta_j^{(k)}\right) = \sum_{k=1}^K \pi_j^{(k)} \mu_j^{(k)}\\ \text{Var}(\beta_j) &=& \text{Var}\left(\sum_{k=1}^K \gamma^{(k)}_j \beta_j^{(k)}\right) = \sum_{k=1}^K \text{Var} \left( \gamma^{(k)}_j \beta_j^{(k)}\right) \\ &=& \sum_{k=1}^K\left( \text{E}\left({\gamma^{(k)}_j}^2 {\beta_j^{(k)}}^2\right) - \text{E}\left(\gamma^{(k)}_j \beta_j^{(k)}\right)^2\right) \\ &=& \sum_{k=1}^K\left( \pi_j^{(k)} \left({\eta_j^{(k)}}^2 + {\mu_j^{(k)}}^2\right) - {\pi_j^{(k)}}^2 {\mu_j^{(k)}}^2\right)\\ &=& \sum_{k=1}^K \left( \pi_j^{(k)} \left({\eta_j^{(k)}}^2\right) + \pi_j^{(k)}(1-\pi_j^{(k)}){\mu_j^{(k)}}^2\right) \end{eqnarray*} Let's find the posterior means and variances and then show the 95% credible intervals assuming that posteriors are Gaussian. {r, fig.width = 10} b.mean = colSums(res.pip*res.b) b.var = colSums(res.pip*res.v + res.pip*(1-res.pip)*res.b^2) plot(NULL, xlim = c(0,p+1), ylim = c(-0.1, 0.5), ylab = expression(beta), xlab = "") points(1:p, b.mean, pch = 19) arrows(1:p, b.mean-1.96*sqrt(b.var), 1:p, b.mean+1.96*sqrt(b.var), code = 3, length = 0) abline(h=0)  #### Credible sets of non-zero variables A nice property of SUSIE model is that it allows a straightforward computation of **credible set** of the non-zero variables. **Definition.** A level$\rho$credible set (CS) is a subset of variables that has probability$\geq \rho$of containing at least one effect variable (i.e. a variable with non-zero regression coefficient). For each effect in SUSIE model, we can compute credible set by sorting the PIPs in decreasing order and by forming the CS by inlcuding the smallest possible number of variables whose PIPs sum to a value$\geq \rho\$. In our current example, 95% CS are below. {r} rho = 0.95 cs = list() for(ii in 1:K){ ord = order(res.pip[ii,], decreasing = T) cs[[ii]] = ord[1:min(which(cumsum(res.pip[ii,ord]) >= rho))] } cs  For the first three effects, the CSs only contain one variable each while the CSs for the last two effects contain almost all variables. This again shows that the first three effects are clearly localized each to one variable whereas the two additional effects are not informative and we might want to just ignore them. Note that the posterior distribution provided by the SSP model does not directly contain information by which one could build credible sets for variables. Credible sets are useful when one needs to list the possible candidate variables that could replace each other in the model. **Question.** Assume that there are three highly correlated variables of which one truly affects the outcome but no other variables associated with the outcome. What kind of PIPs SUSIE would give and how would the credible sets look like?