HDS 8. Bayesian variable selection

In HDS6, we saw that LASSO has a feature of setting many of the coefficients to exactly zero at the optimum of the objective function and hence we said that LASSO does variable selection. A technical reason for this behavior was the LASSO’s prior distribution for coefficients that had a sharp peak around zero. In HDS7, we embedded the LASSO model into a fully Bayesian model blasso that allowed us to estimate the full posterior distribution of the coefficients but that was computationally much more heavy to apply than the standard LASSO. Next, we will consider conceptually simpler Bayesian models for variable selection, where the prior distribution of a coefficient states that the coefficient is non-zero with probability \(\pi_1\) and has a positive probability mass \(1-\pi_1\) of being exactly zero.

The spike and slab prior (SSP)

Under SSP, we model an individual regression coefficient \(\beta_j\) using a two-component mixture distribution \[\beta_j\,|\, \pi_1, \tau^2 \sim (1-\pi_1) \delta_0 + \pi_1 \mathcal{N}(0, \tau^2)\] where \(\delta_0\) is the point mass at 0 (in mathematics called as Dirac delta function). Another way to describe this two-component mixture distribution is to say that \[ \begin{cases} \beta_j = 0, & \text{with probability } 1-\pi_1,\\ \beta_j \sim \mathcal{N}(0, \tau^2), & \text{with probability } \pi_1. \end{cases} \]

Thus, if we wanted to generate example data from such a mixture distribution, we could proceed in two steps. First, we chose which component to sample from (component 1 with probability \(\pi_1\) and component 0 with probability \(1-\pi_1\)). Second, conditional on the chosen component, we would sample a value for \(\beta_j\) from the distribution specific to the chosen component.

Figure visualizes the two-component spike and slab distribution. The spike component sits at exactly 0, and has a non-zero probability mass. The slab component is a Gaussian distribution centered at 0 and having SD > 0. The figure describes the probability mass (black) for the spike and the density function (blue) for the slab.

Suppose we have \(n\) samples and \(p\) predictors (columns of \(\pmb X\)) to model the outcome vector \(\pmb y\). Let \(\pmb \gamma = (\gamma_j)_{j=1}^p\) be a vector of binary indicators indicating which variables are non-zero, that is, \[\gamma_j = \begin{cases} 1, & \text{if } \beta_j \neq 0,\\ 0, & \text{if } \beta_j = 0. \end{cases} \] We can assign to each predictor, independently, an SSP(\(\pi_1,\tau^2\)) prior where the values \(\pi_1\) and \(\tau^2\) are shared between the predictors. Value of \(\pi_1\) could be determined based on which proportion of predictors we expect to be non-zero, and \(\tau^2\) could be determined based on what magnitude of coefficient values we expect at the non-zero coefficients. Such a model is implemented in BoomSpikeSlab package. (Later we will discuss extensions that also estimate \(\pi_1\) and \(\tau^2\) from data.)

Let’s try it on the same data we analyzed using Bayesian LASSO in HDS7 that had \(n=250\), \(p=30\) and the first 3 variables had an effect size of 0.229 while the remaining 27 were exactly 0. We use as prior parameters \(\pi_1 = 5/30 \approx 0.167\) (by setting expected.model.size = 5) and \(\tau^2=1\) (by setting prior.beta.sd = rep(1,p)).

set.seed(122) 
p = 30
n = 250 
phi = 0.05 # variance explained by x_1, should be 0 < phi < 1. 
b = rep(c(sqrt(phi / (1-phi)), 0), c(3, p-3)) # effects 1,2,3 are non-zero, see Lect. 0.1 for "phi"
X = scale(matrix(rnorm(n*p), nrow = n)) # cols 1,2,3 of X have effects, other cols are noise 
eps = scale(rnorm(n, 0, 1)) # epsilon, error term
y = scale(X%*%b + eps, scale = FALSE) # makes y have mean = 0
library(BoomSpikeSlab)
prior = IndependentSpikeSlabPrior(X, y, 
                                  expected.model.size = 5,
                                  prior.beta.sd = rep(1,p)) 
lm.ss = lm.spike(y ~ X - 1, niter = 1000, prior = prior)

summary(lm.ss)

## coefficients:
##          mean      sd mean.inc sd.inc inc.prob
## X3   3.18e-01 0.06410  0.31800 0.0641    1.000
## X1   2.62e-01 0.07150  0.26600 0.0626    0.982
## X2   2.41e-01 0.07780  0.25000 0.0621    0.961
## X6   1.02e-02 0.03780  0.12700 0.0550    0.080
## X27 -4.16e-03 0.02180 -0.08860 0.0520    0.047
## X23 -4.18e-03 0.02320 -0.10200 0.0570    0.041
## X15 -2.88e-03 0.01980 -0.08230 0.0694    0.035
## X11  3.33e-03 0.02030  0.09520 0.0559    0.035
## X12 -1.88e-03 0.01460 -0.07240 0.0569    0.026
## X9  -1.34e-03 0.01330 -0.06710 0.0686    0.020
## X24 -7.82e-04 0.01000 -0.04110 0.0617    0.019
## X20  5.57e-04 0.00831  0.02930 0.0542    0.019
## X17 -4.74e-04 0.01130 -0.02630 0.0820    0.018
## X13 -6.38e-04 0.01180 -0.03540 0.0832    0.018
## X10 -9.30e-04 0.01250 -0.05470 0.0819    0.017
## X22  5.21e-04 0.00831  0.03260 0.0590    0.016
## X30 -6.63e-04 0.00957 -0.04420 0.0668    0.015
## X25  6.91e-04 0.00903  0.04610 0.0598    0.015
## X21  6.10e-05 0.00822  0.00406 0.0693    0.015
## X16  7.82e-04 0.00998  0.05210 0.0651    0.015
## X8  -1.68e-04 0.00723 -0.01120 0.0600    0.015
## X5  -6.56e-04 0.00799 -0.04370 0.0504    0.015
## X26  3.69e-04 0.00621  0.02840 0.0485    0.013
## X28  3.50e-04 0.00996  0.02910 0.0900    0.012
## X29  1.50e-04 0.00625  0.01370 0.0608    0.011
## X7  -6.35e-04 0.00791 -0.05780 0.0511    0.011
## X19  8.87e-05 0.00658  0.00887 0.0687    0.010
## X14 -1.08e-04 0.00462 -0.01080 0.0474    0.010
## X4   2.90e-05 0.00575  0.00290 0.0605    0.010
## X18  3.59e-04 0.00631  0.03990 0.0566    0.009
## 
## residual.sd = 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8630  0.9628  0.9931  0.9937  1.0237  1.1721 
## 
## r-square    = 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1100  0.1532  0.2031  0.2005  0.2511  0.3982

We see that the 3 variables with true non-zero effects have posterior inclusion probabilities (PIP) > 0.90 whereas all others have PIPs < 0.10. In the results, we have two sets of posterior means and standard deviations: first from the full posterior distribution and second conditional on that the variable has been included in the model. When the variable has a large PIP, then the two are similar, but for small PIPs, the two differ and the full unconditional mean is a shrunk version of the conditional mean where the shrinkage factor is the PIP. For example, for variable X6, the posterior mean is 0.1269525 conditional on the variable being non-zero, and X6 is non-zero with a posterior probability of 0.08 Thus, the (unconditional) posterior mean is

0.08 \(\cdot\) 0.1269525 + 0.92 \(\cdot\) 0 = 0.0101562.

Note that, in a similar way, one could extend the LASSO method to relaxed LASSO, where one first runs LASSO to choose the non-zero variables (corresponding to the unconditional analysis) and then fits an unpenalized model using only the chosen variables (corresponding to the conditional analysis). Statistical inference for such a stepwise procedure is, however, complicated, whereas, in the Bayesian approach, the posterior distribution is directly available and forms a natural basis for inference.

Let’s visualize the PIPs and the posterior distributions of the coefficients.

par(mfrow = c(1, 2))
plot(lm.ss) # plots PIPs
plot(lm.ss, "coefficients") # plots coeffs

The posterior probability on the model size can also be plotted.

plot(lm.ss, "size", freq = F)

Extension to estimating \(\pi_1\) and \(\tau^2\)

To make the model fully Bayesian, we would like to estimate also the parameters \(\pi_1\) and \(\tau^2\) from the data. For that, we will consider \(\tau^2\) as being specified relative to the error variance \(\sigma^2\), i.e., we use a prior distribution \[\beta_j\,|\, \pi_1, \tau^2, \sigma^2 \sim (1-\pi_1) \delta_0 + \pi_1 \mathcal{N}(0, \sigma^2 \tau^2).\] For example, the parameters \(\pi_1\) and \(\tau^2\) can be given priors \(\pi_1 \sim \text{Beta}(b_0,b_1)\) and \(\tau^2 \sim \text{Inverse-Gamma}(1/2,s^2/2)\) where the hyper-parameters \(b_0,b_1\) and \(s\) are given such fixed values that the priors cover all reasonable values of the parameters in a fairly uniform manner. Additionally, a flat prior on \(\sigma^2\) could be an \(\text{Inverse-Gamma}(a,a)\) where \(a\) is a small positive value, such as 0.01, as such a distribution approximates a uniform prior on \(\log(\sigma^2)\) scale (so called Jeffrey’s prior for the variance parameter). By using such “flat priors” we hope to achieve a posterior distribution that has been guided by information from the data while the effect of the prior distribution on the posterior remains small.

In the following analysis, we will set the hyper-parameters to \(b_0 = b_1 = 1\), \(s = 0.5\) and \(a=0.01\) that lead to the following prior densities on \(\pi_1\) and \(\tau^2\).

The following Gibbs sampler code to implement this model is adapted from Fabian Dablander https://fabiandablander.com/r/Spike-and-Slab.html

#' Spike-and-Slab Regression using Gibbs Sampling for p > 1 predictors
#'
#' @param y: vector of responses
#' @param X: matrix of predictor values
#' @param nr_samples: indicates number of samples drawn
#' @param a1: parameter a1 of Gamma prior on variance sigma2e
#' @param a2: parameter a2 of Gamma prior on variance sigma2e
#' @param pi1: parameter of prior over mixture weight
#' @param a: 1st parameter of Beta prior for pi1
#' @param b: 2nd parameter of Beta prior for pi1
#' @param burnin: number of samples we discard ('burnin samples')
#'
#' @returns matrix of posterior samples from parameters gamma, beta, tau2, sigma2e, pi1
ss_regress <- function(
  y, X, a1 = .01, a2 = .01, pi1 = .5,
  a = 1, b = 1, s = 1/2, nr_samples = 6000, nr_burnin = round(nr_samples / 4, 2)
  ) {
  
  p <- ncol(X)
  n <- nrow(X)
  
  # res is where we store the posterior samples
  res <- matrix(NA, nrow = nr_samples, ncol = 2*p + 1 + 1 + 1)
  
  colnames(res) <- c(
    paste0('gamma', seq(p)),
    paste0('beta', seq(p)),
    'sigma2', 'tau2', 'pi1'
  )
  
  # take the MLE estimate as the values for the first sample
  m <- lm(y ~ X - 1)
  res[1, ] <- c(rep(0, p), coef(m), var(predict(m) - y), 1, .5)
  
  # compute only once
  XtX <- t(X) %*% X
  Xty <- t(X) %*% y
  
  # we start running the Gibbs sampler
  for (i in seq(2, nr_samples)) {
    
    # first, get all the values of the previous time point
    gamma_prev <- res[i-1, seq(p)]
    beta_prev <- res[i-1, seq(p + 1, 2*p)]
    sigma2_prev <- res[i-1, ncol(res) - 2]
    tau2_prev <- res[i-1, ncol(res) - 1]
    pi1_prev <- res[i-1, ncol(res)]
    
    ## Start sampling from the conditional posterior distributions
    ##############################################################
    
    # sample pi1 from a Beta
    pi1_new <- rbeta(1, a + sum(gamma_prev), b + sum(1 - gamma_prev))
    
    # sample sigma2e from an Inverse-Gamma
    err <- y - X %*% beta_prev
    sigma2_new <- 1 / rgamma(1, a1 + n/2, a2 + t(err) %*% err / 2)
    
    # sample tau2 from an Inverse Gamma
    tau2_new <- 1 / rgamma(
      1, 1/2 + 1/2 * sum(gamma_prev),
      s^2/2 + t(beta_prev) %*% beta_prev / (2*sigma2_new)
    )
    
    # sample beta from multivariate Gaussian
    beta_cov <- qr.solve((1/sigma2_new) * XtX + diag(1/(tau2_new*sigma2_new), p))
    beta_mean <- beta_cov %*% Xty * (1/sigma2_new)
    beta_new <- mvtnorm::rmvnorm(1, beta_mean, beta_cov)
    
    # sample each gamma_j in random order
    for (j in sample(seq(p))) {
      
      # get the betas for which beta_j is zero
      gamma0 <- gamma_prev
      gamma0[j] <- 0
      bp0 <- t(beta_new * gamma0)
      
      # compute the z variables and the conditional variance
      xj <- X[, j]
      z <- y - X %*% bp0
      cond_var <- sum(xj^2) + 1/tau2_new
      
      # compute chance parameter of the conditional posterior of gamma_j (Bernoulli)
      l0 <- log(1 - pi1_new)
      l1 <- (
        log(pi1_new) - .5 * log(tau2_new*sigma2_new) +
        sum(xj*z)^2 / (2*sigma2_new*cond_var) + .5 * log(sigma2_new / cond_var)
      )
      
      # sample gamma_j from a Bernoulli
      gamma_prev[j] <- rbinom(1, 1, exp(l1) / (exp(l1) + exp(l0)))
    }
    
    gamma_new <- gamma_prev
    
    # add new samples
    res[i, ] <- c(gamma_new, beta_new*gamma_new, sigma2_new, tau2_new, pi1_new)
  }
  
  # remove the first nr_burnin number of samples
  res[-seq(nr_burnin), ]
}

Let’s fit the model using this Gibbs sampler and visualize the posteriors of \(\pi_1\) and \(\tau^2\).

set.seed(18)
ss = ss_regress(
  y, X, a1 = .01, a2 = .01, pi1 = .5,
  a = 1, b = 1, s = 1/2, nr_samples = 5000
  )
par(mfrow = c(1,2))
plot(density(ss[,"pi1"]), xlab = expression(pi[1]), lwd = 2, 
     main = "Posterior of pi1", col = "red")
plot(density(ss[,"tau2"]), xlab = expression(tau^2), lwd = 2,  
     main = "Posterior of tau2", col = "orange", xlim = c(0,1))

summary(ss[,"pi1"])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.008434 0.126940 0.184983 0.199423 0.258069 0.688194

summary(ss[,"tau2"])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.01785  0.05738  0.09008  0.14707  0.15038 14.88438

The posterior of \(\pi_1\) is quite wide and suggest that \(\pi_1\) is somewhere between 0.05 and 0.4. For \(\tau^2\), the bulk of the posterior is below 0.20.

We can also see the PIPs and the posterior distributions of the coefficients.

par(mfrow = c(1,2))
barplot(colMeans(ss[,1:p]), names.arg = 1:30, 
        col = "limegreen", horiz = TRUE, xlab = "PIP")
library(vioplot)
vioplot( ss[,(p+1):(2*p)], names = 1:30, 
         col = "blue", horizontal = TRUE, xlab = expression(beta[j]))
abline(v = 0, lty = 2)

Unfortunately, these implementations of Bayesian variable selection using the Spike and Slab prior through Gibbs sampler do not scale to large data sets. The first problem is that, at each iteration of the Gibbs sampler, the matrix decomposition on \(p\times p\) matrix related to \(\pmb X^T \pmb X\) is done. The second problem is that, at each iteration, every predictor is updated even when the model might be very sparse, i.e., even when only a few coefficients are non-zero. These issues become too demanding when \(p\) becomes truly large.

Let’s next discuss some recent work that has made it possible to apply these kinds of models to very large \(p\).

Efficient computation of Bayes factors under SSP

Let’s continue with the assumption that each predictor has an independent SSP(\(\pi_1,\tau^2\)) prior. It has been shown by Benner et al. that the Bayes factor (BF) comparing a model where the non-zero coefficients are specified by a configuration \(\pmb \gamma\) and the null model \(\pmb \gamma_0 = (0,\ldots,0)\) where all coefficients are zero can be computed using data on only the non-zero variables (Details in Slides). This allows computing BF(\(\pmb \gamma : \pmb \gamma_0\)) between \(\pmb \gamma\) and \(\pmb \gamma_0\) in \(\mathcal{O}(k^3)\) operations, where \(k = \sum_{j=1}^p\gamma_j\) is the number of non-zero coefficients in \(\pmb \gamma\), compared to \(\mathcal{O}(p^3)\) operations that are required when the computation is done in the standard way. When we consider sparse models where \(k \sim 10\) and \(p \sim 10^5\) this can make all the difference.

Note that under the SSP(\(\pi_1,\tau^2\)) prior, the prior probability of a configuration \(\pmb \gamma\) is \(\pi_1^{k_\gamma}(1-\pi_1)^{p-k_\gamma}\) and only depends on \(k_\gamma\). Thus, by knowing the BFs for all possible configurations, we can compute an (unnormalized) posterior probabilities for the configurations as \[\text{Pr}(\pmb \gamma\, | \, \text{Data}) \propto \text{BF}(\pmb \gamma : \pmb \gamma_0) \pi_1^{k_\gamma}(1-\pi_1)^{p-k_\gamma}.\]

By normalizing these quantities with respect to their sum over \(\pmb \gamma\), we have computed the posteriors for every configuration.

However, if we allowed, say, 10 non-zero variables out of \(10^5\) candidates, we would still have \({10^5 \choose 10} \approx 10^{43}\) configurations to evaluate, and that is far too many even when we can do every one of them quickly. For this reason, stochastic search algorithms have been introduced to approximate the posterior by finding the most relevant configurations from the whole space of configurations and then normalizing their un-normalized posteriors with respect to the relevant subset of configurations. (See slides for an example.)

Sum of single effect model (SUSIE)

Another recent approach to analyze Bayesian variable selection model is using the ‘single-effect regression’ (SER) model as a building block (Wang et al.). SER model is a multiple-regression model in which exactly one of the \(p\) predictors has a non-zero regression coefficient. The idea is to combine some number \(K\) of these SERs together to get a model where there are \(K\) non-zero predictors, or \(K\) effects. The key is that fitting each SER model conditional on the other SER models is very quick.

SUSIE model with \(K\) effects

Let’s assume that the number of non-zero effects \(K\), the error variance \(\sigma^2\) and the prior variance on the non-zero effect \(\tau^2\) are fixed. For each non-zero effect \(k=1,\ldots,K\), define \(\pmb \gamma^{(k)} \in \{0,1\}^p\) as the indicator vector that shows which is the \(k\)th non-zero coefficient, that is, \(\sum_{j=1}^p \gamma^{(k)}_j=1\). The model is \[\begin{eqnarray*} \pmb y &\sim& \mathcal{N}(\pmb X \pmb \beta,\sigma^2 \pmb I)\\ \pmb \beta &=& \sum_{k=1}^K \pmb \beta^{(k)} \\ \pmb \beta^{(k)} &=& \beta^{(k)}\pmb \gamma^{(k)}, \text{ for each }k=1,\ldots,K,\\ \beta^{(k)} &\sim& \mathcal{N}(0, \tau^2), \text{ for each }k=1,\ldots,K,\\ \pmb \gamma^{(k)} &\sim& \text{Multinomial}(1,(1/p,\ldots,1/p)), \text{ for each }k=1,\ldots,K. \end{eqnarray*}\] The last line means that exactly one of the elements of \(\pmb \gamma^{(k)}\) is non-zero and, a priori, it is equally likely to be any one of the \(p\) elements.

Note that the model does not rule out the possibility that \(\pmb \gamma^{(k)}=\pmb \gamma^{(\ell)}\) for some \(k \neq \ell\) even though conceptually it would be natural to avoid such overlaps. However, by leaving a possibility for an overlap, we can treat each effect independently and we get much simpler computations. Additionally, in practice, we won’t observe overlapping effects in the posterior distribution estimated by the following algorithm.

Iterative Bayesian stepwise selection (IBSS)

A key motivation for the SUSIE model is that, given \((\pmb \beta^{(\ell)})_{\ell \neq k}\), fitting \(\pmb \beta^{(k)}\) involves only fitting an SER model whose posterior distribution on the indicator \(\pmb \gamma^{(k)}\) and the parameter \(\beta^{(k)}\) is simple to compute. This suggests an iterative algorithm (called IBSS) to fit the model. Below we apply it to our existing data set by allowing \(K=5\) effects and setting the prior variance \(\tau^2 = 1\).

p = ncol(X)
n = nrow(X)
K = 5
tau2 = 1
xx = colSums(X^2) # = diagonal of t(X) %*% X. Will be needed later.
b = matrix(0, nrow = K, ncol = p) # initialize each effect beta_k = 0
res.b = res.pip = res.v = b # initialize results to 0 
diff.in.post = rep(0,p) # initialize to 0
converged = FALSE
tol = 1e-6 # stop when difference between iterations < tol
iter = 0
while(!converged){
  iter = iter + 1
  for (k in 1:K){    # Fitting SER model for effect k

    r = as.vector(y - rowSums(X%*%t(b[-k,]))) # residuals without effect k 

    # Use Exercise 1.5 results here to get MLE of linear model
    b.k = t(X) %*% r / xx # univariate OLS estimate
    B = matrix(b.k, byrow = T, nrow = n, ncol = p) 
    M = ( X*B - r )^2 
    v.k = colSums(M) / xx / (n-1)  # MLE for variance of b.k
    
    # Combine MLEs with prior distributions to get posteriors
    post.v.k = 1/(1/v.k + 1/tau2)
    post.b.k = post.v.k/v.k * b.k
    # Conditional posterior for beta_k is N(post.b.k, post.v.k)
    # conditional on the effect being at that position 
    
    # PIPs for the effect being at position 1,...,p:
    log.bf = 0.5*log(v.k/(v.k + tau2)) + 0.5*b.k^2/v.k*tau2/(tau2 + v.k)
    tmp = exp(log.bf - max(log.bf))
    post.ip = tmp / sum(tmp)
    
    b[k,] = post.b.k*post.ip # save new effect k estimate for next iteration
    # compare to previous iteration to observe convergence:
    diff.in.post[k] = max(abs(b[k,] - res.b[k,]*res.pip[k,])) 
    # save results
    res.b[k,] = post.b.k
    res.pip[k,] = post.ip 
    res.v[k,] = post.v.k
  }
  converged = (all(diff.in.post < tol))
}
print(paste("Ended in", iter, "iterations"))

## [1] "Ended in 8 iterations"

Let’s examine the output of this algorithm. We allowed 5 non-zero effects. Let’s see the distributions of PIPs for each effect \(k = 1,\ldots,5\).

par(mfrow = c(1,5))
for(ii in 1:K){
barplot(res.pip[ii,], names.arg = 1:p, horiz = TRUE, xlim = c(0,1), 
        main = paste("Effect",ii), col = ii, xlab = "PIP")
}

We see that for the first 3 effects, there is a clear difference between predictors in their PIPs. For the last two, none of the coefficients has a large PIP and therefore we may want to ignore those effects. In other words, likely there are only 3 clear non-zero effects there and the remaining effects, if included in the SUSIE model, wouldn’t provide any information about which of the predictors would have a non-zero effect. Note, however, that despite this reasonable ad hoc criterion to determine the number of effects, the SUSIE model cannot make any probabilistic assessment of the number of non-zero effects whereas the SSP model directly gave us the posterior distribution on the number of non-zero coefficients.

We can next estimate the coefficients by summing over the \(K\) SER models. Let’s denote the posterior mean and variance of the SER model \(k\) conditional on the effect being at predictor \(j\) by \(\mu_j^{(k)}\) and \(\eta_j^{(k)}\), respectively. Then we can compute the mean and variance for each coefficient (marginalized over all \(K\) SER models) as follows. \[\begin{eqnarray*} \text{E}(\beta_j) &=& \text{E}\left(\sum_{k=1}^K \gamma^{(k)}_j \beta_j^{(k)}\right) = \sum_{k=1}^K \pi_j^{(k)} \mu_j^{(k)}\\ \text{Var}(\beta_j) &=& \text{Var}\left(\sum_{k=1}^K \gamma^{(k)}_j \beta_j^{(k)}\right) = \sum_{k=1}^K \text{Var} \left( \gamma^{(k)}_j \beta_j^{(k)}\right) \\ &=& \sum_{k=1}^K\left( \text{E}\left({\gamma^{(k)}_j}^2 {\beta_j^{(k)}}^2\right) - \text{E}\left(\gamma^{(k)}_j \beta_j^{(k)}\right)^2\right) \\ &=& \sum_{k=1}^K\left( \pi_j^{(k)} \left({\eta_j^{(k)}}^2 + {\mu_j^{(k)}}^2\right) - {\pi_j^{(k)}}^2 {\mu_j^{(k)}}^2\right)\\ &=& \sum_{k=1}^K \left( \pi_j^{(k)} \left({\eta_j^{(k)}}^2\right) + \pi_j^{(k)}(1-\pi_j^{(k)}){\mu_j^{(k)}}^2\right) \end{eqnarray*}\]

Let’s find the posterior means and variances and then show the 95% credible intervals assuming that the shape of the posterior distributions are Gaussian.

b.mean = colSums(res.pip * res.b)
b.var = colSums(res.pip * res.v + res.pip * (1 - res.pip) * res.b^2)
plot(NULL, xlim = c(0, p + 1), ylim = c(-0.1, 0.5), 
     ylab = expression(beta), xlab = "")
points(1:p, b.mean, pch = 19)
arrows(1:p, b.mean - 1.96 * sqrt(b.var), 1:p, b.mean + 1.96 * sqrt(b.var),
       code = 3, length = 0)
abline(h = 0)

Credible sets of non-zero variables

A nice property of the SUSIE model is that it allows a straightforward computation of credible set of the non-zero variables.

Definition. A level \(\rho\) credible set (CS) is a subset of variables that has a probability \(\geq \rho\) of containing at least one effect variable, i.e., a variable with non-zero regression coefficient.

For each effect in the SUSIE model, we can compute a credible set by sorting the PIPs in decreasing order and by forming the CS by including in the set the smallest possible number of variables whose PIPs sum to a value \(\geq \rho\).

In our current example, the 95% CSs are below.

rho = 0.95
cs = list()
for(ii in 1:K){
  ord = order(res.pip[ii,], decreasing = T)
  cs[[ii]] = ord[1:min(which(cumsum(res.pip[ii,ord]) >= rho))]
}
cs

## [[1]]
## [1] 3
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 2
## 
## [[4]]
##  [1]  6 23 11 27 15  9 12 20 24 10 30  7  5 18 17 25 29 16 14 26 22 13 19  8  1
## [26]  4 28  3
## 
## [[5]]
##  [1]  6 23 11 27 15  9 12 20 24 10 30  7  5 18 17 25 29 16 14 26 22 13 19  8  1
## [26]  4 28  3

For the first three effects, the CSs only contain one variable each, while the CSs for the last two effects contain almost all variables. This again shows that the first three effects are clearly localized each to one variable whereas the two additional effects are not informative and we might want to just ignore them.

Note that the posterior distribution provided by the SSP model does not directly contain information by which we could build credible sets for the variables. Credible sets are useful when one needs to list the possible candidate variables that could replace each other in the model.

Question. Assume that there are three highly correlated variables of which one truly affects the outcome but no other variables associated with the outcome. What kind of PIPs SUSIE would give and how would the credible sets look like?