---
title: "HDS 8. Bayesian variable selection"
author: "Matti Pirinen, University of Helsinki"
date: "11.1.2024"
output:
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(15)
```

In HDS6, we saw that LASSO has a feature of setting many of the
coefficients to exactly zero at the optimum of the objective function 
and hence we said that LASSO does
*variable selection*. A technical reason for this behavior
was the LASSO's prior distribution for coefficients that had
a sharp peak around zero. In HDS7, we embedded the LASSO model into
a fully Bayesian model `blasso` that allowed us to estimate the 
full posterior distribution of the coefficients but that 
was computationally much more heavy to apply than the standard LASSO.
Next, we will consider conceptually simpler Bayesian models for variable selection,
where the prior distribution of a coefficient states that 
the coefficient is non-zero with probability $\pi_1$
and has a positive probability mass 
$1-\pi_1$ of being exactly zero.


## The spike and slab prior (SSP)
Under SSP, we model an individual regression coefficient $\beta_j$ using
a two-component **mixture distribution**
$$\beta_j\,|\, \pi_1, \tau^2 \sim (1-\pi_1) \delta_0 + \pi_1 \mathcal{N}(0, \tau^2)$$
where $\delta_0$ is the point mass at 0 (in mathematics called as *Dirac delta function*). 
Another way to describe this two-component mixture distribution is to say that
$$
   \begin{cases}
    \beta_j = 0, & \text{with probability } 1-\pi_1,\\
    \beta_j \sim \mathcal{N}(0, \tau^2), & \text{with probability } \pi_1.
  \end{cases}
$$

Thus, if we wanted to generate example data from such a mixture distribution, we could proceed
in two steps. First, we chose which component to sample from (component 1 with probability $\pi_1$ and
component 0 with probability $1-\pi_1$).
Second, conditional on the chosen component, we would sample a value for $\beta_j$ from 
the distribution specific to the chosen component.

```{r, echo = F}
pi1 = 0.5
tau = 1
plot(NULL, xlim = c(-3,3), ylim = c(0,1), 
     main = expression(paste("Spike and slab with ",pi[1],"=0.5 ",tau,"=1")), 
     xlab = expression(beta[j]), ylab = "density / probability")
x = seq(-3, 3, 0.01)
lines(x, dnorm(x, 0, tau), lwd = 2, col ="blue")
arrows(0, 0, 0, pi1, length = 0, lwd = 3)
points(0,0, pch = 19, cex=2)
legend("topright", col = c("black", "blue"), leg = c("spike", "slab"), 
       pch = c(19,-1), lty = c(0,1), lwd = c(0,2))
```

**Figure** visualizes the two-component spike and slab distribution. The spike component
sits at exactly 0, and has a non-zero probability mass. The slab component is a Gaussian distribution
centered at 0 and having SD > 0. The figure describes the probability mass (black) for the spike
and the density function (blue) for the slab.

Suppose we have $n$ samples and $p$ predictors (columns of $\pmb X$) 
to model the outcome vector $\pmb y$.
Let $\pmb \gamma = (\gamma_j)_{j=1}^p$ 
be a vector of binary indicators indicating which variables are non-zero,
that is,
$$\gamma_j = 
   \begin{cases}
    1, & \text{if } \beta_j \neq 0,\\
    0, & \text{if } \beta_j = 0.
  \end{cases}
$$
We can assign to each predictor, independently, an SSP($\pi_1,\tau^2$) 
prior where the values $\pi_1$ and $\tau^2$ are shared between the predictors.
Value of $\pi_1$ could be determined based on 
which proportion of predictors we expect to be non-zero, and
$\tau^2$ could be determined based on what magnitude of coefficient values we expect
at the non-zero coefficients.
Such a model is implemented in `BoomSpikeSlab` package.
(Later we will discuss extensions that also estimate $\pi_1$ and $\tau^2$ from data.)

Let's try it on the same data we analyzed using Bayesian LASSO in HDS7 that had
$n=250$, $p=30$ and the first 3 variables had an effect size of 0.229
while the remaining 27 were exactly 0.
We use as prior parameters $\pi_1 = 5/30 \approx 0.167$ (by setting `expected.model.size = 5`)
and $\tau^2=1$ (by setting `prior.beta.sd = rep(1,p)`).

```{r, results = FALSE, message = FALSE}
set.seed(122) 
p = 30
n = 250 
phi = 0.05 # variance explained by x_1, should be 0 < phi < 1. 
b = rep(c(sqrt(phi / (1-phi)), 0), c(3, p-3)) # effects 1,2,3 are non-zero, see Lect. 0.1 for "phi"
X = scale(matrix(rnorm(n*p), nrow = n)) # cols 1,2,3 of X have effects, other cols are noise 
eps = scale(rnorm(n, 0, 1)) # epsilon, error term
y = scale(X%*%b + eps, scale = FALSE) # makes y have mean = 0
library(BoomSpikeSlab)
prior = IndependentSpikeSlabPrior(X, y, 
                                  expected.model.size = 5,
                                  prior.beta.sd = rep(1,p)) 
lm.ss = lm.spike(y ~ X - 1, niter = 1000, prior = prior)
```

```{r}
summary(lm.ss)
```
We see that the 3 variables with true non-zero effects have posterior inclusion 
probabilities (PIP) > 0.90 whereas all others have PIPs 
< 0.10. In the results, we have two sets of posterior means and standard deviations:
first from the full posterior distribution and second conditional 
on that the variable has been included in the model. 
When the variable has a large PIP, then the two are similar, but for small PIPs,
the two differ and the full unconditional mean is a shrunk version of the conditional
mean where the shrinkage factor is the PIP. 
For example, for variable `X6`, the posterior mean is `r summary(lm.ss)$coefficients["X6",3]` 
conditional on the variable being non-zero, and
`X6` is non-zero with a posterior probability of `r summary(lm.ss)$coefficients["X6",5]` 
Thus, the (unconditional) posterior mean is

`r summary(lm.ss)$coefficients["X6",5]` $\cdot$ `r summary(lm.ss)$coefficients["X6",3]` + `r 1-summary(lm.ss)$coefficients["X6",5]` $\cdot$ 0 = `r summary(lm.ss)$coefficients["X6",1]`.

Note that, in a similar way, one could extend the LASSO method to [**relaxed LASSO**](https://glmnet.stanford.edu/articles/relax.html), where one first
runs LASSO to choose the non-zero variables (corresponding to the unconditional analysis) 
and then fits an unpenalized model using
only the chosen variables (corresponding to the conditional analysis). 
Statistical inference for such a stepwise procedure is, however, complicated,
whereas, in the Bayesian approach, the posterior distribution is directly available and 
forms a natural basis for inference.

Let's visualize the PIPs and the posterior distributions of the coefficients.
```{r, fig.height = 10}
par(mfrow = c(1, 2))
plot(lm.ss) # plots PIPs
plot(lm.ss, "coefficients") # plots coeffs
```

The posterior probability on the model size can also be plotted.
```{r}
plot(lm.ss, "size", freq = F)
```

### Extension to estimating $\pi_1$ and $\tau^2$

To make the model fully Bayesian, we would like to estimate also the
parameters $\pi_1$ and $\tau^2$ from the data. 
For that, we will consider $\tau^2$ as being specified relative to the
error variance $\sigma^2$, i.e., we use a prior distribution
$$\beta_j\,|\, \pi_1, \tau^2, \sigma^2 \sim (1-\pi_1) \delta_0 + \pi_1 \mathcal{N}(0, \sigma^2 \tau^2).$$
For example, the parameters $\pi_1$ and $\tau^2$ can be
given priors $\pi_1 \sim \text{Beta}(b_0,b_1)$ and 
$\tau^2 \sim \text{Inverse-Gamma}(1/2,s^2/2)$ where the hyper-parameters 
$b_0,b_1$ and $s$ are given such fixed values that the priors cover
all reasonable values of the parameters in a fairly uniform manner. 
Additionally, a flat prior on $\sigma^2$ could be an
$\text{Inverse-Gamma}(a,a)$ where $a$ is a small positive value, such as 0.01,
as such a distribution approximates a uniform prior on $\log(\sigma^2)$ scale 
(so called *Jeffrey's prior* for the variance parameter).
By using such "flat priors" we hope to achieve a posterior distribution that
has been guided by information from the data 
while the effect of the prior distribution on the posterior remains small.

In the following analysis, we will set the hyper-parameters to $b_0 = b_1 = 1$, $s = 0.5$
and $a=0.01$ that lead to the following prior densities on $\pi_1$ and $\tau^2$.

```{r, fig.width = 8, fig.height = 4, echo = F}
par(mfrow = c(1,2))
u = seq(0, 1, 0.01)
v = dbeta(u, 1, 1)
plot(u,v, t = "l", main = "Beta(1,1)", 
     xlab = expression(pi[1]), ylab = "density")
u = seq(0, 2, 0.01)
s = 0.5
shape = 0.5
rate = 0.5*s^2
v = rate^shape / gamma(shape) * u^(-1-shape) * exp(-rate/u) 
plot(u, v, t = "l", main = "Inv-Gamma(1/2,1/8)", 
     xlab = expression(tau^2), ylab = "density")
```

The following Gibbs sampler code to implement this model
is adapted from Fabian Dablander
<https://fabiandablander.com/r/Spike-and-Slab.html>

```{r, eval = T}
#' Spike-and-Slab Regression using Gibbs Sampling for p > 1 predictors
#'
#' @param y: vector of responses
#' @param X: matrix of predictor values
#' @param nr_samples: indicates number of samples drawn
#' @param a1: parameter a1 of Gamma prior on variance sigma2e
#' @param a2: parameter a2 of Gamma prior on variance sigma2e
#' @param pi1: parameter of prior over mixture weight
#' @param a: 1st parameter of Beta prior for pi1
#' @param b: 2nd parameter of Beta prior for pi1
#' @param burnin: number of samples we discard ('burnin samples')
#'
#' @returns matrix of posterior samples from parameters gamma, beta, tau2, sigma2e, pi1
ss_regress <- function(
  y, X, a1 = .01, a2 = .01, pi1 = .5,
  a = 1, b = 1, s = 1/2, nr_samples = 6000, nr_burnin = round(nr_samples / 4, 2)
  ) {
  
  p <- ncol(X)
  n <- nrow(X)
  
  # res is where we store the posterior samples
  res <- matrix(NA, nrow = nr_samples, ncol = 2*p + 1 + 1 + 1)
  
  colnames(res) <- c(
    paste0('gamma', seq(p)),
    paste0('beta', seq(p)),
    'sigma2', 'tau2', 'pi1'
  )
  
  # take the MLE estimate as the values for the first sample
  m <- lm(y ~ X - 1)
  res[1, ] <- c(rep(0, p), coef(m), var(predict(m) - y), 1, .5)
  
  # compute only once
  XtX <- t(X) %*% X
  Xty <- t(X) %*% y
  
  # we start running the Gibbs sampler
  for (i in seq(2, nr_samples)) {
    
    # first, get all the values of the previous time point
    gamma_prev <- res[i-1, seq(p)]
    beta_prev <- res[i-1, seq(p + 1, 2*p)]
    sigma2_prev <- res[i-1, ncol(res) - 2]
    tau2_prev <- res[i-1, ncol(res) - 1]
    pi1_prev <- res[i-1, ncol(res)]
    
    ## Start sampling from the conditional posterior distributions
    ##############################################################
    
    # sample pi1 from a Beta
    pi1_new <- rbeta(1, a + sum(gamma_prev), b + sum(1 - gamma_prev))
    
    # sample sigma2e from an Inverse-Gamma
    err <- y - X %*% beta_prev
    sigma2_new <- 1 / rgamma(1, a1 + n/2, a2 + t(err) %*% err / 2)
    
    # sample tau2 from an Inverse Gamma
    tau2_new <- 1 / rgamma(
      1, 1/2 + 1/2 * sum(gamma_prev),
      s^2/2 + t(beta_prev) %*% beta_prev / (2*sigma2_new)
    )
    
    # sample beta from multivariate Gaussian
    beta_cov <- qr.solve((1/sigma2_new) * XtX + diag(1/(tau2_new*sigma2_new), p))
    beta_mean <- beta_cov %*% Xty * (1/sigma2_new)
    beta_new <- mvtnorm::rmvnorm(1, beta_mean, beta_cov)
    
    # sample each gamma_j in random order
    for (j in sample(seq(p))) {
      
      # get the betas for which beta_j is zero
      gamma0 <- gamma_prev
      gamma0[j] <- 0
      bp0 <- t(beta_new * gamma0)
      
      # compute the z variables and the conditional variance
      xj <- X[, j]
      z <- y - X %*% bp0
      cond_var <- sum(xj^2) + 1/tau2_new
      
      # compute chance parameter of the conditional posterior of gamma_j (Bernoulli)
      l0 <- log(1 - pi1_new)
      l1 <- (
        log(pi1_new) - .5 * log(tau2_new*sigma2_new) +
        sum(xj*z)^2 / (2*sigma2_new*cond_var) + .5 * log(sigma2_new / cond_var)
      )
      
      # sample gamma_j from a Bernoulli
      gamma_prev[j] <- rbinom(1, 1, exp(l1) / (exp(l1) + exp(l0)))
    }
    
    gamma_new <- gamma_prev
    
    # add new samples
    res[i, ] <- c(gamma_new, beta_new*gamma_new, sigma2_new, tau2_new, pi1_new)
  }
  
  # remove the first nr_burnin number of samples
  res[-seq(nr_burnin), ]
}
```

Let's fit the model using this Gibbs sampler and visualize the posteriors of
$\pi_1$ and $\tau^2$.
```{r, fig.width = 8, fig.height = 4}
set.seed(18)
ss = ss_regress(
  y, X, a1 = .01, a2 = .01, pi1 = .5,
  a = 1, b = 1, s = 1/2, nr_samples = 5000
  )
par(mfrow = c(1,2))
plot(density(ss[,"pi1"]), xlab = expression(pi[1]), lwd = 2, 
     main = "Posterior of pi1", col = "red")
plot(density(ss[,"tau2"]), xlab = expression(tau^2), lwd = 2,  
     main = "Posterior of tau2", col = "orange", xlim = c(0,1))
summary(ss[,"pi1"])
summary(ss[,"tau2"])
```

The posterior of $\pi_1$ is quite wide and suggest that 
$\pi_1$ is somewhere between 0.05 and 0.4.
For $\tau^2$, the bulk of the posterior is below 0.20.

We can also see the PIPs and the posterior distributions of the coefficients.

```{r, fig.width = 11, fig.height = 9, message = F}
par(mfrow = c(1,2))
barplot(colMeans(ss[,1:p]), names.arg = 1:30, 
        col = "limegreen", horiz = TRUE, xlab = "PIP")
library(vioplot)
vioplot( ss[,(p+1):(2*p)], names = 1:30, 
         col = "blue", horizontal = TRUE, xlab = expression(beta[j]))
abline(v = 0, lty = 2)
```

Unfortunately, these implementations of Bayesian variable selection using 
the Spike and Slab prior through Gibbs sampler do not scale to large data sets.
The first problem is that, at each iteration of the Gibbs sampler,
the matrix decomposition on $p\times p$ matrix related to 
$\pmb X^T \pmb X$ is done.
The second problem is that, at each iteration, every predictor 
is updated even when the model might be very sparse, 
i.e., even when only a few coefficients are non-zero.
These issues become too demanding when $p$ becomes truly large.

Let's next discuss some recent work that has made it possible
to apply these kinds of models to very large $p$.


## Efficient computation of Bayes factors under SSP

Let's continue with the assumption 
that each predictor has an independent SSP($\pi_1,\tau^2$) prior.
It has been shown by [Benner et al.](https://www.biorxiv.org/content/10.1101/318618v1.full.pdf) that the Bayes factor (BF) comparing a model where the non-zero
coefficients are specified by a configuration $\pmb \gamma$ and 
the null model 
$\pmb \gamma_0 = (0,\ldots,0)$ where all coefficients are zero can be computed using data on
only the non-zero variables (Details in Slides).
This allows computing BF($\pmb \gamma : \pmb \gamma_0$) 
between $\pmb \gamma$ and $\pmb \gamma_0$ in $\mathcal{O}(k^3)$
operations, where $k = \sum_{j=1}^p\gamma_j$ is the number of 
non-zero coefficients in $\pmb \gamma$,
compared to $\mathcal{O}(p^3)$ operations that are required
when the computation is done in the standard way. 
When we consider sparse models where 
$k \sim 10$ and $p \sim 10^5$ this can make all the difference. 

Note that under the SSP($\pi_1,\tau^2$) prior, the prior probability of
a configuration $\pmb \gamma$ is
$\pi_1^{k_\gamma}(1-\pi_1)^{p-k_\gamma}$
and only depends on $k_\gamma$.
Thus, by knowing the BFs for all possible configurations, we can compute
an (unnormalized) posterior probabilities for the configurations as
$$\text{Pr}(\pmb \gamma\, | \, \text{Data}) \propto 
\text{BF}(\pmb \gamma : \pmb \gamma_0) \pi_1^{k_\gamma}(1-\pi_1)^{p-k_\gamma}.$$

By normalizing these quantities with respect to their sum over $\pmb \gamma$,
we have computed the posteriors for every configuration.

However, if we allowed, say, 10 non-zero variables out of $10^5$
candidates, we would still have ${10^5 \choose 10} \approx 10^{43}$
configurations to evaluate, and that is far 
too many even when we can do every one of them quickly.
For this reason, stochastic search algorithms have been introduced to
approximate the posterior by finding the most relevant configurations
from the whole space of configurations and then normalizing 
their un-normalized posteriors with respect to
the relevant subset of configurations. (See slides for an example.)

## Sum of single effect model (SUSIE)

Another recent approach to analyze Bayesian variable selection model
is using the ‘single-effect regression’ (SER) model
as a building block ([Wang et al.](https://rss.onlinelibrary.wiley.com/doi/10.1111/rssb.12388)). 
SER model is a multiple-regression model in 
which exactly one of the $p$ predictors
has a non-zero regression coefficient.
The idea is to combine some number $K$ of these SERs together
to get a model where there are $K$ non-zero predictors, or 
$K$ *effects*.
The key is that fitting each SER model conditional on the other SER models
is very quick.

#### SUSIE model with $K$ effects
Let's assume that the number of non-zero effects $K$, 
the error variance $\sigma^2$
and the prior variance on the non-zero effect $\tau^2$ are fixed.
For each non-zero effect $k=1,\ldots,K$, define
$\pmb \gamma^{(k)} \in \{0,1\}^p$ as the indicator vector that 
shows which is the $k$th non-zero coefficient, that is, 
$\sum_{j=1}^p \gamma^{(k)}_j=1$.
The model is
\begin{eqnarray*}
\pmb y &\sim& \mathcal{N}(\pmb X \pmb \beta,\sigma^2 \pmb I)\\
\pmb \beta &=& \sum_{k=1}^K \pmb \beta^{(k)} \\
\pmb \beta^{(k)} &=& \beta^{(k)}\pmb \gamma^{(k)}, \text{ for each }k=1,\ldots,K,\\
\beta^{(k)} &\sim& \mathcal{N}(0, \tau^2), \text{ for each }k=1,\ldots,K,\\
\pmb \gamma^{(k)} &\sim& \text{Multinomial}(1,(1/p,\ldots,1/p)), \text{ for each }k=1,\ldots,K.
\end{eqnarray*}
The last line means that exactly one of the 
elements of $\pmb \gamma^{(k)}$ is non-zero and, *a priori*, it is 
equally likely to be any one of the $p$ elements.

Note that the model does not rule out the possibility that
$\pmb \gamma^{(k)}=\pmb \gamma^{(\ell)}$ for some $k \neq \ell$
even though conceptually it would be natural to avoid such overlaps.
However, by leaving a possibility for an overlap, 
we can treat each effect independently and we 
get much simpler computations. Additionally, in practice, 
we won't observe overlapping effects in the posterior 
distribution estimated by the following algorithm.


#### Iterative Bayesian stepwise selection (IBSS)

A key motivation for the SUSIE model is that, given 
$(\pmb \beta^{(\ell)})_{\ell \neq k}$, fitting 
$\pmb \beta^{(k)}$ involves only fitting an SER model whose
posterior distribution on the indicator $\pmb \gamma^{(k)}$
and the parameter $\beta^{(k)}$ is simple to compute. 
This suggests an iterative algorithm (called IBSS)
to fit the model. Below we apply it to our existing data set
by allowing $K=5$ effects and setting the prior variance $\tau^2 = 1$.

```{r}
p = ncol(X)
n = nrow(X)
K = 5
tau2 = 1
xx = colSums(X^2) # = diagonal of t(X) %*% X. Will be needed later.
b = matrix(0, nrow = K, ncol = p) # initialize each effect beta_k = 0
res.b = res.pip = res.v = b # initialize results to 0 
diff.in.post = rep(0,p) # initialize to 0
converged = FALSE
tol = 1e-6 # stop when difference between iterations < tol
iter = 0
while(!converged){
  iter = iter + 1
  for (k in 1:K){    # Fitting SER model for effect k

    r = as.vector(y - rowSums(X%*%t(b[-k,]))) # residuals without effect k 

    # Use Exercise 1.5 results here to get MLE of linear model
    b.k = t(X) %*% r / xx # univariate OLS estimate
    B = matrix(b.k, byrow = T, nrow = n, ncol = p) 
    M = ( X*B - r )^2 
    v.k = colSums(M) / xx / (n-1)  # MLE for variance of b.k
    
    # Combine MLEs with prior distributions to get posteriors
    post.v.k = 1/(1/v.k + 1/tau2)
    post.b.k = post.v.k/v.k * b.k
    # Conditional posterior for beta_k is N(post.b.k, post.v.k)
    # conditional on the effect being at that position 
    
    # PIPs for the effect being at position 1,...,p:
    log.bf = 0.5*log(v.k/(v.k + tau2)) + 0.5*b.k^2/v.k*tau2/(tau2 + v.k)
    tmp = exp(log.bf - max(log.bf))
    post.ip = tmp / sum(tmp)
    
    b[k,] = post.b.k*post.ip # save new effect k estimate for next iteration
    # compare to previous iteration to observe convergence:
    diff.in.post[k] = max(abs(b[k,] - res.b[k,]*res.pip[k,])) 
    # save results
    res.b[k,] = post.b.k
    res.pip[k,] = post.ip 
    res.v[k,] = post.v.k
  }
  converged = (all(diff.in.post < tol))
}
print(paste("Ended in", iter, "iterations"))
```

Let's examine the output of this algorithm.
We allowed 5 non-zero effects. 
Let's see the distributions
of PIPs for each effect $k = 1,\ldots,5$. 

```{r, fig.width = 11, fig.height = 7}
par(mfrow = c(1,5))
for(ii in 1:K){
barplot(res.pip[ii,], names.arg = 1:p, horiz = TRUE, xlim = c(0,1), 
        main = paste("Effect",ii), col = ii, xlab = "PIP")
}
```
We see that for the first 3 effects, there is a clear difference between
predictors in their PIPs. For the last two, none of the coefficients 
has a large PIP and therefore we may want to ignore those effects.
In other words, likely there are only 3 clear non-zero effects there
and the remaining effects, if included in the SUSIE model, 
wouldn't provide any information about which of the predictors 
would have a non-zero effect. 
Note, however, that despite this reasonable ad hoc criterion to determine
the number of effects, the SUSIE model cannot make any probabilistic 
assessment of the number of non-zero effects whereas the SSP model directly 
gave us the posterior distribution on the number of non-zero coefficients.

We can next estimate the coefficients by summing over 
the $K$ SER models. Let's denote 
the posterior mean and variance of the SER model $k$ conditional
on the effect being at predictor $j$ by 
$\mu_j^{(k)}$ and $\eta_j^{(k)}$, respectively.
Then we can compute the mean and variance
for each coefficient (marginalized over all $K$ SER models) as follows.
\begin{eqnarray*}
\text{E}(\beta_j) &=& \text{E}\left(\sum_{k=1}^K \gamma^{(k)}_j \beta_j^{(k)}\right) = \sum_{k=1}^K \pi_j^{(k)} \mu_j^{(k)}\\
\text{Var}(\beta_j) &=& \text{Var}\left(\sum_{k=1}^K \gamma^{(k)}_j \beta_j^{(k)}\right) = \sum_{k=1}^K \text{Var} \left( \gamma^{(k)}_j \beta_j^{(k)}\right) \\
&=& \sum_{k=1}^K\left( \text{E}\left({\gamma^{(k)}_j}^2 {\beta_j^{(k)}}^2\right) -
\text{E}\left(\gamma^{(k)}_j \beta_j^{(k)}\right)^2\right) \\
&=& \sum_{k=1}^K\left( \pi_j^{(k)} \left({\eta_j^{(k)}}^2 + {\mu_j^{(k)}}^2\right) - {\pi_j^{(k)}}^2 {\mu_j^{(k)}}^2\right)\\
&=& \sum_{k=1}^K \left( \pi_j^{(k)} \left({\eta_j^{(k)}}^2\right) + \pi_j^{(k)}(1-\pi_j^{(k)}){\mu_j^{(k)}}^2\right)
\end{eqnarray*}

Let's find the posterior means and variances
and then show the 95% credible intervals assuming that
the shape of the posterior distributions are Gaussian.

```{r, fig.width = 10}
b.mean = colSums(res.pip * res.b)
b.var = colSums(res.pip * res.v + res.pip * (1 - res.pip) * res.b^2)
plot(NULL, xlim = c(0, p + 1), ylim = c(-0.1, 0.5), 
     ylab = expression(beta), xlab = "")
points(1:p, b.mean, pch = 19)
arrows(1:p, b.mean - 1.96 * sqrt(b.var), 1:p, b.mean + 1.96 * sqrt(b.var),
       code = 3, length = 0)
abline(h = 0)
```

#### Credible sets of non-zero variables

A nice property of the SUSIE model is that it allows a straightforward
computation of **credible set** of the non-zero variables.

**Definition.** A level $\rho$ credible set (CS) is 
a subset of variables that has a probability $\geq \rho$ 
of containing at least one effect variable, 
i.e., a variable with non-zero regression coefficient. 

For each effect in the SUSIE model, we can compute a credible set
by sorting the PIPs in decreasing order and by forming the
CS by including in the set the smallest possible number of 
variables whose PIPs sum to a value $\geq \rho$.

In our current example, the 95% CSs are below.
```{r}
rho = 0.95
cs = list()
for(ii in 1:K){
  ord = order(res.pip[ii,], decreasing = T)
  cs[[ii]] = ord[1:min(which(cumsum(res.pip[ii,ord]) >= rho))]
}
cs
```
For the first three effects, the CSs only contain one variable each,
while the CSs for the last two effects contain almost all variables.
This again shows that the first three effects are clearly localized each
to one variable whereas the two additional effects are not informative and
we might want to just ignore them.

Note that the posterior distribution provided by the SSP model
does not directly contain information by which we could build 
credible sets for the variables. Credible sets are useful when
one needs to list the possible candidate variables that could replace
each other in the model.

**Question.**  Assume that there are three highly correlated variables
of which one truly affects the outcome but no other variables associated
with the outcome. What kind of PIPs
SUSIE would give and how would the credible sets look like?