---
title: "HDS 8. Bayesian variable selection"
author: "Matti Pirinen, University of Helsinki"
date: "27.11.2021"
output:
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(15)
```
In HDS6, we saw that LASSO has a property of setting many of the
coefficients to exactly zero at the optimum of the objective function
and hence we said that LASSO does
*variable selection*. A technical reason for this behavior
was the LASSO's prior distribution for coefficients that had
a sharp peak near zero. In HDS7, we embedded the LASSO model into
a fully Bayesian model `blasso` that allowed us to estimate the
full posterior distribution of the coefficients but that
was computationally more heavy to apply.
Next we will consider conceptually simpler Bayesian models for variable selection,
where the prior distribution of a coefficient states that
coefficient is non-zero with probability $\pi_1$
and has a positive probability mass
$1-\pi_1$ of being exactly zero.
## Spike and slab prior (SSP)
Under SSP, we model individual coefficient $\beta_j$ using
a mixture distribution
$$\beta_j\,|\, \pi_1, \tau^2 \sim (1-\pi_1) \delta_0 + \pi_1 \mathcal{N}(0, \tau^2)$$
where $\delta_0$ is point mass at 0.
In other words,
$$
\begin{cases}
\beta_j = 0, & \text{with probability } 1-\pi_1,\\
\beta_j \sim \mathcal{N}(0, \tau^2), & \text{with probability } \pi_1.
\end{cases}
$$
```{r, echo = F}
pi1 = 0.5
tau = 1
plot(NULL, xlim = c(-3,3), ylim = c(0,1),
main = expression(paste("Spike and slab with ",pi[1],"=0.5 ",tau,"=1")),
xlab = expression(beta[j]), ylab = "density / probability")
x = seq(-3, 3, 0.01)
lines(x, dnorm(x, 0, tau), lwd = 2, col ="blue")
arrows(0, 0, 0, pi1, length = 0, lwd = 3)
points(0,0, pch = 19, cex=2)
legend("topright", col = c("black", "blue"), leg = c("spike", "slab"),
pch = c(19,-1), lty = c(0,1), lwd = c(0,2))
```
Suppose we have $n$ samples and $p$ predictors (columns of $\pmb X$)
to model outcome vector $\pmb y$.
Let $\pmb \gamma = (\gamma_j)_{j=1}^p$
be vector of binary indicators indicating which variables are non-zero,
that is,
$$\gamma_j =
\begin{cases}
1, & \text{if } \beta_j \neq 0,\\
0, & \text{if } \beta_j = 0.
\end{cases}
$$
We can assign to each predictor, independently, an SSP($\pi_1,\tau^2$)
prior where the values $\pi_1$ and $\tau^2$ are shared between the predictors.
Value of $\pi_1$ could be determined based on
which proportion of predictors we expect to be non-zero, and
$\tau^2$ could be determined based on what magnitude of coefficient values we are
expecting.
Such model is implemented in `BoomSpikeSlab` package.
(Later we will discuss extensions that estimate $\pi_1$ and $\tau^2$ from data.)
Let's try it on the same data we analyzed using Bayesian LASSO in HDS 7 that had
$n=250$, $p=30$ and first 3 variables had an effect size of 0.229
while the remaining 27 were exactly 0.
We use prior $\pi_1 = 5/30 \approx 0.167$ (by setting `expected.model.size = 5`)
and $\tau^2=1$ (by setting `prior.beta.sd = rep(1,p)`).
```{r, results = FALSE, message = FALSE}
set.seed(122)
p = 30
n = 250
phi = 0.05 #variance explained by x_1, should be 0 < phi < 1.
b = rep(c(sqrt(phi / (1-phi)), 0), c(3, p-3)) #effects 1,2,3 are non-zero, see Lect. 0.1 for "phi"
X = scale(matrix(rnorm(n*p), nrow = n)) #cols 1,2,3 of X have effects, other cols are noise
eps = scale(rnorm(n, 0, 1)) #epsilon, error term
y = scale(X%*%b + eps, scale = FALSE) #makes y have mean = 0
library(BoomSpikeSlab)
prior = IndependentSpikeSlabPrior(X, y,
expected.model.size = 5,
prior.beta.sd = rep(1,p))
lm.ss = lm.spike(y ~ X - 1, niter = 1000, prior = prior)
```
```{r}
summary(lm.ss)
```
We see that the 3 variables with true effects have posterior inclusion
probabilities (PIP) > 0.90 whereas all others have PIPs
< 0.10. In the results, we have two sets of posterior means and sds:
first from the full posterior distribution and second conditional
on variable being included in the model. When the variable
has a large PIP, then the two are similar, but for small PIPs,
the two differ and the full unconditional mean is a shrunk version of the conditional
mean where the shrinkage factor is the PIP. For example, for variable `X6`
the posterior mean is `r summary(lm.ss)$coefficients["X6",3]`
conditional on the variable being non-zero, and
`X6` is non-zero with a posterior probability of `r summary(lm.ss)$coefficients["X6",5]`
Thus, the (unconditional) posterior mean is
`r summary(lm.ss)$coefficients["X6",5]` $\cdot$ `r summary(lm.ss)$coefficients["X6",3]` + `r 1-summary(lm.ss)$coefficients["X6",5]` $\cdot$ 0 = `r summary(lm.ss)$coefficients["X6",1]`.
Note that in similar spirit, one could extend the LASSO method to [**relaxed LASSO**](https://glmnet.stanford.edu/articles/relax.html) where one first
runs LASSO to choose the non-zero variables (corresponding to unconditional analysis)
and then fits an unpenalized model using
only the chosen variables (corresponding to conditional analysis).
Statistical inference for such a stepwise procedure is, however, complicated
whereas in the Bayesian approach
the posterior distribution is directly available and
a natural basis for inference.
Let's visualize the PIPs and posterior distributions of the coefficients.
```{r, fig.height = 10}
par(mfrow = c(1,2))
plot(lm.ss) #plots PIPs
plot(lm.ss,"coefficients")#plots coeffs
```
Posterior on the model size can also be plotted.
```{r}
plot(lm.ss,"size", freq = F)
```
### Extension to estimating $\pi_1$ and $\tau^2$
To make the model fully Bayesian, we would like to estimate also
parameters $\pi_1$ and $\tau^2$.
For that, we will consider $\tau^2$ as being specified relative to the
error variance $\sigma^2$, i.e., we use prior
$$\beta_j\,|\, \pi_1, \tau^2, \sigma^2 \sim (1-\pi_1) \delta_0 + \pi_1 \mathcal{N}(0, \sigma^2 \tau^2).$$
For example, these parameters can be
given priors $\pi_1 \sim \text{Beta}(b_0,b_1)$ and
$\tau^2 \sim \text{Inverse-Gamma}(1/2,s^2/2)$ where the hyper-parameters
$b_0,b_1$ and $s$ are given such fixed values that the priors cover
all reasonable values of the parameters in a fairly uniform manner.
Additionally, a flat prior on $\sigma^2$ could be an
$\text{Inverse-Gamma}(a,a)$ where $a$ is a small positive value, such as 0.01,
as such a distribution approximates a uniform prior on $\log(\sigma^2)$ scale
(so called Jeffrey's prior for the variance parameter).
By using such "flat priors" we hope to achieve a posterior distribution that
has been guided by information from the data
while the effect of the prior distribution on the posterior remains small.
In the following analysis, we will use hyper-parameters $b_0 = b_1 = 1$, $s = 0.5$
and $a=0.01$ that lead to the following prior densities on $\pi_1$ and $\tau^2$.
```{r, fig.width = 8, fig.height = 4, echo = F}
par(mfrow = c(1,2))
u = seq(0, 1, 0.01)
v = dbeta(u, 1, 1)
plot(u,v, t = "l", main = "Beta(1,1)",
xlab = expression(pi[1]), ylab = "density")
u = seq(0, 2, 0.01)
s = 0.5
shape = 0.5
rate = 0.5*s^2
v = rate^shape / gamma(shape) * u^(-1-shape) * exp(-rate/u)
plot(u, v, t = "l", main = "Inv-Gamma(1/2,1/8)",
xlab = expression(tau^2), ylab = "density")
```
The following Gibbs sampler code to implement this model
is adapted from Fabian Dablander
```{r, eval = T}
#' Spike-and-Slab Regression using Gibbs Sampling for p > 1 predictors
#'
#' @param y: vector of responses
#' @param X: matrix of predictor values
#' @param nr_samples: indicates number of samples drawn
#' @param a1: parameter a1 of Gamma prior on variance sigma2e
#' @param a2: parameter a2 of Gamma prior on variance sigma2e
#' @param pi1: parameter of prior over mixture weight
#' @param a: 1st parameter of Beta prior for pi1
#' @param b: 2nd parameter of Beta prior for pi1
#' @param burnin: number of samples we discard ('burnin samples')
#'
#' @returns matrix of posterior samples from parameters gamma, beta, tau2, sigma2e, pi1
ss_regress <- function(
y, X, a1 = .01, a2 = .01, pi1 = .5,
a = 1, b = 1, s = 1/2, nr_samples = 6000, nr_burnin = round(nr_samples / 4, 2)
) {
p <- ncol(X)
n <- nrow(X)
# res is where we store the posterior samples
res <- matrix(NA, nrow = nr_samples, ncol = 2*p + 1 + 1 + 1)
colnames(res) <- c(
paste0('gamma', seq(p)),
paste0('beta', seq(p)),
'sigma2', 'tau2', 'pi1'
)
# take the MLE estimate as the values for the first sample
m <- lm(y ~ X - 1)
res[1, ] <- c(rep(0, p), coef(m), var(predict(m) - y), 1, .5)
# compute only once
XtX <- t(X) %*% X
Xty <- t(X) %*% y
# we start running the Gibbs sampler
for (i in seq(2, nr_samples)) {
# first, get all the values of the previous time point
gamma_prev <- res[i-1, seq(p)]
beta_prev <- res[i-1, seq(p + 1, 2*p)]
sigma2_prev <- res[i-1, ncol(res) - 2]
tau2_prev <- res[i-1, ncol(res) - 1]
pi1_prev <- res[i-1, ncol(res)]
## Start sampling from the conditional posterior distributions
##############################################################
# sample pi1 from a Beta
pi1_new <- rbeta(1, a + sum(gamma_prev), b + sum(1 - gamma_prev))
# sample sigma2e from an Inverse-Gamma
err <- y - X %*% beta_prev
sigma2_new <- 1 / rgamma(1, a1 + n/2, a2 + t(err) %*% err / 2)
# sample tau2 from an Inverse Gamma
tau2_new <- 1 / rgamma(
1, 1/2 + 1/2 * sum(gamma_prev),
s^2/2 + t(beta_prev) %*% beta_prev / (2*sigma2_new)
)
# sample beta from multivariate Gaussian
beta_cov <- qr.solve((1/sigma2_new) * XtX + diag(1/(tau2_new*sigma2_new), p))
beta_mean <- beta_cov %*% Xty * (1/sigma2_new)
beta_new <- mvtnorm::rmvnorm(1, beta_mean, beta_cov)
# sample each gamma_j in random order
for (j in sample(seq(p))) {
# get the betas for which beta_j is zero
gamma0 <- gamma_prev
gamma0[j] <- 0
bp0 <- t(beta_new * gamma0)
# compute the z variables and the conditional variance
xj <- X[, j]
z <- y - X %*% bp0
cond_var <- sum(xj^2) + 1/tau2_new
# compute chance parameter of the conditional posterior of gamma_j (Bernoulli)
l0 <- log(1 - pi1_new)
l1 <- (
log(pi1_new) - .5 * log(tau2_new*sigma2_new) +
sum(xj*z)^2 / (2*sigma2_new*cond_var) + .5 * log(sigma2_new / cond_var)
)
# sample gamma_j from a Bernoulli
gamma_prev[j] <- rbinom(1, 1, exp(l1) / (exp(l1) + exp(l0)))
}
gamma_new <- gamma_prev
# add new samples
res[i, ] <- c(gamma_new, beta_new*gamma_new, sigma2_new, tau2_new, pi1_new)
}
# remove the first nr_burnin number of samples
res[-seq(nr_burnin), ]
}
```
Let's fit the model using this Gibbs sampler and see posteriors of
$\pi_1$ and $\tau^2$.
```{r, fig.width = 8, fig.height = 4}
set.seed(18)
ss = ss_regress(
y, X, a1 = .01, a2 = .01, pi1 = .5,
a = 1, b = 1, s = 1/2, nr_samples = 5000
)
par(mfrow = c(1,2))
plot(density(ss[,"pi1"]), xlab = expression(pi[1]), lwd = 2,
main = "Posterior of pi1", col = "red")
plot(density(ss[,"tau2"]), xlab = expression(tau^2), lwd = 2,
main = "Posterior of tau2", col = "orange", xlim = c(0,1))
summary(ss[,"pi1"])
summary(ss[,"tau2"])
```
The posterior of $\pi_1$ is quite wide and suggest that
$\pi_1$ is somewhere between 0.05 and 0.4.
For $\tau^2$, the bulk of the posterior is below 0.20.
We can also see the PIPs and coefficient estimates.
```{r, fig.width = 11, fig.height = 9, message = F}
par(mfrow = c(1,2))
barplot(colMeans(ss[,1:p]), names.arg = 1:30,
col = "limegreen", horiz = TRUE, xlab = "PIP")
library(vioplot)
vioplot( ss[,(p+1):(2*p)], names = 1:30,
col = "blue", horizontal = TRUE, xlab = expression(beta[j]))
abline(v = 0, lty = 2)
```
Unfortunately, these implementation of Bayesian variable selection using
Spike and Slab prior through Gibbs sampler do not scale to large data sets.
First problem is that at each iteration of the Gibbs sampler
matrix decomposition on $p\times p$ matrix related to
$\pmb X^T \pmb X$ are done.
Second problem is that at each iteration every predictor
is updated even when the model might be very sparse,
i.e., only a few coefficients are non-zero.
These issues become too demanding when $p$ becomes large.
Let's next discuss some recent work that has made it possible
to apply these kinds of models to very large $p$.
## Efficient computation of Bayes factors under SSP
Let's continue with the assumption
that each predictor has an independent SSP($\pi_1,\tau^2$) prior.
It has been shown by [Benner et al.](https://www.biorxiv.org/content/10.1101/318618v1.full.pdf) that the Bayes factor (BF) comparing a model where the non-zero
coefficients are specified by a configuration $\pmb \gamma$ and
the null model
$\pmb \gamma_0 = (0,\ldots,0)$ where all coefficients are zero can be computed using data on
only the non-zero variables (Details in Slides).
This allows computing BF($\pmb \gamma : \pmb \gamma_0$)
between $\pmb \gamma$ and $\pmb \gamma_0$ in $\mathcal{O}(k^3)$
operations, where $k = \sum_{j=1}^p\gamma_j$ is the number of
non-zero coefficients in $\pmb \gamma$,
compared to $\mathcal{O}(p^3)$ operations that are required
when the computation is done in the standard way.
When we consider sparse models where
$k \sim 10$ and $p \sim 10^5$ this can make all the difference.
Note that under SSP($\pi_1,\tau^2$) prior, the prior probability of
a configuration $\pmb \gamma$ is
$\pi_1^{k_\gamma}(1-\pi_1)^{p-k_\gamma}$
and only depends on $k_\gamma$.
Thus, by knowing the BFs for all possible configurations, we can compute
an (unnormalized) posterior probabilities for the configurations as
$$\text{Pr}(\pmb \gamma\, | \, \text{Data}) \propto
\text{BF}(\pmb \gamma : \pmb \gamma_0) \pi_1^{k_\gamma}(1-\pi_1)^{p-k_\gamma}.$$
By normalizing these quantities with respect to their sum over $\pmb \gamma$,
we have computed the posteriors for every configuration.
However, if we allowed, say, 10 non-zero variables out of $10^5$
candidates, we would still have ${10^5 \choose 10} \approx 10^{43}$
configurations to evaluate, and that is far
too many even when we can do every one of them quickly.
For this reason, stochastic search algorithms have been introduced to
approximate the posterior by finding the most relevant configurations
from the whole space of configurations and then normalizing
their un-normalized posteriors with respect to
the relevant subset of configurations. (See slides for an example.)
## Sum of single effect model (SUSIE)
Another recent approach to analyze Bayesian variable selection model
is using the ‘single-effect regression’ (SER) model
as a building block ([Wang et al.](https://rss.onlinelibrary.wiley.com/doi/10.1111/rssb.12388)).
SER model is a multiple-regression model in
which exactly one of the $p$ predictors
has a non-zero regression coefficient.
The idea is to combine some number $K$ of these SERs together
to get a model where there are $K$ non-zero predictors, or
$K$ *effects*.
The key is that fitting each SER model conditional on the other SER models
is very quick.
#### SUSIE model with $K$ effects
Let's assume that the number of non-zero effects $K$,
the error variance $\sigma^2$
and the prior variance on the non-zero effect $\tau^2$ are fixed.
For each non-zero effect $k=1,\ldots,K$, define
$\pmb \gamma^{(k)} \in \{0,1\}^p$ as the indicator vector that
shows which is the $k$th non-zero coefficient, that is,
$\sum_{j=1}^p \gamma^{(k)}_j=1$.
The model is
\begin{eqnarray*}
\pmb y &\sim& \mathcal{N}(\pmb X \pmb \beta,\sigma^2 \pmb I)\\
\pmb \beta &=& \sum_{k=1}^K \pmb \beta^{(k)} \\
\pmb \beta^{(k)} &=& \beta^{(k)}\pmb \gamma^{(k)}, \text{ for each }k=1,\ldots,K,\\
\beta^{(k)} &\sim& \mathcal{N}(0, \tau^2), \text{ for each }k=1,\ldots,K,\\
\pmb \gamma^{(k)} &\sim& \text{Multinomial}(1,(1/p,\ldots,1/p)), \text{ for each }k=1,\ldots,K.
\end{eqnarray*}
The last line means that the exactly one of the
elements of $\pmb \gamma^{(k)}$ is non-zero and, *a priori*, it is
equally likely to be any one of the $p$ elements.
Note that the model does not rule out the possibility that
$\pmb \gamma^{(k)}=\pmb \gamma^{(\ell)}$ for some $k \neq \ell$
even though conceptually it would be natural to avoid such overlaps.
However, by leaving a possibility for an overlap,
we can treat each effect independently and we
get much simpler computations. Additionally, in practice,
we won't observe overlapping effects in the posterior
distribution estimated by the following algorithm.
#### Iterative Bayesian stepwise selection (IBSS)
A key motivation for the SUSIE model is that, given
$(\pmb \beta^{(\ell)})_{\ell \neq k}$ fitting
$\pmb \beta^{(k)}$ involves only fitting an SER model whose
posterior distribution on the indicator $\pmb \gamma^{(k)}$
and the parameter $\beta^{(k)}$ is simple to compute.
This suggests an iterative algorithm (called IBSS)
to fit the model. Below we apply it to our existing data set
by allowing $K=5$ effects and setting prior variance $\tau^2 = 1$.
```{r}
p = ncol(X)
n = nrow(X)
K = 5
tau2 = 1
xx = colSums(X^2) # = diagonal of t(X) %*% X. Will be needed later.
b = matrix(0, nrow = K, ncol = p) #initialize each effect beta_k = 0
res.b = res.pip = res.v = b #initialize results to 0
diff.in.post = rep(0,p) #initialize to 0
converged = FALSE
tol = 1e-6 # stop when difference between iterations < tol
iter = 0
while(!converged){
iter = iter + 1
for (k in 1:K){ #Fitting SER model for effect k
r = as.vector(y - rowSums(X%*%t(b[-k,]))) #residuals without effect k
#Use Exercise 1.5 results here to get MLE of linear model
b.k = t(X) %*% r / xx #univariate OLS estimate
B = matrix(b.k, byrow = T, nrow = n, ncol = p)
M = ( X*B - r )^2
v.k = colSums(M) / xx / (n-1) #MLE for variance of b.k
#Combine MLEs with prior distributions to get posteriors
post.v.k = 1/(1/v.k + 1/tau2)
post.b.k = post.v.k/v.k * b.k
#Conditional posterior for beta_k is N(post.b.k, post.v.k)
# conditional on the effect being at that position
#PIPs for effect being at position 1,...,p:
log.bf = 0.5*log(v.k/(v.k + tau2)) + 0.5*b.k^2/v.k*tau2/(tau2 + v.k)
tmp = exp(log.bf - max(log.bf))
post.ip = tmp / sum(tmp)
b[k,] = post.b.k*post.ip #save new effect k estimate for next iteration
#compare to previous iteration to observe convergence:
diff.in.post[k] = max(abs(b[k,] - res.b[k,]*res.pip[k,]))
#save results
res.b[k,] = post.b.k
res.pip[k,] = post.ip
res.v[k,] = post.v.k
}
converged = (all(diff.in.post < tol))
}
print(paste("Ended in",iter,"iterations"))
```
Let's examine the output of this algorithm.
We allowed 5 non-zero effects. Let's see the distributions
of PIPs for each effect $k = 1,\ldots,5$.
```{r, fig.width = 11, fig.height = 7}
par(mfrow = c(1,5))
for(ii in 1:K){
barplot(res.pip[ii,], names.arg = 1:p, horiz = TRUE, xlim = c(0,1),
main = paste("Effect",ii), col = ii, xlab = "PIP")
}
```
We see that for the first 3 effects, there is a clear difference between
predictors in their PIPs. For the last two, none of the coefficients
has a large PIP and therefore we may want to ignore those effects.
In other words, likely there are only 3 clear non-zero effects there
and the remaining effects, if included in SUSIE model,
wouldn't provide any information about which of the predictors
would have a non-zero effect.
Note, however, that despite this reasonable ad hoc criterion to determine
the number of effects, the SUSIE model cannot make any probabilistic
assessment of the number of non-zero effects whereas SSP model directly
gave us the posterior on the number of non-zero coefficients.
We can next estimate the coefficients by summing over
$K$ SER models. Let's denote
the posterior mean and variance of the SER model $k$ conditional
on the effect being at predictor $j$ by
$\mu_j^{(k)}$ and $\eta_j^{(k)}$, respectively.
Then we can compute the mean and variance
for each coefficient (marginalized over all $K$ SER models) as follows.
\begin{eqnarray*}
\text{E}(\beta_j) &=& \text{E}\left(\sum_{k=1}^K \gamma^{(k)}_j \beta_j^{(k)}\right) = \sum_{k=1}^K \pi_j^{(k)} \mu_j^{(k)}\\
\text{Var}(\beta_j) &=& \text{Var}\left(\sum_{k=1}^K \gamma^{(k)}_j \beta_j^{(k)}\right) = \sum_{k=1}^K \text{Var} \left( \gamma^{(k)}_j \beta_j^{(k)}\right) \\
&=& \sum_{k=1}^K\left( \text{E}\left({\gamma^{(k)}_j}^2 {\beta_j^{(k)}}^2\right) -
\text{E}\left(\gamma^{(k)}_j \beta_j^{(k)}\right)^2\right) \\
&=& \sum_{k=1}^K\left( \pi_j^{(k)} \left({\eta_j^{(k)}}^2 + {\mu_j^{(k)}}^2\right) - {\pi_j^{(k)}}^2 {\mu_j^{(k)}}^2\right)\\
&=& \sum_{k=1}^K \left( \pi_j^{(k)} \left({\eta_j^{(k)}}^2\right) + \pi_j^{(k)}(1-\pi_j^{(k)}){\mu_j^{(k)}}^2\right)
\end{eqnarray*}
Let's find the posterior means and variances
and then show the 95% credible intervals assuming that
posteriors are Gaussian.
```{r, fig.width = 10}
b.mean = colSums(res.pip*res.b)
b.var = colSums(res.pip*res.v + res.pip*(1-res.pip)*res.b^2)
plot(NULL, xlim = c(0,p+1), ylim = c(-0.1, 0.5),
ylab = expression(beta), xlab = "")
points(1:p, b.mean, pch = 19)
arrows(1:p, b.mean-1.96*sqrt(b.var), 1:p, b.mean+1.96*sqrt(b.var),
code = 3, length = 0)
abline(h=0)
```
#### Credible sets of non-zero variables
A nice property of SUSIE model is that it allows a straightforward
computation of **credible set** of the non-zero variables.
**Definition.** A level $\rho$ credible set (CS) is
a subset of variables that has probability $\geq \rho$
of containing at least one effect variable
(i.e. a variable with non-zero regression coefficient).
For each effect in SUSIE model, we can compute credible set
by sorting the PIPs in decreasing order and by forming the
CS by inlcuding the smallest possible number of
variables whose PIPs sum to a value $\geq \rho$.
In our current example, 95% CS are below.
```{r}
rho = 0.95
cs = list()
for(ii in 1:K){
ord = order(res.pip[ii,], decreasing = T)
cs[[ii]] = ord[1:min(which(cumsum(res.pip[ii,ord]) >= rho))]
}
cs
```
For the first three effects, the CSs only contain one variable each
while the CSs for the last two effects contain almost all variables.
This again shows that the first three effects are clearly localized each
to one variable whereas the two additional effects are not informative and
we might want to just ignore them.
Note that the posterior distribution provided by the SSP model
does not directly contain information by which one could build
credible sets for variables. Credible sets are useful when
one needs to list the possible candidate variables that could replace
each other in the model.
**Question.** Assume that there are three highly correlated variables
of which one truly affects the outcome but no other variables associated
with the outcome. What kind of PIPs
SUSIE would give and how would the credible sets look like?