Approximating the Posterior

Introduction

We have learned how to think like a Bayesian:
- Prior distribution
- Data distribution
- Posterior distribution
We have learned three conjugate families:
- Beta-Binomial (binary outcomes)
- Gamma-Poisson (count outcomes)
- Normal-Normal (continuous outcomes)
Once we have a posterior model, we must be able to apply the results.
- Posterior estimation
- Hypothesis testing
- Prediction

Introduction

Recall, we have the posterior pdf,

f(\theta|y) = \frac{f(\theta) L(\theta|y)}{f(y)} \propto f(\theta)L(\theta|y)

Now, in the denominator, we need to remember,

f(y) = \int_{\theta_1} \int_{\theta_2} \cdot \cdot \cdot \int_{\theta_k} f(\theta) L(\theta|y) d\theta_k \cdot \cdot \cdot d\theta_2 d\theta_1

Because this is … not fun … we will approximate the posterior via simulation.

Introduction

We are going to explore two simulation techniques:
- grid approximation
- Markov chain Monte Carlo (MCMC)
Either method will produce a sample of N values for \theta.

\left \{ \theta^{(1)}, \theta^{(2)}, ..., \theta^{(N)} \right \}

These \theta_i will have properties that reflect those of the posterior model for \theta.
To help us, we will apply these simulation techniques to Beta-Binomial and Gamma-Poisson models.
- Note that these models do not require simulation! We know their posteriors!
- That’s why we are starting there – we can link the concepts to what we know. :)

Introduction

Note: we will use the following packages that may be new to you:
- janitor
- rstan
- bayesplot
If you are using the server provided by HMCSE, they have been installed for you.
If you are working at home, please check to see if you have the libraries, then install if you do not.
- install.packages(c("janitor", "rstan", "bayesplot"))

Grid Approximation

Suppose there is an image that you can’t view in its entirety.
We can see snippets along a grid that sweeps from left to right across the image.
The finer the grid, the clearer the image; if the grid is fine enough, the result is a good approximation.

Grid Approximation

This is the general idea behind Bayesian grid approximation.
Our target image is the posterior pdf, f(\theta|y).
- It is not necessary to observe all possible f(\theta|y) \ \forall \theta for us to understand its structure.
- Instead, we evaluate f(\theta|y) at a finite, discrete grid of possible \theta values.
- Then, we take random samples from this discretized pdf to approximate the full posterior pdf.

Grid Approximation

Grid approximation produces a sample of N independent \theta values,

\left\{ \theta^{(1)}, \theta^{(2)}, ..., \theta^{(N)} \right\},

from a discretized approximation of the posterior pdf, f(\theta|y).

Grid Approximation

Algorithm:
1. Define a discrete grid of possible \theta values.
2. Evaluate the prior pdf, f(\theta), and the likelihood function, L(\theta|y) at each \theta grid value.
3. Obtain a discrete approximation of the posterior pdf, f(\theta|y) by:
  1. Calculating the product f(\theta) L(\theta|y) at each \theta grid value,
  2. Normalize the products from (a) to sum to 1 across all \theta.
4. Randomly sample N \theta grid values with respect to their corresponding normalized posterior probabilities.

Grid Approximation - Example

We will use the following Beta-Binomial model to learn how to do grid approximation:

\begin{align*} Y|\pi &\sim \text{Bin}(10, \pi) \\ \pi &\sim \text{Beta}(2, 2) \end{align*}

Note that
- Y is the number of successes in 10 independent trials.
- Every trial has probability of success, \pi.
- Our prior understanding about \pi is captured by a \text{Beta}(2,2) model.

Grid Approximation - Example

If we observe Y = 9 successes, we know that the updated posterior model for \pi is \text{Beta}(11, 3).
- Y + \alpha = 9+2
- n - Y + \beta = 10-9+2
Thus,

\begin{align*} Y|\pi &\sim \text{Bin}(10, \pi) \\ \pi &\sim \text{Beta}(2, 2) \\ \pi | Y &\sim \text{Beta}(11, 3) \end{align*}

Grid Approximation

Instead of using the posterior we know, let’s approximate it using grid approximation.
First step: define a discrete grid of possible \theta values.
- So, let’s consider \pi \in \{0, 0.2, 0.4, 0.6, 0.8, 1\}.

library(tidyverse)
grid_data <- tibble(pi_grid = seq(from = 0, to = 1, length = 6))

Grid Approximation

Instead of using the posterior we know, let’s approximate it using grid approximation.
First step: define a discrete grid of possible \theta values.
- So, let’s consider \pi \in \{0, 0.2, 0.4, 0.6, 0.8, 1\}.

library(tidyverse)
grid_data <- tibble(pi_grid = seq(from = 0, to = 1, length = 6))

Grid Approximation

Instead of using the posterior we know, let’s approximate it using grid approximation.
Second step: evaluate the prior pdf, f(\theta), and the likelihood function, L(\theta|y) at each \theta grid value.
- We will use dbeta() and dbinom() to evaluate the \text{Beta}(2,2) prior and \text{Bin}(10, \pi) likelihood with Y=9 at each \pi in pi_grid.

grid_data <- grid_data %>%
  mutate(prior = dbeta(pi_grid, 2, 2),
         likelihood = dbinom(9, 10, pi_grid))

Grid Approximation

Instead of using the posterior we know, let’s approximate it using grid approximation.
Second step: evaluate the prior pdf, f(\theta), and the likelihood function, L(\theta|y) at each \theta grid value.
- We will use dbeta() and dbinom() to evaluate the \text{Beta}(2,2) prior and \text{Bin}(10, \pi) likelihood with Y=9 at each \pi in pi_grid.

grid_data <- grid_data %>%
  mutate(prior = dbeta(pi_grid, 2, 2),
         likelihood = dbinom(9, 10, pi_grid))

Grid Approximation

Instead of using the posterior we know, let’s approximate it using grid approximation.
Third step: obtain a discrete approximation of the posterior pdf, f(\theta|y) by calculating the product f(\theta) L(\theta|y) at each \theta grid value and normalizing the products to sum to 1 across all \theta.

grid_data <- grid_data %>%
  mutate(unnormalized = likelihood*prior,
         posterior = unnormalized / sum(unnormalized))

Grid Approximation

Instead of using the posterior we know, let’s approximate it using grid approximation.
Third step: obtain a discrete approximation of the posterior pdf, f(\theta|y) by calculating the product f(\theta) L(\theta|y) at each \theta grid value and normalizing the products to sum to 1 across all \theta.

grid_data <- grid_data %>%
  mutate(unnormalized = likelihood*prior,
         posterior = unnormalized / sum(unnormalized))

We can verify,

grid_data %>%
  summarize(sum(unnormalized), sum(posterior))

Grid Approximation

Instead of using the posterior we know, let’s approximate it using grid approximation.
Third step: obtain a discrete approximation of the posterior pdf, f(\theta|y) by calculating the product f(\theta) L(\theta|y) at each \theta grid value and normalizing the products to sum to 1 across all \theta.

grid_data <- grid_data %>%
  mutate(unnormalized = likelihood*prior,
         posterior = unnormalized / sum(unnormalized))

Grid Approximation

Instead of using the posterior we know, let’s approximate it using grid approximation.
We now have a glimpse into the actual posterior pdf.
- We can plot it to see what it looks like,

Grid Approximation

As we increase the number of possible \theta values, the better we can “see” the resulting posterior.
What happens if we try the following: n=50, n=100, n=500, n=1000?

Grid Approximation

As we increase the number of possible \theta values, the better we can “see” the resulting posterior.
What happens if we try the following: n=50, n=100, n=500, n=1000?

Markov chain Monte Carlo (MCMC)

Markov chain Monte Carlo (MCMC) is an application of Markov chains to simulate probability models.
MCMC samples are not taken directly from the posterior pdf, f(\theta | y)… and they are not independent.
- Each subsequent value depends on the previous value.
Suppose we have an N-length MCMC sample, \left\{ \theta^{(1)}, \theta^{(2)}, \theta^{(3)}, ..., \theta^{(N)} \right\}
- \theta^{(2)} is drawn from a model that depends on \theta^{(1)}.
- \theta^{(3)} is drawn from a model that depends on \theta^{(2)}.
- etc.

Markov chain Monte Carlo (MCMC)

The (i+1)^st chain value, \theta^{(i+1)} is drawn from a model that depends on data y and the previous chain value, \theta^{(i)}.

f\left( \theta^{(i+1)} | \theta^{(i)}, y \right)

It is important for us to note that the pdf from which a Markov chain is simulated is not equivalent to the posterior pdf!

f\left( \theta^{(i+1)} | \theta^{(i)}, y \right) \ne f\left(\theta^{(i+1)}|y \right)

Using `rstan`

We will use rstan:
- define the Bayesian model structure in rstan notation
- simulate the posterior
Again, we will use the Beta-Binomial model from earlier.
- data: in our example, Y is the observed number of successes in 10 trials.
  - We need to tell rstan that Y is an integer between 0 and 10.
- parameters: in our example, our model depends on \pi.
  - We need to tell rstan that \pi can be any real number between 0 and 1.
- model: in our example, we need to specify Y \sim \text{Bin}(10, \pi) and \pi \sim \text{Beta}(2,2).

Using `rstan`

# STEP 1: DEFINE the model
bb_model <- "
  data {
    int<lower = 0, upper = 10> Y;
  }
  parameters {
    real<lower = 0, upper = 1> pi;
  }
  model {
    Y ~ binomial(10, pi);
    pi ~ beta(2, 2);
  }
"

Using `rstan`

Then, when we go to simulate, we first put in the model information
- model code: the character string defining the model (in our case, bb_model).
- data: a list of the observed data.
  - In this example, we are using Y = 9 - a single data point.
Then, we put in the Markov chain information,
- chains: how many parallel Markov chains to run.
  - This will be the number of distinct \theta values we want.
- iter: desired number of iterations, or length of Markov chain.
  - Half are thrown out as “burn in” samples.
    - “burn in”? Think: pancakes!
- seed: used to set the seed of the RNG.

Using `rstan`

# STEP 2: simulate the posterior
bb_sim <- stan(model_code = bb_model, data = list(Y = 9), 
               chains = 4, iter = 5000*2, seed = 84735)


SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 4e-06 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.013 seconds (Warm-up)
Chain 1:                0.014 seconds (Sampling)
Chain 1:                0.027 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.012 seconds (Warm-up)
Chain 2:                0.014 seconds (Sampling)
Chain 2:                0.026 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.013 seconds (Warm-up)
Chain 3:                0.012 seconds (Sampling)
Chain 3:                0.025 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 0 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.012 seconds (Warm-up)
Chain 4:                0.014 seconds (Sampling)
Chain 4:                0.026 seconds (Total)
Chain 4:

Using `rstan`

Now, we need to extract the values,

as.array(bb_sim, pars = "pi") %>% head(4)

, , parameters = pi

          chains
iterations   chain:1   chain:2   chain:3   chain:4
      [1,] 0.8278040 0.7523560 0.6386998 0.9627572
      [2,] 0.9087546 0.8922386 0.6254761 0.9413518
      [3,] 0.6143859 0.8682356 0.7222360 0.9489624
      [4,] 0.8462156 0.8792812 0.8100192 0.9413812

Remember, these are not a random sample from the posterior!
They are also not independent!
Each chain forms a dependent 5,000 length Markov chain of \left\{ \pi^{(1)}, \pi^{(2)}, ..., \pi^{(5000)}\right\}
- Each chain will move along the sample space of plausible values for \pi.

Using `rstan`

We will look at the trace plot (using mcmc_trace() from bayesplot package) to see what the values did longitudinally.

mcmc_trace(bb_sim, pars = "pi", size = 0.1)

Using `rstan`

We can also look at the mcmc_hist() and mcmc_dens() functions,

Diagnostics

Simulations are not perfect…
- What does a good Markov chain look like?
- How can we tell if the Markov chain sample produces a reasonable approximation of the posterior?
- How big should our Markov chain sample size be?
Unfortunately there is no one answer here.
- You will learn through experience, much like other nuanced areas of statistics.

Diagnostics

Let’s now discuss diagnostic tools.
- Trace plots
- Parallel chains
- Effective sample size
- Autocorrelation
- \hat{R}

Trace Plots

Chain A has not stabilized after 5000 iterations.
- It has not “found” or does not know how to explore the range of posterior plausible \pi values.
- The downward trend also hints against independent noise.

Trace Plots

We say that Chain A is mixing slowly.
- The more Markov chains behave like fast mixing (noisy) independent samples, the smaller the error in the resulting posterior approximation.

Trace Plots

Chain B is not great, either – it gets stuck when looking at a smaller value of \pi.

Trace Plots

Realistically, we are only going to do simulations when we can’t specify the posterior and must approximate
- i.e., we won’t be able to compare the simulation results to the “true” results.
If we see bad trace plots:
- Check the model (… or your code). Are the assumed prior and data models appropriate?
- Run the chain for more iterations. Sometimes we just need a longer run to iron out issues.

Parallel Chains

Let’s now consider a smaller simulation, where n=50 (recall, overall n=100, but half is for burn-in).
Run the following code:

bb_sim_short <- stan(model_code = bb_model, data = list(Y = 9), 
                     chains = 4, iter = 50*2, seed = 84735)

Parallel Chains

Let’s now consider a smaller simulation, where n=50 (recall, overall n=100, but half is for burn-in).
Run the following code:

bb_sim_short <- stan(model_code = bb_model, data = list(Y = 9), 
                     chains = 4, iter = 50*2, seed = 84735)


SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1e-06 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: WARNING: There aren't enough warmup iterations to fit the
Chain 1:          three stages of adaptation as currently configured.
Chain 1:          Reducing each adaptation stage to 15%/75%/10% of
Chain 1:          the given number of warmup iterations:
Chain 1:            init_buffer = 7
Chain 1:            adapt_window = 38
Chain 1:            term_buffer = 5
Chain 1: 
Chain 1: Iteration:  1 / 100 [  1%]  (Warmup)
Chain 1: Iteration: 10 / 100 [ 10%]  (Warmup)
Chain 1: Iteration: 20 / 100 [ 20%]  (Warmup)
Chain 1: Iteration: 30 / 100 [ 30%]  (Warmup)
Chain 1: Iteration: 40 / 100 [ 40%]  (Warmup)
Chain 1: Iteration: 50 / 100 [ 50%]  (Warmup)
Chain 1: Iteration: 51 / 100 [ 51%]  (Sampling)
Chain 1: Iteration: 60 / 100 [ 60%]  (Sampling)
Chain 1: Iteration: 70 / 100 [ 70%]  (Sampling)
Chain 1: Iteration: 80 / 100 [ 80%]  (Sampling)
Chain 1: Iteration: 90 / 100 [ 90%]  (Sampling)
Chain 1: Iteration: 100 / 100 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                0 seconds (Sampling)
Chain 1:                0 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: WARNING: There aren't enough warmup iterations to fit the
Chain 2:          three stages of adaptation as currently configured.
Chain 2:          Reducing each adaptation stage to 15%/75%/10% of
Chain 2:          the given number of warmup iterations:
Chain 2:            init_buffer = 7
Chain 2:            adapt_window = 38
Chain 2:            term_buffer = 5
Chain 2: 
Chain 2: Iteration:  1 / 100 [  1%]  (Warmup)
Chain 2: Iteration: 10 / 100 [ 10%]  (Warmup)
Chain 2: Iteration: 20 / 100 [ 20%]  (Warmup)
Chain 2: Iteration: 30 / 100 [ 30%]  (Warmup)
Chain 2: Iteration: 40 / 100 [ 40%]  (Warmup)
Chain 2: Iteration: 50 / 100 [ 50%]  (Warmup)
Chain 2: Iteration: 51 / 100 [ 51%]  (Sampling)
Chain 2: Iteration: 60 / 100 [ 60%]  (Sampling)
Chain 2: Iteration: 70 / 100 [ 70%]  (Sampling)
Chain 2: Iteration: 80 / 100 [ 80%]  (Sampling)
Chain 2: Iteration: 90 / 100 [ 90%]  (Sampling)
Chain 2: Iteration: 100 / 100 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0 seconds (Warm-up)
Chain 2:                0 seconds (Sampling)
Chain 2:                0 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: WARNING: There aren't enough warmup iterations to fit the
Chain 3:          three stages of adaptation as currently configured.
Chain 3:          Reducing each adaptation stage to 15%/75%/10% of
Chain 3:          the given number of warmup iterations:
Chain 3:            init_buffer = 7
Chain 3:            adapt_window = 38
Chain 3:            term_buffer = 5
Chain 3: 
Chain 3: Iteration:  1 / 100 [  1%]  (Warmup)
Chain 3: Iteration: 10 / 100 [ 10%]  (Warmup)
Chain 3: Iteration: 20 / 100 [ 20%]  (Warmup)
Chain 3: Iteration: 30 / 100 [ 30%]  (Warmup)
Chain 3: Iteration: 40 / 100 [ 40%]  (Warmup)
Chain 3: Iteration: 50 / 100 [ 50%]  (Warmup)
Chain 3: Iteration: 51 / 100 [ 51%]  (Sampling)
Chain 3: Iteration: 60 / 100 [ 60%]  (Sampling)
Chain 3: Iteration: 70 / 100 [ 70%]  (Sampling)
Chain 3: Iteration: 80 / 100 [ 80%]  (Sampling)
Chain 3: Iteration: 90 / 100 [ 90%]  (Sampling)
Chain 3: Iteration: 100 / 100 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0 seconds (Warm-up)
Chain 3:                0 seconds (Sampling)
Chain 3:                0 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 0 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: WARNING: There aren't enough warmup iterations to fit the
Chain 4:          three stages of adaptation as currently configured.
Chain 4:          Reducing each adaptation stage to 15%/75%/10% of
Chain 4:          the given number of warmup iterations:
Chain 4:            init_buffer = 7
Chain 4:            adapt_window = 38
Chain 4:            term_buffer = 5
Chain 4: 
Chain 4: Iteration:  1 / 100 [  1%]  (Warmup)
Chain 4: Iteration: 10 / 100 [ 10%]  (Warmup)
Chain 4: Iteration: 20 / 100 [ 20%]  (Warmup)
Chain 4: Iteration: 30 / 100 [ 30%]  (Warmup)
Chain 4: Iteration: 40 / 100 [ 40%]  (Warmup)
Chain 4: Iteration: 50 / 100 [ 50%]  (Warmup)
Chain 4: Iteration: 51 / 100 [ 51%]  (Sampling)
Chain 4: Iteration: 60 / 100 [ 60%]  (Sampling)
Chain 4: Iteration: 70 / 100 [ 70%]  (Sampling)
Chain 4: Iteration: 80 / 100 [ 80%]  (Sampling)
Chain 4: Iteration: 90 / 100 [ 90%]  (Sampling)
Chain 4: Iteration: 100 / 100 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0 seconds (Warm-up)
Chain 4:                0 seconds (Sampling)
Chain 4:                0 seconds (Total)
Chain 4:

Parallel Chains

Let’s now generate the trace and density plots for the smaller simulation:

ggarrange(mcmc_trace(bb_sim_short, pars = "pi") + ggtitle("trace") + theme_bw() + theme(legend.position="none"),
          mcmc_dens_overlay(bb_sim_short, pars = "pi") + ylab("density") + ggtitle("density") + theme_bw(),
          ncol=2, nrow=1)

Parallel Chains

Parallel Chains

Now you try 10,000 iterations. Update the following code:

bb_sim_full <- stan(model_code = bb_model, 
                    data = list(Y = 9), 
                    chains = 4, 
                    iter = 50*2, 
                    seed = 84735)

Parallel Chains

Create the trace and density plots. Recall the code,

ggarrange(
  mcmc_trace(bb_sim, pars = "pi") + theme_bw() + theme(legend.position="none") + ggtitle("trace"),
  mcmc_dens_overlay(bb_sim, pars = "pi") +  ylab("density") + ggtitle("density") + theme_bw(),
  ncol = 2, nrow = 1)

Parallel Chains

Create the trace and density plots.

Effective Sample Size

The more a dependent Markov chain behaves like an independent sample, the smaller the error in the resulting posterior approximation.
- Plots are great, but numerical assessment can provide more nuanced information.
Effective sample size (N_{\text{eff}}): the number of independent sample values it would take to produce an equivalently accurate posterior approximation.
Effective sample size ratio:

\frac{N_{\text{eff}}}{N}

Generally, we look for the effective sample size, N_{\text{eff}}, to be greater than 10% of the actual sample size, N.

Effective Sample Size

We will use the neff_ratio() function to find this ratio.
In our example data,

# Calculate the effective sample size ratio - N = 50
neff_ratio(bb_sim, pars = c("pi"))

[1] 0.3462257

# Calculate the effective sample size ratio - N = 10000
neff_ratio(bb_sim_short, pars = c("pi"))

[1] 0.4950171

Because the N_{\text{eff}} is over 10%, we are not concerned and can proceed.

Autocorrelation

Autocorrelation allows us to evaluate if our Markov chain sufficiently mimics the behavior of an independent sample.
Autocorrelation:
- Lag 1 autocorrelation measures the correlation between pairs of Markov chain values that are one “step” apart (e.g., \pi_i and \pi_{(i-1)}; e.g., \pi_4 and \pi_3).
- Lag 2 autocorrelation measures the correlation between pairs of Markov chain values that are two “steps apart (e.g., \pi_i and \pi_{(i-2)}; e.g., \pi_4 and \pi_2).
- Lag k autocorrelation measures the correlation between pairs of Markov chain values that are k “steps” apart (e.g., \pi_i and \pi_{(i-k)}; e.g., \pi_4 and \pi_{(4-k)}).

Autocorrelation

Autocorrelation allows us to evaluate if our Markov chain sufficiently mimics the behavior of an independent sample.
Strong autocorrelation or dependence is a bad thing.
- It goes hand in hand with small effective sample size ratios.
- These provide a warning sign that our resulting posterior approximations might be unreliable.

Autocorrelation

No obvious patterns in the trace plot; dependence is relatively weak.
Autocorrelation plot quickly drops off and is effectively 0 by lag 5.
Confirmation that our Markov chain is mixing quickly.
- i.e., quickly moving around the range of posterior plausible \pi values
- i.e., at least mimicking an independent sample.

Autocorrelation

This is an “unhealthy” Markov chain.
Trace plot shows strong trends \to autocorrelation in the Markov chain values.
Slow decrease in autocorrelation plot indicates that the dependence between chain values does not quickly fade away.
- At lag 20, the autocorrelation is still \sim 90%.

Fast vs. Slow Mixing Markov Chains

Fast mixing chains:
- The chains move “quickly” around the range of posterior plausible values
- The autocorrelation among the chain values drops off quickly.
- The effective sample size ratio is reasonably large.
Slow mixing chains:
- The chains move “slowly” around the range of posterior plauslbe values.
- The autocorrelation among the chainv alues drops off very slowly.
- The effective sample size ratio is small.
What do we do if we have a slow mixing chain?
- Increase the chain size :)
- Thin the Markov chain :|

Thinning Markov Chains

Let’s thin our original results, bb_sim, to every tenth value using the thin argument in stan().

thinned_sim <- stan(model_code = bb_model, data = list(Y = 9), 
                    chains = 4, iter = 5000*2, seed = 84735, thin = 10)

mcmc_trace(thinned_sim, pars = "pi")
mcmc_acf(thinned_sim, pars = "pi")

Thinning Markov Chains

Let’s thin our original results, bb_sim, to every tenth value using the thin argument in stan().

thinned_sim <- stan(model_code = bb_model, data = list(Y = 9), 
                    chains = 4, iter = 5000*2, seed = 84735, thin = 10)


SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2e-06 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.02 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.011 seconds (Warm-up)
Chain 1:                0.013 seconds (Sampling)
Chain 1:                0.024 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.011 seconds (Warm-up)
Chain 2:                0.013 seconds (Sampling)
Chain 2:                0.024 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.011 seconds (Warm-up)
Chain 3:                0.01 seconds (Sampling)
Chain 3:                0.021 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 0 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.011 seconds (Warm-up)
Chain 4:                0.012 seconds (Sampling)
Chain 4:                0.023 seconds (Total)
Chain 4:

Thinning Markov Chains

Let’s thin our original results, bb_sim, to every tenth value using the thin argument in stan().

Thinning Markov Chains

Warning!
- The benefits of reduced autocorrelation do not necessarily outweigh the loss of chain values.
- i.e., 5,000 Markov chain values with stronger autocorrelation may be a better posterior approximation than 500 chain values with weaker autocorrelation.
The effectiveness depends on the algorithm used to construct the Markov chain.
- Folks advise against thinning unless you need memory space on your computer.

\hat{R}

\hat{R} \approx \sqrt{\frac{\text{var}_{\text{combined}}}{\text{var}_{\text{within}}}}

where
- \text{var}_{\text{combined}} is the variability in \theta across all chains combined.
- \text{var}_{\text{within}} is the typical variability within any individual chain.
\hat{R} compares the variability in sampled \theta values across all chains combined with the variability within each individual change.
- Ideally, \hat{R} \approx 1, showing stability across chains.
- \hat{R} > 1 indicates instability with the variability in the combined chains larger than that of the variability within the chains.
- \hat{R} > 1.05 raises red flags about the stability of the simulation.

\hat{R}

We can use the rhat() function from the bayesplot package to find \hat{R}.

rhat(bb_sim, pars = "pi")

[1] 1.000245

We can see that our simulation is stable.
If we were to find \hat{R} for the other (obviously bad) simulation, it would be 5.35 😱

Homework / Practice

From the Bayes Rules! textbook:
- 6.5
- 6.6
- 6.7
- 6.13
- 6.14
- 6.15
- 6.17

Approximating the Posterior

Introduction

Introduction

Introduction

Introduction

Grid Approximation

Grid Approximation

Grid Approximation

Grid Approximation

Grid Approximation - Example

Grid Approximation - Example

Grid Approximation

Grid Approximation

Grid Approximation

Grid Approximation

Grid Approximation

Grid Approximation

Grid Approximation

Grid Approximation

Grid Approximation

Grid Approximation

Markov chain Monte Carlo (MCMC)

Markov chain Monte Carlo (MCMC)

Using rstan

Using rstan

Using rstan

Using rstan

Using rstan

Using rstan

Using rstan

Diagnostics

Diagnostics

Trace Plots

Trace Plots

Trace Plots

Trace Plots

Parallel Chains

Parallel Chains

Parallel Chains

Parallel Chains

Parallel Chains

Parallel Chains

Parallel Chains

Effective Sample Size

Effective Sample Size

Autocorrelation

Autocorrelation

Autocorrelation

Autocorrelation

Fast vs. Slow Mixing Markov Chains

Thinning Markov Chains

Thinning Markov Chains

Thinning Markov Chains

Thinning Markov Chains

\hat{R}

\hat{R}

Homework / Practice

Using `rstan`

Using `rstan`

Using `rstan`

Using `rstan`

Using `rstan`

Using `rstan`

Using `rstan`