Conjugate Families

July 10, 2025
Thursday

Introduction: Beta-Binomial Model

On Tuesday, we learned how to think like a Bayesian.
Today, we will formalize the model we muddled through last time.
This is called the Beta-Binomial model.
- The Beta distribution is the prior.
- The Binomial distribution is the data distribution (or the likeihood).
- The posterior also follows a Beta distribution.
Conjugate family: When the prior and posterior are the same named distribution, but different parameters.

Example Set Up

Consider the following scenario.
- “Michelle” has decided to run for president and you’re her campaign manager for the state of Florida.
- As such, you’ve conducted 30 different polls throughout the election season.
- Though Michelle’s support has hovered around 45%, she polled at around 35% in the dreariest days and around 55% in the best days on the campaign trail.

Example Set Up

Past polls provide prior information about \pi, the proportion of Floridians that currently support Michelle.
- In fact, we can reorganize this information into a formal prior probability model of \pi.
In a previous problem, we assumed that \pi could only be 0.2, 0.5, or 0.8, the corresponding chances of which were defined by a discrete probability model.
- However, in the reality of Michelle’s election support, \pi \in [0, 1].
We can reflect this reality and conduct a Bayesian analysis by constructing a continuous prior probability model of \pi.

Example Set Up

A reasonable prior is represented by the curve on the right.
- Notice that this curve preserves the overall information and variability in the past polls, i.e., Michelle’s support, \pi can be anywhere between 0 and 1, but is most likely around 0.45.

Example Set Up

Incorporating this more nuanced, continuous view of Michelle’s support, \pi, will require some new tools.
- No matter if our parameter \pi is continuous or discrete, the posterior model of \pi will combine insights from the prior and data.
- \pi isn’t the only variable of interest that lives on [0,1].
Maybe we’re interested in modeling the proportion of people that use public transit, the proportion of trains that are delayed, the proportion of people that prefer cats to dogs, etc.
- The Beta-Binomial model provides the tools we need to study the proportion of interest, \pi, in each of these settings.

Beta Prior

In building the Bayesian election model of Michelle’s election support among Floridians, \pi, we begin with the prior.
- Our continuous prior probability model of \pi is specified by the probability density function (pdf).
What values can \pi take and which are more plausible than others?

Beta Prior

Let \pi be a random variable, where \pi \in [0, 1].
The variability in \pi may be captured by a Beta model with shape hyperparameters \alpha > 0 and \beta > 0,
- hyperparameter: a parameter used in a prior model.

\pi \sim \text{Beta}(\alpha, \beta),

Beta Prior: Shapes

Let’s explore the shape of the Beta:

plot_beta(1, 5) + theme_bw() + ggtitle("Beta(1, 5)")

Beta Prior: Shapes

Let’s explore the shape of the Beta:

plot_beta(1, 2) + theme_bw() + ggtitle("Beta(1, 2)")

Beta Prior: Shapes

Let’s explore the shape of the Beta:

plot_beta(3, 7) + theme_bw() + ggtitle("Beta(3, 7)")

Beta Prior: Shapes

Let’s explore the shape of the Beta:

plot_beta(1, 1) + theme_bw() + ggtitle("Beta(1, 1)")

Beta Prior: Shapes

Your turn!
How would you describe the typical behavior of a:
- Beta(\alpha, \beta) variable, \pi, when \alpha=\beta?
- Beta(\alpha, \beta) variable, \pi, when \alpha>\beta?
- Beta(\alpha, \beta) variable, \pi, when \alpha<\beta?
For which model is there greater variability in the plausible values of \pi, Beta(20, 20) or Beta(5, 5)?

Beta Prior: Shapes

How would you describe the typical behavior of a Beta(\alpha, \beta) variable, \pi, when \alpha=\beta?

Beta Prior: Shapes

How would you describe the typical behavior of a Beta(\alpha, \beta) variable, \pi, when \alpha>\beta?

Beta Prior: Shapes

How would you describe the typical behavior of a Beta(\alpha, \beta) variable, \pi, when \alpha<\beta?

Beta Prior: Shapes

For which model is there greater variability in the plausible values of \pi, Beta(20, 20) or Beta(5, 5)?

Tuning the Beta Prior

We can tune the shape hyperparameters (\alpha and \beta) to reflect our prior information about Michelle’s election support, \pi.
In our example, we saw that she polled between 25 and 65 percentage points, with an average of 45 percentage points.
- We want our Beta(\alpha, \beta) to have similar patterns, so we should pick \alpha and \beta such that \pi is around 0.45.

E[\pi] = \frac{\alpha}{\alpha+\beta} \approx 0.45

Using algebra, we can tune, and find

\alpha \approx \frac{9}{11} \beta

Tuning the Beta Prior

Your turn!
- Graph the following and determine which is best for the example.

plot_beta(9, 11) + theme_bw()
plot_beta(27, 33) + theme_bw()
plot_beta(45, 55) + theme_bw()

Recall, this is what we are going for:

Tuning the Beta Prior

plot_beta(9, 11) + theme_bw() + ggtitle("Beta(9, 11)")

Tuning the Beta Prior

plot_beta(27, 33) + theme_bw() + ggtitle("Beta(27, 33)")

Tuning the Beta Prior

plot_beta(45, 55) + theme_bw() + ggtitle("Beta(45, 55)")

Tuning the Beta Prior

Now that we have a prior, we “know” some things.

\pi \sim \text{Beta}(45, 55)

From the properties of the beta distribution,

\begin{equation*} \begin{aligned} E[\pi] &= \frac{\alpha}{\alpha + \beta} & \text{ and } & \text{ } & \text{ } \\ &=\frac{45}{45+55} \\ &= 0.45 \end{aligned} \begin{aligned} \text{var}[\pi] &= \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \\ &= \frac{(45)(55)}{(45+55)^2(45+55+1)} \\ &= 0.0025 \end{aligned} \end{equation*}

Binomial Data Model

A new poll of n = 50 Floridians recorded Y, the number that support Michelle.
- The results depend upon \pi (as \pi increases, Y tends to increase).
To model the dependence of Y on \pi, we assume
- voters answer the poll independently of one another;
- the probability that any polled voter supports your candidate Michelle is \pi
This is a binomial event, Y|\pi \sim \text{Bin}(50, \pi), with conditional pmf, f(y|\pi) defined for y \in \{0, 1, ..., 50\}

f(y|\pi) = P[Y = y|\pi] = {50 \choose y} \pi^y (1-\pi)^{50-y}

Binomial Data Model

The conditional pmf, f(y|\pi), gives us answers to a hypothetical question:
- If Michelle’s support were given some value of \pi, then how many of the 50 polled voters (Y=y) might we expect to suppport her?
Let’s look at this graphically:

binom_prob <- tibble(n_success = 1:sample_size,
                     prob = dbinom(n_success, size=sample_size, prob=pi_value))

binom_prob %>%
  ggplot(aes(x=n_success,y=prob))+
  geom_col(width=0.2)+
  labs(x= "Number of Successes",
       y= "Probability") +
  theme_bw()

Binomial Data Model

Binomial Data Model

It is observed that Y=30 of the n=50 polled voters support Michelle.
We now want to find the likelihood function – remember that we treat Y=30 as the observed data and \pi as unknown,

\begin{align*} f(y|\pi) &= {50 \choose y} \pi^y (1-\pi)^{50-y} \\ L(\pi|y=30) &= {50 \choose 30} \pi^{30} (1-\pi)^{20} \end{align*}

This is valid for \pi \in [0, 1].

Binomial Data Model

What is the likelihood of 30/50 voters supporting Michelle?

dbinom(30, 50, pi)

You try this for \pi = \{0.25, 0.50, 0.75\}.

dbinom(30, 50, 0.25)
dbinom(30, 50, 0.5)
dbinom(30, 50, 0.75)

Binomial Data Model

What is the likelihood of 30/50 voters supporting Michelle?

dbinom(30, 50, 0.25)

[1] 1.29633e-07

dbinom(30, 50, 0.5)

[1] 0.04185915

dbinom(30, 50, 0.75)

[1] 0.007654701

Binomial Data Model

Challenge!
Create a graph showing what happens to the likelihood for different values of \pi.
- i.e., have \pi on the x-axis and likelihood on the y-axis.
To get you started,

graph <- tibble(pi = seq(0, 1, 0.001)) %>%
  mutate(likelihood = dbinom(30, 50, pi))

Binomial Data Model

Create a graph showing what happens to the likelihood for different values of \pi.

Where is the maximum?

Binomial Data Model

Where is the maximum?

The Beta Posterior Model

Looking at just the prior and the data distributions,

The prior is a bit more pessimistic about Michelle’s election support than the data obtained from the latest poll.

The Beta Posterior Model

Now including the posterior,

We can see that the posterior model of \pi is continuous and \in [0, 1].
The shape of the posterior appears to also have a Beta(\alpha, \beta) model.
- The shape parameters (\alpha and \beta) have been updated.

The Beta Posterior Model

If we were to collect more information about Michelle’s support, we would use the current posterior as the new prior, then update our posterior.
- How do we know what the updated parameters are?

summarize_beta_binomial(alpha = 45, beta = 55, y = 30, n = 50)

The Beta Posterior Model

We used Michelle’s election support to understand the Beta-Binomial model.
Let’s now generalize it for any appropriate situation.

\begin{align*} Y|\pi &\sim \text{Bin}(n, \pi) \\ \pi &\sim \text{Beta}(\alpha, \beta) \\ \pi | (Y=y) &\sim \text{Beta}(\alpha+y, \beta+n-y) \end{align*}

We can see that the posterior distribution reveals the influence of the prior (\alpha and \beta) and data (y and n).

The Beta Posterior Model

Under this updated distribution,

\pi | (Y=y) \sim \text{Beta}(\alpha+y, \beta+n-y)

we have updated moments:

\begin{align*} E[\pi | Y = y] &= \frac{\alpha + y}{\alpha + \beta + n} \\ \text{Var}[\pi|Y=y] &= \frac{(\alpha+y)(\beta+n-y)}{(\alpha+\beta+n)^2(\alpha+\beta+1)} \end{align*}

The Beta Posterior Model

Let’s pause and think about this from a theoretical standpoint.
The Beta distribution is a conjugate prior for the likelihood.
- Conjugate prior: the posterior is from the same model family as the prior.
Recall the Beta prior, f(\pi),

L(\pi|y) = {n \choose y} \pi^y (1-\pi)^{n-y}

and the likelihood function, L(\pi|y).

f(\pi) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \pi^{\alpha-1}(1-\alpha)^{\beta-1}

The Beta Posterior Model

We can put the prior and likelihood together to create the posterior,

\begin{align*} f(\pi|y) &\propto f(\pi)L(\pi|y) \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \pi^{\alpha-1}(1-\pi)^{\beta-1} \times {n \choose y} \pi^y (1-\pi)^{n-1} \\ &\propto \pi^{(\alpha+y)-1} (1-\pi)^{(\beta+n-y)-1} \end{align*}

This is the same structure as the normalized Beta(\alpha+y, \beta+n-y),

f(\pi|y) = \frac{\Gamma(\alpha+\beta+n)}{\Gamma(\alpha+y) \Gamma(\beta+n-y)} \pi^{(\alpha+y)-1} (1-\pi)^{(\beta+n-y)-1}

Wrap Up: Beta-Binomial Model

We have built the Beta-Binomial model for \pi, an unknown proportion.

\begin{equation*} \begin{aligned} Y|\pi &\sim \text{Bin}(n,\pi) \\ \pi &\sim \text{Beta}(\alpha,\beta) & \end{aligned} \Rightarrow \begin{aligned} && \pi | (Y=y) &\sim \text{Beta}(\alpha+y, \beta+n-y) \\ \end{aligned} \end{equation*}

The prior model, f(\pi), is given by Beta(\alpha,\beta).
The data model, f(Y|\pi), is given by Bin(n,\pi).
The likelihood function, L(\pi|y), is obtained by plugging y into the Binomial pmf.
The posterior model is a Beta distribution with updated parameters \alpha+y and \beta+n-y.

Introduction: Gamma-Poisson Model

Recall the Beta-Binomial model,
- y \sim \text{Bin}(n, \pi) (data distribution)
- \pi \sim \text{Beta}(\alpha, \beta) (prior distribution)
- \pi|y \sim \text{Beta}(\alpha+y, \beta+n-y) (posterior distribution)
The Beta-Binomial model is from a conjugate family (i.e., the posterior is from the same model family as the prior).
Now, we will learn about the Gamma-Poisson, another conjugate family.

Example Set Up

Suppose we are now interested in modeling the number of spam calls we receive.
- This means that we are modeling the rate, \lambda.
We take a guess and say that the value of \lambda that is most likely is around 5,
- … but reasonably ranges between 2 and 7 calls per day.
Why can’t we use the Beta distribution as our prior distribution?
- \lambda is the mean of a count \to \lambda \in \mathbb{R}^+ \to \lambda is not limited to [0, 1] \to broken assumption for Beta distribution.
Why can’t we use the binomial distribution as our data distribution?
- Y_i is a count \to Y_i \in \mathbb{N}^+ \to Y_i is not limited to \{0, 1\} \to broken assumption for Binomial distribution.

Poisson Data Model

We will use the Poisson distribution to model the number of spam calls

Y \in \{0, 1, 2, ...\}

Y is the number of independent events that occur in a fixed amount of time or space.
\lambda > 0 is the rate at which these events occur.
Mathematically,

Y | \lambda \sim \text{Pois}(\lambda),

with pmf,

f(y|\lambda) = \frac{\lambda^y e^{-\lambda}}{y!}, \ \ \ y \in \{0,1, 2, ... \}

Gamma Prior

If \lambda is a continuous random variable that can take on any positive value (\lambda > 0), then the variability may be modeled with the Gamma distribution with
- shape hyperparameter s>0
- rate hyperparameter r>0.
Thus,

\lambda \sim \text{Gamma}(s, r)

and the Gamma pdf is given by

f(\lambda) = \frac{r^s}{\Gamma(s)} \lambda^{s-1} e^{-r\lambda}

Gamma Prior: Shapes

Let’s explore the shape of the Gamma:

plot_gamma(1, 1) + theme_bw() + ggtitle("Gamma(1, 1)")

Gamma Prior: Shapes

Let’s explore the shape of the Gamma:

plot_gamma(2, 1) + theme_bw() + ggtitle("Gamma(2, 1)")

Gamma Prior: Shapes

Let’s explore the shape of the Gamma:

plot_gamma(10, 1) + theme_bw() + ggtitle("Gamma(10, 1)")

Gamma Prior: Shapes

Let’s explore the shape of the Gamma:

plot_gamma(10, 10) + theme_bw() + ggtitle("Gamma(10, 10)")

Gamma Prior: Shapes

What happens when we increase the \alpha parameter?

Gamma Prior: Shapes

What happens when we increase the \beta parameter?

Gamma Prior: Shapes

Putting these on the same scale,

Tuning the Gamma Prior

Let’s now tune our prior.
We are assuming \lambda \approx 5, somewhere between 2 and 7.
We know the mean of the gamma distribution,

E(\lambda) = \frac{s}{r} \approx 5 \to 5r \approx s

Your turn! Use the plot_gamma() function to figure out what value of s and r we need.

Tuning the Gamma Prior

Looking at different values:

Poisson Data Model

We will be taking samples from different days.
- We assume that the daily number of calls may differ from day to day. On each day i,

Y_i|\lambda \sim \text{Pois}(\lambda)

This has a unique pmf for each day (i),

f(y_i|\lambda) = \frac{\lambda^{y_i} e^{-\lambda}}{y_i!}

But really, we are interested in the joint information in our sample of n observations.
- The joint pmf gives us this information.

Poisson Data Model

The joint pmf for the Poisson,

\begin{align*} f\left(\overset{\to}{y_i}|\lambda\right) &= \prod_{i=1}^n f(y_i|\lambda) \\ &= f(y_1|\lambda) \times f(y_2|\lambda) \times ... \times f(y_n|\lambda) \\ &= \frac{\lambda^{y_1}e^{-\lambda}}{y_1!} \times \frac{\lambda^{y_2}e^{-\lambda}}{y_2!} \times ... \times \frac{\lambda^{y_n}e^{-\lambda}}{y_n!} \\ &= \frac{\left( \lambda^{y_1} \lambda^{y_2} \cdot \cdot \cdot \ \lambda^{y_n} \right) \left( e^{-\lambda} e^{-\lambda} \cdot \cdot \cdot e^{-\lambda}\right)}{y_1! y_2! \cdot \cdot \cdot y_n!} \\ &= \frac{\lambda^{\sum y_i}e^{-n\lambda}}{\prod_{i=1}^n y_i !} \end{align*}

Poisson Data Model

If the joint pmf for the Poisson is

f\left(\overset{\to}{y_i}|\lambda\right) = \frac{\lambda^{\sum y_i}e^{-n\lambda}}{\prod_{i=1}^n y_i !}

then the likelihood function for \lambda > 0 is

\begin{align*} L\left(\lambda|\overset{\to}{y_i}\right) &= \frac{\lambda^{\sum y_i}e^{-n\lambda}}{\prod_{i=1}^n y_i !} \\ & \propto \lambda^{\sum y_i} e^{-n\lambda} \end{align*}

Pease see page 102 in the textbook for full derivations.

Gamma-Poisson Conjugacy

Let \lambda > 0 be an unknown rate parameter and (Y_1, Y_2, ... , Y_n) be an independent sample from the Poisson distribution.
The Gamma-Poisson Bayesian model is as follows:

\begin{align*} Y_i | \lambda &\overset{ind}\sim \text{Pois}(\lambda) \\ \lambda &\sim \text{Gamma}(s, r) \\ \lambda | \overset{\to}y &\sim \text{Gamma}\left( s + \sum y_i, r + n \right) \end{align*}

The proof can be seen in section 5.2.4 of the textbook.

Gamma-Poisson Conjugacy

Suppose we use Gamma(10, 2) as the prior for \lambda, the daily rate of calls.
On four separate days in the second week of August (i.e., independent days), we received \overset{\to}y = (6, 2, 2, 1) calls.
We will use the plot_poisson_likelihood() function:

plot_poisson_likelihood(y = c(6, 2, 2, 1), lambda_upper_bound = 10)

Notes:
- lambda_upper_bound limits the x axis – recall that \lambda \in (0, \infty)!
- lambda_upper_bound’s default value is 10.

Gamma-Poisson Conjugacy

We can see that the average is around 2.75.

mean(c(6, 2, 2, 1))

[1] 2.75

Gamma-Poisson Conjugacy

We know our prior distribution is Gamma(10, 2) and the data distribution is Poi(2.75).
Thus, the posterior is as follows,

\begin{align*} \lambda | \overset{\to}y &\sim \text{Gamma}\left( s + \sum y_i, r + n \right) \\ &\sim \text{Gamma}\left(10 + 11, 2 + 4 \right) \\ &\sim \text{Gamma}\left(21, 6 \right) \end{align*}

The Gamma Posterior Model

Looking at just the prior and the data distributions,

The prior expects more spam calls than what we observed.

The Gamma Posterior Model

Now including the posterior,

The shape of the posterior has a Gamma(s, r) model.
- The shape and rate parameters (s and r) have been updated.

The Gamma Posterior Model

The plot_gamma_poisson() function:

plot_gamma_poisson(shape = prior_s, rate = prior_r, 
                   sum_y = sum_of_obs, n = sample_size, 
                   posterior = TRUE or FALSE) + 
  theme_bw()

The Gamma Posterior Model

Your turn! What is different if we had used one of the other priors?
Recall, we considered
- Gamma(5, 1)
- Gamma(10, 2)
- Gamma(15, 3)
- Gamma(20, 4)

The Gamma Posterior Model

Your turn! What is different if we had used Gamma(15, 3) as our prior?

The Gamma Posterior Model

We can use the summarize_gamma_poisson() function to summarize the distribution,

summarize_gamma_poisson(shape = 10, rate = 2, sum_y = 11, n = 4)

Wrap Up: Gamma-Poisson Model

We have built the Gamma-Poisson model for \lambda, an unknown rate.

\begin{equation*} \begin{aligned} Y|\lambda &\sim \text{Poi}(\lambda) \\ \lambda &\sim \text{Gamma}(s, r) & \end{aligned} \Rightarrow \begin{aligned} && \lambda | y &\sim \text{Gamma}(s + \sum y_i, r + n) \\ \end{aligned} \end{equation*}

The prior model, f(\lambda), is given by Gamma(s, r).
The data model, f(Y|\lambda), is given by Poi(\lambda).
The posterior model is a Gamma distribution with updated parameters s+\sum y_i and r + n.

Introduction: Normal-Normal Model

So far, we have learned two conjugate families:
- Beta-Binomial (binary outcomes)
  - y \sim \text{Bin}(n, \pi) (data distribution)
  - \pi \sim \text{Beta}(\alpha, \beta) (prior distribution)
  - \pi|y \sim \text{Beta}(\alpha+y, \beta+n-y) (posterior distribution)
- Gamma-Poisson (count outcomes)
  - Y_i | \lambda \overset{ind}\sim \text{Pois}(\lambda) (data distribution)
  - \lambda \sim \text{Gamma}(s, r) (prior distribution)
  - \lambda | \overset{\to}y \sim \text{Gamma}\left( s + \sum y_i, r + n \right) (posterior distribution)
Now, we will learn about another conjugate family, the Normal-Normal, for continuous outcomes.

Example Set Up

As scientists learn more about brain health, the dangers of concussions are gaining greater attention.
We are interested in \mu, the average volume (cm³) of a specific part of the brain: the hippocampus.
Wikipedia tells us that among the general population of human adults, each half of the hippocampus has volume between 3.0 and 3.5 cm³.
- Total hippocampal volume of both sides of the brain is between 6 and 7 cm³.
- Let’s assume that the mean hippocampal volume among people with a history of concussions is also somewhere between 6 and 7 cm³.
We will take a sample of n=25 participants and update our belief.

The Normal Model

Let Y \in \mathbb{R} be a continuous random variable.
- The variability in Y may be represented with a Normal model with mean parameter \mu \in \mathbb{R} and standard deviation parameter \sigma > 0.
The Normal model’s pdf is as follows,

f(y) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left\{ \frac{-(y-\mu)^2}{2\sigma^2} \right\}

The Normal Model

If we vary \mu,

The Normal Model

If we vary \sigma,

The Normal Model

Our data model is as follows,

Y_i | \mu \sim N(\mu, \sigma^2)

The joint pdf is as follows,

f(\overset{\to}y | \mu) = \prod_{i=1}^n f(y_i | \mu) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left\{ \frac{-(y_i-\mu)^2}{2\sigma^2} \right\}

Meaning the likelihood is as follows,

L(\mu|\overset{\to}y) \propto \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left\{ \frac{-(y_i-\mu)^2}{2\sigma^2} \right\} = \exp \left\{ \frac{- \sum_{i=1}^n(y_i-\mu)^2}{2\sigma^2} \right\}

The Normal Model

Our data model is as follows,

Y_i | \mu \sim N(\mu, \sigma^2)

Returning to our brain analysis, we will assume that the hippocampal volumes of our n = 25 subjects have a normal distribution with mean \mu and standard deviation \sigma.
- Right now, we are only interested in \mu, so we assume \sigma = 0.5 cm³
- This choice suggests that most people have hippocampal volumes within 2 \sigma = 1 cm³.

Normal Prior

We know that with Y_i | \mu \sim N(\mu, \sigma^2), \mu \in \mathbb{R}.
- We think a normal prior for \mu is reasonable.
Thus, we assume that \mu has a normal distribution around some mean, \theta, with standard deviation, \tau.

\mu \sim N(\theta, \tau^2),

meaning that \mu has prior pdf

f(\mu) = \frac{1}{\sqrt{2 \pi \tau^2}} \exp \left\{ \frac{-(\mu - \theta)^2}{2 \tau^2} \right\}

Tuning the Normal Prior

We can tune the hyperparameters \theta and \tau to reflect our understanding and uncertainty about the average hippocampal volume (\mu) among people with a history of concussions.
Wikipedia showed us that hippocampal volumes tend to be between 6 and 7 cm³ \to \theta=6.5.
When we set the standard deviation we can check the plausible range of values of \mu:
- Follow up: why 2?

\theta \pm 2 \times \tau

If we assume \tau=0.4,

(6.5 \pm 2 \times 0.4) = (5.7, 7.3)

Tuning the Normal Prior

Thus, our tuned prior is \mu \sim N(6.5, 0.4^2)

This range incorporates our uncertainty - it is wider than the Wikipedia range.

Normal-Normal Conjugacy

Let \mu \in \mathbb{R} be an unknown mean parameter and (Y_1, Y_2, ..., Y_n) be an independent N(\mu, \sigma^2) sample where \sigma is assumed to be known.
The Normal-Normal Bayesian model is as follows:

\begin{align*} Y_i | \mu &\overset{\text{iid}} \sim N(\mu, \sigma^2) \\ \mu &\sim N(\theta, \tau^2) \\ \mu | \overset{\to}y &\sim N\left( \theta \frac{\sigma^2}{n\tau^2 + \sigma^2} + \bar{y} \frac{n\tau^2}{n\tau^2 + \sigma^2}, \frac{\tau^2 \sigma^2}{n \tau^2 + \sigma^2} \right) \end{align*}

Normal-Normal Conjugacy

Let’s think about our posterior and some implications,

\mu | \overset{\to}y \sim N\left( \theta \frac{\sigma^2}{n\tau^2 + \sigma^2} + \bar{y} \frac{n\tau^2}{n\tau^2 + \sigma^2}, \frac{\tau^2 \sigma^2}{n \tau^2 + \sigma^2} \right)

What happens as n increases?

Normal-Normal Conjugacy

Let’s think about our posterior and some implications,

\mu | \overset{\to}y \sim N\left( \theta \frac{\sigma^2}{n\tau^2 + \sigma^2} + \bar{y} \frac{n\tau^2}{n\tau^2 + \sigma^2}, \frac{\tau^2 \sigma^2}{n \tau^2 + \sigma^2} \right)

What happens as n increases?

\begin{align*} \frac{\sigma^2}{n\tau^2 + \sigma^2} &\to 0 \\ \frac{n\tau^2}{n\tau^2 + \sigma^2} &\to 1 \\ \frac{\tau^2 \sigma^2}{n \tau^2 + \sigma^2} &\to 0 \end{align*}

Normal-Normal Conjugacy

Let’s think about our posterior and some implications,

\mu | \overset{\to}y \sim N\left( \theta \frac{\sigma^2}{n\tau^2 + \sigma^2} + \bar{y} \frac{n\tau^2}{n\tau^2 + \sigma^2}, \frac{\tau^2 \sigma^2}{n \tau^2 + \sigma^2} \right)

\begin{align*} \frac{\sigma^2}{n\tau^2 + \sigma^2} &\to 0 \\ \frac{n\tau^2}{n\tau^2 + \sigma^2} &\to 1 \\ \frac{\tau^2 \sigma^2}{n \tau^2 + \sigma^2} &\to 0 \end{align*}

The posterior mean places less weight on the prior mean and more weight on the sample mean \bar{y}.
The posterior certainty about \mu increases and becomes more in sync with the data.

The Normal Posterior Model

Let us now apply this to our example.
We have our prior model, \mu \sim N(6.5, 0.4^2).
Let’s look at the football dataset in the bayesrules package.

data(football)
concussion_subjects <- football %>% 
  filter(group == "fb_concuss")

What is the average hippocampal volume?

The Normal Posterior Model

Let us now apply this to our example.
We have our prior model, \mu \sim N(6.5, 0.4^2).
Let’s look at the football dataset in the bayesrules package.

data(football)
concussion_subjects <- football %>% 
  filter(group == "fb_concuss")

What is the average hippocampal volume?

mean(concussion_subjects$volume)

[1] 5.7346

The Normal Posterior Model

We can also plot the density!

concussion_subjects %>% ggplot(aes(x = volume)) + geom_density() + theme_bw()

The Normal Posterior Model

Now, we can plug in the information we have (n = 25, \bar{y} = 5.735, \sigma = 0.5) into our likelihood,

L(\mu|\overset{\to}y) \propto \exp \left\{ \frac{-(5.735 - \mu)^2}{2(0.5^2/25)} \right\}

The Normal Posterior Model

We are now ready to put together our posterior:
- Data distribution, Y_i | \mu \overset{\text{iid}} \sim N(\mu, \sigma^2)
- Prior distribution, \mu \sim N(\theta, \tau^2)
- Posterior distribution, \mu | \overset{\to}y \sim N\left( \theta \frac{\sigma^2}{n\tau^2 + \sigma^2} + \bar{y} \frac{n\tau^2}{n\tau^2 + \sigma^2}, \frac{\tau^2 \sigma^2}{n \tau^2 + \sigma^2} \right)
Given our information (\theta=6.5, \tau=0.4, n=25, \bar{y}=5.735, \sigma=0.5), our posterior is

\mu | \overset{\to}y \sim N\left( \theta \frac{\sigma^2}{n\tau^2 + \sigma^2} + \bar{y} \frac{n\tau^2}{n\tau^2 + \sigma^2}, \frac{\tau^2 \sigma^2}{n \tau^2 + \sigma^2} \right)

The Normal Posterior Model

Given our information (\theta=6.5, \tau=0.4, n=25, \bar{y}=5.735, \sigma=0.5), our posterior is

\begin{align*} \mu | \overset{\to}y &\sim N\left( \theta \frac{\sigma^2}{n\tau^2 + \sigma^2} + \bar{y} \frac{n\tau^2}{n\tau^2 + \sigma^2}, \frac{\tau^2 \sigma^2}{n \tau^2 + \sigma^2} \right) \\ &\sim N\left( 6.5 \frac{0.5^2}{25 \cdot 0.4^2 + 0.5^2} + 5.735 \frac{25 \cdot 0.4^2}{25 \cdot 0.4^2 + 0.5^2}, \frac{0.4^2 \cdot 0.5^2}{25 \cdot 0.4^2 + 0.5^2} \right) \\ &\sim N(6.5 \cdot 0.0588 + 5.737 \cdot 0.9412, 0.09^2) \\ &\sim N(5.78, 0.09^2) \end{align*}

Looking at the posterior, we can see the weights
- 95% on the data mean, 6% on the prior mean.

The Normal Posterior Model

Looking at just the prior and data distributions,

The Normal Posterior Model

Now including the posterior,

The Normal Posterior Model

We can use the summarize_normal_normal() function to summarize the distribution,

summarize_normal_normal(mean = 6.5, sd = 0.4, sigma = 0.5, y_bar = 5.735, n = 25)

Wrap Up: Normal-Normal Model

We have built the Normal-Normal model for \mu, an unknown mean.

\begin{equation*} \begin{aligned} Y_i | \mu &\overset{\text{iid}} \sim N(\mu, \sigma^2) \\ \mu &\sim N(\theta, \tau^2) & \end{aligned} \Rightarrow \begin{aligned} && \mu | \overset{\to}y &\sim N\left( \theta \frac{\sigma^2}{n\tau^2 + \sigma^2} + \bar{y} \frac{n\tau^2}{n\tau^2 + \sigma^2}, \frac{\tau^2 \sigma^2}{n \tau^2 + \sigma^2} \right) \\ \end{aligned} \end{equation*}

The prior model, f(\mu), is given by N(\theta,\tau^2).
The data model, f(Y|\mu), is given by N(\mu, \sigma^2).
The posterior model is a Normal distribution with updated parameters
- mean = \theta \frac{\sigma^2}{n\tau^2 + \sigma^2} + \bar{y} \frac{n\tau^2}{n\tau^2 + \sigma^2}
- variance = \frac{\tau^2 \sigma^2}{n \tau^2 + \sigma^2}

Wrap Up

Today we have learned about the conjugate families.
- Beta-Binomial: binary outcomes
- Gamma-Poisson: count outcomes
- Normal-Normal: continuous outcomes
While we are not forced to analyze our data using conjugate families, our lives are much easier when we can use the known relationships.
Note that while we did not do this with our posteriors in this lecture, we can now move forward with drawing conclusions about the posterior distribution.
- Probabilities
- Inference

Homework

3.3
3.9
3.10
3.18
5.3
5.5
5.6
5.9
5.10