On Monday, we talked about the Beta-Binomial model for binary outcomes with an unknown probability of success, \pi.
We will now discuss sequentality in Bayesian analyses.
Working example:
In Alison Bechdel’s 1985 comic strip The Rule, a character states that they only see a movie if it satisfies the following three rules (Bechdel 1986):
the movie has to have at least two women in it;
these two women talk to each other; and
they talk about something besides a man.
These criteria constitute the Bechdel test for the representation of women in film.
Thinking of movies you’ve watched, what percentage of all recent movies do you think pass the Bechdel test? Is it closer to 10%, 50%, 80%, or 100%?
Introduction
Let \pi, a random value between 0 and 1, denote the unknown proportion of recent movies that pass the Bechdel test.
Three friends - the feminist, the clueless, and the optimist - have some prior ideas about \pi.
Reflecting upon movies that he has seen in the past, the feminist understands that the majority lack strong women characters.
The clueless doesn’t really recall the movies they’ve seen, and so are unsure whether passing the Bechdel test is common or uncommon.
Lastly, the optimist thinks that the Bechdel test is a really low bar for the representation of women in film, and thus assumes almost all movies pass the test.
Reflecting upon movies that he has seen in the past, the feminist understands that the majority lack strong women characters.
The clueless doesn’t really recall the movies they’ve seen, and so are unsure whether passing the Bechdel test is common or uncommon.
Lastly, the optimist thinks that the Bechdel test is a really low bar for the representation of women in film, and thus assumes almost all movies pass the test.
Lastly, the optimist thinks that the Bechdel test is a really low bar for the representation of women in film, and thus assumes almost all movies pass the test.
Introduction
The analysts agree to review a sample of n recent movies and record Y, the number that pass the Bechdel test.
Because the outcome is yes/no, the binomial distribution is appropriate for the data distribution.
We aren’t sure what the population proportion, \pi, is, so we will not restrict it to a fixed value.
Because we know \pi \in [0, 1], the beta distribution is appropriate for the prior distribution.
From the previous chapter, we know that this results in the following posterior distribution
\pi | (Y=y) \sim \text{Beta}(\alpha+y, \beta+n-y)
Introduction
Wait!!
Everyone gets their own prior?
… is there a “correct” prior?
…… is the Bayesian world always this subjective?
Introduction
Wait!!
Everyone gets their own prior?
… is there a “correct” prior?
…… is the Bayesian world always this subjective?
More clearly defined questions that we can actually answer:
To what extent might different priors lead the analysts to three different posterior conclusions about the Bechdel test?
How might this depend upon the sample size and outcomes of the movie data they collect?
To what extent will the analysts’ posterior understandings evolve as they collect more and more data?
Will they ever come to agreement about the representation of women in film?!
Different Priors \to Different Posteriors
The differing prior means show disagreement about whether \pi is closer to 0 or 1.
The differing levels of prior variability show that the analysts have different degrees of certainty in their prior information.
The more certain we are about the prior information, the smaller the prior variability.
Different Priors \to Different Posteriors
Informative prior: reflects specific information about the unknown variable with high certainty, i.e., low variability.
Different Priors \to Different Posteriors
Vague or diffuse prior: reflects little specific information about the unknown variable.
A flat prior, which assigns equal prior plausibility to all possible values of the variable, is a special case.
This is effectively saying “🤷.”
Different Priors \to Different Posteriors
Okay, great - we have different priors.
How do the different priors affect the posterior?
We have data from FiveThirtyEight, reporting results of the Bechdel test.
set.seed(65821)bechdel20 <- bayesrules::bechdel %>%sample_n(20)head(bechdel20, n =3)
# A tibble: 3 × 3
year title binary
<dbl> <chr> <chr>
1 2013 Her FAIL
2 1997 Grosse Pointe Blank PASS
3 2006 Volver PASS
Different Priors \to Different Posteriors
So how many pass the test in this sample?
bechdel20 %>%tabyl(binary) %>%adorn_totals("row")
binary n percent
FAIL 11 0.55
PASS 9 0.45
Total 20 1.00
Different Priors \to Different Posteriors
Let’s look at the graphs of just the prior and likelihood.
plot_beta_binomial(alpha =5, beta =11, y =9, n =20, posterior =FALSE) +theme_bw()plot_beta_binomial(alpha =1, beta =1, y =9, n =20, posterior =FALSE) +theme_bw()plot_beta_binomial(alpha =14, beta =1, y =9, n =20, posterior =FALSE) +theme_bw()
Questions to think about:
Whose posterior do you anticipate will look the most like the scaled likelihood?
Whose do you anticipate will look the least like the scaled likelihood?
Different Priors \to Different Posteriors
Let’s look at the graphs of just the prior and likelihood.
Different Priors \to Different Posteriors
Let’s look at the graphs of just the prior and likelihood.
Different Priors \to Different Posteriors
Let’s look at the graphs of just the prior and likelihood.
Different Priors \to Different Posteriors
Find the posterior distributions. (i.e., What are the updated parameters?)
Analyst
Prior
Posterior
the feminist
Beta(5, 11)
Beta(14, 22)
the clueless
Beta(1, 1)
Beta(10, 12)
the optimist
Beta(14, 1)
Beta(23, 12)
Let’s now explore what the posteriors look like.
plot_beta_binomial(alpha =5, beta =11, y =9, n =20) +theme_bw()plot_beta_binomial(alpha =1, beta =1, y =9, n =20) +theme_bw()plot_beta_binomial(alpha =14, beta =1, y =9, n =20) +theme_bw()
Different Priors \to Different Posteriors
Let’s now explore what the posteriors look like.
Different Priors \to Different Posteriors
Let’s now explore what the posteriors look like.
Different Priors \to Different Posteriors
Let’s now explore what the posteriors look like.
Different Priors \to Different Posteriors
In addition to priors affecting our posterior distributions… the data also affects it.
Let’s now consider three new analysts: they all share the optimistic Beta(14, 1) for \pi, however, they have access to different data.
Morteza reviews n = 13 movies from the year 1991, among which Y=6 (about 46%) pass the Bechdel.
Nadide reviews n = 63 movies from the year 2001, among which Y=29 (about 46%) pass the Bechdel.
Ursula reviews n = 99 movies from the year 2013, among which Y=46 (about 46%) pass the Bechdel.
plot_beta_binomial(alpha =14, beta =1, y =6, n =13, posterior =FALSE) +theme_bw()plot_beta_binomial(alpha =14, beta =1, y =29, n =63, posterior =FALSE) +theme_bw()plot_beta_binomial(alpha =14, beta =1, y =46, n =99, posterior =FALSE) +theme_bw()
How will the different data affect the posterior distributions?
Which posterior will be the most in sync with their data?
Which posterior will be the least in sync with their data?
Different Priors \to Different Posteriors
How will the different data affect the posterior distributions?
Different Priors \to Different Posteriors
How will the different data affect the posterior distributions?
Different Priors \to Different Posteriors
How will the different data affect the posterior distributions?
Different Priors \to Different Posteriors
Find the posterior distributions. (i.e., What are the updated parameters?)
Recall that all use the Beta(14, 1) prior.
Analyst
Data
Posterior
Morteza
Y=6 of n=13
Beta(20, 8)
Nadide
Y=29 of n=63
Beta(45, 35)
Ursula
Y=46 of n=99
Beta(60, 54)
Let’s also explore what the posteriors look like.
plot_beta_binomial(alpha =14, beta =1, y =6, n =13) +theme_bw() plot_beta_binomial(alpha =14, beta =1, y =29, n =63) +theme_bw()plot_beta_binomial(alpha =14, beta =1, y =46, n =99) +theme_bw()
Different Priors \to Different Posteriors
Let’s explore what the posteriors look like.
Different Priors \to Different Posteriors
Let’s explore what the posteriors look like.
Different Priors \to Different Posteriors
Let’s explore what the posteriors look like.
Different Priors \to Different Posteriors
What did we observe?
As n \to \infty, variance in the likelihood \to 0.
In Morteza’s small sample of 13 movies, the likelihood function is wide.
In Ursula’s larger sample size of 99 movies, the likelihood function is narrower.
We see that the narrower the likelihood, the more influence the data holds over the posterior.
Striking a Balance
Overall message: no matter the strength of and discrepancies among their prior understanding of \pi, analysts will come to a common posterior understanding in light of strong data.
Striking a Balance
The posterior can either favor the data or the prior.
The rate at which the posterior balance tips in favor of the data depends upon the prior.
Left to right on the graph, the sample size increases from n=13 to n=99 movies, while preserving the proportion that pass (\approx 0.46).
The likelihood’s insistence and the data’s influence over the posterior increase with sample size.
This also means that the influence of our prior understanding diminishes as we gather new data.
Top to bottom on the graph, priors move from informative (Beta(14,1)) to vague (Beta(1,1)).
Naturally, the more informative the prior, the greater its influence on the posterior.
Introduction: Sequentiality
Let’s now turn our thinking to - okay, we’ve updated our beliefs… but now we have new data!
The evolution in our posterior understanding happens incrementally, as we accumulate new data.
Scientists’ understanding of climate change has evolved over the span of decades as they gain new information.
Presidential candidates’ understanding of their chances of winning an election evolve over months as new poll results become available.
Introduction: Sequentiality
Let’s revisit Milgram’s behavioral study of obedience from Chapter 3. Recall, \pi represents the proportion of people that will obey authority, even if it means bringing harm to others.
Prior to Milgram’s experiments, our fictional psychologist expected that few people would obey authority in the face of harming another: \pi \sim \text{Beta}(1,10).
Now, suppose that the psychologist collected the data incrementally, day by day, over a three-day period.
Find the following posterior distributions, each building off the last:
Day 0: \text{Beta}(1,10).
Day 1: Y=1 out of n=10.
Day 2: Y=17 out of n=20.
Day 3: Y=8 out of n=10.
Introduction: Sequentiality
Let’s revisit Milgram’s behavioral study of obedience from Chapter 3. Recall, \pi represents the proportion of people that will obey authority, even if it means bringing harm to others.
Prior to Milgram’s experiments, our fictional psychologist expected that few people would obey authority in the face of harming another: \pi \sim \text{Beta}(1,10).
Now, suppose that the psychologist collected the data incrementally, day by day, over a three-day period.
Find the following posterior distributions, each building off the last:
Day 0: \text{Beta}(1,10).
Day 1: Y=1 out of n=10: \text{Beta}(1,10) \to \text{Beta}(2, 19).
Day 2: Y=17 out of n=20: \text{Beta}(2, 19) \to \text{Beta}(19, 22).
Day 3: Y=8 out of n=10: \text{Beta}(19, 22) \to \text{Beta}(27, 24).
Recall from Chapter 3, our posterior was \text{Beta}(27,24)!
Sequential Bayesian Analysis or Bayesian Learning
In a sequential Bayesian analysis, a posterior model is updated incrementally as more data come in.
With each new piece of data, the previous posterior model reflecting our understanding prior to observing this data becomes the new prior model.
This is why we love Bayesian!
We evolve our thinking as new data come in.
These types of sequential analyses also uphold two fundamental properties:
The final posterior model is data order invariant,
The final posterior only depends upon the cumulative data.
Sequential Bayesian Analysis or Bayesian Learning
In order:
Day 0: \text{Beta}(1,10).
Day 1: Y=1 out of n=10: \text{Beta}(1,10) \to \text{Beta}(2, 19).
Day 2: Y=17 out of n=20: \text{Beta}(2, 19) \to \text{Beta}(19, 22).
Day 3: Y=8 out of n=10: \text{Beta}(19, 22) \to \text{Beta}(27, 24).
Out of order:
Day 0: \text{Beta}(1,10).
Day 3: Y=8 out of n=10: \text{Beta}(1,10) \to \text{Beta}(9, 12).
Day 2: Y=17 out of n=20: \text{Beta}(9, 12) \to \text{Beta}(26, 15).
Day 1: Y=1 out of n=10: \text{Beta}(26, 15) \to \text{Beta}(27, 24).
Sequential Bayesian Analysis or Bayesian Learning
Proving Data Order Invariance
Data order invariance:
Let \theta be any parameter of interest with prior pdf f(\theta).
Then a sequential analysis in which we first observe a data point y_1, and then a second data point y_2 will produce the same posterior model of \theta as if we first observe y_2 and then y_1.
f(\theta|y_1,y_2) = f(\theta|y_2,y_1)
Similarly, the posterior model is invariant to whether we observe the data all at once or sequentially.
Proving Data Order Invariance
Let’s first specify the structure of posterior pdf f(\theta|y_1,y_2), which evolves by sequentially observing data y_1, followed by y_2.
In step one, we construct the posterior pdf from our original prior pdf, f(\theta), and the likelihood function of \theta given the first data point y_1, L(\theta|y_1).
Finally, not only does the order of the data not influence the ultimate posterior model of \theta, but it doesn’t matter whether we observe the data all at once or sequentially.
Suppose we start with the original f(\theta) prior and observe data (y_1, y_2) together, not sequentially.
Further, assume that these data points are independent, thus,