Simple Normal Regression

STA6349: Applied Bayesian Analysis
Spring 2025

Introduction

Before today, we effectively were performing one-sample tests of means, proportions, and counts.
Now, we will focus on incorporating just a single predictor into our analysis.
- Overall question: what is the relationship between Y (outcome) and X (predictor)?
We now switch to thinking about analysis in terms of regression.

Working Example

Capital Bikeshare is a bike sharing service in the Washington, D.C. area. To best serve its registered members, the company must understand the demand for its service. We will analyze the number of rides taken on a random sample of n days, (Y_1, Y_2, ..., Y_n).
Beacuse Y_i is a count variable, you might assume that ridership might need the Poisson distribution. However, past bike riding seasons have exhibited bell-shaped daily ridership with a variability in ridership that far exceeds the typical ridership.
- (i.e., the Poisson assumption of \mu = \sigma does not hold here)
We will instead assume that the number of rides varies normally around some typical ridership, \mu, with standard deviation, \sigma.

Y_i|\mu,\sigma \overset{\text{ind}}{\sim} N(\mu, \sigma^2)

Working Example

In our example, Y is the number of rides and X is the temperature.
Our specific goal will be to model the relationship between ridership and temperature:
- Does ridership tend to increase on warmer days?
- If so, by how much?
- How strong is this relationship?
It is reasonable to assume a positive relationship between temperature and number of rides.
- As it warms up outside, folks are more likely to puruse outdoor activities, including biking.
Today we will focus on the model with a single predictor (y ~ x in R).

Building the Regression Model

Suppose we have n data pairs,

\{ (Y_1, X_1), (Y_2, X_2), ..., (Y_n, X_n) \}

where Y_i is the number of rides and X_i is the high temperature (^oF) on day i.
Assuming that the relationship is linear, we can model

\mu_i = \beta_0 + \beta_1 X_i,

where \beta_0 and \beta_1 are model coefficients.

Building the Regression Model

What do we mean by “model coefficients?”

\mu_i = \beta_0 + \beta_1 X_i,

\beta_0 is the baseline for where our model crosses the y-axis, i.e., when X_i=0.
- Is this meaningful when we are talking about the average ridership when it is 0^oF in DC?
\beta_1 is the slope, or average change, in the outcome (Y) for a one unit increase in the predictor (X).
- Interpretation: for a [1 unit of predictor] increase in [the predictor], [the outcome] [increases or decreases] by [abs(\beta_1)].
- In our example, suppose \beta_1=4.5. For a 1^oF increase in the temperature, the ridership increases by 4.5 riders.

Building the Regression Model

We are now interested in the model

Y_i|\beta_0,\beta_1,\sigma\overset{\text{ind}}{\sim} N(\mu_i, \sigma^2) \text{ with } \mu_i = \beta_0 + \beta_1 X_i

Note that \mu_i is the local mean (for a specific value of X).
- In our data, \mu_i is the mean ridership for day i.
The global mean (regardless of the value of X) is given by \mu.
- In our data, \mu is the mean ridership, regardless of day.
Under this model, \sigma is now measuring the variability from the local mean.

Building the Regression Model

We know the assumptions for linear regression in the frequentist framework.
In Bayesian Normal regression,
- The observations are independent.
- Y can be written as a linear function of X, \mu = \beta_0 + \beta_1 X.
- For any value of X, Y varies normally around \mu with constant standard deviation \sigma.

Building the Regression Model

We know we must set priors for our parameters.
- Here, we have three parameters: \beta_0, \beta_1, \sigma.
- There are multiple ways to set up the priors, but we will stick to the default framework from rstanarm.
First, we will assume that our prior models of \beta_0, \beta_1, and \sigma are independent.
It is common to use the a normal prior for \beta_0 and \beta_1.

\begin{align*} \beta_0 &\sim N(m_0, s_0^2) \\ \beta_1 &\sim N(m_1, s_1^2) \end{align*}

Then, it is the default to use the exponential for \sigma; both are restricted to positive values.

\sigma \sim \text{Exp}(l)

Note that E[\sigma] = 1/l and \text{sd}[\sigma] = 1/l.

Building the Regression Model

Thus, our regression model has the following formulation:

\begin{align*} &\text{data}: & Y_i|\beta_0, \beta_1, \sigma & \overset{\text{ind}}{\sim} N(\mu_i, \sigma^2) \text{ with } \mu_i = \beta_0 + \beta_1 X_i \\ &\text{priors:} &\beta_0 & \sim N(m_0, s_0^2) \\ & & \beta_1 &\sim N(m_1, s_1^2) \\ & & & \sigma \sim \text{Exp}(l) \end{align*}

Tuning the Prior Models

Based on the past bikeshare analyses, we have the following centered prior understandings.

On an average temperature day for DC (65^oF-70^oF), there are typically around 5000 riders, but this could vary between 3000 and 7000 riders.

\beta_{0\text{c}} \sim N(5000, 1000^2)

For every one degree increase in temperature, ridership typically increases by 100 rides, but this could vary between 20 and 180 rides.

\beta_{1\text{c}} \sim N(100, 40^2)

At any given temperature, the daily ridership will tend to vary with a moderate standard deviation of 1250 rides.

\sigma \sim \text{Exp}(0.0008)

Recall, E[\sigma] = 1/l = 1250, so l = 1/1250 = 0.0008.

Tuning the Prior Models

\beta_{0\text{c}} \sim N(5000, 1000^2) \ \ \ \beta_{1\text{c}} \sim N(100, 40^2) \ \ \ \sigma \sim \text{Exp}(0.0008)

Simulating the Posterior

Let’s now update what we know using the bikes data in the bayesrule package.
Looking at the basic relationship between the number of rides vs. the temperature (as it feels outside),

Simulating the Posterior

Note: I am skipping the derivation / the true math behind how to find the posterior in this situation. (It involves multiple integrals!)
We will use the stan_glm() function from rstanarm – it contains pre-defined Bayesian regression models.
- stan_glm() also applies to the wider family of GzLM (i.e., logistic & Poisson/negbin).

library(rstanarm)
bike_model <- stan_glm(rides ~ temp_feel, # data model
                       data = bikes, # dataset
                       family = gaussian, # distribution to apply
                       prior_intercept = normal(5000, 1000), # b0_c
                       prior = normal(100, 40), # b1_c
                       prior_aux = exponential(0.0008), # sigma
                       chains = 4, # 4 chains
                       iter = 5000*2, # 10000 iterations - throw out first 5000
                       seed = 84735) # starting place in RNG

Simulating the Posterior

bike_model <- stan_glm(rides ~ temp_feel, # data model
                       data = bikes, # dataset
                       family = gaussian, # distribution to apply
                       prior_intercept = normal(5000, 1000), # b0_c
                       prior = normal(100, 40), # b1_c
                       prior_aux = exponential(0.0008), # sigma
                       chains = 4, # 4 chains
                       iter = 5000*2, # 10000 iterations - throw out first 5000
                       seed = 84735) # starting place in RNG

stan_glm() contains three types of information:
- Data information: The first three arguments specify the structure of our data.
  - We want to model ridership by temperature (rides ~ temp_feel) using data = bikes and assuming a Normal data model, aka family = gaussian.
- Prior information: The prior_intercept, prior, and prior_aux arguments give the priors for \beta_{0\text{c}}, \beta_{1\text{c}}, \sigma.
- Markov chain information: The chains, iter, and seed arguments specify the number of Markov chains to run, the length or number of iterations for each chain, and the starting place of the RNG.

Simulating the Posterior

Wait, how does this work when we are looking at three model parameters?
We will have three vectors – one for each model parameter.

\begin{align*} <\beta_0^{(1)}, & \ \beta_0^{(2)}, ..., \beta_0^{(5000)}> \\ <\beta_1^{(1)}, & \ \beta_1^{(2)}, ..., \beta_1^{(5000)} > \\ <\sigma^{(1)}, & \ \sigma^{(2)}, ..., \sigma^{(5000)} > \\ \end{align*}

Simulation Diagnostics

Run the simulation code:

library(rstanarm)
bike_model <- stan_glm(rides ~ temp_feel, # data model
                       data = bikes, # dataset
                       family = gaussian, # distribution to apply
                       prior_intercept = normal(5000, 1000), # b0_c
                       prior = normal(100, 40), # b1_c
                       prior_aux = exponential(0.0008), # sigma
                       chains = 4, # 4 chains
                       iter = 5000*2, # 10000 iterations - throw out first 5000
                       seed = 84735) # starting place in RNG

Then, run diagnostics:

neff_ratio(bike_model)
rhat(bike_model)
mcmc_trace(bike_model, size = 0.1)
mcmc_dens_overlay(bike_model)

Simulation Diagnostics

bike_model <- stan_glm(rides ~ temp_feel, # data model
                       data = bikes, # dataset
                       family = gaussian, # distribution to apply
                       prior_intercept = normal(5000, 1000), # b0_c
                       prior = normal(100, 40), # b1_c
                       prior_aux = exponential(0.0008), # sigma
                       chains = 4, # 4 chains
                       iter = 5000*2, # 10000 iterations - throw out first 5000
                       seed = 84735) # starting place in RNG


SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 3.6e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.36 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.126 seconds (Warm-up)
Chain 1:                0.196 seconds (Sampling)
Chain 1:                0.322 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 5e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.152 seconds (Warm-up)
Chain 2:                0.197 seconds (Sampling)
Chain 2:                0.349 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 7e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.123 seconds (Warm-up)
Chain 3:                0.191 seconds (Sampling)
Chain 3:                0.314 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 6e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.11 seconds (Warm-up)
Chain 4:                0.197 seconds (Sampling)
Chain 4:                0.307 seconds (Total)
Chain 4:

Simulation Diagnostics

Diagnostics:

neff_ratio(bike_model)

(Intercept)   temp_feel       sigma 
    1.03285     1.03505     0.96585

rhat(bike_model)

(Intercept)   temp_feel       sigma 
  0.9998873   0.9999032   0.9999642

Quick diagnostics indicate that the resulting chains are trustworthy.
The effective sample size ratios are slightly above 1 and the R-hat values are very close to 1.

Simulation Diagnostics

Diagnostics:

We can see that the chains are stable, mixing quickly, and behaving much like an independent sample.

Simulation Diagnostics

Diagnostics:

The density plot lets us visualize and examine the posterior models for each of our regression parameters, \beta_0, \beta_1, \sigma.

Interpreting the Posterior

Okay… what does this mean, though?

tidy(bike_model, effects = c("fixed", "aux"), conf.int = TRUE, conf.level = 0.80)

Thus, the posterior median relationship is

y = -2194.24 + 82.16x

For a 1 degree increase in temperature, we expect ridership to increase by about 82 rides, with 80% credible interval (75.7, 88.7).

Interpreting the Posterior

We can look at alternatives by drawing from the simulated data in bike_model.
- The add_fitted_draws() function is from the tidybayes package.

bikes %>%
  add_fitted_draws(bike_model, n = 50) %>%
  ggplot(aes(x = temp_feel, y = rides)) +
    geom_line(aes(y = .value, group = .draw), alpha = 0.15) + 
    geom_point(data = bikes, size = 0.05) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_bw()

Interpreting the Posterior

We can look at alternatives by drawing from the simulated data in bike_model.

We see that the plausible models are not super variable.
- This means we’re more confident about the relationship we’re observing.

Interpreting the Posterior

How does this compare against the frequentist version of regression?

Wrap Up

Today, we have learned basic linear regression in the Bayesian framework.
What we have learned so far still holds true re: simulation and diagnostics.
Monday next week:
- Multiple predictors
- Poisson & negative binomial regression
- Logistic regression
Wednesday next week:
- Assignment to practice regression
Monday & Wednesday before finals:
- Project 2! (We will build regression models. Surprise!)
- You will present on Wednesday.
- We will have smaller groups this time.

Homework

9.9, 9.10, 9.11, 9.12
9.16, 9.17, 9.18