Evaluating Regression Models

Introduction

We want to now learn how to evaluate how good our regression model is.
How fair is the model?
- How was the data collected?
- By whom and for what purpose?
- How might the results of the data collection or analysis impact individuals and society?
- What biases might be baked into the analysis?

Introduction

We want to now learn how to evaluate how good our regression model is.
How wrong is the model?
- “All models are wrong, but some are useful.” - George E.P. Box
- Are our assumptions reasonable?
How accurate are the posterior predictive models?
- How far are the posterior predictive models from reality?

Is the Model Fair?

Consider our example from last lecture, the Capital Bikeshare data.
We simulated under the following model:

\begin{align*} Y_i | \beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N(\beta_0 + \beta_1X_i,\sigma^2) \\ \beta_{0\text{c}} &\sim N(5000, 1000^2) \\ \beta_{1\text{c}} &\sim N(100, 40^2) \\ \sigma &\sim \text{Exp}(0.0008) \end{align*}

Is the Model Fair?

Consider our example from last lecture, the Capital Bikeshare data.
How was the data collected?
- It is reasonable to assume that this was collected electronically.
- We do not have individual-level information, only aggregated ride data.
By whom and for what purpose was the data collected?
- It is reasonable for a bike sharing company to track data such as number of rides in a day.
- The company is likely doing this to make evidence-based business decisions.

Is the Model Fair?

Consider our example from last lecture, the Capital Bikeshare data.
How might the results of the analysis, or the data collection itself, impact individuals and society?
- Because there is no individual-level information, this will not negatively impact customers.
- Improving the service might help reduce car traffic in D.C. - this is good!
Note!
- Just because we cannot imagine societal impacts does not mean that none exist.
- It’s critical to recognize our perspective based on our experiences.
- How can we truly determine the impacts?
  - We would need to engage with the company and its community.

Is the Model Fair?

What biases might be baked into this analysis?
- The data is only for one company.
  - Other bikeshare companies may have different usage patterns.
- The data is only for one city.
  - Other cities may have different usage patterns for bikeshare.
  - Other cities will have different weather patterns.

Is the Model Fair?

How do I approach thinking about bias in analysis?
- Who is and is not represented in the data?
- What assumptions were made about the data?
- What assumptions are made about the truth/the population?
- Is the analysis method appropriate for the question posed?
  - e.g., did they use logistic regression for a binary outcome?
- Were there any “reaching” assumptions made in order to complete the analysis?
  - e.g., assuming data is MCAR when it is likely MAR or MNAR.
  - e.g., assuming independent observations when there is likely correlation.
- Are the results stated appropriately? (vs. overstated)
  - Are broad, sweeping generalizations made from a narrow dataset?

How Wrong is the Model?

Recall the data model,

Y_i | \beta_0, \beta_1, \sigma \overset{\text{ind}}{\sim} N(\beta_0 + \beta_1X_i,\sigma^2)

What are the assumptions on our model?
- The observed data, Y_i, is independent of other observed data, Y_j for i \neq j.
- Y can be written as a linear function of X, \mu=\beta_0 + \beta_1 X.
- At any X, Y varys normally around \mu with consistent variability, \sigma.

How Wrong is the Model?

The observed data, Y_i, is independent of other observed data, Y_j for i \neq j.
- We can technically test for this, but it is not often employed by statisticians.
- Instead, we rely on knowledge of the data collection process.
- If the data collection process suggests dependence, we should consider a different model.
  - e.g., a modeling approach that will account for the correlation between observations.

How Wrong is the Model?

The observed data, Y_i, is independent of other observed data, Y_j for i \neq j.
In this example, the data are daily counts of bike rides.
- There is inherently some correlation – the ridership today is likely correlated to the ridership yesterday.
- However, this can be explained by the time of year and features associated with the time of year, like temperature.
- We should ask ourselves if it’s reasonable to assume that the temperature cancels out the time correlation.
“Reasonable” does not mean “perfect,” but instead, “good enough to proceed.”

How Wrong is the Model?

Returning to our bike data, we can look at the scatterplot of temperature (X) and number of rides (Y).
- What kind of relationship do we suspect?

How Wrong is the Model?

We can overlay a line of best fit using geom_smooth(method = "lm").

How Wrong is the Model?

Or we can overlay a locally estimated scatterplot smoothing (LOESS) line using geom_smooth().

How Wrong is the Model?

How different are the two lines?

How Wrong is the Model?

When does “visualization” suggest that a linear model is not appropriate?
- Does the LOESS line deviates substantially from the linear line?
- Here, we see that the LOESS line is fairly close to the linear line.
- Thus, it is reasonable to assume a linear relationship between X and Y.
Are there “issues” with this approach?
- When we have more than one predictor, yes.
- We are only visualizing one predictor at a time.
- We will need to use other tools to assess linearity in multiple regression.

How Wrong is the Model?

To examine linearity in multiple regression, we will perform a posterior predictive check.
If the combined model assumptions are reasonable, then our posterior model should be able to simulate ridership data that’s similar to the original 500 rides observations.

For each of the 20,000 posterior plausible parameter sets, \{\beta_0, \beta_1, \sigma\}, we can predict ridership data from the observed temperature data.
- This gives us 20,000 unique datasets of predicted ridership, each with n=500.

How Wrong is the Model?

Recall our analysis from last lecture,

bike_model <- stan_glm(rides ~ temp_feel, 
                       data = bikes, 
                       family = gaussian,
                       prior_intercept = normal(5000, 1000), 
                       prior = normal(100, 40), 
                       prior_aux = exponential(0.0008), # sigma
                       chains = 4, iter = 5000*2, seed = 84735)


SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000184 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.84 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.124 seconds (Warm-up)
Chain 1:                0.204 seconds (Sampling)
Chain 1:                0.328 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.153 seconds (Warm-up)
Chain 2:                0.201 seconds (Sampling)
Chain 2:                0.354 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 7e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.121 seconds (Warm-up)
Chain 3:                0.194 seconds (Sampling)
Chain 3:                0.315 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 6e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.104 seconds (Warm-up)
Chain 4:                0.204 seconds (Sampling)
Chain 4:                0.308 seconds (Total)
Chain 4:

How Wrong is the Model?

Recall our analysis from last lecture,

tidy(bike_model, conf.int = TRUE) %>% select(-std.error)

This gives the model,

\hat{y} = -2195.3 + 82.2x \ \ \ \to \ \ \ \hat{\text{rides}} = -2195.3 + 82.2\text{temp}

How Wrong is the Model?

How do we perform the posterior predictive check?
The pp_check() function plots a number of the corresponding simulated datasets.

pp_check(bike_model, nreps=50, seed = 84735) + xlab("Number of Rides") + theme_bw()

How Accurate are Posterior Predictive Models?

Although we are interested in predicting the future, we can begin by evaluating how well it predicts the data that was used to build the model.
We can use the posterior_predict() function to generate posterior predictive summaries to either our data or new data.

predictions <- posterior_predict(bike_model, newdata=bikes, seed = 84735)
dim(predictions)

[1] 20000   500

We have 20,000 posterior predictive datasets, each with 500 predicted ride counts.

How Accurate are Posterior Predictive Models?

We can also use the ppc_intervals() function to visualize the posterior predictive intervals for each observation.

ppc_intervals(bikes$rides, yrep = predictions, x = bikes$temp_feel, 
              prob = 0.5, prob_outer = 0.95) + theme_bw()

How Accurate are Posterior Predictive Models?

Let Y_1, Y_2, ..., Y_n denote n observed outcomes. Each Y_i has a corresponding posterior predictive model with mean Y_i' and standard deviation \text{sd}_i
Median Absolute Error (MAE): the typical difference between the observed value and posterior predictive mean.

\text{MAE} = \text{median} | Y_i - Y_i'|

Scaled Median Absolute Error (MAE scaled): the typical difference between the observed value and posterior predictive mean.

\text{MAE scaled} = \text{median} \frac{| Y_i - Y_i'|}{\text{sd}_i}

How Accurate are Posterior Predictive Models?

Let Y_1, Y_2, ..., Y_n denote n observed outcomes. Each Y_i has a corresponding posterior predictive model with mean Y_i' and standard deviation \text{sd}_i
Within 50: Proportion of observed values that fall within their 50% posterior prediction interval.
Within 95: Proportion of observed values that fall within their 95% posterior prediction interval.
We get these four values out of the prediction_summary() function,

set.seed(84735)
prediction_summary(bike_model, data = bikes)

How Accurate are Posterior Predictive Models?

Let’s now interpret these values.

set.seed(84735)
prediction_summary(bike_model, data = bikes)

The MAE of 989.7 means that the typical difference between the observed ride count and the posterior predictive mean is about 990 rides.
- The scaled MAE of 0.77 means that the typical difference between the observed ride count and the posterior predictive mean is about 0.77 standard deviations.
Only 43.8% of test observations fall within their respective 50% prediction interval, while 95.8% fall within their respective 95% prediction interval.

How Accurate are Posterior Predictive Models?

Looking at these side by side,

How Accurate are Posterior Predictive Models?

What is our final conclusion? Does our Bayesian model produce accurate predictions?
Unfortunately, the answer is subjective and dependent on context.
- Is an MAE of 990 rides acceptable for the business?
- Is having only 43.8% of observations within the 50% prediction interval acceptable?
Per usual, we are just the statisticians. The business must decide what is acceptable.

How Accurate are Posterior Predictive Models?

How can we improve the model’s posterior predictive accuracy?
- Add additional predictors or terms (e.g., interactions) to the model.
- Consider different distributions for the outcome variable.
  - e.g., Poisson, negative binomial, binomial, etc.
- Collect more data.
  - This is always the cheeky answer.

Practice / Homework

From the Bayes Rules! textbook:
- 10.3
- 10.5
- 10.20
- 10.21