Evaluating Regression Models

Introduction

  • We want to now learn how to evaluate how good our regression model is.

  • How fair is the model?

    • How was the data collected?
    • By whom and for what purpose?
    • How might the results of the data collection or analysis impact individuals and society?
    • What biases might be baked into the analysis?

Introduction

  • We want to now learn how to evaluate how good our regression model is.

  • How wrong is the model?

    • “All models are wrong, but some are useful.” - George E.P. Box
    • Are our assumptions reasonable?
  • How accurate are the posterior predictive models?

    • How far are the posterior predictive models from reality?

Is the Model Fair?

  • Consider our example from last lecture, the Capital Bikeshare data.

  • We simulated under the following model:

\begin{align*} Y_i | \beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N(\beta_0 + \beta_1X_i,\sigma^2) \\ \beta_{0\text{c}} &\sim N(5000, 1000^2) \\ \beta_{1\text{c}} &\sim N(100, 40^2) \\ \sigma &\sim \text{Exp}(0.0008) \end{align*}

Is the Model Fair?

  • Consider our example from last lecture, the Capital Bikeshare data.

  • How was the data collected?

    • It is reasonable to assume that this was collected electronically.
    • We do not have individual-level information, only aggregated ride data.
  • By whom and for what purpose was the data collected?

    • It is reasonable for a bike sharing company to track data such as number of rides in a day.
    • The company is likely doing this to make evidence-based business decisions.

Is the Model Fair?

  • Consider our example from last lecture, the Capital Bikeshare data.

  • How might the results of the analysis, or the data collection itself, impact individuals and society?

    • Because there is no individual-level information, this will not negatively impact customers.
    • Improving the service might help reduce car traffic in D.C. - this is good!
  • Note!

    • Just because we cannot imagine societal impacts does not mean that none exist.
    • It’s critical to recognize our perspective based on our experiences.
    • How can we truly determine the impacts?
      • We would need to engage with the company and its community.

Is the Model Fair?

  • What biases might be baked into this analysis?

    • The data is only for one company.
      • Other bikeshare companies may have different usage patterns.
    • The data is only for one city.
      • Other cities may have different usage patterns for bikeshare.
      • Other cities will have different weather patterns.

Is the Model Fair?

  • How do I approach thinking about bias in analysis?

    • Who is and is not represented in the data?
    • What assumptions were made about the data?
    • What assumptions are made about the truth/the population?
    • Is the analysis method appropriate for the question posed?
      • e.g., did they use logistic regression for a binary outcome?
    • Were there any “reaching” assumptions made in order to complete the analysis?
      • e.g., assuming data is MCAR when it is likely MAR or MNAR.
      • e.g., assuming independent observations when there is likely correlation.
    • Are the results stated appropriately? (vs. overstated)
      • Are broad, sweeping generalizations made from a narrow dataset?

How Wrong is the Model?

  • Recall the data model,

Y_i | \beta_0, \beta_1, \sigma \overset{\text{ind}}{\sim} N(\beta_0 + \beta_1X_i,\sigma^2)

  • What are the assumptions on our model?

    • The observed data, Y_i, is independent of other observed data, Y_j for i \neq j.
    • Y can be written as a linear function of X, \mu=\beta_0 + \beta_1 X.
    • At any X, Y varys normally around \mu with consistent variability, \sigma.

How Wrong is the Model?

  • The observed data, Y_i, is independent of other observed data, Y_j for i \neq j.

    • We can technically test for this, but it is not often employed by statisticians.
    • Instead, we rely on knowledge of the data collection process.
    • If the data collection process suggests dependence, we should consider a different model.
      • e.g., a modeling approach that will account for the correlation between observations.

How Wrong is the Model?

  • The observed data, Y_i, is independent of other observed data, Y_j for i \neq j.

  • In this example, the data are daily counts of bike rides.

    • There is inherently some correlation – the ridership today is likely correlated to the ridership yesterday.
    • However, this can be explained by the time of year and features associated with the time of year, like temperature.
    • We should ask ourselves if it’s reasonable to assume that the temperature cancels out the time correlation.
  • “Reasonable” does not mean “perfect,” but instead, “good enough to proceed.”

How Wrong is the Model?

  • Returning to our bike data, we can look at the scatterplot of temperature (X) and number of rides (Y).
    • What kind of relationship do we suspect?

How Wrong is the Model?

  • We can overlay a line of best fit using geom_smooth(method = "lm").

How Wrong is the Model?

  • Or we can overlay a locally estimated scatterplot smoothing (LOESS) line using geom_smooth().

How Wrong is the Model?

  • How different are the two lines?

How Wrong is the Model?

  • When does “visualization” suggest that a linear model is not appropriate?

    • Does the LOESS line deviates substantially from the linear line?
    • Here, we see that the LOESS line is fairly close to the linear line.
    • Thus, it is reasonable to assume a linear relationship between X and Y.
  • Are there “issues” with this approach?

    • When we have more than one predictor, yes.
    • We are only visualizing one predictor at a time.
    • We will need to use other tools to assess linearity in multiple regression.

How Wrong is the Model?

  • To examine linearity in multiple regression, we will perform a posterior predictive check.

  • If the combined model assumptions are reasonable, then our posterior model should be able to simulate ridership data that’s similar to the original 500 rides observations.

  • For each of the 20,000 posterior plausible parameter sets, \{\beta_0, \beta_1, \sigma\}, we can predict ridership data from the observed temperature data.
    • This gives us 20,000 unique datasets of predicted ridership, each with n=500.

How Wrong is the Model?

  • Recall our analysis from last lecture,
bike_model <- stan_glm(rides ~ temp_feel, 
                       data = bikes, 
                       family = gaussian,
                       prior_intercept = normal(5000, 1000), 
                       prior = normal(100, 40), 
                       prior_aux = exponential(0.0008), # sigma
                       chains = 4, iter = 5000*2, seed = 84735) 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000184 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.84 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.124 seconds (Warm-up)
Chain 1:                0.204 seconds (Sampling)
Chain 1:                0.328 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.153 seconds (Warm-up)
Chain 2:                0.201 seconds (Sampling)
Chain 2:                0.354 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 7e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.121 seconds (Warm-up)
Chain 3:                0.194 seconds (Sampling)
Chain 3:                0.315 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 6e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.104 seconds (Warm-up)
Chain 4:                0.204 seconds (Sampling)
Chain 4:                0.308 seconds (Total)
Chain 4: 

How Wrong is the Model?

  • Recall our analysis from last lecture,
tidy(bike_model, conf.int = TRUE) %>% select(-std.error)
  • This gives the model,

\hat{y} = -2195.3 + 82.2x \ \ \ \to \ \ \ \hat{\text{rides}} = -2195.3 + 82.2\text{temp}

How Wrong is the Model?

  • How do we perform the posterior predictive check?

  • The pp_check() function plots a number of the corresponding simulated datasets.

pp_check(bike_model, nreps=50, seed = 84735) + xlab("Number of Rides") + theme_bw()

How Accurate are Posterior Predictive Models?

  • Although we are interested in predicting the future, we can begin by evaluating how well it predicts the data that was used to build the model.

  • We can use the posterior_predict() function to generate posterior predictive summaries to either our data or new data.

predictions <- posterior_predict(bike_model, newdata=bikes, seed = 84735)
dim(predictions)
[1] 20000   500
  • We have 20,000 posterior predictive datasets, each with 500 predicted ride counts.

How Accurate are Posterior Predictive Models?

  • We can also use the ppc_intervals() function to visualize the posterior predictive intervals for each observation.
ppc_intervals(bikes$rides, yrep = predictions, x = bikes$temp_feel, 
              prob = 0.5, prob_outer = 0.95) + theme_bw()

How Accurate are Posterior Predictive Models?

  • Let Y_1, Y_2, ..., Y_n denote n observed outcomes. Each Y_i has a corresponding posterior predictive model with mean Y_i' and standard deviation \text{sd}_i

  • Median Absolute Error (MAE): the typical difference between the observed value and posterior predictive mean.

\text{MAE} = \text{median} | Y_i - Y_i'|

  • Scaled Median Absolute Error (MAE scaled): the typical difference between the observed value and posterior predictive mean.

\text{MAE scaled} = \text{median} \frac{| Y_i - Y_i'|}{\text{sd}_i}

How Accurate are Posterior Predictive Models?

  • Let Y_1, Y_2, ..., Y_n denote n observed outcomes. Each Y_i has a corresponding posterior predictive model with mean Y_i' and standard deviation \text{sd}_i

  • Within 50: Proportion of observed values that fall within their 50% posterior prediction interval.

  • Within 95: Proportion of observed values that fall within their 95% posterior prediction interval.

  • We get these four values out of the prediction_summary() function,

set.seed(84735)
prediction_summary(bike_model, data = bikes)

How Accurate are Posterior Predictive Models?

  • Let’s now interpret these values.
set.seed(84735)
prediction_summary(bike_model, data = bikes)
  • The MAE of 989.7 means that the typical difference between the observed ride count and the posterior predictive mean is about 990 rides.
    • The scaled MAE of 0.77 means that the typical difference between the observed ride count and the posterior predictive mean is about 0.77 standard deviations.
  • Only 43.8% of test observations fall within their respective 50% prediction interval, while 95.8% fall within their respective 95% prediction interval.

How Accurate are Posterior Predictive Models?

  • Looking at these side by side,

How Accurate are Posterior Predictive Models?

  • What is our final conclusion? Does our Bayesian model produce accurate predictions?

  • Unfortunately, the answer is subjective and dependent on context.

    • Is an MAE of 990 rides acceptable for the business?
    • Is having only 43.8% of observations within the 50% prediction interval acceptable?
  • Per usual, we are just the statisticians. The business must decide what is acceptable.

How Accurate are Posterior Predictive Models?

  • How can we improve the model’s posterior predictive accuracy?

    • Add additional predictors or terms (e.g., interactions) to the model.
    • Consider different distributions for the outcome variable.
      • e.g., Poisson, negative binomial, binomial, etc.
    • Collect more data.
      • This is always the cheeky answer.

Practice / Homework