y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... \beta_k x_k + \varepsilon
In this model, \varepsilon is the residual error term.
Recall that the residual error (how far away the observation is from the response surface) is defined as \varepsilon = y - \hat{y}
Note that
y is the observed value
\hat{y} is the predicted value
In addition to checking assumptions, we should also perform model diagnostics to identify potential issues with our model.
Common diagnostics include:
Assessing overall model fit (e.g., R^2, AIC, BIC).
Identifying outliers (points that are far from the predicted values).
Identifying influential observations (points that have a large impact on the model’s estimates).
Checking for multicollinearity (high correlation between predictors).
The diagnostics needed depend on the context of our analysis and the specific model we are using.
Context matters! Applied statistics: we do not want our model to be perfect, we want it to be generalizable. Does the interpretation make intuitive (scientific) sense? If not, why? Is there a weird data point (or points) affecting our analysis?
Context matters! Machine learning/predictive modeling: we care about predictive accuracy, not interpretability. Does the model predict well on new data? If not, why? Is there overfitting or underfitting? Is that due to weird data points?
We can assess overall model fit using metrics such as:
R^2: proportion of variance in the response variable explained by the model.
Adjusted R^2: adjusts R^2 for the number of predictors in the model.
AIC (Akaike Information Criterion): measures the relative quality of the specified model for the given data. This balances goodness-of-fit and model complexity.
BIC (Bayesian Information Criterion): same conecpt of AIC, but has a stronger penalty as the number of parameters (p) estimated increases.
In our course, we are only estimating \beta_i, which we typically say that i = 1, ..., k predictors + 1 intercept. Thus, p = k + 1.
If we are accounting for correlated data (not in this course), we are introducing additional parameters to estimate (e.g., correlation structure parameters).
If we want to know how well the model fits this particular dataset, we can look at the R^2.
R^2 = \frac{\text{SSReg}}{\text{SSTot}}
Because it is a proportion, R^2 \in [0, 1] and is independent of the units of y.
If R^2 \to 0, the model does not fit the data well.
If R^2 \to 1, the model fits the data well.
Remember that we are partitioning the variability in y (SSTot), which is constant, into two pieces:
The variability explained by the regression model (SSReg).
The variability explained by outside sources (SSE).
As predictors are added to the model, we necessarily increase SSReg / decrease SSE.
We want a measure of model fit that is resistant to this fluctuation,
R^2_{\text{adj}} = 1 - \left( \frac{n-1}{n-k-1} \right) \left( 1 - R^2 \right),
We will use the ssstats package to check diagnostics.
For normality, we can use the r_squared() function.
m1:\hat{y} = 3 + 1.5 x + \hat{\varepsilon}
m6:\hat{y} = 2 + 1.1 x_1 - 0.8 x_2 + \hat{\varepsilon}
What do I do when there is a very low R^2 value? The questions I ask:
What did the assumption graphs look like?
Do we actually meet the normality assumption?
-If not, should we consider taking a different methodological approach?
I should graph the data to see what it looks like.
Is the relationship non-linear?
Is there an outlier affecting the relationship?
Is the data just super noisy?
Is there a variable missing from the model that could help explain the response?
Remember that the R^2 value is not the end-all, be-all.
Like everything in statistics/data science, context matters.
In some applications, low R^2 values are common.
In other applications, high R^2 values are expected.
Always interpret R^2 in the context of your specific field and research question.
Because the calculation of the AIC and BIC depend on the likelihood, we will not dive into the formulas here.
Instead, let’s discuss how the AIC and BIC penalize our models.
Both AIC and BIC balance model fit and model complexity.
AIC imposes a penalty of 2p.
BIC imposes a penalty of p \times \log(n).
Remember, p is the number of parameters estimated in the model.
In this course, we will not see a great difference between the AIC and BIC values.
However, in more complex models (e.g., analyzing dependent data, where p > k), the difference can be more pronounced.
What do the values actually mean?
How do we use these?
We use these to compare models.
The model with the lowest AIC or BIC is fits the data “better”.
Consider an example dataset with y, x_1, x_2, and x_3; we fit the following models:
m1: y \sim x_1 (AIC = 250, BIC = 255)
m2: y \sim x_1 + x_2 (AIC = 240, BIC = 248)
m3: y \sim x_1 + x_2 + x_3 (AIC = 245, BIC = 256)
We would then declare m2 the best fitting model as its AIC and BIC values are the lowest.
Recall one of the models we examined in the visualization lecture.
h1 <- glm(clubhouse_happiness ~ time_with_friends + goofy_laughs + donald_grumbles,
family = "gaussian",
data = clubhouse)
h1 %>% tidy()\hat{\text{happiness}} = 47.58 + 3.58 \text{ time} + 0.66 \text{ laughs} - 1.06 \text{ grumbles}
Let’s now consider a set of models,
h1: clubhouse_happiness ~ time_with_friends + goofy_laughs + donald_grumbles
h2: clubhouse_happiness ~ time_with_friends + goofy_laughs
h3: clubhouse_happiness ~ goofy_laughs + donald_grumbles
h1 <- glm(clubhouse_happiness ~ time_with_friends + goofy_laughs + donald_grumbles, family = "gaussian", data = clubhouse)
h2 <- glm(clubhouse_happiness ~ time_with_friends + goofy_laughs, family = "gaussian", data = clubhouse)
h3 <- glm(clubhouse_happiness ~ goofy_laughs + donald_grumbles, family = "gaussian", data = clubhouse)[1] 2222.259
[1] 2274.454
[1] 2252.592
time_with_friends + goofy_laughs + donald_grumbles) fits the data best.[1] 2240.778
[1] 2289.269
[1] 2267.407
time_with_friends + goofy_laughs + donald_grumbles) fits the data best.| Model | AIC | BIC |
|---|---|---|
h1 |
2222.259 | 2240.778 |
h2 |
2274.454 | 2289.269 |
h3 |
2252.592 | 2267.407 |
Both the AIC and BIC indicate that h1 is the best fitting model.
We can see that BIC > AIC consistently, which is expected.
The word “outlier” is part of our every day language.
Mathematically, an outlier is an observation that lies an abnormal distance from other values in a random sample from a population.
Outliers in y are observations with large residuals and what we mean by “outliers” in the formal sense.
Outliers can have a significant impact on the regression model.
They can skew the results, leading to biased estimates of the regression coefficients.
They can also affect the overall fit of the model, leading to misleading conclusions.
e_i = y_i - \hat{y}_i
e_{i_{\text{std}}} = \frac{e_i}{\sqrt{\text{MSE}(1-h_i)}}
Ultimately, we want all of our residuals as close to 0 as possible. However,
We will use the rstandard() function in base R to find the residuals.
For ease of examining in large datasets, we will use it to create a “flag.”
Let’s return to our Mickey and friends example, recalling the models we created for the AIC and BIC example.
h1: clubhouse_happiness ~ time_with_friends + goofy_laughs + donald_grumbles
h2: clubhouse_happiness ~ time_with_friends + goofy_laughs
h3: clubhouse_happiness ~ goofy_laughs + donald_grumbles
Let’s examine the outliers. First, we create the flags.
Note!!! I am adding the flags for each model because the outliers can differ by model, and I want to demonstrate that here.
Finally, how do we handle outliers?
My approach:
Investigate the outlier: try to determine if it is a data entry error or a valid observation.
This requires discussion with those involved with data collection and is not possible in the classroom setting.
e.g., BMI of 93 in the Jackson Heart Study
Sensitivity analysis: fit the model with and without the outlier(s) to see how much the results change.
Please note that we should always be cautious when removing outliers from the dataset.
We do not want to curate data to fit our preconceived notions.
Always document any decisions made regarding outliers and justify them based on nonstatistical reasoning (e.g., data entry error that cannot be remedied).
Now, what about outliers in x_i (the predictors)?
These points can have high leverage and/or be influential.
Leverage measures how far an observation’s predictor values are from the mean of the predictor values.
Influential points are observations that have a significant impact on the regression model’s estimates.
We will use Cook’s distance to assess leverage and influence.
Like outliers, an observation’s leverage and influence values depend on the model that was specified.
h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^{n} (x_j - \bar{x})^2}
Conceputally, leverage measures how far an observation’s predictor values are from the mean of the predictor values and compares it to the overall variability in the predictor values.
On the previous slide, we said we will instead use Cook’s distance, which combines information about both the residual and leverage of each observation.
D_i = \frac{1}{(k+1)\text{MSE}} \left( \frac{h_i}{(1-h_i)^2} \right)e_i^2
where
e_i is the residual for observation i,
h_i is the leverage for observation i,
k is the number of predictor terms in the model, and
MSE is the mean squared error of the model.
How do we interpret Cook’s distance?
There are different cutoff rules suggested in the literature.
Common cutoff: if D_i > 4/(n - k - 1), then observation i is considered influential.
For a more intuitive approach, we will graph the Cook’s distance values.
I look for observations with substantially larger Cook’s distance values than the rest of the observations.
These are the influential points.
ssstats package and use the cooks() function to generate the graph we need.Finally, how do we handle influential points? Recall: they are technically outliers….
My approach:
Investigate the outlier: try to determine if it is a data entry error or a valid observation.
Sensitivity analysis: fit the model with and without the outlier(s) to see how much the results change.
Again, we should not formally remove influential points without investigation.
Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated.
Consequences of multicollinearity:
Inflated standard errors of the regression coefficients.
Unstable coefficient estimates that can change significantly with small changes in the data.
Difficulty in determining the individual effect of each predictor on the response variable.
Note that multicollinearity doesn’t necessarily “break” the model, but instead, makes the coefficients hard to trust.
To assess multicollinearity, we can use the variance inflation factor (VIF).
Mathematically,
\text{VIF}_j = \frac{1}{1 - R_j^2}
here,
j indexes the predictor number; j = 1, 2, ..., k, and
R_j^2 is the R^2 when regressing predictor j on all other predictors in the model.
How do we make the call that multicollinearity exists?
Like Cook’s distance, there are different cutoff rules suggested in the literature.
Typical cutoff: if VIF > 5 (or 10), then multicollinearity is a concern.
\text{VIF}_j = \frac{1}{1 - R_j^2}
The closer VIF is to 1, the less multicollinearity there is.
If R_j^2 \to 0 (the other k-1 predictors do a terrible job of predicting x_j) then \text{VIF}_j \to 1 (no multicollinearity).
If R_j^2 \to 1 (the other k-1 predictors do a fantastic job of predicting x_j), then \text{VIF}_j \to \infty (severe multicollinearity).
vif() function from the car package.car package overwrites some functions from tidyverse, so I typically do not load the full library.Chip and Dale are running “Snack Recovery Missions” around Mickey’s Clubhouse after Pluto keeps “misplacing” everyone’s treats.
For each mission, they record:
distance_m steps calories wrong_turns rainbow_map
19.567651 16.198916 14.287939 1.005378 1.007974
YIKES! Distance, steps, and calories indicate multicollinearity.
How do we handle multicollinearity?
Simple: remove one or more of the correlated predictors from the model.
Advanced: combine the correlated predictors into a single predictor using techniques such as principal component analysis (PCA) or factor analysis.
When this happens, I will figure out specifically what variables do not play well with others.
Then, I will have a conversation with the person I’m working with to determine which variable(s) make the most sense to keep in the model.
The concept of sensitivity analysis has been mentioned on previous slides.
This is where we remove specific “problem points” (outliers or influential points) from the dataset and refit the model to see how much the results change.
Let’s consider the final model from the Chip and Dale example,
\hat{\text{recover}} = 16.95 + 0.18 \text{ calories} + 1.92 \text{ wrong turns} - 3.48 \text{ rainbow map}
chip_dale <- chip_dale %>%
mutate(outlier = if_else(abs(rstandard(m_reduced))>2.5, "Suspected", "Not Suspected"))
chip_dale %>% count(outlier) # A tibble: 5 × 3
variable mean_sd median_iqr
<chr> <chr> <chr>
1 calories 70.2 (17.7) 70.4 (23.3)
2 distance_m 1200.5 (338.6) 1196.0 (419.4)
3 minutes_to_recover 22.0 (5.1) 21.9 (6.8)
4 steps 1606.7 (505.3) 1615.8 (691.5)
5 wrong_turns 3.0 (1.6) 3.0 (2.0)
In our diagnostics for extreme values, we have found that observation 29 is both an issue as an outlier and as a leverage point, while observation 40 only shows up as an outlier.
Let’s consider the following modeling situations:
We will then compare the results across these three situations.
There are multiple ways to approach this in R.
tidyverse functions to filter within the modeling function.For simplicity, we will use three datasets:
chip_dale: original (full) dataset.chip_dale_29: dataset without observation 29.chip_dale_29_40: dataset without observations 29 and 40.The data steps are as follows,
m_full <- glm(minutes_to_recover ~ calories + wrong_turns + rainbow_map,
family = "gaussian",
data = chip_dale)
m_29 <- glm(minutes_to_recover ~ calories + wrong_turns + rainbow_map,
family = "gaussian",
data = chip_dale_29)
m_29_40 <- glm(minutes_to_recover ~ calories + wrong_turns + rainbow_map,
family = "gaussian",
data = chip_dale_29_40)| Term | Full | No 29 | No 29 or 40 |
|---|---|---|---|
| (Intercept) | 7.10 (p < 0.001) | 6.90 (p < 0.001) | 7.15 (p < 0.001) |
| calories | 0.16 (p < 0.001) | 0.16 (p < 0.001) | 0.16 (p < 0.001) |
| wrong turns | 1.75 (p < 0.001) | 1.81 (p < 0.001) | 1.79 (p < 0.001) |
| rainbow map | -3.88 (p < 0.001) | -3.20 (p < 0.001) | -3.29 (p < 0.001) |
Questions to ask:
Do the coefficient estimates change meaningfully? Is the scientific interpretation the “same”?
Do the statistical significances change?
Removing these observations did not meaningfully change the coefficient estimates or their statistical significance.
Across models, the scientific interpretations remained approximately the same.
Direction did not change.
Value of coefficients is similar across models.
All predictors remained statistically significant at \alpha = 0.05, specifically with p < 0.001.
So what would I do?
My itention would be to keep all observations.
I would bring the results of the sensitivity analysis up when discussing the results with collaborators, and ask if they see anything “weird” about those specific observations.
This lecture has covered the basics of model diagnostics related to model fit, outliers, leverage/influence, and multicollinearity.
We also discussed sensitivity analysis as a way to assess the impact of “problem points”.
We must have a nonstatistical justification for removing any data points from the final analysis.
We are helping others explore their data on the quantitative side, but we need their domain expertise to make final decisions.
Next module: incorporating categorical predictors
m_full resultsm_29 resultsm_29_40 results