Categorical Predictors: Statistical Testing

Introduction

Recall the general linear model, y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon
In the last lecture, we introduced the inclusion of categorical predictors in our model.
- Remember, if a categorical predictor has c classes, we include c-1 in the model.
We also discussed how to deal with categorical variables either as a factor variable or as a set of indicator variables.

Lecture Example Set Up

Recall our example dataset of duck-related incidents,

duck_incidents <- read_csv("https://raw.githubusercontent.com/samanthaseals/SDSII/refs/heads/main/files/data/lectures/W2_duck_incidents.csv") %>%
  mutate(loc_yard = if_else(location == "Backyard", 1, 0),
         loc_garage = if_else(location == "Garage", 1, 0),
         loc_kitchen = if_else(location == "Kitchen", 1, 0),
         loc_living = if_else(location == "Living Room", 1, 0))
duck_incidents %>% head()

Categorical Variables in the Model

We now want to discuss statistical inference of categorical predictors.
Recall that in the case of c \ge 3 classes, we include c-1 indicator variables in the model.
- How do I determine if the predictor as a whole is statistically significant?
- How do I determine if individual classes are statistically significant?
  - What does this mean?

Inference for Categorical Variables (c \ge 2)

Consider the case of a categorical predictor with c \ge 2 classes.
To determine if the categorical predictor as a whole is statistically significant, we can use an ANOVA F-test (or \chi^2 likelihood ratio test for other GzLMs).
- This tests if all coefficients for the indicator variables are equal to zero.
- If we reject this null hypothesis, we conclude that at least one class is different.
Then, to determine if individual classes are statistically significant, we can look at the t-tests for each coefficient.
- Each t-test tests the null hypothesis that the coefficient for that class is equal to zero.
- If we reject for a specific class, we conclude that the difference between that class and the reference class is non-zero.

Inference for Categorical Variables (c \ge 2)

Hypotheses
- H_0: \ \beta_1 = ... = \beta_{c-1} = 0
- H_1: at least one \beta_i \ne 0
Test Statistic and p-Value
- F_0 = [\text{value from R}], p = [\text{value from R}]
Rejection Region
- Reject H_0 if p < \alpha; \alpha = [\text{assumed } \alpha].
Conclusion/Interpretation
- [Reject (if p < \alpha) or fail to reject (if p \ge \alpha)] H_0. There [is (if reject) or is not (if FTR)] sufficient evidence to suggest that [categorical predictor] significantly predicts [outcome variable].

Inference for Categorical Variables (c \ge 2) (R)

How do we perform this omnibus (or global) test?
If we are using a factor variable,

m <- glm(y = x1 + x2 + ... + xk, family = "gaussian", data = dataset_name)
car::Anova(m, type = 3, test = "F")

If we are using indicator variables,

full <- glm(y = x1 + x2 + ... + xk, family = "gaussian", data = dataset_name) # ALL predictors
reduced <- glm(y = x1 + ... + x(k-c), family = "gaussian", data = dataset_name) # all predictors WITHOUT indicators
anova(reduced, full)

Example 1: One Categorical Predictor

Recall our example from the last lecture,

m1 <- glm(damage_cost ~ location, 
          family = "gaussian",
          data = duck_incidents)
tidy(m1, conf.int = TRUE)

Here, location is a factor variable with four classes.
- Because there are four classes and location is a factor, we will use the car::Anova() approach.

Example 1: One Categorical Predictor

To determine if location is statistically significant as a whole,

car::Anova(m1, type = 3, test = "F")

We see that location significantly predicts damage cost (p = 0.017).

Example 1: One Categorical Predictor

Hypotheses
- H_0: \ \beta_{\text{garage}} = \beta_{\text{kitchen}} = \beta_{\text{living room}} = 0
- H_1: at least one \beta_i \ne 0
Test Statistic and p-Value
- F_0 = 3.450, p = 0.017
Rejection Region
- Reject H_0 if p < \alpha; \alpha = 0.05.
Conclusion/Interpretation
- Reject H_0. There is sufficient evidence to suggest that location significantly predicts damage cost.

Inference for Categorical Variables (c \ge 2)

When we have determined that a categorical predictor is statistically significant as a whole, we can look at the individual t-tests for each class.
Then, our hypotheses for each class are:
- H_0: \ \beta_i = 0 (class i is not significantly different from the reference class)
- H_1: \ \beta_i \ne 0 (class i is significantly different from the reference class)
Remember that these are testing that particular group against the reference group and are similar to posthoc tests.
- Note! Because this is similar to posthoc testing, we should adjust our \alpha.
- A simple way to do this is to use the Bonferroni correction: \alpha_{\text{B}} = \frac{\alpha}{c-1}.

Inference for Categorical Variables (c \ge 2)

Hypotheses
- H_0: \ \beta_i = 0
- H_1: \ \beta_i \ne 0
Test Statistic and p-Value
- t_0 = [\text{value from R}], p = [\text{value from R}]
Rejection Region
- Reject H_0 if p < \alpha; ; \alpha = [(\text{assumed } \alpha)/(c-1)].
Conclusion/Interpretation
- [Reject (if p < \alpha) or fail to reject (if p \ge \alpha)] H_0. There [is (if reject) or is not (if FTR)] sufficient evidence to suggest that [class name] is significantly different from [reference class name] when predicting [outcome variable].

Example 1: One Categorical Predictor

In our example,

tidy(m1, conf.int = TRUE)

The garage is not significantly different from the backyard (p = 0.126). This difference is estimated to be \$64 (-\$18, \$147).
The kitchen is significantly different from the backyard (p = 0.001). We estimate this difference to be \$131 (\$51, \$212).
The living room is not significantly different from the backyard (p = 0.132). This difference is estimated to be \$59 (-\$17, \$135).

Example 2: Two Categorical Predictors

Recall our second example,

m4 <- glm(damage_cost ~ location + nephew,
          family = "gaussian",
          data = duck_incidents)
tidy(m4, conf.int = TRUE)

Here, we have two categorical predictors: location (4 classes) and nephew (2 classes).
- Because both location is a factor variable with 4 classes, we will use the car::Anova() approach again.

Example 2: Two Categorical Predictors

To determine if location and nephew are statistically significant as a whole,

car::Anova(m4, type = 3, test = "F")

We see that both location (p = 0.034) and nephew (p < 0.001) significantly predict damage cost.

Example 2: Two Categorical Predictors

We mentioned earlier that if we are using indicator variables, we will take the reduced/full approach.
We can still do that here – let’s try it out.

full <- glm(damage_cost ~ location + nephew, family = "gaussian",data = duck_incidents) # full model
reduced_loc <- glm(damage_cost ~ nephew, family = "gaussian", data = duck_incidents) # reduced without location
reduced_neph <- glm(damage_cost ~ location, family = "gaussian", data = duck_incidents) # reduced without nephew
anova(reduced_loc, full) # test location

anova(reduced_neph, full) # test nephew

Example 2: Two Categorical Predictors

car::Anova(m4, type = 3, test = "F") # car::Anova() approach

anova(reduced_loc, full) # full/reduced for location

anova(reduced_neph, full) # full/reduced for nephew

Example 2: Two Categorical Predictors

Hypotheses
- H_0: \ \beta_{\text{garage}} = \beta_{\text{kitchen}} = \beta_{\text{living room}} = 0
- H_1: at least one \beta_i \ne 0
Test Statistic and p-Value
- F_0 = 2.91, p = 0.034
Rejection Region
- Reject H_0 if p < \alpha; \alpha = 0.05.
Conclusion/Interpretation
- Reject H_0. There is sufficient evidence to suggest that location significantly predicts damage cost.

Example 2: Two Categorical Predictors

Hypotheses
- H_0: \ \beta_{\text{Huey}} = \beta_{\text{Louie}} = 0
- H_1: at least one \beta_i \ne 0
Test Statistic and p-Value
- F_0 = 26.69, p < 0.001
Rejection Region
- Reject H_0 if p < \alpha; \alpha = 0.05.
Conclusion/Interpretation
- Reject H_0. There is sufficient evidence to suggest that nephew significantly predicts damage cost.

Example 2: Two Categorical Predictors

Let’s re-examine this model with the location indicators,

m4a <- glm(damage_cost ~ location + nephew, family = "gaussian", data = duck_incidents)
coefficients(m4a)

        (Intercept)      locationGarage     locationKitchen locationLiving Room 
         139.288148           51.491118          113.783352           66.873723 
         nephewHuey         nephewLouie 
          -7.974478          206.606636

m4b <- glm(damage_cost ~ loc_garage + loc_kitchen + loc_living + nephew, family = "gaussian", data = duck_incidents)
coefficients(m4b)

(Intercept)  loc_garage loc_kitchen  loc_living  nephewHuey nephewLouie 
 139.288148   51.491118  113.783352   66.873723   -7.974478  206.606636

Example 2: Two Categorical Predictors

What happens when I run these models through car::Anova()?

car::Anova(m4a, type = 3, test = "F") # factor variable

car::Anova(m4b, type = 3, test = "F") # indicator variables

Example 2: Two Categorical Predictors

If you have (manually) included indicator variables, you must take the full/reduced approach when examining omnibus tests for categorical predictors.

car::Anova(m4b, type = 3, test = "F") # indicator variables

reduced_loc2 <- glm(damage_cost ~ nephew, data = duck_incidents) # reduced without location
anova(reduced_loc2, m4b) # test location

Example 2: Two Categorical Predictors

Now that we know that location and nephew are statistically significant predictors (p = 0.034 and p < 0.001, respectively), we can look at the individual t-tests for each class.

tidy(m4, conf.int = TRUE)

Example 2: Two Categorical Predictors

Location:
- The garage is not significantly different from the backyard (p = 0.197). This difference is estimated to be $51 (-$27, $130).
- The kitchen is significantly different from the backyard (p = 0.004). We estimate this difference to be $121 ($37, $190).
- The living room is not significantly different from the backyard (p = 0.071). This difference is estimated to be $54 (-$5, $139).
Nephew
- There is not a difference in cost between Huey and Dewey (p=0.813$). We estimate that Huey causes $8 less in damage (-$74, $58).
- There is a difference between Louie and Dewey (p < 0.001). We estimate that Louie causes $207 more in damage ($141, $272).

Example 3: Mixing Continuous and Categorical Predictors

Let’s now at location, newphew, and sugar intake as predictors.
- What changes when continuous predictors are in the mix?

m5 <- glm(damage_cost ~ location + nephew + sugar_grams,
          family = "gaussian",
          data = duck_incidents)
tidy(m5, conf.int = TRUE)

Example 3: Mixing Continuous and Categorical Predictors

Thankfully, nothing new is introduced – we can take either approach!
- Note that when I have a factor as a predictor, I always use car::Anova().

car::Anova(m5, type = 3, test = "F")

We can see that nephew is significant (p < 0.001) while location and sugar intake are not (p = 0.140 and p = 0.073, respectfully).
- If we were to do the pairwise testing, we would only consider the nephew comparisons.
- We would not look at the location comparisons because location is not significant at the global level.

Example 3: Mixing Continuous and Categorical Predictors

Wait… sugar intake can be examined using car::Anova() too?

car::Anova(m5, type = 3, test = "F")

Yes! We do not need separate tidy() output to examine continuous predictors.

tidy(m5, conf.int = TRUE) %>% tail(n=4)

Example 3: Mixing Continuous and Categorical Predictors

Recall the relationship between a t test and an F test:

F \approx t^2

and that their p-values will be the same.
We see that in this example,

Method	Test Statistic	p-Value
`car::Anova()`	F = 3.22	0.073
`tidy()`	t = 1.80	0.073

To confirm the relationship, 1.80^2 = 3.24 \approx 3.22.

Example 4: Binary Predictors

Recall, we created a binary indicator for punished – it is stored as 0/1 (numeric).

duck_incidents <- duck_incidents %>%
  mutate(donald_action = case_when(donald_reaction %in% c("Assigns Chores", "Grounds") ~ "Punished",
                                   donald_reaction %in% c("Laughs", "Yells") ~ "No Punishment")) %>%
  mutate(punished = if_else(donald_action == "Punished", 1, 0))
duck_incidents %>% count(punished)

Example 4: Binary Predictors

Let’s now model damage cost using sugar_grams and punished.

m6 <- glm(damage_cost ~ sugar_grams + punished,
          family = "gaussian",
          data = duck_incidents)
tidy(m6, conf.int = TRUE)

Because punished is a binary indicator, we can look at the hypothesis testing like any other continuous predictor.
In this model, sugar intake is significant (p < 0.001) while punishment is not (p = 0.773).

Example 4: Binary Predictors

Comparing this to the car::Anova() approach,

tidy(m6, conf.int = TRUE)

car::Anova(m6, type = 3, test = "F")

The p-values are the same.
- Sugar intake is significant (p < 0.001) while punishment is not (p = 0.773).

Summary

When dealing with categorical predictors with c \ge 2, we must remember to test the global significance.
- If using a factor variable, we can use car::Anova().
- If using indicator variables, we must use the full/reduced approach.
When the omnibus test is significant, we then look at the individual t-tests for each class for pairwise significance.
- Remember that we are always comparing against the reference group.
- If the omnibus test is not significant, we do not consider the individual t-tests.
Like with most things in R, there is more than one way to complete the analysis.
- Choose the method that (1) is appropriate and (2) makes the most sense to you/your brain.