Categorical Predictors: Modeling & Interpretation

Introduction

Recall the general linear model, y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon
In the last lecture, we introduced the concept of categorical predictors.
- Remember, if a categorical predictor has c classes, we include c-1 in the model.
We also discussed how to deal with categorical variables either as a factor variable or as a set of indicator variables.

Lecture Example Set Up

Recall our example dataset of duck-related incidents,

duck_incidents <- read_csv("https://raw.githubusercontent.com/samanthaseals/SDSII/refs/heads/main/files/data/lectures/W2_duck_incidents.csv") %>%
  mutate(loc_yard = if_else(location == "Backyard", 1, 0),
         loc_garage = if_else(location == "Garage", 1, 0),
         loc_kitchen = if_else(location == "Kitchen", 1, 0),
         loc_living = if_else(location == "Living Room", 1, 0))
duck_incidents %>% head()

Modeling with Categorical Predictors

Recall that for a categorical variable with c categories, we include c-1 indicator variables in the model.
- If we specify a factor variable in our model, R takes care of the indicator variable creation for us.
The category that is not included in the model is called the reference group.
- Choice of reference group is important for interpretation, but is arbitrary for omnibust tests, model fit, and predictions.

Modeling with Categorical Predictors

How does R handle reference groups when modeling?
- If we include a factor variable in the model, R automatically creates the indicator variables and selects the reference group (default is the “first” level).
- If we include indicator variables in the model, we must ensure that we include only c-1 of them.
  - The c^th one is the reference group.
  - If we include all c, R will not return an estimate for one of the categorical terms.
    - If you are using SAS, it will also skip an estimate for one of the categorical terms, but it will give you a warning about it. The warning shown depends on the PROC in use.

Modeling with Categorical Predictors

Why will all programs skip an estimate? Let’s consider the nephew variable in our dataset,

Nephew	x_{\text{H}}	x_{\text{D}}	x_{\text{L}}
Huey	1	0	0
Dewey	0	1	0
Louie	0	0	1

We only need two indicators to represent the three categories.
- Suppose Louie is the reference group. That means we include x_{\text{H}} and x_{\text{D}} in the model.
- When x_{\text{H}} = 1 and x_{\text{D}} = 0, we know nephew = \text{Huey}.
- When x_{\text{H}} = 0 and x_{\text{D}} = 1, we know nephew = \text{Dewey}.
- When x_{\text{H}} = 0 and x_{\text{D}} = 0, we know nephew = \text{Louie}.

Modeling with Categorical Predictors

Now, let’s consider this from the mathematical standpoint.
If we included all three indicators, we would have the following in our model:

y = \beta_0 + \beta_1 x_{\text{H}} + \beta_2 x_{\text{D}} + \beta_3 x_{\text{L}} + \varepsilon

However, we know that x_{\text{L}} = 1 - x_{\text{H}} - x_{\text{D}}. i.e., x_{\text{L}} is a linear combination of x_{\text{H}} and x_{\text{D}}.

y = \beta_0 + \beta_1 x_{\text{H}} + \beta_2 x_{\text{D}} + \beta_3 (1 - x_{\text{H}} - x_{\text{D}}) + \varepsilon

The algebraically simplified model is not identifiable because we cannot uniquely estimate all of the \beta terms.

y = (\beta_0 + \beta_3) + (\beta_1 - \beta_3) x_{\text{H}} + (\beta_2 - \beta_3) x_{\text{D}} + \varepsilon

Modeling with Categorical Predictors

How do we change reference groups for modeling/testing purposes?
- If using a factor variable, we can relevel the factor using the factor() function with the levels argument.
- If using indicator variables, we simply update the set of c-1 indicators in the model.
Prepare yourself to see a lot of “if you are using a factor variable… but if you are using indicator variables…” throughout the course.
- This is because of the different ways that categorical variables can be stored in datasets.

Modeling with Categorical Predictors

What is my approach (in practice) to choosing a reference group?
- If there is some natural order, I select the lowest as the reference.
- If there is not an order, I will select one that “makes sense” as the comparison group when thinking of the context surrounding the question being asked of the data.
- When I am communicating results to collaborators, I will explain my choice and then ask if they would like a different reference group.

Modeling with Categorical Predictors (R)

To actually model with categorical predictors in R, we use the appropriate modeling function as usual.
- Note that the distribution we assume is what drives the choice of function while the type of predictor affects the syntax within the function.
If we are using stored-as-factor variables, we simply include them,

m <- glm(y ~ continuous_var + continuous_var + factor_var +.....)

If we are using numerically stored categorical predictors, we can wrap them with as.factor(),

m <- glm(y ~ continuous_var + continuous_var + as.factor(numeric_var) +.....)

If we are using indicator variables, we include c-1 of them,

m <- glm(y ~ continuous_var + continuous_var + indicator_1 + ... + indicator_(c-1) +.....)

Example 1: One Categorical Predictor

Let’s model damage cost using only the location variable.

m1 <- glm(damage_cost ~ location, 
          family = "gaussian",
          data = duck_incidents)
tidy(m1, conf.int = TRUE)

Here, R has automatically created the indicator variables for us and selected “Backyard” as the reference group (alphabetically first).

Example 1: One Categorical Predictor

Let’s now model damage cost using the related indicator variables instead.

m2 <- glm(damage_cost ~ loc_garage + loc_kitchen + loc_living,
          family = "gaussian",
          data = duck_incidents)
tidy(m2, conf.int = TRUE)

This is the same output as we saw when using the factor variable instead.

Example 1: One Categorical Predictor

Putting them side-by-side,

tidy(m1, conf.int = TRUE)

tidy(m2, conf.int = TRUE)

Example 1: One Categorical Predictor

Remember, we will want to restate the model in mathematical form:

coefficients(m1)

        (Intercept)      locationGarage     locationKitchen locationLiving Room 
          202.77983            64.34182           131.84687            58.69676

\hat{\text{cost}} = 202.78 + 64.34 \text{ garage} + 131.85 \text{ kitchen} + 58.70 \text{ living}

From here, we can easily create the estimated costs for each location:
- Backyard (reference): $202.78
- Garage: $202.78 + $64.34 = $267.12
- Kitchen: $202.78 + $131.85 = $334.63
- Living Room: $202.78 + $58.70 = $261.48

Intepreting Categorical Predictors

We know that for continuous predictors, the interpretation is “for a one unit increase in x_i, we expect the outcome to [increase or decrease] by \beta_i units, holding all other predictors constant.”
For categorical predictors, what is a one unit increase in x_j?
- For an indicator variable, a one unit increase means going from not being in that category to being in that category.
- For a factor variable, a one unit increase means going from the reference group to the category represented by that term.

Intepreting Categorical Predictors

Thus, the interpretation for categorical predictors becomes “for those in category j, as compared to the reference group, we expect the outcome to [increase or decrease] by \beta_j units, holding all other predictors constant.”
- That is, \beta_j represents the difference in the expected outcome between those in category j and those in the reference group.

Example 1: Intepretations

Using our previous model output,

coefficients(m1)

        (Intercept)      locationGarage     locationKitchen locationLiving Room 
          202.77983            64.34182           131.84687            58.69676

We can interpret the model estimates as follows:
- The average cost of damage for incidents that occurred in the Backyard is $202.78.
- For incidents that occurred in the Garage, we expect the damage cost to increase by $64.34.
- For incidents that occurred in the Kitchen, we expect the damage cost to increase by $131.85.
- For incidents that occurred in the Living Room, we expect the damage cost to increase $58.70.

Example 1: Changing Reference Groups

We have said that the choice of reference group is arbitrary mathematically. Let’s prove this to ourselves.
Let’s now model damage cost using location but with “Kitchen” as the reference group.

m3 <- glm(damage_cost ~ loc_yard + loc_garage + loc_living,
          family = "gaussian",
          data = duck_incidents)
tidy(m3, conf.int = TRUE)

We see that the estimates themselves have changed.
- In future lectures, we will see that model fit and diagnostics remain the same.

Example 1: Changing Reference Groups

Let’s compare the estimates from the two models side-by-side.

coefficients(m2)

(Intercept)  loc_garage loc_kitchen  loc_living 
  202.77983    64.34182   131.84687    58.69676

coefficients(m3)

(Intercept)    loc_yard  loc_garage  loc_living 
  334.62670  -131.84687   -67.50505   -73.15011

The intercepts tell us the expected damage cost for the reference groups.
- The average cost for the backyard is $202.78.
- The average cost for the kitchen is $334.63.
Remember that the coefficient for loc_kitchen was $131.85 – this is the difference between the backyard and the kitchen averages.
- $334.63 - $202.78 = $131.85

Example 2: Two Categorical Predictors

Let’s now model damage cost using both the location and nephew variables.

m4 <- glm(damage_cost ~ location + nephew,
          family = "gaussian",
          data = duck_incidents)
tidy(m4, conf.int = TRUE)

Example 2: Two Categorical Predictors

Restating our model mathematically,

coefficients(m4)

        (Intercept)      locationGarage     locationKitchen locationLiving Room 
         139.288148           51.491118          113.783352           66.873723 
         nephewHuey         nephewLouie 
          -7.974478          206.606636

\begin{align*} \hat{\text{cost}} = 139.29 &+ 51.49 \text{ garage} + 113.78 \text{ kitchen} + 66.87 \text{ living} \\ & - 7.97 \text{ Huey} + 206.61 \text{ Louie} \end{align*}

We now have two reference groups!
- Location: Backyard
- Nephew: Dewey

Example 2: Two Categorical Predictors

\begin{align*} \hat{\text{cost}} = 139.29 &+ 51.49 \text{ garage} + 113.78 \text{ kitchen} + 66.87 \text{ living} \\ & - 7.97 \text{ Huey} + 206.61 \text{ Louie} \end{align*}

The intercept represents the expected damage cost for incidents that occurred in the backyard with Dewey present.
Locations:
- The garage increases expected damage cost by $51.49.
- The kitchen increases expected damage cost by $113.78.
- The living room increases expected damage cost by $66.87.
Nephews:
- Having Huey present decreases expected damage cost by $7.97.
- Having Louie present increases expected damage cost by $206.61.

Example 3: Mixing Continuous and Categorical Predictors

Let’s now model damage cost using location, nephew, and sugar_grams.

m5 <- glm(damage_cost ~ location + nephew + sugar_grams,
          family = "gaussian",
          data = duck_incidents)
tidy(m5, conf.int = TRUE)

Example 3: Mixing Continuous and Categorical Predictors

Restating our model mathematically,

coefficients(m5)

        (Intercept)      locationGarage     locationKitchen locationLiving Room 
          62.162513           54.773465           92.647154           58.373882 
         nephewHuey         nephewLouie         sugar_grams 
          -0.899301          198.792202            1.835897

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ garage} + 92.65 \text{ kitchen} + 58.37 \text{ living} \\ & - 0.90 \text{ Huey} + 198.79 \text{ Louie} \\ & + 1.84 \text{ sugar} \end{align*}

Example 3: Mixing Continuous and Categorical Predictors

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ garage} + 92.65 \text{ kitchen} + 58.37 \text{ living} \\ & - 0.90 \text{ Huey} + 198.79 \text{ Louie} \\ & + 1.84 \text{ sugar} \end{align*}

The interpretation of predictors remains the same as we have previously learned.
- The garage costs an additional $54.77
- The kitchen costs an additional $92.65
- The living room costs an additional $58.37
- Having Huey present decreases expected damage cost by $0.90
- Having Louie present increases expected damage cost by $198.79
- For each additional gram of sugar intake, we expect the damage cost to increase by $1.84

Example 4: Binary Predictors

Recall the punished binary indicator we created that identifies if Donald punishes the nephews (either “Assigns Chores” or “Grounds”) or not (either “Laughs” or “Yells”).

duck_incidents <- duck_incidents %>%
  mutate(donald_action = case_when(donald_reaction %in% c("Assigns Chores", "Grounds") ~ "Punished",
                                   donald_reaction %in% c("Laughs", "Yells") ~ "No Punishment")) %>%
  mutate(punished = if_else(donald_action == "Punished", 1, 0))

Double checking,

kable(duck_incidents %>% n_pct(donald_reaction, punished))

donald_reaction	0	1
Assigns Chores	0 (0.0%)	83 (39.3%)
Grounds	0 (0.0%)	128 (60.7%)
Laughs	70 (29.3%)	0 (0.0%)
Yells	169 (70.7%)	0 (0.0%)

Example 4: Binary Predictors

Let’s now model damage cost using sugar_grams and punished.

m6 <- glm(damage_cost ~ sugar_grams + punished,
          family = "gaussian",
          data = duck_incidents)
tidy(m6, conf.int = TRUE)

Example 4: Binary Predictors

Comparing this to the model using the donald_action factor variable instead,

m7 <- glm(damage_cost ~ sugar_grams + donald_action,
          family = "gaussian",
          data = duck_incidents)
tidy(m7, conf.int = TRUE)

Example 4: Binary Predictors

Looking at the coefficients side-by-side,

m6 %>% coefficients()

(Intercept) sugar_grams    punished 
  86.614516    3.800135    8.276618

m7 %>% coefficients()

          (Intercept)           sugar_grams donald_actionPunished 
            86.614516              3.800135              8.276618

Example 4: Binary Predictors

Mathematically, our model is

\hat{\text{cost}} = 86.61 + 3.80 \text{ sugar} + 8.28 \text{ punished}

Interpreting our model coefficients,
- The average cost of damage when Donald does not punish the nephews and no sugar is involved is $86.61.
- For each additional gram of sugar intake, we expect the damage cost to increase by $3.80.
- When Donald punishes the nephews, we expect the damage cost to increase by $8.28.

Wrap Up

In this lecture, we have introduced modeling with categorical predictors.
Always remember that if there are c categories, there will be c-1 predictor terms in the model.
Whether we use a true factor variable or we use a set of indicator variables, estimation, model fit, and omnibus test results will remain the same.
The choice of reference group is arbitrary mathematically, but important for interpretation.
In the next lecture, we will discuss statistical inference with categorical predictors.