Visualization of the Model:
Categorical Predictors

Introduction

  • Last week, we discussed how to visualize the resulting model when there are only continuous predictors.

  • Categorical predictors change our approach with graphing because categorical predictors are inherently non-numeric while our graph axes are entirely numeric.

  • Thus, we are adding categorical predictors today to get an idea of how to handle them.

    • We will cover models with only categorical predictors and models with both categorical and continuous predictors.

Lecture Example Set Up

  • Recall our example dataset of duck-related incidents,
duck_incidents <- read_csv("https://raw.githubusercontent.com/samanthaseals/SDSII/refs/heads/main/files/data/lectures/W2_duck_incidents.csv") %>%
  mutate(loc_yard = if_else(location == "Backyard", 1, 0),
         loc_garage = if_else(location == "Garage", 1, 0),
         loc_kitchen = if_else(location == "Kitchen", 1, 0),
         loc_living = if_else(location == "Living Room", 1, 0))
duck_incidents %>% head()

Example 1: One Categorical Predictor

  • Recall our example from the last lectures,
m1 <- glm(damage_cost ~ location, 
          data = duck_incidents)
tidy(m1, conf.int = TRUE)

Example 1: One Categorical Predictor

  • This model only has categorical predictors that have no order to them.

\hat{\text{cost}} = 202.78 + 64.34 \text{ garage} + 131.85 \text{ kitchen} + 58.70 \text{ living}

  • Because this is effectively a simple linear regression (one predictor with 4 levels - location), the way we visualized the model before does not make sense.

    • Why? We do not have a variable to put on the x-axis.
  • Instead, we will use a different type of plot to visualize the results.

    • In this case, I would graph the predicted damage cost for each location and include error bars for the confidence intervals.
  • First, we need to find the marginal means and the simultaneous confidence intervals.

Side Note: Marginal Means & Simultaneous CI

  • What are marginal means and simultaneous confidence intervals?

  • Marginal means are the predicted means for each level of a categorical predictor, averaging (or “marginalizing”) over the other predictors in the model.

  • Simultaneous confidence intervals are confidence intervals that account for multiple comparisons, ensuring that the overall confidence level is maintained across all intervals.

    • That is, we know that together, the condfidence intervals have an overall 95% confidence level.
  • This means that the resulting CI will depend on the number of comparisons we are performing.

    • When using the Bonferroni correction, we only divide by the number of comparisons we are interested in.

    • The simultaneous CI accounts for all possible comparisons.

Side Note: Marginal Means & Simultaneous CI (R)

  • In R, we will use the emmeans() function from the emmeans package to calculate the marginal means.
emm <- emmeans(model_name, ~ categorical_variable) 
  • Then, we use the confint() function to calculate the simultaneous confidence intervals.
emm_ci <- confint(emm, adjust = "sidak") 
  • Finally, we can create the dataset for graphing.
conf_data <- as_tibble(emm_ci)

Example 1: One Categorical Predictor

  • In our example,
emm <- emmeans(m1, ~ location) 
emm_ci <- confint(emm, adjust = "sidak") 
graph <- as_tibble(emm_ci)
  • Looking at the resulting data,
Location Average Cost LCL UCL
Backyard 202.7798 131.7132 273.8465
Garage 267.1216 189.7416 344.5017
Kitchen 334.6267 260.6045 408.6489
Living Room 261.4766 195.1439 327.8093

Graphing Means and Simultaneous CIs

  • Now that our dataset has been created, we can construct the graph.
graph %>% ggplot(aes(x = categorical_variable, y = emmean)) +
    geom_point() +
    geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
    labs(x = "Categorical Label",
         y = "Outcome Label") +
    theme_bw()

Example 1: One Categorical Predictor

  • In our example,
graph %>% ggplot(aes(x = location, y = emmean)) +
    geom_point() +
    geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
    labs(x = "Location",
         y = "Average Damage Cost") +
    theme_bw()

Example 3: Mixing Continuous and Categorical Predictors

  • Recall the model examining damage cost as a function of location, nephew, and sugar_grams.
m5 <- glm(damage_cost ~ location + nephew + sugar_grams,
          data = duck_incidents)
tidy(m5, conf.int = TRUE)

Visualizing Models with Mixed Predictor Types

  • Now we have a model with one continuous predictor and two categorical predictors.

  • This means that we can return to our previous method of visualizing the model.

    • We will graph the predicted damage cost (y) as a function of sugar grams (x); our predicted values will depend on location and nephew.
  • How do we include categorical predictors in this graph? There are two approaches:

    • Look at individual graphs for different categories.
    • Create different lines for specific categories.

Example 3: Mixing Continuous and Categorical Predictors

  • Our current model has both categorical and continuous predictors,

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ garage} + 92.65 \text{ kitchen} + 58.37 \text{ living} \\ & - 0.90 \text{ Huey} + 198.79 \text{ Louie} \\ & + 1.84 \text{ sugar} \end{align*}

  • In our example,

    • We could create four graphs: one for each location with the lines defined by nephew.
    • We could create three graphs: one for each nephew with the lines defined by location.
    • We could create a single graph with lines defined by both location and nephew.
    • etc.

Visualizing Models with Mixed Predictor Types

  • Wait! There are multiple approaches!

  • Which approach is appropriate?

    • It depends on the context of the problem and what comparisons are most relevant.
    • Note: I do not put a ton of effort on graphs when drafting them for collaborators. I will draft something basic for demonstration purposes and discussion.
      • e.g., maybe a graph for a single nephew or a single location, then ask which is preferred before I draft the graphs for the additional nephews or locations.
  • For class credit: I am looking for competency in graphing. I am not looking for perfect graphs. Remember that we are practicing and cannot talk to our (hypothetical) collaborators in our projects.

Creating Predicted Values with Categorical Predictors

  • Recall how we defined our indicator variables for nephew,
Nephew x_{\text{H}} x_{\text{D}} x_{\text{L}}
Huey 1 0 0
Dewey 0 1 0
Louie 0 0 1
  • That means, to create predicted values for specific nephews, we need to set the indicator variables accordingly.

    • e.g., for Huey, set x_{\text{H}} = 1, x_{\text{D}} = 0, and x_{\text{L}} = 0.
    • e.g., for Dewey, set x_{\text{H}} = 0, x_{\text{D}} = 1, and x_{\text{L}} = 0.
    • e.g., for Louie, set x_{\text{H}} = 0, x_{\text{D}} = 0, and x_{\text{L}} = 1.
  • Remember, we only plug in for two of the newphews in the model (the third is the reference category).

Creating Predicted Values with Categorical Predictors

  • We will take the same approach for location.
Location x_{\text{B}} x_{\text{G}} x_{\text{K}} x_{\text{L}}
Backyard 1 0 0 0
Garage 0 1 0 0
Kitchen 0 0 1 0
Living Room 0 0 0 1
  • Then, the indicators will be plugged in as follows,

    • e.g., for backyard, set x_{\text{B}} = 1, x_{\text{G}} = 0, x_{\text{K}} = 0, and x_{\text{L}} = 0.
    • e.g., for garage, set x_{\text{B}} = 0, x_{\text{G}} = 1, x_{\text{K}} = 0, and x_{\text{L}} = 0.
    • e.g., for kitchen, set x_{\text{B}} = 0, x_{\text{G}} = 0, x_{\text{K}} = 1, and x_{\text{L}} = 0.
    • e.g., for living room, set set x_{\text{B}} = 0, x_{\text{G}} = 0, x_{\text{K}} = 0, and x_{\text{L}} = 1.

Example 3: Mixing Continuous and Categorical Predictors

  • Back to the current model,

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ garage} + 92.65 \text{ kitchen} + 58.37 \text{ living} \\ & - 0.90 \text{ Huey} + 198.79 \text{ Louie} \\ & + 1.84 \text{ sugar} \end{align*}

  • Let’s first create what I would call the “draft” for my collaborator, then we will create the full graph.

  • We will create the graphs for the nephews and let the lines define the location.

  • That means sugar will vary on the x-axis.

Example 3: Mixing Continuous and Categorical Predictors

  • Defining our predicted values, we will need one for each location (because they are defining our lines) and for each nephew (because they are defining our graphs).

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ g} + 92.65 \text{ k} + 58.37 \text{ lr} - 0.90 \text{ H} + 198.79 \text{ L} + 1.84 \text{ sug.} \end{align*}

  • For Huey,
Location Equation for Predicted Cost
Backyard costH = 62.16 - 0.90 + 1.84 (sugar)
Garage costH = 62.16 + 54.77 - 0.90 + 1.84 (sugar)
Kitchen costH = 62.16 + 92.65 - 0.90 + 1.84 (sugar)
Living Room costH = 62.16 + 58.37 - 0.90 + 1.84 (sugar)

Example 3: Mixing Continuous and Categorical Predictors

  • Defining our predicted values, we will need one for each location (because they are defining our lines) and for each nephew (because they are defining our graphs).

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ g} + 92.65 \text{ k} + 58.37 \text{ lr} - 0.90 \text{ H} + 198.79 \text{ L} + 1.84 \text{ sug.} \end{align*}

  • For Dewey,
Location Equation for Predicted Cost
Backyard costD = 62.16 + 1.84 (sugar)
Garage costD = 62.16 + 54.77 + 1.84 (sugar)
Kitchen costD = 62.16 + 92.65 + 1.84 (sugar)
Living Room costD = 62.16 + 58.37 + 1.84 (sugar)

Example 3: Mixing Continuous and Categorical Predictors

  • Defining our predicted values, we will need one for each location (because they are defining our lines) and for each nephew (because they are defining our graphs).

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ g} + 92.65 \text{ k} + 58.37 \text{ lr} - 0.90 \text{ H} + 198.79 \text{ L} + 1.84 \text{ sug.} \end{align*}

  • For Louie,
Location Equation for Predicted Cost
Backyard costL = 62.16 + 198.79 + 1.84 (sugar)
Garage costL = 62.16 + 54.77 + 198.79 + 1.84 (sugar)
Kitchen costL = 62.16 + 92.65 + 198.79 + 1.84 (sugar)
Living Room costL = 62.16 + 58.37 + 198.79 + 1.84 (sugar)

Example 3: Mixing Continuous and Categorical Predictors

  • Yes… unfortunately this is: a lot of equations \to a lot of coding \to why we start with a draft.

  • But, for fun, let’s get detailed. First, our predicted values.

duck_incidents <- duck_incidents %>%
  mutate(pred_H_bk = 62.16 - 0.90 + 1.84 * sugar_grams,
         pred_H_gar = 62.16 + 54.77 - 0.90 + 1.84 * sugar_grams,
         pred_H_kit = 62.16 + 92.65 - 0.90 + 1.84 * sugar_grams,
         pred_H_lr = 62.16 + 58.37 - 0.90 + 1.84 * sugar_grams,
         pred_D_bk = 62.16 + 1.84 * sugar_grams,
         pred_D_gar = 62.16 + 54.77 + 1.84 * sugar_grams,
         pred_D_kit = 62.16 + 92.65 + 1.84 * sugar_grams,
         pred_D_lr = 62.16 + 58.37 + 1.84 * sugar_grams,
         pred_L_bk = 62.16 + 198.79 + 1.84 * sugar_grams,
         pred_L_gar = 62.16 + 54.77 + 198.79 + 1.84 * sugar_grams,
         pred_L_kit = 62.16 + 92.65 + 198.79 + 1.84 * sugar_grams,
         pred_L_lr = 62.16 + 58.37 + 198.79 + 1.84 * sugar_grams)

Example 3: Mixing Continuous and Categorical Predictors

  • Now, graphs for each nephew.
g_H <- duck_incidents %>%
  ggplot(aes(x = sugar_grams, y = damage_cost)) +
  geom_point(data = filter(duck_incidents, nephew == "Huey")) +
  geom_line(aes(y = pred_H_bk, color = "Backyard")) +
  geom_line(aes(y = pred_H_gar, color = "Garage")) +
  geom_line(aes(y = pred_H_kit, color = "Kitchen")) +
  geom_line(aes(y = pred_H_lr, color = "Living Room")) +
  labs(x = "Sugar (grams)",
       y = "Predicted Damage Cost",
       title = "Huey") +
  scale_color_discrete(name = "Location") +
  theme_bw()

Example 3: Mixing Continuous and Categorical Predictors

  • Now, graphs for each nephew.
g_D <- duck_incidents %>%
  ggplot(aes(x = sugar_grams, y = damage_cost)) +
  geom_point(data = filter(duck_incidents, nephew == "Dewey")) +
  geom_line(aes(y = pred_D_bk, color = "Backyard")) +
  geom_line(aes(y = pred_D_gar, color = "Garage")) +
  geom_line(aes(y = pred_D_kit, color = "Kitchen")) +
  geom_line(aes(y = pred_D_lr, color = "Living Room")) +
  labs(x = "Sugar (grams)",
       y = "Predicted Damage Cost",
       title = "Dewey") +
  scale_color_discrete(name = "Location") +
  theme_bw()

Example 3: Mixing Continuous and Categorical Predictors

  • Now, graphs for each nephew.
g_L <- duck_incidents %>%
  ggplot(aes(x = sugar_grams, y = damage_cost)) +
  geom_point(data = filter(duck_incidents, nephew == "Louie")) +
  geom_line(aes(y = pred_L_bk, color = "Backyard")) +
  geom_line(aes(y = pred_L_gar, color = "Garage")) +
  geom_line(aes(y = pred_L_kit, color = "Kitchen")) +
  geom_line(aes(y = pred_L_lr, color = "Living Room")) +
  labs(x = "Sugar (grams)",
       y = "Predicted Damage Cost",
       title = "Louie") +
  scale_color_discrete(name = "Location") +
  theme_bw()

Example 3: Mixing Continuous and Categorical Predictors

  • Finally, we can display the graphs together.
scatterplots with regression lines overlaid for each nephew and location

Example 3: Mixing Continuous and Categorical Predictors

  • Making some edits (see .qmd file),
scatterplots with regression lines overlaid for each nephew and location

Example 3: Mixing Continuous and Categorical Predictors

  • We could also create a graph for each location with lines defined by nephew. For example,
g_kit <- duck_incidents %>%
  ggplot(aes(x = sugar_grams, y = damage_cost)) +
  geom_point(data = filter(duck_incidents, location == "Kitchen")) +
  geom_line(aes(y = pred_H_kit, color = "Huey")) +
  geom_line(aes(y = pred_D_kit, color = "Dewey")) +
  geom_line(aes(y = pred_L_kit, color = "Louie")) +
  labs(x = "",
       y = "Predicted Damage Cost",
       title = "Kitchen") +
  ylim(0, 2750) +
  scale_color_discrete(name = "Nephew") +
  theme_bw() + 
  theme(legend.position = "none")

Example 3: Mixing Continuous and Categorical Predictors

  • Similar to our last example,
scatterplots with regression lines overlaid for each nephew and location

Example 3: Mixing Continuous and Categorical Predictors

  • Wait… there are three nephews…
scatterplots with regression lines overlaid for each nephew and location

Example 3: Mixing Continuous and Categorical Predictors

  • Let’s take a peek at the data…
duck_incidents %>%
  select(pred_H_kit, pred_D_kit, pred_L_kit) %>%
  head()
  • Ah… Huey and Dewey are similar while Louie causes much more damage than the other two.

Using Graphs to Explain Model Results

  • When presenting model results, graphs can be help demonstrate what your math is showing.

    • e.g., Louie causes much more damage than the other two nephews, especially as sugar intake increases. It is also difficult to predict the amount of damage Louie will cause because of the large variability in his damage costs.
  • We can also use graphs to help us answer questions about the data.

    • e.g., Is there a location that has higher damage costs than the others?

    • e.g., Is there a nephew that causes significantly more damage than the others?

Using Graphs to Explain Model Results

  • Recall the analysis from last lecture,
m5 <- glm(damage_cost ~ location + nephew + sugar_grams,
          family = "gaussian",
          data = duck_incidents)
tidy(m5, conf.int = TRUE)

Using Graphs to Explain Model Results

  • Recall the analysis from last lecture,
car::Anova(m5, type = 3, test = "F")

Using Graphs to Explain Model Results

  • Revisiting the graph,
scatterplots with regression lines overlaid for each nephew and location

Using Graphs to Explain Model Results

  • When we initally graphed this, we saw that Huey and Dewey caused similar amounts of damage while Louie causes much more damage than the other two.

  • We can formally test the pairwise comparisons and quantify the differences,

pairs(emmeans(m5, ~ nephew), adjust = "tukey")
 contrast      estimate   SE  df t.ratio p.value
 Dewey - Huey     0.899 33.8 443   0.027  0.9996
 Dewey - Louie -198.792 33.5 443  -5.939 <0.0001
 Huey - Louie  -199.691 34.4 443  -5.805 <0.0001

Results are averaged over the levels of: location 
P value adjustment: tukey method for comparing a family of 3 estimates 
  • Note that an alternative to this method would be to change the reference category and re-run the model multiple times.

  • Remember, we need to adjust \alpha for multiple comparisons!

Using Graphs to Explain Model Results

scatterplots with regression lines overlaid for each nephew and location
  • Our graph visualizes the differences we see in the pairwise comparisons.

    • Louie causes significantly more damage than Huey and Dewey.

    • Huey and Dewey cause similar amounts of damage.

      • They are so similar that their regression lines appear overlaid.

Wrap Up

  • In this lecture, we discussed how to visualize models with categorical predictors.

    • What we have previously discussed about visualizing models with continuous predictors still applies.
  • The examples I am presenting are examples – not the one and only way to visualize these models.

  • Always consider the context of the problem and what comparisons are most relevant when deciding how to visualize your model.

    • Who is your audience?
    • What level of detail do they need?
  • Next week, we will discuss complicating our models further with interaction terms.