Visualization of the Model:
Categorical Predictors

Introduction

Last week, we discussed how to visualize the resulting model when there are only continuous predictors.
Categorical predictors change our approach with graphing because categorical predictors are inherently non-numeric while our graph axes are entirely numeric.
Thus, we are adding categorical predictors today to get an idea of how to handle them.
- We will cover models with only categorical predictors and models with both categorical and continuous predictors.

Lecture Example Set Up

Recall our example dataset of duck-related incidents,

duck_incidents <- read_csv("https://raw.githubusercontent.com/samanthaseals/SDSII/refs/heads/main/files/data/lectures/W2_duck_incidents.csv") %>%
  mutate(loc_yard = if_else(location == "Backyard", 1, 0),
         loc_garage = if_else(location == "Garage", 1, 0),
         loc_kitchen = if_else(location == "Kitchen", 1, 0),
         loc_living = if_else(location == "Living Room", 1, 0))
duck_incidents %>% head()

Example 1: One Categorical Predictor

Recall our example from the last lectures,

m1 <- glm(damage_cost ~ location, 
          data = duck_incidents)
tidy(m1, conf.int = TRUE)

Example 1: One Categorical Predictor

This model only has categorical predictors that have no order to them.

\hat{\text{cost}} = 202.78 + 64.34 \text{ garage} + 131.85 \text{ kitchen} + 58.70 \text{ living}

Because this is effectively a simple linear regression (one predictor with 4 levels - location), the way we visualized the model before does not make sense.
- Why? We do not have a variable to put on the x-axis.
Instead, we will use a different type of plot to visualize the results.
- In this case, I would graph the predicted damage cost for each location and include error bars for the confidence intervals.
First, we need to find the marginal means and the simultaneous confidence intervals.

Side Note: Marginal Means & Simultaneous CI

What are marginal means and simultaneous confidence intervals?
Marginal means are the predicted means for each level of a categorical predictor, averaging (or “marginalizing”) over the other predictors in the model.
Simultaneous confidence intervals are confidence intervals that account for multiple comparisons, ensuring that the overall confidence level is maintained across all intervals.
- That is, we know that together, the condfidence intervals have an overall 95% confidence level.
This means that the resulting CI will depend on the number of comparisons we are performing.
- When using the Bonferroni correction, we only divide by the number of comparisons we are interested in.
- The simultaneous CI accounts for all possible comparisons.

Side Note: Marginal Means & Simultaneous CI (R)

In R, we will use the emmeans() function from the emmeans package to calculate the marginal means.

emm <- emmeans(model_name, ~ categorical_variable)

Then, we use the confint() function to calculate the simultaneous confidence intervals.

emm_ci <- confint(emm, adjust = "sidak")

Finally, we can create the dataset for graphing.

conf_data <- as_tibble(emm_ci)

Example 1: One Categorical Predictor

In our example,

emm <- emmeans(m1, ~ location) 
emm_ci <- confint(emm, adjust = "sidak") 
graph <- as_tibble(emm_ci)

Looking at the resulting data,

Location	Average Cost	LCL	UCL
Backyard	202.7798	131.7132	273.8465
Garage	267.1216	189.7416	344.5017
Kitchen	334.6267	260.6045	408.6489
Living Room	261.4766	195.1439	327.8093

Graphing Means and Simultaneous CIs

Now that our dataset has been created, we can construct the graph.

graph %>% ggplot(aes(x = categorical_variable, y = emmean)) +
    geom_point() +
    geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
    labs(x = "Categorical Label",
         y = "Outcome Label") +
    theme_bw()

Example 1: One Categorical Predictor

In our example,

graph %>% ggplot(aes(x = location, y = emmean)) +
    geom_point() +
    geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
    labs(x = "Location",
         y = "Average Damage Cost") +
    theme_bw()

Example 3: Mixing Continuous and Categorical Predictors

Recall the model examining damage cost as a function of location, nephew, and sugar_grams.

m5 <- glm(damage_cost ~ location + nephew + sugar_grams,
          data = duck_incidents)
tidy(m5, conf.int = TRUE)

Visualizing Models with Mixed Predictor Types

Now we have a model with one continuous predictor and two categorical predictors.
This means that we can return to our previous method of visualizing the model.
- We will graph the predicted damage cost (y) as a function of sugar grams (x); our predicted values will depend on location and nephew.
How do we include categorical predictors in this graph? There are two approaches:
- Look at individual graphs for different categories.
- Create different lines for specific categories.

Example 3: Mixing Continuous and Categorical Predictors

Our current model has both categorical and continuous predictors,

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ garage} + 92.65 \text{ kitchen} + 58.37 \text{ living} \\ & - 0.90 \text{ Huey} + 198.79 \text{ Louie} \\ & + 1.84 \text{ sugar} \end{align*}

In our example,
- We could create four graphs: one for each location with the lines defined by nephew.
- We could create three graphs: one for each nephew with the lines defined by location.
- We could create a single graph with lines defined by both location and nephew.
- etc.

Visualizing Models with Mixed Predictor Types

Wait! There are multiple approaches!
Which approach is appropriate?
- It depends on the context of the problem and what comparisons are most relevant.
- Note: I do not put a ton of effort on graphs when drafting them for collaborators. I will draft something basic for demonstration purposes and discussion.
  - e.g., maybe a graph for a single nephew or a single location, then ask which is preferred before I draft the graphs for the additional nephews or locations.
For class credit: I am looking for competency in graphing. I am not looking for perfect graphs. Remember that we are practicing and cannot talk to our (hypothetical) collaborators in our projects.

Creating Predicted Values with Categorical Predictors

Recall how we defined our indicator variables for nephew,

Nephew	x_{\text{H}}	x_{\text{D}}	x_{\text{L}}
Huey	1	0	0
Dewey	0	1	0
Louie	0	0	1

That means, to create predicted values for specific nephews, we need to set the indicator variables accordingly.
- e.g., for Huey, set x_{\text{H}} = 1, x_{\text{D}} = 0, and x_{\text{L}} = 0.
- e.g., for Dewey, set x_{\text{H}} = 0, x_{\text{D}} = 1, and x_{\text{L}} = 0.
- e.g., for Louie, set x_{\text{H}} = 0, x_{\text{D}} = 0, and x_{\text{L}} = 1.
Remember, we only plug in for two of the newphews in the model (the third is the reference category).

Creating Predicted Values with Categorical Predictors

We will take the same approach for location.

Location	x_{\text{B}}	x_{\text{G}}	x_{\text{K}}	x_{\text{L}}
Backyard	1	0	0	0
Garage	0	1	0	0
Kitchen	0	0	1	0
Living Room	0	0	0	1

Then, the indicators will be plugged in as follows,
- e.g., for backyard, set x_{\text{B}} = 1, x_{\text{G}} = 0, x_{\text{K}} = 0, and x_{\text{L}} = 0.
- e.g., for garage, set x_{\text{B}} = 0, x_{\text{G}} = 1, x_{\text{K}} = 0, and x_{\text{L}} = 0.
- e.g., for kitchen, set x_{\text{B}} = 0, x_{\text{G}} = 0, x_{\text{K}} = 1, and x_{\text{L}} = 0.
- e.g., for living room, set set x_{\text{B}} = 0, x_{\text{G}} = 0, x_{\text{K}} = 0, and x_{\text{L}} = 1.

Example 3: Mixing Continuous and Categorical Predictors

Back to the current model,

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ garage} + 92.65 \text{ kitchen} + 58.37 \text{ living} \\ & - 0.90 \text{ Huey} + 198.79 \text{ Louie} \\ & + 1.84 \text{ sugar} \end{align*}

Let’s first create what I would call the “draft” for my collaborator, then we will create the full graph.
We will create the graphs for the nephews and let the lines define the location.
That means sugar will vary on the x-axis.

Example 3: Mixing Continuous and Categorical Predictors

Defining our predicted values, we will need one for each location (because they are defining our lines) and for each nephew (because they are defining our graphs).

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ g} + 92.65 \text{ k} + 58.37 \text{ lr} - 0.90 \text{ H} + 198.79 \text{ L} + 1.84 \text{ sug.} \end{align*}

For Huey,

Location	Equation for Predicted Cost
Backyard	cost_H = 62.16 - 0.90 + 1.84 (sugar)
Garage	cost_H = 62.16 + 54.77 - 0.90 + 1.84 (sugar)
Kitchen	cost_H = 62.16 + 92.65 - 0.90 + 1.84 (sugar)
Living Room	cost_H = 62.16 + 58.37 - 0.90 + 1.84 (sugar)

Example 3: Mixing Continuous and Categorical Predictors

Defining our predicted values, we will need one for each location (because they are defining our lines) and for each nephew (because they are defining our graphs).

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ g} + 92.65 \text{ k} + 58.37 \text{ lr} - 0.90 \text{ H} + 198.79 \text{ L} + 1.84 \text{ sug.} \end{align*}

For Dewey,

Location	Equation for Predicted Cost
Backyard	cost_D = 62.16 + 1.84 (sugar)
Garage	cost_D = 62.16 + 54.77 + 1.84 (sugar)
Kitchen	cost_D = 62.16 + 92.65 + 1.84 (sugar)
Living Room	cost_D = 62.16 + 58.37 + 1.84 (sugar)

Example 3: Mixing Continuous and Categorical Predictors

Defining our predicted values, we will need one for each location (because they are defining our lines) and for each nephew (because they are defining our graphs).

\begin{align*} \hat{\text{cost}} = 62.16 &+ 54.77 \text{ g} + 92.65 \text{ k} + 58.37 \text{ lr} - 0.90 \text{ H} + 198.79 \text{ L} + 1.84 \text{ sug.} \end{align*}

For Louie,

Location	Equation for Predicted Cost
Backyard	cost_L = 62.16 + 198.79 + 1.84 (sugar)
Garage	cost_L = 62.16 + 54.77 + 198.79 + 1.84 (sugar)
Kitchen	cost_L = 62.16 + 92.65 + 198.79 + 1.84 (sugar)
Living Room	cost_L = 62.16 + 58.37 + 198.79 + 1.84 (sugar)

Example 3: Mixing Continuous and Categorical Predictors

Yes… unfortunately this is: a lot of equations \to a lot of coding \to why we start with a draft.
But, for fun, let’s get detailed. First, our predicted values.

duck_incidents <- duck_incidents %>%
  mutate(pred_H_bk = 62.16 - 0.90 + 1.84 * sugar_grams,
         pred_H_gar = 62.16 + 54.77 - 0.90 + 1.84 * sugar_grams,
         pred_H_kit = 62.16 + 92.65 - 0.90 + 1.84 * sugar_grams,
         pred_H_lr = 62.16 + 58.37 - 0.90 + 1.84 * sugar_grams,
         pred_D_bk = 62.16 + 1.84 * sugar_grams,
         pred_D_gar = 62.16 + 54.77 + 1.84 * sugar_grams,
         pred_D_kit = 62.16 + 92.65 + 1.84 * sugar_grams,
         pred_D_lr = 62.16 + 58.37 + 1.84 * sugar_grams,
         pred_L_bk = 62.16 + 198.79 + 1.84 * sugar_grams,
         pred_L_gar = 62.16 + 54.77 + 198.79 + 1.84 * sugar_grams,
         pred_L_kit = 62.16 + 92.65 + 198.79 + 1.84 * sugar_grams,
         pred_L_lr = 62.16 + 58.37 + 198.79 + 1.84 * sugar_grams)

Example 3: Mixing Continuous and Categorical Predictors

Now, graphs for each nephew.

g_H <- duck_incidents %>%
  ggplot(aes(x = sugar_grams, y = damage_cost)) +
  geom_point(data = filter(duck_incidents, nephew == "Huey")) +
  geom_line(aes(y = pred_H_bk, color = "Backyard")) +
  geom_line(aes(y = pred_H_gar, color = "Garage")) +
  geom_line(aes(y = pred_H_kit, color = "Kitchen")) +
  geom_line(aes(y = pred_H_lr, color = "Living Room")) +
  labs(x = "Sugar (grams)",
       y = "Predicted Damage Cost",
       title = "Huey") +
  scale_color_discrete(name = "Location") +
  theme_bw()

Example 3: Mixing Continuous and Categorical Predictors

Now, graphs for each nephew.

g_D <- duck_incidents %>%
  ggplot(aes(x = sugar_grams, y = damage_cost)) +
  geom_point(data = filter(duck_incidents, nephew == "Dewey")) +
  geom_line(aes(y = pred_D_bk, color = "Backyard")) +
  geom_line(aes(y = pred_D_gar, color = "Garage")) +
  geom_line(aes(y = pred_D_kit, color = "Kitchen")) +
  geom_line(aes(y = pred_D_lr, color = "Living Room")) +
  labs(x = "Sugar (grams)",
       y = "Predicted Damage Cost",
       title = "Dewey") +
  scale_color_discrete(name = "Location") +
  theme_bw()

Example 3: Mixing Continuous and Categorical Predictors

Now, graphs for each nephew.

g_L <- duck_incidents %>%
  ggplot(aes(x = sugar_grams, y = damage_cost)) +
  geom_point(data = filter(duck_incidents, nephew == "Louie")) +
  geom_line(aes(y = pred_L_bk, color = "Backyard")) +
  geom_line(aes(y = pred_L_gar, color = "Garage")) +
  geom_line(aes(y = pred_L_kit, color = "Kitchen")) +
  geom_line(aes(y = pred_L_lr, color = "Living Room")) +
  labs(x = "Sugar (grams)",
       y = "Predicted Damage Cost",
       title = "Louie") +
  scale_color_discrete(name = "Location") +
  theme_bw()

Example 3: Mixing Continuous and Categorical Predictors

Finally, we can display the graphs together.

scatterplots with regression lines overlaid for each nephew and location

Example 3: Mixing Continuous and Categorical Predictors

Making some edits (see .qmd file),

Example 3: Mixing Continuous and Categorical Predictors

We could also create a graph for each location with lines defined by nephew. For example,

g_kit <- duck_incidents %>%
  ggplot(aes(x = sugar_grams, y = damage_cost)) +
  geom_point(data = filter(duck_incidents, location == "Kitchen")) +
  geom_line(aes(y = pred_H_kit, color = "Huey")) +
  geom_line(aes(y = pred_D_kit, color = "Dewey")) +
  geom_line(aes(y = pred_L_kit, color = "Louie")) +
  labs(x = "",
       y = "Predicted Damage Cost",
       title = "Kitchen") +
  ylim(0, 2750) +
  scale_color_discrete(name = "Nephew") +
  theme_bw() + 
  theme(legend.position = "none")

Example 3: Mixing Continuous and Categorical Predictors

Similar to our last example,

Example 3: Mixing Continuous and Categorical Predictors

Wait… there are three nephews…

Example 3: Mixing Continuous and Categorical Predictors

Let’s take a peek at the data…

duck_incidents %>%
  select(pred_H_kit, pred_D_kit, pred_L_kit) %>%
  head()

Ah… Huey and Dewey are similar while Louie causes much more damage than the other two.

Using Graphs to Explain Model Results

When presenting model results, graphs can be help demonstrate what your math is showing.
- e.g., Louie causes much more damage than the other two nephews, especially as sugar intake increases. It is also difficult to predict the amount of damage Louie will cause because of the large variability in his damage costs.
We can also use graphs to help us answer questions about the data.
- e.g., Is there a location that has higher damage costs than the others?
- e.g., Is there a nephew that causes significantly more damage than the others?

Using Graphs to Explain Model Results

Recall the analysis from last lecture,

m5 <- glm(damage_cost ~ location + nephew + sugar_grams,
          family = "gaussian",
          data = duck_incidents)
tidy(m5, conf.int = TRUE)

Using Graphs to Explain Model Results

Recall the analysis from last lecture,

car::Anova(m5, type = 3, test = "F")

Using Graphs to Explain Model Results

Revisiting the graph,

Using Graphs to Explain Model Results

When we initally graphed this, we saw that Huey and Dewey caused similar amounts of damage while Louie causes much more damage than the other two.
We can formally test the pairwise comparisons and quantify the differences,

pairs(emmeans(m5, ~ nephew), adjust = "tukey")

 contrast      estimate   SE  df t.ratio p.value
 Dewey - Huey     0.899 33.8 443   0.027  0.9996
 Dewey - Louie -198.792 33.5 443  -5.939 <0.0001
 Huey - Louie  -199.691 34.4 443  -5.805 <0.0001

Results are averaged over the levels of: location 
P value adjustment: tukey method for comparing a family of 3 estimates

Note that an alternative to this method would be to change the reference category and re-run the model multiple times.
Remember, we need to adjust \alpha for multiple comparisons!

Using Graphs to Explain Model Results

Our graph visualizes the differences we see in the pairwise comparisons.
- Louie causes significantly more damage than Huey and Dewey.
- Huey and Dewey cause similar amounts of damage.
  - They are so similar that their regression lines appear overlaid.

Wrap Up

In this lecture, we discussed how to visualize models with categorical predictors.
- What we have previously discussed about visualizing models with continuous predictors still applies.
The examples I am presenting are examples – not the one and only way to visualize these models.
Always consider the context of the problem and what comparisons are most relevant when deciding how to visualize your model.
- Who is your audience?
- What level of detail do they need?
Next week, we will discuss complicating our models further with interaction terms.

Visualization of the Model: Categorical Predictors

Introduction

Lecture Example Set Up

Example 1: One Categorical Predictor

Example 1: One Categorical Predictor

Side Note: Marginal Means & Simultaneous CI

Side Note: Marginal Means & Simultaneous CI (R)

Example 1: One Categorical Predictor

Graphing Means and Simultaneous CIs

Example 1: One Categorical Predictor

Example 3: Mixing Continuous and Categorical Predictors

Visualizing Models with Mixed Predictor Types

Example 3: Mixing Continuous and Categorical Predictors

Visualizing Models with Mixed Predictor Types

Creating Predicted Values with Categorical Predictors

Creating Predicted Values with Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Example 3: Mixing Continuous and Categorical Predictors

Using Graphs to Explain Model Results

Using Graphs to Explain Model Results

Using Graphs to Explain Model Results

Using Graphs to Explain Model Results

Using Graphs to Explain Model Results

Using Graphs to Explain Model Results

Wrap Up

Visualization of the Model:
Categorical Predictors