y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon
In the last lecture, we began learning about interactions and focused on continuous \times continuous interactions.
In this lecture, we continue learning about interactions and expand to categorical \times categorical interactions.
Recall that if a categorical predictor with c classes is included in the model, we will include c-1 terms to represent it.
This holds true for interactions!
Note that a special (and easy!) case is when our categorical variable is binary: c-1 = 1.
Consider factor A, with a=3 levels, and factor B, with b=4 levels.
Like before, we have the option of using either the : operator or the * operator.
However, it depends on if you are using indicator variables or a factor variable!
:.* or :.If we try to use the * operator with indicators, we will end up with invalid terms (x_{\text{cat}_i} \times x_{\text{cat}_j}) in our model.
Keep in mind that we are only talking about the very simple case here – two categorical variables interacting with each other.
* operator harder.Suppose we have a categorical variable A with a=3 levels and a categorical variable B with b=4 levels and would like to look at the model with A, B, and A \times B.
Regardless of the method (indicator variables vs. factor variables), we will have 11 terms in the model:
Continuing our hypothetical example,
If we have indicators,
If we have factors,
We have access to the following from the reports:
Let’s consider modeling the damage cost as a function of nephew and location.
m4 <- glm(damage_cost ~ nephew + location + nephew:location,
family = "gaussian",
data = duck_incidents)
m4 %>% coefficients() (Intercept) nephewHuey
225.2708571 -69.3045781
nephewLouie locationGarage
10.6386023 -12.1471905
locationKitchen locationLiving Room
-43.2614632 -44.7758571
nephewHuey:locationGarage nephewLouie:locationGarage
0.1549114 186.9909743
nephewHuey:locationKitchen nephewLouie:locationKitchen
157.5751841 303.9966549
nephewHuey:locationLiving Room nephewLouie:locationLiving Room
71.8660897 275.2918031
As with continuous interactions, the interpretation of the main effects changes when an interaction is included in the model.
For example, in our model with nephew and location, the coefficient represents the difference in damage cost between that nephew and the reference nephew for that location as compared to the reference location.
The difference in damage cost between Louie and Dewey is $186.99 greater in the Garage than it is in the Backyard.
The difference in damage cost between Huey and Dewey is $0.15 greater in the Garage than it is in the Backyard.
We are not going to harp on this too much – I generally do not provide specific interaction interpretations, but instead, provide stratified interpretations when justified.
When categorical variables are involved in interaction terms, the hypothesis testing requires a partial F test.
Hypotheses
Test Statistic & p-Value
car::Anova(m, test="LRT") or anova(m_reduced, m_full, type = 3))car::Anova(m, test="LRT") or anova(m_reduced, m_full, type = 3))Yes, there is an interaction (p = 0.023).
car::Anova() approach,anova(),Let’s now consider modeling damage cost as a function of location, Donald’s presence, and their interaction.
m5 <- glm(damage_cost ~ location + donald_present + location:donald_present,
family = "gaussian",
data = duck_incidents)
m5 %>% coefficients() (Intercept) locationGarage
205.662545 33.906108
locationKitchen locationLiving Room
218.711971 74.131572
donald_presentYes locationGarage:donald_presentYes
-5.525212 64.917225
locationKitchen:donald_presentYes locationLiving Room:donald_presentYes
-210.685440 -45.919757
Yes, the interaction is significant (p = 0.009).
Because the interaction is significant, let’s stratify our model by Donald’s presence.
Overall model,
\begin{align*} \hat{\text{cost}} = 205.66 & \\ & + 33.9 \text{ G} + 218.71 \text{ K} + 74.13 \text{ LR} \\ &-5.53 \text{ Donald} \\ &+ 64.92 \text{ G:Donald} - 210.69 \text{ K:Donald} - 45.92 \text{ LR:Donald} \end{align*}
To create the model for when Donald is present, we plug in Donald = 1.
To create the model for when Donald is not present, we plug in Donald = 0.
\begin{align*} \hat{\text{cost}} = 205.66 & \\ & + 33.9 \text{ G} + 218.71 \text{ K} + 74.13 \text{ LR} \\ &-5.53(1) \\ &+ 64.92(1) \text{ G} - 210.69(1) \text{ K} - 45.92(1) \text{ LR} \\ = 200.13 &+ 98.82 \text{ G} + 8.02 \text{ K} + 28.21 \text{ LR} \\ \end{align*}
\begin{align*} \hat{\text{cost}} = 205.66 & \\ & + 33.9 \text{ G} + 218.71 \text{ K} + 74.13 \text{ LR} \\ &-5.53(0) \\ &+ 64.92(0) \text{ G} - 210.69(0) \text{ K} - 45.92(0) \text{ LR} \\ = 205.66 &+ 33.90 \text{ G} + 218.71 \text{ K} + 74.13 \text{ LR} \\ \end{align*}
\begin{align*} \hat{\text{cost|Donald present}} &= 200.13 + 98.82 \text{ G} + \ \ \ \ 8.02 \text{ K} + 28.21 \text{ LR} \\ \hat{\text{cost|Donald not present}} &= 205.66 + 33.90 \text{ G} + 218.71 \text{ K} + 74.13 \text{ LR} \\ \end{align*}
m6 <- glm(damage_cost ~ outside + present + present:outside,
family = "gaussian",
data = duck_incidents)
m6 %>% tidy()tidy() and car::Anova(),m7 <- glm(damage_cost ~ outside + donald_present + donald_present:outside,
family = "gaussian",
data = duck_incidents)
m7 %>% tidy()When including categorical \times categorical interactions in the model, we must use partial F tests to determine if the interaction is significant.
If we are using indicator variables, we must take the full/reduced approach with glm() and anova().
If we are using factor variables, we can use either the full/reduced approach with glm() and anova() or in one swoop with car::Anova().
In the case of binary \times binary, we can use the results directly from tidy().