pluto <- read_csv("https://raw.githubusercontent.com/samanthaseals/SDSII/refs/heads/main/files/data/lectures/W6_pluto.csv")
pluto %>% head()We have previously discussed continuous outcomes:
This week, we will consider count outcomes:
\ln\left( y \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
This is similar to gamma regression with \ln().
Poisson regression is used specifically for count response variables.
Examples:
How is this different than gamma regression?
Gamma regression is used for continuous response variables that are positive and right-skewed.
Poisson regression is used for count response variables that are non-negative integers.
Pluto spends his days at Mickey’s park chasing squirrels and interacting with guests. Disney researchers are interested in understanding what factors influence the number of squirrels Pluto chases per hour.
For 300 observation periods, researchers recorded:
glm() function when specifying the Poisson distribution. (Intercept) temperature crowd temperature:crowd
0.1857624458 0.0363002481 0.0393306623 -0.0003112087
\ln(\hat{y}) = 0.19 + 0.04 \text{ temp} + 0.04 \text{ crowd} - 0.0003 \text{ temp} \times \text{crowd}
\ln\left( y \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
We are modeling the log of the count.
Interpreting \hat{\beta}_i is an additive effect on the count.
Interpreting e^{\hat{\beta}_i} is a multiplicative effect on the count.
Thus, when interpreting the slope for x_i, we will look at the Incident Rate Ratio, e^{\hat{\beta}_i}.
tidy(),Welp, we have an interaction in our model.
This means that we have to:
Let’s look at summary statistics for each:
Min. 1st Qu. Median Mean 3rd Qu. Max.
51.00 68.00 74.50 74.42 80.00 100.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.30 39.23 48.70 49.00 58.75 91.20
Let’s interpret crowd when temperature = 65, 75, and 85.
Let’s interpret temperature when crowd = 40, 50, and 60.
| Temperature | Crowd’s Slope | Crowd’s IRR |
|---|---|---|
| 65 | 0.04 + (-0.0003 * 65) = 0.0205 | e^{0.0205} = 1.021 |
| 75 | 0.04 + (-0.0003 * 75) = 0.0175 | e^{0.0175} = 1.017 |
| 85 | 0.04 + (-0.0003 * 85) = 0.0145 | e^{0.0145} = 1.015 |
As temperature increases, crowd’s effect on the number of squirrels chased decreases.
| Crowd | Temperature’s Slope | Temperature’s IRR |
|---|---|---|
| 40 | 0.04 - (0.003 * 40) = - 0.08 | e^{-0.08} = 0.92 |
| 50 | 0.04 - (0.003 * 50) = - 0.11 | e^{-0.11} = 0.90 |
| 60 | 0.04 - (0.003 * 60) = - 0.14 | e^{-0.14} = 0.87 |
As temperature increases, crowd’s effect on the number of squirrels chased decreases.
Our approach to determining significance remains the same.
For continuous and binary predictors, we can use the Wald test (z-test from tidy()).
For omnibus tests, we can use the likelihood ratio test (LRT).
car::Anova(type = 3) or full/reduced LRT).car::Anova(type = 3) or full/reduced LRT).m1_full <- glm(squirrels_chased ~ temperature + crowd + temperature:crowd, data = pluto, family = poisson(link = "log"))
m1_reduced <- glm(squirrels_chased ~ 1, data = pluto, family = poisson(link = "log"))
anova(m1_reduced, m1_full, test = "LRT")Yes, the interaction between temperature and crowd is significant (p < 0.001).
Note! The interaction is significant and there are no “plain” main effects in our model. Our formal inference stops here.
Because the three-way interaction is significant, it means that:
Because the three-way interaction is significant, we will now literally perform stratified analysis. We could:
| Term | Mickey Present | Mickey Not Present |
|---|---|---|
| Intercept | 2.37 | 3.09 |
| Temperature | 0.02 (p < 0.001) | 0.009 (p < 0.001) |
| Time of Day (Evening) | -0.41 (p = 0.087) | -1.44 (p < 0.001) |
| Time of Day (Morning) | 0.41 (p = 0.111) | -1.69 (p < 0.001) |
| Temp x Evening | -0.001 (p = 0.661) | 0.01 (p = 0.005) |
| Temp x Morning | - 0.01 (p = 0.003) | 0.02 (p < 0.001) |
| Term | Mickey Present | Mickey Not Present |
|---|---|---|
| Temperature | 1.02 | 1.01 |
| Time of Day (Evening) | 0.67 | 0.24 |
| Time of Day (Morning) | 1.51 | 0.18 |
| Temp x Evening | 0.9986 | 1.01 |
| Temp x Morning | 0.9899 | 1.02 |
| Term | Mickey Present | Mickey Not Present |
|---|---|---|
| Temperature | 1.02 | 1.01 |
| Time of Day (Evening) | 0.67 | 0.24 |
| Time of Day (Morning) | 1.51 | 0.18 |
| Temp x Evening | 0.9986 | 1.01 |
| Temp x Morning | 0.9899 | 1.02 |
m2_mickey_full <- pluto %>% filter(mickey_present == "Yes") %>% glm(squirrels_chased ~ temperature + time_of_day + temperature*time_of_day, data = ., family = poisson(link = "log"))
m2_mickey_reduced <- pluto %>% filter(mickey_present == "Yes") %>% glm(squirrels_chased ~ 1, data = ., family = poisson(link = "log"))
anova(m2_mickey_reduced, m2_mickey_full, test = "LRT")m2_no_mickey_full <- pluto %>% filter(mickey_present == "No") %>% glm(squirrels_chased ~ temperature + time_of_day + temperature*time_of_day, data = ., family = poisson(link = "log"))
m2_no_mickey_reduced <- pluto %>% filter(mickey_present == "No") %>% glm(squirrels_chased ~ 1, data = ., family = poisson(link = "log"))
anova(m2_no_mickey_reduced, m2_no_mickey_full, test = "LRT")\ln(\hat{y}) = 2.37 + 0.04 \text{ temp} - 0.41 \text{ E} + 0.41 \text{ M} - 0.001 \text{ temp} \times \text{E} - 0.01 \text{ temp} \times \text{M}
| Time of Day | Temperature’s Slope | Temperature’s IRR |
|---|---|---|
| Morning | 0.04 - 0.01 = 0.03 | e^{0.03} = 1.030 |
| Afternoon | 0.04 | e^{0.04} = 1.041 |
| Evening | 0.04 - 0.001 = 0.039 | e^{0.039} = 1.039 |
ln(\hat{y}) = 3.09 + 0.009 \text{ temp} - 1.44 \text{ E} - 1.69 \text{ M} + 0.01 \text{ temp} \times \text{E} + 0.02 \text{ temp} \times \text{M}
| Time of Day | Temperature’s Slope | Temperature’s IRR |
|---|---|---|
| Morning | 0.009 - 1.69 = -1.681 | e^{-1.681} = 0.186 |
| Afternoon | 0.009 | e^{0.009} = 2.832 |
| Evening | 0.009 - 1.44 = -1.431 | e^{-1.431} = 0.239 |
| Time of Day | IRR for Temp - Mickey | IRR for Temp - No Mickey |
|---|---|---|
| Morning | 1.030 | 0.186 |
| Afternoon | 1.041 | 2.832 |
| Evening | 1.039 | 0.239 |
In this lecture, we have introduced Poisson regression.
We again saw the log link function.
We interpret coefficients as multiplicative effects on the count of the outcome.
These are called the incident rate ratios (IRR).
In the next lecture, we will review negative binomial regression.