y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon
Until this week, we focused on applying the normal distribution to our model.
What happens when we have a response variable that is continuous but skewed right?
f(y|\mu, \gamma) = \frac{1}{\Gamma(\gamma) \left( \frac{\mu}{\gamma} \right)^\gamma} y^{\gamma-1} \exp\left\{ \frac{-y \gamma}{\mu} \right\}
where: y > 0, \mu > 0, \gamma > 0, and \Gamma(\cdot) is the Gamma function
Note that the distribution is defined by the shape parameters, \gamma, and the scale parameter, \mu/\gamma.
Why shouldn’t we use the normal distribution here?
When we apply the normal distribution, we assume that the outcome (response) can take any value. i.e., y \in (-\infty, \infty)
In this case, our continuous response variable is bounded between 0 and \infty. i.e.,y \in (0, \infty)
The gamma distribution is appropriate for modeling continuous, positive, right-skewed data.
Note that the support for the gamma distribution is (0, \infty).
What does that mean for us?
y_{\text{new}} = y + 0.01
glm() function to perform Gamma regression,Note the new-to-us piece of syntax: link = "log" attached to family.
This specifies that we are using the log link function.
Technically we used the identity link with normal regression. It is the default, but we could have specified family = gaussian(link = "identity").
We are working with the Magic Kingdom operations team. For a sample of ride observations taken throughout several days, WDW has recorded:
(Intercept) temp_f crowd_index ride_typefamily ride_typethrill
1.2815640 0.0149594 0.1094566 -0.1904717 0.4291482
\ln(y) = 1.28 + 0.015 \text{ temp} + 0.11 \text{ crowd} - 0.19 \text{ family} + 0.55 \text{ thrill}
Uh oh. We are now modeling ln(y) and not y directly…
We will transform the coefficients:
\begin{align*} \ln(\hat{y}) &= \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + ... \hat{\beta}_k x_k \\ \hat{y} &= \exp\left\{\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + ... \hat{\beta}_kx_k \right\} \\ \hat{y} &= e^{\hat{\beta}_0} e^{\hat{\beta}_1x_1} e^{\hat{\beta}_2 x_2} \cdot \cdot \cdot e^{\hat{\beta}_k x_k} \end{align*}
These are multiplicative effects, as compared to the additive effects we saw under the normal distribution.
Note: Please do not write your model as a function of \hat{y} – the above is only to demonstrate the multiplicative effect.
This multiplicative effect is called the incident rate ratio.
Let’s think about multiplicative effects:
Examples:
In our example, \ln(\hat{\text{wait}}) = 1.28 + 0.015 \text{ temp} + 0.11 \text{ crowd} - 0.19 \text{ family} + 0.55 \text{ thrill}
Our approach to determining significance remains the same.
For continuous and binary predictors, we can use the Wald test (z-test from tidy()).
For omnibus tests, we can use the likelihood ratio test (LRT).
car::Anova(type = 3) or full/reduced LRT).car::Anova(type = 3) or full/reduced LRT).m1_full <- glm(wait_time ~ temp_f + crowd_index + ride_type, data = mk_wait, family = Gamma(link = "log"))
m1_reduced <- glm(wait_time ~ 1, data = mk_wait, family = Gamma(link = "log"))
anova(m1_reduced, m1_full, test = "LRT")Yes, this is a significant regression line (p < 0.001).
Temperature is a significant predictor (p < 0.001).
Crowd index is a significant predictor (p < 0.001).
Ride type…?
Temperature is a significant predictor (p < 0.001).
Crowd index is a significant predictor (p < 0.001).
Ride type is a significant predictor (p < 0.001).
tidy() results for ride type,m1 %>% tidy() %>% filter(term %in% c("ride_typefamily", "ride_typethrill")) %>% select(term, estimate, p.value)Family rides have significantly lower wait times than dark rides (p < 0.001).
Thrill rides have significantly higher wait times than dark rides (p < 0.001).
In this lecture, we have introduced gamma regression.
We introduced modeling with the log-link and how that affects interpretations.
In the next lecture, we will review visualizing gamma regression.