y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon
Until this week, we focused on applying the normal distribution to our model.
What happens when we have a response variable that is continuous but bounded between 0 and 1?
Beta regression is used specifically for continuous response variables that fall between 0 and 1.
Why shouldn’t we use the normal distribution here?
When we apply the normal distribution, we assume that the outcome (response) can take any value. i.e., y \in (-\infty, \infty)
However, in this case, our response variable is bounded between 0 and 1. i.e.,y \in (0, 1)
The beta distribution is a continuous probability distribution defined on the interval (0, 1).
It is characterized by two shape parameters, \alpha and \beta, which determine the distribution’s shape.
The beta distribution is a continuous probability distribution defined on the interval (0, 1).
It is characterized by two shape parameters, \alpha and \beta, which determine the distribution’s shape.
The beta distribution is a continuous probability distribution defined on the interval (0, 1).
It is characterized by two shape parameters, \alpha and \beta, which determine the distribution’s shape.
The beta distribution is a continuous probability distribution defined on the interval (0, 1).
It is characterized by two shape parameters, \alpha and \beta, which determine the distribution’s shape.
The beta distribution is a continuous probability distribution defined on the interval (0, 1).
It is characterized by two shape parameters, \alpha and \beta, which determine the distribution’s shape.
The beta distribution is a continuous probability distribution defined on the interval (0, 1).
It is characterized by two shape parameters, \alpha and \beta, which determine the distribution’s shape.
The beta distribution is a continuous probability distribution defined on the interval (0, 1).
It is characterized by two shape parameters, \alpha and \beta, which determine the distribution’s shape.
Note that the support for the beta distribution is (0, 1).
What does that mean for us?
One common transformation is from Smithson & Verkuilen (2006), y_{\text{new}} = \frac{y(n-1)+0.5}{n}
y_{\text{new}} = \frac{y(n-1)+0.5}{n}
| n | 0 | 0.25 | 0.5 | 0.75 | 1 |
| 10 | 0.05 | 0.275 | 0.5 | 0.725 | 0.95 |
| 100 | 0.005 | 0.2525 | 0.5 | 0.7475 | 0.995 |
| 1000 | 0.0005 | 0.25025 | 0.5 | 0.74975 | 0.9995 |
| 10000 | 0.00005 | 0.250025 | 0.5 | 0.749975 | 0.99995 |
\text{logit}(\mu) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
\mu = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon
\text{logit}(\mu) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
Wait… \text{logit()}?
The \text{logit()} function is defined as
\text{logit}(\mu) = \log\left(\frac{\mu}{1 - \mu}\right)
In R, we can fit a beta regression model using the betareg package, then run the resulting model through tidy() (from the broom package).
Minnie and Daisy have been hired as “Efficiency Leads” for three different park operations teams, and they want to understand what affects how well a shift goes.
Minnie or Daisy records information about their work shift:
operations %>% summarize(min = min(completion_rate),
median = median(completion_rate),
max = max(completion_rate)) (Intercept) characterMinnie
2.37702571 0.82984905
task_load shift_hours
-0.03904442 -0.08308716
locationMain_Street_Operations locationToontown_Crew
0.01773734 -0.30528088
characterMinnie:task_load (phi)
-0.04533085 6.82672202
\begin{align*} \text{logit}(\mu) = \ & 2.38 + 0.83 \text{ Minnie} - 0.04 \text{ task load} -0.08 \text{ shift hours } + \\ & 0.02 \text{ Main Street} - 0.31 \text{ Toontown} - \\ & 0.05 \text{ Minnie} \times \text{task} \end{align*}
Recall the general linear model when y \sim N(\mu, \sigma^2), \mu = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon,
But when y \sim \text{Beta}(\alpha, \beta), our model becomes \text{logit}(\mu) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k,
Recall the \text{logit()} function, \text{logit}(\mu) = \ln\left(\frac{\mu}{1 - \mu}\right)
To interpret the coefficients on the original scale of the outcome, we can exponentiate them to get the odds ratio (OR) of the mean.
\text{OR} = e^{\beta_i}
To interpret the coefficients on the original scale of the outcome, we can exponentiate them to get the odds ratio (OR) of the mean. \text{OR} = e^{\beta_i}
To interpret the coefficients on the original scale of the outcome, we can exponentiate them to get the odds ratio (OR) of the mean. \text{OR} = e^{\beta_i}
Example interpretations:
\begin{align*} \text{logit}(\mu) = \ & 2.38 + 0.83 \text{ Minnie} - 0.04 \text{ task load} -0.08 \text{ shift hours } + \\ & 0.02 \text{ Main Street} - 0.31 \text{ Toontown} - \\ & 0.05 \text{ Minnie} \times \text{task} \end{align*}
\begin{align*} \text{logit}(\mu) = \ & 3.21 - 0.09 \text{ task load} -0.08 \text{ shift hours } + \\ & 0.02 \text{ Main Street} - 0.31 \text{ Toontown} \end{align*}
\begin{align*} \text{logit}(\mu) = \ & 2.38 - 0.04 \text{ task load} -0.08 \text{ shift hours } + \\ & 0.02 \text{ Main Street} - 0.31 \text{ Toontown} \end{align*}
Now that we have stratified models, we can find the corresponding odds ratios.
[1] 0.9139312 0.9231163 1.0202013 0.7334470
[1] 0.9607894 0.9231163 1.0202013 0.7334470
Note that the OR for shift hours and location is the same between the two models. That is because we did not include an interaction between character and location.
As the number of shift hours increases by 1 hour, the odds of the mean completion rate are multiplied by 0.92 – this is a decrease of approximately 8%.
As compared to Epcot Showcase, the odds of the mean completion rate for Main Street Operations are multiplied by 1.02 – this is an increase of approximately 2%.
As compared to Epcot Showcase, the odds of the mean completion rate for Toontown Crew are multiplied by 0.73 – this is a decrease of approximately 27%.
[1] 0.9139312 0.9231163 1.0202013 0.7334470
[1] 0.9607894 0.9231163 1.0202013 0.7334470
Now, the OR for task load differs between the two characters due to the interaction.
For Minnie, as the task load increases by 1 unit, the odds of the mean completion rate are multiplied by 0.91 – this is a decrease of approximately 9%.
For Daisy, as the task load increases by 1 unit, the odds of the mean completion rate are multiplied by 0.96 – this is a decrease of approximately 4%.
Another statement we can make is that as additional tasks are assigned during the shift, Minnie’s odds of the mean completion rate decrease faster than Daisy’s.
Our approach to determining significance remains the same.
For continuous and binary predictors, we can use the Wald test (z-test from tidy()).
For omnibus tests, we can use the likelihood ratio test (LRT).
With beta regression, we must use the lrtest() from the lmtest package.
full <- betareg(completion_adj ~ character + task_load + shift_hours + location + character:task_load,
data = operations,
link = "logit")
reduced <- betareg(completion_adj ~ 1,
data = operations,
link = "logit")
lrtest(reduced, full)tidy() results?The interaction between character and task load is significant (p < 0.001).
Shift length is significant (p < 0.001).
tidy() results?Let’s look at the omnibus test for location.
full <- betareg(completion_adj ~ character + task_load + shift_hours + location + character:task_load,
data = operations,
link = "logit")
reduced <- betareg(completion_adj ~ character + task_load + shift_hours + character:task_load,
data = operations,
link = "logit")
lrtest(reduced, full)tidy() results,m1 %>% tidy() %>% select(term, estimate, p.value) %>% filter(term %in% c("locationMain_Street_Operations", "locationToontown_Crew"))The reference group is Epcot’s World Showcase.
There is not a difference in the mean completion rate between Epcot’s World Showcase and Main Street (p = 0.867).
There is a difference in the mean completion rate between Epcot’s World Showcase and Toontown (p = 0.010).
Note that beta regression is unique in that it allows us to simultaneously model the mean and the precision (variance).
Let’s return to our example.
We modeled the mean as a function of character, task load, shift hours, location, and the interaction between character and task load.
Let’s now additionally model the precision as a function of character, location, and the interaction between the two.
lrtest() from the lmtest package.m2_full <- betareg(completion_adj ~ character + task_load + shift_hours + location + character:task_load |
character + location + character:location,
data = operations,
link = "logit")
m2_reduced <- betareg(completion_adj ~ character + task_load + shift_hours + location + character:task_load |
character + location,
data = operations,
link = "logit")
lrtest(m2_reduced, m2_full)This week we are turning our attention to continuous responses that are not suitable for the normal distribution.
Beta regression is used for continuous response variables that fall between 0 and 1.
The interpretations are now updated to reflect the use of the logit link function.
Next lecture: visualizing beta regression.
Next week: logistic regression for categorical outcomes.