STA4173: Biostatistics
Spring 2025
We have previously discussed continuous outcomes and the normal distribution.
Let’s now consider categorical outcomes:
Binary
Ordinal
Multinomial
\ln \left( \frac{\pi}{1-\pi} \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k,
where \pi = \text{P}[Y = 1] = the probability of the outcome/event.
How is this different from linear regression?
y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
m1 <- glm(dating ~ F_perc + IMDB,
data = richmondway,
family = "binomial"(link="logit"))
summary(m1)
Call:
glm(formula = dating ~ F_perc + IMDB, family = binomial(link = "logit"),
data = richmondway)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.76166 1.08995 -1.616 0.106
F_perc 0.03323 0.02506 1.326 0.185
IMDB 0.37986 0.72250 0.526 0.599
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46.662 on 33 degrees of freedom
Residual deviance: 44.261 on 31 degrees of freedom
AIC: 50.261
Number of Fisher Scoring iterations: 4
\ln \left( \frac{\hat{\pi}}{1-\hat{\pi}} \right) = -1.76 + 0.03 x_1 + 0.38 x_2,
where
x_1 is the episode’s percentage of the F-bombs from Roy Kent
x_2 is the IMDB rating categorization of the episode
\ln \left( \frac{\pi}{1-\pi} \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k,
We are modeling the log odds, which are not intuitive with interpretations.
To be able to discuss the odds, we will “undo” the natural log by exponentiation.
i.e., if we want to interpret the slope for x_i, we will look at e^{\hat{\beta}_i}.
When interpreting \hat{\beta}_i, it is an additive effect on the log odds.
When interpreting e^{\hat{\beta}_i}, it is a multiplicative effect on the odds.
\begin{align*} \ln \left( \frac{\pi}{1-\pi} \right) &= \beta_0 + \beta_1 x_1 + ... + \beta_k x_k \\ \exp\left\{ \ln \left( \frac{\pi}{1-\pi} \right) \right\} &= \exp\left\{ \beta_0 + \beta_1 x_1 + ... + \beta_k x_k \right\} \\ \frac{\pi}{1-\pi} &= e^{\beta_0} e^{\beta_1 x_1} \cdots e^{\beta_k x_k} \end{align*}
Odds ratios:
Compared to linear regression:
Let’s interpret the odds ratios:
For a 1 percentage point increase in the percentage of f-bombs that came from Roy Kent, the odds of Roy and Keeley dating increase by 3%.
As compared to when episodes have less than an IMDB rating of 8.5, the odds of Roy and Keeley dating are 46% higher in episodes with an IMDB rating of at least 8.5.
What we’ve learned so far re: significance of predictors holds true with logistic regression.
Looking at the results from summary()
:
Call:
glm(formula = dating ~ F_perc + IMDB, family = binomial(link = "logit"),
data = richmondway)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.76166 1.08995 -1.616 0.106
F_perc 0.03323 0.02506 1.326 0.185
IMDB 0.37986 0.72250 0.526 0.599
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46.662 on 33 degrees of freedom
Residual deviance: 44.261 on 31 degrees of freedom
AIC: 50.261
Number of Fisher Scoring iterations: 4
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion / Interpretation
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion / Interpretation
Hypotheses
Test Statistic
p-Value
Rejection Region
Hypotheses
Test Statistic
Rejection Region
Conclusion/Interpretation
That’s it for new material for our course.
The rest of our class meetings will be devoted to working on the project.
It is crucial that you are present in class.