Logistic Regression

STA4173: Biostatistics
Spring 2025

Introduction

We have previously discussed continuous outcomes and the normal distribution.
Let’s now consider categorical outcomes:
- Binary
- Ordinal
- Multinomial

Binary Logistic Regression

We model binary outcomes using logistic regression,

\ln \left( \frac{\pi}{1-\pi} \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k,

where \pi = \text{P}[Y = 1] = the probability of the outcome/event.
How is this different from linear regression?

y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k

Binary Logistic Regression

Why isn’t linear regression appropriate?

R Syntax

In the glm() function, we specify the binomial family.

m <- glm(binary_outcome ~ predictor_1 + predictor_2 + ... + predictor_k, 
         data = dataset_name, 
         family = "binomial")

Today’s Data

Today we will be using the Roy Kent dataset from Tidy Tuesday.

Example: Roy Kent’s F-Bombs

Let’s model the odds of Roy Kent and Keeley Jones dating in a particular episode (dating) as a function of the percentage of F-bombs that belong to Roy Kent (F_perc) and if the IMDB rating is an 8.5 or better (IMDB).

m1 <- glm(dating ~ F_perc + IMDB, 
          data = richmondway, 
          family = "binomial"(link="logit"))
summary(m1)


Call:
glm(formula = dating ~ F_perc + IMDB, family = binomial(link = "logit"), 
    data = richmondway)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.76166    1.08995  -1.616    0.106
F_perc       0.03323    0.02506   1.326    0.185
IMDB         0.37986    0.72250   0.526    0.599

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 46.662  on 33  degrees of freedom
Residual deviance: 44.261  on 31  degrees of freedom
AIC: 50.261

Number of Fisher Scoring iterations: 4

Example: Roy Kent’s F-Bombs

coefficients(m1)

(Intercept)      F_perc        IMDB 
-1.76166167  0.03323012  0.37986267

The model is as follows,

\ln \left( \frac{\hat{\pi}}{1-\hat{\pi}} \right) = -1.76 + 0.03 x_1 + 0.38 x_2,

where
- x_1 is the episode’s percentage of the F-bombs from Roy Kent
- x_2 is the IMDB rating categorization of the episode
  - 0 = IMDB rating < 8.5
  - 1 = IMDB rating \ge 8.5

Odds Ratios

Recall the binary logistic regression model,

\ln \left( \frac{\pi}{1-\pi} \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k,

We are modeling the log odds, which are not intuitive with interpretations.
To be able to discuss the odds, we will “undo” the natural log by exponentiation.
i.e., if we want to interpret the slope for x_i, we will look at e^{\hat{\beta}_i}.
When interpreting \hat{\beta}_i, it is an additive effect on the log odds.
When interpreting e^{\hat{\beta}_i}, it is a multiplicative effect on the odds.

Odds Ratios

Why is it a multiplicative effect?

\begin{align*} \ln \left( \frac{\pi}{1-\pi} \right) &= \beta_0 + \beta_1 x_1 + ... + \beta_k x_k \\ \exp\left\{ \ln \left( \frac{\pi}{1-\pi} \right) \right\} &= \exp\left\{ \beta_0 + \beta_1 x_1 + ... + \beta_k x_k \right\} \\ \frac{\pi}{1-\pi} &= e^{\beta_0} e^{\beta_1 x_1} \cdots e^{\beta_k x_k} \end{align*}

Odds Ratios

Odds ratios:
- For a 1 [unit of predictor] increase in [predictor name], the odds of [outcome] are multiplied by [e^{\hat{\beta}_i}].
- For a 1 [unit of predictor] increase in [predictor name], the odds of [outcome] are [increased or decreased] by [100(e^{\hat{\beta}_i}-1)% or 100(1-e^{\hat{\beta}_i})%].
Compared to linear regression:
- For a [k] [units of predictor] increase in [predictor], we expect [outcome] to [increase or decrease] by [k \times |\hat{\beta}_1|] [units of outcome].

Example: Roy Kent’s F-Bombs

round(exp(coefficients(m1)),2)

(Intercept)      F_perc        IMDB 
       0.17        1.03        1.46

Let’s interpret the odds ratios:
- For a 1 percentage point increase in the percentage of f-bombs that came from Roy Kent, the odds of Roy and Keeley dating increase by 3%.
- As compared to when episodes have less than an IMDB rating of 8.5, the odds of Roy and Keeley dating are 46% higher in episodes with an IMDB rating of at least 8.5.

Test for Significant Predictors

What we’ve learned so far re: significance of predictors holds true with logistic regression.
Looking at the results from summary():

summary(m1)


Call:
glm(formula = dating ~ F_perc + IMDB, family = binomial(link = "logit"), 
    data = richmondway)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.76166    1.08995  -1.616    0.106
F_perc       0.03323    0.02506   1.326    0.185
IMDB         0.37986    0.72250   0.526    0.599

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 46.662  on 33  degrees of freedom
Residual deviance: 44.261  on 31  degrees of freedom
AIC: 50.261

Number of Fisher Scoring iterations: 4

Test for Significant Predictors

Hypotheses
- H_0: \ \beta_{\text{perc}} = 0
- H_1: \ \beta_{\text{perc}} \ne 0
Test Statistic and p-Value
- z_0 = 1.326
- p = 0.185
Rejection Region
- Reject H_0 if p<\alpha; \alpha=0.05.
Conclusion / Interpretation
- Fail to reject H_0. There is not sufficient evidence to suggest there exists a relationship between Roy and Keeley dating and the percentage of f-bombs by Roy.

Test for Significant Predictors

Hypotheses
- H_0: \ \beta_{\text{IMDB}} = 0
- H_1: \ \beta_{\text{IMDB}} \ne 0
Test Statistic and p-Value
- z_0 = 0.526
- p = 0.599
Rejection Region
- Reject H_0 if p<\alpha; \alpha=0.05.
Conclusion / Interpretation
- Fail to reject H_0. There is not sufficient evidence to suggest there exists a relationship between Roy and Keeley dating and the IMDB rating.

Test for Significant Regression Line

We will take a different approach when testing for a significant regression line.

full <- glm(outcome ~ predictor_1 + predictor_2 + ... + predictor_k, data = dataset_name, family = "binomial"(link="logit"))
reduced <- glm(outcome ~ 1, data = dataset_name, family = "binomial"(link="logit")) 
anova(reduced, full, test = "LRT")

Test for Significant Regression Line

Hypotheses
- H_0: \ \beta_1 = \beta_2 = ... = \beta_k = 0
- H_1: at least one \beta_i \ne 0
Test Statistic
- \chi^2_0 = from R
p-Value
- p = P[\chi^2_{k-1} \ge |\chi^2_0|]
Rejection Region
- Reject H_0 if p<\alpha.

Test for Significant Regression Line

full <- glm(dating ~ F_perc + IMDB, data = richmondway, family = "binomial"(link="logit"))
reduced <- glm(dating ~ 1, data = richmondway, family = "binomial"(link="logit")) # intercept only model
anova(reduced, full, test = "LRT")

Test for Significant Regression Line

Hypotheses
- H_0: \ \beta_{\text{perc}} = \beta_{\text{IMDB}} = 0
- H_1: at least one \beta_i \ne 0
Test Statistic
- \chi^2_0 = 2.402
- p = 0.301
Rejection Region
- Reject H_0 if p<\alpha; \alpha=0.05.
Conclusion/Interpretation
- Fail to reject H_0. There is not sufficient evidence to suggest that either predictor has a non-zero slope.

Wrap Up

That’s it for new material for our course.
The rest of our class meetings will be devoted to working on the project.
- Remember that the OUR Symposium is on April 17!
It is crucial that you are present in class.