Multinomial Logistic Regression

Introduction

  • We have previously discussed continuous outcomes:

    • y \sim N(\mu, \sigma^2)
    • y \sim \text{gamma}(\mu, \gamma)
    • y \sim \text{beta}(\alpha, \beta)
  • This week, we will consider categorical outcomes:

    • Binary
    • Multinomial

Logistic Regressions

  • In the last lecture, we discussed binary logistic regression for binary outcomes,

\ln \left( \frac{\pi}{1-\pi} \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k

  • In this lecture, we will extend our understanding to multinomial logistic regression for multinomial outcomes,

\ln \left( \frac{\pi_j}{\pi_{\text{ref}}} \right) = \hat{\beta}_{0j} + \hat{\beta}_{1j} x_1 + ... + \hat{\beta}_{kj} x_k

Multinomial (or Nominal) Logistic Regression

  • We now are creating c-1 models.

    • Each model will compare outcome group j to the reference group.
    • We have unique estimates for every pairwise model.
  • The general form of each model is:

\ln \left( \frac{\hat{\pi}_j}{\hat{\pi}_{\text{ref}}} \right) = \hat{\beta}_{0j} + \hat{\beta}_{1j} x_1 + ... + \hat{\beta}_{kj} x_k

  • Yes, this means we are now reporting multiple models at the same time.

Multinomial Logistic Regression (R)

  • We will now use the multinom() function from the nnet package for modeling,
m <- multinom(y ~ x_1 + x_2 + ... + x_k,
              data = dataset_name,
              trace = FALSE)
  • Note that trace = FALSE only affects printed output

    • The trace is iteration information that is printed to the console while the model is fitting, to let the user know “where” the algorithm is.

Data Management for multinom()

  • Fortunately for us, we know data management for this already, as our outcome will be a factor variable. Recall,

    • When we include factor variables, R defaults to the “first” level as the reference group.

      • “First” means alphabetically first for strings and numerically smallest for numbers.
    • There are (more than) two approaches we can take to “relevel” a variable.

      • Use the factor() function with the levels argument to set the order of the levels.

      • “Brute force” by defining a new character variable using if_else() statements and defining the levels as “1 - first category name”, “2 - second category name”, etc.

  • When dealing with factor variables, ask yourself what the reference group should be.

    • Now that we are dealing with multinomial logistic, the reference group matters more than with categorical predictors.

Data Management for multinom()

  • Previously, I have mentioned that you can use as.factor() within the formula of a model to treat a variable as a factor…

    • …it doesn’t work with multinom().
  • Thus, we must make sure that any categorical variables are coded appropriately in the dataset before modeling.

    • Watch out for categorical variables that are stored numerically! Remember that R treats them as continuous.

Lecture Example Set Up

  • During performance weekends, things don’t just go “well” or “poorly” — they can unfold in several very different ways. Depending on how Max prepares and which plan he follows with Goofy, the final outcome of the performance falls into one of four categories:

    • a disaster
    • a messy (but fun) performance,
    • a solid performance, or
    • an iconic performance.
  • Each day, Max tracks his practice time, amount of sleep, caffeine intake, and the preparation plan he used.

Lecture Example Set Up

  • Pulling in the data,
goofy <- read_csv("https://raw.githubusercontent.com/samanthaseals/SDSII/refs/heads/main/files/data/lectures/W5_goofy.csv")
goofy %>% head(n=4)

Example: Data Management

  • Let’s examine the characteristics of the two categorical variables:
levels(goofy$plan)
NULL
levels(goofy$performance_outcome)
NULL
  • Oops! They aren’t factor variables…
goofy <- goofy %>% 
  mutate(plan = factor(plan, levels = c("Balanced", "Goofy", "Max", "Chaotic")),
         performance_outcome = factor(performance_outcome, levels = c("Solid", "Disaster", "Messy", "Iconic")))
levels(goofy$plan)
[1] "Balanced" "Goofy"    "Max"      "Chaotic" 
levels(goofy$performance_outcome)
[1] "Solid"    "Disaster" "Messy"    "Iconic"  

Example: Multinomial Logistic Regression

  • Let’s now model Max’s performance outcome based on his preparation plan, practice time, and amount of sleep.
m1 <- multinom(performance_outcome ~ plan + practice_hours + sleep_hours,
               data = goofy, trace = FALSE)
tidy(m1)

Example: Multinomial Logistic Regression

  • … that’s a lot of output! o_o

  • Remember that we are fitting three separate models here:

    • Messy vs. Solid
    • Disaster vs. Solid
    • Iconic vs. Solid
coefficients(m1)
         (Intercept)  planGoofy   planMax planChaotic practice_hours
Disaster    6.587583  0.5249369  0.686156    1.749325     -0.5475893
Messy       4.450541  1.2985642  2.933813    1.048919     -0.2963095
Iconic      1.962543 -2.0882399 -2.534283   -1.738306      0.2184171
         sleep_hours
Disaster  -0.9967134
Messy     -0.8866107
Iconic    -0.1948100

Example: Multinomial Logistic Regression

  • Our models are as follows,

\begin{align*} \ln \left( \frac{\pi_{\text{Messy}}}{\pi_{\text{Solid}}} \right) &= 6.59 + 0.52\,\text{G} + 0.69\,\text{M} + 1.75\,\text{C} - 0.55\,\text{prac} - 1.00\,\text{sleep} \\ \ln \left( \frac{\pi_{\text{Disaster}}}{\pi_{\text{Solid}}} \right) &= 4.45 + 1.29\,\text{G} + 2.93\,\text{M} + 1.05\,\text{C} - 0.30\,\text{prac} - 0.89\,\text{sleep} \\ \ln \left( \frac{\pi_{\text{Iconic}}}{\pi_{\text{Solid}}} \right) &= 1.96 - 2.09\,\text{G} - 2.53\,\text{M} - 1.74\,\text{C} + 0.22\,\text{prac} - 0.19\,\text{sleep} \end{align*}

Interpreting the Model

  • Recall the general form of our model,

\ln \left( \frac{\hat{\pi}_j}{\hat{\pi}_{\text{ref}}} \right) = \hat{\beta}_{0j} + \hat{\beta}_{1j} x_1 + ... + \hat{\beta}_{kj} x_k

  • We are modeling the log odds, like in binary logistic regression, and will instead look to the odds ratio, e^{\hat{\beta}_i}

    • Interpreting \hat{\beta}_i is an additive effect on the log odds.

    • Interpreting e^{\hat{\beta}_i} is a multiplicative effect on the odds.

Interpreting the Model

  • We interpret odds ratios much like binary logistic regression.

    • \beta_i < 0 \to 0 < \text{OR}_i < 1 \to a decrease in odds

    • \beta_i = 0 \to \text{OR}_i = 1 \to no change in odds

    • \beta_i > 0 \to \text{OR}_i > 1 \to an increase in odds

Interpreting the Model

  • Continuous predictors:

    • For a one [predictor unit] increase in [predictor], the odds of [response category j], as compared to [response reference category], are multiplied by e^{\hat{\beta}_j}.
  • Categorical predictors:

    • As compared to [predictor reference category], the odds of [response category j_{\text{resp}}], as compared to [response reference category], for [predictor category j_{\text{pred}}] are multiplied by e^{\hat{\beta}_{j{_{\text{pred}}}}}.

      • AI enhanced: Individuals in [predictor category j_{\text{pred}}] are e^{\hat{\beta}_{j{_{\text{pred}}}}} times as likely to be in [response category j_{\text{resp}}] rather than [response reference category], relative to those in [predictor reference category].

Example: Interpreting the Model

  • Returning to our example,

\begin{align*} \ln \left( \frac{\pi_{\text{Messy}}}{\pi_{\text{Solid}}} \right) &= 6.59 + 0.52\,\text{G} + 0.69\,\text{M} + 1.75\,\text{C} - 0.55\,\text{prac} - 1.00\,\text{sleep} \\ \ln \left( \frac{\pi_{\text{Disaster}}}{\pi_{\text{Solid}}} \right) &= 4.45 + 1.29\,\text{G} + 2.93\,\text{M} + 1.05\,\text{C} - 0.30\,\text{prac} - 0.89\,\text{sleep} \\ \ln \left( \frac{\pi_{\text{Iconic}}}{\pi_{\text{Solid}}} \right) &= 1.96 - 2.09\,\text{G} - 2.53\,\text{M} - 1.74\,\text{C} + 0.22\,\text{prac} - 0.19\,\text{sleep} \\ \end{align*}

Example: Odds Ratios

  • Moving to odds ratios, but in table form:
Predictor ORDiaster ORMessy ORIconic
Plan: Goofy vs. Balanced 1.69 3.66 0.12
Plan: Max vs. Balanced 1.99 18.80 0.08
Plan: Chaotic vs. Balanced 5.75 2.85 0.18
Practice Time (hours) 0.58 0.74 1.24
Hours Slept 0.37 0.41 0.82
round(exp(coefficients(m1)), 2)
         (Intercept) planGoofy planMax planChaotic practice_hours sleep_hours
Disaster      726.02      1.69    1.99        5.75           0.58        0.37
Messy          85.67      3.66   18.80        2.85           0.74        0.41
Iconic          7.12      0.12    0.08        0.18           1.24        0.82

Example: Interpreting the Model

  • When looking at the odds ratios:

    • As compared to following a balanced plan, following:

      • a chaotic plan multiplies the odds of a disaster performance by 5.75 (475% increase).
      • Goofy’s plan multiplies the odds of a disaster performance by 1.69 (69% increase).
      • Max’s plan multiplies the odds of a disaster performance by 1.99 (99% increase).
    • For each additional hour of practice, the odds of an disaster performance are multiplied by 0.58 (42% decrease).

    • For each additional hour of sleep, the odds of a disaster performance are multiplied by 0.37 (63% decrease).

Example: Interpreting the Model

  • We may also be interested in comparing odds ratios across models,

  • Examining practice time:

    • For each additional hour of practice, the odds of:

      • a disaster performance are multiplied by 0.58 (42% decrease).
      • a messy performance are multiplied by 0.74 (26% decrease).
      • an iconic performance are multiplied by 1.24 (24% increase).
  • This underscores the importance of practice time for Max’s performance outcome!

Statistical Inference

  • As with other regression models, we will want to determine the significance of our model and predictors.

  • Significance of regression line: no change.

    • Full (all predictors) / reduced (just 1) with anova(reduced, full, test = "Chisq").
  • Significance of predictors: … which way?

    • omnibus
    • model-specific omnibus
    • model-specific individual

Statistical Inference

  • Significance of predictors (omnibus): no change.

    • Tells us if a predictor is significant across all outcome levels.
    • car::Anova(type = "III")
    • Full (all predictors) / reduced (predictor set minus predictor of interest) with anova(reduced, full, test = "Chisq").
  • Significance of predictors (model-specific omnibus): new.

    • Tells us if a predictor is significant across within each model.
    • Approach: Fit separate models for each pairwise comparison of the categorical outcome levels. Then,
      • car::Anova(type = "III") on each model, or
      • Full (all predictors) / reduced (predictor set minus predictor of interest) with anova(reduced, full, test = "Chisq") on each model.

Statistical Inference

  • Significance of predictors (model-specific individual): no change.

    • Tells us if a continuous or binary predictor significantly predicts the outcome within each model.
    • Examines pairwise differences within a categorical variable within each model.
    • tidy()

Example: Significance of the Regression Line

  • We know how to determine the significance of the regression line,
m1_full <- multinom(performance_outcome ~ plan + practice_hours + sleep_hours, data = goofy, trace = FALSE)
m1_reduced <- multinom(performance_outcome ~ 1, data = goofy, trace = FALSE)
anova(m1_reduced, m1_full, test = "Chisq") 
  • Yes, this predictor set significantly predicts Max’s performance outcome (p < 0.001).

Example: Significance of Predictors (Omnibus)

  • We also know how to determine if individual predictors are significant overall (\text{outcome} \sim x_i),
m1 %>% car::Anova(m1, type = "III")  %>% tidy()
  • Yes, plan, practice hours, and sleep hours are all significant predictors of Max’s performance outcome (all p < 0.001).

Significance of Predictors (Model-Specific Omnibus)

  • But how do we determine if individual predictors are significant within each model (\pi_{\text{group}}/\pi_{\text{ref}}\sim x_i)?

    • This is straight-forward for continuous and binary predictors and tidy().
  • If we want to get into the model-specific categorical predictor significance, we have a “rough” approach:

    • Fit separate models for each pairwise comparison of the categorical outcome levels.
dataset_name %>%
  filter(y %in% c("Group j","Reference")) %>% 
  multinom(y ~ x_1 + ... + x_k,
           data = .,
           trace = FALSE) %>%
  car::Anova(type=3) %>%
  tidy()

Example: Significance of Predictors (Model-Level Omnibus)

  • Let’s try this for the Iconic vs. Solid model,
goofy %>%
  filter(performance_outcome %in% c("Iconic","Solid")) %>% 
  multinom(performance_outcome ~ plan + practice_hours + sleep_hours,
           data = .,
           trace = FALSE) %>%
  car::Anova(type=3) %>%
  tidy()
  • Plan (p<0.001) is a significant predictor of Max achieving an iconic performance vs. a solid performance, but practice time (p=0.0.094) and hours of sleep (p=0.197) are not.

Example: Significance of Predictors (Model-Level Omnibus)

  • Let’s try this for the Messy vs. Solid model,
goofy %>%
  filter(performance_outcome %in% c("Messy","Solid")) %>% 
  multinom(performance_outcome ~ plan + practice_hours + sleep_hours,
           data = .,
           trace = FALSE) %>%
  car::Anova(type=3) %>%
  tidy()
  • Plan (p<0.001) and hours of sleep (p<0.001) are significant predictors of Max achieving a messy performance vs. a solid performance, but practice time (p=0.105) is not.

Example: Significance of Predictors (Model-Level Omnibus)

  • Let’s try this for the Disaster vs. Solid model,
goofy %>%
  filter(performance_outcome %in% c("Disaster","Solid")) %>% 
  multinom(performance_outcome ~ plan + practice_hours + sleep_hours,
           data = .,
           trace = FALSE) %>%
  car::Anova(type=3) %>%
  tidy()
  • Plan (p=0.008), practice hours (p = 0.016), and hours of sleep (p<0.001) are significant predictors of Max achieving an disasterous performance vs. a solid performance.

Significance of Predictors (Model-Specific Individual)

  • Finally, we can look at the model-specific individual predictor significance.

    • What is this adding to our analysis? We can now examine pairwise comparisons of categorical predictors within each model.
  • However, our tidy() output is totally overwhelming. We will filter,

model %>% 
  tidy(conf.int = TRUE, exponentiate = TRUE) %>% 
  filter(y.level == "Group j") %>% 
  select(term, estimate, conf.low, conf.high, p.value)

Example: Significance of Predictors (Model-Specific Individual)

  • In our example, for Disaster vs. Solid,
m1 %>% tidy(conf.int = TRUE, exponentiate = TRUE) %>% filter(y.level == "Disaster") %>% select(term, estimate, conf.low, conf.high, p.value)

Example: Significance of Predictors (Model-Specific Individual)

  • In our example, for Messy vs. Solid,
m1 %>% tidy(conf.int = TRUE, exponentiate = TRUE) %>% filter(y.level == "Messy") %>% select(term, estimate, conf.low, conf.high, p.value)

Example: Significance of Predictors (Model-Specific Individual)

  • In our example, for Iconic vs. Solid,
m1 %>% tidy(conf.int = TRUE, exponentiate = TRUE) %>% filter(y.level == "Iconic") %>% select(term, estimate, conf.low, conf.high, p.value)

Reporting Results

  • In the paper,
Example of multinomial logistic regression results

Reporting Results

  • In a supplement,
Example of multinomial logistic regression results

Wrap Up

  • In this lecture, we have introduced multinomial logistic regression.

    • Multinomial logistic regression is appropriate for nominal outcomes.
  • We again saw the logit link function.

    • We interpret coefficients as multiplicative effects on the odds of one outcome compared to the reference outcome
  • In the next lecture, we will review data visualization for logistic regressions.