Multiple Regression

STA4173: Biostatistics
Spring 2025

Introduction

We have learned simple linear regression, y = \beta_0 + \beta_1 x + \varepsilon
Today, we will expand to multiple regression, which allows us to include multiple predictors, y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + \varepsilon
- Simple linear regression is just a special case of multiple regression, where k=1.
The good news is that all things we learned for simple linear regression hold true for multiple regression! 😎

R Syntax

We again use the lm() function to define the model and summary() to see the results.

m <- lm(outcome ~ predictor_1 + predcitor_2 + ... + predictor_k, 
        data = dataset_name)
summary(m)

Additionally, we will find the confidence intervals for the \beta_i using confint().

confint(m) # for 95% CI
confint(m, level = conf_level) # for other levels

Interpretations

We interpret coefficients much the same as in simple linear regression.
- Intercept: when [all predictors = 0], the average [outcome] is [\hat{\beta}_0].
- Slope: For a 1 [units of predictor i] increase in [predictor i], we expect [outcome] to [increase or decrease] by [|\hat{\beta}_i|] [units of outcome], after controlling for [all other predictors in the model].
  - If \hat{\beta}_i > 0, there is an increase.
  - If \hat{\beta}_i < 0, there is a decrease.

Example

A family doctor wishes to further examine the variables that affect their female patients’ total cholesterol.
They randomly select 14 female patients, measure their total cholesterol, and asks the patients to determine their average daily consumption of saturated fat.
The data is as follows,

library(tidyverse)
data <- tibble(age = c(25, 25, 28, 32, 32, 32, 38, 42, 48, 51, 51, 58, 62, 65), 
               chol = c(180, 195, 186, 180, 210, 197, 239, 183, 204, 221, 243, 208, 228, 269), 
               fat = c(19, 28, 19, 16, 24, 20, 31, 20, 26, 24, 32, 21, 21, 30))
head(data, n = 2)

Example

Model total cholesterol (y) as a function of age (x_1) and daily consumption of saturated fat (x_2).

m <- lm(chol ~ age + fat, data = data)
summary(m)


Call:
lm(formula = chol ~ age + fat, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.874  -8.192   3.479   8.151  14.907 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  90.8415    15.9887   5.682 0.000142 ***
age           1.0142     0.2427   4.179 0.001540 ** 
fat           3.2443     0.6632   4.892 0.000478 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.42 on 11 degrees of freedom
Multiple R-squared:  0.8473,    Adjusted R-squared:  0.8196 
F-statistic: 30.53 on 2 and 11 DF,  p-value: 3.239e-05

The resulting model is \hat{\text{cholesterol}}_i = 90.842 + 1.014 \text{ age}_i + 3.244 \text{ fat}_i

Example

Let’s now interpret the slopes from the cholesterol example. \hat{\text{cholesterol}}_i = 90.842 + 1.014 \text{ age}_i + 3.244 \text{ fat}_i
- For a 1 year increase in age, the cholesterol level is expected to increase by 1.014 mg/dL, after adjusting for the average daily consumption of saturated fat.
- For a 1 gram increase in average daily consumption of saturated fat, the total cholesterol is expected to increase by 3.244 mg/dL, after adjusting for age.

Significance of \beta_i

Hypotheses
- H_0: \ \beta_i = 0
- H_1: \ \beta_i \ne 0
Test Statistic
- t_0 = \frac{\hat{\beta}_i}{\text{SE of }\hat{\beta}_i}
p-Value
- p = 2 \times P[t_{n-k-1} \ge |t_0|]
Rejection Region
- Reject H_0 if p<\alpha.

Example

Which, if any, are significant predictors of cholesterol?

summary(m)


Call:
lm(formula = chol ~ age + fat, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.874  -8.192   3.479   8.151  14.907 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  90.8415    15.9887   5.682 0.000142 ***
age           1.0142     0.2427   4.179 0.001540 ** 
fat           3.2443     0.6632   4.892 0.000478 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.42 on 11 degrees of freedom
Multiple R-squared:  0.8473,    Adjusted R-squared:  0.8196 
F-statistic: 30.53 on 2 and 11 DF,  p-value: 3.239e-05

Example

Hypotheses
- H_0: \ \beta_{\text{age}} = 0
- H_1: \ \beta_{\text{age}} \ne 0
Test Statistic and p-Value
- t_0 = 4.179
- p = 0.002
Rejection Region
- Reject H_0 if p<\alpha; \alpha=0.05
Conclusion/Interpretation
- Reject H_0.
- There is sufficient evidence to suggest that there is a relationship between age and cholesterol.

Example

Hypotheses
- H_0: \ \beta_{\text{fat}} = 0
- H_1: \ \beta_{\text{fat}} \ne 0
Test Statistic and p-Value
- t_0 = 4.892
- p < 0.001
Rejection Region
- Reject H_0 if p<\alpha; \alpha=0.05
Conclusion/Interpretation
- Reject H_0.
- There is sufficient evidence to suggest that there is a relationship between fat and cholesterol.

Example

Let’s find the 95% confidence intervals for the regression coefficients.

confint(m)

                 2.5 %     97.5 %
(Intercept) 55.6507255 126.032306
age          0.4800054   1.548405
fat          1.7845942   4.703937

The 95% CI for \beta_{\text{age}} is (0.48, 1.55).
The 95% CI for \beta_{\text{fat}} is (1.78, 4.70).

Reporting Results

How do I report regression models in the real world?
- I always give a table of \hat{\beta}_i, (95% CI), and p-values.
In our example,

Predictor	Estimate (95% CI)	p-Value
Age	1.01 (0.48, 1.55)	0.002
Fat	3.24 (1.78, 4.70)	< 0.001

Significant Regression Line

We can use an F-test to test for a significant regression line.
- A significant regression line means that there is a non-zero slope among all slopes.
- This makes use of an ANOVA table, however, we will not concern ourselves with the computation.
The results can be found at the bottom of the summary() output.

Significant Regression Line

Hypotheses
- H_0: \ \beta_1 = ... = \beta_k = 0
- H_1: at least one is different
Test Statistic
- F_0 (pulled from the summary() output)
p-Value
- p = \text{P}[F_{k, n-k-1} \ge F_0]
Rejection Region
- Reject H_0 if p < \alpha.

Example

Is this a significant regression line?
- Is at least one of the x_i a significant predictor?
- Is at least one slope non-zero?

Example

Hypotheses
- H_0: \ \beta_1 = \beta_2 = 0
- H_1: at least one is different
Test Statistic and p-Value
- F_0 = 30.53
- p < 0.001
Rejection Region
- Reject H_0 if p < \alpha; \alpha=0.05.
Conclusion/Interpretation
- Reject H_0.
- There is sufficient evidence to suggest that at least one slope is non-zero.

Line Fit

We can assess how well the regression model fits the data using R^2. R^2 = \frac{\text{SS}_{\text{Reg}}}{\text{SS}_{\text{Tot}}}
Thus, R^2 is the proportion of variation explained by the model (i.e., predictor set).
R^2 \in [0, 1]
- R^2 \to 0 indicates that the model fits “poorly.”
- R^2 \to 1 indicates that the model fits “well.”
- R^2 = 1 indicates that all points fall on the response surface.

Line Fit

Recall that the error term in ANOVA is the “catch all” …
- The SS_Tot is constant for the outcome of interest.
- As we add predictors to the model, we are necessarily increasing SS_Reg
  - The variance is moving from SS_E to SS_Reg
We do not want to arbitrarily increase R^2, so we will use an adjusted version: R^2_{\text{adj}} = 1 - \frac{\text{MS}_{\text{E}}}{\text{SS}_{\text{Tot}}/\text{df}_{\text{Tot}}}

Example

The R^2 and R^2_{\text{adj}} both come out of the summary() function.

R^2 is 0.847.
R^2_{\text{adj}} is 0.820 – 82.0% of the variability in cholesterol is explained by the model with age and fat.

Outliers

Definition: data values that are much larger or smaller than the rest of the values in the dataset.
We will look at the standardized residuals, e_{i_{\text{standardized}}} = \frac{e_i}{\sqrt{\text{MSE}(1-h_i)}}, where
- e_i = y_i - \hat{y}_i is the residual of the i^th observation
- h_i is the leverage of the i^th observation
If |e_{i_{\text{standardized}}}| > 2.5 \ \to \ outlier.
If |e_{i_{\text{standardized}}}| > 3 \ \to \ extreme outlier.

Outliers

We will use the rstandard() function to find the residuals, then we will count the outliers,

dataset_name %>% count(abs(rstandard(m))>2.5)

In our example data,

data %>% count(abs(rstandard(m))>2.5)

There are no outliers.
- (This is Happy Textbook Land!)

Multicollinearity

Collinearity/multicollinearity: a correlation between two or more predictor variables affects the estimation procedure.
We will use the variance inflation factor (VIF) to check for multicollinearity. \text{VIF}_j = \frac{1}{1-R^2_j},
where
- j = the predictor of interest and j \in \{1, 2, ..., k \},
- R^2_j results from regressing x_j on the remaining (k-1) predictors.
We say that multicollinearity is present if VIF > 10.

Multicollinearity

How do we deal with multicollinearity?
- Easy answer: remove at least one predictor from the collinear set, then reassess VIF.
- More complicated: how do we know which predictor should be the one removed?
  - (We will likely need to consult with the research team.)

Multicollinearity

We will use the vif() function from the car package.

car::vif(m)

There will be a value for each predictor in the model.
In our cholesterol model,

car::vif(m)

     age      fat 
1.117404 1.117404

No multicollinearity is present.

Conclusions

This is just scratching the surface for multiple regression.
Other statistics courses go deeper into regression topics.
- Categorical predictors.
- Interaction terms.
- Other regression diagnostics.
- How to handle non-continuous outcomes.