STA4173: Biostatistics
Spring 2025
We have learned simple linear regression, y = \beta_0 + \beta_1 x + \varepsilon
Today, we will expand to multiple regression, which allows us to include multiple predictors, y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + \varepsilon
The good news is that all things we learned for simple linear regression hold true for multiple regression! 😎
lm()
function to define the model and summary()
to see the results.confint()
.We interpret coefficients much the same as in simple linear regression.
Intercept: when [all predictors = 0], the average [outcome] is [\hat{\beta}_0].
Slope: For a 1 [units of predictor i] increase in [predictor i], we expect [outcome] to [increase or decrease] by [|\hat{\beta}_i|] [units of outcome], after controlling for [all other predictors in the model].
A family doctor wishes to further examine the variables that affect their female patients’ total cholesterol.
They randomly select 14 female patients, measure their total cholesterol, and asks the patients to determine their average daily consumption of saturated fat.
The data is as follows,
Call:
lm(formula = chol ~ age + fat, data = data)
Residuals:
Min 1Q Median 3Q Max
-19.874 -8.192 3.479 8.151 14.907
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 90.8415 15.9887 5.682 0.000142 ***
age 1.0142 0.2427 4.179 0.001540 **
fat 3.2443 0.6632 4.892 0.000478 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.42 on 11 degrees of freedom
Multiple R-squared: 0.8473, Adjusted R-squared: 0.8196
F-statistic: 30.53 on 2 and 11 DF, p-value: 3.239e-05
Let’s now interpret the slopes from the cholesterol example. \hat{\text{cholesterol}}_i = 90.842 + 1.014 \text{ age}_i + 3.244 \text{ fat}_i
For a 1 year increase in age, the cholesterol level is expected to increase by 1.014 mg/dL, after adjusting for the average daily consumption of saturated fat.
For a 1 gram increase in average daily consumption of saturated fat, the total cholesterol is expected to increase by 3.244 mg/dL, after adjusting for age.
Hypotheses
Test Statistic
p-Value
Rejection Region
Call:
lm(formula = chol ~ age + fat, data = data)
Residuals:
Min 1Q Median 3Q Max
-19.874 -8.192 3.479 8.151 14.907
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 90.8415 15.9887 5.682 0.000142 ***
age 1.0142 0.2427 4.179 0.001540 **
fat 3.2443 0.6632 4.892 0.000478 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.42 on 11 degrees of freedom
Multiple R-squared: 0.8473, Adjusted R-squared: 0.8196
F-statistic: 30.53 on 2 and 11 DF, p-value: 3.239e-05
Hypotheses
Test Statistic and p-Value
t_0 = 4.179
p = 0.002
Rejection Region
Conclusion/Interpretation
Reject H_0.
There is sufficient evidence to suggest that there is a relationship between age and cholesterol.
Hypotheses
Test Statistic and p-Value
t_0 = 4.892
p < 0.001
Rejection Region
Conclusion/Interpretation
Reject H_0.
There is sufficient evidence to suggest that there is a relationship between fat and cholesterol.
2.5 % 97.5 %
(Intercept) 55.6507255 126.032306
age 0.4800054 1.548405
fat 1.7845942 4.703937
The 95% CI for \beta_{\text{age}} is (0.48, 1.55).
The 95% CI for \beta_{\text{fat}} is (1.78, 4.70).
How do I report regression models in the real world?
In our example,
Predictor | Estimate (95% CI) | p-Value |
---|---|---|
Age | 1.01 (0.48, 1.55) | 0.002 |
Fat | 3.24 (1.78, 4.70) | < 0.001 |
We can use an F-test to test for a significant regression line.
A significant regression line means that there is a non-zero slope among all slopes.
This makes use of an ANOVA table, however, we will not concern ourselves with the computation.
The results can be found at the bottom of the summary()
output.
Hypotheses
Test Statistic
summary()
output)p-Value
Rejection Region
Is this a significant regression line?
Is at least one of the x_i a significant predictor?
Is at least one slope non-zero?
Hypotheses
Test Statistic and p-Value
F_0 = 30.53
p < 0.001
Rejection Region
Conclusion/Interpretation
We can assess how well the regression model fits the data using R^2. R^2 = \frac{\text{SS}_{\text{Reg}}}{\text{SS}_{\text{Tot}}}
Thus, R^2 is the proportion of variation explained by the model (i.e., predictor set).
R^2 \in [0, 1]
R^2 \to 0 indicates that the model fits “poorly.”
R^2 \to 1 indicates that the model fits “well.”
R^2 = 1 indicates that all points fall on the response surface.
Recall that the error term in ANOVA is the “catch all” …
The SSTot is constant for the outcome of interest.
As we add predictors to the model, we are necessarily increasing SSReg
We do not want to arbitrarily increase R^2, so we will use an adjusted version: R^2_{\text{adj}} = 1 - \frac{\text{MS}_{\text{E}}}{\text{SS}_{\text{Tot}}/\text{df}_{\text{Tot}}}
summary()
function.R^2 is 0.847.
R^2_{\text{adj}} is 0.820 – 82.0% of the variability in cholesterol is explained by the model with age and fat.
Definition: data values that are much larger or smaller than the rest of the values in the dataset.
We will look at the standardized residuals, e_{i_{\text{standardized}}} = \frac{e_i}{\sqrt{\text{MSE}(1-h_i)}}, where
If |e_{i_{\text{standardized}}}| > 2.5 \ \to \ outlier.
If |e_{i_{\text{standardized}}}| > 3 \ \to \ extreme outlier.
rstandard()
function to find the residuals, then we will count the outliers,There are no outliers.
Collinearity/multicollinearity: a correlation between two or more predictor variables affects the estimation procedure.
We will use the variance inflation factor (VIF) to check for multicollinearity. \text{VIF}_j = \frac{1}{1-R^2_j},
where
We say that multicollinearity is present if VIF > 10.
How do we deal with multicollinearity?
Easy answer: remove at least one predictor from the collinear set, then reassess VIF.
More complicated: how do we know which predictor should be the one removed?
vif()
function from the car
package.This is just scratching the surface for multiple regression.
Other statistics courses go deeper into regression topics.
Categorical predictors.
Interaction terms.
Other regression diagnostics.
How to handle non-continuous outcomes.