We have discussed quantifying the relationship between two continuous variables.
Recall that the correlation describes the strength and the direction of the relationship.
Pearson’s correlation: describes the linear relationship; assumes normality of both variables.
Spearman’s correlation: describes the monotone relationship; assumes both variables are at least ordinal.
Further, recall that correlation is unitless and bounded to [-1, 1].
Now we will discuss a different way of representing/quantifying the relationship.
Using simple linear regression, we will model y (the outcome) as a function of x (the predictor).
y = \beta_0 + \beta_1 x + \varepsilon
\beta_0 is the y-intercept.
\beta_1 is the slope describing the relationship between x and y.
\varepsilon (estimated by e) is the error term; remember, from ANOVA (😱):
\varepsilon \overset{\text{iid}}{\sim} N(0, \sigma^2)
y = \beta_0 + \beta_1 x + \varepsilon
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + e
\hat{y} estimates y.
\hat{\beta}_0 estimates \beta_0.
\hat{\beta}_1 estimates \beta_1.
e estimates \varepsilon.
\hat{\beta}_1 = \frac{\sum_{i=1}^n x_i y_i - \frac{\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{n}}{\sum_{i=1}^n x_i^2 - \frac{\left(\sum_{i=1}^n x_i\right)^2}{n}} = r \frac{s_y}{s_x}
\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} where
We will use the linear_regression() function from library(ssstats) to construct our regression model.
In the case of simple linear regression,
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now construct the simple linear regression model. How should we update the code?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
\hat{\text{satisfaction}} = 20.92 + 0.90 \text{ chew\_time}
\hat{\text{satisfaction}} = 20.92 + 0.90 \text{ chew\_time}
y = \beta_0 + \beta_1 x + \varepsilon
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + \varepsilon
We will use the linear_regression() function from library(ssstats) to construct our regression model.
In the case of multiple linear regression,
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now construct the corresponding multiple regression model. How should we update the code?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
\hat{\text{satisfaction}} = 23.02 + 0.92 \text{ chew\_time} - 0.04 \text{ snack\_crunch}
\hat{\text{satisfaction}} = 23.02 + 0.92 \text{ chew\_time} - 0.04 \text{ snack\_crunch}
\hat{\beta}_i \pm t_{\alpha/2, n-k-1} \text{ SE}_{\hat{\beta}_i}
linear_regression() function.The 95% CI for \beta_{\text{time chewed}} is (0.81, 1.02).
The 95% CI for \beta_{\text{crunch factor}} is (-0.22, 0.14).
The 95% CI for \beta_{\text{time chewed}} is (0.78, 1.06).
The 95% CI for \beta_{\text{crunch factor}} is (-0.28, 0.20).
Hypotheses
Test Statistic
p-Value
This is default output in our linear_regression() function.
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How do I report regression models in the real world?
In our example,
| Predictor | Estimate (95% CI) | p-Value |
|---|---|---|
| Time Chewed | 0.92 (0.81, 1.02) | < 0.001 |
| Crunch Factor | -0.04 (-0.22, 0.14) | 0.663 |
We can use the likelihood ratio test to test for a “significant regression” line.
Hypotheses
Test Statistic and p-Value
significant_line() function from library(ssstats) to test for a significant model.Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now determine if we have a significant regression line. How should we update the following code?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now determine if we have a significant regression line. Our updated code,
Likelihood Ratio Test for Significant Regression Line:
Null: H₀: β₁ = β₂ = ... = βₖ = 0
Alternative: H₁: At least one βᵢ ≠ 0
Test statistic: χ²(2) = 57158.677
p-value: p < 0.001
Conclusion: Reject the null hypothesis (p = < 0.001 < α = 0.05)
R^2 = \frac{\text{SS}_{\text{Reg}}}{\text{SS}_{\text{Tot}}}
Thus, R^2 is the proportion of variation explained by the model (i.e., predictor set).
R^2 \in [0, 1]
R^2 \to 0 indicates that the model fits “poorly.”
R^2 \to 1 indicates that the model fits “well.”
R^2 = 1 indicates that all points fall on the response surface.
Recall that the error term in ANOVA is the “catch all” …
The SSTot is constant for the outcome of interest.
As we add predictors to the model, we are necessarily increasing SSReg
We do not want to arbitrarily increase R^2, so we will use an adjusted version:
R^2_{\text{adj}} = 1 - \frac{\text{MS}_{\text{E}}}{\text{SS}_{\text{Tot}}/\text{df}_{\text{Tot}}}
r_squared()function fromlibrary(ssstats)` to find the R^2 value.Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now find the R^2 value for the model we constructed earlier. How should we update the following code?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Running the code,
\hat{\text{satisfaction}} = 23.02 + 0.92 \text{ chew\_time} - 0.04 \text{ snack\_crunch}
Note that we will plug in for all x except one, which will vary & is on the x-axis.
Let’s say we have 3 predictors: x1, x2, and x3.
If we want x1 on the x-axis, then
\text{variable on } x = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 \text{median}(x_2) + \hat{\beta}_3 \text{median}(x_3)
To visualize our model, we will overlay the regression line on a scatter plot of the data.
Here, we take the scatterplot code from last week and add a geom_line() for the regression line.
dataset_name %>% ggplot(aes(x = x_variable, y = y_variable)) + # specify x and y
geom_point(size = 2.5, color = "gray50") + # plot points
geom_line(aes(y = line_variable), size = 1, color = "black") + # plot line
labs(x = "x-axis label", # edit x-axis label
y = "y-axis label") + # edit y-axis label
theme_bw() # change theme of graphFor our example, we will create two visualizations: one with chew_time on the x-axis and one with snack_crunch on the x-axis.
The base code for the visualization is:
gummy_data %>% ggplot(aes(x = x_variable, y = satisfaction)) + # specify x and y
geom_point(size = 2.5, color = "gray50") + # plot points
geom_line(aes(y = line_variable), size = 1, color = "black") + # plot line
labs(x = "x-axis label", # edit x-axis label
y = "Gummy's Satisfaction") + # edit y-axis label
theme_bw() # change theme of graphgummy_data %>% ggplot(aes(x = chew_time, y = satisfaction)) + # specify x and y
geom_point(size = 2.5, color = "gray50") + # plot points
geom_line(aes(y = chew_on_x), size = 1, color = "black") + # plot line
labs(x = "Time Spent Chewing (min)", # edit x-axis label
y = "Gummy's Satisfaction") + # edit y-axis label
theme_bw() # change theme of graphgummy_data %>% ggplot(aes(x = snack_crunch, y = satisfaction)) + # specify x and y
geom_point(size = 2.5, color = "gray50") + # plot points
geom_line(aes(y = snack_on_x), size = 1, color = "black") + # plot line
labs(x = "Crunch Factor of Snack", # edit x-axis label
y = "Gummy's Satisfaction") + # edit y-axis label
theme_bw() # change theme of graphWhoa! It looks like there is not much of a relationship between crunch factor and satisfaction!
Remember that regression is looking at the relationship between y and xi while adjusting for all other predictors in the model.
The variability in satisfaction explained by snack_crunch is being adjusted for chew_time.
This is why we saw a non-significant slope for snack_crunch earlier.
This means that chew_time explains the variance in satisfaction, not snack_crunch.
With r_s = 0.82, we see that snack_crunch and chew_time are highly correlated.
This is just scratching the surface for multiple regression.
Other statistics courses go deeper into regression topics.
Categorical predictors.
Interaction terms.
Regression diagnostics.
How to handle non-continuous outcomes.
Advanced data visualization for models.
STA4173 - Biostatistics - Fall 2025