July 22, 2025
Tuesday
Before today, we discussed methods for comparing continuous outcomes across two or more groups.
We now will begin exploring the relationships between two continuous variables.
We will first focus on data visualization and the corresponding correlation.
The we will quantify the relationship using regression analysis.
Scatterplot or scatter diagram:
Each individual in the dataset is represented by a point on the scatterplot.
The explanatory variable is on the x-axis and the response variable is on the y-axis.
It is super important for us to plot the data!
Positive relationship: As x increases, y increases.
Negative relationship: As x increases, y decreases.
ggplot()
function from library(tidyverse)
(or library(ggplot2)
).In Ponyville, Pinkie Pie is curious about how Gummy’s snack satisfaction (0 to 100) relates to the duration of chew time (in seconds) he spends on crunchy treats. She suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How should we update the code for a scatterplot?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How should we update the code for a scatterplot?
Fluttershy wants to determine how the amount of carrots Angel Bunny is given (grams) affects his happiness level (0 to 100). She believes that Angel Bunny tends to be happiest with a moderate amount of carrots.
Fluttershy records both the weight of the carrot (carrot_weight) and Angel Bunny’s happiness level (happiness) of the last 200 snacks given to Angel Bunny (angel_data).
Fluttershy believes that Angel Bunny tends to be happiest with a moderate amount of carrots.
Fluttershy records both the weight of the carrot (carrot_weight) and Angel Bunny’s happiness level (happiness) of the last 200 snacks given to Angel Bunny (angel_data).
How should we update the code for a scatterplot?
Fluttershy believes that Angel Bunny tends to be happiest with a moderate amount of carrots.
Fluttershy records both the weight of the carrot (carrot_weight) and Angel Bunny’s happiness level (happiness) of the last 200 snacks given to Angel Bunny (angel_data).
How should we update the code for a scatterplot?
Creating the scatterplot allows us to visualize a potential relationship.
Now, let’s discuss quantifying that relationship.
Initial quantification: correlation.
Further quantification: regression.
Correlation: A unitless measure of the strength and direction of the linear relationship between two quantitative variables.
\rho represents the population correlation coefficient.
r represents the sample correlation coefficient.
Correlation is bounded to [-1, 1].
r=-1 represents perfect negative correlation.
r=1 represents perfect positive correlation.
r=0 represents no correlation.
r = \frac{\sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s_x} \right)\left( \frac{y_i - \bar{y}}{s_y} \right)}{n-1}
We will use the correlation()
function from library(ssstats)
to examine correlation.
For a single pairwise correlation,
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How should we update the code for the correlation between satisfaction level and the chew time?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How should we update the following code to get the correlation matrix?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
We can determine if the correlation is significantly different from 0 (i.e., a relationship exists)
Hypotheses:
Test Statistic and p-Value
This is default output in our correlation()
function.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Looking at correlation results for this specific relationship,
The assumption for Pearson’s correlation is that both x and y are normally distributed.
We will use the correlation_qq()
function from library(ssstats)
to examine the normality of x and y.
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now check the assumption that both satisfaction level and the chew time are normally distributed. How should we change the following code?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now check the assumption that both satisfaction level and the chew time are normally distributed. Our updated code,
What do we do when we do not meet the normality assumption?
Spearman’s Correlation: A unitless measure of the strength and direction of the monotone relationship between two variables.
Spearman’s correlation is interpreted the same as Pearson’s correlation.
To find Spearman’s correlation, the following algorithm is followed:
We will use the correlation()
function from library(ssstats)
to examine correlation.
For a single pairwise correlation,
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How should we update the code for the Spearman correlation between satisfaction level and the chew time?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
We can determine if the correlation is significantly different from 0 (i.e., a relationship exists)
Hypotheses:
Test Statistic and p-Value
This is default output in our correlation()
function.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Looking at Spearman’s correlation results for this specific relationship,
We have discussed quantifying the relationship between two continuous variables.
Recall that the correlation describes the strength and the direction of the relationship.
Pearson’s correlation: describes the linear relationship; assumes normality of both variables.
Spearman’s correlation: describes the monotone relationship; assumes both variables are at least ordinal.
Further, recall that correlation is unitless and bounded to [-1, 1].
Now we will discuss a different way of representing/quantifying the relationship.
Using simple linear regression, we will model y (the outcome) as a function of x (the predictor).
y = \beta_0 + \beta_1 x + \varepsilon
\beta_0 is the y-intercept.
\beta_1 is the slope describing the relationship between x and y.
\varepsilon (estimated by e) is the error term; remember, from ANOVA (😱):
\varepsilon \overset{\text{iid}}{\sim} N(0, \sigma^2)
y = \beta_0 + \beta_1 x + \varepsilon
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + e
\hat{y} estimates y.
\hat{\beta}_0 estimates \beta_0.
\hat{\beta}_1 estimates \beta_1.
e estimates \varepsilon.
\hat{\beta}_1 = \frac{\sum_{i=1}^n x_i y_i - \frac{\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{n}}{\sum_{i=1}^n x_i^2 - \frac{\left(\sum_{i=1}^n x_i\right)^2}{n}} = r \frac{s_y}{s_x}
\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} where
We will use the linear_regression()
function from library(ssstats)
to construct our regression model.
In the case of simple linear regression,
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now construct the simple linear regression model. How should we update the code?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
\hat{\text{satisfaction}} = 20.92 + 0.90 \text{ chew\_time}
\hat{\text{satisfaction}} = 20.92 + 0.90 \text{ chew\_time}
y = \beta_0 + \beta_1 x + \varepsilon
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + \varepsilon
We will use the linear_regression()
function from library(ssstats)
to construct our regression model.
In the case of multiple linear regression,
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now construct the corresponding multiple regression model. How should we update the code?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
\hat{\text{satisfaction}} = 23.02 + 0.92 \text{ chew\_time} - 0.04 \text{ snack\_crunch}
\hat{\text{satisfaction}} = 23.02 + 0.92 \text{ chew\_time} - 0.04 \text{ snack\_crunch}
\hat{\beta}_i \pm t_{\alpha/2, n-k-1} \text{ SE}_{\hat{\beta}_i}
linear_regression()
function.The 95% CI for \beta_{\text{time chewed}} is (0.81, 1.02).
The 95% CI for \beta_{\text{crunch factor}} is (-0.22, 0.14).
The 95% CI for \beta_{\text{time chewed}} is (0.78, 1.06).
The 95% CI for \beta_{\text{crunch factor}} is (-0.28, 0.20).
Hypotheses
Test Statistic
p-Value
This is default output in our linear_regression()
function.
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How do I report regression models in the real world?
In our example,
Predictor | Estimate (95% CI) | p-Value |
---|---|---|
Time Chewed | 0.92 (0.81, 1.02) | < 0.001 |
Crunch Factor | -0.04 (-0.22, 0.14) | 0.663 |
We can use the likelihood ratio test to test for a “significant regression” line.
Hypotheses
Test Statistic and p-Value
significant_line()
function from library(ssstats)
to test for a significant model.Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now determine if we have a significant regression line. How should we update the following code?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now determine if we have a significant regression line. Our updated code,
Likelihood Ratio Test for Significant Regression Line:
Null: H₀: β₁ = β₂ = ... = βₖ = 0
Alternative: H₁: At least one βᵢ ≠ 0
Test statistic: χ²(2) = 57158.677
p-value: p < 0.001
Conclusion: Reject the null hypothesis (p = < 0.001 < α = 0.05)
R^2 = \frac{\text{SS}_{\text{Reg}}}{\text{SS}_{\text{Tot}}}
Thus, R^2 is the proportion of variation explained by the model (i.e., predictor set).
R^2 \in [0, 1]
R^2 \to 0 indicates that the model fits “poorly.”
R^2 \to 1 indicates that the model fits “well.”
R^2 = 1 indicates that all points fall on the response surface.
Recall that the error term in ANOVA is the “catch all” …
The SSTot is constant for the outcome of interest.
As we add predictors to the model, we are necessarily increasing SSReg
We do not want to arbitrarily increase R^2, so we will use an adjusted version:
R^2_{\text{adj}} = 1 - \frac{\text{MS}_{\text{E}}}{\text{SS}_{\text{Tot}}/\text{df}_{\text{Tot}}}
r_squared
()function from
library(ssstats)` to find the R^2 value.Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now find the R^2 value for the model we constructed earlier. How should we update the following code?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them. She also suspects that crunch factor plays a role in his satisfaction and now wants to incorporate this into the analysis.
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
Pinkie Pie records the chew time (chew_time), the crunch factor (snack_crunch) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Running the code,
This is just scratching the surface for multiple regression.
Other statistics courses go deeper into regression topics.
Categorical predictors.
Interaction terms.
Regression diagnostics.
How to handle non-continuous outcomes.
Data visualization for models.
STA4173 - Biostatistics - Summer 2025