Before today, we discussed methods for comparing continuous outcomes across two or more groups.
We now will begin exploring the relationships between two continuous variables.
We will first focus on data visualization and the corresponding correlation.
The we will quantify the relationship using regression analysis.
Scatterplot or scatter diagram:
Each individual in the dataset is represented by a point on the scatterplot.
The explanatory variable is on the x-axis and the response variable is on the y-axis.
It is super important for us to plot the data!
Positive relationship: As x increases, y increases.
Negative relationship: As x increases, y decreases.
ggplot() function from library(tidyverse) (or library(ggplot2)).In Ponyville, Pinkie Pie is curious about how Gummy’s snack satisfaction (0 to 100) relates to the duration of chew time (in seconds) he spends on crunchy treats. She suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How should we update the code for a scatterplot?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How should we update the code for a scatterplot?
Fluttershy wants to determine how the amount of carrots Angel Bunny is given (grams) affects his happiness level (0 to 100). She believes that Angel Bunny tends to be happiest with a moderate amount of carrots.
Fluttershy records both the weight of the carrot (carrot_weight) and Angel Bunny’s happiness level (happiness) of the last 200 snacks given to Angel Bunny (angel_data).
Fluttershy believes that Angel Bunny tends to be happiest with a moderate amount of carrots.
Fluttershy records both the weight of the carrot (carrot_weight) and Angel Bunny’s happiness level (happiness) of the last 200 snacks given to Angel Bunny (angel_data).
How should we update the code for a scatterplot?
Fluttershy believes that Angel Bunny tends to be happiest with a moderate amount of carrots.
Fluttershy records both the weight of the carrot (carrot_weight) and Angel Bunny’s happiness level (happiness) of the last 200 snacks given to Angel Bunny (angel_data).
Our updated code,
Creating the scatterplot allows us to visualize a potential relationship.
Now, let’s discuss quantifying that relationship.
Initial quantification: correlation.
Further quantification: regression.
Correlation: A unitless measure of the strength and direction of the linear relationship between two quantitative variables.
\rho represents the population correlation coefficient.
r represents the sample correlation coefficient.
Correlation is bounded to [-1, 1].
r=-1 represents perfect negative correlation.
r=1 represents perfect positive correlation.
r=0 represents no correlation.
r = \frac{\sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s_x} \right)\left( \frac{y_i - \bar{y}}{s_y} \right)}{n-1}
We will use the correlation() function from library(ssstats) to examine correlation.
For a single pairwise correlation,
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How should we update the code for the correlation between satisfaction level and the chew time?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How should we update the following code to get the correlation matrix?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
We can determine if the correlation is significantly different from 0 (i.e., a relationship exists)
Hypotheses:
Test Statistic and p-Value
This is default output in our correlation() function.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Looking at correlation results for this specific relationship,
The assumption for Pearson’s correlation is that both x and y are normally distributed.
We will use the correlation_qq() function from library(ssstats) to examine the normality of x and y.
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now check the assumption that both satisfaction level and the chew time are normally distributed. How should we change the following code?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Let’s now check the assumption that both satisfaction level and the chew time are normally distributed. Our updated code,
What do we do when we do not meet the normality assumption?
Spearman’s Correlation: A unitless measure of the strength and direction of the monotone relationship between two variables.
Spearman’s correlation is interpreted the same as Pearson’s correlation.
To find Spearman’s correlation, the following algorithm is followed:
We will use the correlation() function from library(ssstats) to examine correlation.
For a single pairwise correlation,
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
How should we update the code for the Spearman correlation between satisfaction level and the chew time?
Pinkie Pie suspects that Gummy enjoys snacks more the longer he chews them.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Our updated code,
We can determine if the correlation is significantly different from 0 (i.e., a relationship exists)
Hypotheses:
Test Statistic and p-Value
This is default output in our correlation() function.
Pinkie Pie records both the chew time (chew_time) and Gummy’s satisfaction level (satisfaction) of the last 300 snacks given to Gummy (gummy_data).
Looking at Spearman’s correlation results for this specific relationship,
STA4173 - Biostatistics - Fall 2025