Correlation and
Simple Linear Regression

STA4173: Biostatistics
Spring 2025

Introduction: Correlation

  • Before today, we discussed methods for comparing continuous outcomes across two or more groups.

  • We now will begin exploring the relationships between two continuous variables.

  • We will first focus on data visualization and correlation.

  • The we will quantify the relationship using regression analysis.

Scatterplot

  • Scatterplot or scatter diagram:

    • A graph that shows the relationship between two quantitative variables measured on the same subject.
  • Each individual in the dataset is represented by a point on the scatterplot.

  • The explanatory variable is on the x-axis and the response variable is on the y-axis.

  • It is super important for us to plot the data!

    • Plotting the data is a first step in identifying issues or potential relationships.

Scatterplots

  • Positive relationship: As x increases, y increases.

  • Negative relationship: As x increases, y decreases.


Example

  • A golf pro wants to investigate the relation between the club-head speed of a golf club (measured in miles per hour) and the distance (in yards) that the ball will travel.

  • The pro uses a single model of club and ball, one golfer, and a clear, 70-degree day with no wind.

  • The pro records the club-head speed, measures the distance the ball travels, and collected the data below.

library(tidyverse)
golf <- tibble(speed = c(100, 102, 103, 101, 105, 100, 99, 105),
               distance = c(257, 264, 274, 266, 277, 263, 258, 275))
  head(golf, n=3)

Example

  • Like we have done before, we will graph using the ggplot2 package.
golf %>% ggplot(aes(x = speed, y = distance)) + 
  geom_point(size = 5) + # plot points; make dots bigger
  labs(x = "Club Head Speed (mph)",
       y = "Distance (yards)") + # define labels 
  theme_bw() # change background color

Correlation

  • Creating the scatterplot allows us to visualize a potential relationship.

    • e.g., we know from the scatterplot for the golf data, as speed increases, distance increases.
  • Now, let’s talk about how to quantify that relationship.

    • Initial quantification: correlation.

    • Further quantification: regression.

Correlation

  • Correlation: A measure of the strength and direction of the linear relationship between two quantitative variables.

    • \rho represents the population correlation coefficient.

    • r represents the sample correlation coefficient.

  • Correlation is bounded to [-1, 1].

    • r=-1 represents perfect negative correlation.

    • r=1 represents perfect positive correlation.

    • r=0 represents no correlation.

Pearsons’s Correlation

  • Pearson’s correlation coefficient:

r = \frac{\sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s_x} \right)\left( \frac{y_i - \bar{y}}{s_y} \right)}{n-1}

  • x_i is the ith observation of x; y_i is the ith observation of y
  • \bar{x} is the sample mean of x; \bar{y} is the sample mean of y
  • s_x is the sample standard deviation of x; s_y is the sample standard deviation of y
  • n is the sample size (number of paired observations)

Pearsons’s Correlation

  • In R, we will use the cor() function to find the correlation.

    • We can plug in the two specific variables we are interested in.

    • We can also plug in the dataset we are working with and all pairwise correlations will be reporeted.

      • (This will be useful later!)
cor(dataset_name$variable_1, dataset_name$variable_2) # request correlation between two values 
cor(dataset_name) # request correlation from a tibble (dataset)

Example

  • Let’s find the correlation of the golf data.

  • First, let’s plug in the specific variables we are interested in.

cor(golf$distance, golf$speed)
[1] 0.9386958
  • Looking at the correlation coefficient for distance and speed, r = 0.94.

    • This is a strong positive correlation.

Example

  • Now, let’s plug the whole dataset in.
cor(golf)
             speed  distance
speed    1.0000000 0.9386958
distance 0.9386958 1.0000000
  • Notes:

    • The correlation between a variable and itself is 1.

      • We can see this above: r_{\text{speed, speed}} = 1 and r_{\text{distance, distance}} = 1.
    • The correlation between x and y is the same as the correlation between y and x.

      • We can see this above: r_{\text{speed, distance}} = r_{\text{distance, speed}}

Spearman’s Correlation

  • An assumption for Pearson’s correlation is that both x and y are normally distributed.

  • What do we do when we do not meet the normality assumption?

  • Spearman’s Correlation: A measure of the strength and direction of the monotone relationship between two variables.

    • Spearman’s correlation only assumes that the data is ordinal.

    • It does not work for nominal data!

  • Spearman’s correlation is interpreted the same as Pearson’s correlation.

  • To find Spearman’s correlation, the following algorithm is followed:

    1. Rank the x and y values.

      • (x, y) \to (R_x, R_y)
    2. Find Pearson’s correlation for the ranked data.

Spearman’s Correlation

  • We will again use the cor() function, but now we will specify Spearman.
cor(dataset_name$variable_1, dataset_name$variable_2, method = "spearman") 
cor(dataset_name, method = "spearman") 

Example

  • Let’s find the Spearman correlation of the golf data.
cor(golf$distance, golf$speed, method = "spearman")
[1] 0.9277782
  • Spearman’s correlation is r_{\text{S}} = 0.93.

    • This is still a strong, positive correlation.
  • Compare this to Pearson’s correlation, r = 0.94.

    • In this dataset, Spearman’s correlation is not vastly different.

    • While we see that Spearman’s correlation is smaller than Pearson’s correlation, this is not always the case.

Conclusions

  • Correlation allows us to quantify the relationship between two variables without units.

    • It does not matter if x and y have the same units! Correlation is unitless.
  • The closer to -1 or 1, the stronger the relationship.

    • As correlation approaches -1 or 1, x is better at predicting y.
  • For fun: play Guess the Correlation.

Introduction: Simple Linear Regression

  • We have discussed quantifying the relationship between two continuous variables.

  • Recall that the correlation describes the strength and the direction of the relationship.

    • Pearson’s correlation: describes the linear relationship; assumes normality of both variables.

    • Spearman’s correlation: describes the monotone relationship; assumes both variables are at least ordinal.

  • Further, recall that correlation is unitless and bounded to [-1, 1].

Introduction: Simple Linear Regression

  • Now we will discuss a different way of representing/quantifying the relationship.

    • We will now construct a line of best fit, called the ordinary least squares (OLS) regression line.
  • Using simple linear regression, we will model y (the outcome) as a function of x (the predictor).

    • This is called simple linear regression becaause there is only one predictor.

    • Next week, we will venture into multiple regression, which has mulitple predictors.

Simple Linear Regression

  • Population regression line:

y = \beta_0 + \beta_1 x + \varepsilon

  • \beta_0 is the y-intercept.

  • \beta_1 is the slope describing the relationship between x and y.

  • \varepsilon (estimated by e) is the error term; remember, from ANOVA (😱):

\varepsilon \overset{\text{iid}}{\sim} N(0, \sigma^2)

Simple Linear Regression

  • Population regression line:

y = \beta_0 + \beta_1 x + \varepsilon

  • Sample regression line:

\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + e

  • \hat{y} estimates y.

  • \hat{\beta}_0 estimates \beta_0.

  • \hat{\beta}_1 estimates \beta_1.

  • e estimates \varepsilon.

Simple Linear Regression

  • We first calculate the slope, \hat{\beta}_1

\hat{\beta}_1 = \frac{\sum_{i=1}^n x_i y_i - \frac{\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{n}}{\sum_{i=1}^n x_i^2 - \frac{\left(\sum_{i=1}^n x_i\right)^2}{n}} = r \frac{s_y}{s_x}

  • r is the Pearson correlation,
  • s_x is the standard deviation of x, and
  • s_y is the standard deviation of y

Simple Linear Regression

  • Then we can calculate the intercept, \hat{\beta}_0

\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} where

  • \bar{x} is the mean of x, and
  • \bar{y} is the mean of y.

Simple Linear Regression

  • We already know the syntax for modeling!

    • ANOVA = regression 😎
  • We will use the lm() function to define the model and summary() to see the results.

m <- lm(outcome_variable ~ predictor_variable, data = dataset_name)
summary(m)

Example

  • Recall the golf data,
library(tidyverse)
golf <- tibble(speed = c(100, 102, 103, 101, 105, 100, 99, 105), 
               distance = c(257, 264, 274, 266, 277, 263, 258, 275))
head(golf, n=3)

Example

  • Let’s now construct the simple linear regression model.

    • We want to model distance as a function of speed.
m <- lm(distance ~ speed, data = golf)
summary(m)

Call:
lm(formula = distance ~ speed, data = golf)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8136 -2.0195  0.3542  2.0619  3.6881 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -55.7966    48.3713  -1.154  0.29257    
speed         3.1661     0.4747   6.670  0.00055 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.883 on 6 degrees of freedom
Multiple R-squared:  0.8811,    Adjusted R-squared:  0.8613 
F-statistic: 44.48 on 1 and 6 DF,  p-value: 0.0005498
  • This results in the model,

\hat{\text{distance}} = -55.80 + 3.17 \text{ speed}

Interpretations

  • We need to provide interpretations of

  • Interpretation of the slope, \hat{\beta}_1

    • For a [k] [units of x] increase in [x], we expect [y] to [increase or decrease] by [k \times |\hat{\beta}_1|] [units of y].
  • Interpretation of the y-intercept, \hat{\beta}_0

    • When [x] is 0, we expect the mean of [y] to be [\hat{\beta}_0].

    • Caveat:

      • It does not always make sense for x=0. In this situation, we do not interpret the y-intercept.

      • Some applications (e.g., psychology) will center the x variable around its mean. Then, when x=0, we are interpreting in terms of the “average value of x.”

Example

  • Recall the regression line from the golf pro example, \hat{\text{distance}} = -55.80 + 3.17 \text{ speed}

    • For a 1 mph increase in club-head speed, we expect the distance the golf ball travels to increase by 3.17 yards.

    • For a 5 mph increase in club-head speed, we expect the distance the golf ball travels to increase by 15.83 yards.

    • When the club-head speed is 0 mph, the average distance the golf ball travels is -55.797 yards.

Hypothesis Testing for \beta_1

  • First, the residual:

e_i = y_i - \hat{y}_i

  • Then, we find the standard error of the estimate, s_e,

s_e = \sqrt{\frac{\sum_{i=1}^n e_i^2}{n-2}}

  • Finally, we can find the standard error of \hat{\beta}_1, s_{\hat{\beta}_1},

s_{\hat{\beta}_1} = \frac{s_e}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}} = \frac{s_e}{s_x \sqrt{n-1}}

Hypothesis Testing for \beta_1

  • Hypotheses

    • H_0: \ \beta_1 = 0
    • H_1: \ \beta_1 \ne 0
  • Test Statistic

    • t_0 = \frac{\hat{\beta}_1}{s_{\hat{\beta}_1}} = \frac{\hat{\beta}_1}{\text{SE of }\hat{\beta}_1}
  • p-Value

    • p = 2 \times P[t_{n-2} \ge |t_0|]
  • Rejection Region

    • Reject H_0 if p<\alpha.

Example

  • From the golf example,
summary(m)

Call:
lm(formula = distance ~ speed, data = golf)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8136 -2.0195  0.3542  2.0619  3.6881 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -55.7966    48.3713  -1.154  0.29257    
speed         3.1661     0.4747   6.670  0.00055 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.883 on 6 degrees of freedom
Multiple R-squared:  0.8811,    Adjusted R-squared:  0.8613 
F-statistic: 44.48 on 1 and 6 DF,  p-value: 0.0005498
  • t_0 for speed is 6.67

    • Slope for speed is 3.1661 and the SE of the slope for speed is 0.4747

    • We can see that 3.1661/0.4747 = 6.67

  • The corresponding p-value is p = 0.00055, which we will represent as p < 0.001.

Example

  • Hypotheses

    • H_0: \ \beta_{\text{speed}} = 0
    • H_1: \ \beta_{\text{speed}} \ne 0
  • Test Statistic

    • t_0 = 6.67
  • p-Value

    • p < 0.001
  • Rejection Region

    • Reject H_0 if p<\alpha; \alpha=0.05

Confidence Interval for \beta_1

  • Recall that a point estimate by itself does not tell us how good our estimation is.

  • Thus, we are also interested in finding the confidence interval for \beta_1.

  • (1-\alpha)100\% CI for \beta_1: \hat{\beta}_1 \pm t_{\alpha/2,n-2} s_{\hat{\beta}_1} where the SE of the slope, s_{\hat{\beta}_1}, is as defined earlier.

  • The R syntax also reuses the model results in the confint() function.

confint(m) # for 95% CI
confint(m, level = conf_level) # for other levels

Example

  • Let’s find the confidence intervals for the golf data,
confint(m) # 95% CI
                  2.5 %    97.5 %
(Intercept) -174.157039 62.563818
speed          2.004539  4.327664
confint(m, level = 0.90) # 90% CI
                    5 %     95 %
(Intercept) -149.790863 38.19764
speed          2.243664  4.08854
  • The 95% CI for \beta_{\text{speed}} is (2.00, 4.33).
  • The 90% CI for \beta_{\text{speed}} is (2.24, 4.09).

Wrap Up

  • Today we discussed the modeling of continuous data.

  • We have discussed simple linear regression, which means we have a single predictor in the model.

  • Next week, we will extend to multiple regression (more than one predictor).

  • Regression is ANOVA; ANOVA is regression.

    • Keep this in mind if you are asked to assess assumptions.