Review of Linear Regression

Introduction

  • Recall the general linear model, y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k

  • This is a multiple regression model because it has multiple predictors (x_i).

    • A special case is simple linear regression, when there is a single predictor. y = \beta_0 + \beta_1 x_1
  • \beta_0 is the y-intercept, or the average outcome (y) when all x_i = 0.

  • \beta_i is the slope for predictor i and describes the relationship between the predictor and the outcome, after adjusting (or accounting) for the other predictors in the model.

Lecture Example Set Up

  • On a busy day at the clubhouse, Mickey Mouse wants to understand what drives “happiness” at the end of the day. For each day, he records (in the clubhouse dataset):

    • Time with friends (in hours; time_with_friends): how many hours Mickey spends hanging out with his friends.
    • Goofy Laughs (a count; goofy_laughs) – how many big goofy laughs happen that day.
    • Donald Grumbles (a count; donald_grumbles): how many times Donald gets frustrated and grumbles.
    • Clubhouse Happiness (a score; clubhouse_happiness): an overall happiness score at the end of the day.
  • In this lecture, we will use a linear model to explore the relationships between happiness and laughs, grumbles, and time spent with friends.

Lecture Example Set Up

  • Let’s pull in our data,
clubhouse <- read_csv("https://raw.githubusercontent.com/samanthaseals/SDSII/refs/heads/main/files/data/lectures/W1_mickey_clubhouse.csv")
head(clubhouse)

Constructing the Linear Model (R)

  • We will use the glm() function to construct our linear model.

    • glm() stands for Generalized Linear Model.
    • For linear regression, we will specify family = gaussian.
m <- glm(y ~ x_1 + x_2 + ... + x_k,
         family = "gaussian",
         data = dataset_name)

Constructing the Linear Model

  • In our example,
m <- glm(clubhouse_happiness ~ time_with_friends + goofy_laughs + donald_grumbles,
         family = gaussian,
         data = clubhouse)
tidy(m)

\begin{align*} \hat{y} &= 47.58 + 3.58 \ x_1 + 0.65 \ x_2 - 1.06 \ x_3 \\ \hat{\text{happiness}} &= 47.58 + 3.58 \text{ time} + 0.65 \text{ laughs} - 1.06 \text{ grumbles} \end{align*}

Interpretation of Slope

  • We want to put the slope into perspective for whoever we are collaborating with.

  • Basic interpretation: for every 1 [units of x_i] increase in [x_i], [y] [increases or decreases] by \left[ \left| \hat{\beta}_i \right| \right] [units of y].

    • We say that y is decreasing if \hat{\beta}_i < 0 and y is increasing if \hat{\beta}_i > 0.
  • Note that in the case of multiple regression, there is an unspoken “after adjusting for everything else in the model” at the end of the sentence.

    • Always remember that we are looking at the relationship between y and x_i after adjusting for all other predictors included in the predictor set.

Interpretation of Slope

  • Let’s interpret the slopes for the model we constructed.
m %>% tidy()

Interpretation of Slope

  • Let’s interpret the slopes for the model we constructed.
time_with_friends      goofy_laughs   donald_grumbles 
        3.5760335         0.6594346        -1.0632922 
  • For every 1 hour increase in time spent with friends, clubhouse happiness increases by 3.58 points, after adjusting for goofy laughs and Donald grumbles.

  • For every 1 big goofy laugh increase, clubhouse happiness increases by 0.65 points, after adjusting for time spent with friends and Donald grumbles.

  • For every 1 Donald grumble increase, clubhouse happiness decreases by 1.06 points, after adjusting for time spent with friends and goofy laughs.

Interpretation of Slope

  • We can also scale our interpretations. e.g.,

    • For every k [units of x_i] increase in [x_i], [y] [increases or decreases] by \left[ k \times \left| \hat{\beta}_i \right| \right] [units of y].
  • In our example,

    • For every increase of 1 Donald grumble, clubhouse happiness decreases by 1.06 points, after adjusting for time spent with friends and goofy laughs.
    • For every increase of 5 Donald grumbles, clubhouse happiness decreases by 5.3 points, after adjusting for time spent with friends and goofy laughs.

Interpretation of Intercept

  • The intercept is the average of the outcome when all predictors in the model are equal to 0.

    • We can think of this as the “baseline” level of the outcome before any predictors have an effect.
  • In our example,

(Intercept) 
   47.57612 
  • The average clubhouse happiness when time with friends = 0 hours, goofy laughs = 0, and Donald grumbles = 0 is 47.58 points.

Confidence Intervals for \beta_i (R)

  • Recall confidence intervals – they allow us to determine how “good” our estimation is.

  • In general, CIs will take the form

    point estimate \pm margin of error

  • The margin of error is a critical value (e.g., t_{1-\alpha/2}) multiplied by the standard error of the point estimate.

  • Recall that the standard error accounts for the sample size.

  • In R, we will run the model results through the tidy() function, but ask for the confidence intervals.

m %>% tidy(conf.int = TRUE)

Confidence Intervals for \beta_i

  • In our example,
m %>% tidy(conf.int = TRUE)
  • We have the following CIs:

    • 95% CI for \beta_{\text{time}} is (2.37, 4.78)
    • 95% CI for \beta_{\text{goofy}} is (0.52, 0.79)
    • 95% CI for \beta_{\text{Donald}} is (-1.34, -0.79)

Confidence Intervals for \beta_i

  • What if we want something other than a 95% CI?
m %>% tidy(conf.int = TRUE, conf.level = insert_level_here)

Confidence Intervals for \beta_i

  • In our example,
m %>% tidy(conf.int = TRUE, conf.level = 0.99)
  • We have the following CIs:

    • 99% CI for \beta_{\text{time}} is (1.99, 5.16)
    • 99% CI for \beta_{\text{goofy}} is (0.48, 0.84)
    • 99% CI for \beta_{\text{Donald}} is (-1.42, -0.71)

Confidence Intervals for \beta_i

  • Let’s compare the two sets of CIs.

  • 95% CIs for \beta_i:

    • 95% CI for \beta_{\text{time}} is (2.37, 4.78)
    • 95% CI for \beta_{\text{goofy}} is (0.52, 0.79)
    • 95% CI for \beta_{\text{Donald}} is (-1.34, -0.79)
  • 99% CIs for \beta_i:

    • 99% CI for \beta_{\text{time}} is (1.99, 5.16)
    • 99% CI for \beta_{\text{goofy}} is (0.48, 0.84)
    • 99% CI for \beta_{\text{Donald}} is (-1.42, -0.71)

Significant Regression Line

  • We now will ask ourselves if any of our slopes are flat.

H_0: \ \beta_1 = \beta_2 = ... = \beta_k = 0

  • What does it mean if the slopes are flat?

    • If a slope is flat, it means that there is no relationship between that predictor and the outcome, after adjusting for the other predictors in the model.
  • We will use hypothesis testing to determine if at least one slope is non-zero.

    • This is often referred to as the omnibus F-test in linear regression.

    • We will instead use the likelihood ratio test.

      • This is because we will continue to use this test outside of the normal distribution.

Significant Regression Line

  • For any model constructed, we can compute deviance.

    • Deviance measures the unexplained variability in the outcome under the fitted model.
    • Lower deviance means the model fits the data better.
      • The actual value is not meaningful.
  • To determine the test statistic, we examine the difference between two models:

    • Full model: includes all predictors.
    • Reduced model: includes only the intercept (no predictors).
  • Then, we look at the difference between the deviances of the full and reduced models.

    • This difference has an approximate \chi^2 distribution.

Significant Regression Line

  • Hypotheses

    • H_0: \ \beta_1 = ... = \beta_k = 0
    • H_1: at least one \beta_i \ne 0
  • Test Statistic and p-Value

    • \chi^2_0 = [\text{value from R}], p = [\text{value from R}]
  • Rejection Region

    • Reject H_0 if p < \alpha; \alpha = [\text{assumed } \alpha].
  • Conclusion/Interpretation

    • [Reject (if p < \alpha) or fail to reject (if p \ge \alpha)] H_0. There [is (if reject) or is not (if FTR)] sufficient evidence to suggest that at least one slope is non-zero.

Significant Regression Line (R)

  • We will compare models to determine the significance of the line.

    • Full model: includes all predictors.
    • Reduced model: includes only the intercept (no predictors).
  • In R,

full <- glm(y ~ x_1 + x_2 + ... + x_k, data = dataset_name, family = "gaussian")
reduced <- glm(y ~ 1, data = dataset_name, family = "gaussian")
anova(reduced, full, test = "LRT")

Significant Regression Line

  • In our example,
full <- glm(clubhouse_happiness ~ time_with_friends + goofy_laughs + donald_grumbles,
            data = clubhouse,
            family = "gaussian")
reduced <- glm(clubhouse_happiness ~ 1, 
               data = clubhouse, 
               family = "gaussian")
anova(reduced, full, test = "LRT")

Significant Regression Line

  • Hypotheses

    • H_0: \ \beta_{\text{time}} = \beta_{\text{goofy}} = \beta_{\text{Donald}} = 0
    • H_1: at least one \beta_i \ne 0
  • Test Statistic and p-Value

    • \chi^2_0 = 17305, p < 0.001
  • Rejection Region

    • Reject H_0 if p < \alpha; \alpha = 0.05.
  • Conclusion/Interpretation

    • Reject H_0. There is sufficient evidence to suggest that at least one slope is non-zero.

Significant Predictors of y

  • Hypotheses

    • H_0: \ \beta_i = 0
    • H_1: \ \beta_i \ne 0
  • Test Statistic and p-Value

    • t_0 = [\text{value from R}], p = [\text{value from R}]
  • Rejection Region

    • Reject H_0 if p < \alpha; ; \alpha = [\text{assumed } \alpha].
  • Conclusion/Interpretation

    • [Reject (if p < \alpha) or fail to reject (if p \ge \alpha)] H_0. There [is (if reject) or is not (if FTR)] sufficient evidence to suggest that [predictor name] significantly predicts [outcome], after adjusting for the other predictors in the model.

Significant Predictors of y (R)

  • Because we currently are only dealing with continuous predictors, we will use the results from tidy().
m %>% tidy()

Significant Predictors of y

  • In our example,
m %>% tidy()

Significant Predictors of y

  • Hypotheses

    • H_0: \ \beta_{\text{time}} = 0
    • H_1: \ \beta_{\text{time}} \ne 0
  • Test Statistic and p-Value

    • t_0 = 5.80, p < 0.001
  • Rejection Region

    • Reject H_0 if p < \alpha; ; \alpha = 0.05.
  • Conclusion/Interpretation

    • Reject H_0. There is sufficient evidence to suggest that time with friends significantly predicts the clubhouse happiness, after adjusting for the other predictors in the model.

Significant Predictors of y

  • Hypotheses

    • H_0: \ \beta_{\text{goofy}} = 0
    • H_1: \ \beta_{\text{goofy}} \ne 0
  • Test Statistic and p-Value

    • t_0 = 9.55, p < 0.001
  • Rejection Region

    • Reject H_0 if p < \alpha; ; \alpha = 0.05.
  • Conclusion/Interpretation

    • Reject H_0. There is sufficient evidence to suggest that goofy laughs significantly predict the clubhouse happiness, after adjusting for the other predictors in the model.

Significant Predictors of y

  • Hypotheses

    • H_0: \ \beta_{\text{Donald}} = 0
    • H_1: \ \beta_{\text{Donald}} \ne 0
  • Test Statistic and p-Value

    • t_0 = -7.66, p < 0.001
  • Rejection Region

    • Reject H_0 if p < \alpha; ; \alpha = 0.05.
  • Conclusion/Interpretation

    • Reject H_0. There is sufficient evidence to suggest that Donald’s grumbles significantly predict the clubhouse happiness, after adjusting for the other predictors in the model.

Reporting Results

  • I typically use a table to provide results to collaborators.
Predictor \hat{\beta}_i (95% CI for \beta_i) p-value
time with friends 3.58 (2.37, 4.78) < 0.001
goofy laughs 0.65 (0.52, 0.80) < 0.001
Donald grumbles -1.06 (-1.34, -0.79) < 0.001
  • As time with friends increases by 1 minute, clubhouse happiness increases by 3.58 points.
  • As goofy laughs increase by 1, clubhouse happiness increases by 0.65 points.
  • As Donald grumbles increase by 1, clubhouse happiness decreases by 1.06 points

Wrap Up

  • This lecture reviewed the key components of linear regression:

    • Constructing the model in R
    • Interpreting slopes and intercepts
    • Constructing confidence intervals for slopes
    • Testing for a significant regression line
    • Testing for significant predictors
  • We will continue reminding ourselves of these concepts as we move into other types of regression models.

  • In the next lecture, we will review how to visualize our models.