clubhouse <- read_csv("https://raw.githubusercontent.com/samanthaseals/SDSII/refs/heads/main/files/data/lectures/W1_mickey_clubhouse.csv")
head(clubhouse)Recall the general linear model, y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
This is a multiple regression model because it has multiple predictors (x_i).
\beta_0 is the y-intercept, or the average outcome (y) when all x_i = 0.
\beta_i is the slope for predictor i and describes the relationship between the predictor and the outcome, after adjusting (or accounting) for the other predictors in the model.
On a busy day at the clubhouse, Mickey Mouse wants to understand what drives “happiness” at the end of the day. For each day, he records (in the clubhouse dataset):
In this lecture, we will use a linear model to explore the relationships between happiness and laughs, grumbles, and time spent with friends.
We will use the glm() function to construct our linear model.
glm() stands for Generalized Linear Model.family = gaussian.m <- glm(clubhouse_happiness ~ time_with_friends + goofy_laughs + donald_grumbles,
family = gaussian,
data = clubhouse)
tidy(m)\begin{align*} \hat{y} &= 47.58 + 3.58 \ x_1 + 0.65 \ x_2 - 1.06 \ x_3 \\ \hat{\text{happiness}} &= 47.58 + 3.58 \text{ time} + 0.65 \text{ laughs} - 1.06 \text{ grumbles} \end{align*}
We want to put the slope into perspective for whoever we are collaborating with.
Basic interpretation: for every 1 [units of x_i] increase in [x_i], [y] [increases or decreases] by \left[ \left| \hat{\beta}_i \right| \right] [units of y].
Note that in the case of multiple regression, there is an unspoken “after adjusting for everything else in the model” at the end of the sentence.
time_with_friends goofy_laughs donald_grumbles
3.5760335 0.6594346 -1.0632922
For every 1 hour increase in time spent with friends, clubhouse happiness increases by 3.58 points, after adjusting for goofy laughs and Donald grumbles.
For every 1 big goofy laugh increase, clubhouse happiness increases by 0.65 points, after adjusting for time spent with friends and Donald grumbles.
For every 1 Donald grumble increase, clubhouse happiness decreases by 1.06 points, after adjusting for time spent with friends and goofy laughs.
We can also scale our interpretations. e.g.,
In our example,
The intercept is the average of the outcome when all predictors in the model are equal to 0.
In our example,
(Intercept)
47.57612
Recall confidence intervals – they allow us to determine how “good” our estimation is.
In general, CIs will take the form
point estimate \pm margin of error
The margin of error is a critical value (e.g., t_{1-\alpha/2}) multiplied by the standard error of the point estimate.
Recall that the standard error accounts for the sample size.
In R, we will run the model results through the tidy() function, but ask for the confidence intervals.
We have the following CIs:
We have the following CIs:
Let’s compare the two sets of CIs.
95% CIs for \beta_i:
99% CIs for \beta_i:
H_0: \ \beta_1 = \beta_2 = ... = \beta_k = 0
What does it mean if the slopes are flat?
We will use hypothesis testing to determine if at least one slope is non-zero.
This is often referred to as the omnibus F-test in linear regression.
We will instead use the likelihood ratio test.
For any model constructed, we can compute deviance.
To determine the test statistic, we examine the difference between two models:
Then, we look at the difference between the deviances of the full and reduced models.
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
We will compare models to determine the significance of the line.
In R,
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
tidy().Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
| Predictor | \hat{\beta}_i (95% CI for \beta_i) | p-value |
|---|---|---|
| time with friends | 3.58 (2.37, 4.78) | < 0.001 |
| goofy laughs | 0.65 (0.52, 0.80) | < 0.001 |
| Donald grumbles | -1.06 (-1.34, -0.79) | < 0.001 |
This lecture reviewed the key components of linear regression:
We will continue reminding ourselves of these concepts as we move into other types of regression models.
In the next lecture, we will review how to visualize our models.