Module 3 Review

Putting It All Together

Relationships between continuous variables.
Correlation:
- Unitless measure of linear (Pearson’s) or monotonic (Spearman’s) pairwise relationships.
  - Does not adjust for other variables.
- Quantifies strength and direction of relationship.
  - Allows us to compare strength of relationships across different pairs of variables.
Linear regression:
- Models relationship between continuous outcome and one or more predictors.
  - Units are attached, so the relationship can be stated in context.
- Provides estimates of association adjusted for other variables in the model.
- Allows for prediction of outcome based on predictor values.

Example

During a puzzle-heavy session in Ponyville, Twilight wants to understand what helps small teams solve logic puzzles faster. For each puzzle, she records:
- MindPractice: hours of individual “mind” drills the day before.
- SleepHours: hours slept the night before.
- FocusMinutes: minutes of calm focus/box-breathing right before starting.
- PuzzleComplexity: a continuous difficulty score Twilight assigns from 1–10.
- SolveTime — minutes to solve the puzzle (lower is better).
Our task is to determine how MindPractice, SleepHours, FocusMinutes, and PuzzleComplexity relate to SolveTime.

Example

Let’s first look at pairwise scatterplots.

Example

Next, let’s determine what method of correlation we should use. Checking normality,

Example

Because we met the normality assumption, we can use Pearson’s correlation to quantify pairwise relationships.

Example

Because we met the normality assumption, we can use Pearson’s correlation to quantify pairwise relationships.

Mind practice is significantly correlated with solve time (p < 0.001).
- The correlation is negative and modest (r \approx -0.42), indicating that increased mind practice is associated with decreased solve times (i.e., they are faster to solve).

Example

Because we met the normality assumption, we can use Pearson’s correlation to quantify pairwise relationships.

Sleep duration is significantly correlated with solve time (p < 0.001).
- The correlation is negative and weak (r \approx -0.27), indicating that increased sleep duration is associated with decreased solve times (i.e., they are faster to solve).

Example

Because we met the normality assumption, we can use Pearson’s correlation to quantify pairwise relationships.

Focus is not significantly correlated with solve time (p = 0.219).
- The correlation is negative and weak (r \approx -0.10), indicating that increased focus is associated with decreased solve times (i.e., they are faster to solve).

Example

Because we met the normality assumption, we can use Pearson’s correlation to quantify pairwise relationships.

Puzzle complexity is significantly correlated with solve time (p < 0.001).
- The correlation is positive and modest (r \approx 0.64), indicating that increased complexity is associated with increased solve times (i.e., they are slower to solve).

Example

Let’s now examine the multiple regression model,

This gives the regression model,

\hat{\text{solve time}} = 37.54 - 4.12 \text{ practice} - 1.82 \text{ sleep} - 0.02 \text{ focus} + 3.10 \text{ complexity}

Example

If I were reporting the table of results, I would submit

Predictor	Estimate (95% CI)	p-value
Mind Practice	-4.12 (-5.02, -3.22)	<0.001
Sleep Hours	-1.82 (-2.58, -1.06)	<0.001
Focus Minutes	-0.02 (-0.21, 0.17)	0.857
Puzzle Complexity	3.10 (2.63, 3.57)	<0.001

Example

Before we make inference on each predictor, let’s examine the significance of the regression model.

Likelihood Ratio Test for Significant Regression Line:
Null: H₀: β₁ = β₂ = ... = βₖ = 0
Alternative: H₁: At least one βᵢ ≠ 0
Test statistic: χ²(4) = 8661.511
p-value: p < 0.001
Conclusion: Reject the null hypothesis (p = < 0.001 < α = 0.05)

Example

Hypotheses
- H_0: \beta_{\text{mind}} = \beta_{\text{sleep}} = \beta_{\text{focus}} = \beta_{\text{complex}} = 0
- H_1: at least one \beta_i \ne 0
Test Statistic and p-Value
- \chi^2_0 = 8661.51; p < 0.001
Rejection Region
- Reject H_0 if p<\alpha; \alpha=0.05
Conclusion and Intepretation
- Reject H_0. There is sufficient evidence to conclude that at least one of the predictors is significantly associated with solve time, after adjusting for the other predictors.

Example

Now, let’s provide some interpretations and comment on significance.

Predictor	Estimate (95% CI)	p-value
Mind Practice	-4.12 (-5.02, -3.22)	<0.001
Sleep Hours	-1.82 (-2.58, -1.06)	<0.001
Focus Minutes	-0.02 (-0.21, 0.17)	0.857
Puzzle Complexity	3.10 (2.63, 3.57)	<0.001

Mind practice is significantly associated with solve time (p < 0.001). After adjusting for sleep, focus, and complexity, each additional hour of mind practice is associated with a decrease in solve time of approximately 4.12 minutes (95% CI: -5.02, -3.22).

Example

Now, let’s provide some interpretations and comment on significance.

Predictor	Estimate (95% CI)	p-value
Mind Practice	-4.12 (-5.02, -3.22)	<0.001
Sleep Hours	-1.82 (-2.58, -1.06)	<0.001
Focus Minutes	-0.02 (-0.21, 0.17)	0.857
Puzzle Complexity	3.10 (2.63, 3.57)	<0.001

Sleep hours is significantly associated with solve time (p < 0.001). After adjusting for mind practice, focus, and complexity, each additional hour of sleep is associated with a decrease in solve time of approximately 1.82 minutes (95% CI: -2.58, -1.06).

Example

Now, let’s provide some interpretations and comment on significance.

Predictor	Estimate (95% CI)	p-value
Mind Practice	-4.12 (-5.02, -3.22)	<0.001
Sleep Hours	-1.82 (-2.58, -1.06)	<0.001
Focus Minutes	-0.02 (-0.21, 0.17)	0.857
Puzzle Complexity	3.10 (2.63, 3.57)	<0.001

Focus minutes is not significantly associated with solve time (p = 0.857). After adjusting for mind practice, sleep, and complexity, each additional minute of focus is associated with a decrease in solve time of approximately 0.02 minutes (95% CI: -0.21, 0.17).

Example

Now, let’s provide some interpretations and comment on significance.

Predictor	Estimate (95% CI)	p-value
Mind Practice	-4.12 (-5.02, -3.22)	<0.001
Sleep Hours	-1.82 (-2.58, -1.06)	<0.001
Focus Minutes	-0.02 (-0.21, 0.17)	0.857
Puzzle Complexity	3.10 (2.63, 3.57)	<0.001

Puzzle complexity is significantly associated with solve time (p < 0.001). After adjusting for mind practice, sleep, and focus, each additional unit increase in puzzle complexity is associated with an increase in solve time of approximately 3.10 minutes (95% CI: 2.63, 3.57).

Example

Let’s create a visualization of the model.
- We know that solve time will be on the y-axis.
- We can choose one predictor to put on the x-axis – let’s choose sleep duration.
- The other predictors will be set to their median values.

# A tibble: 5 × 3
  variable         mean_sd    median_iqr 
  <chr>            <chr>      <chr>      
1 FocusMinutes     10.6 (4.7) 10.7 (6.8) 
2 MindPractice     2.9 (1.0)  2.8 (1.4)  
3 PuzzleComplexity 6.0 (1.9)  5.9 (3.0)  
4 SleepHours       7.1 (1.2)  7.1 (1.6)  
5 SolveTime        30.9 (9.4) 31.1 (12.5)

Example

Our first step is to create predicted values using the found regression model.
- Let sleep duration vary.
- Plug in medians for other predictors.

\hat{\text{solve time}} = 37.54 - 4.12 \text{ practice} - 1.82 \text{ sleep} - 0.02 \text{ focus} + 3.10 \text{ complexity}

puzzles <- puzzles %>%
  mutate(predicted = 37.54 - 4.12 * median(MindPractice)
                           - 1.82 * SleepHours
                           - 0.02 * median(FocusMinutes)
                           + 3.10 * median(PuzzleComplexity))

Example

Then we can graph.
- Scatterplot: x = SleepHours, y = SolveTime
- Line: x = SleepHours, y = predicted

puzzles %>% ggplot(aes(x = SleepHours, y = SolveTime)) + # specify x and y
  geom_point(size = 2.5, color = "gray50") + # plot points
  geom_line(aes(y = predicted), linewidth = 1, color = "black") + # plot line
  labs(x = "Sleep Duration (hours)", # edit x-axis label
       y = "Time to Complete Puzzle (min)") + # edit y-axis label
  theme_bw() # change theme of graph

Example

The resulting graph,

Wrap Up

This module covers the basics of linear regression.
As a reminder, this is just the beginning. There are many more advanced topics to explore, including:
- Interaction terms
- Nonlinear relationships
- Model selection
- Diagnostics and remedial measures
STA4231 - Statistics for Data Science II dives deeper into regression topics.