y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... \beta_k x_k + \varepsilon
In this model, \varepsilon is the residual error term.
Recall that the residual error (how far away the observation is from the response surface) is defined as
\varepsilon = y - \hat{y}
Note that
y is the observed value
\hat{y} is the predicted value
To ensure the validity of the least squares estimation, we have several assumptions:
Homogeneity of variance
Independence of observations
Linearity
Existence
Gaussian errors
Think: HILE+G
Homogeneity of variance (also called homoscedasticity) means that the variance of the residuals is constant across all levels of the predictors.
If the variance of the residuals changes at different levels of the predictors, we have heteroscedasticity.
We will check this assumption graphically using a scatterplot of residual vs. fitted values.
We do not want to see a pattern.
A “funnel” or “open fan” shape indicates heteroscedasticity.
A general “cloud” shape indicates homoscedasticity.
We will use the ssstats package to check assumptions.
For variance, we can use the variance_check() function.
m1,\hat{y} = 3 + 1.5 x + \hat{\varepsilon}
m2,\hat{y} = 4 + 1.2 x_1 - 0.8 x_2 + \hat{\varepsilon}
m3,\hat{y} = 5 + 1.1 x_1 - 0.9 x_2 + \hat{\varepsilon}
m4,\hat{y} = 4 + 1.0 x_1 - 0.7 x_2 + \hat{\varepsilon}
m5,\hat{y} = 3 + 1.3 x_1 - 0.9 x_2 + \hat{\varepsilon}
m6,\hat{y} = 2 + 1.1 x_1 - 0.8 x_2 + \hat{\varepsilon}
We also assume that our observations are independent.
Examples of independent data:
Examples of dependent data:
Violations of independence often occur in longitudinal, time series, and spatial data.
When analyzing correlated data, the methodology learned in this course is not appropriate.
Honestly, it’s really not a big deal: we include a covariance structure in the model to account for dependence.
However, this methodology is beyond the scope of this course – we will focus only on independent data.
What about squicky cases?
We know that perfect independence may not be achievable in practice.
Mild violations of independence may not severely impact results, but this is context-dependent.
Ultimately, we do the best with what we can and notate any potential limitations in our analysis.
We assume that the relationship between each predictor and the response is linear in the parameters (i.e., \beta_i).
In this course, all of the models we will construct are linear models.
The language gets tricky because we can have non-linear relationships in a linear model.
For example, we can include polynomial terms (e.g., x^2, x^3) or interaction terms (e.g., x_1 \times x_2) in a linear model.
As long as the model is linear in the parameters, it is considered a linear model.
Consider the following models – which are linear in \beta?
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \hat{\varepsilon}
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \hat{\varepsilon}
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 x_3 + \hat{\varepsilon}
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2^2 x_2 + \hat{\varepsilon}
\hat{y} = \beta_0 + \beta_1 \log(x_1) + \beta_2 x_2 + \hat{\varepsilon}
\hat{y} = \beta_0 + \log(\beta_1) x_1 + \beta_2 x_2 + \hat{\varepsilon}
Consider the following models – which are linear in \beta?
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \hat{\varepsilon} – linear
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \hat{\varepsilon} – linear
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 x_3 + \hat{\varepsilon} – linear
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2^2 x_2 + \hat{\varepsilon} – broken by \beta_2^2
\hat{y} = \beta_0 + \beta_1 \log(x_1) + \beta_2 x_2 + \hat{\varepsilon} – linear
\hat{y} = \beta_0 + \log(\beta_1) x_1 + \beta_2 x_2 + \hat{\varepsilon} – broken by \log(\beta_1)
The residuals have a finite mean and variance.
This assumption is almost always satisfied in practical applications.
This is not an assumption we check – we will see estimation errors when this assumption is violated.
Note! The normal distribution is also known as the Gaussian distribution.
For our inference to be valid, we must assume that the residuals are normally distributed.
We will check this assumption using a quantile-quantile (q-q) plot of the residuals.
We want to see points roughly along the line.
Large deviations from the line indicate non-normality.
We can also use a histogram to “back up” our decision from the q-q plot.
We want to see it roughly mound-shaped and symmetric.
Skewed or multi-modal histograms indicate non-normality.
We will use the ssstats package to check assumptions.
For normality, we can use the normality_check() function.
m1:\hat{y} = 3 + 1.5 x + \hat{\varepsilon}
m2:\hat{y} = 4 + 1.2 x_1 - 0.8 x_2 + \hat{\varepsilon}
m3:\hat{y} = 5 + 1.1 x_1 - 0.9 x_2 + \hat{\varepsilon}
m4:\hat{y} = 4 + 1.0 x_1 - 0.7 x_2 + \hat{\varepsilon}
m5:\hat{y} = 3 + 1.3 x_1 - 0.9 x_2 + \hat{\varepsilon}
m6:\hat{y} = 2 + 1.1 x_1 - 0.8 x_2 + \hat{\varepsilon}
\varepsilon \overset{\text{iid}}{\sim} N(0, \sigma^2)
This notation combines our checkable assumptions:
Residuals are independent and identically distributed (I)
Residuals follow a normal distribution (G), centered at 0 and with some constant variance, \sigma^2 (H).
We can look at a graph with all of these checks at once using the reg_check() function in the ssstats package.
m1:m3:m5:No.
Formal tests for normality (e.g., Shapiro-Wilk test, Kolmogorov-Smirnov test) are generally not recommended in practice.
These tests are sensitive to large sample sizes, leading to the rejection of normality, even for minor infractions that do not affect our inferential conclusions.
Instead, graphical methods (q-q plots, histograms) and practical considerations are preferred for assessing normality.
This lecture has covered the assumptions on the linear model for continuous outcomes assuming independent data and Gaussian errors.
My approach in “real life” is:
Fit the model.
Check assumptions using graphical diagnostics.
If assumptions are violated, consider an alternative approach.
Document any assumption violations and their potential impact on results.
Discuss findings (including relevant failed assumption checks) with collaborators/stakeholders.
Next lecture: Model Diagnostics