STA4173: Biostatistics
Spring 2025
We previously discussed testing three or more means using ANOVA.
We also discussed that ANOVA is an extension of the two-sample t-test.
Recall that the t-test has two assumptions:
Equal variance between groups.
Normal distribution.
We will extend our knowledge of checking assumptions today.
y_{ij} = \mu + \tau_i + \varepsilon_{ij}
where:
We assume that the error term follows a normal distribution with mean 0 and a constant variance, \sigma^2. i.e., \varepsilon_{ij} \overset{\text{iid}}{\sim} N(0, \sigma^2)
Very important note: the assumption is on the error term and NOT on the outcome!
We will use the residual (the difference between the observed value and the predicted value) to assess assumptions: e_{ij} = y_{ij} - \hat{y}_{ij}
Normality: quantile-quantile plot
Variance: scatterplot of the residuals against the predicted values
Like with t-tests, we will assess these assumptions graphically.
We will return to the classpackage
package and use the anova_check()
function.
library(tidyverse)
strength <- c(15.4, 12.9, 17.2, 16.6, 19.3,
17.2, 14.3, 17.6, 21.6, 17.5,
5.5, 7.7, 12.2, 11.4, 16.4,
11.0, 12.4, 13.5, 8.9, 8.1)
system <- c(rep("Cojet",5), rep("Silistor",5), rep("Cimara",5), rep("Ceramic",5))
data <- tibble(system, strength)
m <- aov(strength ~ system, data = data)
summary(m)
Df Sum Sq Mean Sq F value Pr(>F)
system 3 200.0 66.66 7.545 0.00229 **
Residuals 16 141.4 8.84
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can formally check the variance assumption with the Brown-Forsythe-Levine test.
The test statistic is calculated as follows, F_0 = \frac{\sum_{i=1}^k n_i (\bar{z}_i - \bar{z})^2/(k-1)}{\sum_{i=1}^k \sum_{j=1}^{n_j}(z_{ij}-\bar{z}_i)^2/(n-k) }, where
Hypotheses
Test Statistic
p-Value
Rejection Region
We will use the leveneTest()
function from the car
package.
car
package because it overwrites a necessary function in tidyverse
.Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
\varepsilon_{ij} \overset{\text{iid}}{\sim} N(0, \sigma^2)
We also discussed how to assess the assumptions:
Graphically using the anova_check()
function.
Confirming the variance assumption using the BFL.
If we break an assumption, we will turn to the nonparametric alternative, the Kruskal-Wallis.
If we break ANOVA assumptions, we should implement the nonparametric version, the Kruskal-Wallis.
The Kruskal-Wallis test determines if k independent samples come from populations with the same distribution.
Our new hypotheses are
\chi^2_0 = \frac{12}{n(n+1)} \sum_{i=1}^k \frac{R_i^2}{n_i} - 3(n+1),
where
H follows a \chi^2 distribution with k-1 degrees of freedom.
Hypotheses
Test Statistic
p-Value
Rejection Region
kruskal.test()
function to perform the Kruskal-Wallis test.Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
We can also perform posthoc testing in the Kruskal-Wallis setting.
The set up is just like Tukey’s – we can perform all pairwise comparisons and control for the Type I error rate.
Instead of using |\bar{y}_i - \bar{y}_j|, we will use |\bar{R}_i - \bar{R}_j|, where \bar{R}_i is the average rank of group i.
The comparison we are making:
kruskalmc()
function from the pgirmess
package to perform the Kruskal-Wallis post-hoc test.Multiple comparison test after Kruskal-Wallis
alpha: 0.01
Comparisons
obs.dif critical.dif stat.signif
Ceramic-Cimara 0.2 11.7637 FALSE
Ceramic-Cojet 7.9 11.7637 FALSE
Ceramic-Silistor 10.3 11.7637 FALSE
Cimara-Cojet 8.1 11.7637 FALSE
Cimara-Silistor 10.5 11.7637 FALSE
Cojet-Silistor 2.4 11.7637 FALSE
Today we have talked about assessing ANOVA assumptions and performing the nonparametric alternative, the Kruskal-Wallis.
Per usual, we should only look at posthoc testing when we’ve detected an overall difference with the Kruskal-Wallis.
Next lecture: two-way ANOVA.