Recall the general linear model, y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon
Until now, we have discussed continuous predictors.
Now we will introduce the use of categorical, or qualitative, predictors.
This means that we will include predictors that categorize the observations.
In this dataset (duck_incidents), we have access to the following from the reports:
Categorical variables can show up in datasets two ways:
As ordinal variables: there is natural order to the categories.
As nominal variables: there is no natural order to the categories.
Further, they can be stored multiple ways in a dataset:
As character/factor variables.
As numeric variables (e.g., 1, 2, 3 for treatment groups A, B, C).
As binary indicator variables (i.e., 0/1 where 1 indicates “yes” for the characteristic).
When first sitting down to examine categorical variables, I will look at frequency charts (using count() or something similar) to see what responses are possible.
We also want to evaluate the number of responses in each category.
All of the variables we just explored are stored as string variables.
There are other ways to store variables in R (and other programs).
In R, we can convert character variables to factors using the factor() function.
class() function.The levels of the factor are stored in the variable as strings.
When we include factor variables, R defaults to the “first” level as the reference group.
There are (more than) two approaches we can take to “relevel” a variable.
Use the factor() function with the levels argument to set the order of the levels.
“Brute force” by defining a new character variable using if_else() statements and defining the levels as “1 - first category name”, “2 - second category name”, etc.
We can create indicator (or dummy) variables to include in our model.
For a categorical (or factor) variable with c classes, we define binary indicators as follows:
x_i = \begin{cases} 1 & \textnormal{if category $i$} \\ 0 & \textnormal{if another category} \end{cases}
duck_incidents <- duck_incidents %>%
mutate(loc_yard = if_else(location == "Backyard", 1, 0),
loc_garage = if_else(location == "Garage", 1, 0),
loc_kitchen = if_else(location == "Kitchen", 1, 0),
loc_living = if_else(location == "Living Room", 1, 0))
duck_incidents %>%
select(location,
loc_yard, loc_garage, loc_kitchen, loc_living) %>%
head()fastDummies)dummy_cols() function from the fastDummies package,The manual approach:
The fastDummies approach:
Sometimes, we will be interested in combining categories. Reasons include:
We can combine categories using either case_when() or if_else() statements.
| donald_reaction | n (pct) |
|---|---|
| Assigns Chores | 83 (18.4%) |
| Grounds | 128 (28.4%) |
| Laughs | 70 (15.6%) |
| Yells | 169 (37.6%) |
Although we have sufficient sample size in each group, let’s look at this from a different perspective: Donald took action vs. Donald did not take action.
I always look at a frequency table after combining categories to ensure everything looks correct.
In our example,
So far, we have focused on categorical predictors with more than two categories.
Sometimes, we have categorical predictors that are just two categories.
donald_action, is binary.In this case, we can include the variable as a factor or as a single indicator variable.
| donald_action | n (pct) |
|---|---|
| No Punishment | 239 (53.1%) |
| Punished | 211 (46.9%) |
Let’s now create a binary indicator for this variable.
In this lecture, we have introduced the concept of categorical predictors.
We focused only on the data management associated with categorical predictors.
In the next lecture, we will focus on including categorical predictors in our linear models.