Code
ggplot(penguins) +
aes(x=species, y=body_mass_g) +
geom_point() +
ggtitle("Body mass of penguins by species") +
ylab("Mass (g)")
Categorical data are a common occurrence in regression modeling. These are any data that consist of categories rather than numbers. For instance, in a forestry experiment, you may thin the trees in some plots and do a prescribed burn in others. “Prescribed burn” and “thinning” aren’t numerical values, they are categories of treatment.
Sometimes categories have a natural order, like low, medium, high, which blurs the distinction between categorical and numerical data. We’ll also cover ways to handle these.
In R, categorical variables are called factors. Deep down in the machinery of a regression model, factor effects are handled the same way as continuous features. To fully appreciate this, you have to understand that factors are coded differently than continuous features. To begin, note that linear regression for factor variables is also a kind of scatterplot smoother. Let’s look at an example:
Returning to the palmerpenguins
dataset, the plot below shows the mass of penguins of three different species.
ggplot(penguins) +
aes(x=species, y=body_mass_g) +
geom_point() +
ggtitle("Body mass of penguins by species") +
ylab("Mass (g)")
Look carefully at the x-axis and you’ll see that the coordinates are species names, not numbers. A line drawn to fit the points would imply that there is an order to the species, and that there are some intermediate values between the species.
ggplot(penguins) +
aes(x=species, y=body_mass_g) +
geom_point() +
geom_abline(intercept=1800, slope=1200, color="blue") +
ggtitle("Body mass of penguins by species with linear fit") +
ylab("Mass (g)")
Clearly, we need to treat the species as categories, rather than as coordinates along a continuum. Looking at the summary of a regression model gives a clue as to how that works.
= lm(body_mass_g ~ species, data=penguins)
penguin_lm summary(penguin_lm)
Call:
lm(formula = body_mass_g ~ species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1126.02 -333.09 -33.09 316.91 1223.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3700.66 37.62 98.37 <2e-16 ***
speciesChinstrap 32.43 67.51 0.48 0.631
speciesGentoo 1375.35 56.15 24.50 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 462.3 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677
F-statistic: 343.6 on 2 and 339 DF, p-value: < 2.2e-16
Now let’s see the mean masses of the three species.
group_by(penguins, species) |>
summarize(mass = mean(body_mass_g, na.rm=TRUE))
# A tibble: 3 × 2
species mass
<fct> <dbl>
1 Adelie 3701.
2 Chinstrap 3733.
3 Gentoo 5076.
Notice that the fitted value of the (Intercept)
is identical to the average mass of an Adelie penguin, and that there are two rows of species effects, instead of the one row that we saw for the continuous effects so far. The values of the Estimate
column in those rows are the estimated effects for Chinstrap and Gentoo penguins - and they are equal to the difference between the average mass of those penguins and Adelie penguins.
The reason that the results look like this can be made more clear if you look at the way R converts the categories to numbers. The model.matrix()
is the function that R uses internally to prepare data for an lm()
, but we can call it ourselves. Remember that linear regression works by multiplying each term by a coefficient, adding adding the results together. Here, we have three terms for the three species. Adelie has been automatically selected as the reference level because it appears first in the data.
tail(model.matrix(penguin_lm))
(Intercept) speciesChinstrap speciesGentoo
339 1 1 0
340 1 1 0
341 1 1 0
342 1 1 0
343 1 1 0
344 1 1 0
Here, we have three terms: (Intercept)
, speciesChinstrap
, and speciesGentoo
. So to calculate the mass that is estimated for row 12, we will add 1 * (Intercept) + 1 * speciesGentoo
. Since the effect (Intercept)
is the average mass of the Adelie penguins, the effect speciesGentoo
must be the difference between the mean mass of Gentoo penguins and the mean mass of Adelie penguins. Why no term for speciesAdelie
? It’s because the fit would then depend on how mass was apportioned between the penguin species and the intercept. You could increase the intercept by 10 grams and reduce all three species estimates by 10 grams and end up with the same model fit, despite different effects. The computer has no way of deciding between those options, because it is just minimizing the model error.
You don’t have to set one factor level to be the reference for estimation, but that goes beyond the scope of this introductory workshop. You may need to change which level is the reference, and that is within our scope. R has a function relevel()
, which takes the argument ref=
, which specifies the reference level of a factor. Here is how it would work to set Gentoo as the reference level:
$species = relevel(penguins$species, ref="Gentoo")
penguins= lm(body_mass_g ~ species, data=penguins)
releveled_penguin_model summary(releveled_penguin_model)
Call:
lm(formula = body_mass_g ~ species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1126.02 -333.09 -33.09 316.91 1223.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5076.02 41.68 121.78 <2e-16 ***
speciesAdelie -1375.35 56.15 -24.50 <2e-16 ***
speciesChinstrap -1342.93 69.86 -19.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 462.3 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677
F-statistic: 343.6 on 2 and 339 DF, p-value: < 2.2e-16
With Gentoo as the reference level for species, the summary of model results tells us directly that the body mass of Adelie and Chinstrap penguins are both significantly less than that of Gentoo penguins. However we set the factor levels, the model predictions remain unchanged, which helps emphasize that the factor coding affects which interpretation(s) are emphasized in the summary but does not change the model.
# compare predictions between the models with different contrasts:
= data.frame(species=c("Adelie", "Chinstrap", "Gentoo"))
pred_df predict(penguin_lm, pred_df)
1 2 3
3700.662 3733.088 5076.016
predict(releveled_penguin_model, pred_df)
1 2 3
3700.662 3733.088 5076.016
Of course, continuous and categorical features don’t have to be kept separate. We’ll return to our original example and consider whether the relationship between a penguin’s mass and its flipper length is different between the three species. You combine categorical and continuous features by adding them together in the model formula.
# create a model with both species and flipper length as features
= lm(body_mass_g ~ flipper_length_mm + species,
combo_model data=penguins)
Here is a visualization of the model fit:
$nointeract = predict(combo_model, newdata=penguins)
penguinsggplot(penguins) +
aes(x=flipper_length_mm, y=body_mass_g, color=species) +
geom_point() +
xlab("Flipper length (mm)") +
ylab("Penguin mass (g)") +
geom_smooth(data=penguins,
mapping=aes(x=flipper_length_mm, y=nointeract, color=species),
method='lm',
se=FALSE)
As you can see, adding a categorical factor to the model has resulted in a vertical offset between the regression lines for each species, and the three regression lines all have the same slope. You should therefore interpret the species effects as species-specific intercepts for the regression line. Now, let’s generate the summary plots and the model summary.
# plot the model diagnostics
layout(matrix(1:4, 2, 2))
plot(combo_model)
# check out the model summary
summary(combo_model)
Call:
lm(formula = body_mass_g ~ flipper_length_mm + species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-927.70 -254.82 -23.92 241.16 1191.68
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3764.667 667.844 -5.637 3.65e-08 ***
flipper_length_mm 40.705 3.071 13.255 < 2e-16 ***
speciesAdelie -266.810 95.264 -2.801 0.00539 **
speciesChinstrap -473.320 86.746 -5.456 9.41e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 375.5 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.7826, Adjusted R-squared: 0.7807
F-statistic: 405.7 on 3 and 338 DF, p-value: < 2.2e-16
The summary shows how big the differences are between the species-specific intercepts. The diagnostic plots still exhibit a “U” shape in the Fitted Vs. Residual plot, so we haven’t yet found an ideal regression model for this data.
We can improve this model further by adding an interaction between the species and the flipper length. An interaction allows the regression lines to have different slopes for the different species, as seen here:
ggplot(penguins) +
aes(x=flipper_length_mm, y=body_mass_g, color=species) +
geom_point() +
xlab("Flipper length (mm)") +
ylab("Penguin mass (g)") +
geom_smooth(method='lm', se=FALSE)
There will still be different intercepts between species because I have retained the so-called “main effect” of species. An intercept is written in an R formula by placing a colon (:
) between two variables.
# create a model interacting species and flipper length as features
=
interaction_model lm(body_mass_g ~ flipper_length_mm + species + flipper_length_mm:species,
data=penguins)
And now we can look at the diagnostics.
# plot the model diagnostics
layout(matrix(1:4, 2, 2))
plot(interaction_model)
# check out the model summary
summary(interaction_model)
Call:
lm(formula = body_mass_g ~ flipper_length_mm + species + flipper_length_mm:species,
data = penguins)
Residuals:
Min 1Q Median 3Q Max
-911.18 -251.93 -31.77 197.82 1144.81
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6787.281 1124.195 -6.037 4.14e-09 ***
flipper_length_mm 54.623 5.174 10.557 < 2e-16 ***
speciesAdelie 4251.444 1427.332 2.979 0.00311 **
speciesChinstrap 3750.085 1676.687 2.237 0.02597 *
flipper_length_mm:speciesAdelie -21.791 6.941 -3.139 0.00184 **
flipper_length_mm:speciesChinstrap -20.049 8.190 -2.448 0.01487 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 370.6 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.7896, Adjusted R-squared: 0.7864
F-statistic: 252.2 on 5 and 336 DF, p-value: < 2.2e-16
The “U” shape in the residuals is gone! Also, the significantly negative coefficients for the interactions of flipper length with species Adelie and Chinstrap tells us that body mass for these species increases less quickly with flipper length than for Gentoo penguins. You can see this same relationship in the steeper slope of the red line in the scatter plot with interactions.