31.1 Basics of {recipes}

The {recipes} package is designed to replace the stats::model.matrix() function that you might be familiar with. For example, if you fit a model like the one below

library(palmerpenguins)
m1 <- lm(bill_length_mm ~ species, data = penguins)
summary(m1)
## 
## Call:
## lm(formula = bill_length_mm ~ species, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9338 -2.2049  0.0086  2.0662 12.0951 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       38.7914     0.2409  161.05   <2e-16 ***
## speciesChinstrap  10.0424     0.4323   23.23   <2e-16 ***
## speciesGentoo      8.7135     0.3595   24.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.96 on 339 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7078, Adjusted R-squared:  0.7061 
## F-statistic: 410.6 on 2 and 339 DF,  p-value: < 2.2e-16

You can see that our species column, which has the values Adelie, Gentoo, Chinstrap, is automatically dummy-coded for us, with the first level in the factor variable (Adelie) set as the reference group. The {recipes} package forces you to be a bit more explicit in these decisions. But it also has a much wider range of modifications it can make to the data.

Using {recipes} also allows you to more easily separate the pre-processing and modeling stages of your data analysis workflow. In the above example, you may not have even realized stats::model.matrix() was doing anything for you because it’s wrapped within the stats::lm() modeling code. But with {recipes}, you make the modifications to your data first and then conduct your analysis.

How does this separation work? With {recipes}, you can create a blueprint (or recipe) to apply to a given dataset, without actually applying those operations. You can then use this blueprint iteratively across sets of data (e.g., folds) as well as on new (potentially unseen) data that has the same structure (variables). This process helps avoid data leakage because all operations are carried forward and applied together, and no operations are conducted until explicitly requested.