31.2 Creating a recipe

Let’s read in some data and begin creating a basic recipe. We’ll work with the simulated statewide testing data introduced previously. This is a fairly decent sized dataset, and since we’re just illustrating concepts here, we’ll pull a random sample of 2% of the total data to make everything run a bit quicker. We’ll also remove the classification variable, which is just a categorical version of score, our outcome.

In the chunk below, we read in the data, sample a random 2% of the data (being careful to set a seed first so our results are reproducible), split it into training and test sets, and extract just the training dataset. We’ll hold off on splitting it into CV folds for now.

library(tidyverse)
library(tidymodels)

set.seed(8675309)
full_train <- read_csv("https://github.com/uo-datasci-specialization/c4-ml-fall-2020/raw/master/data/train.csv") %>% 
  slice_sample(prop = 0.02) %>% 
  select(-classification)

splt <- initial_split(full_train)
train <- training(splt)

A quick reminder, the data look like this

And you can see the full data dictionary on the Kaggle website here.

When creating recipes, we can still use the formula interface to define how the data will be modeled. In this case, we’ll say that the score column is predicted by everything else in the data frame.

rec <- recipe(score ~ ., data = train)

Notice that I still declare the dataset (in this case, the training data), even though this is just a blueprint. It uses the dataset I provide to get the names of the columns, but it doesn’t actually do anything with this dataset (unless we ask it to). Let’s look at what this recipe looks like.

rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38

Notice it just states that this is a data recipe in which we have specified 1 outcome variable and 38 predictors.

We can prep this recipe to learn more.

prep(rec)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
## 
## Training data contained 2841 data points and 2841 incomplete rows.

Notice we now get an additional message about how many rows are in the data, and how many of these rows contain missing (incomplete data). So the recipe is the blueprint, and we prep the recipe to get it to actually go into the data and conduct the operations. The dataset it has now, however, is just a placeholder than can be substituted in for any other dataset with an equivalent structure.

But of course, modeling score as the outcome with everything else predicting it is not a reasonable choice in this case. For example, we have many ID variables, and we also have multiple categorical variables. For some methods (like tree-based models) it might be okay to leave these categorical variables as they are, but for others (like any model in the linear regression family) we’ll want to encode them somehow (e.g., dummy code).

We can address these concerns by adding steps to our recipe. In the first step, we’ll update the role of all the ID variables so they are not included among the predictors. In the second, we will dummy code all nominal (i.e. categorical) variables.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_dummy(all_nominal())

When updating the roles, we can change the variable label (text passed to the new_role argument) to be anything we want, so long as it’s not "predictor" or "outcome".

Notice in the above I am also using helper functions to apply the operations to all variables of a specific type. There are five main helper functions for creating recipes: all_predictors(), all_outcomes(), all_nominal(), all_numeric() and has_role(). You can use these together, including with negation (e.g., -all_outcomes() to specify the operation should not apply to the outcome variable(s)) to select any set of variables you want to apply the operation to.

Let’s try to prep this updated recipe.

prep(rec)
## Error: Only one factor level in lang_cd

Uh oh! We have an error. Our recipe is trying to dummy code the lang_cd variable, but it has only one level. It’s kind of hard to dummy-code a constant!

Luckily, we can expand our recipe to first remove any zero-variance predictors, like so:

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal())

The _zv part stands for “zero variance” and should take care of this problem. Let’s try again.

prep(rec)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    id vars          6
##    outcome          1
##  predictor         32
## 
## Training data contained 2841 data points and 2841 incomplete rows. 
## 
## Operations:
## 
## Zero variance filter removed calc_admn_cd, lang_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, ... [trained]

Beautiful! Note we do still get a warning here, but I’ve omitted it in the text (we’ll come back to this in the section on Missing data). Our recipe says we now have 6 ID variables, 1 outcome, and 32 predictors, with 2841 data points (rows of data). The calc_admn_cd and lang_cd variables have been removed because they have zero variance, and several variables have been dummy coded, including gndr and ethnic_cd, among others.

Let’s dig just a bit deeper here though. What’s going on with these zero-variance variables? Let’s look back at the training data.

train %>% 
  count(calc_admn_cd)
## # A tibble: 1 x 2
##   calc_admn_cd     n
##   <lgl>        <int>
## 1 NA            2841
train %>% 
  count(lang_cd)
## # A tibble: 2 x 2
##   lang_cd     n
##   <chr>   <int>
## 1 S          80
## 2 <NA>     2761

So at least in our sample, calc_admn_cd really is just fully missing, which means it might as well be dropped because it’s providing us exactly nothing. But that’s not the case with lang_cd. It has two values, NA and "S" (for “Spanish”). This variable represents the language the test was administered in and the NA values are actually meaningful here because they are the the “default” administration, meaning English. So rather than dropping these, let’s mutate them to transform the NA values to "E" for English. We could reasonably do this inside or outside the recipe, but a good rule of thumb is, if it can go in the recipe, put it in the recipe. It can’t hurt, and doing operations outside of the recipe risks data leakage.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_mutate(lang_cd = ifelse(is.na(lang_cd), "E", lang_cd)) %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal())

Let’s take a look at what our data would actually look like when applying this recipe now. First, we’ll prep the recipe

prepped <- prep(rec)
prepped
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    id vars          6
##    outcome          1
##  predictor         32
## 
## Training data contained 2841 data points and 2841 incomplete rows. 
## 
## Operations:
## 
## Variable mutation for lang_cd [trained]
## Zero variance filter removed calc_admn_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, ... [trained]

And we see that lang_cd is no longer being caught by the zero variance filter. Next we’ll bake the prepped recipe to actually apply it to our data. If we specify new_data = NULL, bake() will apply the operation to the data we originally specified in the recipe. But we can also pass new data as an additional argument and it will apply the operations to that data instead of the data specified in the recipe.

bake(prepped, new_data = NULL)
## # A tibble: 2,841 x 106
##        id attnd_dist_inst… attnd_schl_inst… enrl_grd partic_dist_ins…
##     <dbl>            <dbl>            <dbl>    <dbl>            <dbl>
##  1  62576             2083             1353        7             2083
##  2  71424             2180              878        6             2180
##  3 179893             2244             1334        3             2244
##  4 136083             2142             4858        5             2142
##  5 196809             2212             1068        3             2212
##  6  13931             2088              581        8             2088
##  7 103344             1926              102        6             1926
##  8 105122             2142              766        6             2142
##  9 172543             1965              197        4             1965
## 10  45153             2083              542        6             2083
## # … with 2,831 more rows, and 101 more variables: partic_schl_inst_id <dbl>,
## #   lang_cd <fct>, ncessch <dbl>, lat <dbl>, lon <dbl>, score <dbl>,
## #   gndr_M <dbl>, ethnic_cd_B <dbl>, ethnic_cd_H <dbl>, ethnic_cd_I <dbl>,
## #   ethnic_cd_M <dbl>, ethnic_cd_P <dbl>, ethnic_cd_W <dbl>,
## #   tst_bnch_X2B <dbl>, tst_bnch_X3B <dbl>, tst_bnch_G4 <dbl>,
## #   tst_bnch_G6 <dbl>, tst_bnch_G7 <dbl>, tst_dt_X3.21.2018.0.00.00 <dbl>,
## #   tst_dt_X3.22.2018.0.00.00 <dbl>, tst_dt_X3.23.2018.0.00.00 <dbl>,
## #   tst_dt_X3.8.2018.0.00.00 <dbl>, tst_dt_X3.9.2018.0.00.00 <dbl>,
## #   tst_dt_X4.10.2018.0.00.00 <dbl>, tst_dt_X4.11.2018.0.00.00 <dbl>,
## #   tst_dt_X4.12.2018.0.00.00 <dbl>, tst_dt_X4.13.2018.0.00.00 <dbl>,
## #   tst_dt_X4.16.2018.0.00.00 <dbl>, tst_dt_X4.17.2018.0.00.00 <dbl>,
## #   tst_dt_X4.18.2018.0.00.00 <dbl>, tst_dt_X4.19.2018.0.00.00 <dbl>,
## #   tst_dt_X4.2.2018.0.00.00 <dbl>, tst_dt_X4.20.2018.0.00.00 <dbl>,
## #   tst_dt_X4.23.2018.0.00.00 <dbl>, tst_dt_X4.24.2018.0.00.00 <dbl>,
## #   tst_dt_X4.25.2018.0.00.00 <dbl>, tst_dt_X4.26.2018.0.00.00 <dbl>,
## #   tst_dt_X4.27.2018.0.00.00 <dbl>, tst_dt_X4.30.2018.0.00.00 <dbl>,
## #   tst_dt_X4.5.2018.0.00.00 <dbl>, tst_dt_X4.6.2018.0.00.00 <dbl>,
## #   tst_dt_X4.9.2018.0.00.00 <dbl>, tst_dt_X5.1.2018.0.00.00 <dbl>,
## #   tst_dt_X5.10.2018.0.00.00 <dbl>, tst_dt_X5.11.2018.0.00.00 <dbl>,
## #   tst_dt_X5.14.2018.0.00.00 <dbl>, tst_dt_X5.15.2018.0.00.00 <dbl>,
## #   tst_dt_X5.16.2018.0.00.00 <dbl>, tst_dt_X5.17.2018.0.00.00 <dbl>,
## #   tst_dt_X5.18.2018.0.00.00 <dbl>, tst_dt_X5.2.2018.0.00.00 <dbl>,
## #   tst_dt_X5.21.2018.0.00.00 <dbl>, tst_dt_X5.22.2018.0.00.00 <dbl>,
## #   tst_dt_X5.23.2018.0.00.00 <dbl>, tst_dt_X5.24.2018.0.00.00 <dbl>,
## #   tst_dt_X5.25.2018.0.00.00 <dbl>, tst_dt_X5.29.2018.0.00.00 <dbl>,
## #   tst_dt_X5.3.2018.0.00.00 <dbl>, tst_dt_X5.30.2018.0.00.00 <dbl>,
## #   tst_dt_X5.31.2018.0.00.00 <dbl>, tst_dt_X5.4.2018.0.00.00 <dbl>,
## #   tst_dt_X5.7.2018.0.00.00 <dbl>, tst_dt_X5.8.2018.0.00.00 <dbl>,
## #   tst_dt_X5.9.2018.0.00.00 <dbl>, tst_dt_X6.1.2018.0.00.00 <dbl>,
## #   tst_dt_X6.4.2018.0.00.00 <dbl>, tst_dt_X6.5.2018.0.00.00 <dbl>,
## #   tst_dt_X6.6.2018.0.00.00 <dbl>, tst_dt_X6.7.2018.0.00.00 <dbl>,
## #   tst_dt_X6.8.2018.0.00.00 <dbl>, migrant_ed_fg_Y <dbl>, ind_ed_fg_Y <dbl>,
## #   sp_ed_fg_Y <dbl>, tag_ed_fg_Y <dbl>, econ_dsvntg_Y <dbl>, ayp_lep_B <dbl>,
## #   ayp_lep_E <dbl>, ayp_lep_F <dbl>, ayp_lep_M <dbl>, ayp_lep_N <dbl>,
## #   ayp_lep_S <dbl>, ayp_lep_W <dbl>, ayp_lep_X <dbl>, ayp_lep_Y <dbl>,
## #   stay_in_dist_Y <dbl>, stay_in_schl_Y <dbl>, dist_sped_Y <dbl>,
## #   trgt_assist_fg_Y <dbl>, ayp_dist_partic_Y <dbl>, ayp_schl_partic_Y <dbl>,
## #   ayp_dist_prfrm_Y <dbl>, ayp_schl_prfrm_Y <dbl>, rc_dist_partic_Y <dbl>,
## #   rc_schl_partic_Y <dbl>, rc_dist_prfrm_Y <dbl>, rc_schl_prfrm_Y <dbl>,
## #   tst_atmpt_fg_Y <dbl>, grp_rpt_dist_partic_Y <dbl>,
## #   grp_rpt_schl_partic_Y <dbl>, grp_rpt_dist_prfrm_Y <dbl>, …

And now we can actually see the dummy-coded categorical variables, along with the other operations we requested. For example, calc_admn_cd is not in the dataset. Notice the ID variables are output though, which makes sense because they are often neccessary for joining with other data sources. But it’s important to realize that they are output (i.e., all variables are returned, regardless of role) because if we passed this directly to a model they would be included as predictors. Note that there may be reasons you would want to include a school and/or district level ID variable in your modeling, but you certainly would not want redundant variables.

We do still have one minor issue with this recipe though, which is pretty evident when looking at the column names of our baked dataset. The tst_dt variable, which specifies the date the test was taken, was treated as a categorical variable because it was read in as a character vector. That means all the dates are being dummy coded! Let’s fix this by just transforming it to a date within our step_mutate.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
              tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal())

And now when we prep/bake the dataset it’s still a date variable, which is what we probably want (it will be modeled as a numeric variable).

rec %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 2,841 x 55
##        id attnd_dist_inst… attnd_schl_inst… enrl_grd tst_dt             
##     <dbl>            <dbl>            <dbl>    <dbl> <dttm>             
##  1  62576             2083             1353        7 2018-05-16 00:00:00
##  2  71424             2180              878        6 2018-04-24 00:00:00
##  3 179893             2244             1334        3 2018-05-25 00:00:00
##  4 136083             2142             4858        5 2018-05-24 00:00:00
##  5 196809             2212             1068        3 2018-05-16 00:00:00
##  6  13931             2088              581        8 2018-06-06 00:00:00
##  7 103344             1926              102        6 2018-06-04 00:00:00
##  8 105122             2142              766        6 2018-05-08 00:00:00
##  9 172543             1965              197        4 2018-05-23 00:00:00
## 10  45153             2083              542        6 2018-05-10 00:00:00
## # … with 2,831 more rows, and 50 more variables: partic_dist_inst_id <dbl>,
## #   partic_schl_inst_id <dbl>, ncessch <dbl>, lat <dbl>, lon <dbl>,
## #   score <dbl>, gndr_M <dbl>, ethnic_cd_B <dbl>, ethnic_cd_H <dbl>,
## #   ethnic_cd_I <dbl>, ethnic_cd_M <dbl>, ethnic_cd_P <dbl>, ethnic_cd_W <dbl>,
## #   tst_bnch_X2B <dbl>, tst_bnch_X3B <dbl>, tst_bnch_G4 <dbl>,
## #   tst_bnch_G6 <dbl>, tst_bnch_G7 <dbl>, migrant_ed_fg_Y <dbl>,
## #   ind_ed_fg_Y <dbl>, sp_ed_fg_Y <dbl>, tag_ed_fg_Y <dbl>,
## #   econ_dsvntg_Y <dbl>, ayp_lep_B <dbl>, ayp_lep_E <dbl>, ayp_lep_F <dbl>,
## #   ayp_lep_M <dbl>, ayp_lep_N <dbl>, ayp_lep_S <dbl>, ayp_lep_W <dbl>,
## #   ayp_lep_X <dbl>, ayp_lep_Y <dbl>, stay_in_dist_Y <dbl>,
## #   stay_in_schl_Y <dbl>, dist_sped_Y <dbl>, trgt_assist_fg_Y <dbl>,
## #   ayp_dist_partic_Y <dbl>, ayp_schl_partic_Y <dbl>, ayp_dist_prfrm_Y <dbl>,
## #   ayp_schl_prfrm_Y <dbl>, rc_dist_partic_Y <dbl>, rc_schl_partic_Y <dbl>,
## #   rc_dist_prfrm_Y <dbl>, rc_schl_prfrm_Y <dbl>, lang_cd_E <dbl>,
## #   tst_atmpt_fg_Y <dbl>, grp_rpt_dist_partic_Y <dbl>,
## #   grp_rpt_schl_partic_Y <dbl>, grp_rpt_dist_prfrm_Y <dbl>,
## #   grp_rpt_schl_prfrm_Y <dbl>

31.2.1 Order matters

It’s important to realize that the order of the steps matters. In our recipe, we first declare ID variables as having a different role than predictors or outcomes. We then modify two variables, remove zero-variance predictors, and finally dummy code all categorical (nominal) variables. What happens if we instead dummy code and then remove zero-variance predictors?

rec <- recipe(score ~ ., train) %>% 
  step_dummy(all_nominal()) %>% 
  step_zv(all_predictors()) 

prep(rec)
## Error: Only one factor level in lang_cd

We end up with our original error. We don’t get this error if we remove zero variance predictors and then dummy code.

rec <- recipe(score ~ ., train) %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal()) 

prep(rec)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
## 
## Training data contained 2841 data points and 2841 incomplete rows. 
## 
## Operations:
## 
## Zero variance filter removed calc_admn_cd, lang_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, ... [trained]

The fact that order matters may occasionally require that you apply the same operation at multiple steps (e.g., a near zero variance filter could be applied before and after dummy-coding).

All of the above serves as a basic introduction to developing a recipe, and what follows goes into more detail on specific feature engineering pieces. For complete information on all possible recipe steps, please see the documentation.