31 Feature Engineering

Feature engineering is a fancy machine learning way of saying “prepare your data for analysis”. But it’s also more than just getting your data in the right format. It’s about transforming variables to help the model provide more stable predictions – for example, encoding, modeling, or omitting missing data; creating new variables from the available dataset; and otherwise helping improve the model performance through transformations and modifications to the existing data before analysis.

One of the biggest challenges with feature engineering is what’s referred to as data leakage, which occurs when information from the test dataset leaks into the training dataset. Consider the simple example of normalizing or standardizing a variable (subtracting the variable mean from each value and then dividing by the standard deviation). Normalizing your variables is necessary for a wide range of predictive models, including any model using regularization methods, principal components analysis, and \(k\)-nearest neighbor models, among others. But when we normalize the variable, it is critical we do so relative to the training set, not relative to the full dataset. If we normalize the numeric variables in our full dataset and then divide it into test and training datasets, information from the test dataset has leaked into the training dataset. This seems simple enough - just wait to standardize until after splitting - but it becomes more complicated when we consider tuning a model. If we use a process like \(k\)-fold cross-validation, then we have to normalize within each fold to get an accurate representation of out-of-sample predictive accuracy.

In this chapter, we’ll introduce feature engineering using the {recipes} package from the {tidymodels} suite of packages which, as we will see, helps you to be explicit about the decisions you’re making, while avoiding potential issues with data leakage.