## 31.5 Missing data

If we look closely at our statewide testing data, we will see that there is a considerable amount of missingness. In fact, every row of the data frame has at least one value that is missing. The amount to which missing data impacts your work varies by field, but in most fields you’re likely to run into situations where you have to handle missing data in some way. The purpose of this section is to discuss a few approaches using **{recipes}** to handle missingness for predictive modeling purposes. Note that this is not a comprehensive discussion on the topic (for which, we recommend Little and Rubin (2002)), but is instead an applied discussion of what you *can* do. As with many aspects of data analysis, generally, there is no single approach that will always work best, and it’s worth trying a few different approaches in your model development to see how different choices impact your model performance.

There are three basic ways of handling missing data:

**Omit**rows of the data frame that include missing values**Encode**or**Impute**the missing data**Ignore**the missing data and estimate from the available data

The last option is not always feasible and will depend the modeling framework you’re working within. Some estimation procedures can also lend themselves to efficient handling of missing data (for example, imputation via the posterior distribution with Bayesian estimation). In this section, we’ll mostly focus on the first two approaches. Additionally, we will only really be concerned with missingness on the predictor variables here, rather than the outcome. Generally, missing data in the predictors is a much more difficult problem than missingness in the outcome, because most models assume you have complete data across your predictors. This is not to say that missingness on your outcome is not challenging (it can be highly challenging, and can make model performance evaluation more difficult). However, without handling missing data on your predictors, you generally cannot even *fit* the model. So we’ll mostly focus there.

### 31.5.1 Omission

We can **omit** missing data with `step_naomit()`

. This will remove any row that has any missing data. Let’s see how this impacts our data, working with our same recipe we finished up with in the Creating a recipe section. I’ve placed the recipe here again so we don’t have to go back to remind ourselves what we did previously.

```
rec <- recipe(score ~ ., train) %>%
update_role(contains("id"), ncessch, new_role = "id vars") %>%
step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
tst_dt = lubridate::mdy_hms(tst_dt)) %>%
step_zv(all_predictors())
na_omit_data <- rec %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal()) %>%
prep() %>%
bake(new_data = NULL)
nrow(na_omit_data)
```

`## [1] 573`

As can be seen above, when we omit any row with missing data we end up with only 573 rows out of the original 2841 rows in the training data (or approximately 20% of the original data). This level of data omission is highly likely to introduce systematic biases into your model prediction. Generally, `step_naomit()`

should only be used when developing preliminary models, where you’re just trying to get code to run. When you get to the point where you’re actually trying to improve performance, you should consider alternative approaches.

### 31.5.2 Encoding and simple imputation

Encoding missing data is similar to imputation. In imputation, we replace the missing value with something we think could have reasonably been the real value, if it were observed. When we encode missing data we are creating values that will be included in the modeling process. For example, with categorical variables, we could replace the missingess with a “missing” level, which would then get its own dummy code (if we were using dummy coding to encode the categorical variables).

I mentioned in the Creating a recipe section that we were getting warnings but I was omitting them in the text. The reason for these warnings is that some of these columns have missing data. If we want to avoid this warning, we have to add an additional step to our recipe to encode the missing data in the categorical variables. This step is called `step_unknown()`

, and it replaces missing values with `"unknown"`

. Let’s do this for all categorical variables and omit any rows that are missing on numeric columns.

```
na_encode_data <- rec %>%
step_unknown(all_nominal()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal()) %>%
prep() %>%
bake(new_data = NULL)
nrow(na_encode_data)
```

`## [1] 2788`

Notice in the above that when I call `step_naomit()`

I state that it should be applied to `all_predictors()`

because I’ve already encoded the nominal predictors in the previous step. This approach allows us to capture 98% of the original data. And as a bonus, we’ve removed the warnings. (*Note:* we might also want to apply `step_novel()`

for any future data that had levels outside of our training data - see Handling new levels).

Just a slight step up in complexity from omission of rows with missing data is to impute them with sample descriptive statistics, such as the mean or the median. Generally, I’ve found that median imputation works better than mean imputation, but that could be related to the types of data I work with most frequently. Let’s switch datasets so we can see what’s happening more directly.

Let’s look at the airquality dataset, which ships with R.

```
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
```

As we can see, `Solar.R`

is missing for observations 5 and 6. Let’s compute the sample mean and median for this column.

`## [1] 185.9315`

`## [1] 205`

If we use mean or median imputation, we just replace the missing values with these sample statistics. Let’s do this in a new recipe, assuming we’ll be fitting a model where `Ozone`

is the outcome, predicted by all other variables in the dataset.

```
recipe(Ozone ~ ., data = airquality) %>%
step_meanimpute(all_predictors()) %>%
prep() %>%
bake(new_data = NULL)
```

```
## # A tibble: 153 x 6
## Solar.R Wind Temp Month Day Ozone
## <int> <dbl> <int> <int> <int> <int>
## 1 190 7.4 67 5 1 41
## 2 118 8 72 5 2 36
## 3 149 12.6 74 5 3 12
## 4 313 11.5 62 5 4 18
## 5 186 14.3 56 5 5 NA
## 6 186 14.9 66 5 6 28
## 7 299 8.6 65 5 7 23
## 8 99 13.8 59 5 8 19
## 9 19 20.1 61 5 9 8
## 10 194 8.6 69 5 10 NA
## # … with 143 more rows
```

As we can see, the value \(186\) has been imputed for rows 5 and 6, which is the integer version of the sample mean (an integer was imputed because the column was already an integer, and not a double).

Let’s try the same thing with median imputation

```
recipe(Ozone ~ ., data = airquality) %>%
step_medianimpute(all_predictors()) %>%
prep() %>%
bake(new_data = NULL)
```

```
## # A tibble: 153 x 6
## Solar.R Wind Temp Month Day Ozone
## <int> <dbl> <int> <int> <int> <int>
## 1 190 7.4 67 5 1 41
## 2 118 8 72 5 2 36
## 3 149 12.6 74 5 3 12
## 4 313 11.5 62 5 4 18
## 5 205 14.3 56 5 5 NA
## 6 205 14.9 66 5 6 28
## 7 299 8.6 65 5 7 23
## 8 99 13.8 59 5 8 19
## 9 19 20.1 61 5 9 8
## 10 194 8.6 69 5 10 NA
## # … with 143 more rows
```

And as we would expect, the missingness has now been replaced with values of \(205\).

Sometimes you have time series data, or there is a date variable in the dataset that accounts for a meaningful proportion of the variance. In these cases, you might consider `step_rollimpute()`

, which provides a conditional median imputation based on time, and you can set the size of the window from which to calculate the median. In still other cases it may make sense to just impute with the lowest observed value (i.e., assume a very small amount of the predictor), which can be accomplished with `step_lowerimpute()`

.

These simple imputation techniques are fine to use when developing models. However, it’s an area that may be worth returning to as you start to refine your model to see if you can improve performance.

### 31.5.3 Modeling the missingness

Another alternative for imputation is to fit a statistical model with the column you want to impute modeled as the outcome, with all other columns (minus the actual outcome) predicting it. We then use that model for the imputation. Let’s first consider a linear regression model. We’ll fit the same model we specified in our recipe, using the airquality data.

```
##
## Call:
## lm(formula = Solar.R ~ ., data = airquality[, -1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -182.945 -67.348 5.295 73.781 170.068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.5960 84.4424 -0.173 0.863016
## Wind 2.1661 2.2633 0.957 0.340171
## Temp 3.7023 0.9276 3.991 0.000105 ***
## Month -13.2640 5.4525 -2.433 0.016242 *
## Day -1.0631 0.8125 -1.308 0.192875
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 85.14 on 141 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.131, Adjusted R-squared: 0.1063
## F-statistic: 5.313 on 4 and 141 DF, p-value: 0.0005145
```

Notice that I’ve dropped the first column here, which is `Ozone`

, our actual outcome. The model above has been fit using the equivalent of `step_naomit()`

, otherwise known as *listwise deletion*, where any row with any missing data is removed.

We can now use the coefficients from this model to impute the missing values in `Solar.R`

. For example, row 6 in the dataset had a missing value on `Solar.R`

and the following values for all other variables:

Using our model, we would predict the following score for this missing value

```
## 1
## 189.3325
```

Let’s try this using **{recipes}**.

```
recipe(Ozone ~ ., data = airquality) %>%
step_impute_linear(all_predictors()) %>%
prep() %>%
bake(new_data = NULL)
```

```
## # A tibble: 153 x 6
## Solar.R Wind Temp Month Day Ozone
## <int> <dbl> <int> <int> <int> <int>
## 1 190 7.4 67 5 1 41
## 2 118 8 72 5 2 36
## 3 149 12.6 74 5 3 12
## 4 313 11.5 62 5 4 18
## 5 152 14.3 56 5 5 NA
## 6 189 14.9 66 5 6 28
## 7 299 8.6 65 5 7 23
## 8 99 13.8 59 5 8 19
## 9 19 20.1 61 5 9 8
## 10 194 8.6 69 5 10 NA
## # … with 143 more rows
```

And we see two important things here. First, row 6 for `Solar.R`

is indeed as we expected it to be (albeit, in integer form). Second, the imputed values for rows 5 and 6 are now *different*, which is the first time we’ve seen this via imputation.

The same basic approach can be used for essentially any statistical model. The **{recipes}** package has currently implemented linear imputation (as above), \(k\)-nearest neighbor imputation, and bagged imputation (via bagged trees). Let’s see how rows 5 and 6 differ with these approaches.

```
# k-nearest neighbor imputation
recipe(Ozone ~ ., data = airquality) %>%
step_knnimpute(all_predictors()) %>%
prep() %>%
bake(new_data = NULL)
```

```
## # A tibble: 153 x 6
## Solar.R Wind Temp Month Day Ozone
## <int> <dbl> <int> <int> <int> <int>
## 1 190 7.4 67 5 1 41
## 2 118 8 72 5 2 36
## 3 149 12.6 74 5 3 12
## 4 313 11.5 62 5 4 18
## 5 159 14.3 56 5 5 NA
## 6 220 14.9 66 5 6 28
## 7 299 8.6 65 5 7 23
## 8 99 13.8 59 5 8 19
## 9 19 20.1 61 5 9 8
## 10 194 8.6 69 5 10 NA
## # … with 143 more rows
```

```
# bagged imputation
recipe(Ozone ~ ., data = airquality) %>%
step_bagimpute(all_predictors()) %>%
prep() %>%
bake(new_data = NULL)
```

```
## # A tibble: 153 x 6
## Solar.R Wind Temp Month Day Ozone
## <int> <dbl> <int> <int> <int> <int>
## 1 190 7.4 67 5 1 41
## 2 118 8 72 5 2 36
## 3 149 12.6 74 5 3 12
## 4 313 11.5 62 5 4 18
## 5 99 14.3 56 5 5 NA
## 6 252 14.9 66 5 6 28
## 7 299 8.6 65 5 7 23
## 8 99 13.8 59 5 8 19
## 9 19 20.1 61 5 9 8
## 10 194 8.6 69 5 10 NA
## # … with 143 more rows
```

These models are quite a bit more flexible than linear regression, and can potentially overfit. You can, however, control some of the parameters to the models through additional arguments (e.g., \(k\) for \(knn\), which defaults to 5). The benefit of these models is that they may provide better estimates of what the imputed value *would* have been, were it not missing, which may then improve model performance. The downside is that they are quite a bit more computationally intensive. Generally, you use recipes within processes like \(k\)-fold cross-validation, with the recipe being applied to each fold. In this case, a computationally expensive approach may significantly bog down hyperparameter tuning.

### 31.5.4 A few words of caution

Missing data is a highly complex topic. This section was meant to provide a basic overview of some of the options you can choose from when building a predictive model. **None** of these approaches, however, will “fix” data that are missing not at random (MNAR). Unfortunately, it is usually impossible to know if your data are MNAR, and we therefore assume that data are missing at random (MAR), or missing at random conditional on the observed data. For example, if boys were more likely to have missing data on the outcome than girls, we could account for this by including a gender variable in the model, and the resulting data would be MAR.

If you have significant missing data, this section is surely incomplete. We recommended Little and Rubin (2002) previously, and there are a number of other good resources, including a chapter in Kuhn and Johnson (2019).