## 31.4 Dealing with low variance predictors

Occasionally you have (or can create) variables that are highly imbalanced. A common example might include a gender variable that takes on the values “male”, “female”, “non-binary”, “other”, and “refused to answer”. Once you dummy-code a variable like this, it is possible that one or more of the categories may be so infrequent that it makes modeling that category difficult. This is not to say that these categories are not important, particularly when considering the representation of your training dataset to real-world applications (and any demographic variable is going to be associated with issues of ethics). Ignoring this variation may lead to systematic biases in model predictions. However, you also regularly have to make compromises to get models to work and be useful. One of those compromises often includes (with many types of variables, not just demographics) dropping highly imbalanced predictors.

Let’s look back at our statewide testing data. Let’s `bake`

the final recipe from our Creating a recipe section on the training data and look at the dummy variables that are created.

```
rec <- recipe(score ~ ., train) %>%
update_role(contains("id"), ncessch, new_role = "id vars") %>%
step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
tst_dt = lubridate::mdy_hms(tst_dt)) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal())
baked <- rec %>%
prep() %>%
bake(new_data = NULL)
```

Below is a table of just the categorical variables and the frequency of each value.

The relative frequency of many of these looks fine, but for some one category has very low frequency. For example, `ayp_lep_M`

has 576 observations (from our random 2% sample) that were \(0\), and only 2 that were \(1\). This is the same for `ayp_lep_S`

. We may therefore consider applying a *near-zero variance filter* to drop these columns. Let’s try this, and then we’ll talk a bit more about what the filter is actually doing.

```
rec_nzv <- rec %>%
step_nzv(all_predictors())
baked_rm_nzv <- rec_nzv %>%
prep() %>%
bake(new_data = NULL)
```

Let’s look at what columns are in `baked`

that were removed from `baked_rm_nzv`

.

```
## [1] "ethnic_cd_B" "ethnic_cd_I" "ethnic_cd_P"
## [4] "migrant_ed_fg_Y" "ind_ed_fg_Y" "ayp_lep_B"
## [7] "ayp_lep_M" "ayp_lep_S" "ayp_lep_W"
## [10] "stay_in_dist_Y" "stay_in_schl_Y" "dist_sped_Y"
## [13] "trgt_assist_fg_Y" "ayp_dist_partic_Y" "ayp_schl_partic_Y"
## [16] "ayp_dist_prfrm_Y" "ayp_schl_prfrm_Y" "rc_dist_partic_Y"
## [19] "rc_schl_partic_Y" "rc_dist_prfrm_Y" "rc_schl_prfrm_Y"
## [22] "lang_cd_E" "tst_atmpt_fg_Y" "grp_rpt_dist_partic_Y"
## [25] "grp_rpt_schl_partic_Y" "grp_rpt_dist_prfrm_Y" "grp_rpt_schl_prfrm_Y"
```

As you can see, the near-zero variance filter has been quite aggressive here, removing 27 columns. Looking back at our table of variables, we can see that, for example, there are 55 students coded Black out of 2841, and it could be reasonably argued that this column is worth keeping in the model.

So how is `step_nzv`

working and how can we adjust it to be not quite so aggressive? Variables are flagged for being near-zero variance if they

- Have very few unique values, and
- The frequency ratio for the most common value to the second most common value is large

These criteria are implemented in `step_nzv`

through the `unique_cut`

and `freq_cut`

arguments, respectively. `unique_cut`

is estimated as the number of unique values divided by the total number of samples (length of the column) times 100 (i.e., it is a percent). `freq_cut`

is estimated by dividing the most common level frequency by the second most common level frequency. The default for `unique_cut`

is \(10\), while the default for `freq_cut`

is \(95/5 = 19\). For a column to be “caught” by a near-zero variance filter, and removed from the training set, it must be *below* the specified `unique_cut`

and *above* the specified `freq_cut`

.

In the case of `ethnic_cd_B`

, we see that there are two unique values, \(0\) and \(1\) (because it’s a dummy-coded variable). There are 2841 rows, so the `unique_cut`

value is \((2 / 2841) \times 100 = 0.07\). The frequency ratio is \(2786/55 = 50.65\). It therefore meets both of the default criteria (below `unique_cut`

and above `freq_cut`

) and is removed.

If you’re applying a near-zero variance filter on dummy variables, there will always be only 2 values, leading to a small `unique_cut`

. This might encourage you to up the `freq_cut`

to a higher value. Let’s try this approach

```
rec_nzv2 <- rec %>%
step_nzv(all_predictors(),
freq_cut = 99/1)
baked_rm_nzv2 <- rec_nzv2 %>%
prep() %>%
bake(new_data = NULL)
removed_columns2 <- names(baked)[!(names(baked) %in% names(baked_rm_nzv2))]
removed_columns2
```

```
## [1] "ethnic_cd_P" "ayp_lep_M" "ayp_lep_S"
## [4] "ayp_lep_W" "dist_sped_Y" "ayp_dist_partic_Y"
## [7] "ayp_schl_partic_Y" "rc_dist_partic_Y" "rc_schl_partic_Y"
## [10] "tst_atmpt_fg_Y" "grp_rpt_dist_partic_Y" "grp_rpt_schl_partic_Y"
## [13] "grp_rpt_dist_prfrm_Y"
```

Removing near-zero variance dummy variables can be a bit tricky because they will essentially always meet the `unique_cut`

criteria. But it can be achieved by fiddling with the `freq_cut`

variable and, actually, could be considered part of your model tuning process. In this case, we’ve set it so variables will be removed if greater than 99 out of every 100 cases is the same. This led to only 13 variables being flagged and removed. But we could continue on even further specifying, for example, that 499 out of every 500 must be the same for the variable to be flagged. At some point, however, you’ll end up with variables that have such low variance that model estimation becomes difficult, which is the purpose of applying the near-zero variance filter in the first place.