34.2 Random Forests

One potential problem with bagged trees is that the predictions across trees often correlate strongly. This is because, even if a new bootstrap sample is used to create the tree, all of the features are used in every tree. Thus, if a given feature is strongly predictive of the outcome, it is likely to be used as the root node in nearly every tree. Because decision trees are built recursively, with optimal splits only determined after previous splits have already been determined, this means that we may be missing out on potential predictive accuracy by not allowing other features to “start” the tree (be the root node). This does not mean that bagging is not useful. As we just saw, we were able to have fairly substantial gains in model performance using bagging when compared to what we observed with a single tree. Yet, if we could decorrelate the trees, we might be able to improve performance even more.

Random forests extend bagged trees by introducing a stochastic component to the features. Specifically, rather than building each tree with all of the features, each tree is built using a random sample of features at each split. The predictive performance of any individual tree is then unlikely to be as performant as a tree using all the features at each split, but the forest of trees (aggregate prediction) can regularly provide better predictions than any single tree.

Assume we have a feature with one very strong predictor and a few features that are moderately strong predictors. If we used bagging, the moderately strong predictors would likely be internal nodes for nearly every tree–meaning their importance would be conditional on the strong predictor. Yet, these variables might provide important information for specific samples on their own. If we remove the very strong predictor from the root node split, that tree will be built with one of the moderately strong predictors at the root node. The very strong predictor may then be selected in one of the internal nodes and, while this model may not be as predictive overall, it may result in better predictions for specific cases. If we average over all these diverse models, we can often get better predictions for the entire dataset.

In essence, a random forest is just a bagged tree with a stochastic component introduced to decorrelate the trees. This means each individual model is more different than a bagged tree. When we average across all the trees we reduce this variance and may be able to get overall improved performance when compared to a single tree or a bagged tree.

Random forests work like bagged trees, but include a random selection of features at each split. This helps decorrelate the trees and can help get at the unique features of the data.

Random forest models tend to provide very good “out of the box” model performance. Because bagged trees are just a special case of random forests, and the number of predictors to select for each tree is a hyperparameter that can be tuned, it’s generally worth starting with a random forest and only moving to a bagged tree if, for your specific situation, using all the predictors for each tree works better than using a sample.

34.2.1 Fitting random forests

In the previous section we used the rand_forest() function to fit bagged classification trees and obtain performance metrics on the OOB samples. We did this by setting mtry = p, where p is equal to the number of predictor variables. When fitting a random forest, we just change mtry to a smaller value, which represents the number of features to randomly select from at each split. Everything else is the same: trees represents the number of bags, or the number of trees in the forest, while min_n represents the minimum sample size for a terminal node (i.e., limiting the depth to which each tree is grown).

A good place to start is mtry = p/3 for regression problems and mtry = sqrt(p) for classification problems. Generally, higher values of mtry will work better when the data are fairly noisy and the predictors are not overly strong. Lower values of mtry will work better when there are a few very strong predictors.

Just like bagged trees, we need to have a sufficient number of trees that the performance estimate stabilizes. Including more trees will not hurt your model performance, but it will increase computation time. Generally, random forests will need more trees to stabilize than bagged trees, because each tree is “noisier” than those in a bagged tree. A good place to start is at about 1,000 trees, but if your learning curve suggests the model is still trending toward better performance (rather than stabilizing/flat lining) then you should add more trees. Note that the number of trees you need will also depend on other hyperparameters. Lower values of mtry and min_n will likely lead to needing a greater number of trees.

Because we’re using bagging in a random forest we can again choose whether to use \(k\)-fold cross validation or just evaluate performance based on the OOB samples. If we were in a situation where we wanted to be extra sure we had conducted the hyperparameter tuning correctly, we might start by building a model on the OOB metrics, then evaluate a small candidate of hyperparameters with \(k\)-fold CV to finalize them. In this case we will only use the OOB metrics to save on computation time.

34.2.1.1 A classification example

Let’s continue with our classification example, using the same training data and recipe we used in the previous section. As a reminder, the recipe looked like this:

rec <- recipe(accuracy_group ~ ., data = train) 

And we prepared the data for analysis so it looked like this:

processed_train <- rec %>% 
  prep() %>% 
  bake(new_data = NULL)

processed_train
## # A tibble: 65,047 x 6
##    event_count event_code game_time title               world     accuracy_group
##          <dbl>      <dbl>     <dbl> <fct>               <fct>     <ord>         
##  1          43       3020     25887 Bird Measurer (Ass… TREETOPC… 0             
##  2          28       3121     19263 Cauldron Filler (A… MAGMAPEAK 2             
##  3          76       4020     76372 Chest Sorter (Asse… CRYSTALC… 1             
##  4           1       2000         0 Bird Measurer (Ass… TREETOPC… 2             
##  5          39       4025     44240 Chest Sorter (Asse… CRYSTALC… 0             
##  6          21       4020     14421 Cart Balancer (Ass… CRYSTALC… 3             
##  7          30       3110     21045 Cart Balancer (Ass… CRYSTALC… 3             
##  8          40       3121     28769 Cart Balancer (Ass… CRYSTALC… 3             
##  9           2       2020        51 Cart Balancer (Ass… CRYSTALC… 3             
## 10          15       3021      8658 Mushroom Sorter (A… TREETOPC… 3             
## # … with 65,037 more rows

We will again optimize on the model AUC. Recall that we had 5 predictor variables. Let’s start by fitting a random forest with 1,000 trees, randomly sampling two predictors at each split (\(\sqrt{5}=2.24\)). We’ll set min_n to 2 again so we are starting our modeling process by growing very deep trees

rf_mod1 <- rand_forest() %>% 
    set_mode("classification") %>% 
    set_engine("ranger") %>% 
    set_args(mtry = 2,
             min_n = 2,
             trees = 1000)
tic()
rf_fit1 <- fit(rf_mod1, 
               formula = accuracy_group ~ .,
               data = processed_train)
toc()
## 74.525 sec elapsed

Note that even with 1,000 trees, the model fits very quickly. What does our AUC look like for this model?

pred_frame <- rf_fit1$fit$predictions %>% 
  as_tibble() %>% 
  mutate(observed = processed_train$accuracy_group)

roc_auc(pred_frame, observed, `0`:`3`)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc hand_till      0.751

Looks pretty good! Note that this is quite similar to the best value we estimated using a bagged tree.

But did our model stabilize? Did we use enough trees? Let’s investigate. First, we’ll write a function so we can easily refit our model with any number of trees, and return the OOB AUC estimate.

fit_rf <- function(tree_n, mtry = 2, min_n = 2) {
  
  # specify the model
  rf_mod <- rand_forest() %>% 
    set_mode("classification") %>% 
    set_engine("ranger") %>% 
    set_args(mtry = mtry,
             min_n = min_n,
             trees = tree_n)
  
  # fit the model
  rf_fit <- fit(rf_mod, 
                formula = accuracy_group ~ .,
                data = processed_train)

  # Create a data frame from which to make predictions
  pred_frame <- rf_fit$fit$predictions %>% 
    as_tibble() %>% 
    mutate(observed = processed_train$accuracy_group)

  # Make the predictions, and output other relevant information
  roc_auc(pred_frame, observed, `0`:`3`) %>% 
    mutate(trees = tree_n, 
           mtry = mtry, 
           min_n = min_n)
}

Notice in the above we’ve made it a bit more general so we can come back to check the number of trees with different values of mtry and min_n. Let’s look at values from 100 (which is almost certainly too low) to 1201 by increments of 50 trees. First, we loop through these values using map_df() so they are all bound in a single data frame.

The code below took about 10 minutes to run on a local computer.

tic()
test_n_trees <- map_df(seq(1, 1201, 50), fit_rf)
toc()
test_n_trees <- readRDS(here::here("models", "bagged-trees", "test_n_trees.Rds"))

Next, we plot the result!

ggplot(test_n_trees, aes(trees, .estimate)) +
  geom_line()

And as we can see, we’re actually pretty safe with a much lower value, perhaps even as low as 100. However, because the models run very fast, we won’t worry about this too much. When we conduct our model tuning, we’ll drop to 500 trees instead of 1000 (still well above what it appears is needed).

Let’s see if we can improve performance by changing the mtry or min_n. Because we only have five predictor variables in this case, we don’t have a huge range to evaluate with mtry. But let’s setup a regular grid looking at values of 2, 3, 4 and 5 for mtry, and min_n values of 2, 5, 10, 15, 20, and 25.

grid <- expand.grid(mtry = 2:5,
                    min_n = c(2, seq(5, 25, 5)))

We can then use our fit_rf() function again, but this time setting the number of trees and looping through each of our mtry and min_n values.

The code below took about 18 and a half minutes to run on a local computer.

rf_tune <- map2_df(grid$mtry, grid$min_n, ~fit_rf(500, .x, .y))

rf_tune %>% 
  arrange(desc(.estimate))
## # A tibble: 24 x 6
##    .metric .estimator .estimate trees  mtry min_n
##    <chr>   <chr>          <dbl> <dbl> <int> <dbl>
##  1 roc_auc hand_till      0.752   500     2    25
##  2 roc_auc hand_till      0.752   500     2    15
##  3 roc_auc hand_till      0.752   500     2    20
##  4 roc_auc hand_till      0.752   500     2    10
##  5 roc_auc hand_till      0.751   500     2     5
##  6 roc_auc hand_till      0.751   500     2     2
##  7 roc_auc hand_till      0.749   500     3    25
##  8 roc_auc hand_till      0.748   500     3    20
##  9 roc_auc hand_till      0.745   500     3    15
## 10 roc_auc hand_till      0.744   500     4    25
## # … with 14 more rows

And notice that all of our top models here include our maximum number of predictors. So in this case, bagged trees are actually looking like our best option (rather than a random forest).

34.2.1.2 A regression example

For completeness, let’s run through another example using our statewide testing data. We didn’t use these data when fitting a bagged tree model, but we can start with a random forest anyway and see if it simplifies to a bagged tree.

First, let’s read in the data and create our testing set. We’ll only sample 20% of the data so things run more quickly.

state_tests <- get_data("state-test") %>% 
  select(-classification) %>% 
  slice_sample(prop = 0.2)

splt_reg <- initial_split(state_tests)
train_reg <- training(splt_reg)

Now we’ll create a recipe for these data. It is a bit more complicated than the last example because the data are a fair amount more complicated. The recipe looks like this:

rec_reg <- recipe(score ~ ., data = train_reg)  %>% 
  step_mutate(tst_dt = lubridate::mdy_hms(tst_dt),
              time_index = as.numeric(tst_dt)) %>%
  update_role(tst_dt, new_role = "time_index")  %>% 
  update_role(contains("id"), ncessch, new_role = "id vars")  %>% 
  step_novel(all_nominal())  %>% 
  step_unknown(all_nominal())  %>% 
  step_rollimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% 
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # neccessary when date is NA
  step_zv(all_predictors()) 

The recipe above does the following

  • Defines score as the outcome and all other variables as predictors
  • Changes the tst_dt column (date the assessment was taken) to be a date (rather than character) and creates a new column that is a numeric version of the date
  • Changes the role of tst_dt from a predictor to a time_index, which is a special role that can be used to calculate date windows
  • Changes the role of all ID variables to an arbitrary "id vars" role, just so they are not used as predictors in the model
  • Recodes nominal columns such that previously unencountered levels of the variable will be recoded to a "new" level
  • Imputes an "unknown" category for all nominal data
  • Uses a rolling time window imputation (median for the closes 5 data points) to impute all numeric columns
  • In cases where the time window is missing, imputes with the median of the column for all numeric columns
  • Removes zero variance predictors

Next, we want to process the data using this recipe. Note that after we bake() the data, we remove those variables that are not predictors. Note that it might be worth considering keeping a categorical version of the school ID in the model (given that the school in which a student is enrolled is likely related to their score), but to keep things simple we’ll just remove all ID variables for now.

processed_reg <- rec_reg %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  select(-contains("id"), -ncessch, -tst_dt)

processed_reg
## # A tibble: 28,414 x 32
##    gndr  ethnic_cd enrl_grd tst_bnch migrant_ed_fg ind_ed_fg sp_ed_fg tag_ed_fg
##    <fct> <fct>        <dbl> <fct>    <fct>         <fct>     <fct>    <fct>    
##  1 M     H                6 G6       N             N         N        N        
##  2 M     W                7 G7       N             N         N        Y        
##  3 M     W                8 3B       N             N         N        N        
##  4 F     W                7 G7       N             N         N        N        
##  5 M     H                5 2B       N             N         Y        N        
##  6 F     W                3 1B       N             N         Y        N        
##  7 F     W                8 3B       N             N         N        N        
##  8 M     M                6 G6       N             N         N        N        
##  9 M     H                7 G7       N             N         N        N        
## 10 M     H                7 G7       N             N         Y        N        
## # … with 28,404 more rows, and 24 more variables: econ_dsvntg <fct>,
## #   ayp_lep <fct>, stay_in_dist <fct>, stay_in_schl <fct>, dist_sped <fct>,
## #   trgt_assist_fg <fct>, ayp_dist_partic <fct>, ayp_schl_partic <fct>,
## #   ayp_dist_prfrm <fct>, ayp_schl_prfrm <fct>, rc_dist_partic <fct>,
## #   rc_schl_partic <fct>, rc_dist_prfrm <fct>, rc_schl_prfrm <fct>,
## #   lang_cd <fct>, tst_atmpt_fg <fct>, grp_rpt_dist_partic <fct>,
## #   grp_rpt_schl_partic <fct>, grp_rpt_dist_prfrm <fct>,
## #   grp_rpt_schl_prfrm <fct>, lat <dbl>, lon <dbl>, score <dbl>,
## #   time_index <dbl>

We’ll now use the processed data to fit a random forest, evaluating our model using the OOB samples. Let’s start by writing a function that fits the model for a given set of hyperparameters and returns a data frame with the hyperparameter values, our performance metric (RMSE) and the model object. Notice I’ve set num.threads = 16 because the local computer I’m working has (surprise!) 16 cores.

rf_fit_reg <- function(tree_n, mtry, min_n) {
  rf_mod_reg <- rand_forest() %>% 
    set_mode("regression") %>% 
    set_engine("ranger") %>% 
    set_args(mtry = mtry,
             min_n = min_n,
             trees = tree_n,
             num.threads = 16)

  rf_fit <- fit(rf_mod_reg, 
                formula = score ~ .,
                data = processed_reg)
  
  # output RMSE
  tibble(rmse = sqrt(rf_fit$fit$prediction.error)) %>% 
    mutate(trees = tree_n, 
           mtry = mtry, 
           min_n = min_n)
}

In the above I’ve used sqrt(rf_fit$fit$prediction.error) for the root mean square error, rather than using the model predictions with a {yardstick} function. This is because the default OOB error for a regression models with {ranger} is the mean square error, so we don’t have to do any predictions on our own - we just need to take the square root of this value. Most of the rest of this function is the same as before.

Let’s now conduct our tuning. First, let’s check how many predictors we have:

ncol(processed_reg) - 1
## [1] 31

Remember, for regression problems, a good place to start is around \(m/3\), or 10.33. Let’s make a grid of mtry values, centered around 10.33. For min_n, we’ll use the same values we did before: 2, 5, 10, 15, 20, and 25. We’ll then evaluate the OOB RMSE across these different values and see if we need to conduct any more hyperparameter tuning.

grid_reg <- expand.grid(mtry = 5:15,
                        min_n = c(2, seq(5, 25, 5)))

And now we’ll use our function from above to fit each of these models, evaluating each with the OOB RMSE. We’ll start out with 1000 trees, then inspect that with our finalized model to make sure it is sufficient. Note that, even with using the OOB samples for our tuning, the following code takes a decent amount of time to run.

The code below took about 8 and half minutes to run on a local computer.

rf_reg_fits <- map2_df(grid_reg$mtry, grid_reg$min_n, 
                       ~rf_fit_reg(1000, .x, .y))

rf_reg_fits %>% 
  arrange(rmse)
rf_reg_fits <- readRDS(here::here("models", "random-forests", "rf_reg_fits.Rds"))

rf_reg_fits %>% 
  arrange(rmse)
## # A tibble: 66 x 4
##     rmse trees  mtry min_n
##    <dbl> <dbl> <int> <dbl>
##  1  86.1  1000     9    25
##  2  86.1  1000     9    20
##  3  86.2  1000     8    25
##  4  86.2  1000     8    20
##  5  86.2  1000     8    15
##  6  86.2  1000     7    15
##  7  86.2  1000     7    20
##  8  86.2  1000     7    25
##  9  86.2  1000    10    25
## 10  86.2  1000     7    10
## # … with 56 more rows

Let’s evaluate the hyperparameters we searched by plotting the mtry values against the RMSE, faceting by min_n.

ggplot(rf_reg_fits, aes(mtry, rmse)) +
  geom_line() +
  facet_wrap(~min_n)

It looks like values of 7 or 8 are optimal for mtry, and our min_n is likely somewhere north of 20. Let’s see if we can improve performance more by tuning a bit more on min_n. We’ll evaluate values from 21 to 31 in increments of 2, while using both mtry values of 7 and 8.

grid_reg2 <- expand.grid(mtry = c(7, 8),
                         min_n = seq(21, 31, 2))

tic()
rf_reg_fits2 <- map2_df(grid_reg2$mtry, grid_reg2$min_n, 
                       ~rf_fit_reg(1000, .x, .y))
toc()
## 279.37 sec elapsed
rf_reg_fits2 <- rf_reg_fits2 %>% 
  arrange(rmse)
rf_reg_fits2
## # A tibble: 12 x 4
##     rmse trees  mtry min_n
##    <dbl> <dbl> <dbl> <dbl>
##  1  86.6  1000     8    31
##  2  86.6  1000     8    21
##  3  86.6  1000     8    29
##  4  86.6  1000     8    25
##  5  86.6  1000     8    23
##  6  86.6  1000     8    27
##  7  86.6  1000     7    23
##  8  86.6  1000     7    21
##  9  86.6  1000     7    31
## 10  86.7  1000     7    25
## 11  86.7  1000     7    29
## 12  86.7  1000     7    27

Notice the RMSE is essentially equivalent for all models listed above. We’ll go forward with the first model, but any of these combinations of hyperparameters likely provide similar out-of-sample predictive accuracy (according to RMSE).

Finally, before moving on to evaluating our model against the test set, we would likely want to make sure that we fit the model with a sufficiently large number of trees. Let’s investigate with a model that had the lowest RMSE. We’ll use trees from 500 to 1500 by increments of 100.

tic()
rf_reg_ntrees <- map_df(seq(500, 1500, 100), 
                        ~rf_fit_reg(.x, 
                                    mtry = rf_reg_fits2$mtry[1], 
                                    min_n = rf_reg_fits2$min_n[1])
                        )
toc()
## 268.152 sec elapsed

And now we’ll plot the number of trees against the RMSE, looking for a point of stability.

ggplot(rf_reg_ntrees, aes(trees, rmse)) +
  geom_line()

Hmm… that doesn’t look terrifically stable. BUT WAIT! Notice the bottom to the top of the y-axis is only about one-tenth of a point difference. So really, this is not a problem with instability, but that it is actually stable across all these values. Just to prove this to ourselves, let’s look at the same plot, but making the y-axis span one point (which is still not very much), going from 86 to 87.

ggplot(rf_reg_ntrees, aes(trees, rmse)) +
  geom_line() +
  ylim(86, 87)

And now it looks more stable. So even with 500 trees we likely would have been fine.

From here, we could continue on to evaluate our final model against the test set using the last_fit() function, but we will leave that as an exercise for the reader. Note that we also only worked with 5% of the total data. The hyperparameters that we settled on may be different if we used more of the data. We could also likely improve performance more by including additional information in the model - e.g., information on staffing and funding, the size of the school or district, indicators of the economic and demographic makeup of the surrounding area, or practices related to school functioning (e.g., use of suspension/expulsion).