35.3 Fitting boosted tree models

Let’s use our same statewide testing data we have now used at numerous points, along with our same recipe.

library(sds)
library(tidyverse)
library(tidymodels)
library(tictoc)

set.seed(41920)
full_train <- get_data("state-test") %>% 
  slice_sample(prop = 0.05) %>% 
  select(-classification)

splt <- initial_split(full_train)
train <- training(splt)
cv <- vfold_cv(train)

Next we’ll specify a basic recipe. One thing to note about the below is that we generally don’t have to worry about dummy-coding variables for decision trees. However, dummy-coding (or one-hot encoding) is necessary when using {xgboost}.

rec <- recipe(score ~ ., train) %>% 
  step_mutate(tst_dt = as.numeric(lubridate::mdy_hms(tst_dt))) %>% 
  update_role(contains("id"), -ncessch, new_role = "id vars") %>%
  step_zv(all_predictors()) %>% 
  step_novel(all_nominal()) %>% 
  step_unknown(all_nominal()) %>% 
  step_medianimpute(all_numeric(), 
                    -all_outcomes(), 
                    -has_role("id vars"))  %>% 
  step_dummy(all_nominal(), 
             -has_role("id vars"),
             one_hot = TRUE) %>% 
  step_nzv(all_predictors(), freq_cut = 995/5)

XGBoost is “an optimized distributed gradient boosting library designed to be highly efficient” (emphasis in the original, from the XGBoost documentation). The distributed part means that the models are built to work efficiently with parallel processing, and is (relative to alternative methods) extremely fast. We specify the model the same basic way we always do in the {tidymodels} framework, using parsnip::boost_tree() to specify the model and setting the engine to xgboost. We’ll also set the mode to be regression.

mod <- boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("regression") 

We can translate this model to essentially a raw {xgboost} model using the translate function.

translate(mod)
## Boosted Tree Model Specification (regression)
## 
## Computational engine: xgboost 
## 
## Model fit template:
## parsnip::xgb_train(x = missing_arg(), y = missing_arg(), nthread = 1, 
##     verbose = 0)

Note that the data have not yet been specified (we pass these arguments when we use fit_resamples() or similar). Additionally, the nthread argument specifies the number of processors to use in parallel processing. By default, {parsnip} sets this argument to 1, but {xgboost} will set it to the maximum number of processors available. We can change this by passing the nthread argument either directly to set_engine() or to set_args().

Additionally, we want to use early stopping. To do this we have to pass three additional arguments: ntree, stop_iter and validation. These arguments represent, respectively, the maximum number of trees to use, the number of consecutive trees that should show no improvement on the validation set to exit the stop the algorithm, and the proportion of the data that should be used for the validation set (to monitor the objective function and potentially exit early). The maximum number of trees should just be sufficiently large that early stopping will kick in at some point. The stop_iter argument is a hyperparameter, but is less important than other - generally we use somewhere between 20-50, but values of 100 or so may also work well (because they are evaluated against the validation set, generally you just need a number that is sufficiently large). Finally, you must also pass a validation argument specifying the proportion of the data you would like to use to determine early stopping or the early stopping will not work (because it can’t evaluate changes against a validation set when no validation set exists, and there is no default value for validation). We typically use 20% for validation. Additionally it is worth noting that, at least at present, the methods for establishing early stopping differ somewhat between the two implementations (see here).

Let’s specify our model with these additional arguments, while setting the nthread to the maximum.

parallel::detectCores()
## [1] 3
mod <- mod %>% 
    set_args(nthread = 16,
             trees = 5000,
             stop_iter = 20,
             validation = 0.2)

And finally, we can estimate using the default hyperparameter values.

tic()
gbt_def_tm <- fit_resamples(mod, rec, cv)
toc(log = TRUE)
## 25.644 sec elapsed

Notice we’re only working with 5% of the overall sample (to save time in rendering this chapter). But this was relatively quick for 10-fold CV. How does the default performance look?

collect_metrics(gbt_def_tm)
## # A tibble: 2 x 5
##   .metric .estimator   mean     n std_err
##   <chr>   <chr>       <dbl> <int>   <dbl>
## 1 rmse    standard   89.5      10  1.28  
## 2 rsq     standard    0.418    10  0.0134

Not bad! But there’s a lot of tuning we can still do to potentially improve performance from here. Notice that this took 16.88 seconds. What if we do this same thing with {xgboost} directly?

First, we’ll need to process our data

processed_train <- rec %>% 
  prep() %>% 
  bake(train)

Next, we’re going to drop the ID variables and the outcome and transform the data frame into a matrix, which is what {xgboost} will want.

features <- processed_train %>% 
  select(-score, -contains("id"), -ncessch) %>% 
  as.matrix()

And now we can fit the model using these data directly with {xgboost}. We’ll still use 10-fold cross-validation, but have {xgboost} do it for us directly. Notice in the below that some of the arguments are a bit different, but they all walk directly to what we used with {tidymodels}. We’re using the squared errors as the objective function, which is equivalent to the mean squared error. Notice also that I do not have specify the nthread argument because it will default to the maximum number of processors.

library(xgboost)
tic()
fit_default_xgb <- xgb.cv(
  data = features,
  label = processed_train$score,
  nrounds = 5000, # number of trees
  objective = "reg:squarederror", 
  early_stopping_rounds = 20, 
  nfold = 10,
  verbose = FALSE
)
toc(log = TRUE)
## 10.548 sec elapsed

And this took 25.64 to fit, so just a bit faster. But, this also takes additional time before estimation in data processing (and that data processing happens directly with the {tidymodels} implementation.

How do the results compare? When using {xgboost} directly, the history of the model is stored in an evaluation_log data frame inside the returned object. There is also a best_iteration value that specifies which iteration had the best value for the objective function for the given model. We can use these together to look at our final model results.

fit_default_xgb$evaluation_log %>% 
  dplyr::slice(fit_default_xgb$best_iteration)
##    iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
## 1:   22        78.28245      0.4560459       88.80643       2.20707

And as you can see, the RMSE value for the validation set is very similar to what we got with {tidymodels}.

One nice feature about using {xgboost} directly is that we can use the evaluation_log to plot the learning curve. We just have to first rearrange the data frame a bit.

log_def <- fit_default_xgb$evaluation_log %>% 
  pivot_longer(-iter,
               names_to = c("set", "metric", "measure"),
               names_sep = "_") %>% 
  pivot_wider(names_from = "measure",
              values_from = "value")
log_def
## # A tibble: 84 x 5
##     iter set   metric  mean    std
##    <dbl> <chr> <chr>  <dbl>  <dbl>
##  1     1 train rmse   1753. 0.246 
##  2     1 test  rmse   1753. 2.79  
##  3     2 train rmse   1230. 0.157 
##  4     2 test  rmse   1230. 2.79  
##  5     3 train rmse    864. 0.109 
##  6     3 test  rmse    864. 2.70  
##  7     4 train rmse    609. 0.0808
##  8     4 test  rmse    609. 2.60  
##  9     5 train rmse    431. 0.0666
## 10     5 test  rmse    431. 2.45  
## # … with 74 more rows

And then we can plot the test and train curves together.

ggplot(log_def, aes(iter, mean)) +
  geom_line(aes(color = set)) 

As we can see, in this case, there is very little separation between the test and train sets, which is a good thing. If the curves started to converge but then separated again, or never really converged but always had a big gap between them, this would be strong evidence that we were overfitting (i.e. the training error continues to go down as the test/validation error goes up).

35.3.1 Model tuning

Let’s now tune our model, starting with the learning rate. In what follows, we illustrate this approach using both the {tidymodels} implementation and {xgboost} directly. However, following tuning our model, we proceed with further tuning using only the {tidymodels} implementation.

The models here take a long time to fit. In fact, we did not estimate them here directly. Rather, we sampled half of the total data (rather than 5% as above) and ran each of the models below using a high performance computing cluster. We recommend not running these models on your local computer, and instead using high-performance computing.

First, we need to specify a grid of values to search over. Remember, smaller learning rate values will lead to the model learning slower (and thus taking longer to fit) and risk finding local minima. However, lower learning rates are also more likely to find the global minimum. Higher learning rates will fit quicker, taking larger steps downhill, but risk “stepping over” the global minimum.

First, let’s create a grid of 20 learning rate values to evaluate.

grd <- expand.grid(learn_rate = seq(0.001, 0.2, length.out = 20))

We’ll start with the {tidymodels} approach. Let’s update our previous {parsnip} model to set the learning rate to tune().

mod <- mod %>% 
  set_args(learn_rate = tune())
mod
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   trees = 5000
##   learn_rate = tune()
##   stop_iter = 20
## 
## Engine-Specific Arguments:
##   nthread = 16
##   validation = 0.2
## 
## Computational engine: xgboost

Next, we just use tune_grid() like normal, passing it our model, cv object, and grid to search over.

tune_tree_lr_tm <- tune_grid(mod, rec, cv, grid = grd)

Let’s try the same thing with {xgboost} directly. In this case, we’re going to have to loop over the learning rate candidate values “manually”. In the below, we’ve used purrr::pmap() because it’s the most general. In other words, you could very easily modify the below code to include additional hyperparameters by providing them within the list argument, and then referring to those parameters in the function by their location in the list (e.g., ..1 for the first argument, as below, ..2 for the second, ..7 for the seventh, etc.). In this case, however, purrr::map() or base::lapply() would work just as well, and you could always use a for loop if you prefer (and you could also use names within pmap() by including function(...)).

tune_tree_lr_xgb <- pmap(list(grd$learn_rate), ~ {
 xgb.cv(
   data = features,
   label = processed_train$score,,
   nrounds = 5000, # number of trees
   objective = "reg:squarederror", # 
   early_stopping_rounds = 20, 
   nfold = 10,
   verbose = FALSE,
   params = list(eta = ..1) # learning rate = eta
 ) 
})

How do the results compare? Let’s first look at the best hyperparamter values using {tidymodels}.

lr_tm_best <- collect_metrics(tune_tree_lr_tm) %>% 
  filter(.metric == "rmse") %>% 
  arrange(mean)
lr_tm_best
## # A tibble: 20 x 7
##    learn_rate .metric .estimator  mean     n std_err .config
##         <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <fct>  
##  1     0.0429 rmse    standard    84.0    10   0.400 Model05
##  2     0.0534 rmse    standard    84.0    10   0.388 Model06
##  3     0.127  rmse    standard    84.0    10   0.400 Model13
##  4     0.106  rmse    standard    84.0    10   0.391 Model11
##  5     0.0848 rmse    standard    84.0    10   0.418 Model09
##  6     0.0115 rmse    standard    84.0    10   0.382 Model02
##  7     0.0324 rmse    standard    84.0    10   0.384 Model04
##  8     0.0953 rmse    standard    84.0    10   0.401 Model10
##  9     0.0638 rmse    standard    84.0    10   0.394 Model07
## 10     0.0743 rmse    standard    84.0    10   0.368 Model08
## 11     0.137  rmse    standard    84.0    10   0.383 Model14
## 12     0.0219 rmse    standard    84.1    10   0.352 Model03
## 13     0.148  rmse    standard    84.1    10   0.436 Model15
## 14     0.158  rmse    standard    84.1    10   0.392 Model16
## 15     0.116  rmse    standard    84.1    10   0.415 Model12
## 16     0.179  rmse    standard    84.1    10   0.396 Model18
## 17     0.190  rmse    standard    84.1    10   0.394 Model19
## 18     0.169  rmse    standard    84.1    10   0.386 Model17
## 19     0.2    rmse    standard    84.1    10   0.412 Model20
## 20     0.001  rmse    standard    87.1    10   0.339 Model01

To get these same values with the {xgboost} model we have to work a little bit harder. The models are all stored in a single list. We’ll loop through that list and pull out the best iteration. If we name the list first by the learning rate values the entry represents, then we can use the .id argument with map_df() to return a nice data frame.

names(tune_tree_lr_xgb) <- grd$learn_rate
lr_xgb_best <- map_df(tune_tree_lr_xgb, ~{
  .x$evaluation_log[.x$best_iteration, ]
  }, .id = "learn_rate") %>% 
  arrange(test_rmse_mean) %>% 
  mutate(learn_rate = as.numeric(learn_rate))
lr_xgb_best
##     learn_rate iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
##  1: 0.05336842  699        77.22848     0.07556896       83.69208     0.5359507
##  2: 0.03242105 1171        77.23337     0.11004818       83.70918     0.4975113
##  3: 0.02194737 1488        77.98110     0.12425554       83.71486     0.8445702
##  4: 0.08478947  410        77.50933     0.07902634       83.72119     0.9068654
##  5: 0.04289474  800        77.70967     0.13914770       83.72758     0.7172070
##  6: 0.15810526  230        77.21855     0.13016396       83.74998     0.8471658
##  7: 0.07431579  507        77.18520     0.09973896       83.75593     0.8895591
##  8: 0.10573684  358        77.09417     0.11564633       83.75641     0.8914690
##  9: 0.09526316  377        77.35261     0.14526839       83.75908     0.6121835
## 10: 0.11621053  299        77.47077     0.10450500       83.76679     0.5882256
## 11: 0.01147368 2582        78.40452     0.10640421       83.77338     0.8008329
## 12: 0.06384211  493        78.06430     0.11867213       83.78236     1.2322519
## 13: 0.14763158  245        77.22341     0.06415112       83.80426     0.9773060
## 14: 0.12668421  270        77.60423     0.12680544       83.83385     0.5904568
## 15: 0.13715789  284        76.87336     0.07113273       83.84141     0.8814912
## 16: 0.16857895  193        77.70561     0.15511678       83.89016     1.2144844
## 17: 0.18952632  182        77.38573     0.09436432       83.91741     0.8508369
## 18: 0.20000000  161        77.76660     0.08792396       83.91885     1.1856841
## 19: 0.17905263  204        77.18137     0.07499242       83.96743     0.7913820
## 20: 0.00100000 5000        85.99580     0.06931381       87.11882     0.5379044

Notice that we do get slightly different values here, but it appears clear from both implementations that the best learning rate is somewhere around 0.05. Let’s go ahead and set it at this value and proceed with tuning our tree depth. We’ll use a space filling design (max entropy) to tune the minimum \(n\) for a terminal node, and the maximum depth of the tree. We’ll fit 30 total models.

tree_grd <- grid_max_entropy(min_n(c(0, 50)), # min_child_weight
                             tree_depth(), # max_depth
                             size = 30)

We can visualize this grid to see how well we’re doing in actually filling the full space of this grid.

ggplot(tree_grd, aes(min_n, tree_depth)) +
  geom_point()

Looks pretty good! At least as a starting point. But we’ll want to look at the trends for each of these hyperparatmers after tuning to see if we need to further refine them.

Going forward from here with just the {tidymodels} implementation, we’ll set the learning rate at 0.05 and set each of the tree depth parameters to tune.

mod <- mod %>% 
  set_args(learn_rate = 0.05,
           min_n = tune(),
           tree_depth = tune())

We’ll then fit the model (using high performance computing) for each of the combinations in our hyperparamter space-filling design using tune_grid().

tune_treedepth_tm <- tune_grid(mod, rec, cv, grid = tree_grid)

Let’s evaluate the trends of each of the hyperparameters

collect_metrics(tune_treedepth_tm) %>% 
  filter(.metric == "rmse") %>% 
  pivot_longer(c("min_n", "tree_depth")) %>% 
  ggplot(aes(value, mean)) +
  geom_line() +
  geom_point() +
  facet_wrap(~name, scales = "free_x")

And we can see that the optimal value for min_n is a little bit less clear than for tree_depth. The former looks to be somewhere in the range of 12-25, but a few of the higher values (e.g., 33-ish) look pretty good as well. For the latter, it seems pretty clear that a tree depth of about 8 is optimal. Note that this is a bit higher than typical. Let’s look at just the raw

collect_metrics(tune_treedepth_tm) %>% 
  filter(.metric == "rmse") %>% 
  arrange(mean)
## # A tibble: 30 x 8
##    min_n tree_depth .metric .estimator  mean     n std_err .config
##    <int>      <int> <chr>   <chr>      <dbl> <int>   <dbl> <fct>  
##  1    33          8 rmse    standard    83.8    10   0.411 Model17
##  2    42          8 rmse    standard    83.8    10   0.409 Model10
##  3    37         10 rmse    standard    83.8    10   0.403 Model16
##  4    18          9 rmse    standard    83.8    10   0.392 Model13
##  5    21          6 rmse    standard    83.8    10   0.414 Model03
##  6    13          7 rmse    standard    83.8    10   0.419 Model24
##  7    21         12 rmse    standard    83.9    10   0.397 Model14
##  8    43         11 rmse    standard    83.9    10   0.381 Model07
##  9    34          6 rmse    standard    83.9    10   0.404 Model25
## 10    27          6 rmse    standard    83.9    10   0.420 Model30
## # … with 20 more rows

And as we can see there are actually multiple combinations of our two hyperparameters that lead to essentially equivalentRMSE values. The variability (estimated by the standard error) is fairly consistent as well, although it is a bit lower for a few models. This is fairly common for boosted tree models and, really, any model with multiple hyperparameters - various combinations of the hyperparameters lead to similar results. Just to be certain this is really what we want, we could use a regular grid to search over min_n values from, say, 12 to 25 and 30 to 35 while tuning the tree depth over a smaller range, perhaps 6 to 9. For now, however, we’ll just move forward with our best combination from the grid search we’ve conducted. Let’s first update our model with the best hyperparameter combination we found. We’ll then proceed with model tuning to see if we can additional gains by limiting the depth of the tree by the change in the cost function (i.e, tuning the loss reduction).

mod <- mod %>% 
  finalize_model(
    select_best(tune_treedepth_tm, metric = "rmse")
  ) %>% 
  set_args(loss_reduction = tune())

mod
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   trees = 5000
##   min_n = 33
##   tree_depth = 8
##   learn_rate = 0.05
##   loss_reduction = tune()
##   stop_iter = 20
## 
## Engine-Specific Arguments:
##   nthread = 16
##   validation = 0.2
## 
## Computational engine: xgboost

Let’s search 15 different values of loss reduction, searching over the default recommended range provided to use by the dials::loss_reduction function

lossred_grd <- grid_regular(loss_reduction(c(-4, 1.7)), levels = 15)
lossred_grd$loss_reduction
##  [1] 1.000000e-04 2.553541e-04 6.520572e-04 1.665055e-03 4.251786e-03
##  [6] 1.085711e-02 2.772408e-02 7.079458e-02 1.807769e-01 4.616212e-01
## [11] 1.178769e+00 3.010034e+00 7.686246e+00 1.962715e+01 5.011872e+01

Note that I have changed the default range here (after applying a log10 transformation) so there aren’t quite so many values down close to zero, and there are a few more that are in the upper range.

Let’s tune the model

tune_lossred_tm <- tune_grid(mod, rec, cv, grid = lossred_grd)

And evaluate the how the loss reduction related to the RMSE:

collect_metrics(tune_lossred_tm) %>% 
  filter(.metric == "rmse") %>% 
  arrange(mean)
## # A tibble: 15 x 7
##    loss_reduction .metric .estimator  mean     n std_err .config
##             <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <fct>  
##  1       0.0109   rmse    standard    83.7    10   0.401 Model06
##  2       0.00425  rmse    standard    83.7    10   0.404 Model05
##  3       3.01     rmse    standard    83.8    10   0.399 Model12
##  4       0.000652 rmse    standard    83.8    10   0.402 Model03
##  5       7.69     rmse    standard    83.8    10   0.396 Model13
##  6       0.000255 rmse    standard    83.8    10   0.378 Model02
##  7       0.0277   rmse    standard    83.8    10   0.406 Model07
##  8      50.1      rmse    standard    83.8    10   0.397 Model15
##  9       0.181    rmse    standard    83.8    10   0.380 Model09
## 10       0.0708   rmse    standard    83.8    10   0.409 Model08
## 11      19.6      rmse    standard    83.8    10   0.410 Model14
## 12       0.00167  rmse    standard    83.8    10   0.378 Model04
## 13       0.0001   rmse    standard    83.8    10   0.405 Model01
## 14       0.462    rmse    standard    83.9    10   0.399 Model10
## 15       1.18     rmse    standard    83.9    10   0.426 Model11

Our best model above is slightly better than our previous best, and the standard error (variance in the means across folds) is essentially unchanged. Let’s just look at these values quickly and make sure we don’t see any trends that would warrant additional tuning. Note that I’ve put a log 10 transformation on the x axis because that was the transformer that was used when creating the hyperparemeters, so that will allow us to view the hyperparameters with equal intervals.

collect_metrics(tune_lossred_tm) %>% 
  filter(.metric == "rmse") %>% 
  ggplot(aes(loss_reduction, mean)) +
  geom_line() +
  geom_point() +
  scale_x_log10()

We don’t see much of a trend, and it’s important to note that the range of the y-axis is severely restricted (meaning all models are pretty comparable), but the two best values are on the lower end of the scale.

Let’s move forward with these, and tune the stochastic components. We’ll start by randomly sampling a proportion of rows and columns to sample for each tree (as opposed to each each split, as we do with random forests). We will again setup a grid for these parameters. To do this most efficiently, we’ll first bake our data so we can efficiently determine the number of rows and columns in our data. We will also use the dials::finalize() function to help us determine the range of reasonable values. This is because functions like dials::mtry() have a lower bound range of 1 (one column at a time) and an upper value that is unknown and depends on the number of columns in the data.

baked <- rec %>% 
  prep() %>% 
  bake(train) %>% 
  select(-contains("id"), -ncessch)

cols_range <- finalize(mtry(), baked)
cols_range
## # Randomly Selected Predictors  (quantitative)
## Range: [1, 62]

For the sample_size argument {parsnip} expects (rather confusingly) a proportion of rows to be exposed to the fitting algorithm for each tree, rather than a number (i.e., at least as of thise writing, you will get an error if you used dials::sample_size() to determine the number of rows). We will therefore use sample_prop() and keep the default range from 10% to 100% of the rows.

Let’s again use a space-filling design to specify our grid, evaluating 30 different combinations of these hyperparameters.

stoch_grd <- grid_max_entropy(cols_range, 
                              sample_size = sample_prop(),
                              size = 30)
stoch_grd
## # A tibble: 30 x 2
##     mtry sample_size
##    <int>       <dbl>
##  1    60       0.350
##  2     4       0.221
##  3    59       0.655
##  4    53       0.361
##  5    61       0.762
##  6    56       0.204
##  7    43       0.171
##  8    35       0.167
##  9    19       0.827
## 10     4       0.656
## # … with 20 more rows

And finally, we can update our previous model with the best loss reduction, setting the stochastic parameters to tune. We then tune over our grid with tune_grid().

mod <- mod %>% 
  finalize_model(select_best(tune_lossred_tm, metric = "rmse")) %>% 
  set_args(sample_size = tune(),
           mtry = tune())
mod
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 5000
##   min_n = 33
##   tree_depth = 8
##   learn_rate = 0.05
##   loss_reduction = 0.010857111194022
##   sample_size = tune()
##   stop_iter = 20
## 
## Engine-Specific Arguments:
##   nthread = 16
##   validation = 0.2
## 
## Computational engine: xgboost
tune_stoch_tm <- tune_grid(mod, rec, cv, grid = stoch_grd)
stoch_mets <- collect_metrics(tune_stoch_tm) %>% 
  filter(.metric == "rmse") %>% 
  arrange(mean)
stoch_mets
## # A tibble: 30 x 8
##     mtry sample_size .metric .estimator  mean     n std_err .config
##    <int>       <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <fct>  
##  1    32       0.921 rmse    standard    83.6    10   0.391 Model12
##  2    51       0.984 rmse    standard    83.7    10   0.391 Model05
##  3    27       0.822 rmse    standard    83.7    10   0.400 Model14
##  4    62       0.962 rmse    standard    83.7    10   0.395 Model27
##  5    18       0.810 rmse    standard    83.7    10   0.398 Model11
##  6    36       0.809 rmse    standard    83.7    10   0.383 Model02
##  7    32       0.637 rmse    standard    83.7    10   0.393 Model29
##  8    54       0.747 rmse    standard    83.8    10   0.411 Model07
##  9    10       0.954 rmse    standard    83.8    10   0.407 Model08
## 10    47       0.693 rmse    standard    83.8    10   0.392 Model16
## # … with 20 more rows

And once again our best model is very marginally better over our previous best. Let’s see if there are any trends in these hyperparameters.

stoch_mets %>% 
  pivot_longer(c(mtry, sample_size)) %>% 
  ggplot(aes(value, mean)) +
  geom_line() + 
  geom_point() +
  facet_wrap(~name, scales = "free_x")

There doesn’t appear to be much of a trend in mtry, although values abouve about 10 seem to be preferred. For sample_size, there is a fairly clear trend downward as the proportion goes up, but the best is before we reach all the data. We might want to retune here with a more limited range on sample size somewhere between the 0.9 to 1.0 range, but we’ll leave here for now.

The final step for tuning this model would be to set the stochastic components and re-tune the learning rate to see if we could improve performance further. Given where we’ve been, and the progress we’ve made, our model is unlikely to have substantial performance improvements from here, but we might be able to make some marginal gains. In this case, we also didn’t have much evidence that our model was overfitting. If we did, we could try to tune on hyperparameters that are not currently implemented in parsnip::boost_tree(), like L1 and L2 regularizers, by just passing these arguments direction to the {xgboost} function through set_args(). We just need to use the name of the parameter in {xgboost}. For example

mod <- mod %>% 
  finalize_model(select_best(tune_stoch_tm, metric = "rmse")) %>% 
  set_args(
    alpha = tune(), # L1
    lambda = tune() # L2
  )

Then we set a grid. I’ve used a space filling design below and the method is a bit hacky because grid_max_entropy won’t let us have two penalty() values, and it also won’t (currently) let us have two named values. So we just create it with different names and then manually rename the columns.

penalized_grid <- grid_max_entropy(penalty(), lambda = penalty(), size = 15)
names(penalized_grid)[1] <- "alpha"
penalized_grid
## # A tibble: 15 x 2
##       alpha   lambda
##       <dbl>    <dbl>
##  1 3.15e-10 1.49e-10
##  2 5.67e- 4 3.60e- 3
##  3 4.66e- 3 1.28e- 9
##  4 8.06e- 3 9.35e- 6
##  5 6.09e- 6 4.85e- 2
##  6 5.06e- 8 3.41e- 8
##  7 2.56e- 2 6.72e- 1
##  8 1.48e- 8 7.90e- 1
##  9 1.95e- 5 3.30e- 6
## 10 7.86e- 8 1.25e- 4
## 11 1.54e- 1 2.62e- 4
## 12 3.21e-10 1.05e- 5
## 13 9.16e- 1 1.15e- 6
## 14 1.72e- 5 2.17e-10
## 15 6.71e-10 1.56e- 2

And then we can tune. Note that we are not actually fitting this model here, but rather we are illustrating how you can tune hyperparameters with {tidymodels} that are not yet implemented in the given {parsnip} model.

tune_penalized_tm <- tune_grid(mod, rec, cv, grid = penalized_grid)

Note that if we wanted to try a different booster, such as DART (which uses dropout), we would need to use {xgboost} directly.

35.3.2 Wrapping up

Gradient boosted trees provide a powerful framework for constructing highly performant models. They differ from other tree-based ensemble models by building the trees sequentially, with each new tree fit to the errors of the previous. The learning is guided by gradient descent, an algorithm that finds the optimal direction the model needs to move in to minimize the objective function. The rate at which we move in that direction, is called the learning rate.

Gradient boosted trees, particularly when implemented with {xgboost} have a large number of hyperparameters - many more than we’ve seen in previous models. This means that multiple combinations of these hyperparameters may lead to similarly performing models. It can also become overwhelming if you are searching for any combination that even marginally improves performance, as with online machine learning competition platforms like Kaggle. In most social science problems, however, the practical difference in a model that has an RMSE value of (in our example) 83.8 versus, say, 83.6 is unlikely to be a game changer. If it is, then you should likely go through several cycles of hyperparameter tuning to try to identify the optimal hyperparameter combination. But if it’s not, we suggest proceeding in the method shown here, using early stopping to control the number of trees, tuning the learning rate, then the tree-specific hyperparemters, and then the stochastic components. After conducting those steps, you should re-evaluate the learning rate and, in particular, see if a reduced learning rate will increase model performance. From there, we would likely move on and call our model complete. Knowing when to push for more model performance, and when to abandon you efforts and live with what you have, is part of the “art” to building predictive models, and this intuition builds both with experience and with your own knowledge of the data.