35.2 Boosted trees

Like all boosted models, boosted trees are built with weak models so the overall algorithm learns slowly. We want the model to learn slowly so that it can pick up on the unique characteristics of the data and find the overall optimal solution. If it learns too quickly it may “jump over” the optimal solution.

In the feature engineering chapter, we discussed methods for handling things like nonlinearity through, for example, splines. When working through a linear regression framework, splines can create smooth curves that model any non-linear relations. Recall, however, that decision trees work by splitting the feature space into distinct regions. In other words, decision trees do not, by themselves, create curves. But if the model learns slowly enough it is possible to approximate curves, non-parametrically, even if we do not know the underlying functional form of the data a priori.

For example, the figure below shows a boosted tree model fit to the same log data we saw previously using 1, 5, 50, 100, 200 and 500 trees.

Initially, the errors (average distance between a point the line) are extremely high. But as more trees are built from the residuals of the previous, the algorithm starts to learn the underlying functional form. Note that this is still not as good as what we would get if we just log transformed the \(x\) variable and fit a simple linear regression model because, of course, that is how the data were generated. Once we get to 500 trees we could argue that the model is starting to overfit in some regions, while still badly underfitting at the bottom tail. This is an important example of the no free lunch theorem where, in this case, we have fit a relatively advanced machine learning model to a simple set of data, the results of which would be easily outperformed by simple linear regression. No single model will always work best across all applications. However, with proper hyperparameter tuning, boosted decision trees are regularly among the most performant “out of the box”.

35.2.1 Hyperparameters and engines

Standard boosted tree hyperparameters

  • Number of trees
  • Number of splits for each tree
  • Minimum \(n\) size for a terminal node
  • Learning rate

We have talked about each of the hyperparameters listed above previously. However, it’s worth thinking more about how they work together in a boosted tree model.

Generally, the minimum sample size for a terminal node will not factor in heavily with boosted tree models, because the trees are typically grown quite shallow (so the model learns slowly). The number of splits is often between one (a stump) and six, although occasionally models with deeper trees can work well. The depth of the tree (generally controlled more heavily by the number of splits than the terminal \(n\) size) will determine the number of trees needed to arrive at an optimal solution. Deeper trees will learn more from the data, and will therefore generally require fewer trees. Deeper trees also allow for interactions among variables - i.e., the splits are conditional on previous splits, and thus the prediction depends on the interaction among the features in the tree. If important interactions in the data exist, the trees will need to be grown to a sufficient depth to capture them. An ensemble of stumps (trees with just a single split) are relatively common and computationally efficient, although they do generally require more trees, as each tree learn very little about the data (as illustrated above). Because they learn more slowly, they are less likely to overfit. These models will, however, miss any interactions that may be present in the data.

Standard boosted tree tuning strategy

  1. Set learning rate around \(0.1\)
  2. Tune the number of trees
  3. Fix the number of trees, tune number of splits (and potentially minimum terminal \(n\) size)
  4. Fix tree depth & terminal \(n\), tune learning rate
  5. Fix all other hyperparameters, evaluate the number of trees again. If (more than marginally) different, set the number of trees to the new value and continue tuning.

The learning rate is a new parameter that we haven’t seen with any tree-based model previously (although we also didn’t have to tune the number of trees previously, we just had to make sure we had a sufficient number to reach a stable estimate). We talked about the learning rate some in the section on gradient descent but it’s worth reiterating that, along with the number of trees, the learning rate is perhaps the most important hyperparameter. If the learning rate is too high, the model is likely to “jump over” the optimal solution, and if it is too low the model may take far too long to reach an optimal solution. The learning rate is also called the shrinkage parameter because it is the amount by which we multiply the partial derivatives of our cost function (the gradient). Smaller values will result in smaller “steps” downhill, and in turn, the amount each tree learns. As a practical recommendation, we suggest starting with a learning rate around \(0.1\). Values higher than this do not typically work well, in our experience, but it is high enough that the model should find a value close to the minimum relatively fast. Once you’ve conducted your model tuning on the other hyperparameters with this learning rate, consider reducing the learning rate to see if you can increase model performance further (e.g., to values of \(0.01\) or \(0.001\)).

The {gbm} package (for gradient boosted models) is perhaps the most common package for estimating standard boosted trees. Throughout this book, however, we have been been using the {tidymodels} suite of packages and, although multiple boosting engines have been implemented, {gbm} is not one of them. Although it is possible to add new engines or models to {parsnip}, the {gbm} package is in many ways a simpler implementation of boosted models than those that are already implemented in {parsnip}, specifically the {xgboost} package, which includes a number of additional hyperparameters, as follows.

{xgboost} hyperparameters

  • Number of trees
  • Number of splits for each tree
  • Minimum \(n\) size for a terminal node
  • Learning rate

  • Number of predictors to be randomly sampled at each split
  • Proportion of cases to sample for each tree
  • Limiting tree depth by the cost function (reduction required to continue splitting the tree)
  • Early stopping criteria (stop growing additional trees during training if there is no change in the cost function on the validation set for \(k\) consecutive trees)

As can be seen, the same hyperparameters for standard boosted trees are available when using the {xgboost} engine, but additional stochastic based parameters are also available (randomly sampling columns at each split and cases for each tree). The stochastic parameters are based in the same underlying principles that help improve the performance of bagged trees and random forests. Exposing only a sample of the data and features to the algorithm at a time may make the algorithm take a bit more time, as each tree is likely to learn less (making it learn even more slowly). But it may also pick up on meaningful aspects of the data that are otherwise missed. Of course, the extent to which including these stochastic components helps you model performance will depend on your data. For example, random selection of columns will likely help more when you have a few features that are strongly related to the outcome and many that are weakly related (i.e., so each subsequent tree isn’t built by starting with the same features as the root node).

Tuning {xgboost} models

The early stopping rules change the way we tune our model. We can use a very large number of trees with early stopping to

  • Tune the learning rate, then
  • Tune tree depth, then
  • Evaluate if stochastic components help, then
  • Re-tune learning rate

If the cross-validation curve suggests overfitting, include regularization methods (e.g., limiting tree depth by changes in the cost function, L1 or L2 regularizers)

Beyond these additional stochastic-based hyperparameters, {xgboost} provides an alternative method by which tree depth can be limited. Specifically, if the cost function is not reduced below a given threshold, the tree stops growing. Further, when tuning a model, {xgboost} allows for the use of early stopping rule. This works by evaluating changes in the gradient across iterations (trees) on the validation set during training. If no improvements are made after a specified number of trees, then the algorithm stops. Early stopping can provide immense gains in computational efficiency (by not continuing to grow a tree when no improvements are being) while also helping to prevent overfitting. Rather than being an additional parameter to tune, you can use early stopping to your advantage to only fit the number trees necessary, while being adaptive across different model specifications (e.g., tree depth).

The {xgboost} package itself includes even more hyperparameters, including slightly alternative methods for fitting the model altogether. For example, the dart booster includes a random dropout component. The idea, which originated with neural nets, is that trees added early in the ensemble will be more important than trees added late in the ensemble. So, particularly in cases where the model overfits quickly, it can be helpful to randomly remove (drop) specific trees from the overall ensemble. This is a rather extreme regularization approach, but has been shown to improve performance in some situations.

In what follows, we discuss implementation with {tidymodels}. However, we will also provide a few examples with {xgboost} directly, including direct comparisons between the two implementations. Our motivation here is not necessarily to advocate for the use of one versus the other, but to highlight that there are some differences and, depending on your specific situation, you might prefer to use one over the other. For example, if you do not have a need to move outside of the hyperparameters that have already been implementedin {parsnip} for the {xgboost} engine, there’s likely no need to move outside of the {tidymodels} framework. If, however, you are having problems with severe overfitting and you want to try using dropout for regularization, then you will need to move to {xgboost} directly with the dart booster.