35 Boosting and Boosted Trees

Similar to bagging, boosting is a general algorithm that can be applied to any model. Frequently, however, it is applied to trees. Recall that when we use bagging (often through bagged tree or random forest models), we needed to have a sufficient number of bags, or bootstrap resamples, of the training data so our model performance metrics stabilize. Including additional bags (trees) once the performance has stabilized will not hurt in terms of model performance, but it will take longer to fit (i.e., more computationally expensive).

Similar to bagging, boosting works by building an ensemble of models. However, rather than each model being built independently on a new bootstrap resample of the original data, the subsequent models are constructed from the residuals of the previous model. This means that the number of models used controls how much is learned from the data and is an important hyperparameter.

Boosting versus Bagging

  • Bagged models use \(b\) bootstrap resamples of the training data. Models are built independently on each resample, and model results are aggregated to form a single prediction

  • Boosted models are built sequentially, with each model fit to the residuals of the previous model

  • For bagged models, \(b\) needs to be sufficiently large that performance metrics stabilize. Additional bags beyond this point only hurt computational efficiency.

  • For boosted models, the number of models determines how much is learned from the data: too few, and the model will underfit (leading to excess bias), too many and the model will overfit (leading to excess variance).

The model that is iteratively fit when applying boosting is called the base learner. This can be any model but is regularly a decision tree because, in part, it is relatively straightforward to control how much each individual tree learns from the data (i.e., shallow trees will learn less from the data than deeper trees).

Base learner: The model that is iteratively fit within a boosting algorithm. Often these are shallow decision trees, as in a boosted tree model.

Weak models & slow learning: Each model builds off the errors of the previous model. To find an optimal solution, we want our base learner to learn slowly, so the amount that is learned is controlled by the number of models.

We also typically want our base learner to be a weak model, otherwise known as a slow learner. Imagine, for example, that you were wearing a blindfold in a meadow with many small hills. Some are as high as ten feet above where you are currently, while other dip beneath you up to about five feet. You are challenged to find the lowest point in the valley, while remaining blindfolded. You are given an additional directive that the size of your step must be the same each time. If you take off at a sprint you may, beyond falling on your face, accidentally skip right over the lowest point because the length of your stride would be sufficient to carry you over. Further, you would not even be able to get to the lowest point, because each subsequent stride may carry you over again (remember, each stride must be the same length). This is equivalent to a model that learns fast. You would likely get close quickly (running downhill), but your final solution may not be optimal.

If, on the other hand, you take smaller, more careful steps, it would take you longer to get to the lowest point, but your stride would be less likely to carry you over the lowest point, and you would be able to feel when the gradient beneath you steepens. This does not guarantee you’ll find the lowest point, of course, but there’s probably a better chance than you’d have if you were sprinting. This is the idea behind a slow learner. We take more steps (number of models we fit), but each gets us closer to the optimal solution.

How do we determine what’s optimal? We us our objective function, along with a process called gradient descent. Our objective function tells us the criteria that we want to minimize or maximize. We can, theoretically, evaluate the objective function for any model parameters. We use gradient descent to optimize the model parameters with respect to the objective function (which we maximize or minimize).