34 Bagging and Random Forests

In the last chapter, we talked about how decision trees are highly flexible, but are often not the most performant model on their own because they can easily overfit, leading to poor generalizability to new/unseen data. One way to avoid this problem is to build an ensemble of trees from random bootstrap samples of the data, and aggregate the predictions across the entire ensemble. This is a general approach known as bagging, or bootstrap aggregation, which can be applied to any modeling framework, but generally will only provide large gains in improvement if, like decision trees, the model has high variability.

Bagging

A general process to improve the performance of highly variable models, regularly applied to decision trees.

  1. Create \(b\) bootstrap resamples of the original data
  2. Fit the model (base learner) to each \(b\) resample
  3. Aggregate predictions across all models
  • For regression problems, the final prediction is the average of all model predictions
  • For classification problems, either (a) average model classification probabilities or (b) take the mode of model classifications.

Benefits: Leads to more stable model predictions (reduces model variability)

As mentioned previously, bagging does not tend to help much (and increases computational burdens) when the model already has low variance. Models like linear regression will generally not change much in their model predictions when using bagging. For example, for a very small sample, bagging may help a linear model, but as the sample size grows, the predictions will be nearly identical while increasing computational complexity. Decision trees, however, can be highly variable. Bagging can help reduce the variance of these models and lead to more stable model predictions.

How many bags?

Put differently, how many bootstrap resamples?

There is no hard and fast rule. The important thing is to have enough bags to reach a stable model.

Noisier data will require more bags. Data with strongly predictive features will require fewer bags.

Start with somewhere between between 50-500 bags depending on how variable you think your data are, evaluate the learning curve, and then adjust up or down from there accordingly.

There is no rule for the number of “bags”, or bootstrap resamples, that one should use to create a stable model. Further, the number of bags just needs to be sufficiently high that the model becomes stable—after stability is achieved, additional bags will not help model performance. In other words, there is no upper bound for the number of bags (the only burden is computational), but it is critical that there are enough bags to create a stable model. Datasets with highly predictive features will generally need fewer bags to reach stability.

34.0.1 Bagging “by hand”

To understand bagging we first have to be clear about what bootstrap resampling implies. When we take bootstrap resamples from our dataset, we sample \(n\) rows of our dataset with replacement, where \(n\) represents the total number of rows in our data. For example, suppose we had a dataset that had the first five letters of the alphabet, each with an associated score.

lets <- data.frame(letters = c("a", "b", "c", "d", "e"),
                   score = c(5, 7, 2, 4, 9))
lets
##   letters score
## 1       a     5
## 2       b     7
## 3       c     2
## 4       d     4
## 5       e     9

Bootstrap resampling would imply sampling five rows from the above dataset with replacement. This means some rows may be represented multiple times, and others not at all. Let’s do this and see what the first three datasets look like.

# set seed for reproducibility
set.seed(42)

# specify the number of bootstrap resamples
b <- 3
resamples <- replicate(b, 
                       lets[sample(1:5, 5, replace = TRUE), ],
                       simplify = FALSE)
resamples
## [[1]]
##     letters score
## 1         a     5
## 5         e     9
## 1.1       a     5
## 1.2       a     5
## 2         b     7
## 
## [[2]]
##     letters score
## 4         d     4
## 2         b     7
## 2.1       b     7
## 1         a     5
## 4.1       d     4
## 
## [[3]]
##     letters score
## 1         a     5
## 5         e     9
## 4         d     4
## 2         b     7
## 2.1       b     7

Notice that in the first bootstrap resample, a is represented three times, b once, c and d not at all, and e once. Similar patterns, with different distributional frequencies, are represented in the second and third datasets.

Why is this useful? It turns out that if we do this enough times, we develop a sampling distribution. Fitting the model to all of these different samples then gives us an idea of the variability of the model, which we can reduce by averaging across all samples. Bootstrap resampling is useful in all sorts of different ways in statistics. In the above, our observed mean across the letters is 5.4. We can compute the standard error of this mean analytically by \(\sigma/\sqrt{n}\), or sd(lets$score)/sqrt(5), which is equal to 1.2083046. We can also approximate this same standard error by computing the mean of many bootstrap resamples, and estimating the standard deviation among these means. For example

library(tidyverse)

b <- 5000
resamples <- replicate(b, 
                       lets[sample(1:5, 5, replace = TRUE), ],
                       simplify = FALSE)
means <- map_dbl(resamples, ~mean(.x$score))
sd(means)
## [1] 1.108987

In this case, the difference between the analytic standard error and the bootstrap estimate is greater than typical because the sample size is so small.

The process of bagging is essentially equivalent to the above, except instead of computing the mean with each bootstrap resample, we fit a full model. We then compute the predictions from all of these models and either average the resulting predictions, or take the mode of the classifications.