33.1 Determining optimal splits

Regression trees work by optimally splitting each node until some criterion is met. But how do we determine what’s optimal? We use an objective function. For regression problems, this usually the sum of the squared errors, defined by

\[ \Sigma_{R1}(y_i - c_1)^2 + \Sigma_{R2}(y_i - c_2)^2 \]

where \(c\) is the prediction (mean of cases in the region), which starts as a constant and is updated, and \(R\) is the region. The algorithm searches through every possible split for every possible feature to identify the split that minimizes the sum of the squared errors.

For classification problems, the Gini impurity index is most common. Note that this is not the same index that is regularly used in spatial demography and similar fields to estimate inequality. Confusingly, both terms are referred to as the Gini index (with the latter also being referred to as the Gini coefficient or Gini ratio), but they are entirely separate.

For a two-class situation, the Gini impurity index, used in decision trees, is defined by

\[ D_i = 1 - P(c_1)^2 - P(c_2)^2 \]

where where \(P(c_1)\) and \(P(c_2)\) are the probabilities of being in Class 1 or 2, respectively, for node \(i\). This formula can be generalized to a multiclass situation by

\[ D_i = 1 - \Sigma(p_i)^2 \]

In either case, when \(D = 0\), the node is “pure” and the classification is perfect. When \(D = 0.5\), the node is random (flip a coin). As an example, consider a terminal node with 75% of cases in one class. The Gini impurity index would be estimated as

\[ \begin{aligned} D &= 1 - (0.75^2 + 0.25^2) \\\ D &= 1 - (0.5625 + 0.0625) \\\ D &= 1 - 0.625 \\\ D &= 0.375 \end{aligned} \]

As the proportion in one class goes up, the value of \(D\) goes down. Similar to regression problems, the searches through all possible features to find the optimal split that minimizes \(D\).

Regardless of whether the decision tree is built to solve a regression or classification problem, the algorithm is built recursively, with new optimal splits determined from previous splits. This “top down” approach is one example of a greedy algorithm, in which a series of localized optimal decisions are made. In other words, the optimal split is determined for each node. There is no going backwards to check if a different combination of features and splits would lead to better performance.