33.3 Fitting a decision tree

As an applied example, let’s fit a decision tree model to data from the 2019 Data Science Bowl from Kaggle. These data come from PBS KIDS on data from an educational gaming app called Measure Up!. You can actually still submit to this competition (as of the time of this writing) if you’d like to see how performant your model is. The objective is to fit a model to predict kiddo’s (ages 3-5) scores on an in-game assessment.

The outcome is accuracy_group, which is an ordered categorical variable ranging from 0-4. See the Kaggle description of the data for more information.

33.3.1 Load the data

First, we need to read in the train.csv and train_labels.csv data. Note that I’ve removed the event_data column, which includes JSON data on event_count, event_code, and game_time, which are already represented as columns in train.csv. I’ve also sampled only read in the first 10,000 rows of the data so that things will run more quickly.

k_train <- get_data("ds-bowl-2019")

After joining the training data with the labels, our full training dataset look like this