Forecasting

Keep on Splitting with Decision Tree Regressors

January 10, 2024

Who would have guessed that a tall stack of weak learners would work well for predicting daily energy use? Well, it does work well, and this is one of the results from some research we did recently on Decision Tree Regression models. The results of out-of-sample accuracy tests for a series of modeling approaches are shown below. First, look at the red bar. That is for the benchmark model (a proven least squares regression model), and the mean absolute percent error (MAPE) in 100 out-of-sample tests averaged 1.51%. You can see the comparable test statistics for five flavors of decision tree models in the three preceding bars. The Gradient Boosting, Extreme Boosting, and Boosted Forest methods are very competitive, and that is without spending much time on feature engineering. In comparison, plain old Decision Tree Regressor (a single deep, wide tree) and Random Forest Regressor (a group of deep, wide trees) did not fare so well.

Graph 1

A couple of mixed methods (shown below the red Regression bar) did do a bit better. We will dig into that at the end. First, some background.

Decision Tree Models. Decision tree models were initially developed for classification problems, but they can also be applied to prediction problems with continuous outcome variables. When so applied, they are called Decision Tree Regressors or Decision Tree Regression Models. Recently, we took a look at how well these methods work when applied to forecasting daily energy use. The full results are summarized in a White Paper titled Overview of Decision Tree Modeling Methods which you can download from the forecasting section of the Itron website. The rest of this blog provides more background and a few kernels of wisdom.

The Data – Outcomes and Features. We start with data values for an outcome variable, which is daily energy deliveries for an electric utility, and corresponding values for explanatory variables. The data covers the three years before the pandemic, giving a total of 1,095 observations. The main explanatory variable is daily average temperature, but there are others like average cloud cover, average wind speed, month of the year, and day of the week. The following chart shows the outcome variable (daily energy) on the Y axis and daily average temperature on the X axis. Each point represents energy use on one day, and the color coding shows the day type (Weekday, Saturday, Sunday, Holiday). The average value (275.5 GWh) is represented by the thick black line. The mean absolute deviation (MAD) statistic is the average value of deviations around the black line. The mean absolute percent error (MAPE) statistic is the average absolute percentage deviation around the black line. So, any model that we build really needs to do better than a MAPE value of 12.5%, because that is what we get if we just predict daily energy to be at the average value no matter what the conditions are.

Graph 2

Variables vs. Features. In econometrics, variables used to explain an outcome are called explanatory variables. In Data Science, these variables are called features. It’s all the same thing. A feature is basically a column of data. And if we transform a variable or interact variables, in data science, this is called feature engineering. In econometrics, it is just part of the model specification.

Keep on Splitting. Decision tree regressors come in many flavors. However, they all have a common building block, which is the idea of splitting. First, we dump all the data (outcome values and feature values) into a big bucket called the root node. Then split the data into two branch nodes based on a factor value, for example, split days with daily average temperature less than 75 degrees to the left branch and other days to the right branch. Then split each branch node into two sub-branch nodes. Then split each sub-branch node into two sub-sub-branch nodes. Keep on splitting until it is time to stop. At the end, each terminal node is called a leaf. Compute the average outcome value in each leaf, and that is the predicted value.

That is how a decision tree regressor works. Any combination of explanatory variable values that you put in will lead to a terminal leaf, and the model will return the average value computed from the training data that ended up in that leaf.

Splits are Greedy. At any decision node in a tree, the logic for splitting is the same. First, for each explanatory factor and each possible split value compute an improvement statistic (for example, sum of squared deviations). Then find the best split value for that factor. Pick the factor for which the best split value gives the biggest improvement, and that defines the split rule for the node. When making a split, don’t look backward to past splits or strategize about future splits. Just focus on the split at hand. That’s the greedy part. There is no strategy, just a rule to minimize errors or maximize improvement at the current split.

It's Nonparametric. Decision tree methods are called nonparametric. That is because the model has splitting rules, but no slope parameters. If you think about a regression model, slope parameters are estimated that tell us the impact of changes in a factor value. For example, a slope parameter on daily average temperature would tell us how much daily energy is expected to increase if it is one degree hotter. In contrast, decision trees have splitting rules that navigate us to a terminal leaf, but within that leaf, there is no slope, only a single predicted value. So, if the terminal leaf includes data for all days with temperatures greater than 85 degrees, those days all get the same predicted value.

Preventing Overfitting. A decision tree regressor that is not constrained will continue to split until there is no advantage to splitting further. With continuous factor values, that could take us to the point where there is only one observation in each leaf. It turns out that going that far is not very useful in terms of predicting outside of the data used to train the model. There are various ways to stop the splitting process, like setting the maximum number of splitting levels (the depth of the tree), or the smallest allowable leaf size. The usual approach is to use out-of-sample testing to identify a good level for these limits. The limits then become the settings (sometimes called hyperparameters) for the model.

Benchmark Model. The benchmark model is a regression model applied to the data shown above. Daily energy is the variable to be explained and the explanatory variables are daily weather (average temperature, wind speed, cloud cover) and calendar variables (like day of the week, month, and holidays). The benchmark model uses a well-proven and robust specification using multiple heating degree variables (degrees below a specified base value) and cooling degree variables (degrees above a specified base value) It also includes variables that interact wind and clouds with the cooling degree and heating degree variables.

Benchmark Performance. To compare performance, models are estimated 100 times using different 90% random subsets of the data for estimation (training) and using the remaining 10% of the data for out-of-sample testing. The test statistic we are using here is the mean absolute percentage error or MAPE, and the value for the benchmark regression is 1.51%.

Types of Decision Tree Models. To start, there is a basic Decision Tree Regressor (DTR) model. This is a single tree that is potentially very wide and very deep. The second method is called a Gradient Boosting Regressor (GBR). This is a stack of sequential models, and at each step, a decision tree model is applied to the errors from the prior step. The third method is a Random Forest Regressor (RFR), and this is a group of wide and deep decision trees where each tree is exposed to a randomly selected subset of the training data and the factors available for splits are a randomly selected subset of the explanatory factors.

That’s all a bit too much to go through in detail. But just to get the idea, the following is a Decision Tree Regressor model using daily average temperature as the only explanatory feature. The tree is restricted to allow only 10 terminal leaves. At the top, all observations appear in the root node. The best place for the first split is at a temperature of 75.4 degrees. Colder days go to the left and warmer days go to the right. If a box is shaded brown (like a tree trunk), it is a decision node. If it is light green, it is a terminal or leaf node. In each box, “samples” is the number of observations in the box, “value” is the average daily energy for days that land in the box, and “mse” is the mean squared error of daily energy values for days that land in the box.

Graph 3

With this simple 10-leaf decision tree, the in-sample MAPE value is down to 4.2%. If you look at the green boxes, on the left we have the coldest days, and on the right, we have the hottest days. Each green box represents a temperature range. And the average of the training values that fall in each box is the average energy use for days that fall in that temperature range. That is how a DTR model with a single feature works.

There are lots of ways to look at this, and a lot more to dig into, but let’s leave it there and if you are interested in a deeper dive, check out the white paper which goes about as deep as you would ever want to go.

A Note on Boosting and Weak Learners. Clearly, of the decision tree methods, the boosting methods worked the best. As mentioned above, this is a tall stack of sequential models, each of which is a very small decision tree. In fact, the best model (out-of-sample) had about 800 levels and each level was a decision tree with only four leaves. These very small trees are called weak learners because they are so small.

Also, in boosting, there is a learning rate. If you have ever looked at optimization algorithms, this is similar to a step size in gradient descent. The best model (out of sample) had a learning rate of .1 or 10%. So, at each step we model the cumulative residuals from the prior steps and add 10% of the predicted residual values to the model prediction. Then proceed to the next step. The amazing thing is that this combination of many layers of weak learners to which we only pay partial attention produces a powerful result.

Final Thoughts. The gradient boosting idea looked so promising that we took it a step further by combining it with the regression model. The results are presented below the red bar in the initial figure, which is repeated below.

To produce the Regression with Boosting result, first estimate the regression model. Then apply gradient boosting to the residuals from this model. The boosting model figures out if there are combinations of conditions for which the regression model residuals are positive on average or negative on average. So, the boosting component is an error correction model. The figure also shows the same result for Regression with an AR1 time-series error correction. When we start forecasting, these error corrections behave differently. The AR1 correction dies out exponentially as we move beyond the end of the data. The Boosting error correction does not, so it is a permanent part of the structural model.

For the data we looked at, both error correction approaches are equally promising in terms of reducing out-of-sample errors. And that is the proof of the pudding.

Graph 4