Abstract

Renewable energy has become popular compared with traditional energy like coal. The relative demand for renewable energy compared to traditional energy is an important index to determine the energy supply structure. Forecasting the relative demand index has become quite essential. Data mining methods like decision trees are quite effective in such time series forecasting, but theory behind them is rarely discussed in research. In this paper, some theories are explored about decision trees including the behavior of bias, variance, and squared prediction error using trees and the prediction interval analysis. After that, real UK grid data are used in interval forecasting application. In the renewable energy ratio forecasting application, the ratio of renewable energy supply over that of traditional energy can be dynamically forecasted with an interval coverage accuracy higher than 80% and a small width around 22, which is similar to its standard deviation.

1. Introduction

Renewable energy such as solar and wind has been playing an integral role in sustaining power supply and relieving the environment pollution and global warming crisis. With the increasing penetration of renewable energy, determining the amounts of renewable energy generation is critical to maintain the energy balance and the stability and reliability of power networks. Forecasting the mixing shares of the energy generation offers the guidance of setting up the power generation for each energy source and ensures the load demand of power networks to be satisfied [1, 2]. Data-based prediction methods, in particular machine learning methods, provide a promising solution to infer the required ratios of energy generation, among which decision tree is a well-recognized approach due to its satisfactory accuracy and interpretation [36].

Although decision tree provides an effective method in forecasting, the theory explaining when and how it performs well is rarely discussed. The required ratios of renewable energy generation can be seen as a linear time series. In this context, we explore how the tree model performs in terms of the bias, variance, and prediction error. In addition, point prediction is not sufficient in time series prediction, so we also provide prediction interval choices like Gaussian and quantile intervals in theories with the application in renewable energy ratio forecasting.

Decision tree [7] is a nonparametric supervised learning method used for discovery and prediction-oriented classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Decision tree, compared to other data mining methods, has its own advantages. (1) For casual relationship, it can deal with nonlinear models. In most cases, economics pay more attention to linear models, while if it is a nonlinear model, it will be transferred to be a linear model. In problems like consumer behavior analysis, the number of variables exceeds the normal extent to be tens or even hundreds, which will definitely lead to high correlation among variables. In that case, coefficients may have the wrong meaning in reality. Decision tree, however, provides variable importance ranking criteria, which helps a lot. (2) In terms of comprehensibility, it tends to be relatively better than “black-box” models like neural network, which means it can interpret data structure more clearly and help readers understand the information involved. These undoubtedly bring convenience to decision making in medical treatment [810], e-commerce, [1113] and so on.

We now explore the performance of trees when fitted to data generated from a linear model. The corresponding bias, variance, and prediction error between the fitted simplified tree and the true simple linear model will be calculated. Then, how those errors vary will be explored when the linear data distribution changes. The motivation is to explore how the trees perform under different distributions. Afterwards, prediction interval is proposed using Gaussian and quantile intervals, which explains why quantile interval is chosen in the study by Zhao et al. [14]. The simple linear model in use iswhere is the true model. It is supposed that, throughout this paper, independently and . Uniform distribution guarantees that if the tree has k terminal nodes, the sample size in each node will be equal which is convenient in theory and simulation analysis. Decision tree analysis under the uniform distribution assumption includes the work by Hancock [15], Jackson and Servedio [16], and White and Liu [17]. Other distributions can also be considered but the analysis will be much more complex as the sample size of each terminal node depends on many parameters.

The expected squared prediction error (SPE) is one of the important metrics to measure how well the trained model is applied to further unseen data. As shown in Hastie et al. [18], SPE of a regression fit at an input point is

In (2), the first term is the variance of the target around its true mean and cannot be avoided no matter how well the is estimated, unless . The second term is the squared bias, the amount by which the average of the estimate differs from the true mean; the last term is the variance, the expected squared deviation of around its mean. Typically the more complex the model is, the lower the (squared) bias but the higher the variance [18] will be.

In Section 2, the performance of regression trees is analyzed when fitted to data which simply follow a uniform distribution, with additive Gaussian noise. When we predict this time series using simplified trees, the prediction error is calculated and decomposed into variance and other errors. When Gaussian or uniform effect is strong, those errors have different kinds of behavior. Other exploration is conducted in Section 3 including the best tree depth with minimum prediction error and the performance of Gaussian and quantile prediction intervals under different conditions. A real interval forecasting application is conducted in Section 4. Conclusions are drawn in Section 5. All calculations were done using R [19]; ‘waveslim’ [20] was used for wavelet decomposition and ‘ctree’ [21] for CTree.

2. Bias-Variance Exploration

2.1. Decomposition Background

For the observation , the (unconditional) expectation isand the variance is

They both have no relationship to . In that case, and . Accordingly, for observations, the expectation and variance for the average are shown in

2.2. Decomposition in the Context of Decision Trees

In the context of decision trees, the fitted model is in a simplified form iswhere is the number of terminal nodes in the tree and is the mean of in terminal node . In a tree with only the root node, , and the fitted model is . Then, for point ,and the variance is

Thus, the at point is

Then, the mean squared prediction error (MSPE) is

Comprising variance isand squared bias is

Now the number of terminal nodes in the decision tree is extended from to a general ; then, the MSPE, , and variance for are equal to those for since the decision tree is assumed to make equal terminal nodes with the same number of observations in each terminal node. In that case, for for a general , the MSPE iswith variance asand squared bias as

It is easy to see that with a lower , , and and higher , variance, squared bias, and MSPE will all decrease.

2.3. Optimal to Minimize MSPE

The ideal number of terminal nodes can be found by minimizing the MSPE with respect to . Here is a discrete integer, so the target will be the nearest integer from the differentiate result. Calculating the first derivative of MSPE, we getand the second derivative of MSPE is always positive. Therefore, we only need to solve

The real root of (18) is

Havingwe can approximate by

In addition, the constraint for root is also . If is not in , MSPE might always decrease.

By substituting in (19) back into (16), we will getand it is easy to see that, with the increase of and , when is fixed, will increase. The others will be shown as figures.

Accordingly, how will the ratios , , vary when parameters change? Since , , and appear together, they are regarded as one parameter. For and , the thing that matters is their difference, so we use and only change . Here, is set to be calculated using given parameters for (19), and if does not exist, the results will not be shown. The results in Figure 1 (changing ) and Figure 2 (changing ) show that, under both circumstances, , , and all increase.

In Figure 1, when gets bigger, is more likely to be uniformly distributed and increases as is more accurately described with a uniform distribution; besides, the ratio of and over MSPE gets larger while increases. In Figure 2, when gets bigger, the Gaussian distribution will play a bigger role in data generation and decreases. That is why increases. and generally decrease. The decrease speed slows with bigger and as expected.

2.4. Simulation

In this simulation, a simplified tree model will be designed to confirm the theory results using simulated data. That is, when parameters of the simulated data change, the distribution of and will also change. The question is, how will the statistics of Var, , MSE, and change accordingly?

In the simplified tree, is evenly split into intervals, . For specific , , , , , and , we are going to calculate the statistics of MSPE, Var, and for the interval in from simulated data. Thus, for the interval, the range is

The number of observations in interval () is :defining and .(i)Step 1: for the data (, ) in , we train a model from them asfor simulated , and is the averaged value of in .(ii)Step 2: repeat Step 1 times. Then, we have trained models , .(iii)Step 3: simulate one uniformly from the range . We are going to calculate the , , and for this specific .(iv)Step 4: simulate values of using .(v)Step 5: calculate the statistics of , , and for this specific as(vi)Step 6: repeat Step 3 to Step 5 for times and calculate the mean of , , and as , , and .

Follow Step 1 to Step 6 for all , , and calculate the mean as MSPE, , and .

The results of simulations with 200 trials are shown in Figures 3 and 4. For Figure 3, , we have a minimum MSPE. However, when as in Figure 4, MSPE keeps decreasing.

3. Prediction Interval

Instead of point prediction, a prediction interval is also desirable especially for time series with high variance. If both the point prediction and the prediction interval can be provided, we will be more confident for the prediction. This study also helps us decide the proper prediction interval method for decision-tree-based regression problems. Gaussian-based prediction interval and quantile interval are compared under different parameters distributions.

3.1. Probability Function of

Since our linear model,is the sum of uniform and Gaussian distributions, the probability function for is

By letting , we obtain

Now we get the probability of as (29). However, is in a complex form meaning that the parameters are not easily solvable in theory by a given value for .

3.2. Prediction Interval as a Gaussian Distribution

If we want to get the prediction interval, say for at level, the theoretical way is to obtain and from the equations

However, the integral of is not analytically solvable without approximating with other suitable expressions. The results will also be quite complex. If we know the parameters values, then and can easily be found numerically.

From Figure 5, if the uniform (Gaussian) distribution plays a main role, then can be approximately described by a uniform (Gaussian) distribution. Under the conditions that is not too large, is not too small, and is 1 (with only one interval), we will approximate the distribution of as a Gaussian distribution :

Then, the prediction interval under criteria for this Gaussian distribution is around

Then, for a general , the prediction interval becomeswhich is the forma typical Gaussian prediction interval.

3.3. Prediction Simulation Using Gaussian Prediction Interval and Quantile Interval

In this simulation, we explore the performance of Gaussian prediction intervals and quantile intervals under different parameter combinations. The parameters include , , , and . When the other parameters are fixed, a higher means a stronger Gaussian distribution effect, in which case, Gaussian prediction interval may work well. When is large, the uniform distribution plays a bigger role. Then, Gaussian prediction interval may not work so well. Both Gaussian prediction interval and quantile interval are influenced by the observation size of the terminal node. When the sample size is large, they can have stable performance, but when sample size is small, performance differs.

The Gaussian prediction interval in use iswhere is 1.96 and RMSPE is the root mean squared error estimated from the training data in each terminal node.

The quantile interval comes from the 0.025 and 0.975 quantiles of each terminal node from the training data.(i)Step 1: training data generation.Using given parameters , , , , , , data are generated according to the modelTherefore, we get the true fitted values for .(ii)Step 2: RMSPE and quantiles from training data.From this training data, the trained model, RMSPE, and quantiles are calculated as in the following steps.(i)Step 2.1: model training.For training data (the rest of data is the test data), we sort the data in an ascending order, so will also be rearranged following , and then is divided into roughly successive equal folds, making a total of observations. The number of observations in fold () is :defining .For the fold in , giving and , the predicted value will bein the trees context. The predicted value of a tree model is the averaged response values of each terminal node. Samples being split into those terminal nodes will have the corresponding averaged value as the predicted value.(ii)Step 2.2: RMSPE and quantile calculation.When the model for each is trained as , the predicted values for in will be . Then, the RMSPE for the training data isThe quantile intervals and are the 0.025 and 0.975 quantiles of the training data .(iii)Step 3: test data generation and model testing.Using the same parameters , , , , and as in Step 1, data are generated according toThen, the test data are put into and the coverage is computed as(iv)Step 4: repeating Steps 1 to 3.

Repeat Steps 1 to 3 times to get an averaged coverage.

Using parameters , , and , the results are shown in Figures 6.

The results show that quantile interval coverages are closer to the 0.95 reference line for fixed , , and . Gaussian prediction interval is only closer to the 0.95 coverage when is large; otherwise, it is larger than 0.95 at the cost of wider width. When is chosen as the best , the coverages get closer to the 0.95 reference line as increases for both quantile and Gaussian prediction intervals. However, when the uniform distribution effect gets stronger, the coverages all go far away from 0.95. Accordingly, when the number of observations for each terminal node is large and the data distribution is not obviously Gaussian, quantile intervals are suggested. When the data follows obvious Gaussian distribution, Gaussian prediction intervals are recommended.

4. Real Application

We have explored the performance of decision trees under different circumstances. A real application is conducted in this section. The data come from UK Gridwatch (http://www.gridwatch.templar.co.uk/), which are the demand data of grid and the supply data of each energy source. The time series begin from year 2011 to year 2020, making a total of 953824 observations with a record every 5 minutes. The details are shown in Figure 7.

From the figure, we can see that the demand of grid changes in period as expected since there are peak and valley values daily and seasonally. The general trend of grid demand changes a little. Some kinds of energy like wind and biomass increase a lot in supply these years; they will be more frequently used in the future than traditional energy like coal as they are more environmentally friendly. We construct a metric to measure the ratio of other energy supply over that of coal. By deleting observations which have none or zero values of coal, we have 847922 observations left, as shown in Figure 8.

We average the time series from a frequency of 5 minutes to a daily basis, ending with 2954 observations left. A forecasting method is conducted on to help us know how much renewable energy is needed in the near future. The interval forecasting method we use is from our designed method, Zhao et al. [14], called ctreeone, which uses the tree method ctree in a dynamic interval forecasting context. The different parameter we choose is 7 for time gap (weekly dynamic forecasting), leaving the other parameters unchanged.

Interval forecasting provides not only the point forecasting results but also the prediction interval that the predicted point belongs to. Small change of the often happens, which influences a little the the energy supply and demand system, so no action is needed in this circumstance. When the predicted ratio changes a lot, out of a preset limit, an alarm may be raised to help the system accommodate to the new circumstance, for example, by producing more renewable energies in advance to meet the instant demand. The interval forecasting model provides such an alerting system to adjust the energy production.

The results are shown in Table 1 and Figure 9. The coverage and width make a good balance; that is, a higher coverage costs a relatively higher width. We end with a coverage of 80.31% and a suitable width of 22.95 which is similar to the standard deviation of ratio of 19.78.

5. Conclusion

In this paper, the data are constructed using a simple model that includes both Gaussian and uniform distributions. We explore the squared prediction error in the context of trees and decompose that error into bias, variance, and irreducible error. The bias decreases when the tree gets bigger. However, for squared prediction error and variance, the relationship is not monotonic. We also calculate the best tree depth with a minimum mean squared prediction error. When Gaussian effect dominates, the best tree depth density decreases. However, when uniform effect dominates, the best tree depth increases. Under both circumstances, mean squared error, variance, and bias all increase.

After that, two options are given for the prediction interval using Gaussian prediction interval or quantile interval. When Gaussian distribution is obviously dominant, Gaussian prediction intervals are suggested. Otherwise, quantile intervals are suggested, which is also why quantile intervals are chosen as the prediction intervals in our regression application, although they both perform poorly when the uniform distribution is quite strong. When the number of observations is small in the terminal node, both interval constructions perform poorly in terms of coverage.

In the real data application, we applied our method to the UK grid energy supply and demand data to forecast the ratio of renewable energy supply over that of coal. We have good forecasting results as 80.31% in interval coverage and 22.95 in interval width. The method can be extended to other models as well besides decision trees.

We use the model decision tree for interval forecasting. In practice, other models can also be considered. For example, Hall et al. [22] used multiple nonlinear regression to forecast and analyze the changes of climate and weather dynamics and proposed a simple model averaging approach to reduce model and prediction uncertainty. Besides decision tree, other dynamic regression models can also be considered, for example, Gu et al. [23] used dynamic regression model to predict the dynamics of a specific space weather index and proposed a new approach for prediction uncertainty analysis using point-cloud model parameters. Dynamic regression model was also applied to social dynamic behavior modeling and analysis [24].

In the future research, the model can be applied to more kinds of datasets to test its generation ability. In the simulation, instead of linear model, nonlinear model can also be considered to test the tree performance.

Data Availability

The source code and simulation data in the theory exploration are available from the corresponding author upon request. The real data in application can be openly accessed from Elexon Portal (cited June 2020) [25].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research is improved from the PhD thesis of Xin Zhao. She is grateful for the financial support of the Fundamental Research Funds for the Central Universities (nos. 2242020R40073 and 2242020R10051) and Jiangsu Science Foundation for Youths (no. SBK2020040696) during this research, which was mostly completed during her PhD studies at the University of Leeds. Xiaokai Nie is grateful for the financial support of the Fundamental Research Funds for the Central Universities (no. 2242020R10053), Nanjing Prioritized Fund for Science and Technology Innovation (no. 1108000241), and Essential Science Indicator Improvement Funds of Southeast University (no.4016002011) during this research.