Abstract
The decision-making of power generation enterprises, power supply enterprises, and power consumers can be affected by forecasting the price of electricity. There are many irrelevant samples and features in big data, which often lead to low forecasting accuracy and high time-cost. Therefore, this paper proposes a forecasting framework based on big data processing, which selects a small quantity of data to achieve accurate forecasting while reducing the time-cost. First, the sample selection based on grey correlation analysis (GCA) is established to eliminate useless samples from the periodicity. Second, the feature selection based on GCA is established considering the feature classification and the temporal correlation features to further eliminate useless features. Third, principal component analysis is applied to reduce the noise among the data. Then, combined with a differential evolution algorithm (DE), a support-vector machine (SVM) is applied to forecast the price. Finally, the proposed framework is applied to the New England electricity market to forecast the short-term electricity price. The results show that, compared with DE-SVM without data processing, the forecasting accuracy is improved from 81.68% to 91.44%, and the time-cost is decreased from 35,074 s to 1,809 s which shows that the proposed method and model can provide a valuable tool for data processing and forecasting.
1. Introduction
In the power market, the electricity price is an essential element because it influences the behaviour of power generation enterprises, power supply enterprises, and other buyers [1]. In recent years, with the rapid development of smart grids, the increase in the number of power users, and the large-scale development of renewable energy, more and more factors have begun affecting the electricity price, which is making price forecasting more difficult. Therefore, how to extract and mine useful information from the electricity price and its influencing factors are extremely important for accurate and timely forecasting.
Currently, the methods of forecasting include traditional forecasting methods [2–6] and intelligence methods [7–12]. Among these methods, support-vector machine (SVM) has higher generalization ability and good robustness, so it is widely applied to forecast electricity price. A framework composed of time series segmentation, recursive feature elimination, and minimum redundancy maximum relevance was developed based on SVM [13]. Chen et al. [14] proposed a combination of nonlinear regression and SVM to forecast the price in peak and off-peak periods. In [15], a least-squares SVM, combining radial basis function (RBF) and universal SVM kernel function, was applied to forecast electricity price.
Data processing has a great impact on forecasting accuracy. An efficient sparse autoencoder was proposed to extract features and a nonlinear autoregressive network was applied to forecast electricity price [16]. In [17], the improved wavelet transform was applied to process the input data, and the new feature selection was applied to filter the input data. XG-boost, decision tree, recursive feature elimination, and random forest were applied to feature selection and feature extraction [18]. Jahangir et al.[19] used grey correlation analysis (GCA) to select features, and the data were denoised by using a deep neural network with stacked denoising autoencoders.
To the best of the authors’ knowledge, some developments are missing from research on electricity price forecasting based on data analysis, categorized as follows:(1)Electricity price data are huge and contains a great deal of useless information, which is prone to low accuracy. Moreover, SVM has a high time-cost when using big data. In the past, researchers have carried out data processing on the features, ignoring that too many useless samples will also affect the forecasting accuracy and the time-cost.(2)Existing forecasting models are generally based on a single sample set composed of features and combined with the models of data processing, which is prone to extracting excessive data, resulting in poor forecasting accuracy.
The main contributions of this paper are as follows:(1)Sample selection, feature selection, and feature extraction are applied to clean the massive data to decrease the amount of data and make full use of the useful information so as to improve the accuracy of the SVM model and reduce the time-cost.(2)Considering the periodicity and the temporal correlation of the electricity price, feature classification is proposed to avoid excessive extraction of data to reduce the probability of the forecasting model falling into local optima.(3)A framework of electricity price forecasting based on big data is proposed to realize 24-hour forecasting, rather than single-point forecasting by 24 times.
The main structure of this paper is as follows: Section 2 introduces the model of electricity price forecasting in this paper; Section 3 introduces the method of data processing; Section 4 shows the numerical results and details of a case study; and Section 5 discusses the results of this study.
2. Price Forecasting Model Based on DE-SVM
2.1. SVM Model
SVM is based on the Vapnik–Chervonenkis (VC) dimension theory [20] and the principle of structural risk minimization [21]. It seeks the best compromise between model complexity and learning ability. Compared with ANN and random forest (RF), SVM has a better generalization ability and higher forecasting accuracy in solving nonlinear problems. Therefore, SVM is selected as a predictor, as shown in [22].
The SVM problem can be transformed into the following expression:where is a weight vector; is a bias; is the regularization constant; is the insensitive loss function; is the actual output of sample ; is the feature related to the electricity price; and is the actual price.
The insensitive loss function is defined as follows:
SVM needs to introduce a kernel function. Because the kernel function based on RBF has strong stability in dealing with nonlinear problems, the RBF kernel function is adopted:where is the width coefficient.
The final model is as follows:
2.2. Parameter Optimization Based on DE
DE is an adaptive global optimization algorithm based on population with a simple structure and strong robustness. There are two parameters ( and penalty factor ) to be optimized in SVM, which affect the forecasting accuracy of SVM. However, the sample size for training SVM is huge, and it takes a long time for DE to find the optimal parameters. Wang et al.[23] proposed an improved DE for this problem and introduced a scale factor of dynamic adjustment in mutation, which can speed up the optimization process. Therefore, the improved DE is used to optimize the parameters and .
DE includes four steps: initialization, mutation, crossover, and selection, in which the scale factor of dynamic adjustment is introduced into the mutation operation. The formulas for this step arewhere , is the most suitable individual in the th generation; controls the mutation scale of the th iteration; and are the maximum and minimum ; and is the total number of iterations.
3. Forecasting Model Based on Data Analysis
3.1. Framework of Forecasting
The time-cost and forecasting accuracy of DE-SVM will be affected when the irrelevant data in the samples and features of the electricity price are applied to the forecasting process. Therefore, this paper proposes feature classification and establishes three models of data processing to select and extract the valid samples and features, which ultimately reduces the time-cost and improves the accuracy. Figure 1 shows the framework of the price forecasting, including four models: sample selection based on GCA, feature selection based on GCA, feature extraction based on principal component analysis (PCA), and price forecasting based on DE-SVM. Each model plays an important role in the framework. First, sample selection based on GCA is applied to select the valid samples, and feature selection based on GCA is applied to select the important features of the samples. Then, feature extraction is used to extract features of the samples. Finally, DE-SVM is used to forecast the electricity price.

3.2. Sample Selection Based on GCA
3.2.1. GCA Method
GCA determines correlation by quantifying the “closeness” between two different data sequences. The more similar the two data sequences, the greater the correlation between them. The result of GCA is stable and fast, and GCA has good performance in removing irrelevant data. Many scholars have researched GCA [24–26]. Therefore, this paper uses GCA to select important samples and their important features.(1)The comparison sequence matrix D is defined as where is the th feature in the th sample.(2)The reference sequence is defined as where is the target sequence.(3)Then, the grey coefficient [27] is determined as where and are the normalized components of the data sequence; and , respectively, represent the minimum and maximum of the absolute value of the difference between the reference sequence and the comparison sequences; represents the absolute value of the difference between the reference sequence and the ith comparison sequence; and is the distinguishing factor, usually set to 0.5 [28].(4)The final grey correlation grade is expressed as follows:
3.2.2. Sample Input considering Periodicity
Each sample is composed of multiple features. Sample selection based on GCA should not require too many features. Otherwise, the final sample set will be extremely similar, resulting in poor fault tolerance and ultimately affecting the forecasting accuracy. In this paper, a sample corresponds to a moment, and the electricity price changes periodically with time [29]. Therefore, a small number of features ( periodic features) that change periodically with time are selected here to form sample set 1 and serve as the input of the model of sample selection based on GCA.
3.2.3. Sample Selection considering Forecasting Period
There are many useless samples in the input samples, which can affect the forecasting accuracy and time-cost of SVM. GCA can quickly determine the importance of different sequences with a stable result, so GCA has good performance for removing useless samples. Therefore, GCA is applied to eliminate useless samples. An example of GCA sample selection is shown in Figure 2. GCA removes useless samples and finally obtains important samples.

For the sample selection based on GCA, it can be applied to single-point forecasting. To forecast the electricity price in a time period, it needs to be forecasted many times. For this reason, the model of electricity price forecasting for a given time period is proposed, and the following contents should be introduced into GCA.(1)The forecasting period s is introduced.(2)The reference sequence of multiple targets is defined as
The process is shown in Figure 3. In sample set 1, the important samples at all times in the forecasting period are found and then superimposed and combined to form an important sample set for the forecasting period.

3.3. Feature Selection Based on GCA
3.3.1. Sample Input considering Temporal Correlation
The number of features required for feature selection based on GCA is different from that for sample selection. The more the features are selected, the more comprehensive the factors affecting electricity price. If a single sample set composed of features is successively subjected to sample selection and feature extraction, then the sample set lacks pertinence. At the same time, as the model of the data processing increases, it becomes more likely that excessive data will be extracted, which ultimately affects the forecasting accuracy. Therefore, feature classification is proposed. This paper classifies the features and introduces them into two models, as shown in Figure 4. First, periodic features, comprising sample set 1, are applied to the sample selection. Next, temporal correlation features, combined with the important sample set to form sample set 2, are applied to the feature selection. The temporal correlation features, including historical electricity price and temperature of the previous day, are related to the electricity price in the adjacent period.

3.3.2. Feature Selection
There are many features of the electricity price, some of which are useless. The data obtained by sample selection are calculated by the GCA to determine the importance between the target price and each feature. Compared with RF, relief, and other methods of feature selection, the calculation speed of GCA is faster and the results are more stable. Therefore, GCA is applied to analyse the price features and remove the useless features. An example is shown in Figure 5. Finally, important features are retained and used as the input of feature extraction based on PCA.

3.4. Feature Extraction Based on PCA
Two GCAs can remove useless samples and features, but they cannot remove the redundant information between features, and this redundant information will lead to poor forecasting accuracy. PCA can quickly calculate the results and effectively remove the redundancy. Therefore, PCA is applied to reduce the dimensions of the original features into several comprehensive indexes that contain most of the information.
The th principal component contribution [30] can be expressed as follows:where represents the th eigenvalue.
The cumulative contribution is as follows:
The principal component can be expressed as
3.5. Process of Electricity Price Forecasting
The electricity price forecasting method proposed in this paper classifies features, selects the important samples and features using GCA, extracts the features using PCA, and finally forecasts the price using DE-SVM. The specific steps are as follows:(1)Set , , and .(2)Select periodic features to form sample set 1 similar to equation (6), and the sample sequence . The reference sequence of multiple targets is defined by equation (10).(3)Calculate equation (9) using to obtain . Then, important samples at time t are obtained, which are larger than the control threshold of the first GCA. Set .(4)If , return to Step (3); otherwise, go to Step (5). Finally, the important sample set of the forecast period s is obtained.(5)Considering the temporal correlation features, the important sample set of the forecast period s is expanded to sample set 2 similar to equation (6).(6)Calculate equation (9) using sample set 2. Then, important features are obtained, which are larger than the control threshold of the second GCA.(7)Calculate equation (12) to obtain . Set .(8)If , calculate the first i principal components using equation (13) and return to Step (7); otherwise, go to Step (9).(9)Solve equation (4) to obtain f (x) and optimize the parameters and using DE [24].
4. Results
4.1. Implementation Platform and Data Selection
This paper uses MATLAB R2016a simulation. During the simulation, MATLAB is running on a platform with an Intel Core i5, 4 GB RAM, and 500 GB hard disk. Location marginal price (LMP) and its features in the New England electricity market from 2017 to 2018 are used to verify the proposed model. Periodic features are shown in Table 1, and temporal correlation features are shown in Table 2.
4.2. Model Evaluation Index
To verify the effectiveness of the method in this paper, mean absolute percent error (MAPE) and root-mean-square error (RMSE) are selected as the evaluation indexes of the forecast model:where is the real value and is the predicted value.
4.3. Parameter Settings
The parameters in the framework are set as follows: (i) sample selection based on GCA: the distinguishing factors , the control threshold ; (ii) feature selection based on GCA: the distinguishing factors , the control threshold ; (iii) DE-SVM: population size , the maximum number of iterations is 50, the scaling factor range is [0.2, 0.8], the crossover probability is 0.5, the insensitivity coefficient of SVM is set to 0.01, and the change range of and is set to [0.01, 10].
4.4. Result Analysis
4.4.1. Analysis of Complexity
Both the data processing and the electricity price forecasting have a time-cost. The complexity of each step is analysed as follows.
Regarding the data processing, the complexity of sample selection is , the complexity of feature selection is , and the complexity of feature extraction is , where . When , , and change, the complexity is shown in Figure 6. As can be seen from Figure 6, the time-cost of data processing is very small.

(a)

(b)

(c)
As for the electricity price forecasting based on DE-SVM, the complexity of SVM is and DE-SVM is , where is the number of support vectors. However, after data processing, the number of training samples and features decreases greatly, thereby reducing and time-cost.
4.4.2. Daily Forecasting Results
(1)Sample Selection Based on GCA. Taking November 29, 2018, as an example, the important samples in sample set 1 are selected, and the grey correlation grades of 24 : 00 are shown in Figure 7.

In the forecasting period, each sample at each time-point can obtain a grey correlation grade, similar to the results presented in Figure 7. The samples with a correlation degree greater than 0.983 are selected, and finally 3,466 important samples are obtained, which avoids the problems of forecasting accuracy and time-cost caused by the samples with large differences.
(2)Feature Selection Based on GCA. This paper proposes feature classification according to periodicity and temporal correlation and inputs the framework twice. The comparison results are shown in Table 3.
In Table 3, we find that the forecasting results of this paper are better. When 3,466 samples are selected, the MAPE is the smallest at 10.05%.
The importance of each feature is shown in Figure 8, and the forecasting results of DE-SVM under different thresholds are shown in Table 4.

In Table 4, when the threshold is 0.64, the forecasting results are the best, with MAPE = 8.83% and RMSE = 5.4862. Some features (1, 2, 3, 5, 6, 20, 21, 22, 23, 24, 25, 26, 34, and 35) are finally removed in Figure 8. Among these features, there are the features of historical price with long time distance, and the larger the time distance is, the weaker the correlation is. Therefore, it is reasonable to remove such features. The removed features also include hour, week, temperature, and season, which are not directly related to the electricity price, so the correlation is not large and can be removed.
(3)Feature Extraction Based on PCA. PCA transforms the 21 original features into 21 principal components. The corresponding contribution and cumulative contribution are shown in Figure 9.

The first nine principal components, containing 95% of the original features, are shown in Figure 9, which can replace the original features as the input of the DE-SVM forecast model.
(4)Analysis of Daily Forecasting Results. Figure 10 shows the final forecasting results and the absolute value of the relative error.

(a)

(b)
In Figure 10, the error of 62.5% is within 10%, and that of 92% is within 20%. However, for most of the data, the forecasting results are within the acceptable range.
In Table 5, four models are applied to forecast the electricity price, and the results are shown in Figure 11.

(a)

(b)

(c)
From Figure 11, the forecasting results of the proposed GGDS and GGPDS are the best, and the forecasting results of DS without data processing are the worst. Compared with DS without data processing, the forecasting accuracy is improved from 81.68% to 91.44%. More detailed results can be seen in Table 6.
In Table 6, the accuracy of DE-SVM will be improved with each addition of processing data: GCA finds important samples, GCA selects features, and PCA extracts features. This is because after each data processing step, the useless information will be further reduced, and the interference will be lessened, thereby improving the forecasting accuracy.
Regarding time, the time-cost of only using DS is 35,074 s, but after the proposed data processing, the time is greatly reduced to 1,809 s.
Finally, regression algorithm (REG), RF, BP neural network (BP), DE-SVM, and the proposed model are each used for forecasting, and the results are shown in Table 7.
Compared with the four benchmark models in Table 7, the results of the proposed method are better, which further reflects the advantages of the method.
4.4.3. Seasonal Forecasting Results
This paper takes any week in each season and forecasts the electricity price of that week. The results are shown in Figures 12–15 .




From Figures 12–15, we find that the electricity price of representative weeks in each season fluctuates greatly and their change law is poor. Moreover, the electricity price in Figure 12 is several times greater than that in Figures 13–15.
To better reflect the advantages of this model, we compare it with other models, and the results are shown in Table 8.
From Table 8, it can be seen that the MAPE of the proposed method is smaller than that of other methods. MAPE considers not only the error between predictive value and real value but also the ratio between the error and true value. MAPE is an index of forecasting accuracy in the field of statistics. The proposed method is the most accurate, but in winter and spring, the RMSE of RF and REG is better. This is because RMSE uses average error, which is sensitive to outliers, and the electricity price fluctuation in spring, and in winter, it is unstable, and there are a large number of peak load prices, which has a great impact on RMSE.
5. Discussion
In this paper, a period of the electricity price can be forecasted, rather than a single-point price. This provides a valuable tool for period forecasting.
Considering the periodicity and temporal correlation of electricity price, feature classification is proposed, and two kinds of features are input into the framework respectively. Compared with the features as the input directly, this method has higher forecasting accuracy. A key observation point of this paper is sample selection. It can be found that the number of samples has an impact on the forecasting results. Therefore, how to automatically select the optimal sample set will be worthy of attention.
The samples and features proposed in this paper are combined to reduce the dimension of data, and DE-SVM is applied to forecast the electricity price of one day (24 time point), namely, the GGPDS model. It was compared with other models (DS, GDS, and GGDS). The time-cost of the DS model is 35,074 s, and the time-cost of the GGDS model is 1,803 s. The time-cost of the GGPDS model, 1,809 s, is slightly higher than that of the GGDS model. However, the total time is greatly reduced, which reflects an advantage of this framework. Overall, this method has high forecasting accuracy and low time-cost, which provides a new idea for data processing and can be applied to other fields. Moreover, this method will be significant for dealing with the “big data” problem.
To further verify the feasibility and correctness, the results of the proposed models are compared with those of other benchmark methods (REG, RF, BP, and DS). The comparison indicates that the proposed method is superior. Finally, the model is applied to forecast the electricity price of different seasons to verify its robustness. It is found that the results of May 27–June 3 are poor, while the price fluctuation during this part of the year is large and unstable, and it is difficult to find the internal law. However, the prices of July 9–July 15 are relatively stable, and their results are good. Therefore, the proposed method may be more suitable for forecasting electricity prices during periods of the year which have historically shown stable electricity prices.
6. Conclusion
Aiming at the time-cost and accuracy of electricity price forecasting caused by useless samples and features in big data, a forecasting framework based on big data analysis is proposed. The framework includes feature classification considering the periodicity and the temporal correlation, sample selection based on GCA, feature selection based GCA, feature extraction based PCA, and electricity price forecasting based on DE-SVM.
We apply the framework to forecast price in the New England electricity market. The results show that the probability of over extracting data can be reduced. Besides, compared with DS without data processing, the forecasting accuracy is higher and the time-cost is lower, where the forecasting accuracy is improved from 81.68% to 91.44%, and the time-cost decreases from 35,074 s to 1,809 s. The framework provides a method of electricity price forecasting with strong applicability. However, the model can also be applied to other fields of forecasting.
Data Availability
The data used in this paper are all from ISO new England energy: http://www.iso-ne.com.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Shuang Wu and Li He contributed to the conception and method. Shuang Wu and Zhaolong Zhang contributed to the computation. All of the authors contributed to the validation.
Acknowledgments
This research was funded by the National Key Research and Development Program of China (No. 2019YFB2102703-004), the National Natural Science Foundation of China (No. 51309094), and the Science, Technology and Innovation Commission of Shenzhen Municipality (No. 20200125).