Abstract
It is difficult to determine the main control factors owing to the complex geological conditions of heavy oil reservoirs, including high viscosity, a wide range of variation of crude oil, and the great difference in production between different recovery methods. In this context, main control factors of heavy oil production in different recovery methods are analyzed and obtained based on the Apriori algorithm. The prediction of heavy oil production is faced with problems such as low prediction precision and insufficient data usage. Therefore, a novel intelligent simulation and prediction model of data-driven heavy oil production with time-varying characteristics is established based on differential simulation, machine learning, and intelligent optimization theory, which overcomes the defects of nonlinear, multifactor, and low fitting precision of dynamic data of heavy oil development. The parameters of the heavy oil production time-varying simulation model are identified by the least square support vector machine (LSSVM) to realize the intelligent prediction of the production. Numerical experiments show that the prediction result of the novel intelligent simulation and prediction model is better than the BP neural network model and the GM (1, N) model. This study provides a novel feasible method for data-driven heavy oil production prediction, and it can be helpful in further study of data-driven heavy oil production.
1. Introduction
Production forecasting is of great significance in heavy oil development planning. Because of the complex geological conditions of heavy oil, many oil recovery methods have been attempted, such as cyclic steam stimulation (CSS), steam flooding, steam-assisted gravity drainage (SAGD), in situ combustion, and toe-to-heel air injection (THAI). Many development and exploitation factors influence the prediction of heavy oil production; therefore, it is difficult to establish a heavy oil production prediction model with high accuracy and applicability.
The methods of oilfield development production prediction mainly include the decline curve method and mechanism model prediction method. The Arps declining method is the earliest research on the analysis of production decline curves, and research on the decline curve is widely used in reservoir gas development dynamic prediction, which is the basis of reservoir dynamic prediction [1, 2]. Agarwal et al. [3] presented novel production decline curves for analyzing well production data from radial and vertically fractured oil and gas wells. Li et al. [4] presented a decline analysis model derived based on fluid flow mechanisms, which was proposed and used to analyze oil production data from naturally fractured reservoirs developed by water flooding. Ling and He [5] derived the governing equations of production decline for different reservoirs by combining static geological and reservoir data with dynamic production data. Jongkittinarukorn et al. [6] presented a new method to improve production forecasts and reserve estimation for a multilayer well in the early stages of production, using the Arps hyperbolic decline method to model the decline rate of each layer. There are various recovery methods for heavy oil reservoirs, and many factors affect production. The traditional production decline method is mostly an empirical equation, which only considers the relationship between production and time, and has some limitations due to its harsh application conditions and small application scope. The model parameters of the mechanism model prediction method have clear physical significance and are an important means of studying the reservoir fluid flow mechanism. A complex numerical simulation model and historical fitting are required for the production prediction; subsequently, the new parameters are substituted into the modified model to predict the production [7]. However, the solution of the model takes a long time.
Recently, artificial intelligence (AI) technology has been gradually applied in petroleum exploration and development. Research has shown that the data-driven model performs better than the experience-based prediction model [8]. At present, the application of AI technology in the field of oil and gas field exploration and development includes the prediction of reservoir physical properties, oil and gas properties, recoverable reserves, optimization of well layout plans, optimization of hydraulic fracturing design, and production prediction. For example, Ahmadi et al. [9] compared the effectiveness of conventional models with the developed GA-LSSVM and GA-FL models. Al-Marhoun et al. [10] used AI technology to predict the crude oil viscosity curve in Canadian oil fields. Bhattacharya et al. [11] used a random forest and an artificial neural network to establish a supervised data-driven machine learning model and used the support vector machine (SVM) algorithm to understand oil well dynamics and predict daily gas production. Davtyan et al. [12] constructed an expected dynamic regression model based on the machine learning method of sliding window regression and obtained a stable crude oil production prediction model with long-term prediction ability. Most engineers do not have the time to analyze the geological conditions, production system, and production decline of most wells in detail. However, the application of AI plans can quickly and efficiently evaluate oil well performance and predict production [13] and can enable fast and accurate analysis of thousands of wells or even tens of thousands of wells. Compared with conventional reservoir engineering and numerical simulation methods, the data-driven AI model is simple, has strong generalization ability, and can accurately reflect the nonlinear response of the relationship between the production data [14]. AI has self-learning ability, and the production model established has intelligence and adaptability and can satisfy the needs of oil field production and development intelligence.
Machine learning is an important part of AI prediction methods. Machine learning algorithms include neural networks, decision-making tree, k-means, SVM, Apriori algorithm, expectation maximization algorithm, k-nearest neighbor algorithm, and naive Bayes algorithm. The Apriori algorithm [15], as a classical method of association rule mining, is often used in factor analysis. This algorithm can find the correlation between data items in a large relational dataset and has been applied in many fields. The SVM has high fitting precision and a close relationship between the prediction index and the influencing factor; therefore, it is often used in the dynamic analysis of oil and gas reservoir development and production prediction. For example, Zhong et al. [16] used an SVM to predict the development index of ultrahigh water cut oil fields in eastern China. Elhaj et al. [17] combined a neural network with an SVM to predict the single well flow of a gas reservoir. Peng et al. [18] combined SVM with LSSVM and particle swarm optimization algorithm to predict oil and gas field production. According to the historical data of oil well production, Machado de Almeida Duque et al. [19] analyzed the prediction effect of six machine learning algorithms on crude oil production, and the verification set showed that the SVM and logistic regression models had the best prediction effect. However, the SVM is often used as a nonlinear fitting tool in crude oil production prediction to realize a univariate time sequence prediction or multiple regression prediction of production, without fully considering the variation trend of the oil field development index itself.
Grey theory was proposed by Deng [20] and has been widely applied in recent years [21–23]. Kumar and Jain [21] used the Grey–Markov and GM (1, 1) models with a rolling mechanism to predict the energy consumption in India. In [22], a prediction method using a grey model for cumulative plastic deformation under cyclic loads was demonstrated. Pao and Tsai [23] used the GM (1, 1) model to predict energy consumption in Brazil compared with the ARIMA model. The mainstream grey prediction models include the univariate grey model and multivariate grey prediction model, such as the GM (1, 1) model [20], discrete grey model DGM (1, 1) [24], fractional cumulative grey model FAGM (1, 1) [25], GM (1, N) model [26], and DGM (1, N) model [27]. Grey prediction pays more attention to the variation trend of the prediction index itself, which overcomes the defect that the SVM fails to pay attention to the variation trend of the prediction index itself. The parameters of the grey prediction model are important factors that affect the performance of the grey prediction model. Generally, the least squares method is used to estimate the parameters [28]. Based on the residual sum of squares minimum optimization, the least squares method easily falls into the local minimum. When it is applied to the prediction of heavy oil production with strong nonlinearity, the results obtained will deviate significantly. On the contrary, the least squares method has poor stability and cannot fit well in medium- and long-term production prediction, thus affecting the precision of the prediction model [29].
In this study, the Apriori algorithm is used to determine the main control factors of different heavy oil recovery methods, a time-varying multiple grey prediction model is established, and the LSSVM is used to identify model parameters for production prediction. The application of LSSVM to parameter identification of the grey model not only ensures historical high-precision fitting but also fully considers the variation trend of the state variable itself, solves the problem of large deviation in the production prediction of the grey model, and realizes intelligent prediction of heavy oil production.
The remainder of this paper is organized as follows. In Section 2, the main control factors for heavy oil production are identified. In Section 3, an intelligent simulation and prediction model for data-driven heavy oil production is established. In Section 4, we conduct a numerical experiment and an analysis of the model application. Finally, the conclusions are drawn in Section 5.
2. Determination for Main Control Factors of Heavy Oil Production
Many factors affect the production of heavy oil. Under different recovery methods, these factors may play different roles, which make it difficult to determine the main control factors of production. In this study, the Apriori algorithm is introduced to conduct association analysis, and strong association rules related to crude oil production are mined. Then, the correlation degree between each factor and heavy oil production is ranked through the Pearson correlation coefficient, and the main control factors of heavy oil production are finally determined.
2.1. K-Means-Based Data Discretization
The numerical values of each column in the dynamic original dataset of heavy oil development are similar and have strong continuity and insufficient discrimination, which is not conducive to using the association rule algorithm to analyze the data. Before the Apriori algorithm was used to analyze the heavy oil production data, these data were discretized and converted into corresponding logical characteristic set data. In this study, the k-means clustering algorithm is used for realizing the discretization processing of the original dataset of oil field production.
Specific steps are as follows:(1)Select k centroids, C1, C2, …, from the dataset, and Ck is taken as the initial cluster centroid.(2)The Euclidean distance from each sample to the cluster center was calculated, and each sample was assigned to the most similar set according to the nearest distance principle. The average value of all objects in the cluster represents the cluster centroid. For each point Vj, the cluster centroid Cj is found. If the distance d(Vj, Cj) between is minimum, Vj is assigned to the jth set.(3)In this way, all data samples are allocated to the corresponding set, and the initial centroid Cj of each cluster is recalculated using the above method.(4)Continue to follow Steps (2) and (3) circularly until the division of data no longer changes.(5)Obtain the minimum value .
Figure 1 shows the data discretization flowchart of the k-means-based clustering algorithm.

2.2. Influence Factor Analysis of Apriori Algorithm-Based Heavy Oil Production
The Apriori algorithm uses an iterative method known as layer-by-layer search to find frequent item sets related to crude oil production based on the given minimum support degree and then obtains strong association rules related to crude oil production based on the minimum confidence degree. Apriori association rule analysis has four basic definitions: frequent item set, association rule, support degree, and confidence. The frequent item set refers to the dataset that often appears, and the frequency is determined according to the support degree, while the support degree refers to the probability of the set appearing in the data. For example, in equation (1), X and Y represent the two individuals to be analyzed, and the support degree is defined as follows:
The association rule refers to the association between two individuals, which is measured by confidence, as shown in the following equation:
For a given rule X ⟶ Y, the higher the confidence value, the more likely Y is to appear in a transaction involving X. Confidence can also estimate the conditional probability of Y under the condition of the given X, where the conditional probability is the condition of our association rule. The association rule that satisfies the minimum support threshold (min_sup) and minimum confidence threshold (min_conf) is called a strong rule. The two thresholds were between 0% and 100%. The task of association rule mining is to determine the strong association rules related to crude oil production data.
Specific steps of the Apriori algorithm are as follows:(1)Find all 1-item sets first through the iterative updating method, then carry out the judgment according to the corresponding support degree, and eliminate the ones below the minimum support degree, and the rest are the frequent 1-item sets.(2)Put all frequent 1-item sets together to form 2-item sets, screen the 2-item sets according to the minimum support degree, and eliminate the ones with low support degree, and the rest are the actual frequent 2-item sets. If the iteration is carried out continuously, frequent l + 1-item sets can be obtained, and then, they are eliminated according to the minimum support degree to obtain the final production result of frequent l-item sets. Figure 2 shows the flowchart of the Apriori algorithm.

2.3. Sequence of Influence Factors of Heavy Oil Production
Through the analysis of the factors influencing the heavy oil production using the Apriori algorithm, the association rules of factors that may influence the heavy oil production were obtained. The degree ranking of these factors was conducted based on the Pearson correlation coefficient to determine the main control factors of heavy oil production. With each potential influencing factor influencing heavy oil production as an independent variable and heavy oil production as a dependent variable y, the correlation coefficient is calculated as follows:where i = 1,2, …, p, j = 1,2, …, p, and .
The Pearson correlation coefficient is the ratio of the product of covariance and the standard deviation of two variables, and it is a dimensionless and standardized covariance. The linear variation does not affect the result of the Pearson correlation coefficient, so the unit change of the abscissa or ordinate will not change the value r, that is, the r values of different unit data are comparable. According to equation (3), the correlation coefficient can be calculated to judge the influence of independent variables on dependent variables, to sort the influencing factors and determine the main control factors of heavy oil production.
3. Intelligent Simulation and Prediction Model of Data-Driven Heavy Oil Production
Many factors affect heavy oil production, and the dynamic variation is different; therefore, it is difficult to predict the heavy oil production precisely. In the previous section, the authors analyzed and determined the main control factors for heavy oil production using the Apriori algorithm. Based on this, a heavy oil production simulation and prediction model from the perspective of a multifactor time-varying system is established in this section, considering the influence of the main control factors on heavy oil production. Meanwhile, the parameter identification of the model is improved, and a new intelligent simulation and prediction model for heavy oil production based on time-varying is obtained.
3.1. Establishment of Time-Varying Intelligent Simulation Model
3.1.1. Establishment of the Main Control Factor Dataset for Heavy Oil Production
The accurate selection of data reflecting valid information and the elimination of interference data are essential to the predictive ability of the data-driven model. Generally, abnormal data can be eliminated, but for the dynamic oil field development system, considering the oil field development index from the perspective of time sequence can better reflect the dynamic law of the whole oil field development, and the adoption of the elimination process will cause inconvenience to the subsequent modeling. Therefore, for the missing and abnormal data x(i) in this study, the geometric mean of the previous and subsequent adjacent data, x(a) and x(b), is taken as the estimated value of x(i) to repair the data:
Historical information on production and its main control factors is established. According to the main control factors of production corresponding to different heavy oil recovery methods, the heavy oil production is taken as the production of the system and the main control factors of production as the input of the system; the data table (Table 1) arranged in the chronological order is obtained.
The data on the factors influencing the heavy oil production are not consistent in dimension, and the quantity difference is large. If the dimensionless processing of these data is not conducted, the phenomena of “large numbers eat small numbers” easily occur, resulting in imprecise processing results. In this study, mean-value conversion was adopted for dimensionless processing:
To weaken the randomness of historical data, the dimensionless data should be accumulated and an association model is established.
Dimensionless processing is performed for the original data of the heavy oil production Q and the main control factor index () in the historical data table according to equation (5) to construct a dimensionless time sequence for processing:
The corresponding first-order accumulation sequence of dimensionless data in equation (6) is constructed and is denoted as follows:where and .
Superscript 1 represents the first accumulation of dimensionless data.
Furthermore, the corresponding quadratic accumulation sequence was constructed using dimensionless equation (7), which is denoted as follows:where and .
Superscript 2 represents the second accumulation of the original dimensionless data.
The heavy oil production data table (Table 1) is a dimensionless processing and secondary accumulation data table that can finally obtain heavy oil production and main control factors after dimensionless processing and secondary accumulation, which reduces the randomness of historical data.
3.1.2. Establishment of Model
According to the grey theory, the quadratic accumulation time sequence has the property of exponential change, that is, it can establish the association model of differential simulation heavy oil production:
In equation (9), : the function of the change over time of heavy oil production after dimensionless processing and secondary accumulation : the function of the change over time of the jth influence factor of the heavy oil production after dimensionless processing and secondary accumulation
Based on the historical data, the parameters A and B1, B2, …, Bj are identified using the least squares method for equation (9) and discretized to obtain the following:
If equation (10) is used for extrapolating one more step, multistep prediction can be performed:
When equation (11) is used for extrapolation and multistep prediction, the parameter is not updated with the latest information, resulting in inaccurate predictions. Therefore, when equation (9) is used for multistep prediction, it is a time-varying system, and association model equation (12) of heavy oil production based on a time-varying system can be obtained:
In equation (12), the identification of parameters A(t) and B1(t), B2(t), …, Bj(t) is obtained based on repeated cycles of historical data.
3.2. LSSVM-Based Model Parameter Identification
When multistep prediction is performed for the time-varying intelligent simulation model, the latest information is used to identify the model parameters at each step. The model parameters changed with time, and the prediction precision was higher. In equation (12), the time-varying parameters A(t) and B1(t), B2(t), …, Bj(t) of the model are obtained based on repeated cycles of historical data. Although they can fully reflect the relationship between the prediction results and the main control factors and the variation trend of the state variable itself, the calculation deviation increases exponentially with an increase in the time step, resulting in the imprecision of the multistep simulation and prediction. The essence of this imprecision is the accumulated error of the least squares method for parameter identification of the time-varying system.
To overcome the disadvantage of using the traditional least squares method to estimate the model parameters and improve the precision of the time-varying intelligent simulation model in a multistep prediction, an LSSVM-based method is proposed to estimate the model parameters. Because of high-precision LSSVM fitting, it can avoid the rapid increase in accumulated errors caused by the identification parameters of the least square method. This method follows the principle of structural risk minimization [30], and the algorithm is fast and accurate.
Set abbreviated as and abbreviated as , and the training sample is .
Consider the nonlinear regression model:where is a nonlinear function. The nonlinear mapping , where F is the high-dimensional feature space, is obtained as follows:where .
By plugging in equation (13) into (14), the nonlinear regression model can be written as follows:
Defining an error variable:
Considering the optimization problems:where C is the penalty factor.
To solve this problem in equation (17), the Lagrange function is constructed to transform the optimization problem into an unconstrained problem. Considering the Lagrange function below:where is a Language multiplier.
Considering the Karush–Kuhn–Tucker (KKT) condition below:
By plugging in equations (18) and (21) into equation (22) and eliminating and , the following system of linear equations can be obtained by simultaneous equations (20) and (22):wherein , , , and I are the n-order unit matrix. The elements in the matrix are defined aswhere is a kernel function that satisfies Mercer’s condition [31]. In this study, the radial basis function (RBF) was selected for the kernel function:
By plugging in equations (19) and (24) into nonlinear regression model (15), the following can be obtained:
The values of Lagrange multipliers and b can be solved directly using the system of linear equation (23), and the parameter values can be substituted into (26) to obtain the time-varying parameter discrete estimates in equation (12): A(t) and B1(t), B2(t), …, Bj(t).
3.3. Intelligent Prediction of Heavy Oil Production
The state equation can be obtained by performing discretization processing according to the association model in equation (12) for heavy oil production based on a time-varying system:
The LSSVM is used for the parameter identification of A() and B1(), B2(), …, Bj().
Finally, the quadratic reduction is performed for to obtain , and the dimension is reduced to obtain the predicted value of heavy oil production. Therefore, according to equation (27), if the main control factors of production at a given predicted time (such as steam injection volume, oil well opening times, steam injection dryness, and steam injection pressure), the predicted production can be obtained.
4. Model Application and Result Analysis
An intelligent simulation prediction method for data-driven heavy oil production is applied to an oil field in China, and various examples of heavy oil production prediction have been obtained. In this study, the production prediction is researched using CSS and SAGD as examples.
4.1. Determination of Main Control Factors of Heavy Oil
The analysis data were derived from the production data of 200 CSS development units and 200 SAGD development units in a heavy oil reservoir in China. The Apriori algorithm and Pearson correlation coefficient were used to determine the main control factors for shadow heavy oil production.
Before the Apriori algorithm was used to analyze the heavy oil production data, the data were preprocessed and converted into the corresponding logical characteristic set data. In this study, the k-means algorithm was used in Python, the K value was set to 4, and the continuous sample data were discretized into four levels for numbering. The discretization results of CSS and SAGD production and steam injection data are shown in the following figures (Figures 3–6).




According to the same method, the sample data of all possible influencing factors of steam stimulation and SAGD development are discretized into four levels for numbering, and the total labeling rules are shown in Tables 2 and 3.
Convert the CSS and SAGD production data into a logical data table (Tables 4 and 5).
Python programming was used to realize and obtain Apriori association rules, as shown in Tables 6 and 7.
The Pearson correlation coefficient between the influencing factors of heavy oil production and each influencing factor was calculated, and the thermodynamic chart of association was drawn (Figures 7 and 8).


By comparing the relationship between the factors influencing the CSS and SAGD production of the heavy oil, the main control factors of CSS and SAGD production were obtained (Table 8).
4.2. Intelligent Simulation and Prediction of Heavy Oil Production
Three statistical criteria were used to appraise the predictive performance of the model: coefficient of determination (R2), root mean squared error (RMSE), and mean absolute percentage error (MAPE).
R2 is used to recognize the goodness of fit between the observed and modeled data values, which is defined as follows:
The RMSE is one of the most reported measures of disagreement and indicates the accuracy of predictions, which is defined as follows:
MAPE is used to evaluate the overall forecast performance of the prediction models, which is defined as follows:where is the actual value, is the predicted value, and is the average of the actual values.
The CSS and SAGD of the oil field were selected to develop continuous monitoring data for 24 months from 2018 to 2019, and the actual data information was obtained, as shown in Tables 9 and 10. Table 9 shows the relationship between the CSS production of heavy oil (Q) and the main control factor steam injection volume (x1), oil well opening times (x2), round of CSS (x3), steam injection dryness (x4), and steam injection pressure (x5), and Table 10 shows the relationship between the SAGD production (Q) and the main control factor steam injection volume (x1), oil well opening times (x2), steam injection dryness (x3), steam injection pressure (x4), and steam injection temperature (x5).
In this study, two traditional prediction methods, the BP neural network model (M1) [32] and the GM (1, N) model (M2) [33], were selected for comparison with the prediction effect of the data-driven intelligent simulation and prediction model (M3) for heavy oil production. The first 18 months of heavy oil production data were used for model parameter training, and the last six months of production data were used to evaluate the model’s prediction effect. The three prediction models were used for predicting CSS and SAGD production, respectively, and the results are shown in Tables 11 and 12.
Table 11 shows that the MAPE values of the training and prediction data of M3 are 7.65, and 4.89, respectively. The RMSE of the M3 model is 10872, which is lower than the value of the other two models, reflecting the superiority of model M3; R2 of model M3 is closer to 1, indicating that model M3 has a good prediction effect. Based on the analysis of the three evaluation indexes, model M3 had the best prediction effect on CSS production among the three prediction models.
Table 12 shows that the MAPE values of the training data and prediction data of M3 are 7.65, and 2.78, respectively. The RMSE of the M3 model is 4878, which is lower than the value of the other two models, reflecting the superiority of model M3; R2 of model M3 is closer to 1, indicating that model M3 has a good prediction effect. Based on the analysis of the three evaluation indexes, model M3 had the best prediction effect on SAGD production among the three prediction models.
Figure 9 shows the comparison curves of production prediction results of models M1, M2, and M3 for CSS and predicted values with the original data, and Figure 10 shows the comparison curves of production prediction results of models M1, M2, and M3 for SAGD with the original data. There are 24 sets of data for the test in total, in which the first 18 sets of data were the model parameter training data, and the rest of 6 sets of data were used for testing the model prediction results.


As shown in Figures 9 and 10, the predictive ability of the intelligent simulation model (M3) of data-driven heavy oil production is better than that of the other two prediction models. Model M3 has a higher prediction precision for heavy oil production and better adaptability in heavy oil reservoirs with different recovery methods.
5. Conclusions
Based on the characteristics of different heavy oil recovery methods, this study uses a hybrid data-driven method to determine the main controlling factors of heavy oil production and establishes an intelligent simulation prediction model for heavy oil production based on these main controlling factors. The anticipated effect using the data-driven intelligent simulation and prediction model to forecast heavy oil production is better than the BP neural network model and the GM(1, N) model under the same data conditions. The intelligent simulation and prediction model of data-driven heavy oil production can accurately predict the CSS of heavy oil and SAGD production, indicating that the model has good adaptability to heavy oil production prediction with different recovery methods, and the model can be used to predict heavy oil production.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was partially supported by the Major Program of Sichuan Province (no. 20QYCX0030).