Abstract
Unconventional resources have recently gained a lot of attention, and as a consequence, there has been an increase in research interest in predicting total organic carbon (TOC) as a crucial quality indicator. TOC is commonly measured experimentally; however, due to sampling restrictions, obtaining continuous data on TOC is difficult. Therefore, different empirical correlations for TOC have been presented. However, there are concerns about the generalization and accuracy of these correlations. In this paper, different machine learning (ML) techniques were utilized to develop models that predict TOC from well logs, including formation resistivity (FR), spontaneous potential (SP), sonic transit time (Δt), bulk density (RHOB), neutron porosity (CNP), gamma ray (GR), and spectrum logs of thorium (Th), uranium (Ur), and potassium (K). Over 1250 data points from the Devonian Duvernay shale were utilized to create and validate the model. These datasets were obtained from three wells; the first was used to train the models, while the data sets from the other two wells were utilized to test and validate them. Support vector machine (SVM), random forest (RF), and decision tree (DT) were the ML approaches tested, and their predictions were contrasted with three empirical correlations. Various AI methods’ parameters were tested to assure the best possible accuracy in terms of correlation coefficient (R) and average absolute percentage error (AAPE) between the actual and predicted TOC. The three ML methods yielded good matches; however, the RF-based model has the best performance. The RF model was able to predict the TOC for the different datasets with R values range between 0.93 and 0.99 and AAPE values less than 14%. In terms of average error, the ML-based models outperformed the other three empirical correlations. This study shows the capability and robustness of ML models to predict the total organic carbon from readily available logging data without the need for core analysis or additional well interventions.
1. Introduction
Due to the continuous oil and gas exploitation, conventional hydrocarbon reserves are gradually depleted, and the production rates of the current reservoirs are significantly declining.
Conventional hydrocarbon reserves are gradually depleting due to the continuous oil and gas exploitation and the production rates of the current reservoirs dramatically dropped [1, 2]. Source rock and unconventional reservoirs have recently attracted interest as a result [3, 4]. Since unconventional resources are more complex, tight, and less permeable, unconventional reservoirs exploration is more challenging and demanding in contrast to conventional reservoirs [5]. However, considerable discoveries of unconventional resources have been announced around the globe, namely, in North and South America, Middle East, and North Africa, which represents a significant addition to the total world oil reserves [6, 7]. Unconventional resources, in contrast to conventional reservoirs, are self-storing and self-generating reservoirs; consequently, evaluating their hydrocarbon generation potential is critical. Characterization, development, and hydrocarbon extraction of unconventional resources are sophisticated and costly operations, all of which underscores the importance of evaluating the unconventional resources’ ability to generate hydrocarbons in a cost-effective and precise manner [4, 6].
Total organic carbon (TOC), which has been widely considered as a quantification of the hydrocarbon generation potentials [8–10], is one of the most efficient parameters that evaluate the quality of unconventional resources [11]. In general, the rock pyrolysis experiment is used to determine TOC content in the laboratory [12, 13]. Due to the high cost of the experiments, there is a limitation on the number of laboratory tests to measure TOC. Consequently, it is very difficult to obtain a complete TOC profile for the formations of the interest, which severely affects the reservoir evaluation [14].
Several authors developed empirical correlations to determine TOC, and these models were developed based on the experimental TOC measures on drilling cuttings or core samples and the corresponding well logs. Then, these developed correlations are proposed to be applied to determine the TOC for different wells [15–19]. These correlations are summarized in Table 1 [20–24]. One concern about these empirical correlations is the low accuracy of the predictions when used with different datasets.
Artificial intelligence (AI) has been used in different industries [25]. AI techniques are known to have the capability to generate high accuracy models; therefore, several studies utilized them in TOC prediction [26, 27]. In the appendix, Table 2 summarizes the different research studies that utilized AI techniques to estimate the TOC from well logs [8, 9, 14, 17, 18, 26, 28–42]. The used well logs include formation resistivity (FR), spontaneous potential (SP), sonic transit time (Δt), bulk density (RHOB), neutron porosity (CNP), gamma ray (GR), and spectrum logs of thorium (Th), potassium (K), and uranium (Ur).
TOC is a vital parameter to characterize the unconventional resources. Experimental analysis can be used to measure the TOC, but it is expensive, time-consuming, and does not give a continuous profile for the total organic carbon with depth. Empirical correlations can be used to estimate TOC. However, there are concerns about the generalization and accuracy of these correlations. In this paper, the application of different AI techniques in TOC prediction in Devonian shale formation from the well logs will be tested. These well logs include sonic transient time, resistivity, bulk gamma ray, bulk density, and spectral GR logs of Th, Ur, and K and neutron log porosity. The next sections in the paper describe the methodology that was used to build the ML models, followed by building the model. The models were then tested and validated with different data sets. A sensitivity analysis was conducted to investigate the importance of input logging parameters in the predicted TOC values.
2. Methodology
In this study, three machine learning tools were used to estimate the TOC as a function of eight well logs records. Figure 1 summarizes the methodology applied for optimizing ML models and validation of these developed models.

2.1. Data Description
Experimental data for TOC from three different wells have been collected together with their corresponding well logs. 891, 291, and 82 data points from Well-A, Well-B, and Well-C were used to train, test, and validate the AI models all, respectively. All wells are in Devonian Duvernay shale, which is an organic liquid-rich source rock. The sedimentary basin is located in Alberta, Canada, with 61.7 billion barrels and 443 trillion cubic feet of oil and gas reserves, respectively [43, 44].
2.2. Core Samples Testing
Rock-Eval 6 was employed to estimate the actual TOC values of drilling cuttings from different wells. Tests procedures are shown in Figure 2. More detailed discussion about TOC experimental procedure is presented by Chen et al. [44].

2.3. Data Preprocessing
Prior to the AI model’s training, outliers, incomplete, or unrealistic data points were eliminated from the data used to construct the model. Data points that contain any value that is away from the mean of the data with three times the standard deviation were considered as an outlier. The statistical characteristics of Well-A’s dataset are illustrated in Table 3.
2.4. Model Development
The AI model was trained and optimized in this work using Well-A dataset that contains 891 data points with wide ranges of TOC and well logs values. The effect of various parameters inside the AI algorithms was tested to optimize the models, by running the AI tools inside a mutable for loops in MATLAB.
In SVM models, different kernel functions, values for kernel options, epsilon, and regularization were tested while in RF, different sets of number of trees, maximum number of level in each tree, and maximum number of features were used. In DT, various values for maximum tree depth, minimum sample split, and maximum number of features were tested. The correlation coefficient between the known and predicted TOC and the average absolute percentage error (AAPE) were used as evaluation criteria.
In addition, different sets of inputs were used to evaluate the significance of each well log parameter in TOC prediction. Seven sets were considered, the least one includes four parameters, and the most comprehensive consists of all eight parameters, as shown in Table 4.
The accuracies of the generated models were tested and validated using 291 and 82 data points from Well-B and Well-C, respectively. The two wells are in the same field as Well-A. The performance of AI models was also compared with that of currently existing correlations, such as the Schmoker model, the logR approach, and the Zhao et al. model.
3. Results and Discussion
The AI models were trained for TOC estimation based on eight well log data of RHOB, Δt, CNL, FR, GR, and spectral GR. The training dataset consisted of 891 data points from Well-A, while the testing dataset contains 291 data points from Well-C. This section presents the results obtained using each method.
3.1. Support Vector Machine
Using data set from Well-A and Well-B, different trials have been applied using support SVM with changing some tuning parameters inside the algorithm, such as kernel function and regularization. The best results were achieved using the Gaussian kernel function. Good results have been achieved in the training dataset with a 7.1% average error; however, the accuracy in the testing datasets was relatively low with an average error of 19.7%. The correlation coefficients were 0.974 and 0.856 for training and testing, respectively. Figure 3 presents the cross-plots of the actual and SVM-predicted TOC values for the training and testing data sets.

(a)

(b)
3.2. Decision Tree
This technique resulted in perfect fitting in the training dataset with a 0.994 correlation coefficient as shown in Figure 4(a). However, the prediction performance was significantly less accurate in testing with R value which equals 0.864 and which reduces the ability to generalize the model. In Figure 4(b), it is noticeable that some points fall relatively away from the 45° line.

(a)

(b)
3.3. Random Forest
In comparison with DT, RF performed better in testing with a 0.931 correlation coefficient, and its performance with the training data set was very close with R which equals 0.989. Figure 5 indicates the deviation between the actual and TOC prediction visually. In comparison with testing results of DT shown in Figure 4(b), it is noticeable in Figure 5(b) that the points fall closer to the 45° line.

(a)

(b)
4. Input Parameters Sensitivity
From the previous analysis, RF was chosen to perform the sensitivity on the inputs. Seven different sets of the well logs information are tested and reported in Table 5. The best performance was achieved when all eight parameters were incorporated (case 7) while the worst fitting accuracy happened when the gamma ray and spectral gamma ray were excluded (case 1). In the cases that exclude density, porosity, and sonic transient time each alone (cases 3, 4, and 6), the least effect in TOC prediction was noticed. It is noteworthy to mention that the effect of changing the inputs’ set is more obvious in the testing dataset in contrast to the training dataset.
5. Models Validation
As shown in the previous results, good matching accuracies for Devonian shale’s TOC have been achieved by the two methods in training and testing datasets. As an additional validation, eighty-two data points from Well-C were kept hidden from the AI tool during model construction utilized later to ensure the generalization of the new models. This well is located in the vicinity of the Well-I and Well-II that has been used in model building. The validation data points fall in 140 ft depth range. The accuracy of the two AI techniques was compared with that of three of the available models for TOC estimation, namely, the Schmoker, ΔlogR, and Zhao et al. correlations.
The TOC values obtained from different empirical correlations and AI-based models against the actual values of Well-C dataset are presented in Figure 6. This figure shows that the RF-based model outmatched all other models with AAPE of only 14% and a high R of 0.94. DT and SVM models were less accurate than RF; however, in terms of AAPE, both were better than the other three correlations with AAPE values less than 16.4%. Zhao et al. [24] correlation results were the closest to AI-based models and not far from ΔlogR predictions with AAPE values range between 20% and 25% and R values between 083 and 0.83 while the least favorable results in the validation dataset were from the Schmoker model with AAPE above 48%.

The TOC prediction in the different datasets and their comparison with the existing correlations show the accuracy of the developed models for Devonian shale. These models determine the TOC using the conventional well logs and spectral GR logs. It outperformed the existing models that calculate the TOC based on RHOB log only (Schmoker model) or a combination of FR and Δt logs and level of maturity (LOM) (ΔlogR method) or bulk gamma ray, FR, and Δt or RHOB logs (Zhao et al. [24] correlation). This result and the previous results in this study demonstrate the reliability of the AI models for TOC estimation in Devonian shale.
The developed models were able to accurately predict the TOC from the convention well log including CNP, RHOB, GR, Δt, FR, K, Th, and Ur logs, which helps to obtain a continuous profile for TOC with depth without the need for core analysis or additional well interventions. Similar to any developed model, we recommend employing the developed models using input parameters within the same model’s inputs ranges to ensure reliable results. For future work, more data will be collected to validate the developed models and other ML techniques will be applied.
6. Conclusions
This study established three models for TOC prediction in Devonian shale from conventional well logs and spectral GR logs using different machine learning techniques and approximately 1250 data points from three wells. The employed ML techniques were support vector machine (SVM), decision tree (DT), and random forest (RF). A summary of the findings reported in this paper is as follows [45]:(i)In training and testing datasets, the three AI algorithms produced good matches; however, the RF-based model has the best accuracy. The RF model was able to predict the TOC for the training and testing datasets, with R values of 0.99 and 0.93, respectively, and AAPE values of 5.3% and 13.8% in the same order.(ii)Data from a different well were hidden entirely from the AI tools and used to validate the built model. In this dataset, the RF model produced a 0.94 correlation coefficient and a 14% AAPE.(iii)The AI-based models’ predictions were compared with three other empirical correlations. The AI models yielded more accurate results contrasted to the other models which resulted in AAPEs greater than 20%.
Abbreviations
AAPE: | Average absolute percentage error |
ANFIS: | Adaptive neuro-fuzzy interference system |
ANN: | Artificial neural network |
CNN: | Convolutional neural network |
CNP: | Neutron porosity |
DT: | Decision tree |
ELM: | Extreme learning machine |
FL: | Fuzzy logic |
FN: | Functional network |
FR: | Formation resistivity |
GPR: | Gaussian process regression |
GR: | Gamma ray |
K: | Spectrum logs of potassium |
ML: | Machine learning |
NF: | Neuro-fuzzy |
NN: | Neural network |
R: | Correlation coefficient |
RF: | Random forest |
RHOB: | Bulk density |
SP: | Spontaneous potential |
SVM: | Support vector machine |
Th: | Spectrum logs of thorium |
TOC: | Total organic carbo |
Ur: | Spectrum logs of uranium |
ρ: | Density |
Δt: | Sonic transit time. |
Data Availability
Most of the data are included in the manuscript. A detailed data sample will be provided upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
The authors would like to thank KFUPM for giving permission to publish this work.