Abstract

The total organic carbon content (TOC) is a core indicator for shale gas reservoir evaluations. Machine learning-based models can quickly and accurately predict TOC, which is of great significance for the production of shale gas. Based on conventional logs, the measured TOC values, and other data of 9 typical wells in the Jiaoshiba area of the Sichuan Basin, this paper performed a Bayesian linear regression and applied a random forest machine learning model to predict TOC values of the shale from the Wufeng Formation and the lower part of the Longmaxi Formation. The results showed that the TOC value prediction accuracy was improved by more than 50% by using the well-trained machine learning models compared with the traditional method in an overmature and tight shale. Using the halving random search cross-validation method to optimize hyperparameters can greatly improve the speed of building the model. Furthermore, excluding the factors that affect the log value other than the TOC and taking the corrected data as input data for training could improve the prediction accuracy of the random forest model by approximately 5%. Data can be easily updated with machine learning models, which is of primary importance for improving the efficiency of shale gas exploration and development.

1. Introduction

Shale gas is a very important unconventional energy. Shale gas production in the United States constitutes a major part of its energy structure, and China has also made breakthroughs in the shale gas field in recent years [1]. Quickly identifying sweet spots where oil and gas are enriched in shale formation has an important impact on guiding the economic and effective exploitation of shale oil and gas resources [24]. The total organic matter content (TOC) is an important index for evaluating the enrichment of oil and gas resources, which can effectively indicate the organic matter enrichment intervals in shale formation [5]. The TOC values are often obtained through laboratory testing of cores. However, shale formation has strong heterogeneity, and it is limited by sedimentary space, material source supply, and other factors [6]. Moreover, in the early stage of shale oil and gas resource exploration, the continuity and integrity of core data cannot be guaranteed. As a result, the use of discrete measured TOC values from the core test may lead to a misunderstanding of organic matter-rich intervals. In contrast, the geophysical log data are complete and continuous. Continuous vertical TOC values can be obtained by using log data, and then, the distribution of organic enrichment layers can be predicted [710].

In the 1980s, Schmoker first discovered the relationship between log data and organic matter abundance, and the density log value was used to calculate the organic carbon content. With the continuous development of technology, many methods using log information to predict TOC values have been found, such as the log curve superposition evaluation method ( method and its modification method) [1113], multiple linear regression evaluation methods [14], machine learning, and other mathematical analysis evaluation methods [1517]. However, different methods have different scopes. The method has a comparably wide range of applications among these methods because it is driven by the physical model. Nevertheless, neither the traditional method nor the improved method can fully cover a variety of different formation conditions (such as abnormal fluid pressure, overmaturity, and tight reservoirs). In addition, most TOC evaluation methods based on need to manually determine the baseline value of the porosity curve and the resistivity curve, which is a relatively cumbersome process.

In recent years, machine learning methods have become a useful tool for building prediction models, which can reveal hidden patterns and unknown correlations between independent variables and dependent variables [18, 19]. In the machine learning model, TOC prediction is a multiple regression problem. The machine learning algorithm can automatically determine the comprehensive relationship between the TOC values and the corresponding log values through the learning of samples. Machine learning methods are driven by data and thus are not subject to changes in geological conditions. A large amount of stratum information can be better used to comprehensively predict TOC values, so the accuracy will not be greatly reduced due to the distortion of a certain curve [20, 21]. The disadvantages of machine learning-based models are that they may have multiple solutions and overfit with a limited number of samples. In recent years, some machine learning methods have shown good application effects and prospects in TOC prediction of source rocks. Zhao et al. [22] used Bayesian methods to predict TOC values and achieved good results. Handhal et al. [23] used integrated learning methods which not only guaranteed the accuracy of the model but also solved the problems of overlearning and improved the generalization ability of the model.

This paper takes the shale of the Wufeng Formation and the lower part of the Longmaxi Formation in the Jiaoshiba area of the Sichuan Basin as the main research object. Based on a large amount of measured TOC values from drilling cuttings and cores in this area, Bayesian linear regression and the relatively stable random forest algorithm are selected to predict the TOC values. Comparing the results with the traditional method, this paper discusses which method is more suitable for TOC prediction in this area and how to improve the calculation speed and accuracy based on existing methods.

2. Data and Methods

2.1. Geological Background and Data Source

The shale of the Upper Ordovician Wufeng Formation and the Lower Silurian Longmaxi Formation in the southeastern Sichuan Basin is a shelf deposit that was maintained over a long time in a deep-water anoxic environment. Black shale and silty shale with stable thickness and wide distribution are deposited, and they have high siliceous contents and are rich in graptolite. The organic matter content is mainly derived from high-productivity marine organisms and has a high degree of thermal evolution. Its Ro value is approximately 2.0% to 3.5% [24]. This basin is currently the most important shale gas reservoir in China [25, 26]. The first shale gas field in China was built in the Jiaoshiba area (Figure 1), and the main production layer of this shale gas field is the black shale section of the Wufeng Formation and the lower part of Longmaxi Formation. The geothermal field is relatively stable. The roof and floor plates constitute excellent sealing units, and the damage of faults is limited. Overall, it has good preservation conditions, which are conducive to the enrichment and storage of shale gas [2628]. However, the black shale in the Wufeng Formation and the lower part of the Longmaxi Formation was once buried to a depth of 6000 m. The diagenesis of the shale was relatively thorough, which led to changes in the porosity log values under the effect of many factors. The application effect of the TOC prediction method is not ideal [29]. In this paper, the Jiaoshiba area is the research object used to investigate prediction effect of the TOC values with different machine learning methods based on log data and the measured TOC values from 9 typical wells (JY1, JY2, JY3, JY4, JY5, JY6, JY7, JY8, and JY9) in this area. Among them, the data from wells JY1, JY3, JY4, JY5, JY8, and JY9 are used for model establishment and verification. The data from wells JY2, JY6, and JY7 are used to verify the universality of the model.

2.2. Methods
2.2.1. Traditional Method

The traditional method was proposed by Passey et al. [11]. This method uses the porosity curve, deep lateral resistivity curve, and maturity parameters to predict the TOC values. Normally, the porosity curve (usually the acoustic transit time log curve) and the resistivity curve overlap in the fine-grained organic-poor texture layer but show an amplitude difference (defined as ) in the organic-rich texture layer. This amplitude difference has a linear relationship with the TOC values and is a function of maturity, which can be used to calculate TOC values.

The formula for calculating the amplitude difference from the acoustic transit time and resistivity is as follows: where is the curve amplitude difference measured in logarithmic resistivity units; is the deep lateral resistivity, and the unit is Ω·m; is the acoustic transit time, and the unit is μs/ft; and are the baseline values of the resistivity curve and the acoustic transit time curve, respectively, which correspond to the overlapping section of the two in the fine-grained organic-poor texture layer; and 0.02 is the calibration coefficient.

The TOC value is obtained by the following empirical relationship: where TOC is the total organic carbon content (%) and LOM is the maturity parameter, which can be replaced with the Ro value.

2.2.2. Bayesian Linear Regression Method

Bayesian linear regression model is based on Bayesian inference in statistics [31, 32]. It regards the parameters of the linear model as random variables and finds the posterior by the prior of the model parameters (weight coefficients). This model has the basic properties of a Bayesian statistical model and can obtain the probability density function of the weight coefficient. In addition, it can carry out online learning and model hypothesis testing based on Bayesian factors [32, 33].

The purpose of Bayesian linear regression is not to find the single best value of the model parameters but to determine the posterior distribution of the model parameters. The response variables as well as the model parameters are from the probability distribution. The posterior distribution of the model parameters is based on the input and output of the training data [34]: where is the posterior probability distribution of the model parameters based on the input and output, is the likelihood probability of the output, is the prior probability of the parameter based on the input, and is the normalization constant. This formula is a simple expression of the Bayesian theorem, which is the basis of Bayesian inference.

The probability density function of the parameter posterior distribution is as follows:

In the formula, is the weight coefficient of each vector of the model, is the noise variance, is an individual hyperparameter that can measure the accuracy of , is the real target value, is the vector value of each log, is the model prediction value, and is the constant [35, 36].

For the TOC prediction problem, accurate prior information is not available for the weight proportion of each log vector; thus, the noninformation prior must be introduced. That is, the probability distribution of the prior parameter is obtained by using the spherical Gaussian distribution. After that, the specific process is as follows. (1) Use the training data to build the model. In this process, parameters and can be obtained by maximum likelihood estimation. Another method is to artificially specify an initial value and then update them continuously until the maximum log marginal likelihood is obtained. At this time, the model is most consistent with the actual situation. (2) Use the validation data to verify the accuracy of the model. In this process, the grid search method is used to optimize the hyperparameters of the gamma distribution that and obey to, so as to obtain the optimal model. (3) Use the optimal model to predict the test data. If the minimum accuracy is met, then the model parameters are returned and the whole dataset is trained to fit the final model. (4) Use the final model to predict the TOC values of other sections or wells.

Bayesian linear regression can solve the problem of overfitting in maximum likelihood estimation because the parameters are regarded as unknown fixed values in the maximum likelihood estimation linear regression, while they are regarded as random variables in the Bayesian linear regression, which is widely used in the field of machine learning. The utilization rate of the data samples is 100% by the Bayesian linear regression method, and the complexity of the model can be effectively and accurately determined by using training samples only. This method is suitable for processing small datasets like log values [37]. It has been applied in lithology recognition, fluid classification, etc., which has achieved good results [38, 39].

2.2.3. Random Forest Method

Random forest method is an integrated learning method for classification, regression, and other tasks, and it uses the prediction results of multiple decision trees to determine the final classification results and regression values. It is essentially a bagging method that uses limited data to obtain many new samples through repeated sampling, constructs multiple independent estimators, and takes the average results for overall prediction. When determining the final output, multiple decision trees are combined. Although a single decision tree has a large variance, the variance of the final comprehensive result can be very low since each decision tree is perfectly trained for a specific sample.

The corresponding basic steps of the algorithm are as follows: (1) Bootstrap sampling with return is carried out from the training data to generate several datasets. Each dataset generates a decision tree through training. (2) When the decision tree is divided into nodes, it is necessary to randomly select several features from all log vectors and make the branches of the optimal feature grow fully until they cannot regenerate. Pruning is not performed in this process. (3) Use out-of-bag data (unselected data) to test the effect and generalization ability of the model, determine the optimal number of decision trees, and rebuild the model. (4) Use the determined model to predict the new data [20, 40].

The main advantage of the random forest algorithm is that each decision tree only uses part of the samples and only extracts some of the attributes for modelling, which enhances the diversity of learners, corrects the habit of the decision tree for overadapting to its training set, and improves the generalization of the model [41]. Especially for the high-dimensional regression problem like TOC prediction, the stability and generalization of the model are more important than the small deviation to some extent. It has been successfully applied in many aspects, such as lithology identification [42], source rock prediction [43], and seismic reservoir prediction [44], and it has the advantages of simplicity and interpretability.

3. Results and Analysis

3.1. Traditional Method

According to the results of previous studies [45], the TOC values predicted by the method have a poor correlation with the measured TOC values. The predicted results cannot objectively reflect the actual total organic carbon content (Figure 2).

3.2. Machine Learning Methods

The machine learning models used for TOC prediction include four steps: data preprocessing, log series selection, hyperparameter selection and model establishment, and model verification and application [4649]. The specific workflow is as follows: find enough data points and preprocess the data, including deep homing, data cleaning, and data resampling; divide the data into the training set, validation set, and test set; use the training set to optimize the hyperparameters before the learning process because these parameters will affect the performance of the model and cannot be learned by machine learning algorithm; use the optimized hyperparameters to build the model; use the well-trained model to evaluate test set; and extrapolate and apply the evaluated model.

3.2.1. Data Preprocessing

There is often a deviation between the core and log depth because of the low core recovery rate and inaccurate estimation of the core depth, which leads to inconsistencies between the geological characteristics recorded by the core and the log records, which affects the accuracy of geological feature recognition by log data. Under actual geological conditions, the depth of the log records is more accurate than that of the cores. Therefore, the core depth needs to be corrected so that the TOC test sampling points can be calibrated to the log depth.

It is currently believed that the minimum resolution of the log data is 0.1 m, and it is impossible to distinguish two TOC test sampling points that are less than 0.1 m apart. In addition, some data not meeting the statistical significance often have an impact on the establishment of the model; thus, it is necessary to screen the TOC data. This study uses the DataFrames function of the Pandas tripartite library in Python to screen the TOC test points whose depth difference is less than 0.1 m and the invalid values beyond the mean value plus or minus 3 times the variance.

The log data are sampled uniformly and densely throughout the well section, which can be approximately regarded as a continuous variable, while the measured TOC values are discrete data with a fixed depth. There is a certain mismatch between the two in the sampling depth. A resampling operation is required for two kinds of data with different sampling intervals. At present, the commonly used resampling methods in the field of log technology include fast Fourier transform, Gaussian convolution, window data shift, linear transformation, and linear antialiasing [50, 51]. Considering that the log data generally present continuous linear transformation on the well section, this paper chose the linear transformation reprocessing method in the implementation process and set the maximum interval to 0.25 m (2 log intervals), which avoids the subjectiveness of manually selecting data.

After preprocessing, 386 groups of modelling data from JY1, JY3, JY4, JY5, JY8, and JY9 and 242 groups of prediction test data from JY2, JY6, and JY7 are finally obtained. Each group of data includes RT, GR, AC, CAL, CNL, DEN, and SP log values and corresponding measured TOC value. Their statistical information is shown in Table 1 and Table 2. Statistical analyses show that most measured TOC values are less than 6%, which is generally low. Individual values exceeding this range are regarded as outliers and removed.

3.2.2. Log Series Selection

The accuracy of the machine learning methods used to predict the TOC values largely depends on the input data. If the correlation between the log and TOC values is weak or too complicated, then it is easy for the algorithm to learn the wrong function relationship with small numbers of samples, which may result in oversimulation. Therefore, it is necessary to analyze the correlation between the log data and the TOC values before building the machine learning model. Generally, more selected features correspond to more log series, more information that can be covered, and a more accurate model. However, redundant features will also affect the accuracy of the calculation and the generalization of the model.

Considering the difficulty of acquiring log data, this paper mainly uses the commonly available conventional log parameters (GR, SP, CAL, DEN, CNL, AC, and RT) to predict the TOC. Before modelling, it is necessary to perform a preliminary correlation analysis on the selected log series and the measured TOC values. This process can avoid overfitting caused by weak correlation or complex relationships between the log and TOC values. In statistical analyses, the Pearson product-moment correlation coefficient (Pearson’s r) is widely used to measure the degree of linear correlation between two variables, and its value is between -1 and 1. A positive number indicates a positive correlation, and a negative number indicates a negative correlation. The closer the absolute value is to 1, the higher the correlation between the two variables [52]. The Pearson matrix can be used to analyze the correlation between different log curves and the measured TOC (Figure 3). In Figure 3, the number in each block is the Pearson’s of the two variables corresponding to the row and column. The Pearson’s values of the measured TOC value and GR, DEN, AC, and CNL are 0.65, -0.9, 0.48, and -0.67, respectively, which have relatively good correlations. For the regression prediction model, redundant information with low correlation needs to be excluded. According to the actual condition in the work area, the prediction of TOC is carried out by using log series with correlation coefficients greater than or equal to 0.2 as the input data, including GR, SP, DEN, AC, and CNL.

3.2.3. Hyperparameter Selection and Model Establishment

To enhance the generalization ability of the machine learning models, it is necessary to use cross-validation methods to optimize the hyperparameters. Then, the selected optimal hyperparameters are used to overcome overlearning and improve the prediction performance [53]. In this paper, the modelling data are divided into a training dataset, verification dataset, and test dataset at a ratio of 6 : 3 : 1. The TOC distribution of different datasets is similar to ensure that the results obtained from cross-validation are meaningful. The training data are the initial learning data for building the model. The verification data are used to test the accuracy of the model with different hyperparameters and screen the best hyperparameter. The test data will not participate in the establishment of the model or the selection of the model, although they will be used to test the accuracy of the final model. The accuracy of the model obtained by the test datasets can reflect the extrapolation ability of the model to a certain extent, which increases the credibility of the model.

The loss functions commonly used in cross-validation are the mean square error (MSE), mean absolute error (MAE), explained variance score (EVS), and coefficient of determination () [54]. The calculation needs to be repeated many times in the cross-validation. To save calculation costs and avoid losing accuracy, the mean absolute error (MAE) is used as the loss function in this paper. The formula is as follows [55]: where is the real target value, is the estimated target value, is the number of samples, is the real target value of the -th sample, and is the estimated target value of the -th sample.

Moreover, the coefficient of determination () is used as the standard for evaluation in the test set. A value closer to 1 corresponds to a better final regression prediction result, and a value closer to 0 corresponds to a worse regression prediction result. The formula is as follows [56]: where is the true target value, is the estimated target value, is the number of samples, is the true target value of the -th sample, is the estimated target value of the -th sample, and is the average value of the true target value.

In general, it is important to choose the initial value of the regularization parameter (, ) when fitting a curve to a polynomial by Bayesian linear regression method because the regularization parameter is determined by an iterative process that depends on the initial value [57]. Whether the regularization parameters are the default values (, ) or the relative extreme values (, ), the training results are good. The sample data can be considered relatively consistent with the Gaussian prior; therefore, the results do not depend on the initial values. Moreover, a comparison between the test set and the training set (Figure 4) shows that the generalization of the model is relatively good, with an score of 0.8997.

The random forest regression model also uses the cross-validation to optimize the hyperparameters. However, the difference is that the random forest method has too many hyperparameters, such as the maximum depth of the tree (max_depth), the minimum number of samples required to split the internal nodes (min_sample_split), the maximum number of features to be considered when looking for the best split (max_features), and the number of samples for training each basic estimator (max_samples) [33]. Conventional search methods, such as grid search algorithms, can exhaust all parameter combinations. However, the efficiency is too low, which will cause a waste of computing power. Therefore, it is necessary to use a randomized search algorithm for optimization. Randomized search cross-validation samples a fixed number of hyperparameters from a given distribution. Because not all the parameters are sampled, it can improve the speed of operation. However, the speed of searching for a good combination of parameters is still not ideal, which takes approximately 20 minutes.

To solve this problem, previous studies proposed the halving random search cross-validation [58, 59], which is an iterative selection process in which all parameter combinations (replaced with candidates in the following part) use a small amount of resources for the evaluation in the first iteration and only some candidates are selected in the next iteration; therefore, more resources will be allocated. In other words, the search strategy begins to use a small amount of resources to evaluate all candidates and uses an increasing number of resources to iteratively select the best candidate. The resource usually refers to the number of training samples and can also be any numeral parameters, such as the number of basic estimators in the random forest algorithm.

As shown in Figure 5, in the first iteration, a small amount of resources (the number of samples) were used to evaluate all candidates. In the second iteration, only the better half of the previous candidates were evaluated, while the number of resources allocated doubled. This process was repeated until the last iteration, in which only 2 candidates remained. With the iteration and the increase of input samples, different candidates were eliminated according to the score of the verification set, and then, the hyperparameters were optimized. The line segments with different colors in Figure 5 represent different candidates (parameter combinations), thus reflecting the score changes in their verification sets during the iterative process. As the number of iterations (abscissa) increases, candidates with low scores (ordinate) are eliminated, and candidates with high scores continue to participate in the next iteration. Only one candidate will remain until the end of the iteration process. The best parameter candidate is the one with the highest score in the last iteration (the black line), resulting in the best hyperparameter combination: {‘bootstrap’: False, ‘criterion’: ‘mae’, ‘max_depth’: None, ‘max_features’: 1, ‘min_samples_split’: 4}. In this case, of the verification set is 0.9464.

The performance of the halving random search cross-validation method in the test set was statistically analyzed. As shown in Figure 6, the best value is approximately 0.9 and the lowest MAE is approximately 0.3. The whole search was completed in only 19.2 seconds, and the speed increased by more than 60 times.

However, the prediction effect of the above random forest model is not ideal. Especially when the TOC value is less than 0.6%, the relative error between the predicted value and the measured value is large. After excluding certain factors, such as algorithms and parameters, they are considered related to the input data structure. In the process of the log data acquisition, noise will be inevitably generated due to interference from the environment and random factors, which brings errors to the calculation of geological parameters [60]. In addition to the TOC variation, the factors that affect the changes in log values include abnormal fluid pressure, hydrocarbons, tight reservoirs, overmature organic matter, and other formation information [61]. The fluctuation of log values caused by these factors may cover the log variation caused by the TOC, especially when the TOC value is small. The contribution of other formation information to the variation in log values may be much greater than that of the TOC, resulting in inaccurate prediction results. Therefore, when using log data to predict the TOC values, it is better to exclude the interference of unrelated factors in advance. Referring to previous research work [45], the log base value corresponding to the TOC of 0% should be found based on the correlation between the log values and the measured TOC values, and then, the input data should be changed to the absolute value after the actual log value minus the log base value. For example, the relationship between the GR value and the measured TOC value of well JY1 is shown in Figure 7, which can be expressed as the following formula:

As a result, the base value GRb is 124 API with the TOC value of 0%. If the input value of this part of well JY1 is defined as , then . If a measured TOC value is not available to confirm the base value where TOC is 0%, the average value of the predicted shale interval with a relatively small GR, AC, and RT, relatively large DEN and CNL, and no obvious changes in log value is taken as the base value. According to the above method, the modelling process is repeated after reprocessing the input data. The result is shown in Figure 8. increased by approximately 5%, and the large error when the TOC value is less than 0.6% has been reduced.

3.2.4. Extrapolation and Application of the Model

Based on the models established by the two machine learning methods, TOC prediction of the other three wells JY2, JY6, and JY7 in the study area was carried out. The comprehensive results are shown in Figure 9. is above 0.85, the mean absolute error (MAE) is approximately 0.3, and the mean relative error (MRE) is approximately 0.2 (Table 3). The values with large relative error mainly occur in the part where the TOC value is less than 0.6% because the log values of this part are greatly disturbed by factors other than the TOC. However, on the whole, the model has strong extrapolation ability and good generalization.

In addition, a comparison of the two machine learning methods with the traditional method by extrapolation (Figure 10) showed that these two machine learning methods lead to great accuracy improvements in the results.

4. Discussion

The greatest advantage of the traditional method is that it can eliminate the influence of porosity on the log response of organic carbon. However, it is not reasonable to use a fixed empirical coefficient with many limitations to predict TOC [62, 63]. In addition, the amplitude difference of only two log curves is used to calculate the organic matter content, and other important log information may be ignored, resulting in poor anti-interference ability of the model. Machine learning-based methods can synthesize a variety of log information to predict TOC. The results show that a large amount of geophysical information can reflect the changes in the composition of materials in formations from different physical quantities. The method integrated from a variety of information has a relatively better anti-interference ability.

In this study, using a variety of shale data from the Wufeng Formation and the lower part of the Longmaxi Formation in the Sichuan Basin, the accuracy of TOC prediction by Bayesian linear regression and the random forest method is more than 50% higher than that by the traditional method. Of the two, the Bayesian linear regression model is more accurate. This method includes relevant domain knowledge and the guess of the model parameters. It assumes that not all the required parameter information will be provided by the available data, breaking through the limitations from data itself. If there are no prediction in advance, no prior information can be used for the parameters, which facilitates the construction of the model.

In the field of machine learning, the random forest method is more suitable for regression problems than other common algorithms, especially the problem of nonlinear or complex relationships between elements and labels similar to TOC prediction problem. It uses a set of irrelevant decision trees on the subsamples of the data and improves the prediction accuracy and reduces the variance by averaging. It is insensitive to the noise in the training set and thus is more conducive to obtaining a robust model to avoid overfitting. However, due to the need to connect a large number of decision trees together, general parameter optimization methods require considerable training time. Therefore, attention should be given to the method of parameter optimization in TOC prediction. In this paper, the halving random search cross-validation method was used to optimize the hyperparameters in the random forest model, which greatly improved the learning efficiency and increased the calculation speed by more than 60 times. In other words, the use of a well-trained machine learning model can quickly and easily predict the organic carbon content of shale.

In addition, the machine learning model can be updated conveniently. If there is a new dataset, the machine learning model can be upgraded and provide broader applications. Therefore, compared with traditional methods, machine learning models are data-driven based, thereby avoiding a large number of theoretical assumptions and mathematical derivations. Moreover, it should be noted that the input data structure has a great impact on the building of the machine learning model, so the data preprocessing is very important before training. In response to TOC prediction, this paper provided a new data preprocessing strategy. It was to eliminate the log value changes caused by factors other than TOC before inputting, which improved the prediction accuracy by approximately 5%.

The machine learning models proposed in this paper can provide more accurate prediction results using both training data and test data with reasonable extrapolation. However, from the perspective of application, it has certain limitations. First, the data used in this paper are from the same research area with similar geological conditions. Thus, the reliability of the model needs to be further verified in other areas with large differences in geological conditions. Second, due to the frequent changes of sedimentary water properties in geological history, the heterogeneity of shale strata is strong, and the TOC values vary greatly. The limited TOC values may not fully reflect the relevant characteristics of the entire formation. The applicability of the model is unknown for strata that are not covered by the TOC test. In addition, the measured TOC values used in this study range from 0.01% to 6.02%. The applicability of the model in the case of higher TOC values is not discussed. For areas with a TOC value less than 0.6%, it is necessary to perform further research to improve the prediction accuracy. In this regard, it can be considered to collect the TOC test data and corresponding log data from different basins, sedimentary environments, and structural backgrounds, and a more comprehensive model should be built in a large numerical framework. Therefore, the general relationship between the TOC value and log value can be found by this method.

5. Conclusion

This paper uses Bayesian linear regression and random forest algorithms to predict TOC values. Compared with the method, both machine learning methods have higher TOC prediction accuracy and better generalization in overmature and tight shale in the study area. When the random forest method is used for modelling, the halving random search cross-validation can be applied to find the optimal hyperparameters and improve the training speed. The log data with the corresponding log base value removed can be taken as the input data for modelling. Thus, the factors other than TOC that affect the log values can be avoided to ensure the accuracy of the predicted results. In addition, if a new dataset is provided, the machine learning model can be updated more conveniently, which is of great significance for improving the efficiency of shale gas exploration and development.

Data Availability

The main data used to support the findings of this study are included within the article, and the others are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA14010202) and the National Science and Technology Major Project (2017ZX05008–004).